Skip to content

koodaamo/linkextractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

linkextractor - HTML link extraction

UNDER DEVELOPMENT. NOT READY FOR USE.

This library aims to provide means for extracting links from both HTML and plain text. For that, it uses a combination of lxml, BeautifulSoup and regular expressions.

It is built with linked resource retrieval in mind, so it tries very hard to find all links in a document.

For HTML, the library supports:

  • extraction of any link found in SRC or HREF attribute
  • extraction of CSS import links

For plaintext documents:

  • extraction of all valid URLs

Optional functionality

  • validate and fix the source documents prior to parsing
  • extract BASE URL and expand relative URLs

Usage:

>>> from linkextractor import from_html() >>> docs_iter = [doc1, doc2, doc3] >>> linksets = [links for links in from_html(docs_iter)]

The from_html call is a generator that yields lists of links found in each document. Besides an iterable producing source documents, It takes the following optional boolean keywords:

  • validate - whether to validate the document beforehand
  • fix - whether to also fix it (implies validate=True)
  • expand - whether to expand relative URLs

Similarly importable from_text(docs_iter) function behaves the same, except that it takes no keyword arguments; all links in text documents are expected to be well-formed valid URLs.

About

library to extract HTTP links from text and HTML

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages