urltools

Some functions to parse and normalize URLs.

NOTE and credit

this is base on the original work that used to be here github.com/rbaier/urltools.git but is now gone

Functions

Normalize

>>> urltools.normalize("Http://exAMPLE.com./foo")
http://example.com/foo

Rules that are applied to normalize a URL:

tolower scheme
tolower host (also works with IDNs)
remove default port
remove ':' without port
remove DNS root label
unquote path, query, fragment
collapse path (remove '//', '/./', '/../')
sort query params and remove params without value

Parse

The result of parse and extract is a ParseResult named tuple that contains scheme, username, password, subdomain, domain, tld, port, path, query and fragment.

>>> urltools.parse("http://example.co.uk/foo/bar?x=1#abc")
ParseResult(scheme='http', username='', password='', subdomain='', domain='example', tld='co.uk', port='', path='/foo/bar', query='x=1', fragment='abc')

If the scheme is missing parse interprets the URL as relative.

>>> urltools.parse("www.example.co.uk/abc")
ParseResult(scheme='', username='', password='', subdomain='', domain='', tld='', port='', path='www.example.co.uk/abc', query='', fragment='')

Extract

extract does not care about relative URLs and always tries to extract as much information as possible.

>>> urltools.extract("www.example.co.uk/abc")
ParseResult(scheme='', username='', password='', subdomain='www', domain='example', tld='co.uk', port='', path='/abc', query='', fragment='')

Additional functions

Besides the already described main functions urltools has some more functions to manipulate segments of a URL.

encode (IDNA, see RFC 3490)

  >>> urltools.encode("http://müller.de")
  'http://xn--mller-kva.de/'

assemble a new URL from a ParseResult
normalize_host
normalize_port

normalize_path

  >>> normalize_path("/a/b/../../c")
  '/c'

normalize_query

  >>> normalize_query("x=1&y=&z=3")
  'x=1&z=3'

normalize_fragment
unquote

split (basically the same as urlparse.urlparse)

  >>> split("http://www.example.com/abc?x=1&y=2#foo")
  SplitResult(scheme='http', netloc='www.example.com', path='/abc', query='x=1&y=2', fragment='foo')

split_netloc

  >>> split_netloc("foo:bar@www.example.com:8080")
  ('foo', 'bar', 'www.example.com', '8080')

split_host

  >>> split_host("www.example.ac.at")
  ('www', 'example', 'ac.at')

Installation

pip is not working yet You can install urltools from the Python Package Index (PyPI):

pip install urltools

... or get the newest version directly from GitHub:

pip install -e git://github.com/itzik-h/urltools.git#egg=urltools

Public Suffix List

urltools uses the Public Suffix List to split domain names correctly. E.g. the TLD of example.co.uk would be .co.uk and not .uk.

I recommend to use a local copy of this list. Otherwise it will be downloaded after each import of urltools.

export PUBLIC_SUFFIX_LIST="/path/to/effective_tld_names.dat"

For more information see http://publicsuffix.org/

Tests

To run the tests I use pytest:

py.test -vrxs

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
urltools		urltools
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

urltools

urltools

.gitignore

.gitignore

LICENSE.txt

LICENSE.txt

MANIFEST.in

MANIFEST.in

README.md

README.md

setup.py

setup.py

Repository files navigation

urltools

NOTE and credit

Functions

Normalize

Parse

Extract

Additional functions

Installation

Public Suffix List

Tests

About

Releases 2

Packages

Languages

License

itzik-h/urltools

Folders and files

Latest commit

History

Repository files navigation

urltools

NOTE and credit

Functions

Normalize

Parse

Extract

Additional functions

Installation

Public Suffix List

Tests

About

Resources

License

Stars

Watchers

Forks

Languages