Some functions to parse and normalize URLs.
>>> urltools.normalize("Http://exAMPLE.com./foo")
http://example.com/foo
>>> urltools.parse("http://example.co.uk/foo/bar?x=1#abc")
ParseResult(scheme='http', subdomain='www', domain='example', tld='co.uk', port='', path='/foo/bar', query='x=1', fragment='abc')
>>> urltools.parse("www.example.co.uk/abc")
ParseResult(scheme='', subdomain='', domain='', tld='', port='', path='www.example.co.uk/abc', query='', fragment='')
The difference between extract
and parse
is that parse
cares about relative
URLs and extract
always tries to extract as much information as possible.
>>> urltools.extract("www.example.co.uk/abc")
ParseResult(scheme='', subdomain='www', domain='example', tld='co.uk', port='', path='/abc', query='', fragment='')
You can install urltools
from the Python Package Index (PyPI):
pip install urltools
... or get the newest version directly from GitHub:
pip install -e git://github.com/rbaier/urltools.git#egg=urltools
urltools
uses the Public Suffix List to split domain names correctly. E.g. the
TLD of example.co.uk
would be .co.uk
and not .uk
.
I recommend to use a local copy of this list. Otherwise it will be downloaded
after each import of urltools
.
export PUBLIC_SUFFIX_LIST="/path/to/effective_tld_names.dat"
For more information see http://publicsuffix.org/
To run the tests I use pytest:
py.test -vrxs