Developed by Khalid Aziz, Peter Li, Christopher Moran, and Ethan Romba
BURP is a Python package that performs static analysis of HTML, URL tokens, HTTP headers, and WHOIS information, extracting features that can be used to evaluate the reputation of an arbitrary URL. The extracted features can be fed into a machine-learning system such as Weka to enable intelligent classification of URLs.
The package includes a script for analyzing URLs in bulk (e.g. for creating training sets), as well as a script that uses Weka to classify individual URLs as malicious or benign based on a decision-tree model developed from a training set of ~44,000 URLs.
BURP requires Python 2.6+ / 3.1+.
-
Install Weka 3.7.7+
-
Follow these instructions to add the weka.jar file to your CLASSPATH
-
Install lxml
-
Install the BURP fork of python-whois:
git clone https://github.com/eecs-354-burp/python-whois cd python-whois python setup.py install
-
Install BURP:
git clone https://github.com/eecs-354-burp/BURP cd BURP python setup.py install
Run BURP from the command line, passing the URL you would like to classify:
burp [URL]
The BURP HTML analyzer is optimized for retrieving and analyzing HTML from URLs:
from burp.html import HTMLAnalyzer
analyzer = HTMLAnalyzer(url)
analysis = analyzer.analyze()
...
analyzer.loadUrl(url2)
analysis2 = analyzer.analyze()
...
To analyze an HTML string directly, be sure to call the setUrl()
method with the URL where the HTML originated from:
from burp.html import HTMLAnalyzer
html = '<html>Hello World!</html>'
analyzer = HTMLAnalyzer()
analyzer.loadHtml(html)
analyzer.setUrl('http://www.example.com')
analysis = analyzer.analyze()
The analyze()
method returns a dictionary with the following keys:
-
numCharacters
(Int) The number of characters in the HTML document -
percentWhitespace
(Float) The percentage of whitespace characters in the HTML document -
percentScriptContent
(Float) The precentage of inline script content in the HTML document -
numIframes
(Int) The number of<iframe>
elements -
numScripts
(Int) The number of<script>
elements -
numScriptsWithWrongExtension
(Int) The number of<script>
elements with the wrong extension (i.e. not .js) -
numEmbeds
(Int) The number of<embed>
elements -
numObjects
(Int) The number of<object>
elements -
numSuspiciousObjects
(Int) The number of<object>
elements whose classid is contained in a list of ActiveX controls known to be exploitable -
numHyperlinks
(Int) The number of<a>
elements -
numMetaRefresh
(Int) The number of<meta>
elements with anhttp-equiv="refresh"
attribute -
numHiddenElements
(Int) The number of elements with a style attribute that sets their CSS display property to "none" or their visibility property to "hidden" -
numSmallElements
(Int) The number of elements with width, height, or style attributes that set their width or height to < 2 px or their total area to < 30 sq. px -
hasDoubleDocuments
(Bool) True if the HTML document has more than one<html>
,<head>
,<title>
, or<body>
-
numUnsafeIncludedUrls
(Int) The total number of URLs included by elements that can be used to include executable code (<script>
,<iframe>
,<frame>
,<embed>
,<form>
,<object>
) -
numExternalUrls
(Int) The total number of included URLs that point to an external domain -
percentUnknownElements
(Float) The percentage of elements that are not recognized by the HTML specification
The BURP URL analyzer is optimized for URLs themselves, the IP addresses associated with the URLs, and the WHOIS information related to URLs: from burp.html import URLAnalyzer analyzer = URLAnalyzer() analysis1 = analyzer.analyze(url1) analysis2 = analyzer.analyze(url2)
The analyze()
method returns a dictionary with the following keys:
tokens
(Dictionary) The tokens contained in the URL. This dictionary contains the following keys:
- 'subdomain_length': (int)
- 'domain': (string)
- 'number_subdomains': (int)
- 'domain_length': (int)
- 'path': (string)
- 'subdomain': (string)
- 'port': (string)
- 'subdomain_length': (int)
ip
(String) IP address associated with the URL.tokens
(Dictionary) The whois information in the URL. This dictionary contains the following keys:
- ‘last_updated’ : (Datetime Object)
- ‘name’ : (string)
- ‘expiration_date’ : (Datetime Object)
- ‘creation_date’ : (Datetime Object)
- ‘registrar’ : (string)
- ‘name_servers’ : (Set of Strings)
- ‘last_updated’ : (Datetime Object)
python setup.py test