BURP

Developed by Khalid Aziz, Peter Li, Christopher Moran, and Ethan Romba

BURP is a Python package that performs static analysis of HTML, URL tokens, HTTP headers, and WHOIS information, extracting features that can be used to evaluate the reputation of an arbitrary URL. The extracted features can be fed into a machine-learning system such as Weka to enable intelligent classification of URLs.

The package includes a script for analyzing URLs in bulk (e.g. for creating training sets), as well as a script that uses Weka to classify individual URLs as malicious or benign based on a decision-tree model developed from a training set of ~44,000 URLs.

BURP requires Python 2.6+ / 3.1+.

Installation

Install Weka 3.7.7+
Follow these instructions to add the weka.jar file to your CLASSPATH
Install lxml

Install the BURP fork of python-whois:

 git clone https://github.com/eecs-354-burp/python-whois
 cd python-whois
 python setup.py install

Install BURP:

 git clone https://github.com/eecs-354-burp/BURP
 cd BURP
 python setup.py install

Usage

Run BURP from the command line, passing the URL you would like to classify:

burp [URL]

HTML Analyzer

The BURP HTML analyzer is optimized for retrieving and analyzing HTML from URLs:

from burp.html import HTMLAnalyzer
analyzer = HTMLAnalyzer(url)
analysis = analyzer.analyze()
...
analyzer.loadUrl(url2)
analysis2 = analyzer.analyze()
...

To analyze an HTML string directly, be sure to call the setUrl() method with the URL where the HTML originated from:

from burp.html import HTMLAnalyzer
html = '<html>Hello World!</html>'
analyzer = HTMLAnalyzer()
analyzer.loadHtml(html)
analyzer.setUrl('http://www.example.com')
analysis = analyzer.analyze()

The analyze() method returns a dictionary with the following keys:

numCharacters
(Int) The number of characters in the HTML document
percentWhitespace
(Float) The percentage of whitespace characters in the HTML document
percentScriptContent
(Float) The precentage of inline script content in the HTML document
numIframes
(Int) The number of <iframe> elements
numScripts
(Int) The number of <script> elements
numScriptsWithWrongExtension
(Int) The number of <script> elements with the wrong extension (i.e. not .js)
numEmbeds
(Int) The number of <embed> elements
numObjects
(Int) The number of <object> elements
numSuspiciousObjects
(Int) The number of <object> elements whose classid is contained in a list of ActiveX controls known to be exploitable
numHyperlinks
(Int) The number of <a> elements
numMetaRefresh
(Int) The number of <meta> elements with an http-equiv="refresh" attribute
numHiddenElements
(Int) The number of elements with a style attribute that sets their CSS display property to "none" or their visibility property to "hidden"
numSmallElements
(Int) The number of elements with width, height, or style attributes that set their width or height to < 2 px or their total area to < 30 sq. px
hasDoubleDocuments
(Bool) True if the HTML document has more than one <html>, <head>, <title>, or <body>
numUnsafeIncludedUrls
(Int) The total number of URLs included by elements that can be used to include executable code (<script>, <iframe>, <frame>, <embed>, <form>, <object>)
numExternalUrls
(Int) The total number of included URLs that point to an external domain
percentUnknownElements
(Float) The percentage of elements that are not recognized by the HTML specification

URL Analyzer

The BURP URL analyzer is optimized for URLs themselves, the IP addresses associated with the URLs, and the WHOIS information related to URLs: from burp.html import URLAnalyzer analyzer = URLAnalyzer() analysis1 = analyzer.analyze(url1) analysis2 = analyzer.analyze(url2)

The analyze() method returns a dictionary with the following keys:

tokens
(Dictionary) The tokens contained in the URL. This dictionary contains the following keys:
- 'subdomain_length': (int)
- 'domain': (string)
- 'number_subdomains': (int)
- 'domain_length': (int)
- 'path': (string)
- 'subdomain': (string)
- 'port': (string)
ip
(String) IP address associated with the URL.
tokens
(Dictionary) The whois information in the URL. This dictionary contains the following keys:
- ‘last_updated’ : (Datetime Object)
- ‘name’ : (string)
- ‘expiration_date’ : (Datetime Object)
- ‘creation_date’ : (Datetime Object)
- ‘registrar’ : (string)
- ‘name_servers’ : (Set of Strings)

Running the HTML Test Suite

python setup.py test

Name		Name	Last commit message	Last commit date
Latest commit History 118 Commits
burp		burp
scripts		scripts
.gitignore		.gitignore
README.md		README.md
distribute_setup.py		distribute_setup.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

burp

burp

scripts

scripts

.gitignore

.gitignore

README.md

README.md

distribute_setup.py

distribute_setup.py

setup.py

setup.py

Repository files navigation

BURP - The Better URL Reputation Platform

Installation

Usage

BURP

HTML Analyzer

URL Analyzer

Running the HTML Test Suite

About

Releases

Packages

Contributors 3

Languages

eecs-354-burp/BURP

Folders and files

Latest commit

History

Repository files navigation

BURP - The Better URL Reputation Platform

Installation

Usage

BURP

HTML Analyzer

URL Analyzer

Running the HTML Test Suite

About

Resources

Stars

Watchers

Forks

Languages