Project for the security course at CentraleSupelec, CS track.
Use Python 3.4, 3.5 or 3.6 (compatibility with Tensorflow)
python -m venv venv
venv\Scripts\activate.bat
pip install -r requirements.txt
Check the file predict.py
.
Dataset 1: Unbalanced dataset with 80% safe URLs, 20% malicious - repeated URLs
Dataset 2: Balanced dataset
Dataset 3: Dated malicious URLs, built from PhishTank and Malware Domains Blocklist
The code here is based on the work of the following people:
- Hillary Sanders and Joshua Saxe - Garbage In, Garbage Out How purportedly great ML models can be screwed up by bad data - paper, slides
- Joshua Saxe and Konstantin Berlin - eXpose: A Character-Level Convolutional Neural Network with Embeddings For Detecting Malicious URLs, File Paths and Registry Keys - paper and their github