FluxEnder

Malicious domains refers to websites that spread virus, worms and Trojan horses. These sites are also usually related to fishing and pornography. These sites have been a serious threat to Internet user’s information and property safety. With the utilization of Fast-Flux and DGA technology, these sites turns to be more difficult to be tracked and taken down.

To identify these malicious domains, this paper proposes a machine learning classifier based on random forests algorithm. I call it FluxEnder.

We propose a storage model based on TokuMX to store Passive DNS data. This model have high scalability, query speed and space utilization. It is optimized for analysis and it also supports reverse query on IP address. We build a malicious domain classifier from five aspects: diversity, time, growth, relatedness and IP info, consisting of 18 features. 7 of them are first time to be proposed by this paper. This paper bring up an idea called domain pool which describe the relation between domains. If several domains point to the same IP, they are in the same domain pool. Therefor, we can identify these domains with their group behavior. A malicious domain might use a malicious IP. In our learning model, we also use IP information, which is the info to judge the reliability of an IP. In addition, the model also take the DGA domain into consideration.

Setup

download dns log from here, these are input dns logs, remember update the 'dns_log_path' in log_to_mongo.py. you might have your own dns logs, that's ok, you can write your own extraction code.
I suggest you update malicious blocklist and cdn domains lists in the /resources folder to make sure you have the latest legal and malware domains. These are supposed for noise deduction. The scripts in /scripts will help you in this part.
store the raw logs into TukuMX or mongoDB with /src/log_to_mongo.py and /src/log_to_mongo_lib.py . you can run 'python log_to_mongo.py args1 args2', but I wrote a file called worker.py, this is a main entrance for both log to mongo and extraction, just two parts, comment the part you do not want to run.
extract features from mongoDB rows(part 3), in this part, it's easy to come up with something new and help your paper. read the /src/extract_feature.py and /src/extract_feature_lib.py, you can get core ideas of FluxEnder in this two files. the I store the extracted features.
this is about final classification, with the /src/classify_domain.py, I use random forests tree from scikit-learn(thanks for the developers of these fancy tools), I train the model with labeled data, and then classify the test domain, calculate the precision. you can read dga hacking here, it will help you understand.

Tuning

The malicious domains used(!!)
CDN names you use(!!)
Thresholds in classify_domain.py(!!!)

##DIRs

output generated by /src/classify_domain.py

fn.txt: flase negative domains
fp.txt: false positive domains

resources white\black\CDN domain lists

cdn.txt: top 50 cdn domains in this planet(manually generated)
cdn_ip.txt: IPs associated with cdn domains above
cdn_new.txt: domains which use ips in cdn_ip.txt (cdn_ip.txt and cdn_new.txt are generated by /scripts/find_cdn_user.py)
domain_whitelist.txt: domains that are legal, which means they might be classified as malicious(manually generated)
hosts_*: different kinds of malicious domains(generated by /scripts/update_malware_files.py and /scripts/update_malware_files.sh)
top-1m.csv: Alexa top 1m sites(downlaod from Alexa)

scripts

drop.py: drop collections or dbs from mongo
find_cdn_cname.py: CDN domains my use CNAME recours recursively, this script can find all the domain names recursively
find_cdn_user.py: find domains that use CDN IP
plotting.py: plot various graphs to show the difference between legal and malicious on features
printting.py: ancient code, for feature value show, work well with interruption
test.py: tests as it is...
update_malware_files.*: malware blocklist are frequently updated, so update them

src

models: dga detection related, (check here)
_init_.py: kinds of global variables
classify_domain.py: run on labeled domains and features extracted to check whether the model works, this is the last step
dga_model_eval.py: dga detection related, (check here)
extract_feature.py: extract features from raw data, slow
extract_feature_lib.py: imported by extract_feature.py
log_to_mongo.py: read log file, store into mongo
log_to_mongo_lib.py: imported by log_to_mongo.py
worker.py: the 'main()' for log_to_mongo and extract_feature (python, glue language)

Good luck, :) in case you need help, chenkainp@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.idea		.idea
output		output
resources		resources
scripts		scripts
src		src
FluxEnder.iml		FluxEnder.iml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.idea

.idea

output

output

resources

resources

scripts

scripts

src

src

FluxEnder.iml

FluxEnder.iml

LICENSE

LICENSE

README.md

README.md

Repository files navigation

FluxEnder

Setup

Tuning

About

Releases

Packages

Languages

License

GFxOx3JJ3CQR0RMy/FluxEnder

Folders and files

Latest commit

History

Repository files navigation

Setup

Tuning

About

Resources

License

Stars

Watchers

Forks

Languages