Skip to content

Systematic Classification Engine And Data ANalysis

License

GPL-2.0, Unknown licenses found

Licenses found

GPL-2.0
LICENSE.txt
Unknown
COPYING
Notifications You must be signed in to change notification settings

WeiliangLuo/sceadan

 
 

Repository files navigation

Sceadan v 1.2.1

Sceadan (pronounced "scee-den") is a Systematic Classification Engine for Advanced Data ANalysis tool.

Sceadan is originally Old English / Proto-Germanic for "to classify."

Copyright (c) 2012-2013 The University of Texas at San Antonio

License: GPLv2

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.

Written by:

Dr. Nicole Beebe (nicole.beebe@utsa.edu) and Lishu Liu, Department of Information Systems and Cyber Security, and Laurence Maddox, Department of Computer Science, University of Texas at San Antonio, One UTSA Circle, San Antonio, Texas 78209

Simson L. Garfinkel (simsong@acm.org) Naval Postgraduate School Arlington, VA 22201

Purpose

Sceadan is a C module that uses libsvm to determine the "type" of bulk data based on file content, without using magic numbers, internal data structures, or file system metadata. Currently this is done by extracting features from the bulk data and applying a statical model based on a support vector machine. Current veatures include unigram counts, bigram counts, and other statistical measures.

The sceadan module ships with an application called sceadan_app that demonstrates the use of the program.

Layout

$ROOT/DATA - The training data. Extension is assumed to be the file type. Can contain subdirectories $ROOT/src - Sceadan source code $ROOT/src/sceadan_app - generates vectors $ROOT/tools/ - This directory $ROOT/tools/Makefile - runs all training $ROOT/tools/sceadan_train.py - Python program run by Makefile

Build Instructions

./configure
make

Usage

src/sceadan_app [options] <target> 
    python3 tools/sceadan_train.py <args>

If target is a directory, Sceadan will operate on each file within the directory individual, otherwise Sceadan will classify the single file referenced.

Training Guide

Sceadan comes with a pre-trained model that is compiled into the program. The model is generated by liblinear and then compiled into C code with the program src/mcompile.cpp. This produces a .c output file called src/sceadan_model_precompiled.c which is compiled by the C compiler.

However, you may wish to train your own model. For example, you can:

  • Experiment with different block sizes
  • Experiment with different features
  • Train on different document types

To train sceadan you will use the files in the tools/ directory. Please see the file doc/training_procedure.md for further information.

Machine Learning Science

Sceadan uses the default support vector machine model:

  • Linear kernel function
  • L2 regularized L2 loss function, primal solver (Liblinear solver_type L2R_L2LOSS_SVC) (e=0.005)
  • Uses C=256 for the penalty parameter for the error term
  • The model was trained on 1,800 items in each of 48 file type classes
  • The features used for classification in this version are the concatenated bi-gram and unigram count vectors:
    • features 1-65536 are bigrams \x0000-\x00FF then \xFF00-\xFFFF
    • features 65537-65792 are unigrams\x00-\xFF.

This version was statistically trained to identify 48 of the following file/data types:

TEXT = 1, /* file type description: Plain text file extensions: .text, .txt */
CSV = 2, /* file type description: Delimited file extensions: .csv */
LOG = 3, /* file type description: Log files file extensions: .log */
HTML = 4, /* file type description: HTML file extensions: .html */
XML = 5, /* file type description: xml file extensions: .xml */
JSON = 7, /* file type description: JSON records file extensions: .json */
JS = 8, /* file type description: JavaScript code file extensions: .js */
JAVA = 9, /* file type description: Java Source Code file extensions: .java */
CSS = 10, /* file type description: css file extensions: .css */
B64 = 11, /* file type description: Base64 encoding file extensions: .b64 */
A85 = 12, /* file type description: Base85 encoding file extensions: .a85 */
B16 = 13, /* file type description: Hex encoding file extensions: .b16 */
URL = 14, /* file type description: URL encoding file extensions: .urlencoded */
RTF = 16, /* file type description: Rich Text File file extensions: .rtf */
TBIRD = 17, /* file type description: Thunderbird Mail Files (data and index) file extensions: .msf and no extension */
PST = 18, /* file type description: MS Outlook PST files file extensions: .pst */
PNG = 19, /* file type description: Portable Network Graphic file extensions:  */
GIF = 20, /* file type description: GIF file extensions: .gif */
TIF = 21, /* file type description: Bi-tonal images file extensions: .tif, .tiff */
JB2 = 22, /* file type description: JBIG2 file extensions: .jb2 */
GZ = 23, /* file type description: ZLIB - DEFLATE compression file extensions: .gz, .gzip, .tgz, .z, .taz */
ZIP = 24, /* file type description: ZLIB - DEFLATE compression file extensions: .zip */
BZ2 = 27, /* file type description: BZ2 file extensions: .bz, .tbz, .bz2, bzip2, .tbz2 */
PDF = 28, /* file type description: PDF file extensions: .pdf */
DOCX = 29, /* file type description: MS-DOCX file extensions: .docx */
XLSX = 30, /* file type description: MS-XLSX file extensions: .xlsx */
PPTX = 31, /* file type description: MS-PPTX file extensions: .pptx */
JPG = 32, /* file type description: JPG file extensions: .jpg, .jpeg */
MP3 = 33, /* file type description: MP3 file extensions: .mp3 */
M4A = 34, /* file type description: AAC file extensions: .m4a */
MP4 = 35, /* file type description: H264 file extensions: .mp4 */
AVI = 36, /* file type description: AVI file extensions: .avi */
WMV = 37, /* file type description: WMV file extensions: .wmv */
FLV = 38, /* file type description: FLV file extensions: .flv */
WAV = 40, /* file type description: Windows Audio File file extensions: .wav */
MOV = 42, /* file type description: Apple Quicktime Move file extensions: .mov */
DOC = 43, /* file type description: MS-DOC file extensions: .doc */
XLS = 44, /* file type description: MS-XLS file extensions: .xls */
PPT = 45, /* file type description: MS-PPT file extensions: .ppt */
FAT = 46, /* file type description: FS-FAT file extensions: .fat */
NTFS = 47, /* file type description: FS-NTFS file extensions: .ntfs */
EXT3 = 48, /* file type description: FS-EXT file extensions: .ext3 */
EXE = 49, /* file type description: Windows Portable Executables (PE) file extensions: .exe */
DLL = 50, /* file type description: Windows Dynamic Link Library files file extensions: .dll */
ELF = 51, /* file type description: Linux Executables file extensions: .elf */
BMP = 52, /* file type description: Bitmap file extensions: .bmp */

This version of Sceadan detects the following types by rule or pattern:

  • Random
  • Constant

This version has support for enabling advanced feature extraction and usage in future versions and with use of other model files.

NOTE: The formulas contained in the code are not verified as correct and should be used only with extreme caution.

Such features include:

  • Bi-gram entropy
  • Hamming weight
  • Mean byte value
  • Standard deviation of byte values
  • Absolute deviation of byte values
  • Skewness
  • Kurtosis
  • Compressed item length -- Burrows-Wheeler algorithm (bzip)
  • Compressed item length -- Lempel-Ziv-Welch algorithm (LZW used by zlib and gzip)
  • Average contiguity between bytes
  • Max byte streak
  • Low ASCII frequency
  • Medium ASCII frequency
  • High ASCII frequency NOTE: Experimentation and testing showed the bi-gram/unigram byte count vectors achieved higher prediction accuracy than other combinations which included some or all of the above advanced features in a 38-class scenario.

Should advanced features be enabled, the following dependencies will be required:

  • zlib v1.2.7
  • libbz v1.0.2-5

Modes of operation:

  • AVAILABLE: "Container mode": calculates statistics on the entire input item (file) NOT AVAILABLE: "Block mode": calculates statistics on blocks of specified size within a container file; marches through container at specified byte-level offsets
  • NOT AVAILABLE:"Feature vector extraction mode": does not predict type; simply produces vector files compatible with libsvm and liblinear format; useful for researchers who want to use Sceadan to generate vector files for research experiments

Bugs and Platforms Tested On

This version was built for a Linux and Mac environments.

This version calculates hamming weight and entropy values incorrectly. This has no impact on the default operation, since Sceadan does not use these features in this version's prediction model.

Where to find things

You'll want to be familiar with:

Acknowledgments

Foundational research, design, and project management by Nicole L. Beebe, Ph.D. at The University of Texas at San Antonio (UTSA)

Software development by Laurence Maddox and Lishu Liu, UTSA.

Experimentation by Lishu Liu, UTSA.

Special thanks to DJ Bauch, SAIC, for software development assistance.

Special thanks to Minghe Sun, Ph.D., UTSA for support vector machine assistance.

Special thanks to Matt Beebe for tool naming inspiration.

This research was supported in part by a research grant from the U.S. Naval Postgraduate School (Grant No. N00244-11-1-0011), and in part by funding from the UTSA Provost Summer Research Mentorship program.

About

Systematic Classification Engine And Data ANalysis

Resources

License

GPL-2.0, Unknown licenses found

Licenses found

GPL-2.0
LICENSE.txt
Unknown
COPYING

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C++ 51.6%
  • C 39.2%
  • Python 8.2%
  • Other 1.0%