A genetic algorithm to designing distributed cell classifier circuits.
RAccoon has the following dependencies:
- Python 3+
Check requirements.txt for required packages or use:
$ conda create --name <env_name> --file requirements.txt
To download the RAccoon from Github, do:
$ git clone https://github.com/MelaniaNowicka/RAccoon
Use the following format of the .csv file: the first column includes unique IDs of samples, the second column includes annotation (0 - negative samples, 1 - positive samples), the following columns include miRNA profiles. Use semicolon as a separator. An example may be found below:
Continuous data:
ID | Annots | miR1 | miR2 |
---|---|---|---|
1 | 0 | 244 | 455 |
2 | 1 | 12 | 7945 |
3 | 1 | 7 | 2369 |
If you use continuous data, keep the discretization on.
Discretized data:
ID | Annots | miR1 | miR2 |
---|---|---|---|
1 | 0 | 1 | 0 |
2 | 1 | 0 | 1 |
3 | 1 | 0 | 1 |
To train a classifier on train data run:
python raccoon.py --train train_data.csv
Use exemplary data to try it: train_data.csv, test_data.csv.
Description of parameters used from command line:
--train - training data set in the .csv format (obligatory)
--test - testing data set in the .csv format (default: None)
Training and test data should be formatted according to the description in the Data format section.
--config - config file name
Here, you can use most of the following parameters using config file instead of the command line. Check the exemplary config file here.
--rules - path to a file of pre-optimized rules (default: None)
--filter - filtering non-relevant features (default: t, f to turn off)
Features that are non-relevant (columns filled with only 0s or 1s) are filtered out as such features are not informative.
--discretize - discretize the data (default: t, f to turn off)
--mbin - discretization parameter: m segments (default: 50)
--abin - discretization parameter: alpha (default: 0.5)
--lbin - discretization parameter: lambda (default: 0.1)
Data discretization according to Wang et al. (2014). To know more please look into the mentioned publication.
-c - maximal size of a classifier (maximal number of single rules in the classifier, default: 5)
-a - classification threshold (default: None)
If classification threshold is set to a certain value (e.g., 0.5) the threshold is fixed for all classifiers in the GA run, if None - different values of thresholds (0.25, 0.45, 0.5, 0.75 and 1.0) are randomly assigned to the classifiers.
-w - multi-objective function weight (default: 0.5)
-u - uniqueness option (related to calculation of CDD classifier score, default: True)
-i - number of iterations without improvement after which the algorithm terminates (default: 30)
-f - number of fixed iterations after which the algorithm terminates (default: None)
-p - population size (default: 300)
--elitism - if True the best found solutions are added to the population in each selection operation (default: True)
--poptfrac - pre-optimized fraction of population, the rest of solutions is generated randomly (default: 0.5)
-x - crossover probability (default: 0.8)
-m - mutation probability (default: 0.1)
-t - tournament size (default: 0.2)
Run the analysis using:
python run_tests.py --train train_data.csv --config config_tuning.ini [--val validation_data.csv --test test_data.csv --rules rule_file.csv --run_id name_of_the_run]
Use exemplary data to try it: train_data.csv, test_data.csv.
python run_tests.py --train train_data.csv --test test_data.csv --config config_tuning.ini
You may change all the parameters in config.ini. Description may be found here.
Output log description:
READING CONFIG - config parameter values
READING DATA - data processing information
PARAMETER TUNING - parameter tuning section including data division and pre-processing information as well as parameter tuning results
FINAL TEST - results of the final tests (the classifiers are trained with tuned parameters and tested on test data)
simDataGenerator allows to generate a simulated GED data set with compcodeR package, preprocess it by splitting into train/test data sets and normalize with TMM normalization method (edgeR).
LIBRARIES REQUIRED: compcodeR, edgeR, matrixStats
Run prepareSimulatedDataset() with parameters:
n.genes - number of genes, e.g., 500
samples.per.cond - number of samples per class, e.g., 100
n.diffexp - number of differentially expressed genes, e.g., 50 (10%)
fraction.upregulated - fraction of differentially expresed genes that are upregulated, e.g., 0.5
random.outlier.high.prob - number of random outliers (higher values), e.g., 0.5
random.outlier.low.prob - number of random outliers (lower values), e.g., 0.5
train.fraction - fraction of data that becomes training data set, e.g, 0.8 (1-train.fraction = test.fraction)
is.seed - set to TRUE to be able to reproduce the results for certain conditions, set to FALSE if you want to generate different data sets
generateSummary - set to TRUE if you want to generate a compCodeR data report
imbalanced - set to TRUE if the data set should be imbalanced