This project allows for the evaluation of whole machine learning workflows specified in a simple JSON format. As such, it can be considered an extension of the Pipeline class in scikit-learn. Moreover, it contains a simple XMLRPC interface which allows for parallel evaluation of a number of worflows at once.
The project originated as part of GP-ML, a framework for the automatic design of machine learning workflows based on genetic programming, where it was used for the evaluation of the population in a distributed way. The population is provided through the XMLRPC interface as a single JSON-formatted string which contains a list of workflows. Each workflow is a dictionary of methods indexed by (workflow-unique) identifiers. Each method is in turn described as a triple of (list of input IDs, method with parameters, list of output IDs).
To start using the project, you simply need to start the xmlrpc_interface.py
script to start the server on localhost:8080
.
python -m scoop xmlrpc_interface.py <log_path>
Once started, the server can be called through the XML-RPC and provides three basic functions
get_param_sets(datafile)
returns the set of parameters for all machine learning methods the server understands based on the inputdatafile
(the ranges of parameters for some methods may change depending e.g. the number of attributes in the dataset)evaluate(json_string, datafile)
evaluates the machine learning workflow described by thejson_string
(see below) on thedatafile
quit()
stops the server
The input JSON contains a list of machine learning worklows [wf_1, ..., wf_n]
which should be evaluated on the same dataset. Each workflow in the list is a
dictionary {"method1": method_spec1, "method2": method_spec2, ...}
which
contains the specification of at least two methods. Each method specification,
in turn, is a triple [inputs, [method, parameters], outputs]
.
In each workflow, an input
method must be specified, this method has no
inputs, and is used only to rename the input file to an ID which is used in the
rest of the workflow. For example, the following method specification tells the
system that the input file will be referenced as "IN:0"
in the rest of the
workflow.
{
"input" : [ [], "input", ["IN:0"] ],
...
}
The workflow further contains any number of other methods. Each method must provide the list of its inputs, its name and parameters and a list of outputs. There are methods of four types
splitter
is a methods which has a single input and provides a list of outputs, for example thek
-means algorithm can be used to split the data into clusterstransformer
is a method with a single input and a single output which is used transform the data in any way, for example PCA analysis or feature selection can be performed with a splitterclassifier/regressor
is a method, which is capable of learning from the data and provides predictions for new dataaggregator
is a method with several inputs and a single output, this method is used to aggregate results from several classifiers or regressors, currently only voting is supported as an agreggator (however, the implementation of voting can be used to combine the results if the dataset was split into several smaller dataset)
Inputs and outputs of the methods are specified by unique IDs. These IDs only
connect the outputs of one method to the inputs of other methods and their
values do not affect the results in any way. The same holds for the keys of the
methods in the dictionary. The only exception to this rule is the input
method,
which must have this name.
The last method in the workflow must contain an empty list of outputs, which signifies that the output its output is the output of the whole workflow.
Let's provide a few examples of workflow specifications.
This workflow contains only a single decision tree classifier with depth limited to 10 used on the input data directly.
{
"input" : [ [], "input", ["IN:0"] ],
"DT" : [ ["IN:0"], ["DT", {"max_depth": 10}], [] ]
}
This workflow contains a decision tree with maximum depth of 10, gaussian naive Bayes classifier, and a support vector classifier with the complexity constant set to 0.5. The results of these three methods are aggregated with voting.
{
"input" : [ [], "input", ["IN:0"] ],
"DT" : [ ["IN:0"], ["DT", {"max_depth": 10}], ["DT:0"] ],
"GNB" : [ ["IN:0"], ["gaussianNB", {}], ["GNB:0"] ],
"SVC" : [ ["IN:0"], ["SVC", {"C": 0.5}], ["SVC:0"] ],
"vote" : [ ["DT:0", "GNB:0", "SVC:0"], ["vote", {}], [] ]
}
In this example, the data are first pre-processed using the PCA analysis and
then split into two groups using the k
-means clustering. The first group is
classified using the support vector classifier, the other group is classified
with a decision tree. The results of these two methods are merged back together
with voting. Additionally, the raw data are also processed with another decision
tree. Finally, voting is used to aggregate the results of the last split.
{
"input" : [ [], "input", ["IN:0"] ],
"PCA" : [ ["IN:0"], ["PCA", {"feat_frac": 0.1}], ["PCA:0"] ],
"kMeans" : [ ["PCA:0"], ["kMeans", {}], ["kM:0", "kM:1"] ],
"SVC" : [ ["kM:0"], ["SVC", {}], ["SVC:0"] ],
"DT1" : [ ["kM:1"], ["DT", {}], ["DT1:0"] ],
"vote1" : [ ["SVC:0", "DT1:0"], ["vote", {}], ["VT1:0"] ],
"DT2" : [ ["IN:0"], ["DT", {}], ["DT2:0"] ],
"vote2" : [ ["VT1:0", "DT2:0"], ["vote", {}], [] ]
}
- Python 3
- numpy, scipy, pandas, scikit-learn, matplotlib, scoop
Internally, the evaluator runs in iterations. In each iteration, it checks all the unprocessed methods and selects those that have all data available. These are then trained on their data. It essentially implements a simple topological sorting algorithm.
Adding new methods is simple for new machine learning methods, which implement
the scikit-learn interface, just add the method to the model_names
dictionary
in method_params.py
and add the possible values of its parameters to the
create_param_set
method.
For the feature selection methods, do the same, and additionally add the
handling of the number of selected features to the train_dag
method in
eval.py
. This currently cannot be handled automatically, as the methods
require the number of features which is not known in advance (e.g. if such a
method follows another feature selection methods). Therefore, the number of
features is handled as a fraction during the parameter search and is transformed
to the actual number during training.