Skip to content

sitongan/vSearch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

README for vSearch

a Parallelised Script for Variable Selection

vSearch is a script for automatic implementation of wrapper methods for variable selection. Variables are ranked according to importance, which is important for machine learning analysis in high energy physics.


==================================================
How to use:

1. Check that the program includes:

		vsearch.py
		vSearchStrategy.py
		vSearchHelper.py
		vsearch-runtest.sh

		* anl-plot.py is available for easy visualisation of performance from vsearch output

2. Environment requirement:

		1. Sun Grid Engine, for parallelisation. 
		2. TMVA Framework, for machine learning training.
		3. Program tested on RedHat Scientific 6.9 and Python 2.7.13

3. The program requires:

		1. 3 Folders to be created in program's directory:
			1)	vsearch-result, for output files
			2)	vsearch-varlist, for initial variable lists and successtive generated variable lists
				Initial variable lists must be put in this folder
			3)	vsearch-log, for log files generated by TMVA
			The names of these folders can be redefined in vsearch.py

		2. A variable list that define all variables available for selection ("mother_list"), placed in program's directory

		3. Initial settings to be defined at the top of vsearch.py

		4. SGE and TMVA environment***
		The bash script vsearch-runtest.sh is environment-specific. It is called by vsearch.py to send batch jobs to SGE. Its purpose is to read a list of variable lists from a temp file (created and passed by vsearch.py), and run TMVA on each of them as a batch of jobs in parallel.
		To adapt to your local SGE environment, you may need to change the command in both this vsearch-runtest.sh bash script and the run_jobs() function in vsearch.py. Note that jobs must be run in batch and in sync ("-sync y" and "-t" argument in run_jobs() is necessary for vsearch to run).
		Depending on your application, you might need to change the command that runs TMVA too. This is the last line in vsearch-runtest.sh. Variable ${filedir} gives the dir ("vsearch-varlist/itrAd-0-0-0-1" for eg.) for the variable list to be run in EACH JOB. Variable ${ffname} gives only the last part ("itrAd-0-0-0-1"). As default, they are passed to a TMVA wrapper mva-train-test as arguments here, but you will likely need to change this line to run TMVA your own way.
			Note that vsearch requires the result root file for the training of each variable list to be stored in a folder titled ${ffname}. This can be done by passing ${ffname} to the TMVA wrapper that you use.

		5. Format of variable list filename:
		vlistitle-"generation"-"parent id"-"no. of var"-"id in this generation"
		For example, itrAd-0-0-0-1 means that this is a variable list for iterative addition, used in 0th generation (initial list), has 0 variable in it (initial list for iterative addition) and is the 1st variable list in its generation
		Every variable list placed in vsearch-varlist needs to adhere to this format for their filenames, but "mother_list" does not have to.

		6. Format of variable list
		This depends on your TMVA implementation. For the out-of-the-box vsearch, its variable list has following format:

			Title (Headerline) 
			Var1
			Var2
			Var3
			...
			&
			Everything behind this ampersand is ignored.

		As this is the format required by mva-train-test, the TMVA wrapper used for vsearch in its development. vSearch will output metadata behind the ampersand sign. If the format of your variable list is different, you will need to change the relevant code in spawnvlist() function in vsearch.py.
		If your variable list is simply defined as each variable definition taking up a line (without header or ampersand symbol at the end), you can toggle varlistextraformat as False in the initial settings for vsearch.py. vsearch will not output metadata info then.


==================================================

Alternative Search Strategy/Performance Benchmark:

		Area under Receiver Operating Characteristics Curve (AUROC) is the default performance benchmark for vsearch. This is defined in check_benchmark_roc() in vsearch.py. This function can be changed easily for other benchmark.

		Search Strategies are defined in file vSearchStrategy.py. Each strategy is defined by two functions: choose_next_vlistlist_STRATEGYNAME takes in a list of tested variable lists and their performance and decide which one to use as the basis parent variable lists for the next generation. genvlist_STRATEGYNAME takes in a list of selected variable lists and generate next generation variable lists accroding to predefined logic. Function interfaces and examples are clearly defined in vSearchStrategy.py, and you can easily code up your own search strategy to be used with vSearch.

==================================================

Output Files:

		After vSearch finishes the search process, the folder vsearch-result is populated with result files, among which:
		1) best-gen-N: best variable list for Nth generation
		2) STRATEGYNAME-N: performance of different variable lists for Nth generation
		3) STRATEGYNAME.result: best performance for each possible No. of variables, this is the most important output file

		STRATEGYNAME.result can be easily visualised with anl-plot.py. See anl-plot.py for examples and instructions.


==================================================

Author: 

Sitong An
University of Cambridge
sa747@cam.ac.uk

vSearch is written as part of the DESY Summer Student Programme 2017
Project Report available at http://www.desy.de/f/students/2017/reports/SitongAn.pdf

About

a simple parallelised script for automatic implementation of wrapper methods for variable selection, developed for 2017 DESY Summer Internship project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published