DIG workflow processing for the EFFECT project.
- Download and install conda - https://www.continuum.io/downloads. Example for 64 bit linux:
a.
wget https://repo.continuum.io/archive/Anaconda2-4.4.0-Linux-x86_64.sh
b.bash Anaconda2-4.4.0-Linux-x86_64.sh
c.source ~/.bashrc
- Install conda env -
conda install -c conda conda-env
- Clone this repo:
git clone https://github.com/usc-isi-i2/effect-workflows.git
cd effect-workflows
- Create user
effect
on hdfs. Add folders\user\effect\data
and '\user\effect\workflow` and give all users write permission to those folders - Run the install script
.\install.sh
- To copy CDR data from existing machine, follow instructions in
copyCDR.txt
- Import all workflows in oozie*.json into you oozie using Hue and schedule the coordinators
NOTE: You should build the environment on the same hardware/os you're going to run the job
Running script to convert CSV,JSON,XML,CDR data into a format that should be used for Karma Modeling
- Switch to the effect-env:
source activate effect-env
- Execute:
python generateDataForKarmaModeling.py --input <input filename> --output <output filename> \
--format <input format-csv/json/xml/cdr> --source <a name for the source> \
--separator <column separator for CSV files>
Example Invocations:
python generateDataForKarmaModeling.py --input ~/github/effect/effect-data/nvd/sample/nvdcve-2.0-2003.xml \
--output nvd.jl --format xml --source nvd
python generateDataForKarmaModeling.py --input ~/github/effect/effect-data/hackmageddon/sample/hackmageddon_20160730.csv \
--output hackmageddon.jl --format csv --source hackmageddon
python generateDataForKarmaModeling.py --input ~/github/effect/effect-data/hackmageddon/sample/hackmageddon_20160730.jl \
--output hackmageddon.jl --format json --source hackmageddon
- See hiveQueries.sql for examples
- See copyCDR.txt to copy all data from one hive install to another
- The
install.sh
script will build all jars and files required to run the workflow cp sparkRunCommands\run_effectWorkflow.sh .\
.\run_effectWorkflow.sh
This will load data from HIVE table CDR, apply karma models to it and save the output to HDFS.
To load the data to ES,
cp sparkRunCommands\run_effectWorkflow-es.sh .\
.\run_effectWorkflow-es.sh
- To remove the environment run
conda env remove -n effect-env
- To see all environments run
conda env list
** Run OOZIE workflow from command line - takes in job.properties and workflow.xml