Currently, the data available for clustering is of Finance articles. The source can be found in the S3 bucket: s3://corpii/Finance
.
-
Required Environment Variables (~/.bashrc)
export INDICO_API_KEY="" # Needs access to the themeextraction server export CUSTOM_INDICO_API_KEY="" # Needs access to the custom collections export AWS_ACCESS_KEY_ID="" # Access to the S3 for contact@indico.io export AWS_SECRET_ACCESS_KEY="" # Access to the S3 for contact@indico.io export AWS_HOSTED_ZONE_ID="Z2GXF43FTQVWH2" # us-west-2
-
(Optional) Useful scripts for process control & using tmux (~/.bashrc)
# i.e. `die python` - this will kill existing python processes die() { proc=$(echo $1 | sed 's/^\(.\)/[\1]/') sudo kill -9 $(ps aux | grep $proc | awk '{print $2}') } # i.e. `attach 1` - this will attach the screen 1 for tmux attach() { tmux attach-session -t $1 }
-
Don't forget to source
~/.bashrc
-
Save the contents of
./scripts/setup.sh
to a file withsudo chmod -x setup.sh
permissions. Run the script. -
Run elasticsearch by running the
./scripts/run_elasticsearch_host.sh
script in a tmux screen. -
Either restore elasticsearch data from a backup, or run data ingress to populate the elasticsearch store.
# With the data in <ClusterRSS root>/inputxl
# With a <ClusterRSS root>/completed.txt file containing finished file names
python -m cluster.search.load_data [number_of_processes] 2>&1 | tee raw.log
python -m indicluster.app
# navigate to localhost:8002/text-mining in your browser