GitHub - IndicoDataSolutions/ClusterRSS: A small app for clustering the content of RSS feeds

Indico's Article Clustering Demo

Data

Currently, the data available for clustering is of Finance articles. The source can be found in the S3 bucket: s3://corpii/Finance.

Setup

Required Environment Variables (~/.bashrc)

export INDICO_API_KEY="" # Needs access to the themeextraction server
export CUSTOM_INDICO_API_KEY="" # Needs access to the custom collections
export AWS_ACCESS_KEY_ID="" # Access to the S3 for contact@indico.io
export AWS_SECRET_ACCESS_KEY="" # Access to the S3 for contact@indico.io
export AWS_HOSTED_ZONE_ID="Z2GXF43FTQVWH2" # us-west-2

(Optional) Useful scripts for process control & using tmux (~/.bashrc)

# i.e. `die python` - this will kill existing python processes
die() {
    proc=$(echo $1 | sed 's/^\(.\)/[\1]/')
    sudo kill -9 $(ps aux | grep $proc | awk '{print $2}')
}

# i.e. `attach 1` - this will attach the screen 1 for tmux
attach() {
  tmux attach-session -t $1
}

Don't forget to source ~/.bashrc
Save the contents of ./scripts/setup.sh to a file with sudo chmod -x setup.sh permissions. Run the script.
Run elasticsearch by running the ./scripts/run_elasticsearch_host.sh script in a tmux screen.
Either restore elasticsearch data from a backup, or run data ingress to populate the elasticsearch store.

Running Data Ingress

# With the data in <ClusterRSS root>/inputxl
# With a <ClusterRSS root>/completed.txt file containing finished file names
python -m cluster.search.load_data [number_of_processes] 2>&1 | tee raw.log

Running the Server

python -m indicluster.app
# navigate to localhost:8002/text-mining in your browser

Name		Name	Last commit message	Last commit date
Latest commit History 131 Commits
.github		.github
cluster		cluster
scripts		scripts
static		static
templates		templates
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github

.github

cluster

cluster

scripts

scripts

static

static

templates

templates

tests

tests

.gitignore

.gitignore

.gitmodules

.gitmodules

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

setup.py

setup.py

Repository files navigation

Indico's Article Clustering Demo

Data

Setup

Running Data Ingress

Running the Server

About

Releases

Packages

Contributors 4

Languages

License

IndicoDataSolutions/ClusterRSS

Folders and files

Latest commit

History

Repository files navigation

Indico's Article Clustering Demo

Data

Setup

Running Data Ingress

Running the Server

About

Resources

License

Stars

Watchers

Forks

Languages