It's a trimmed version of ad campaign analysis platform.
- need NLTK and those libs for NLP.
- need traverse through all existing articles, thus need keep big database. since its no sense to duplicate same db in ruby, so python also take web.
- NLP needs to keep dict, corpus, lsi models in RAM, thus a live python process is needed.
- while keep playing with data in future, python is the best choice.
- install : sudo easy_install virtualenv
- create : virtualenv my_pyenv
- activate : source my_pyenv/bin/activate
- deactivate : deactivate
bottle.run(server='cherrypy', host='api.piposay.com', port=9800) # in python main file
Supervoid : http://supervisord.org/running.html
- install : pip install supervisor
- configue : echo_supervisord_conf > supervisord.conf , then sudo mv supervisord.conf /etc/supervisord.conf
- add program in /etc/supervisord.conf
- run : supervisord
- restart : just 'kill -9 ' in shell
- startup : get /etc/init.d/supervisord from https://github.com/Supervisor/initscripts
ipython piposay # cherrypy server for bottle
- piposay.py : main program, page extract & summary, with bottle + cherrypy web server.
- collector.py : scrape whole-site content from sitemap.
- gravity.py : build LSA model from corpus, find similar topics.
/usr/bin/ipython notebook --profile=myserver # start remote ipython-notebook sudo supervisorctl restart piposay # restart job, see /etc/supervisord.conf
print schema.Article.m.find().count() # calculate count post1 = schema.Article.m.find({'title': 'MyPage'}).all()[3] # fetch one print post1.title
post2 = schema.Article.m.get(title='MyPage')[3] # same as last
post3 = schema.Article(dict(title='MyPage', text='')) # new post post3.m.save() # save it
[Jieba] https://github.com/fxsjy/jieba alg: http://ddtcms.com/blog/archive/2013/2/4/69/jieba-fenci-suanfa-lijie/
[JustText] https://github.com/miso-belica/jusText alg: http://code.google.com/p/justext/wiki/Algorithm)
[Python-Goose] https://github.com/xgdlm/python-goose by GravityLab goose demo: http://jimplush.com/blog/goose
[Gensim] topic modeling, similarity query. http://radimrehurek.com/gensim/
[Articles Categorize using NaiveBayesClassifier] http://www.ibm.com/developerworks/cn/opensource/os-pythonnltk/#list4
[Ming for MongoDB] http://merciless.sourceforge.net/tour.html