Skip to content

Sapphirine/stackexchange

 
 

Repository files navigation

stackexchange

Stack Exchange is a Q&A platform where software engineers, scientists, students share knowledge and get questions answered.

As users, we are interested in: 

  • What are heated discussed topics 
  • How to filtering best answers among all the given answers

As developers, we are interested in:  -The problems users are facing and how they can take such information to improve their products and documentation. 

Our project addresses such problems by 

  • Extracting topics out of large amount of posts and the topic distribution of each document 
  • Predicting the best answers by building a classification model 
  • Visualizing the “network” of questions, to know what’s the trends and relationships among discussed topics

Analytics: - Text mining and feature extraction: NLTK and text mining tools in Python, Spark LDA API with Python and Scala 

  • Sentiment analysis using AlchemyAPI
  • Classification algorithms: Random Forest 
  • Building the graph database in Neo4j 
  • Visualizations from D3.js toolkit for implementing a visualization to represent the output 

LDA; intermediate results are saved in result folder

*make_tdm.ipynb or make_tdm.py clean the dataset using the module Word2VecUtility.py[1] and then generate the work-document matrix matrix.csv for the input of our LDA model. *lda_spark.py uses the matrix.csv as input, and generate a lda_topicMatrix.csv where each column is a topic, and each row is a word with the same index of matrix.csv. *lda_spark.scala uses the matrix0.csv, which is the matrix.csv without header, to generate a topicDist.txt, which is the topic distribution of each document. *clean_topic_dist_result.ipynb or clean_topic_dist_result.py clean the raw output of topicDist.txt, and output the result topic_Distribution_for_each_doc.csv, of which the first column is the index of each document, and the following 10 columns are the distribution on the 10 topics. And then they link the result with document ID, i.e. output the topic distributions of each document ID. The output is doc_id_topic_doc_dist.csv. *get_top20_words_each_topic.ipynb extracts the top weighted 20 words for each topic. The output is topic_word_distribution.csv.

In the “graphAnalysis” folder, the intermediate results are stored in “outputData” folder.

*nodeCreate.ipynb or nodeCreate.py extract the schema from Posts.xml, Tags.xml and Users.xml files, and output the corresponding csv files (table format) for later uses. The output includes post.csv, user.csv, tag.csv, post_relation.csv, userPost.csv, post_tag_relation_frame.csv. All the above mentioned data are in “data” folder under the root directory. *post_user_database.cypher is cypher code conducting creation of nodes and links in Neo4j database. *useGraphDB.ipynb walks through some examples of using Neo4j StackExchange database to do some analysis, integrating packages with Neo4j, including py2neo, ipython-cypher. Specifically, it queries the databases to get the tag-connection network data, which can be visualized using a force-directed graph and an edge bundling graph.

Visualization part:

Under “Apache-tomcat/webapps/d3” folder, after running ./bin/startup.sh to start up tomcat, we can visualize our data from the route http://localhost:8080/d3/… We visualize the StackExchange dataset by creating the following graphs: wordCloud: the input json format can be generated by get_top20_words_each_topic.ipynb. forceDirectedGraph: which is a force directed graph with input json data generated by useGraphDB.ipynb. hirarlinks: which is an edge bunding graph with input json data generated by useGraphDB.ipynb.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 43.9%
  • JavaScript 33.9%
  • Jupyter Notebook 20.6%
  • CSS 0.8%
  • HTML 0.6%
  • Shell 0.1%
  • Other 0.1%