stackexchange

Stack Exchange is a Q&A platform where software engineers, scientists, students share knowledge and get questions answered.

As users, we are interested in:

What are heated discussed topics
How to filtering best answers among all the given answers

As developers, we are interested in: -The problems users are facing and how they can take such information to improve their products and documentation.

Our project addresses such problems by

Extracting topics out of large amount of posts and the topic distribution of each document
Predicting the best answers by building a classification model
Visualizing the “network” of questions, to know what’s the trends and relationships among discussed topics

Analytics: - Text mining and feature extraction: NLTK and text mining tools in Python, Spark LDA API with Python and Scala

Sentiment analysis using AlchemyAPI
Classification algorithms: Random Forest
Building the graph database in Neo4j
Visualizations from D3.js toolkit for implementing a visualization to represent the output

LDA; intermediate results are saved in result folder

*make_tdm.ipynb or make_tdm.py clean the dataset using the module Word2VecUtility.py[1] and then generate the work-document matrix matrix.csv for the input of our LDA model. *lda_spark.py uses the matrix.csv as input, and generate a lda_topicMatrix.csv where each column is a topic, and each row is a word with the same index of matrix.csv. *lda_spark.scala uses the matrix0.csv, which is the matrix.csv without header, to generate a topicDist.txt, which is the topic distribution of each document. *clean_topic_dist_result.ipynb or clean_topic_dist_result.py clean the raw output of topicDist.txt, and output the result topic_Distribution_for_each_doc.csv, of which the first column is the index of each document, and the following 10 columns are the distribution on the 10 topics. And then they link the result with document ID, i.e. output the topic distributions of each document ID. The output is doc_id_topic_doc_dist.csv. *get_top20_words_each_topic.ipynb extracts the top weighted 20 words for each topic. The output is topic_word_distribution.csv.

In the “graphAnalysis” folder, the intermediate results are stored in “outputData” folder.

*nodeCreate.ipynb or nodeCreate.py extract the schema from Posts.xml, Tags.xml and Users.xml files, and output the corresponding csv files (table format) for later uses. The output includes post.csv, user.csv, tag.csv, post_relation.csv, userPost.csv, post_tag_relation_frame.csv. All the above mentioned data are in “data” folder under the root directory. *post_user_database.cypher is cypher code conducting creation of nodes and links in Neo4j database. *useGraphDB.ipynb walks through some examples of using Neo4j StackExchange database to do some analysis, integrating packages with Neo4j, including py2neo, ipython-cypher. Specifically, it queries the databases to get the tag-connection network data, which can be visualized using a force-directed graph and an edge bundling graph.

Visualization part:

Under “Apache-tomcat/webapps/d3” folder, after running ./bin/startup.sh to start up tomcat, we can visualize our data from the route http://localhost:8080/d3/… We visualize the StackExchange dataset by creating the following graphs: wordCloud: the input json format can be generated by get_top20_words_each_topic.ipynb. forceDirectedGraph: which is a force directed graph with input json data generated by useGraphDB.ipynb. hirarlinks: which is an edge bunding graph with input json data generated by useGraphDB.ipynb.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
SentimentAnalysis		SentimentAnalysis
cypher-app		cypher-app
data		data
graphAnalysis		graphAnalysis
movies-python-py2neo-2.0-master		movies-python-py2neo-2.0-master
simpleStatistics		simpleStatistics
topicModeling		topicModeling
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SentimentAnalysis

SentimentAnalysis

cypher-app

cypher-app

data

data

graphAnalysis

graphAnalysis

movies-python-py2neo-2.0-master

movies-python-py2neo-2.0-master

simpleStatistics

simpleStatistics

topicModeling

topicModeling

README.md

README.md

Repository files navigation

stackexchange

LDA; intermediate results are saved in result folder

In the “graphAnalysis” folder, the intermediate results are stored in “outputData” folder.

Visualization part:

About

Releases

Packages

Languages

Sapphirine/stackexchange

Folders and files

Latest commit

History

Repository files navigation

stackexchange

LDA; intermediate results are saved in result folder

In the “graphAnalysis” folder, the intermediate results are stored in “outputData” folder.

Visualization part:

About

Resources

Stars

Watchers

Forks

Languages