ECE 143 Final Project: The Difficulties in Learning Python

by Mahmoud Maarouf, Vidya Kanekal, Hamed Mojtahed, Kevin Anderson, and Songlin Chen

This Github provides the complete analysis of the "Python Questions from Stack Overflow" Kaggle dataset (https://www.kaggle.com/stackoverflow/pythonquestions/version/2) to see what the common difficulties were in learning Python. The library dependencies are listed in the environment.yml file.

Data Preprocessing

This extract_data.py contains a parser class that is used to process the HTML-formatted text from the dataset. It is applied to a dataframe containing the bodies of text for the questions and answers. It first strips the HTML tags from the text, separates the code blocks, and then removes newline characters. It first requires Questions.csv, Answers.csv, and Tags.csv to be loaded into dataframes, from which the parser can be applied to the desired columns. There is also an option to use the Kaggle API to directly download the required dataset. The extract_data.py also joins together the .csv files according to the Parent_ID column in the Questions.csv file. The modules that must be imported include:

Pandas
Kaggle API

LDA Topic Modeling

The Topic_Modeling_Questions_Answers_Tags.py has the train topic modeling algorithms. The code starts by pre-processing the dataset. This process includes tokenization, removal of stop words, build bigram and trigram, and Lemmatization. Then the dictionary and corpus of all unique words are made. After completion of preprocessing the data is fed to topic modeling alogrithems. The LDA algorithms used are LDA Mallet by UMASS, LDA, and LDA Multicore both by gensim library. The quality of each model then is analyzed by computing coherence score, and then each model is compared with the rest by JCARD distance displayed on heatmap. LDA is trained for topics from 25 to 150 by step size of 25 to compare and measure properness of topics numbers. Visualization of the output are created using Word Cloud, and by an interactive chart for each model. The third-party modules needed to run the code are:

LDA Mallet (Downlaod form UMASS website)
JDK (Only needed to run LDA Mallet)
en_core_web_sm
scipy
matplotlib
nltk
seaborn
spacy
gensim
wordcloud
pandas
numpy
pickle5
plotly
pyLDAvis

Sentiment Analysis

The sentiment.py file contains the code to analyze the sentiment of the dataset and creates visualizations for the analysis. The code does two major analyses, the sentiment of the overall tone of the data and the average sentiment on each topic- 100 of them in total. The file requires Questions.csv and Answers.csv files in the same directory as the python file to properly execute the code. The file also needs to load the topics generated by the topic modeling from a pickle file. The pickle file, which contains the topics from the topic modeling algorithm is created in the Topic_Modeling_Questions_Answers_Tags.py file, should be included in same folder as the sentiment.py file. The third party modules needed are:

Matplotlib
Textblob

Extra Analysis

The extra_analysis.py file contains the code to analyze the dataset and create visual represensations of the trends. This analysis includes the most frequent libraries, operating system, IDE, and package manager. It requires the Questions.csv, Answers.csv, and Tags.csv files in the same directory as the code in order to read the data. The third party modules needed include:

Pandas
Matplotlib
Numpy

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
Jupyter Notebooks		Jupyter Notebooks
Visualizations		Visualizations
py Files		py Files
Group0_Assignment1_Test_Cases.ipynb		Group0_Assignment1_Test_Cases.ipynb
Presentation_PDF.pdf		Presentation_PDF.pdf
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jupyter Notebooks

Jupyter Notebooks

Visualizations

Visualizations

py Files

py Files

Group0_Assignment1_Test_Cases.ipynb

Group0_Assignment1_Test_Cases.ipynb

Presentation_PDF.pdf

Presentation_PDF.pdf

README.md

README.md

environment.yml

environment.yml

Repository files navigation

ECE 143 Final Project: The Difficulties in Learning Python

by Mahmoud Maarouf, Vidya Kanekal, Hamed Mojtahed, Kevin Anderson, and Songlin Chen

Data Preprocessing

LDA Topic Modeling

Sentiment Analysis

Extra Analysis

About

Releases

Packages

Contributors 5

Languages

MahmoudKMaarouf/ECE143-Final-Project

Folders and files

Latest commit

History

Repository files navigation

ECE 143 Final Project: The Difficulties in Learning Python

by Mahmoud Maarouf, Vidya Kanekal, Hamed Mojtahed, Kevin Anderson, and Songlin Chen

Data Preprocessing

LDA Topic Modeling

Sentiment Analysis

Extra Analysis

About

Resources

Stars

Watchers

Forks

Languages