Skip to content

MahmoudKMaarouf/ECE143-Final-Project

Repository files navigation

ECE 143 Final Project: The Difficulties in Learning Python

by Mahmoud Maarouf, Vidya Kanekal, Hamed Mojtahed, Kevin Anderson, and Songlin Chen


This Github provides the complete analysis of the "Python Questions from Stack Overflow" Kaggle dataset (https://www.kaggle.com/stackoverflow/pythonquestions/version/2) to see what the common difficulties were in learning Python. The library dependencies are listed in the environment.yml file.

Data Preprocessing

This extract_data.py contains a parser class that is used to process the HTML-formatted text from the dataset. It is applied to a dataframe containing the bodies of text for the questions and answers. It first strips the HTML tags from the text, separates the code blocks, and then removes newline characters. It first requires Questions.csv, Answers.csv, and Tags.csv to be loaded into dataframes, from which the parser can be applied to the desired columns. There is also an option to use the Kaggle API to directly download the required dataset. The extract_data.py also joins together the .csv files according to the Parent_ID column in the Questions.csv file. The modules that must be imported include:

  • Pandas
  • Kaggle API

LDA Topic Modeling

The Topic_Modeling_Questions_Answers_Tags.py has the train topic modeling algorithms. The code starts by pre-processing the dataset. This process includes tokenization, removal of stop words, build bigram and trigram, and Lemmatization. Then the dictionary and corpus of all unique words are made. After completion of preprocessing the data is fed to topic modeling alogrithems. The LDA algorithms used are LDA Mallet by UMASS, LDA, and LDA Multicore both by gensim library. The quality of each model then is analyzed by computing coherence score, and then each model is compared with the rest by JCARD distance displayed on heatmap. LDA is trained for topics from 25 to 150 by step size of 25 to compare and measure properness of topics numbers. Visualization of the output are created using Word Cloud, and by an interactive chart for each model. The third-party modules needed to run the code are:

  • LDA Mallet (Downlaod form UMASS website)
  • JDK (Only needed to run LDA Mallet)
  • en_core_web_sm
  • scipy
  • matplotlib
  • nltk
  • seaborn
  • spacy
  • gensim
  • wordcloud
  • pandas
  • numpy
  • pickle5
  • plotly
  • pyLDAvis

Sentiment Analysis

The sentiment.py file contains the code to analyze the sentiment of the dataset and creates visualizations for the analysis. The code does two major analyses, the sentiment of the overall tone of the data and the average sentiment on each topic- 100 of them in total. The file requires Questions.csv and Answers.csv files in the same directory as the python file to properly execute the code. The file also needs to load the topics generated by the topic modeling from a pickle file. The pickle file, which contains the topics from the topic modeling algorithm is created in the Topic_Modeling_Questions_Answers_Tags.py file, should be included in same folder as the sentiment.py file. The third party modules needed are:

  • Matplotlib
  • Textblob

Extra Analysis

The extra_analysis.py file contains the code to analyze the dataset and create visual represensations of the trends. This analysis includes the most frequent libraries, operating system, IDE, and package manager. It requires the Questions.csv, Answers.csv, and Tags.csv files in the same directory as the code in order to read the data. The third party modules needed include:

  • Pandas
  • Matplotlib
  • Numpy

About

"Difficulties in Learning Python" Final Project by Mahmoud Maarouf, Vidya Kanekal, Hamed Mojtahed, Kevin Anderson, and Songlin Chen,

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published