GitHub - may-tal/nlpProject

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
Data		Data
catboost_info		catboost_info
externals/FreeSans		externals/FreeSans
scores		scores
sentiment_lexicon		sentiment_lexicon
Automatic Detection of Cyberbullying- final project on hebrew nlp course - 67680.pdf		Automatic Detection of Cyberbullying- final project on hebrew nlp course - 67680.pdf
Compare stages result.png		Compare stages result.png
CyberBullying.pdf		CyberBullying.pdf
CyberBullying.pptx		CyberBullying.pptx
Data.zip		Data.zip
Datadata.csv		Datadata.csv
README		README
classifiers.py		classifiers.py
clean_and_normalization_data_without_yap.csv		clean_and_normalization_data_without_yap.csv
clean_data.csv		clean_data.csv
clustering.py		clustering.py
confustion_matrix.png		confustion_matrix.png
d_norm.csv		d_norm.csv
data.py		data.py
data_to_morphem_by_yap.csv		data_to_morphem_by_yap.csv
data_without_punctuation_with_yap.csv		data_without_punctuation_with_yap.csv
evaluation.py		evaluation.py
feature_extraction.py		feature_extraction.py
feature_selection.py		feature_selection.py
heb_stopwords.txt		heb_stopwords.txt
main.py		main.py
neg.neg		neg.neg
new_heb_stopwords.txt		new_heb_stopwords.txt
norm_data.csv		norm_data.csv
norm_no_yap_data.csv		norm_no_yap_data.csv
orig_data.csv		orig_data.csv
pos.pos		pos.pos
roc_normData.png		roc_normData.png
scores.png		scores.png
tacbleScores_normData.png		tacbleScores_normData.png
text_normalization.py		text_normalization.py
text_statistics.py		text_statistics.py
topBigramsNonStopwords.png		topBigramsNonStopwords.png
topNonStopwords.png		topNonStopwords.png
topic_modeling.py		topic_modeling.py
words_list.txt		words_list.txt
yap_punc_data.csv		yap_punc_data.csv

Repository files navigation

The project contains the following files:

* main.py - Main project file.

* classifiers.py - This file gets a tagged set of the data, split it to train and test. 
then, the classifier learns how to classify from the training set and predict tags to the test set.
This file contains few classifiers and return the scores of each one.

* evaluation.py - Compute all measure scores for given classifier and plot the roc curve.

* clustring.py-  This file cluster the data to three class using k-means algorithm and plot wordCloud graph.

* data.py- this file gets folder path that contain the data files and return the data as csv form.

* feature_extraction.py- Transforming raw data into features that better represent the underlying problem, resulting in improved predictive model accuracy on unseen data (feature engineering process)

* feature_selection.py-  This file receives train data and selects features in three methods of the feature selection algorithm - removing features with low variance, selecting the best features based on univariate statistical tests, and select from models' method.

* text_statistics.py- this file contains functions that get the data after preprocessing and return the statistic of the data.

* text_normalization.py- this file contains all the functions of the preprocessing step - clean and normalization the data.

* topic_modeling.py- this file use several methods of topic modeling – SVD, NFS.

* heb_stopwords.txt- text file that contain the Hebrew stop words.

* Data/sentences.neg- contain the data whose labelling is negative.

* Data/sentences.pos- contain the data whose labelling is positive.

* sentiment_lexicon/negative_words_he.txt – Hebrew semantic lexicon for negative words.

* sentiment_lexicon/positive_words_he.txt – Hebrew semantic lexicon for positive words.

* orig_data.csv - contain the original data as dataframe, each line contain text and label.

* norm_data.csv - contain the normalization (clean + yap) data as dataframe.

* yap_punc_data.csv - contain the data after punctuation removeal and YAP as dataframe.

* clean_data.csv - contain the clean data as dataframe.

* norm_no_yap_data.csv - contain the normalization data without yap as dataframe.

* words_vector.npy – The list of representing vectors (arrenged according to the word_list.txt. 

* word_list.txt - the list of words which there are representative word2vec vectors.

---- How to run our project? ----			  
To run this code, you need to download twitter-w2w from https://drive.google.com/drive/folders/1b1Pj1oWBqs3y0Qncaqpz4IK-ujzChy2Z 
and pace the two files word_list.txt and words_vector.npy in the project folder.
Then all you need to do is install the relevant packages and run 'main.py' file.