KaggleDathena

To understand our project please read our pdf report

files

Notebooks

Pipeline.ipynb: Main notebook that create our pipline for this project (using the other modules) and that create the submission file

Data_frame_creation.ipynb : Read and exctract features from all documents and save it as a pandas dataFrame (in safe/df_file_final.csv)

Test.ipynb: Some unit test for utility functions

Train_sentiment_analyser.ipynb: Notebook to execute that train our sentiment annalyser CNN. Warning you better have a good GPU to train it. The training is done either on Amazon reviews or imdb movie review using the keras dataset.

Train_word_embeddings.ipynb: based on the Corpus (all document) this learns a vector representation for each lemma (word) and a mapping dictionary.

Python code (module)

extract_text.py: some function to extract text data from doc(x), pdf and xls(x) files

sentiment_analyzer.py: code containing the Convolutional NN made with Keras, including method to train and predict.

text_preprocessing.py: code to preprocess the text, like doing some Lemmatisation, vectorization, stop words removal as well as some regex cleaning.

text_summarization.py: code to extract interesting sentence about the companies.

util.py: Some utility function which doesn't find a place in other an file.

Others

submission_mapper.csv: provided file slightly modified (name of the .doc containing parenthesis have changed)

Folders

safe: Contains checkpoint for faster exection of the code as pickle or csv for pandas.

dataset: Not filled, contains Amazon review dataset.

files: Contains the dataset of this project.

models: Contains the model computed thanks to the code.

Dependencies

pip install PyPDF2
pip install python-docx
pip install xlrd
pip install pdfrw
pip install sumy
pip install gensim
pip install nltk
pip install glob2


python -m spacy download en
import nltk
nltk.download('punkt')

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
dataset		dataset
files		files
models		models
safe		safe
submission		submission
.gitignore		.gitignore
Data_frame_creation.ipynb		Data_frame_creation.ipynb
LICENSE		LICENSE
Pipeline.ipynb		Pipeline.ipynb
README.md		README.md
Reputation_Analysis_Dathena_competition.pdf		Reputation_Analysis_Dathena_competition.pdf
Test.ipynb		Test.ipynb
Train_sentiment_analyser.ipynb		Train_sentiment_analyser.ipynb
Train_word_embeddings.ipynb		Train_word_embeddings.ipynb
extract_text.py		extract_text.py
sentiment_analyzer.py		sentiment_analyzer.py
submission_mapper.csv		submission_mapper.csv
text_preprocessing.py		text_preprocessing.py
text_summarization.py		text_summarization.py
util.py		util.py

License

sshuster/KaggleDathena

Folders and files

Latest commit

History

Repository files navigation

KaggleDathena

files

Notebooks

Python code (module)

Others

Folders

Dependencies

About

Resources

License

Stars

Watchers

Forks

Languages