CZ4045_Natural_Language_Processing

Please run all the commands in the directory ntu-nlp/

Install

Please make sure that you have installed Conda.

# install Conda env
conda env create -f environment.yml
# activate Conda env
conda activate ntu-nlp

# create data folders
mkdir -p core/{data,output}

# Download packages
python -m spacy download en
python -c "import nltk; nltk.download('punkt'); nltk.download('vader_lexicon')"

Prepare Data

Three csv files are needed for this project:

data.csv (task 3.2, 3.3 and 3.4)
train.csv (task 3.3 Bert-model and 3.4)
val.csv (task 3.3 Bert-model and 3.4)

Step 1: Download Data

Option 1 Auto-script

sh scripts/download-data.sh

Option 2. Step-by-step download

Download data, move it into core/data/ and unzip:

cd core/data
unzip data.zip

Step 2: Data Process

python core/examples/data_preprocess.py core/data/reviewSelected100.json

Note:

You need to give the correct absolute path of the Json data file.
Input: data file in Json.
Output: core/data/{data, train, val}.csv, which serves tasks after

Execution

3.2 Data Analysis

Sentence Segmentation

python core/examples/3.2-Dataset-analysis/sentence_segmentation.py

Note:

Input: core/data/data.csv
Output: core/examples/3.2-Dataset-analysis/results/sentence_segmentation_by_star/
Once sentence segmentation for a rating star is completed, the corresponding plot of review counts VS sentence counts will be displayed. For now, plots are displayed one by one. To view the plot for the next rating star, close the current plot. To save the image, click the "save" icon at the bottom of the plot display.

Tokenization and Stemming

python core/examples/3.2-Dataset-analysis/tokenization_stemming.py

Note:

Input: core/data/data.csv
Output: core/examples/3.2-Dataset-analysis/results/tokenize_and_stemming/, core/examples/3.2-Dataset-analysis/results/top_20_words/
Once tokenization is completed, the corresponding plot of review counts VS token counts will be displayed. For now, plots are displayed one by one. To view the next plot, close the current plot. To save the image, click the "save" icon at the bottom of the plot display.
You may modify the argument in get_most_freq(num=20) to view any number of most frequent tokens

POS Tagging

python core/examples/3.2-Dataset-analysis/pos_tag.py

Note:

Input: core/data/data.csv
Output: core/examples/3.2-Dataset-analysis/results/pos_tagging/tagger_result.csv The CSV file contains five sections: each section includes the tagging results produced by a different tagger for the same sentence. The order of taggers which generate the results are: default tagger, regex-based tagger, baseline tagger, unigram tagger, unigram tagger with backoff, bigram tagger, bigram tagger with backoff, trigram tagger, trigram tagger with backoff and perceptron tagger.
The random seed is set to be 22, so that it will produce the same output every time. Change the seed if you wish to get different output.

Most Frequent Adjectives for each Rating

python core/examples/3.2-Dataset-analysis/most_freq_adj.py

Note:

Input: core/data/data.csv
Output: core/examples/3.2-Dataset-analysis/results/most_freq_adj/
The script will first group the reviews based on the rating star and generate a csv for each rating star (e.g. r1_review.csv). Afterwards, the most frequent words are counted and the results are stored in most_freq_adj.csv. Lastly, the most indicative words are calculated and the results are stored in most_indicative_adj.csv.

3.3 Noun Adjective Pair Summarizer

Rule based method: POS Tagging + FSA

python core/examples/3.3-Adj-Noun-Pairs/adj_noun_extractor1.py

Note:

Input: core/data/data.csv
Output: core/examples/3.3-Adj-Noun-Pairs/adj_noun_pairs1.csv
numberOfBusinessId=5 [line 11, int, the number of different business id]
numberOfPairs=5 [line 12, int, the number of noun-adj pairs for each business id]
withExtra=False [line 13, boolean, if the extra wolds included, eg. good / very good]

Bert-based method:

python core/examples/3.3-Adj-Noun-Pairs/adj_noun_extractor2.py

Note:

Input: core/data/data.csv
Output: the noun-adj pairs extracted printed out in the order of the business ids.

3.4 Application

Sentiment analysis

python core/examples/3.4-sentiment-analysis/sentiment_analysis.py

RNN model

Train:

export PYTHONPATH="${PWD}/core"
python core/examples/3.4-sentiment-analysis/sentiment_analysis_train.py
unset PYTHONPATH

Predict:
Change inputs here:

...
    input_list=[[
        "Appreciate the call ahead seating! Always a joy to go in n sit down with in 15 mins MAX! Thanks for awesome food and service!"
    ], ["delicious, grubby, chinese food in generous portions and always great service. their szchewaun chicken is THE BOMB. so spicy it makes me sweat."]],
...

And run:

export PYTHONPATH="${PWD}/core"
python core/examples/3.4-sentiment-analysis/sentiment_analysis_predict.py
unset PYTHONPATH

Note:

Input: a text string
Output: the degree of sentiment analyzed.

Run Web Application

You have to run three microservices in serving, server, and web-app. See Model server instruction, API server instruction and frontend instruction.

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
core		core
scripts		scripts
server		server
serving		serving
web-app		web-app
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml

cxbn12/ntu-nlp

Folders and files

Latest commit

History

Repository files navigation

CZ4045_Natural_Language_Processing

Install

Prepare Data

Three csv files are needed for this project:

Step 1: Download Data

Option 1 Auto-script

Option 2. Step-by-step download

Step 2: Data Process

Execution

3.2 Data Analysis

Sentence Segmentation

Tokenization and Stemming

POS Tagging

Most Frequent Adjectives for each Rating

3.3 Noun Adjective Pair Summarizer

Rule based method: POS Tagging + FSA

Bert-based method:

3.4 Application

Sentiment analysis

RNN model

Run Web Application

About

Resources

Stars

Watchers

Forks

Languages