GitHub - Tuan-Lee-23/Vietnamese-corpus-search-and-analysis-Web-app: Vietnamese corpus search tools and statistical analysis

This project used 100% python (v 3.7)

Features:

Corpus search tool:

Our tool can search in a corpus by:

Ambiguous: you can search everything such as character, number, morpheme,...
Noun (POS tagging)
Verb (POS tagging)
Adjective (POS tagging)
Name of Person (NER model)
Name of Location (NER model)
Name of Organization (NER model)
Show the top 10 similar words of your input (gensim word2Vec)

Corpus dataset:

I did web scrapping and got 12k description lines on vnexpress.net

Libraries used:

Dash + Dash bootstrap components
Plotly
Gensim
Underthesea (now Underthesea requires pytorch 1.4.0)
nltk
numpy
pandas
statsmodels

How to run:

Open terminal in the following directory: "Vietnamese-corpus-search-and-analysis-Web-app/"

Using Corpus search app

Run terminal "python src/app.py"

python src/app.py

Wait about 1 minute for the server, if you see the local host link in terminal, then ctrl click open it or copy and paste it into browser

Using corpus statistical analysis app

Run terminal

python src_statistics/app.py

Wait about 1 minute for the server, if you see the local host link in terminal, then ctrl click open it or copy and paste it into browser

Using another corpus

Rename your corpus file to "vn_express.txt" and replace it in resources/
You have to run "python src/create_NER_pickle.py", then type in your corpus' directory: "resources/vn_express.txt" to build the NER model and Word2vec model, output as 2 files ner.pik and w2v.pik
You only need to run once when using a new corpus

Folders structure:

docs/: documentation folder
- NLP.pptx: slides
src/: source code of corpus search app
src_statistics/: source code of corpus statistical analysis app
resources/:
- ner.pik: pickle file of NER model
- w2v.pik: pickle file of Word2vec model
- vn_express.txt: main corpus data
- corpus_mini.txt: small 2k corpus for fast debugging
- stop_words.txt: File contains Vietnamese stopwords

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs

docs

resources

resources

src

src

src_statistics

src_statistics

README.md

README.md

TEST.txt

TEST.txt

Repository files navigation

This project used 100% python (v 3.7)

Features:

Corpus search tool:

Corpus dataset:

Libraries used:

How to run:

Using Corpus search app

Using corpus statistical analysis app

Using another corpus

Folders structure:

Demo

Corpus search tool

Statistical analysis tool

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
docs		docs
resources		resources
src		src
src_statistics		src_statistics
README.md		README.md
TEST.txt		TEST.txt

Tuan-Lee-23/Vietnamese-corpus-search-and-analysis-Web-app

Folders and files

Latest commit

History

Repository files navigation

This project used 100% python (v 3.7)

Features:

Corpus search tool:

Corpus dataset:

Libraries used:

How to run:

Using Corpus search app

Using corpus statistical analysis app

Using another corpus

Folders structure:

Demo

Corpus search tool

Statistical analysis tool

About

Topics

Resources

Stars

Watchers

Forks

Languages