-Notion21_3

1. Hierarchical clustering model

Required files:

Hierarchy_Min_Final.ipynb
stop.txt

Required data:

Royal Commissions Round3, Round4, Round5 hearings data The data for each hearing round should be stored in separate folders. Each round contains 5 json files. Folder and file names should be created as shown below:

parent(specify the path to the parent directory)
|--r3 (folder)
|--|--1.json
|--|--2.json
|--|--3.json
|--|--4.json
|--|--5.json
|--r4 (folder)
|--|--1.json
|--|--2.json
|--|--3.json
|--|--4.json
|--|--5.json
|--r5 (folder)
|--|--1.json
|--|--2.json
|--|--3.json
|--|--4.json
|--|--5.json

Instructions:

To run this file, stop.txt should be in the working directory.

Run the code from top to bottom to see the same results.

Section 1: Import libraries.
Section 2: Import data and extract content.
Sections 3-7: Define functions.
Sections 8: Show results (Top features, top sentences, features network)

2. LDA model

Required files and environment:

Recommend complier: Jupyter notebook Version 4.4.0
Jupyter notebook file: LDA_Method.ipynb
Python file: LDA_Method.py
Data file: Json file (Royal commission) =================================================================

Library:

nltk
pandas
gensim
pyLDAvis
py2neo
matplotlib
pprint
re

=================================================================

Loading data

Download the 3 hearing data file
Rename the folder name to r3, r4 and r5 respectively
Rename the json file in each folder to 1.json,...5.json

Inside the LDA_Method.ipynb or LDA_Method.py

Change the value of the variable called 'dir' to the path of data file e.g. dir='/Users/shengyuan/Desktop/Study/CAPSTONE/RoyalCommission'
Change the value of the variable called 'round' to the round number you want to analyze e.g round=3
=================================================================

Get the stop words There are two ways to get the customized stop-words

myStops=getNameFromClassifier(data,path1,path2)

Requirement

Download the StandfordTagger Library from https://nlp.stanford.edu/software/tagger.shtml
Inside the LDA_Method.ipynb or LDA_Method.py
Change the value of the variable called 'path1' and 'path2' to the path of StandfordTagger e.g. path1='/Users/shengyuan/desktop/study/Capstone/RoyalCommission/stanford-ner-2018-02-27/classifiers/english.all.3class.distsim.crf.ser.gz' path2='/Users/shengyuan/desktop/study/Capstone/RoyalCommission/stanford-ner-2018-02-27/stanford-ner.jar'

myStops=getNameFromFile(round)

Download the file called Round3name.txt, Round4name.txt and Round5name.txt
Place those files in the same folder of the code file

Attension: You just neec to execute one of above function

Getting the LDA performance

Execution code LDA_Analysis(data,myStops)

=================================================================

Getting the result by using optimal parameter

Execution code myResult,LDA_model, corpus_tfidf, dictionary=myLDA(data,myStops,n_topics,n_words,passes,iteration, top_n_sentence,top_n_article,graph)

Parameter

Number of topic
Number of words provided by a topic
Passes
Iteration
Number of relevant article
Number of key sentences for one article e.g. n_topics=8 n_words=10; passes=1 iteration=50 top_n_sentence=2 top_n_article=4

Visualization

Requirement

Download the Neo4j from https://neo4j.com/
Create a graph and set a password
Give the password to the variable called 'graph' e.g.graph = Graph(password="123")

Execution code lda_vis = pyLDAvis.gensim.prepare(LDA_model, corpus_tfidf, dictionary) pyLDAvis.display(lda_vis)

In the browser of Neo4j, type "Match(n) return n" for viewing the whole result type "Match (n) detach delete n" for deleting the whole result =================================================================

3. Watson model

file :Watson NLU.ipynb

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Proposal		Proposal
.DS_Store		.DS_Store
DBSCAN.py		DBSCAN.py
DEC.py		DEC.py
Doc2Vec.ipynb		Doc2Vec.ipynb
GMM for single file.ipynb		GMM for single file.ipynb
GMM in document level.ipynb		GMM in document level.ipynb
Hierarchy_Min_Final.ipynb		Hierarchy_Min_Final.ipynb
Hongjin_model.ipynb		Hongjin_model.ipynb
K_means.ipynb		K_means.ipynb
LDA_Method.ipynb		LDA_Method.ipynb
LDA_Method.py		LDA_Method.py
LDA_Round4.ipynb		LDA_Round4.ipynb
Min_hierarchy_ver2.py		Min_hierarchy_ver2.py
NMF.ipynb		NMF.ipynb
README.md		README.md
Readme(LDA).txt		Readme(LDA).txt
Round3name.txt		Round3name.txt
Round4name.txt		Round4name.txt
Round5name.txt		Round5name.txt
Watson NLU.ipynb		Watson NLU.ipynb
ae.py		ae.py
ae_weights.h5		ae_weights.h5
clustering with autoencoder_retures		clustering with autoencoder_retures
data_r3_nlu.json		data_r3_nlu.json
data_r4_nlu.json		data_r4_nlu.json
data_r5_nlu.json		data_r5_nlu.json
graph.dot		graph.dot
hierarchy_min.ipynb		hierarchy_min.ipynb
important_words_phrases.py		important_words_phrases.py
load.py		load.py
post_chunks.py		post_chunks.py
preprocess_mj.ipynb		preprocess_mj.ipynb
stop.txt		stop.txt

LuckyCheese/Notion21_3

Folders and files

Latest commit

History

Repository files navigation

-Notion21_3

1. Hierarchical clustering model

Required files:

Required data:

Instructions:

2. LDA model

Required files and environment:

Attension: You just neec to execute one of above function

3. Watson model

About

Resources

Stars

Watchers

Forks

Languages