GitHub - ChenluJi/cs224n-project: Shakespeare in One Hundred Words

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
lowercase		lowercase
processed		processed
remove_all_caps		remove_all_caps
remove_punct		remove_punct
remove_stop_words		remove_stop_words
sentence-frequencies		sentence-frequencies
shakespeare		shakespeare
stemmed		stemmed
summary-kmeans		summary-kmeans
summary-pagerank		summary-pagerank
summary-svd		summary-svd
summary-tfidf		summary-tfidf
term-frequencies		term-frequencies
LEXICON		LEXICON
README		README
cluster.py		cluster.py
df.dat		df.dat
df.py		df.py
evaluation.py		evaluation.py
gold_julius_caesar.txt		gold_julius_caesar.txt
main.py		main.py
pagerank.py		pagerank.py
porter.py		porter.py
remove_all_caps_words.py		remove_all_caps_words.py
remove_stop_words.py		remove_stop_words.py
sentence.py		sentence.py
sf.py		sf.py
stop_words.txt		stop_words.txt
summarizer.py		summarizer.py
svd.py		svd.py
tfidf.py		tfidf.py
util.py		util.py

Repository files navigation

README
------
This file describes the various scripts and utilities and the manner in which
they were used to generate the summaries.

PRE-PROCESSING
--------------
The general sequence of pre-processing steps is as follows:

1) The directory 'shakespeare' contains the raw text of the plays, marked up
   in XML. Source: http://www.ibiblio.org/xml/examples/shakespeare/

2) Each document was manually edited to remove the credits line (typically the
   first few words in the text). The results were stored in the directory called
   'processed'.

3) All-cap words were then removed from each processed file: these are typically
   names of the actors preceding their lines in the play. Doing so allows the
   algorithms to not be distracted by the noise.

   for f in `ls *.txt`; do cat $f | python ../remove_all_caps_words.py > ../remove_all_caps/$f; done

4) Words in the text of each play are then uniformly lower-cased.

   for f in `ls *.txt`; do cat $f | tr '[[:upper:]]' '[[:lower:]]' > ../lowercase/$f; done

5) Punctuation is removed.

   for f in `ls *.txt`; do cat $f | sed 's/[:;?!.,-]/ /g' | tr "'" " " > ../remove_punct/$f ; done

6) Stop words are removed.

   for f in `ls *.txt`; do python ../remove_stop_words.py $f > ../remove_stop_words/$f ; done

7) The text is stemmed using a Porter stemmer.

   for f in `ls *.txt`; do python ../porter.py $f > ../stemmed/$f ; done

8) A lexicon of stemmed words is created.

   cat stemmed/*.txt | tr ' ' '\n' | sed '/^\s+$/d' | sort | uniq -c > LEXICON



# How many unique words across all plays?
cat shakespeare/*.txt | tr ' !?,:;-.' '\n' | tr '[[:upper:]]' '[[:lower:]]' | sed '/^$/d' | sort | uniq | wc -l


CALCULATING SENTENCE FREQUENCIES
--------------------------------
(*** These are not used. ***)
cd remove_all_caps
for f in `ls *.txt`; do
    python ../sf.py $f
    mv sf.dat ../sentence-frequencies/`echo $f | cut -d'.' -f1`.sf;
done

CALCULATING TERM FREQUENCIES
----------------------------
cd stemmed
for f in `ls *.txt`; do cat $f | tr ' ' '\n' | sed '/^\s+$/d' | sort | uniq -c > ../term-frequencies/$f; done
# rename the tf files--just to avoid confusion
for f in `ls *.txt`; do mv $f `echo $f | cut -d'.' -f1`.tf; done

CALCULATING DOCUMENT FREQUENCIES
--------------------------------
Document frequencies are computed by the script 'df.py' and saved in the file 'df.dat':
python df.py term-frequencies/*.tf

SUMMARY CREATION
----------------
The input to a summarizer is the text of a play from step 3 above; other processing
(stemming and lower-casing) is done in the code itself. We do this in order to retain
the original sentence as part of the Sentence object so that we can display the result
in terms of the original text. All computation is done on the processed text, i.e.,
tf-idf values, SVD, and PageRank operate on stemmed, lower-cased words.

# Summarize ALL plays! (cd into remove_all_caps)
for f in `ls *.txt`; do python ../main.py -doc $f -tf ../term-frequencies/`echo $f | cut -d'.' -f1`.tf -df df.dat -alg tfidf > ../summary-tfidf/$f; done

# SVD version
for f in `ls *.txt`; do python ../main.py -doc $f -tf ../term-frequencies/`echo $f | cut -d'.' -f1`.tf -df df.dat -alg svd > ../summary-svd/$f; done

# PageRank
for f in `ls *.txt`; do python ../main.py -doc $f -tf ../term-frequencies/`echo $f | cut -d'.' -f1`.tf -df df.dat -alg pagerank > ../summary-pagerank/$f; done

EVALUATION
----------
Trigrams
========
python evaluation.py -gold gold_julius_caesar.txt -test summary-pagerank/julius_caesar.txt 
P: 0.12962962963 R: 0.4375 F1: 0.2

python evaluation.py -gold gold_julius_caesar.txt -test summary-tfidf/julius_caesar.txt 
P: 0.0344827586207 R: 0.25 F1: 0.0606060606061

python evaluation.py -gold gold_julius_caesar.txt -test summary-kmeans/julius_caesar.txt 
P: 0.0666666666667 R: 0.25 F1: 0.105263157895

python evaluation.py -gold gold_julius_caesar.txt -test summary-svd/julius_caesar.txt 
P: 0.0 R: 0.0 F1: 0

Unigrams
========
python evaluation.py -gold gold_julius_caesar.txt -test summary-pagerank/julius_caesar.txt -model 1
P: 0.3 R: 0.405405405405 F1: 0.344827586207

python evaluation.py -gold gold_julius_caesar.txt -test summary-tfidf/julius_caesar.txt -model 1
P: 0.155555555556 R: 0.378378378378 F1: 0.220472440945

python evaluation.py -gold gold_julius_caesar.txt -test summary-kmeans/julius_caesar.txt -model 1
P: 0.254901960784 R: 0.351351351351 F1: 0.295454545455

python evaluation.py -gold gold_julius_caesar.txt -test summary-svd/julius_caesar.txt -model 1
P: 0.0983606557377 R: 0.162162162162 F1: 0.122448979592