turkish-parliament-texts

Install

wget http://voltran.cmpe.boun.edu.tr/temporary_download/datasets/tbmm/tbmm-corpus-v0.1.tar.gz
tar -zxvf tbmm-corpus-v0.1.tar.gz

wget http://voltran.cmpe.boun.edu.tr/temporary_download/datasets/tbmm/corpus-dev.tar.gz
tar -zxvf corpus-dev.tar.gz

pip3 install pipenv
pipenv install --python 3

Run

Example code to save a figure:

pipenv shell
ipython
import corpus_loader
# plot distribution of "mebus" and "milletvekil" keywords
corpus_loader.corpus.plot_word_freqs_given_a_regexp_for_each_year([r"^mebus",r"^milletvekil"], keyword="milletvekil_and_mebus",)

# to see plotted values 
plot_values, counts, total_count, all_keywords = corpus_loader.corpus._word_freqs_given_a_regexp_for_each_year(r"^(milletvekil|vekil)", keyword="milletvekil",)

Jupyter Notebook

You can use the notebook.ipynb file to load and query the corpus.

Preprocessing

Run below command to test the new clean text suit.

python3 cleaning_text_files.py

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
data/raw-txt		data/raw-txt
plots		plots
resources		resources
Pipfile		Pipfile
README.md		README.md
cleaning_text_files.py		cleaning_text_files.py
config.ini		config.ini
construct_vocab.py		construct_vocab.py
corpus_loader.py		corpus_loader.py
crawler.py		crawler.py
first-level-urls.csv		first-level-urls.csv
notebook.ipynb		notebook.ipynb
notebook.png		notebook.png
rules.py		rules.py
tbmmcorpus.py		tbmmcorpus.py
text_extractor.py		text_extractor.py
utils.py		utils.py
year_mapping.py		year_mapping.py

cgl/turkish-parliament-texts

Folders and files

Latest commit

History

Repository files navigation

turkish-parliament-texts

Install

Run

Jupyter Notebook

Preprocessing

About

Resources

Stars

Watchers

Forks

Languages