PSHAT

(Pronounced "P'Shot") Part of Speech Handling for Aramaic Talmud

This is the official repo for Noah's Master's thesis.

This project aims to fill the gaping hole in ancient Aramaic POS tagging. Astonishingly, this field of research is scant. My work begins to show that modern machine learning techniques can learn patterns syntactic patterns in Talmud, despite two major issues

Talmud has no punctuation. Because of this, it can be very difficult to break up sentences and ideas, even if one is familiar with the Aramaic and the structure of the text
Talmud is actually a mix of two languages, Mishnaic Hebrew and Talmudic Aramaic. While in some places the distinction between these languages is clearly marked, the majority of Talmud is a mixture of the two.

Despite these issues, LSTMs were able to achieve above 90% POS tagging on a validation set.

I gratefully thank CAL and especially Steve Kaufman for working with me on this project. The use of his dataset was crucial and his help working with the dataset was just as important.

Requirements

This project uses the Sefaria library. Certain scripts require you to have Sefaria set up on your computer. Follow the instructions on their repo to set it up.
You need to install dynet to run the LSTMs.

Pipeline

DatasetMatcher.py: takes input from data/1_cal_input and outputs to data/2_matched_sefaria.
LangDatasetGenerator.py: generates language training dataset from Sefaria library and CAL files. Aramaic training comes from data/1_cal_input/caldbfull.txt and Mishnaic training comes from Sefaria's Mishnah. Outputs training as json file to data/3_lang_tagged/model/lstm_training.json. NOTE: This file isn't written perfectly. There's a bool at the bottom, make_training. If true, it generates training files. Otherwise, see step (4).
LangTagger.py: takes input from data/3_lang_tagged/model/lstm_training.json and trains an LSTM to differentiate between Hebrew and Aramaic (only on individual words). Outputs to data/3_lang_tagged.
Dilate language tagged output. Run LangDatasetGenerator.py with make_training = False. Outputs to 4_lang_tagged_dilated
POSTagger2MLP-beam.py: takes input from 4_lang_tagged_dilated, 2_sefaria_matched and outputs to 5_pos_tagged. Trains LSTM to learn POS tags of Aramaic words in Talmud

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
data_utilities		data_utilities
results_txt		results_txt
.gitignore		.gitignore
Dataset_Matcher.py		Dataset_Matcher.py
LangDatasetGenerator.py		LangDatasetGenerator.py
LangTagger.py		LangTagger.py
POSTagger.py		POSTagger.py
POSTagger2MLP-beam.py		POSTagger2MLP-beam.py
POSTagger2MLP.py		POSTagger2MLP.py
POSTaggerBaseline.py		POSTaggerBaseline.py
README.md		README.md
RabbiTagging.py		RabbiTagging.py
beam_search.pysnip		beam_search.pysnip
cal_tools.py		cal_tools.py
local_settings.py		local_settings.py
local_settings_example.py		local_settings_example.py
lstm_context.py		lstm_context.py
mean_results.py		mean_results.py
thesis.pdf		thesis.pdf
tonorabbis.json		tonorabbis.json
util.py		util.py

dimidd/PSHAT

Folders and files

Latest commit

History

Repository files navigation

PSHAT

Requirements

Pipeline

About

Resources

Stars

Watchers

Forks

Languages