N-Gram-Language-Modeling

We are writing a program that computes unsmoothed unigrams and bigrams for an arbitrary text corpus, in this case open source books from Gutenberg.org. Since we are working with raw texts, so we need to do tokenization, based on the design decisions we make.

(a) Data and preprocessing: We are working with a selection of books from gutenberg.org which are separated according to their genres: children’s books, crime/detective and history/biography. We will use the books as our corpora to train language models.

(b) Random sentence generation: We will generate random sentences based on the unigram and bigram language model that we generated in the first step. We will also do the same with seeding, i.e. starting from an incomplete sentence of our choice and completing it by generating from our language model, instead of generating from scratch.

(c) Smoothing: We will implement Add-One smoothing and Good-Turing smoothing to improve the accuracy of the language models

(d) Handle unknown words: We will implement an algorithm that will allow us to handle unknown words in the test data that were not found in the training data.

(e) Genre Classification: Using K-Nearest Neighbor algorithm to classify test books into one of the three genres - crime, history or children

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
.idea		.idea
KNearestGenreClassification		KNearestGenreClassification
ModelCreation_SentenceGenerator		ModelCreation_SentenceGenerator
Smoothing		Smoothing
utils		utils
.gitignore		.gitignore
.project		.project
.pydevproject		.pydevproject
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.idea

.idea

KNearestGenreClassification

KNearestGenreClassification

ModelCreation_SentenceGenerator

ModelCreation_SentenceGenerator

Smoothing

Smoothing

utils

utils

.gitignore

.gitignore

.project

.project

.pydevproject

.pydevproject

README.md

README.md

Repository files navigation

N-Gram-Language-Modeling

About

Releases

Packages

Languages

gkeswani92/N-Gram-Language-Modeling

Folders and files

Latest commit

History

Repository files navigation

N-Gram-Language-Modeling

About

Resources

Stars

Watchers

Forks

Languages