nlp-proj-LPD

Hello, The goal of this project was to identify LPD possible in sentences for hebrew

for example: "היה לי לוחם ממש טוב" <=> "היה לי חלום ממש טוב", "רוצה לבוא לאכול גדילה" <=> "רוצה לבוא לאכול גלידה"

In this project has several parts:

Preparation of the information:

I downloaded form Mila site : http://www.mila.cs.technion.ac.il/resources_corpora_wikipedia_2013.html Corpus for wikpdia 2013, And extracted as many sentences as I could (more 192000000 sentences) It was not simple (prepareTheCorpus.py)

create Vocabulary,Bigram Model, And LPD sentence

The next step was to create a vocabulary, a Bigram model and I test sentences with LPD the test sentence is not import to train courps.

tests:

baseLine:

the baseLine is simple: For each word I choose a random combinator, that exists in a dictionary. of all possible combinations. The results were really bad: 0% Almost all the words in the sentence were changed and no sentence came out as the desired sentence.

unigram or other words Which appeared most often:

For each word I choose a the word that appeared most often. The results were a bit nicer: 64%

Bigram:

For each word I choose the word that its probability when known as the previous word and the next word is the highest. The results in this case were really impressive:97%

Play yourself:

for play yourself run test model bigram.

You can enter any sentence you want and test it.
You can choose a sentence between the test corpus.
You can select a sentence that has an error from the test corpus. (It helped me check the errors, and improve the results)

Thanks

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.idea		.idea
__pycache__		__pycache__
src		src
test		test
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
common.pyc		common.pyc
wordcount.lex		wordcount.lex
זיהוי ותיקון של שיכול אותיות.docx		זיהוי ותיקון של שיכול אותיות.docx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.idea

.idea

pycache

pycache

src

src

test

test

.gitattributes

.gitattributes

.gitignore

.gitignore

README.md

README.md

common.pyc

common.pyc

wordcount.lex

wordcount.lex

זיהוי ותיקון של שיכול אותיות.docx

זיהוי ותיקון של שיכול אותיות.docx

Repository files navigation

nlp-proj-LPD

Preparation of the information:

create Vocabulary,Bigram Model, And LPD sentence

tests:

baseLine:

unigram or other words Which appeared most often:

Bigram:

Play yourself:

About

Releases

Packages

Languages

aviadpinis/nlp-proj-LPD

Folders and files

Latest commit

History

Repository files navigation

nlp-proj-LPD

Preparation of the information:

create Vocabulary,Bigram Model, And LPD sentence

tests:

baseLine:

unigram or other words Which appeared most often:

Bigram:

Play yourself:

About

Resources

Stars

Watchers

Forks

Languages