Language Classifier

Train the model:

Training: The data file on which the model gets trained
Output: The name of the file that will contain the trained model as a pickle object.
Type: Type of model training; For decision trees put "dt"; For ada-boosted stumps put "ada"
Testing: The file on which the model is tested for parameter tuning/best tree depth
```
  python3 train.py <Training> <Output> <Type> <Testing>
```

Test the model:

Model: filename of the model generated by train.py
Data: The datafile for predicting the classifications
```
  python3 predict.py <Model> <Data>
```

Features Selected:

I have selected all the features based on the most recurring words present in that language. I tried using average length of words, grammar, sentence construction, alphabets used, and frequency of alphabets.

All these features failed to provide accurate classifications.

<e.g.> For frequency of alphabets, in Dutch: E, N, A are most frequently used alphabets; whereas for English it is E, T, A. This was a poor differentiator.

For average word length, both languages have a similar curve. Spike up between 3-4. This too, was a poor differentiator.

For sentence make-up and grammar; it was difficult to find the proper nouns-adjectives-pronouns etc. in the sentence.

Dutch does not have rules like English and some sentences fail to provide right answers. Hence, the features I selected were based on most common words used in their vernacular.

Check if the sentence contains “the” : It’s the most common word in English language.
Check if the sentence contains “het” or “de” : It’s Dutch equivalent for “THE”
Check if the sentence contains “and”
Check if the sentence contains “ik” : Dutch equivalent of ‘I’
Check if the sentence contains “een” : Dutch equivalent of ‘A’
Check if the sentence contains “en” : Dutch equivalent of ‘AND’
Check if the sentence contains “he” or “she” : Used in most conversations and third person sentences
Check if the sentence contains “hij” or “ze” or “zij” : Dutch equivalent of ‘HE / SHE’
Check if the sentence contains “van” : Dutch equivalent of ‘FROM/OF/BY’
Check if the sentence contains “a” : The English alphabet ‘a’.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
README.md		README.md
ada_algo.jpg		ada_algo.jpg
best_ada_hypothesis		best_ada_hypothesis
best_tree_hypothesis		best_tree_hypothesis
dataset.py		dataset.py
ds_algo.jpg		ds_algo.jpg
file_parser.py		file_parser.py
functions.py		functions.py
graph.jpg		graph.jpg
leaf.py		leaf.py
node.py		node.py
predict.py		predict.py
size_1000.dat		size_1000.dat
size_10000.dat		size_10000.dat
train.py		train.py
tree.jpg		tree.jpg
tree_builder.py		tree_builder.py

abhaykul/Language-Classifier

Folders and files

Latest commit

History

Repository files navigation

Language Classifier

Train the model:

Test the model:

Features Selected:

Algorithms used:

For Decision Trees:

For Ada-boosted stumps:

About

Topics

Resources

Stars

Watchers

Forks

Languages