language-classification

Example Collection:

The training data is extended by 60 more lines with total 70 examples of English/ Dutch sentences (35 each).
The testing data is the original file downloaded from the website of the lab guideline. Although we all know that the more training data we collect, the more accurate prediction model we will get, I had tried my best to collection the training data since there is no sharing data on the discussion post.

Features

Boolean: usage of the word “and” in daily English sentences
Boolean: usage of the word “en” in daily Dutch sentences
Boolean: usage of the word “the” in daily English sentences
Boolean: usage of the word “de” in daily Dutch sentences
Boolean: usage of the word “enn” in daily Dutch sentences
Boolean: usage of the word “het” in daily Dutch sentences
Boolean: contains the substring of “ij” in daily Dutch words
Range: Words in Dutch tend to be longer than words in English
Range: Frequency of the usage of double vowels (consecutive two same vowels) in Dutch is more than in English
Range: Frequency of the usage of double consonants (consecutive two same consonants) in Dutch is more than in English
Range: Frequency of letters in words such as “j, k, v, z” in Dutch is more than in English

Decision Tree Learning • A decision tree model was built using the training data with 70 example entries(sentences).
• Entropy is used to find the impurity for each level of classification. • The information gain algorithm is used to classify the entries by the best feature for each level as well. • A maximum depth of 15 was set in order to handle larger training data and testing data. However, for my default training set, I found out that 5 is enough to generate a good trained model. Adaboost • A boosted ensemble model was built using the same training data, and a weight for each entry(sentence) was assigned in order to use the Adaboost algorithm to adjust the weight of each entry before going to the next stump. • A maximum size of stumps was set to 5 and it is enough for my default training data.

To run the program: Default training and testing files locate in data directory, and best decision tree model called tree.o and best adaboost model called ensemble.o were generated in out directory. The output of the program while using the predict action is simply the predicted label(en/nl) for each sentence from the test file. • To train: python classify.py train P.S. for entering files, full location is necessary Example：classify.py train data\train.dat out\tree.o dt • To predict: python classify.py predict

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
__pycache__		__pycache__
data		data
out		out
README.md		README.md
adaboost.py		adaboost.py
classify.py		classify.py
decision_tree.py		decision_tree.py
features_collection.py		features_collection.py
tree.py		tree.py
weights.py		weights.py
writeup.docx		writeup.docx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pycache

pycache

data

data

out

out

README.md

README.md

adaboost.py

adaboost.py

classify.py

classify.py

decision_tree.py

decision_tree.py

features_collection.py

features_collection.py

tree.py

tree.py

weights.py

weights.py

writeup.docx

writeup.docx

Repository files navigation

language-classification

About

Releases

Packages

Languages

yxy6465/language-classification

Folders and files

Latest commit

History

Repository files navigation

language-classification

About

Resources

Stars

Watchers

Forks

Languages