GitHub - xiaoyugan0418/Text-classification-with-Naive-Bayes

we use the popular Reuters 21578 collection of documents as our training dataset which can obtain the dataset from this site: http://archive.ics.uci.edu/ml/machine-learning-databases/reuters21578-mld/. The collection has a total of 123 categories. We modeling Naïve Bayes Classifier according to the following steps: 1.Parse XML documents to extract topics and related content. 2. Tokenize the documents and stem. 3. Create our dictionary of all words (i.e., vocabulary) in the collection and obtain a inverse document frequency (IDF) for each term. 4. Vectorize documents using the TF-IDF scores 5. Train the NB classifier. 4. Classify HTML documents.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
step1parse.py		step1parse.py
step2trainNB.py		step2trainNB.py
step3urltest.py		step3urltest.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

step1parse.py

step1parse.py

step2trainNB.py

step2trainNB.py

step3urltest.py

step3urltest.py

Repository files navigation

About

Releases

Packages

Languages

xiaoyugan0418/Text-classification-with-Naive-Bayes

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages