Skip to content

NLP - A Naive Bayes text classifier applied to the MD&A section of a company's 10-Q.

Notifications You must be signed in to change notification settings

jamespeterthornton/stockclassifier

Repository files navigation

stockclassifier

Classification: James Thornton and Dylan Hurwitz Web Scraping: Thomas Thornton

This is a Naive Bayes text classifier. It classifies the Management Discussion and Analysis of a company's 10-Q, a quarterly report filed with the SEC.

To run the code, start the Python shell, import evaluation, and run eval("training_set.data"). This will split the training data (10-Qs for the Dow Jones Industrial Average) into five parts, and use cross-validation to gauge the accuracy of the classifier (training on four parts, testing on the fifth, and rotating the test segment until all possibilities have been measured). It will then generate nine more splits, do the same on each, and return the average accuracy from all of these results. Generally, we get about 65% accuracy for the basic classifier.

Note that holds are treated somewhat ficticiously, as no documents are pre-classified as "HOLD," but rather the classifier is allowed to "not bet" on reports that it is unsure of, and this is the purpose of the hold classification.

Feel free, also to modify the list of tickers at the end of scraper.py and use it to generate new training sets.

About

NLP - A Naive Bayes text classifier applied to the MD&A section of a company's 10-Q.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages