yewno project
- Create a pipeline to pull data from Gutenberg.org - STALL
- Saved books to specific file folder for now
- Use that pipeline to save a few books into csv format - COMPLETE
- Expansion idea: Retrieve books based on more metadata (i.e. Author, Topic/Subject)
- Database idea: Set up a database rather than saving books as CSV
- Preprocess the book data in the data folder - COMPLETE
- Save the preprocessed data - COMPLETE
- Expansion idea: More pre process steps to get cleaner text (i.e. stemming, etc.)
- Create algorithm to test each sentence in book for language - COMPLETE
- Save the algorithmic percentage of language per sentence per book - COMPLETE
- Expansion idea: Extend more ways to detect languages, compare against each other
- If a crowd source response to a sentence corrects that sentence language, if that response is "overwelming", then change the label from the detected language to the crowd sourced language