GitHub - omidrohanian/PLSA-Persian: Unsupervised text classification in Persian using Probabilistic Latent Semantic Analysis

Author: Omid Rohanian

This project uses Probabilistic Latent Semantic Analysis to categorize textual data into separate semantic groups. There are 15 random text documents in Persian, each can be uniquely mapped to a separate subject: "15-day weight loss diet plan", "religion" and "chess".

The program classifies documents into K separate groups (K is manually set to be 3) and prints the most probable words in each category.

Tokenization and Normalization of text was done using the open source NLP package Hazm. Stop words were deleted in the preprocessing stage. The Persian stop words were taken from here:

Kazem Taghva, Russell Beckley, Mohammad Sadeh(2003) A List of Farsi Stop words, ISRI Technical Report No. 2003-01 Information Science Research Institute University of Nevada, Las Vegas

I also borrowed code from a github repo that seems to have been removed for a while. If you happen to know the original source drop me a line and I'll give proper reference.

There is a documentation included (in Persian) for further clarification.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
__pycache__		__pycache__
documents		documents
README.md		README.md
documentation.pdf		documentation.pdf
main.py		main.py
test_results.txt		test_results.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pycache

pycache

documents

documents

README.md

README.md

documentation.pdf

documentation.pdf

main.py

main.py

test_results.txt

test_results.txt

Repository files navigation

About

Releases

Packages

Languages

omidrohanian/PLSA-Persian

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages