Skip to content

omidrohanian/PLSA-Persian

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Author: Omid Rohanian

This project uses Probabilistic Latent Semantic Analysis to categorize textual data into separate semantic groups. There are 15 random text documents in Persian, each can be uniquely mapped to a separate subject: "15-day weight loss diet plan", "religion" and "chess".

The program classifies documents into K separate groups (K is manually set to be 3) and prints the most probable words in each category.

Tokenization and Normalization of text was done using the open source NLP package Hazm. Stop words were deleted in the preprocessing stage. The Persian stop words were taken from here:

Kazem Taghva, Russell Beckley, Mohammad Sadeh(2003) A List of Farsi Stop words, ISRI Technical Report No. 2003-01 Information Science Research Institute University of Nevada, Las Vegas

I also borrowed code from a github repo that seems to have been removed for a while. If you happen to know the original source drop me a line and I'll give proper reference.

There is a documentation included (in Persian) for further clarification.

About

Unsupervised text classification in Persian using Probabilistic Latent Semantic Analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages