Skip to content

Samialsayadi/TextClassification

Repository files navigation

A Fuzzy Declarative Approach to Classify Unlabeled Short Texts Based on Automatically Constructed WordNet Ontologies here.

This library classifies short texts, Using Wordnet Ontologies and Fuzzy Declarative modeling. It is geared towards creating Semantic Classification of corpora of relatively short documents, such as comments on social media, or online product reviews. The WordNet-Gloss and Hyponyms use to create the ontologies, then apply Fuzzy Declarative to classify short Docs.

The library consists of two main scripts to generate ontologies: aut_ontology_based_gloss.py and auto_ontology_based_hy.py.

In both scripts used the same Pre-Processing steps:

  • Tokenization using the NLTK Liberary
from nltk.tokenize import word_tokenize
tokens=word_tokenize(token)
  • Removal of stopwords using an augmented version of the NLTK English stopwords corpus, here
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
  • Convert all plural noun forms (irregular and regular) into singular noun forms using pattern library, here .
from pattern.text.en import singularize

        if w not in stop_words and len(w)>1:
                    #b.    Convert all plural nouns form (irregular and regular) to singular noun form
                    #using pattern library in python.
            token=singularize(w)
  • Lemmatization using the NLTK WordNetLemmatizer, here
from nltk.stem.wordnet import WordNetLemmatizer
Lem = WordNetLemmatizer()
# verb: pos='a' 
token=WordNetLemmatizer().lemmatize(token,'v')
# # adjective: pos='a' 
token=WordNetLemmatizer().lemmatize(token,'a')
#  adverb: pos='r' 
token=WordNetLemmatizer().lemmatize(token,'r')

Generate Ontologies based wordnet definition(gloss):

from nltk.corpus import wordnet as wn  
for label in [list1[Terms]]:
for def_labels in   wn.synsets(label):
    #get all gloss definiation
    deftoken=def_labels.definition()

Generate Ontologies based wordnet hyponyms:

from nltk.corpus import wordnet as wn  

for def_label in  wn.synsets(label):
tkn=list(set([w for s in def_label.closure(lambda s:s.hyponyms()) for w in s.lemma_names()]))

You can get the similarity degree between keywords and their definition or hyponyms by the follow code:

from nltk.corpus import wordnet as wn  

wordFromList1 = wn.synsets(label)
wordFromList2 = wn.synsets(token)
if wordFromList1 and wordFromList2: 
    degree = wordFromList1[0].path_similarity(wordFromList2[0])
    if degree != None and degree > 0.01:
        #save all ontologies in a txt file 
        print(label,'~',token,'=',degree,'.', file=data)

Classification Method by using Bousi~Prolog system

Perform an experiment, launching the goal: experiment(Measure, FileName, CategoryList, Process). The "Measure" parameter can be one of the following constants: [ont, path, wup, lch, res, jcn, jin, yarm]. The constant "ont" means that you want to perform the experimentes using a predefined ontology, that you must first load before launching the predicate "experiment/4". The other constants are the acronyms of standard linguistic similarity measures which are computed thanks to the connection to the Prolog version of the WordNet database. This connection is automatically established by the cataloging.bpl program (and it does not require uploading an ontology).

EXPLANATION: FileName: File name or path name of a file containing the documents to be classified (stored in the the SMART standard format). CategoryList: A list of categories related with the ontology loaded. Process: A compatibility degree operator (e.g.: sUm, wa, mIn, mAx).

In order to install the Bousi~Prolog system, follow these steps:

Download Bousi~Prolog system: here. Follow the instructions, which found in Bousi~Prolog ON-LINE MANUAL :

Execute the Bousi~Prolog system to reproduce the experiments, follow these steps:

  1. After install, please follow the instructions that found in [Bousi~Prolog ON-LINE MANUAL :](https://dectau.uclm.es/bousi-prolog/2019/02/26/on-line-manual) to lunch the Bousi~Prolog system.
    >> bousi
    
                                                    Universidad de
    

    |O) |D) Castilla - La Mancha |O)(O)\U(S)|I| ~~ || |R (O) |L (O) (G|. (Version Devel ~ January, 22nd 2020)

    Welcome to Bousi~Prolog, a fuzzy logic programming system created by
    Juan Gallardo-Casero and Pascual Julian-Iranzo. Fernando Saenz-Perez
    (UCM) contributed to this version. This software is for research and
    educational purposes only, and it is distributed with NO WARRANTY.
    Please visit our website for the latest news on Bousi~Prolog:
                https://dectau.uclm.es/bousi-prolog
    


  2. Go to the working directory: the one where is placed the cataloging.bpl program and a directory 'finalexperiments' containing the document collections and ontologies.
    BPL> cd Downloads/A Fuzzy to Classify based ontologies
    
  3. Load the cataloging.bpl program and an ontology.
    BPL> ld cataloging
    Parsing and translating 'cataloging.bpl'...
    'cataloging.tpl' is being loaded...
    Program loaded!
    
  4. Load an ontology.
    BPL> ld -o 'finalexperiments/odp/wikipedia.ont'
    Parsing and translating 'cataloging.bpl' using ontology 'wikipedia.ont'...
    'cataloging-wikipedia.tpl' is being loaded...
    Ontology loaded!
    
  5. Reproduce the experiments.
    BPL> experiment(ont, 'finalexperiments/odp/odp',[renewable, electricity, oil_gas, utilities, fuel_cells, hydrogen, consulting, employment, associations, management], sUm)
    
The results:
Processing file, finalexperiments/odp/odp, for category: renewable.  This may take a while ...
End of file reached.
Processing file, finalexperiments/odp/odp, for category: electricity.  This may take a while ...
End of file reached.
Processing file, finalexperiments/odp/odp, for category: oil_gas.  This may take a while ...
End of file reached.
Processing file, finalexperiments/odp/odp, for category: utilities.  This may take a while ...
End of file reached.
Processing file, finalexperiments/odp/odp, for category: fuel_cells.  This may take a while ...
End of file reached.
Processing file, finalexperiments/odp/odp, for category: hydrogen.  This may take a while ...
End of file reached.
Processing file, finalexperiments/odp/odp, for category: consulting.  This may take a while ...
End of file reached.
Processing file, finalexperiments/odp/odp, for category: employment.  This may take a while ...
End of file reached.
Processing file, finalexperiments/odp/odp, for category: associations.  This may take a while ...
End of file reached.
Processing file, finalexperiments/odp/odp, for category: management.  This may take a while ...
End of file reached.
Reading file, finalexperiments/odp/odp.exp with an expert classification...
.............................................................................
............ ABSOLUTE RESULTS ...............................................
 Positive classifications:       88
 Negative classifications:       27
 Undefined classifications:      0
 Total classifications:          153
 Total Expert classifications:   115
 Total positive classifications: 88
.............................................................................
............ PERCENT RESULTS ................................................

 Positive classifications (w.r.t. the total num. of documents): 76.52173913043478
 
 Negative classifications (w.r.t. the total num. of documents): 23.47826086956522
 
 Undefined classifications (w.r.t. the total num. of documents): 0
 
 Precision (w.r.t. the set of categories): 57.51633986928104
 
 Recall (w.r.t. the set of categories):    76.52173913043478
 
 F measure:                                      :  65.67164179104478
 
.............................................................................
true .

Dependencies

About

Fuzzy classification

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published