GitHub - vwoloszyn/pylinguistics: Open source readability metrics for Portuguese and English.

Pylinguistics [English and Portuguese]

eadability assessment is an important task for different NLP tasks, for instance automatic text simplification. In this project, we've designed an Open Source library for computing a set of readability metrics in Porguese and English.

https://travis-ci.org/vwoloszyn/pylinguistics.png?branch=master

Metrics

Word count
Sentence count
Word_per_sentence
Syllable count
Average of syllables per word
Adjective Incidence
Noun Incidence
Verb Incidence
Adverbs Incidence
Pronoun Incidence
Content words Incidence
Functional word Incidence
Lexical Diversty
Content word Diversty
Logic Negation Incidence
Logic If Incidence
Logic Or Incidence
Logic And Incidence
Logic Operators Incidence
Connective Incidence
Connective Additive Incidence
Connective Logic Incidence
Connective Temporal Incidence
Connective Casual Incidence

.... and more

Case Study: intelligibility of the scientific journalism

In order to illustrate the use of our tool, we contextualize a real world problem, the complexity and intelligibility of the scientific journalism. There are still few computational linguistic studies devoted to observe their textual constitution with particular emphasis on the characterization of stylistics elements of this textual genre. A thorough scientific journalism description can be extremely important for many of the core problems that computational linguists are concerned with. For example, parsing accuracy could be increased by taking genre into account, for instance, certain object-less constructions occur only in recipes in English. Similarly for POS-tagging, where the frequency of uses of trend as a verb in the Journal of Commerce is 35 times higher than in Sociological Abstracts. In information retrieval, genre classification could enable users to sort search results according to their immediate interests, for example scholarly articles about supercollider, novels about the French Revolution, and so forth.

The two corpora used in this study are geared towards different groups. Thus, they employ different vocabularies and textual structures that can be classified into different levels of complexity. In this study, we compared two corpora: Pesquisa Fapesp, a Brazilian specialized science magazine; and Folha de Sao Paulo (FSP), a Brazilian newspaper aimed at the general public.

We then assessed the predictive power of our features based on 3 feature selection algorithms commonly used for text categorization: Information Gain, Gain Ratio and Chi-square. Figure 1 shows the performance of the SVM when varying the number of features selected for each method. It shows that with only 2 features we can already predict the genre with over 93% accuracy. Additionaly by 7 metrics it already reaches the best possible result (97%). Finding a small subset of predictors is important to avoid over-fitting.

Basicic Usage

>>> import Pylinguistics as pl
>>> objpl=pl.text('Ia bem em matemática, porém reprovou em física.')
>>> objpl.setLanguage("pt-br")
>>> # this is a multiline comment
>>> objpl.getFeatures()
{'ConnectiveAdditiveIncidence': 125.0, 'redability': 66.6, 'word_count': 8, 'ConnectiveLogicIncidence': 0.0, 'syllable_count': 17, 'avg_word_per_sentence': 8.0, 'LogicIfIncidence': 0.0, 'LogicAndIncidence': 0.0, 'ContentDiversty': 1.0, 'pronIncidence': 0.0, 'LogicOperatorsIncidence': 0.0, 'verbIncidence': 250.0, 'functionalIncidence': 375.0, 'nounIncidence': 250.0, 'LogicOrIncidence': 0.0, 'adjectiveIncidence': 0.0, 'LogicNegationIncidence': 0.0, 'contentIncidence': 625.0, 'ConnectiveIncidence': 125.0, 'avg_syllables_per_word': 2.125, 'ConnectiveTemporalIncidence': 0.0, 'sentence_count': 1, 'ConnectiveCasualIncidence': 0.0, 'advIncidence': 125.0, 'LexicalDiversty': 0.9}

Dependencies

Pylinguistics also requires extra resources: NLTK and nlpnet. Additionally, NLTK needs some extra downloads. After installing it, call

>>> import nltk
>>> nltk.download()

Try

You can also test pylinguistics by yourself without any instalation: http://app.mybinder.org/1746087056/notebooks/pylinguistics_test.ipynb

Install

(not working yet - We'll do it ASAP) pip install git+git://github.com/vwoloszyn/pylinguistics.git

Publications

http://www.fsma.edu.br/si/edicao18/FSMA_SI_2016_2_Principal_2_en.html https://www.lume.ufrgs.br/bitstream/handle/10183/147640/000999695.pdf?sequence=1

How to cite

Article{Castilhos2016,: author = {Castilhos, S. and Woloszyn, V, and Barno, D. and Wives, L. K.}, title = {Pylinguistics: an open source library for readability assessment of texts written in Portuguese}, journal = {Revista de Sistemas de Informação da FSMA}, year = {2016}, volume = {18}, issn = {1983-5604},

}

Name		Name	Last commit message	Last commit date
Latest commit History 131 Commits
.ipynb_checkpoints		.ipynb_checkpoints
build/lib/pylinguistics		build/lib/pylinguistics
dist		dist
pylinguistics.egg-info		pylinguistics.egg-info
pylinguistics		pylinguistics
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.rst		README.rst
install.py		install.py
portuguese_tags.pickle		portuguese_tags.pickle
pylinguistics_test.ipynb		pylinguistics_test.ipynb
requirements.txt		requirements.txt
setup.py		setup.py
test.py		test.py

License

vwoloszyn/pylinguistics

Folders and files

Latest commit

History

Repository files navigation

Pylinguistics [English and Portuguese]

Metrics

Case Study: intelligibility of the scientific journalism

Basicic Usage

Dependencies

Try

Install

Publications

How to cite

About

Resources

License

Stars

Watchers

Forks

Languages