Skip to content

JordiCarreraVentura/textnorm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Summary

This repository contains Textnorm (after "[text] [norm]alization"), a Python class for detecting multiword expressions over a large corpus of raw free text.

Description

A multiword expression is a sequence of words that show a significantly high statistical association and tend to appear together much more often than chance. Multiwords are a superset of compounds (e.g. chicken soup), idioms (kick the bucket), phrasal verbs (come up with), collocations (extraordinary circumstances versus uncommon circumstances -both combinations are largely semantically equivalent yet the first one is usually preferred-), and standard multiword entities (Barack Obama, Barack H. Obama, Obama, Mr. Barack Hussein Obama and President Obama all behave like single units despite the white space).

In many cases, word tokenization (naïvely performed by splitting at white spaces) results in linguistically incorrect statistics. As an example, when calculating the most frequent words in a text, naïve word tokenization will yield Barack as a word with a certain frequency and Obama as another word with a frequency very close to that of the first word (but not necessarily the same because their distribution pattern is not bi-univocal). What we want is, rather, a single token consisting of both words and with a single frequency count. Ultimately, this can have significant consequences for any statistical approaches relying on the independence assumption, such as Naïve Bayes.

Put another way, the goal of this library is to minimize the discrepancy between the following two claims: i) white spaces separate distinct distributionally distinct linguistic units (our goal) ii) white spaces separate graphical words (default situation in many data pipelines) In the standard written form of most languages, cases of claim ii) are only a subset of the cases of claim i) and the difference stands for errors in linguistic analysis that translate into suboptimal statistical language models.

Textnorm both detects the likely multiword candidates and maps them as annotation onto the original file, returning a copy of the input text with all multiwords marked as sequences of words connected with underscores. Examples are provided at the bottom for illustration.

Performance

Speed

With respect to NLTK's built-in implementation of collocation extraction classes and methods [1], Textnorm is approximately as fast as over small datasets (~1Mb corpora). Over larger datasets (~36Mb corpora and above), Textnorm is about 7x faster than NLTK:

Running time (in seconds)
corpus NLTK Textnorm time
1Mb Paul Graham's essays 6 15
36.6Mb Blog posts 732 109

On the other hand, Textnorm also returns in all cases the normalized input, which represents an additional part of the process that NLTK does not deal with by default.

Parameters

Textnorm is designed to not require parameterization (although all parameters can still be modified at invocation time if necessary). Whereas e.g. NLTK returns a list of collocations, in a typical scenario this is a ranked list (sorted by relevance) of all possible n-grams. This implies that further post-processing must then be applied (sometimes involving non-trivial steps) in order to filter out irrelevant candidates. For instance, below is a list of top 490th-500th best collocations generated by NLTK over the corpus of Paul Graham's essays included in this repository:

[(u'other', u'direction'), (u'smarter', u'than'), (u'better', u'off'), (u'most', u'productive'),
 (u'same', u'thing'), (u'anyone', u'who'), (u'get', u'paid'), (u'for', u'example'),
 (u'power', u'between'), (u'quite', u'different')]

Over the same data, and without default parameterization, Textnorm returned the multiwords in sublist (a) and did not return any of those in sublist (b):

(a)	better_off, same_thing, for_example, quite_different,
	transmission_of_[power_between]_generations

(b)	[(u'other', u'direction'), (u'smarter', u'than'), (u'most', u'productive'), (u'anyone', u'who'),
	 (u'get', u'paid')]

NOTE: Observe that power between was extracted by Textnorm as one of the items in a longer sequence of several consecutive bigrams, up to the highest-order gram they all belong to (transmission of power between generations).

Crucially, the expressions in (a) are provably legitimate multiword units (they all behave as unitary phrases from a distributional point of view) whereas some of those in (b) would generally be regarded as subunits of larger units, e.g., smarter than N, anyone who V (with the only exception of get paid).

Usage

Running the script with default parameters

python textnorm.py -i PATH/TO/INPUT/FILE

The output is stored by default in a PATH/TO/INPUT/FILE".textnorm.out.txt". Alternatively, a different location for the output can be specified by using the flag "-o" during invocation as shown below:

python textnorm.py -i PATH/TO/INPUT/FILE -o PATH/TO/OUTPUT/FILE

When invoking Textnorm in this way, the system will auto-configure its parameters following a small set of statistical assumptions that usually hold for most natural language text across a variety of settings. During testing, the default parameters have provided (subjectively) good results with inputs as diverse as e-commerce titles and blog posts, although substantial fluctuations can be expected for varying sample sizes and different degrees of text naturalness.

Description of parameters and advanced settings

flag description format required/optional default
-i input file string required None
-o output file string optional Input file + ".textnorm.out.txt"
-t temporal file string optional "/tmp/textnorm.main.temp"
-n order of grams [3] int optional 5
--flush gram flushing ratio [5] int:int optional 1:200000
--smooth smoothing ratio [6] float or 'auto' optional 'auto'
--silent disables messages on stdout [7] None optional false
--ndocs To be filled
--maxf To be filled
--minf To be filled

Notes

[1] http://www.nltk.org/howto/collocations.html

[2] https://en.wikipedia.org/wiki/Function_word

[3] Order n of the [n]-grams to be used in the system's calculations.

[4] Number k of top most frequent words to be disregarded as high frequency noise by the system. It is intended to prevent phenomena such as function words [2] from interfering with the analysis. If a floating point number is provided, the value for this parameter will be interpreted as a ratio over the total number of documents.

[5] Indicates a ratio x:y specifying the x minimum number of times any gram must appear for every y documents processed. Any gram with a frequency lower than x times over y documents will be deleted upon reaching y documents since the time when that gram was added. Any deleted gram may be added again later on.

[6] Specifies a ratio r (where 0 <= r <= 1.0), computed over the total frequency of any multiword, at or above which any adjacent function word [2] will be attached to the multiword. For example: the should not be added to most multiwords but in cases such as The Wall Street Journal, it is actually part of the multiword. Since the ratio between the frequency of The Wall Street Journal (as a multiword) and the frequency of Wall Street Journal will be nearly 1.0 in most standard datasets, a value equal to or lower than 1.0 for this parameter will result in The being merged with Wall Street Journal, yielding The Wall Street Journal as the final multiword. For problematic cases, a value of 1.1 will effectively deactivate this type of smoothing.

[7] NOTE: Not implemented yet.

Samples

The only person exempted from that restriction is the American ambassador to Iraq, Ryan_Crocker, who can discuss Iraq-related issues with Iranian officials on a_regular_basis, according to a State_Department_official in Washington who spoke_on_condition_of_anonymity.Ambassador Khalilzad aroused the ire of Secretary_Rice who is reportedly upset that such a high ranking American official would participate in the same forum as the Iranian_foreign_minister. Mr. Khalilzad did not stray from American talking points at the forum. But Powerline is reporting that the moderator of the panel_discussion, the head_of_the_International Crisis Group Gareth Evans, insulted former UN_Ambassador John Bolton.


"Revolutionary_Armed_Forces_of_Colombia" and their mission of violence:"Founded in 1964, the FARC is a self-proclaimed communist and revolutionary guerrilla organization. They claim_to_represent the poor in their struggle against the country's wealthier classes, striving to seize power through armed revolution. These declarations notwithstanding, however, the group has largely abandoned its political agenda, and the FARC are now merely a drug_trafficking and terrorist group with complete_disregard for human_rights_and_international humanitarian law. Since the late 1980s, the Colombian government has repeatedly attempted to negotiate a solution and peace_settlement, without success. Directly_or_indirectly, all Colombians, including those of us here in Princeton, have been affected by their inhumane actions."My article from 2006 on the general topic can be accessed via this link.Ari J. Kaufman


On Super_Tuesday, 22 states and a couple territories with a combined 1,688 pledged_delegates will hold nominating_contests. From this point, quick math shows that after Super_Tuesday, only 1,428 pledged_delegates will still be available. Now, here is where the problem shows up. According to current_polling averages, the largest_possible victory for either candidate on Super_Tuesday will be Clinton 889 pledged_delegates, to 799 pledged_delegates for Obama. (In all likelihood, the winning margin will be lower than this, but using these numbers helps emphasize the seriousness_of_the_situation.) Power advised Obama on foreign policy, having spent her career detailing genocides and international responses to them, including a Pulitzer_Prize-winning book_on_the_subject.


A new video has Barack Obama speaking in 2007 about the burgeoning crisis in the financial_markets, but focusing on accounting_fraud rather than the_root_cause of the meltdown: the widespread issuance of bad credit, securitized by Fannie_Mae_and_Freddie_Mac at the behest of Congress. In this audio_clip, Obama defends the idea of subprime_lending just over a year ago:


Iran has watched the drop_in_oil_prices with growing_alarm, the_International_Herald-Tribune reports, and it wants OPEC to take action to support prices at higher than $100 per_barrel.


The anger generated from that information has nothing to do with racism, and everything to do with the breach_of_trust between Congress and its constituents.  Frank, Chris_Dodd, and others like Lacy_Clay and Maxine_Waters tried the racist meme out on regulators who tried_to_warn Congress of the pending collapse.  They have to smear their critics.  They certainly can’t admit that Congress failed spectacularly.


With Congress grilling Wall_Street_executives over the_financial_collapse, why not have some of the real culprits testify in their investigations?  One of them is close at hand; in fact, he’s pretending to lead the investigation while really being one of its best targets.  Senator Chris_Dodd took massive_amounts in political contributions from Fannie_Mae, Freddie_Mac, and Countrywide, while securing sweetheart_deals from that same lender, all while supposedly providing the oversight that somehow missed the rotten struts under the entire subprime_market.  The_Wall_Street_Journal wonders why Dodd’s asking_questions rather than answering them:Dodd should get expelled first for his conflict_of_interest in accepting his sweetheart loans_from_Countrywide in the “Friends of Angelo” program. 

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages