Skip to content

(Mental) maps of texts with kernel density estimation and force-directed networks.

License

Notifications You must be signed in to change notification settings

AndersNYC/textplot

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Textplot

War and Peace (click to zoom)

War and Peace

Texplot is a little program that turns a document into a network of terms that are connected to each other depending on the extent to which they appear in the same locations in the text. For each term:

  1. Get the set of offsets in the document where the term appears.

  2. Using kernel density estimation, compute a probability density function (PDF) that represents the word's distribution across the document. Eg, from War and Peace:

War and Peace

  1. Compute a Bray-Curtis dissimilarity between the term's PDF and the PDFs of all other terms in the document. This measures the extent to which two words appear in the same locations.

  2. Sort this list in descending order to get a custom "topic" for the term. Skim off the top X words (usually 10) to get the strongest links. Eg, here's "napoleon" from War and Peace:

[('napoleon', 1.0),
('war', 0.65319871313854128),
('military', 0.64782349297012154),
('men', 0.63958189887106576),
('order', 0.63636730075877446),
('general', 0.62621616907584432),
('russia', 0.62233286026418089),
('king', 0.61854160459241103),
('single', 0.61630514751638699),
('killed', 0.61262010905310182),
('peace', 0.60775702746632576),
('contrary', 0.60750138486684579),
('number', 0.59936009740377516),
('accompanied', 0.59748552019874168),
('clear', 0.59661288775164523),
('force', 0.59657370362505935),
('army', 0.59584331507492383),
('authority', 0.59523854206807647),
('troops', 0.59293965397478188),
('russian', 0.59077308177196441)]
  1. Shovel all of these links into a network and export a GML file.

Generating graphs

The easiest way to build out a graph is to use the frequent function, which wraps up all the intermediate steps of tokenizing the text, computing the term distance matrix, generating the per-word topic lists, etc. (Or, use the clumpy function, which tries to pick words that concentrate really tightly in specific parts of the text). First, spin up a virtualenv:

virtualenv env
. env/bin/activate
pip install -r requirements.txt

Then, fire up an IPython terminal and build a network:

In [1]: from textplot import frequent

In [2]: g = frequent('path/to/file.txt')
Indexing terms:
[############################### ] 140000/140185 - 00:00:03
Generating graph:
[################################] 530/530 - 00:00:00

In [3]: g.write_gml('path/to/file.gml')

The frequent function takes these arguments:

  • (int) term_depth=500 - The number of terms to include in the network. Right now, the code just rakes the top X most frequent terms, after stopwords are removed.

  • (int) skim_depth=10 - The number of connections to skim off the top of the "topics" computed for each of the words and added to the network as edges.

  • (bool) d_weights=False - Should the edge weights be treated as measures of "similarity" (similar terms have "heavy" weights) or "distance" (similar terms have "short" distances)?

  • (int) bandwidth=2000 - The bandwidth for the kernel density estimation. This controls how "smoothness" of the curve. 2000 is a sensible default for long novels, but bump it down if you're working with shorter texts.

  • (int) samples=1000 - The number of equally-spaced points on the X-axis where the kernel density is sampled. 1000 is almost always enough, unless you're working with a huge document.

  • (str) kernel="gaussian" - The kernel function. The scikit-learn implementation also supports tophat, epanechnikov, exponential, linear, and cosine.


Texplot uses numpy, scipy, scikit-learn, matplotlib, networkx, and clint.

About

(Mental) maps of texts with kernel density estimation and force-directed networks.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%