Textplot

Texplot is a little program that turns a document into a network of terms that are connected to each other depending on the extent to which they appear in the same locations in the text. For each term:

Get the set of offsets in the document where the term appears.
Using kernel density estimation, compute a probability density function (PDF) that represents the word's distribution across the document. Eg, from War and Peace:

Compute a Bray-Curtis dissimilarity between the term's PDF and the PDFs of all other terms in the document. This measures the extent to which two words appear in the same locations.
Sort this list in descending order to get a custom "topic" for the term. Skim off the top X words (usually 10) to get the strongest links. Eg, here's "napoleon" from War and Peace:

[('napoleon', 1.0),
('war', 0.65319871313854128),
('military', 0.64782349297012154),
('men', 0.63958189887106576),
('order', 0.63636730075877446),
('general', 0.62621616907584432),
('russia', 0.62233286026418089),
('king', 0.61854160459241103),
('single', 0.61630514751638699),
('killed', 0.61262010905310182),
('peace', 0.60775702746632576),
('contrary', 0.60750138486684579),
('number', 0.59936009740377516),
('accompanied', 0.59748552019874168),
('clear', 0.59661288775164523),
('force', 0.59657370362505935),
('army', 0.59584331507492383),
('authority', 0.59523854206807647),
('troops', 0.59293965397478188),
('russian', 0.59077308177196441)]

Shovel all of these links into a network and export a GML file.

Generating graphs

The easiest way to build out a graph is to use the frequent function, which wraps up all the intermediate steps of tokenizing the text, computing the term distance matrix, generating the per-word topic lists, etc. (Or, use the clumpy function, which tries to pick words that concentrate really tightly in specific parts of the text). First, spin up a virtualenv:

virtualenv env
. env/bin/activate
pip install -r requirements.txt

Then, fire up an IPython terminal and build a network:

In [1]: from textplot import frequent

In [2]: g = frequent('path/to/file.txt')
Indexing terms:
[############################### ] 140000/140185 - 00:00:03
Generating graph:
[################################] 530/530 - 00:00:00

In [3]: g.write_gml('path/to/file.gml')

The frequent function takes these arguments:

(int) term_depth=500 - The number of terms to include in the network. Right now, the code just rakes the top X most frequent terms, after stopwords are removed.
(int) skim_depth=10 - The number of connections to skim off the top of the "topics" computed for each of the words and added to the network as edges.
(bool) d_weights=False - Should the edge weights be treated as measures of "similarity" (similar terms have "heavy" weights) or "distance" (similar terms have "short" distances)?
(int) bandwidth=2000 - The bandwidth for the kernel density estimation. This controls how "smoothness" of the curve. 2000 is a sensible default for long novels, but bump it down if you're working with shorter texts.
(int) samples=1000 - The number of equally-spaced points on the X-axis where the kernel density is sampled. 1000 is almost always enough, unless you're working with a huge document.
(str) kernel="gaussian" - The kernel function. The scikit-learn implementation also supports tophat, epanechnikov, exponential, linear, and cosine.

Texplot uses numpy, scipy, scikit-learn, matplotlib, networkx, and clint.

Name		Name	Last commit message	Last commit date
Latest commit History 261 Commits
notes/mental-maps		notes/mental-maps
textplot		textplot
.editorconfig		.editorconfig
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

notes/mental-maps

notes/mental-maps

textplot

textplot

.editorconfig

.editorconfig

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

setup.py

setup.py

Repository files navigation

Textplot

Generating graphs

About

Releases

Packages

Languages

License

AndersNYC/textplot

Folders and files

Latest commit

History

Repository files navigation

Textplot

Generating graphs

About

Resources

License

Stars

Watchers

Forks

Languages