Skip to content

xuezhizeng/textacy

 
 

Repository files navigation

textacy: NLP, before and after spaCy

textacy is a Python library for performing a variety of natural language processing (NLP) tasks, built on the high-performance spacy library. With the fundamentals --- tokenization, part-of-speech tagging, dependency parsing, etc. ---delegated to another library, textacy focuses on the tasks that come before and follow after.

build status

current release version

pypi version

conda version

Features

  • Provide a convenient entry point and interface to one or many documents, with the core processing delegated to spaCy
  • Stream text, json, csv, spaCy binary, and other data to and from disk
  • Download and explore a variety of included datasets with both text content and metadata, from Congressional speeches to historical literature to Reddit comments
  • Clean and normalize raw text, before analyzing it
  • Access and filter basic linguistic elements, such as words, ngrams, and noun chunks; extract named entities, acronyms and their definitions, and key terms
  • Flexibly tokenize and vectorize documents and corpora, then train, interpret, and visualize topic models using LSA, LDA, or NMF methods
  • Compare strings, sets, and documents by a variety of similarity metrics
  • Calculate common text statistics, including Flesch-Kincaid Grade Level, SMOG Index, and multilingual Flesch Reading Ease

... and more!

Note: ReadTheDocs builds have been failing for months, so those docs are currently out-of-date. Very sorry. As a (temporary?) workaround, docs for the latest version (v0.6.0) have been published via GitHub Pages:

https://chartbeat-labs.github.io/textacy

Maintainer

Howdy, y'all. 👋

Packages

No packages published

Languages

  • Python 99.9%
  • Shell 0.1%