Preprocessor

Preprocessor is a preprocessing library for tweet data written in Python.

When building Machine Learning systems based on tweet data, a preprocessing is required. This library makes it easy to clean, parse or tokenize the tweets.

Features

Currently supports cleaning, tokenizing and parsing:

URLs
Hashtags
Mentions
Reserved words (RT, FAV)
Emojis
Smileys

Supports Python 2.7 and 3.3+

Usage

Basic cleaning:

>>> import preprocessor as p
>>> p.clean('Preprocessor is #awesome 👍 https://github.com/s/preprocessor')
'Preprocessor is'

Tokenizing:

>>> p.tokenize('Preprocessor is #awesome 👍 https://github.com/s/preprocessor')
'Preprocessor is $HASHTAG$ $EMOJI$ $URL$'

Parsing:

>>> parsed_tweet = p.parse('Preprocessor is #awesome https://github.com/s/preprocessor')
<preprocessor.parse.ParseResult instance at 0x10f430758>
>>> parsed_tweet.urls
[(25:58) => https://github.com/s/preprocessor]
>>> parsed_tweet.urls[0].start_index
25
>>> parsed_tweet.urls[0].match
'https://github.com/s/preprocessor'
>>> parsed_tweet.urls[0].end_index
58

Installation

using pip:

$ pip install tweet-preprocessor

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
docs		docs
preprocessor		preprocessor
requirements		requirements
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE.md		LICENSE.md
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.rst		README.rst
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs

docs

preprocessor

preprocessor

requirements

requirements

tests

tests

.gitignore

.gitignore

.travis.yml

.travis.yml

LICENSE.md

LICENSE.md

MANIFEST.in

MANIFEST.in

Makefile

Makefile

README.rst

README.rst

setup.py

setup.py

Repository files navigation

Preprocessor

Features

Usage

Basic cleaning:

Tokenizing:

Parsing:

Installation

About

Releases

Packages

Languages

License

clementtrebuchet/preprocessor

Folders and files

Latest commit

History

Repository files navigation

Preprocessor

Features

Usage

Basic cleaning:

Tokenizing:

Parsing:

Installation

About

Resources

License

Stars

Watchers

Forks

Languages