Skip to content

aaa121/smappdragon

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

									 _
 ___ _ __ ___   __ _ _ __  _ __   __| |_ __ __ _  __ _  ___  _ __ 
/ __| '_ ` _ \ / _` | '_ \| '_ \ / _` | '__/ _` |/ _` |/ _ \| '_ \ 
\__ \ | | | | | (_| | |_) | |_) | (_| | | | (_| | (_| | (_) | | | |
|___/_| |_| |_|\__,_| .__/| .__/ \__,_|_|  \__,_|\__, |\___/|_| |_|
					|_|   |_|                    |___/

PyPI PyPI

🐉 smappdragon is a set of tools for working with twitter data. a more abstract / contextual wrapper for smappdragon can be found in pysmap (work in progress). the old smapp-toolkit is here..

##installation

pip install smappdragon

pip install smappdragon --upgrade

smappdragon runs in python 3. if you dont have this version of python, install anaconda, or miniconda, whatever you do we recommend at least python 3.0 .

(check python binary location with which python) should be /usr/bin/python (mac osx base install), /usr/local/bin/python (homebrew), /Users/YOURNAME/miniconda3/bin/python (miniconda), /Users/YOURNAME/anaconda/bin/python (anaconda)

(check with python --version) if you are 2.X.X, you're out of date

##testing

you absolutely need to write unit tests for any methods you add to smappdragon, this software needs to stay as stable as porssible as it will be the basis for other software.

this folder contains tests for smappdragon.

the bson folder contains two bson files on which to run tests. One if a valid.bson file with tweets that have properly formatted fields. Another is an sketchy.bson file that has tweets with strange fields, missing fields, etc.

our test covearge setup: https://github.com/coagulant/coveralls-python

##collection

classes for interfacing with a tweets from different data sources

##mongo_collection

this allows you to plug into a running live mongodb database and run smappdragon methods on the resulting collection object.

abstract:

from smappdragon import MongoCollection

collection = MongoCollection('HOST', PORT, 'USER_NAME', 'PASSWORD', 'DB_NAME', 'COLLECTION_NAME')

practical:

from smappdragon import MongoCollection

collection = MongoCollection('superhost.bio.nyu.edu', 27574, smappReadWriteUserName, 'PASSWORD', 'GERMANY_ELECTION_2015_Nagler', 'tweet_collection_name')

returns a collection object that can have methods called on it

test: python -m unittest test.test_mongo_collection

you should create a config.py file in the tests directory structured like so:

config = {
	'mongo':{
		'host': 'HOSTNAME',
		'port': PORT,
		'user': 'DB_USER',
		'password': 'DB_PASS'
		'database': 'DB_NAME'
		'collection': 'COLLECTION_NAME'
	},
	'blah':{
		.
		.
		.
	}
}

this config is used for testing it is gitignored.

##bson_collection

this allows you to use any bson file as a data source for smappdragon

abstract:

from smappdragon import BsonCollection

collection = BsonCollection('/PATH/TO/BSON/FILE.bson')

practical:

from smappdragon import BsonCollection

collection = BsonCollection('~/Documents/file.bson')

returns a collection object can have methods called on it

test: python -m unittest test.test_bson_collection

you should create a config.py file in the tests directory structured like so:

config = {
	'blah':{
		.
		.
		.
	},
	'bson':{ \
        'valid': 'bson/valid.bson' \
    } \
}

this config is used for testing it is gitignored.

##json_collection

this allows you to use any json file (with a json object on each line) as a data source for smappdragon

abstract:

from smappdragon import JsonCollection

collection = JsonCollection('/PATH/TO/JSON/FILE.json')

practical:

from smappdragon import JsonCollection

collection = JsonCollection('~/Documents/file.json')

returns a collection object that can have methods called on it

test: python -m unittest test.test_json_collection

you should create a config.py file in the tests directory structured like so:

config = {
	'blah':{
		.
		.
		.
	},
	'json':{ \
        'valid': 'json/valid.json' \
    } \
}

this config is used for testing it is gitignored.

##csv_collection

newly added

this allows you to use any csv file (with a csv header) as a data source for smappdragon

abstract:

from smappdragon import CsvCollection

collection = CsvCollection('/PATH/TO/CSV/FILE.csv')

practical:

from smappdragon import CsvCollection

collection = CsvCollection('~/Documents/file.csv')

returns a collection object that can have methods called on it

test: python -m unittest test.test_csv_collection

you should create a config.py file in the tests directory structured like so:

config = {
	'blah':{
		.
		.
		.
	},
	'csv':{ \
        'valid': 'json/valid.csv' \
    } \
}

this config is used for testing it is gitignored.

##base_collection

this is the base class for all collection objects. methods that all collection objects use are found here. this is actually the most important class.

test: python -m unittest test.test_base_collection

##get_iterator

makes an iterator that can iterate through all tweets in a particular collection

abstract:

collection.get_iterator()

practical:

for tweet in collection.get_iterator():
	print(tweet)

returns an iterable object that will yield all the tweets in a particular collection

##set_limit

sets a limit on the number of documents a collection can return

abstract:

collection.set_limit(TWEET_LIMIT_NUMBER)

practical:

collection.set_limit(10)
# or 
collection.set_limit(10).top_entities({'hashtags':10})

returns a collection object limited to querying / filtering only as many tweets as the limit number allows. a limit of 10 will only allow 10 tweets to be processed.

##strip_tweets

abstract:

collection.strip_tweets(FIELDS_TO_KEEP)

practical:

collection.strip_tweets(['id', 'user.id', 'entities.user_mentions'])

returns a collection object that will return reduced tweet objects where all the fields but the specified ones are filtered away.

note list indexes do not work here, 'entities.user_mentions.0' does not work, you can only preserve entire lists.

##set_filter

sets a filter to apply toa all tweets, the filter is a mongo style query dictionary

abstract:

collection.set_filter(TWEET_FILTER)

practical:

collection.set_filter({'id_str':'4576334'})
# or 
collection.set_filter({'id_str':'4576334', 'user':{'screen_name':'yvanscher'}}).top_entities({'hashtags':10})

returns a collection object that will only return tweets that match the specified filter. so if you ask for {id_str:4576334} you will only get tweets where the id_str field is 4576334.

note: passing an empty filter will return all tweets in a collection, empty filters {} are like no filter.

note: to make sure you are querying what you really want you should examine the twitter docs on tweet and user objects. some field names are shared between objects (example id_str is part of both user and tweet objects, even when a user object is nested inside a tweet object)

##set_custom_filter

sets a method you define as a filter for tweets

abstract:

collection.set_custom_filter(FUNCTION)

practical:

def is_tweet_a_retweet(tweet):
	if 'retweeted' in tweet and tweet['retweeted']:
		return True
	else:
		return False
collection.set_custom_filter(is_tweet_a_retweet)
# or 
collection.set_custom_filter(is_tweet_a_retweet).top_entities({'hashtags':10})

returns a collection object that will only return tweets that match or pass the specified custom filter method.

##set_custom_filter_list

sets a list of methods you define as a set of filters for tweets

abstract:

collection.set_custom_filter_list([FUNCTION_ONE, FUNCTION_TWO, ETC])

practical:

def is_tweet_a_retweet(tweet):
	if 'retweeted' in tweet and tweet['retweeted']:
		return True
	else:
		return False
def screen_name_is_yvan(tweet):
	if 'screen_name' in tweet['user'] and tweet['user']['screen_name'] == 'yvan':
		return True
	return False
collection.set_custom_filter_list([is_tweet_a_retweet, screen_name_is_yvan])
# or 
collection.set_custom_filter_list([is_tweet_a_retweet, screen_name_is_yvan]).top_entities({'hashtags':10})

returns a collection object that will only return tweets that match or pass the specified custom filter methods.

note: passing an empty filter will return all tweets in a collection, empty filters [] are like no filter.

##dump_to_bson

dumps all tweets in a collection to bson.

abstract:

collection.dump_to_bson('/PATH/TO/OUTPUT/FILE.bson')

pratical:

collection.dump_to_bson('~/smappstuff/file.bson')
# or 
collection.limit(5).dump_to_bson('/Users/kevin/work/smappwork/file.bson')

returns a bson file that dumps to disk.

##dump_to_json

dumps all tweets in a collection to json formatted bson (a json object on each line of the file).

abstract:

collection.dump_to_json('/PATH/TO/OUTPUT/FILE.json')

pratical:

collection.dump_to_json('~/smappstuff/file.json')
# or 
collection.limit(5).dump_to_json('/Users/kevin/work/smappwork/file.json')

returns a json file that dumps to disk.

##dump_to_csv

dumps all tweets in a collection to csv.

abstract:

collection.dump_to_csv('/PATH/TO/OUTPUT/FILE.csv', ['FIELD1', 'FIELD2', 'FIELD3.SUBFIELD', ETC])

pratical:

collection.dump_to_csv('~/smappstuff/file.csv', ['id_str', 'entities.hashtags.0', 'entities.hashtags.1'])
# or 
collection.set_limit(5).dump_to_csv('/Users/kevin/work/smappwork/file.csv', ['id_str', 'entities.hashtags.0', 'entities.hashtags.1'])

returns a csv file that dumps to disk.

note: media and lists of objects are converted to a unicode string and put as one field in the csv.

note: this will dump in the order in which fields appear in a tweet and not the order in which you list them in your flist of fields.

note: empty lists [] will return nothing. you must specifyfields.

##tools

these are tools that our collection classes use ut that can also be used on their own if you have some kind of custom tweet input data source

##tweet_parser

a parser for tweets that can perform all sorts of tsansformations on tweets or extrct data from them easily.

abstract / practical:

from smappdragon import TweetParser
tweet_parser = TweetParser()

returns an instance of the TweetParser class that can textract data from tweets or entities

test: python -m unittest test.test_tweet_parser

##contains_entity

tells you wether or not a tweet object has a certain twitter entity

abstract:

tweet_parser.contains_entity(ENTITY_TYPE, TWEET_OBJ)

practical:

tweet_parser.contains_entity('media', { ... tweet object here ... })
#or
tweet_parser.contains_entity('user_mentions', { ... tweet object here ... })
#etc

returns true or false depending on whether a tweet contains the given entity

note: entity_type must be 'urls' 'hashtags' 'user_mentions' 'media' or 'symbols'

##get_entity

gets a particular list of twitter entities for you

abstract:

tweet_parser.get_entity(ENTITY_TYPE, TWEET_OBJ)

practical:

print tweet_parser.get_entity('urls', { ... tweet object here ... })

output:

[
	{
			"url": "https:\/\/t.co\/XdXRudPXH5",
			"expanded_url": "https:\/\/blog.twitter.com\/2013\/rich-photo-experience-now-in-embedded-tweets-3",
			"display_url": "blog.twitter.com\/2013\/rich-phot\u2026",
			"indices": [80, 103]
		},
	{
			"url": "https:\/\/t.co\/XdXRudPXH4",
			"expanded_url": "https:\/\/blog.twitter.com\/2013\/rich-photo-experience-now-in-embedded-tweets-3",
			"display_url": "blog.twitter.com\/2013\/rich-deio\u2026",
			"indices": [80, 103]
		},
]

returns a list of entity objects stored inside the tweet object's entity field.

note: entity_type must be 'urls' 'hashtags' 'user_mentions' 'media' or 'symbols'

##get_entity_field

gets the field of a particular twitter entity object for you

abstract:

tweet_parser.get_entity_field(FIELD, ENTITY)

practical:

for entity in tweet_parser.get_entity('user_mentions', tweet):
	entity_value = tweet_parser.get_entity_field('id_str', entity)
# or
print tweet_parser.get_entity_field('url', {
			"url": "https:\/\/t.co\/XdXRudPXH5", \
			"expanded_url": "https:\/\/blog.twitter.com\/2013\/rich-photo-experience-now-in-embedded-tweets-3", \
			"display_url": "blog.twitter.com\/2013\/rich-phot\u2026", \
			"indices": [80, 103] \
	})

output:

# the second would output
'https://t.co/XdXRudPXH5'

returns the value stored in this entity object in the field you specified

note: those urls look weird, they are just escaped, it's where you put a \ in front of every /

##tweet_passes_filter

checks to see if a tweet passes a filter

abstract:

tweet_parser.tweet_passes_filter(FILTER_OBJECT, TWEET_OBJECT)

practical:

tweet_parser.tweet_passes_filter({'a':'b'}, {'a':'b', 'c':'d'})

output:

True

return true if a tweet passes a filter or false if a tweet fails to pass a filter.

##flatten_dict

flattens a tweet into a list of key paths and values. this is a one dimensional structure. you can flatten two objects and then compare them more easily.

abstract:

tweet_parser.flatten_dict(TWEET_OBJ)

practical:

tweet_parser.flatten_dict({'key':{'key2':{'key3':'blah blah'}}, 'cat':'tab'})

output:

[
(['key1', 'key2', 'key3'], 'blah blah'),
(['cat'], 'tab')
]

returns a list of tuples wherer each tuple contains the a list of keys to get to a value ant the value located at those nested keys.

see: http://stackoverflow.com/questions/11929904/traverse-a-nested-dictionary-and-get-the-path-in-python see: the tweet_passes_filter method in tweet_parser.py for an example of how to use it to comapare two objects.

##tweet_passes_custom_filter

tells you wether or not a tweet passes a certain custom filter method that you define

abstract:

tweet_parser.tweet_passes_custom_filter(FILTER_FUNCTION, TWEET)

practical:

def is_tweet_a_retweet(tweet):
	if 'retweeted' in tweet and tweet['retweeted']:
		return True
	else:
		return False
tweet_parser.tweet_passes_custom_filter(is_tweet_retweet, {text: 'blah blah', retweeted: True})

returns true or false depending on whether or not a tweet passes through the filter

##tweet_passes_custom_filter_list

tells you wether or not a tweet passes a list of certain custom filter method that you define

abstract:

tweet_parser.tweet_passes_custom_filter_list([FILTER_FUNCTION, ANOTHER_FILTER_FUNCTION], TWEET)

practical:

def is_tweet_a_retweet(tweet):
	if 'retweeted' in tweet and tweet['retweeted']:
		return True
	else:
		return False
def screen_name_is_yvan(tweet):
	if screen_name in tweet and tweet['screen_name'] == 'yvan':
		return True
	return False
tweet_parser.tweet_passes_custom_filter_list([screen_name_is_yvan, is_tweet_a_retweet], {text: 'blah blah', retweeted: True})

returns true or false depending on whether or not a tweet passes through the list of filters

##strip_tweet

strips a tweet of all its fields except the ones specified

abstract:

tweet_parser.strip_tweet(KEEP_FIELDS, TWEET)

practical:

tweet_parser.strip_tweet(['id', 'user.id', 'entities.user_mentions'], tweet)

returns a tweet stripped down to the fields you want, retaining only specified fields

##contributing

install the developer environment: conda env create -f environment.yml

run pylint smappdragon and fix style issues

submit your pull request on a feature branch feature/added-language-support to be merged with the dev branch

resources:

best tutorial on python encoding/decoding csv encoding explanation

bad style:

do not write excessively long 'one-liners' these ar difficult to understand and wlll be rejected. break them up into multiple lines. posterity will thank you.

use as few dependencies as possible. if you have a choice between using a little bit of extra code or importing a dependency and using a little less code. do not import the dependecy. write the extra code.

only create an extra file with methods if those methods could be used on their own. in other words do not make pure helper classes for the sake of abstracting code. it just makes the project more confusing. if there's code that's repeated more than 3-4x make a helper method in the place where it's used not a separate file.

an example of good helper code is the tweet_parser in smappdragon/tools.

good guide to distributing to pypi

python setup.py sdist upload

##author

yvan

About

🐉 smappdragon is a set of tools for working with twitter data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%