Geo-Inferencing in Twitter [Soft-Boiled]

General

The usage examples below assume that you have created a zip file containing the top level directory of the repo called soft-boiled.zip

Spatial Label Propagation [slp.py]:

Usage:

sc.addPyFile ('/path/to/zip/soft-boiled.zip') # Can be an hdfs path
from src.algorithms.slp import SLP

# Create dataframe from parquet data
tweets = sqlCtx.parquetFile('hdfs:///post_etl_datasets/twitter')

# Configure options object
options = {'dispersion_threshold':100, 'num_located_neighbors_req':3, 
                   'num_iters':5, 'hold_out':set(['9'])}
# Create algorithm with options
slp = SLP(sc, sqlCtx, options)
# Train 
slp.train(tweets)
# Test
slp.test(tweets)
# Save a dictionary of known and estimated user id_str and their corresponding location
slp.save('/my/local/path/filename.pkl')

Options:

Related to calculating the median point amongst a collection of points:

dispersion_threshold: This is the maximum median distance in km a point can be from the remaining points and still estimate a location

num_points_req_for_known: Number of geotagged posts that a user must have to be included in ground truth.

home_radius_for_known: Median distance in km that num_points_req_for_known must be under from each other.

num_located_neighbors_req: The min number of vertex neighbors with known locations needed before we will calculate a median location to assign an unlocatd user.

Related to the actual label propagation:

num_iters: This controls the number of iterations of label propagation performed

Gaussian Mixture Model [gmm.py]

Usage:

sc.addPyFile ('/path/to/zip/soft-boiled.zip') # Can be an hdfs path
from src.algorithms.gmm import GMM

# Create dataframe from parquet data
tweets = sqlCtx.parquetFile('hdfs:///post_etl_datasets/twitter')

# Configure options object
options = {'fields':set(['user.location', 'text'])}
# Create algorithm with options
gmm = GMM(sc, sqlCtx, options)
# Train
gmm.train(tweets)
#Test
gmm.test(tweets)
# Save a dictionary of words and the GMM that corresponds to that word
save('/my/local/path/filename.pkl')

Options

Related to GMM :

fields: A set of fields to use to train/test the GMM model. Currently only user.location and text are supported

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
notebooks		notebooks
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

notebooks

notebooks

scripts

scripts

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

init.py

init.py

Repository files navigation

Geo-Inferencing in Twitter [Soft-Boiled]

General

Spatial Label Propagation [slp.py]:

Usage:

Options:

Related to calculating the median point amongst a collection of points:

Related to the actual label propagation:

Related to reading in data [Applicable if slp.load is used]:

Gaussian Mixture Model [gmm.py]

Usage:

Options

Related to GMM :

Related to reading in data [Applicable if gmm.load is used]:

About

Releases

Packages

Languages

License

Lab41PaulM/soft-boiled

Folders and files

Latest commit

History

Repository files navigation

Geo-Inferencing in Twitter [Soft-Boiled]

General

Spatial Label Propagation [slp.py]:

Usage:

Options:

Related to calculating the median point amongst a collection of points:

Related to the actual label propagation:

Related to reading in data [Applicable if slp.load is used]:

Gaussian Mixture Model [gmm.py]

Usage:

Options

Related to GMM :

Related to reading in data [Applicable if gmm.load is used]:

About

Resources

License

Stars

Watchers

Forks

Languages