A native Python implementation of Spark's RDD interface. The primary objective is not to have RDDs that are resilient and distributed, but to remove the dependency on the JVM and Hadoop. The focus is on having a lightweight and fast implementation for small datasets. It is a drop-in replacement for PySpark's SparkContext and RDD.
Use case: you have a pipeline that processes 100k input documents and converts them to normalized features. They are used to train a local scikit-learn classifier. The preprocessing is perfect for a full Spark task. Now, you want to use this trained classifier in an API endpoint. You need the same pre-processing pipeline for a single document per API call. This does not have to be done in parallel, but there should be only a small overhead in initialization and preferably no dependency on the JVM. This is what
pysparkling
is for.
pip install pysparkling
- Supports multiple URI scheme:
s3://
,http://
andfile://
. Specify multiple files separated by comma. Resolves*
and?
wildcards. - Handles
.gz
and.bz2
compressed files. - Parallelization via
multiprocessing.Pool
,concurrent.futures.ThreadPoolExecutor
or any other Pool-like objects that have amap(func, iterable)
method. - only dependencies:
boto
for AWS S3 andrequests
for http
The change log is in HISTORY.rst.
Word Count
from pysparkling import Context
counts = Context().textFile(
'README.rst'
).map(
lambda line: ''.join(ch if ch.isalnum() else ' ' for ch in line)
).flatMap(
lambda line: line.split(' ')
).map(
lambda word: (word, 1)
).reduceByKey(
lambda a, b: a + b
)
print(counts.collect())
which prints a long list of pairs of words and their counts. This and a few more advanced examples are demoed in docs/demo.ipynb.
A usual pysparkling
session starts with either parallelizing a list
or by reading data from a file using the methods Context.parallelize(my_list)
or Context.textFile("path/to/textfile.txt")
. These two methods return an RDD
which can then be processed with the methods below.
aggregate(zeroValue, seqOp, combOp)
: aggregate value in partition with seqOp and combine with combOpaggregateByKey(zeroValue, seqFunc, combFunc)
: aggregate by keycache()
: synonym forpersist()
cartesian(other)
: cartesian productcoalesce()
: do nothingcollect()
: return the underlying listcount()
: get length of internal listcountApprox()
: same ascount()
countByKey
: input is list of pairs, returns a dictionarycountByValue
: input is a list, returns a dictionarycontext()
: return the contextdistinct()
: returns a new RDD containing the distinct elementsfilter(func)
: return new RDD filtered with funcfirst()
: return first elementflatMap(func)
: return a new RDD of a flattened mapflatMapValues(func)
: return new RDDfold(zeroValue, op)
: aggregate elementsfoldByKey(zeroValue, op)
: aggregate elements by keyforeach(func)
: apply func to every elementforeachPartition(func)
: apply func to every partitiongetNumPartitions()
: number of partitionsgetPartitions()
: returns an iterator over the partitionsgroupBy(func)
: group by the output of funcgroupByKey()
: group by key where the RDD is of type [(key, value), ...]histogram(buckets)
: buckets can be a list or an intid()
: currently just returns Noneintersection(other)
: return a new RDD with the intersectionisCheckpointed()
: returns Falsejoin(other)
: joinkeyBy(func)
: creates tuple in new RDDkeys()
: returns the keys of tuples in new RDDleftOuterJoin(other)
: left outer joinlookup(key)
: return list of values for this keymap(func)
: apply func to every element and return a new RDDmapPartitions(func)
: apply f to entire partitionsmapValues(func)
: apply func to value in (key, value) pairs and return a new RDDmax()
: get the maximum elementmean()
: meanmin()
: get the minimum elementname()
: RDD's namepersist()
: caches outputs of previous operations (previous steps are still executed lazily)pipe(command)
: pipe the elements through an external command line toolreduce()
: reducereduceByKey()
: reduce by key and return the new RDDrepartition(numPartitions)
: repartitionrightOuterJoin(other)
: right outer joinsample(withReplacement, fraction, seed=None)
: sample from the RDDsampleStdev()
: sample standard deviationsampleVariance()
: sample variancesaveAsTextFile(path)
: save RDD as text filestats()
: return a StatCounterstdev()
: standard deviationsubtract(other)
: return a new RDD without the elements in othersum()
: sumtake(n)
: get the first n elementstakeSample(n)
: get n random samplestoLocalIterator()
: get a local iteratorunion(other)
: form unionvariance()
: variancezip(other)
: other has to have the same lengthzipWithUniqueId()
: pairs each element with a unique index
A Context
describes the setup. Instantiating a Context with the default arguments using Context()
is the most lightweight setup. All data is just in the local thread and is never serialized or deserialized.
If you want to process the data in parallel, you can use the multiprocessing
module. Given the limitations of the default pickle
serializer, you can specify to serialize all methods with dill
instead. For example, a common instantiation with multiprocessing
looks like this:
c = Context(
multiprocessing.Pool(4),
serializer=dill.dumps,
deserializer=dill.loads,
)
This assumes that your data is serializable with pickle
which is generally faster than dill
. You can also specify a custom serializer/deserializer for data.
__init__(pool=None, serializer=None, deserializer=None, data_serializer=None, data_deserializer=None)
: pool is any instance with amap(func, iterator)
methodbroadcast(var)
: returns an instance ofBroadcast()
. Access its value withvalue
.newRddId()
: incrementing number [internal use]parallelize(list_or_iterator, numPartitions)
: returns a new RDDtextFile(filename)
: load every line of a text file into an RDDfilename
can contain a comma separated list of many files,?
and*
wildcards, file paths on S3 (s3://bucket_name/filename.txt
) and local file paths (relative/path/my_text.txt
,/absolut/path/my_text.txt
orfile:///absolute/file/path.txt
). If the filename points to a folder containingpart*
files, those are resolved.version
: the version of pysparkling
The functionality provided by this module is used in Context.textFile()
for reading and in RDD.saveAsTextFile()
for writing. You can use this submodule for writing files directly with File(filename).dump(some_data)
, File(filename).load()
and File.exists(path)
to read, write and check for existance of a file. All methods transparently handle http://
, s3://
and file://
locations and compression/decompression of .gz
and .bz2
files.
Use environment variables AWS_SECRET_ACCESS_KEY
and AWS_ACCESS_KEY_ID
for auth and use file paths of the form s3://bucket_name/filename.txt
.
File
:__init__(filename)
: filename is a URI of a file (can includehttp://
,s3://
andfile://
schemes)dump(stream)
: write the stream to the file[static] exists(path)
: check for existance of pathload()
: return the contents as BytesIOmake_public(recursive=False)
: only for files on S3[static] resolve_filenames(expr)
: given an expression with*
and?
wildcard characters, get a list of all matching filenames. Multiple expressions separated by,
can also be specified. Spark-style partitioned datasets (folders containingpart-*
files) are resolved as well to a list of the individual files.