Skip to content

nunofernandes-plight/pyspark-cassandra

 
 

Repository files navigation

pyspark-cassandra

Utilities and examples to asssist in working with Cassandra and PySpark.

Currently contains an updated and much more robust example of using a SparkContext's newAPIHadoopRDD to read from and an RDD's saveAsNewAPIHadoopDataset to write to Cassandra 2.1. Demonstrates usage of CQL collections: lists, sets and maps.

Working on proper integration with the DataStax Cassandra Spark Connector.

Building

You'll need Maven in order to build the uberjar required for the examples.

mvn clean package

Will create an uberjar at target/pyspark-cassandra-<version>-SNAPSHOT.jar.

Using with PySpark

spark-submit --driver-class-path /path/to/pyspark-cassandra.jar myscript.py ...

Using examples

pip install -r requirements.txt

Then run examples either directly with spark-submit, or use the run_script.py utility.

Running the PySpark Cassandra Hadoop Example

The example can first create the schema it requires via:

./run_script.py src/main/python/pyspark_cassandra_hadoop_example.py init test

The init command initializes the keyspace, table and inserts sample data. "test" is the name of the keyspace. A users table will be created in this keyspace with two sample users to enable reading.

Afterwards, you can run:

./run_script.py src/main/python/pyspark_cassandra_hadoop_example.py run test

Which runs a sample PySpark driver program that reads the existing values in the users table and then writes two new users to this table.

About

Utilities and examples to asssist in working with PySpark and Cassandra.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 68.5%
  • Scala 28.6%
  • Shell 2.9%