pyspark-cassandra

Utilities and examples to asssist in working with Cassandra and PySpark.

Currently contains an updated and much more robust example of using a SparkContext's newAPIHadoopRDD to read from and an RDD's saveAsNewAPIHadoopDataset to write to Cassandra 2.1. Demonstrates usage of CQL collections: lists, sets and maps.

Working on proper integration with the DataStax Cassandra Spark Connector.

Building

You'll need Maven in order to build the uberjar required for the examples.

mvn clean package

Will create an uberjar at target/pyspark-cassandra-<version>-SNAPSHOT.jar.

Using with PySpark

spark-submit --driver-class-path /path/to/pyspark-cassandra.jar myscript.py ...

Using examples

pip install -r requirements.txt

Then run examples either directly with spark-submit, or use the run_script.py utility.

Running the PySpark Cassandra Hadoop Example

The example can first create the schema it requires via:

./run_script.py src/main/python/pyspark_cassandra_hadoop_example.py init test

The init command initializes the keyspace, table and inserts sample data. "test" is the name of the keyspace. A users table will be created in this keyspace with two sample users to enable reading.

Afterwards, you can run:

./run_script.py src/main/python/pyspark_cassandra_hadoop_example.py run test

Which runs a sample PySpark driver program that reads the existing values in the users table and then writes two new users to this table.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
maven_repo		maven_repo
src/main		src/main
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
init_datastax_spark_connector.sh		init_datastax_spark_connector.sh
pom.xml		pom.xml
requirements.txt		requirements.txt
run_script.py		run_script.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

maven_repo

maven_repo

src/main

src/main

.gitignore

.gitignore

LICENSE.txt

LICENSE.txt

README.md

README.md

init_datastax_spark_connector.sh

init_datastax_spark_connector.sh

pom.xml

pom.xml

requirements.txt

requirements.txt

run_script.py

run_script.py

Repository files navigation

pyspark-cassandra

Building

Using with PySpark

Using examples

Running the PySpark Cassandra Hadoop Example

About

Releases

Packages

Languages

License

nunofernandes-plight/pyspark-cassandra

Folders and files

Latest commit

History

Repository files navigation

pyspark-cassandra

Building

Using with PySpark

Using examples

Running the PySpark Cassandra Hadoop Example

About

Resources

License

Stars

Watchers

Forks

Languages