sparkmagic

Sparkmagic is a set of tools for interactively working with remote Spark clusters through Livy, a Spark REST server, in Jupyter notebooks. The Sparkmagic project includes a set of magics for interactively running Spark code in multiple languages, as well as some kernels that you can use to turn Jupyter into an integrated Spark environment.

Features

Run Spark code in multiple languages against any remote Spark cluster through Livy
Automatic visualization of SQL queries with the %%sql magic in the PySpark and Spark kernels; use an easy visual interface to interactively construct visualizations, no code required
Capture the output of SQL queries as Pandas dataframes to work with them on your local machine

Examples

Check out the examples directory.

Installation

Install the library

 git clone https://github.com/jupyter-incubator/sparkmagic
 cd sparkmagic
 pip install -e .

(Optional) Install the wrapper kernels

 jupyter-kernelspec install remotespark/kernels/sparkkernel
 jupyter-kernelspec install remotespark/kernels/pysparkkernel

(Optional) Copy the example configuration file to your home directory
```
 cp remotespark/example_config.json ~/.sparkmagic/config.json
```

Architecture

Sparkmagic uses Livy, a REST server for Spark, to remotely execute all user code. The library then automatically collects the output of your code as plain text or a JSON document, displaying the results to you as formatted text or as a Pandas dataframe as appropriate.

This architecture offers us some important advantages:

Run Spark code completely remotely; no Spark components need to be installed on the Jupyter server
Multi-language support; the Python and Scala kernels are equally feature-rich, and adding support for more languages will be easy
Support for multiple endpoints; you can use a single notebook to start multiple Spark jobs in different languages and against different remote clusters
Easy integration with any Python library for data science or visualization, like Pandas or Plotly

However, there are some important limitations to note:

Some overhead added by sending all code and output through Livy
Since all code is run on a remote driver through Livy, all structured data must be serialized to JSON and parsed by the Sparkmagic library so that it can be manipulated and visualized on the client side. In practice this means that you must use Python for client-side data manipulation in %%local mode.

Contributing

We welcome contributions from everyone. If you've made an improvement to our code, please send us a pull request.

If you want to see an enhancement made but don't have time to work on it yourself, feel free to submit an issue for us to deal with.

Name		Name	Last commit message	Last commit date
Latest commit History 516 Commits
examples		examples
remotespark		remotespark
screenshots		screenshots
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE.md		LICENSE.md
README.md		README.md
codecoverage.bat		codecoverage.bat
codecoverage.sh		codecoverage.sh
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples

examples

remotespark

remotespark

screenshots

screenshots

tests

tests

.gitignore

.gitignore

.travis.yml

.travis.yml

LICENSE.md

LICENSE.md

README.md

README.md

codecoverage.bat

codecoverage.bat

codecoverage.sh

codecoverage.sh

requirements.txt

requirements.txt

setup.py

setup.py

Repository files navigation

sparkmagic

Features

Examples

Installation

Architecture

Contributing

About

Releases

Packages

Languages

License

cfregly/sparkmagic

Folders and files

Latest commit

History

Repository files navigation

sparkmagic

Features

Examples

Installation

Architecture

Contributing

About

Resources

License

Stars

Watchers

Forks

Languages