- Simplifies plotting Spark DataFrames by making calculations for plots inside Spark
- Plot types: Histogram, 2D Histogram
- Generates Matplotlib plots with a similar Pandas Plotting API
TODO:
- Other plot types
- Supports multiple Python plotting frontends (Altair, Plotly and more)
pip install spark-plot
Look at the full NYCFlights example notebook. A short summary is presented below.
Create an Spark DataFrame:
from nycflights13 import flights as flights_pd
flights = spark.createDataFrame(flights_pd)
Import spark-plot
Matplotlib frontend:
from spark_plot import mpl
mpl.hist(flights, "distance", color="#474747")
Specify the bin_width
instead:
mpl.hist(flights, "distance", bin_width=400, color="#474747")
Similar to a histogram but in two dimensions.
ax = mpl.hist2d(flights, col_x="sched_dep_time", col_y="sched_arr_time", title="Sched Arrival vs Departure", cmap="Blues_r")
mpl.bar(flights, x="origin")
Use any Spark aggregate function
import pyspark.sql.functions as F
mpl.bar(flights, x="origin", y="dep_delay", agg=F.mean)
Pass multiple columns.
import pyspark.sql.functions as F
ax = mpl.bar(flights, x="origin", y=["dep_delay", "arr_delay"], agg=F.sum)