Skip to content
This repository has been archived by the owner on Mar 12, 2023. It is now read-only.

danielfrg/spark-plot

Repository files navigation

spark-plot

pypi build license

  • Simplifies plotting Spark DataFrames by making calculations for plots inside Spark
  • Plot types: Histogram, 2D Histogram
  • Generates Matplotlib plots with a similar Pandas Plotting API

TODO:

  • Other plot types
  • Supports multiple Python plotting frontends (Altair, Plotly and more)

Installation

pip install spark-plot

Usage

Look at the full NYCFlights example notebook. A short summary is presented below.

Create an Spark DataFrame:

from nycflights13 import flights as flights_pd

flights = spark.createDataFrame(flights_pd)

Import spark-plot Matplotlib frontend:

from spark_plot import mpl

Histogram

mpl.hist(flights, "distance", color="#474747")

Flights Histogram

Specify the bin_width instead:

mpl.hist(flights, "distance", bin_width=400, color="#474747")

Flights Histogram

Histogram 2D

Similar to a histogram but in two dimensions.

ax = mpl.hist2d(flights, col_x="sched_dep_time", col_y="sched_arr_time", title="Sched Arrival vs Departure", cmap="Blues_r")

Flights Histogram 2d

Bar plot

mpl.bar(flights, x="origin")

Flights Bar Plot

Aggregate Functions

Use any Spark aggregate function

import pyspark.sql.functions as F

mpl.bar(flights, x="origin", y="dep_delay", agg=F.mean)

Flights Bar Plot Mean Delay

Multiple columns

Pass multiple columns.

import pyspark.sql.functions as F

ax = mpl.bar(flights, x="origin", y=["dep_delay", "arr_delay"], agg=F.sum)

Flights Bar Plot Sum delays

About

Simplifies plotting Spark DataFrames by making calculations for plots inside Spark

Topics

Resources

License

Stars

Watchers

Forks