Exemplo n.º 1
0
def display_q1(spark: SparkSession):
    st.title("Building your first DataFrames")
    st.markdown("""
    A `Dataset` is a distributed collection of data which combines the benefits of RDDs (strong typing, ability to use lambda functions) 
    and SparkSQL's optimized execution engine.

    A `DataFrame` is a `Dataset` organized into named columns. 
    It is conceptually equivalent to a table in a relational database, or a data frame in Python/R. 
    Conceptually, a `DataFrame` is a `Dataset` of `Row`s.

    As with RDDs, applications can create DataFrames from an existing RDD, a Hive table or from Spark data sources.
    """)
    st.subheader("Question 1 - Convert a RDD of Row to a DataFrame")
    st.markdown("""
    Recall from the previous assignment how we used two tables on students : 
    one for students to grades, another one for students to gender. 
    
    Let's create a function which takes a `RDD` of `Row`s and a schema as arguments 
    and generates the corresponding DataFrame.
    
    Edit `create_dataframe` in `src/session3/sparksql.py` to solve the issue.
    """)
    test_create_dataframe(spark)
    display_exercise_solved()

    st.subheader("Question 2 - Load a CSV file to a DataFrame")
    st.markdown("""
    Let's reload the `FL_insurance_sample.csv` file from last session and freely interact with it.

    Edit `read_csv` in `src/session3/sparksql.py` to solve the issue.
    """)
    test_read_csv(spark)
    display_goto_next_section()
Exemplo n.º 2
0
def display_q1(sc: SparkContext):
    st.title("Building your first RDDs")
    with st.beta_expander("Introduction"):
        st.markdown("""
        In this section, we are going to introduce Spark's core abstraction for working with data 
        in a distributed and resilient way: the **Resilient Distributed Dataset**, or RDD. 
        Under the hood, Spark automatically performs the distribution of RDDs and its processing around 
        the cluster, so we can focus on our code and not on distributed processing problems, 
        such as the handling of data locality or resiliency in case of node failure.

        A RDD consists of a collection of elements partitioned accross the nodes of a cluster of machines 
        that can be operated on in parallel. 
        In Spark, work is expressed by the creation and transformation of RDDs using Spark operators.
        """)
        st.image("./img/spark-rdd.png", use_column_width=True)
        st.markdown("""
        _Note_: RDD is the core data structure to Spark, but the style of programming we are studying 
        in this lesson is considered the _lowest-level API_ for Spark. 
        The Spark community is pushing the use of Structured programming with Dataframes/Datasets instead, 
        an optimized interface for working with structured and semi-structured data, 
        which we will learn later. 
        Understanding RDDs is still important because it teaches you how Spark works under the hood 
        and will serve you to understand and optimize your application when deployed into production.

        There are two ways to create RDDs: parallelizing an existing collection in your driver program, 
        or referencing a dataset in an external storage system, 
        such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
        """)

    st.subheader("Question 1 - From Python collection to RDD")
    st.markdown("""
    Edit the `rdd_from_list` method in `src/session2/rdd.py` 
    to generate a Python list and transform it into a Spark RDD.

    Ex:
    ```python
    rdd_from_list([1, 2, 3]) should be a RDD with values [1, 2, 3]
    ```
    """)
    test_rdd_from_list(sc)
    display_exercise_solved()

    st.subheader("Question 2 - From text file to RDD")
    st.markdown("""
    Edit the `load_file_to_rdd` method in `src/session2/rdd.py` 
    to generate a Spark RDD from a text file. 
    
    Each line of the file will be an element of the RDD.

    Ex:
    ```python
    load_file_to_rdd("./data/FL_insurance_sample.csv") should be a RDD with each line an element of the RDD
    ```
    """)
    test_rdd_from_list(sc)
    display_exercise_solved()
    display_goto_next_section()
def display_q2():
    st.subheader("Question 2 - Square numbers in the list")
    st.markdown("""
    Edit the `squared` method in `src/session1/hello.py` to square all elements in a list.
    
    Ex:
    ```python
    squared([1, 2, 3]) should be [1, 4, 9]
    ```
    """)
    test_squared()
    display_exercise_solved()
    display_goto_next_section()
def display_q1():
    st.subheader("Question 1 - Sum of two numbers")
    st.markdown("""
    Edit the `add` method in `src/session1/hello.py` to return the sum of 2 numbers.

    Ex:
    ```python
    add(1, 2) should be 3
    ```
    """)
    test_add()
    display_exercise_solved()
    display_goto_next_section()
Exemplo n.º 5
0
def display_q3(spark: SparkSession):
    st.title("Machine Learning on DataFrames - Titanic example")
    st.markdown("""
    Following the evolution of Spark, there are two ways to do Machine Learning on Spark :

    * MLlib, or `spark.mllib`, was the first ML library implemented in the core Spark library and runs on RDDs. As of today, the library is in maintenance mode, but as we did for RDDs vs DataFrames, it is important that we cover some aspects of the older library. MLlib is also the only library that supports training models for Spark Streaming. 
    * ML, or `spark.ml` is now the primary ML library on Spark, and runs on DataFrames. Its API is close to those of other mainstream librairies like scikit-learn.

    We will dive into both APIs in this notebook, using the `titanic.csv` file for classification purposes on the `Survived` column.

    _If you need a description of the Titanic dataset, [find it here](https://www.kaggle.com/c/titanic/data)_.
    """)

    display_goto_next_section()
Exemplo n.º 6
0
def display_q4(sc: SparkContext):
    st.title("Manipulating a CSV file")
    st.markdown("""
    We provide a `FL_insurance_sample.csv` file inside the `data` folder to use in our computations, 
    it will be loaded through  `load_file_to_rdd()` you have previously implemented.

    The first line of the CSV is the header, and it is annoying to have it mixed with the data. 
    In the lower-level RDD API we need to write code to specifically filter that first line.

    Edit the `filter_header` method to remove the first element of a RDD.
    """)
    with st.beta_expander("Hint ?"):
        st.markdown("""
        **Hint** : `rdd.zipwithindex()` is a useful function when you need to filter by position 
        in a file _(though it is computationally expensive)_.
        """)
    test_filter_header(sc)
    display_exercise_solved()

    st.markdown("""
    Let's try some statistics on the `county` variable, which is the second column of the dataset.

    Edit `test_county_count` to return the number of times each county appears
    """)
    test_county_count(sc)
    display_exercise_solved()

    st.markdown("""
    A little bonus. Streamlit can display plots directly:
    * Matplotlib: `st.pyplot`
    * Plotly: `st.plotly_chart`
    * Bokeh: `st.bokeh_chart`
    * Altair: `st.altair_chart`

    So as a bonus question, display a bar chart of number of occurrences 
    for each county directly in the app by editing the `bar_chart_county` method.
    """)
    bar_chart_county(sc)
    display_goto_next_section()
Exemplo n.º 7
0
def display_q2(spark: SparkSession):
    st.title("Running queries on DataFrames")
    st.subheader("Question 1 - The comeback of 'Mean grades per student'")
    st.markdown("""
    Let's generate a Dataframe of the students tables for the incoming questions, 
    using our newly created `create_dataframe` function. 
    """)
    with st.beta_expander(
            "The following code is run automatically by the Streamlit app."):
        st.markdown("""
        ```python
        genders_rdd = spark.sparkContext.parallelize(
            [("1", "M"), ("2", "M"), ("3", "F"), ("4", "F"), ("5", "F"), ("6", "M")]
        )
        grades_rdd = spark.sparkContext.parallelize(
            [("1", 5), ("2", 12), ("3", 7), ("4", 18), ("5", 9), ("6", 5)]
        )

        genders_schema = StructType(
            [
                StructField("ID", StringType(), True),
                StructField("gender", StringType(), True),
            ]
        )
        grades_schema = StructType(
            [
                StructField("ID", StringType(), True),
                StructField("grade", StringType(), True),
            ]
        )

        genders_df = create_dataframe(spark, genders_rdd, genders_schema)
        grades_df = create_dataframe(spark, grades_rdd, grades_schema)
        ```
        """)
    with st.beta_expander("There are 2 ways of interacting with DataFrames"):
        st.markdown("""
        * DataFrames provide a domain-specific language for structured manipulation :

        ```python
        >> genders_df.filter(genders_df['ID'] > 2)
        +---+------+
        | ID|gender|
        +---+------+
        |  3|     F|
        |  4|     F|
        |  5|     F|
        |  6|     M|
        +---+------+
        ```

        In the more simple cases, you can interact with DataFrames with a syntax close to the Pandas syntax.

        ```python
        >> genders_df[genders_df['ID'] > 2]
        +---+------+
        | ID|gender|
        +---+------+
        |  3|     F|
        |  4|     F|
        |  5|     F|
        |  6|     M|
        +---+------+
        ```

        * The `sql` function of a SparkSession enables to run SQL queries directly on the frame 
        and returns a DataFrame, on which you can queue other computations.
        
        Before doing that, you must create a temporary views for those DataFrames 
        so we can interact with them within the SQL query..

        ```python
        # Register the DataFrame as a SQL temporary view beforehand
        >> genders_df.createOrReplaceTempView('genders')

        # Now use the temporary view inside a SQL query, the compiler will map the name to the actual object
        >> spark.sql('SELECT * FROM genders WHERE ID > 2').show()
        +---+------+
        | ID|gender|
        +---+------+
        |  3|     F|
        |  4|     F|
        |  5|     F|
        |  6|     M|
        +---+------+
        ```

        Don't hesitate to check the [DataFrame Function Reference](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions) 
        for all of the operators you can use on a DataFrame. Use the following cell to experiment :)
        """)

    st.markdown("""
    ---

    Remember the mean grade per gender question from last assignment ? Remember how unpleasant it was ? 
    Let's do that directly in SparkSQL in the `mean_grade_per_gender` method in `src/session3/sparksql.py`. 
    
    PS : if you are using programmatic SQL interaction, you can define a temporary view of temporary variables. 
    You may want to delete those views at the end of your function with `spark.catalog.dropTempView('your_view')`. 
    """)
    test_mean_grade_per_gender(spark)
    display_exercise_solved()

    st.subheader("Question 2 - The comeback of 'counting counties'")
    st.markdown("""
    Let's plot the number of different counties in a histogram, like in the previous assignment. 
    To do that, in `count_county` return a Pandas a dataframe which contains, for each county, 
    the number of its occurences in the dataset.

    > Hint: a Spark Dataframe is distributed on a number of workers, so it cannot be plotted as is. 
    > You will need to collect the data you want to plot back in the driver. 
    > The `toPandas` is usable to retrieve a Pandas local Dataframe, 
    > be careful to only use it on small Dataframes !
    """)
    test_count_county(spark)
    display_exercise_solved()

    st.markdown("""
    A little bonus. Streamlit can display plots directly:
    * Matplotlib: `st.pyplot`
    * Plotly: `st.plotly_chart`
    * Bokeh: `st.bokeh_chart`
    * Altair: `st.altair_chart`

    So as a bonus question, display a bar chart of number of occurrences 
    for each county directly in the app by editing the `bar_chart_county` method.
    """)
    bar_chart_county(spark)
    display_goto_next_section()
Exemplo n.º 8
0
def display_q3(sc: SparkContext):
    st.title("Using Key-value RDDs")
    st.markdown("""
    If you recall the classic MapReduce paradigm, you were dealing with key/value pairs 
    to reduce your data in a distributed manner. 
    We define a pair as a tuple of two elements, 
    the first element being the key and the second the value.

    Key/value pairs are good for solving many problems efficiently in a parallel fashion 
    so let us delve into them.
    ```python
    pairs = [('b', 3), ('d', 4), ('a', 6), ('f', 1), ('e', 2)]
    pairs_rdd = sc.parallelize(pairs)
    ```

    ### reduceByKey

    The `.reduceByKey()` method works in a similar way to the `.reduce()`, 
    but it performs a reduction on a key-by-key basis.
    The following counts the sum of all values for each key.
    ```python
    pairs = [('b', 3), ('d', 4), ('a', 6), ('f', 1), ('e', 2)]
    pairs_rdd = sc.parallelize(pairs).reduceByKey(lambda x,y: x+y)
    ```
    """)

    st.header("Time for the classic Hello world question !")
    st.markdown(
        "You know the drill. Edit `wordcount()` to count the number of occurences of each word."
    )
    test_wordcount(sc)
    display_exercise_solved()

    st.subheader("Question 2 - Joins")
    with st.beta_expander("About joins"):
        st.markdown("""
        The `.join()` method joins two RDD of pairs together on their key element.
        
        ```python
        genders_rdd = sc.parallelize([('1', 'M'), ('2', 'M'), ('3', 'F'), ('4', 'F'), ('5', 'F'), ('6', 'M')])
        grades_rdd = sc.parallelize([('1', 5), ('2', 12), ('3', 7), ('4', 18), ('5', 9), ('6', 5)])

        genders_rdd.join(grades_rdd)
        ```
        """)
    st.markdown("""
    Let's give ourselves a `student-gender` RDD and a `student-grade` RDD. 
    Compute the mean grade for each gender.
    """)

    with st.beta_expander("Hint ?"):
        st.markdown("""
        _This is a long exercise._
        Remember that the mean for a gender equals the sum of all grades 
        divided by the count of the number of grades. 

        You already know how to sum by key, 
        and you can use the `countByKey()` function for returning a hashmap of gender to count of grades, 
        then use that hashmap inside a map function to divide. 
        
        Good luck !
        """)
    test_mean_grade_per_gender()
    display_goto_next_section()
Exemplo n.º 9
0
def display_q2(sc: SparkContext):
    st.title("Operations on RDDs")
    with st.beta_expander("Introduction"):
        st.markdown("""
        RDDs have two sets of parallel operations:

        * transformations : which return pointers to new RDDs without computing them, it rather waits for an action to compute itself.
        * actions : which return values to the driver after running the computation. The `collect()` funcion is an operation which retrieves all elements of the distributed RDD to the driver.

        RDD transformations are _lazy_ in a sense they do not compute their results immediately.

        The following exercises study the usage of the most common Spark RDD operations.
        """)
    st.subheader("Question 1 - Map")
    with st.beta_expander(".map() and flatMap() transformation"):
        st.markdown("""
        The `.map(function)` transformation applies the function given in argument to each of the elements 
        inside the RDD. 

        The following sums every number in the RDD by one, in a distributed manner.
        ```python
        sc.parallelize([1,2,3]).map(lambda num: num+1)
        ```

        The `.flatMap(function)` transformation applies the function given in argument to each of the elements 
        inside the RDD, then flattens the list so that there are no more nested elements inside it. 
        
        The following splits each line of the RDD by the comma and returns all numbers in a unique RDD.
        ```python
        sc.parallelize(["1,2,3", "2,3,4", "4,5,3"]).flatMap(lambda csv_line: csv_line.split(","))
        ```
        ---
        """)
        st.markdown("""
        What would be the result of:
        ```python
        sc.parallelize(["1,2,3", "2,3,4", "4,5,3"]).map(lambda csv_line: csv_line.split(","))
        ```
        ?

        ---
        """)
    st.markdown("""
    Suppose we have a RDD containing only lists of 2 elements :

    ```python
    matrix = [[1,3], [2,5], [8,9]]
    matrix_rdd = sc.parallelize(matrix)
    ```

    This data structure is reminiscent of a matrix.

    Edit the method `op1()` to  multiply the first column (or first coordinate of each element) 
    of the matrix by 2, and removes 3 to the second column (second coordinate).
    """)
    test_op1(sc)
    display_exercise_solved()

    st.subheader("Question 2 - Extracting words from sentences")
    st.markdown("""
    Suppose we have a RDD containing sentences :

    ```python
    sentences_rdd = sc.parallelize(
        ['Hi everybody', 'My name is Fanilo', 'and your name is Antoine everybody'
    ])
    ```

    Edit `op2()` which returns all the words in the rdd, after splitting each sentence by the whitespace character.
        
    """)
    test_op2(sc)
    display_exercise_solved()

    st.subheader("Question 3 - Filtering")
    st.markdown("""
    The `.filter(function)` transformation let's us filter elements verify a certain function.

    Suppose we have a RDD containing numbers.

    Edit `op3()` to returns all the odd numbers.
    """)
    test_op3(sc)
    display_exercise_solved()

    st.subheader("Question 4 - Reduce")
    with st.beta_expander("About reduce"):
        st.markdown("""
        The `.reduce(function)` transformation reduces all elements of the RDD into one 
        using a specific method.

        This next example sums elements 2 by 2 in a distributed manner, 
        which will produce the sum of all elements in the RDD.
        ```python
        sc.parallelize([1,2,3,4,5]).map(lambda x,y: x + y)
        ```

        Do take note that, as in the Hadoop ecosystem, the function used to reduce the dataset 
        should be associative and commutative.

        ---
        """)

    st.markdown("""
    Suppose we have a RDD containing numbers.

    Create an operation `.op4()` which returns the sum of 
    all squared odd numbers in the RDD, using the `.reduce()` operation.
    """)
    test_op4(sc)
    display_goto_next_section()