Beispiel #1
0
def test_nb_grade_simple_valid():
    """
    Test parsing & running a simple oktest file.
    """
    here = os.path.dirname(__file__)

    nb = Notebook(os.path.join(here, 'oktests/simple.ok'))

    nb.grade('simple_valid')
Beispiel #2
0
class WandbTrackedOK(object):

    def __init__(self, entity, path, project):
        self.grader = Notebook(path)
        wandb.init(entity=entity, project=project, anonymous="must")
        self.test_map = self.grader.assignment.test_map
        self.pass_dict = {k: 0 for k in self.test_map}
        self.log()

    def grade(self, question, *args, **kwargs):
        result = self.grader.grade(question, *args, **kwargs)
        self.pass_dict[question] = result["passed"]
        self.log()

    def log(self):
        total = sum([v for v in self.pass_dict.values()])
        wandb.log({"passes": self.pass_dict,
                   "total": total})

    def __delete__(self):
        wandb.join()
Beispiel #3
0
    return classify(row, train_20, train_movies.column("Genre"), 3)


# In[181]:


new_test_guesses = test_20.apply(another_classifier)
new_proportion_correct = np.count_nonzero(new_test_guesses == test_movies.column("Genre")) / test_movies.num_rows
new_proportion_correct


# Briefly describe what you tried to improve your classifier. As long as you put in some effort to improving your classifier and describe what you have done, you will receive full credit for this problem.

# Original prediction 73.
# 
# I first tried to manipulate the values of K, and when I increased it to 20 my prediction dropped to 67.5 and when I decreased it to 7 my prediction increased to 78. When that did not help out my predication, I then tried to append the staff features to my original 20, and that ended up not helping at all either, lowering my prediction to 46. Lastly, I just ended up using the staff variables and there was again no change in my prediction and that brought my prediction down to 59. Since the only increase I got was by dropping the K value to 7 I ended up dropping it to 3 and it again increased my prediction to the same value as 7's 78.

# Congratulations: you're done with the required portion of the project! Time to submit.

# In[183]:


_ = ok.submit()

# For your convenience, you can run this cell to run all the tests at once!
import os
print("Running all tests...")
_ = [ok.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]
print("Finished running all tests.")

Beispiel #4
0
seconds_in_a_decade = ...

# We've put this line in this cell so that it will print
# the value you've given to seconds_in_a_decade when you
# run it.  You don't need to change this.
seconds_in_a_decade


# ## 2.1. Checking your code
# Now that you know how to name things, you can start using the built-in *tests* to check whether your work is correct. Try not to change the contents of the test cells. Running the following cell will test whether you have assigned `seconds_in_a_decade` correctly in Question 3.2. If you haven't, this test will tell you the correct answer. Resist the urge to just copy it, and instead try to adjust your expression. (Sometimes the tests will give hints about what went wrong...)

# In[ ]:


# Test cell; please do not change!
_ = ok.grade('q22')


# ## 2.2. Comments
# You may have noticed this line in the cell above:
# 
#     # Test cell; please do not change!
# 
# That is called a *comment*.  It doesn't make anything happen in Python; Python ignores anything on a line after a #.  Instead, it's there to communicate something about the code to you, the human reader.  Comments are extremely useful.
# 
# <img src="http://imgs.xkcd.com/comics/future_self.png" alt="comic about comments">

# ## 2.3. Application: A physics experiment
# 
# On the Apollo 15 mission to the Moon, astronaut David Scott famously replicated Galileo's physics experiment in which he showed that gravity accelerates objects of different mass at the same rate. Because there is no air resistance for a falling object on the surface of the Moon, even two objects with very different masses and densities should fall at the same rate. David Scott compared a feather and a hammer.
# 
#
#
# <!--
# BEGIN QUESTION
# name: q1_1
# manual: false
# -->

# In[30]:

all_unique_causes = np.unique(causes_of_death.column("Cause"))
sorted(all_unique_causes)

# In[31]:

ok.grade("q1_1")

# In[32]:


# This function may be useful for Question 2.
def elem(x):
    return x.item(0)


# **Question 2:** We would like to plot the death rate for each disease over time. To do so, we must create a table with one column for each cause and one row for each year.
#
# Create a table called `causes_for_plotting`. It should have one column called `Year`, and then a column with age-adjusted death rates for each of the causes you found in Question 1. There should be as many of these columns in `causes_for_plotting` as there are causes in Question 1.
#
# *Hint*: Use `pivot`, and think about how the `elem` function might be useful in getting the **Age Adjusted Death Rate** for each cause and year combination.
#
Beispiel #6
0
# In[4]:


a=5*13*31+2
b=2**5-2**11-2**1
b=2018
settings.new_year = max(a,b)
settings.new_year


# Check your work by executing the next cell.

# In[5]:


_ = ok.grade('q11')


# **Question 2.1.** Yuri Gagarin was the first person to travel through outer space.  When he emerged from his capsule upon landing on Earth, he [reportedly](https://en.wikiquote.org/wiki/Yuri_Gagarin) had the following conversation with a woman and girl who saw the landing:
# 
#     The woman asked: "Can it be that you have come from outer space?"
#     Gagarin replied: "As a matter of fact, I have!"
# 
# The cell below contains unfinished code.  Fill in the `...`s so that it prints out this conversation *exactly* as it appears above.

# In[7]:


settings.woman_asking = ""
woman_quote = '"Can it be that you have come from outer space?"'
gagarin_reply = 'Gagarin replied:'
Beispiel #7
0
    'The Shawshank Redemption (1994)', 'The Godfather (1972)',
    'The Godfather: Part II (1974)', 'Pulp Fiction (1994)',
    "Schindler's List (1993)",
    'The Lord of the Rings: The Return of the King (2003)',
    '12 Angry Men (1957)', 'The Dark Knight (2008)',
    'Il buono, il brutto, il cattivo (1966)',
    'The Lord of the Rings: The Fellowship of the Ring (2001)')

top_10_movies = ...
# We've put this next line here so your table will get printed out when you
# run this cell.
top_10_movies

# In[ ]:

_ = ok.grade('q2_1')

# #### Loading a table from a file
# In most cases, we aren't going to go through the trouble of typing in all the data manually. Instead, we can use our `Table` functions.
#
# `Table.read_table` takes one argument, a path to a data file (a string) and returns a table.  There are many formats for data files, but CSV ("comma-separated values") is the most common.
#
# **Question 2.2.** <br/>The file `imdb.csv` contains a table of information about the 250 highest-rated movies on IMDb.  Load it as a table called `imdb`.

# In[ ]:

imdb = ...
imdb

# In[ ]:
import sqlite3

conn = sqlite3.connect('taxi.db')
lon_bounds = [-74.03, -73.75]
lat_bounds = [40.6, 40.88]

squery = "SELECT * FROM taxi WHERE pickup_lon BETWEEN {} AND {} AND dropoff_lon  BETWEEN {} AND {} AND pickup_lat BETWEEN {} AND {} AND dropoff_lat BETWEEN {} AND {}".format(
    lon_bounds[0], lon_bounds[1], lon_bounds[0], lon_bounds[1], lat_bounds[0],
    lat_bounds[1], lat_bounds[0], lat_bounds[1])

all_taxi = pd.read_sql(squery, conn)
all_taxi.head()

# In[4]:

ok.grade("q1a")

# A scatter plot of pickup locations shows that most of them are on the island of Manhattan. The empty white rectangle is Central Park; cars are not allowed there.

# In[5]:


def pickup_scatter(t):
    plt.scatter(t['pickup_lon'], t['pickup_lat'], s=2, alpha=0.2)
    plt.xlabel('Longitude')
    plt.ylabel('Latitude')
    plt.title('Pickup locations')


plt.figure(figsize=(8, 8))
pickup_scatter(all_taxi)
Beispiel #9
0
# In[25]:

# These lines load the tests.
from client.api.notebook import Notebook
ok = Notebook('lab01.ok')

# Running the following cell will test whether you have assigned `seconds_in_a_decade` correctly in Question 2.2.
#
# Sometimes the tests will give hints about what went wrong. If the test doesn't pass, read the output, adjust your answer to the question, run the answer cell again to update the name `seconds_in_a_decade`, then run this test cell again.
#
# Sometimes the tests will tell you the answer. Rather than copying the answer, try to understand how it was reached.

# In[29]:

# Test cell; please do not change!
_ = ok.grade('q22')

# ### 2.2. Comments
# You may have noticed this line in the cell above:
#
#     # Test cell; please do not change!
#
# That is called a *comment*.  It doesn't make anything happen in Python; Python ignores anything on a line after a #.  Instead, it's there to communicate something about the code to you, the human reader.  Comments are extremely useful.
#
# <img src="http://imgs.xkcd.com/comics/future_self.png" alt="comic about comments">

# ### 2.3. Application: A physics experiment
#
# On the Apollo 15 mission to the Moon, astronaut David Scott famously replicated Galileo's physics experiment in which he showed that gravity accelerates objects of different mass at the same rate. Because there is no air resistance for a falling object on the surface of the Moon, even two objects with very different masses and densities should fall at the same rate. David Scott compared a feather and a hammer.
#
# You can run the following cell to watch a video of the experiment.
Beispiel #10
0
# 
# Try to fix the code above so that you can run the cell and see the intended message instead of an error.

# ### 1.5. The Kernel
# The kernel is a program that executes the code inside your notebook and outputs the results. In the top right of your window, you can see a circle that indicates the status of your kernel. If the circle is empty (⚪), the kernel is idle and ready to execute code. If the circle is filled in (⚫), the kernel is busy running some code. 
# 
# You may run into problems where your kernel is stuck for an excessive amount of time, your notebook is very slow and unresponsive, or your kernel loses its connection. If this happens, try the following steps:
# 1. At the top of your screen, click **Kernel**, then **Interrupt**.
# 2. If that doesn't help, click **Kernel**, then **Restart**. If you do this, you will have to run your code cells from the start of your notebook up until where you paused your work.
# 3. If that doesn't help, restart your server. First, save your work by clicking **File** at the top left of your screen, then **Save and Checkpoint**. Next, click **Control Panel** at the top right. Choose **Stop My Server** to shut it down, then **My Server** to start it back up. Then, navigate back to the notebook you were working on.

# ### 1.6. Submitting your work
# All assignments in the course will be distributed as notebooks like this one, and you will submit your work from the notebook. We will use a system called OK that checks your work and helps you submit. At the top of each assignment, you'll see a cell like the one below that prompts you to identify yourself. Run it to import your autograder tests.

# In[ ]:


# Don't change this cell; just run it.
# These statments import 
from client.api.notebook import Notebook
ok = Notebook('lab00.ok')


# When you finish a question, you need to check your answer by running the grade command below. It's OK to grade multiple times, OK will only try to grade your final submission for each question.

# In[ ]:


_ = ok.grade("q0")

source = [i['source'] for i in all_tweets]
text = [i['text'] if 'text' in i else i['full_text'] for i in all_tweets]
retweet_count = [i['retweet_count'] for i in all_tweets]
trump = pd.DataFrame(
    {
        'time': time,
        'source': source,
        'text': text,
        'retweet_count': retweet_count
    },
    index=id)
trump.head()

# In[7]:

ok.grade("q1")

# ---
# # Part 2: Tweet Source Analysis
#
# In the following questions, we are going to find out the charateristics of Trump tweets and the devices used for the tweets.
#
# First let's examine the source field:

# In[8]:

trump['source'].unique()

# ## Question 2
#
# Notice how sources like "Twitter for Android" or "Instagram" are surrounded by HTML tags. In the cell below, clean up the `source` field by removing the HTML tags from each `source` entry.
Beispiel #12
0
# ### Part 1a: Looking Inside and Extracting the Zip Files
#

# In[401]:

my_zip = zipfile.ZipFile(file=dest_path, mode='r')
#my_zip.extractall('data')
data_dir_path = Path(
    'data')  # creates a Path object that points to the data directory
list_names = [x.name for x in data_dir_path.glob('*') if x.is_file()]
list_names

# In[402]:

ok.grade("q1a")

from pathlib import Path
data_dir = Path('data')
my_zip.extractall(data_dir)
get_ipython().system('ls {data_dir}')

# The cell above created a folder called `data`, and in it there should be four CSV files. Open up `legend.csv` to see its contents.

# ### Part 1b: Programatically Looking Inside the Files

#...
ds100_utils.head('data/businesses.csv', 5)
ds100_utils.head('data/inspections.csv', 5)
ds100_utils.head('data/legend.csv', 5)
ds100_utils.head('data/violations.csv', 5)
Beispiel #13
0
#
# In the `population` table, the `geo` column contains three-letter codes established by the [International Organization for Standardization](https://en.wikipedia.org/wiki/International_Organization_for_Standardization) (ISO) in the [Alpha-3](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3#Current_codes) standard. We will begin by taking a close look at Bangladesh. Inspect the standard to find the 3-letter code for Bangladesh.

# **Question 1.** Create a table called `b_pop` that has two columns labeled `time` and `population_total`. The first column should contain the years from 1970 through 2015 (including both 1970 and 2015) and the second should contain the population of Bangladesh in each of those years.

# In[3]:

b_pop_1 = population.select('time', 'population_total', 'geo')
b_pop = b_pop_1.where('geo', are.equal_to('bgd')).drop('geo').where(
    'time', are.above_or_equal_to(1970)).where('time',
                                               are.below_or_equal_to(2015))
b_pop

# In[4]:

_ = ok.grade('q1_1')

# Run the following cell to create a table called `b_five` that has the population of Bangladesh every five years. At a glance, it appears that the population of Bangladesh has been growing quickly indeed!

# In[5]:

b_pop.set_format('population_total', NumberFormatter)

fives = np.arange(1970, 2016, 5)  # 1970, 1975, 1980, ...
b_five = b_pop.sort('time').where('time', are.contained_in(fives))
b_five

# **Question 2.** Assign `b_1970_through_2010` to a table that has the same columns as `b_five` and has one row for every five years from 1970 through 2010 (but not 2015). Then, use that table to assign `initial` to an array that contains the population for every five year interval from 1970 to 2010. Finally, assign `changed` to an array that contains the population for every five year interval from 1975 to 2015.
#
# *Hint*: You may find the `exclude` method to be helpful ([Docs](http://data8.org/datascience/_autosummary/datascience.tables.Table.exclude.html)).
Beispiel #14
0
# In[ ]:


def first(values):
    return values.item(0)


latest = ...

latest.relabel(0, 'geo').relabel(1, 'time').relabel(
    2, 'poverty_percent')  # You should *not* change this line.

# In[ ]:

_ = ok.grade('q3_1')

# **Question 3.2.** <br/>Using both `latest` and `population`, create a four-column table called `recent` with one row for each country in `latest`. The four columns should have the following labels and contents:
# 1. `geo` contains the 3-letter country code,
# 1. `poverty_percent` contains the most recent poverty percent,
# 1. `population_total` contains the population of the country in 2010,
# 1. `poverty_total` contains the number of people in poverty **rounded to the nearest integer**, based on the 2010 population and most recent poverty rate.

# In[ ]:

poverty_and_pop = ...
recent = ...
recent

# In[ ]:
Beispiel #15
0
    
    Output:
      a list of the top n richest neighborhoods as measured by the metric function
    """
    table = data[["Neighborhood", "SalePrice"]].groupby("Neighborhood").agg(metric).sort_values("SalePrice", ascending = False)
    neighborhoods = [i for i in table.iloc[:n].index]
    return neighborhoods

rich_neighborhoods = find_rich_neighborhoods(training_data, 3, np.median)
rich_neighborhoods


# In[7]:


ok.grade("q1b");


# In[ ]:





# ### Question 1c <a name="q1c"></a> 
# 
# We now have a list of neighborhoods we've deemed as richer than others.  Let's use that information to make a new variable `in_rich_neighborhood`.  Write a function `add_rich_neighborhood` that adds an indicator variable which takes on the value 1 if the house is part of `rich_neighborhoods` and the value 0 otherwise.
# 
# **Hint:** [`pd.Series.astype`](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.Series.astype.html) may be useful for converting True/False values to integers.
# 
# *The provided tests check that you answered correctly, so that future analyses are not corrupted by a mistake.*
Beispiel #16
0

bottom_left = 1

# What properties does a word in the bottom right corner have?

# In[33]:


bottom_right = 3


# In[88]:


_ = ok.grade("q3_0_2")
_ = ok.backup()


# What properties does a word in the top right corner have?

# In[34]:


top_right = 4


# In[90]:


_ = ok.grade("q3_0_3")
Beispiel #17
0
# BEGIN QUESTION
# name: q6a
# points: 1
# -->

# In[6]:


zero_predictor_fp = 0
zero_predictor_fn = sum(Y_train == 1)


# In[7]:


ok.grade("q6a");


# ### Question 6b
# 
# What are the accuracy and recall of `zero_predictor` (classifies every email as ham) on the training set? Do **NOT** use any `sklearn` functions.
# 
# <!--
# BEGIN QUESTION
# name: q6b
# points: 1
# -->

# In[8]:

Beispiel #18
0
#
# 1. the absolute value of $2^{5}-2^{11}-2^1$, and
# 2. $5 \times 13 \times 31 + 2$.
#
# Try to use just one statement (one line of code).

# In[ ]:

new_year = ...
new_year

# Check your work by executing the next cell.

# In[ ]:

_ = ok.grade('q11')

# # 2. Text
# Programming doesn't just concern numbers. Text is one of the most common types of values used in programs.
#
# A snippet of text is represented by a **string value** in Python. The word "*string*" is a programming term for a sequence of characters. A string might contain a single character, a word, a sentence, or a whole book.
#
# To distinguish text data from actual code, we demarcate strings by putting quotation marks around them. Single quotes (`'`) and double quotes (`"`) are both valid, but the types of opening and closing quotation marks must match. The contents can be any sequence of characters, including numbers and symbols.
#
# We've seen strings before in `print` statements.  Below, two different strings are passed as arguments to the `print` function.

# In[ ]:

print("I <3", 'Data Science')

# Just like names can be given to numbers, names can be given to string values.  The names and strings aren't required to be similar in any way. Any name can be assigned to any string.
Beispiel #19
0
# <!--
# BEGIN QUESTION
# name: q1
# points: 3
# -->

# In[9]:

# These should be True or False
q1statement1 = False
q1statement2 = True
q1statement3 = True

# In[10]:

ok.grade("q1")

# ### SalePrice vs Gr_Liv_Area
#
# Next, we visualize the association between `SalePrice` and `Gr_Liv_Area`.  The `codebook.txt` file tells us that `Gr_Liv_Area` measures "above grade (ground) living area square feet."
#
# This variable represents the square footage of the house excluding anything underground.  Some additional research (into real estate conventions) reveals that this value also excludes the garage space.

# In[11]:

sns.jointplot(x='Gr_Liv_Area',
              y='SalePrice',
              data=training_data,
              stat_func=None,
              kind="reg",
              ratio=4,
Beispiel #20
0
#def check_get_hashtags(file,hashtag,answer):
#    with open(file) as json_file:
#        statuses = json.load(json_file)
#    other_hashtags = get_hashtags(statuses, hashtag)
#    #print(other_hashtags)
#    other_hashtags = [s.replace('#', '') for s in other_hashtags]
#    if other_hashtags==answer:
#        return True
#    else:
#        return False

#NEWCELL
ok = Notebook(cf['ok_file'])
_ = ok.auth(inline=False)
results = {
    q[:-3]: ok.grade(q[:-3])
    for q in os.listdir("tests") if q.startswith('q')
}

#NEWCELL
import autograde as ag
importlib.reload(ag)


def output_tests(cf, results):
    autograde = {}
    autograde['github_id'] = cf['github_id']
    #This is a selection of variables from config file.
    for s in cf['variables']:
        if s in globals():
            autograde[s] = eval(s)
Beispiel #21
0
# In the `population` table, the `geo` column contains three-letter codes established by the [International Organization for Standardization](https://en.wikipedia.org/wiki/International_Organization_for_Standardization) (ISO) in the [Alpha-3](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3#Current_codes) standard. We will begin by taking a close look at Bangladesh. Inspect the standard to find the 3-letter code for Bangladesh.

# **Question 1.** Create a table called `b_pop` that has two columns labeled `time` and `population_total`. The first column should contain the years from 1970 through 2015 (including both 1970 and 2015) and the second should contain the population of Bangladesh in each of those years.

# In[4]:

temp = np.arange(1970, 2016)
bgd = ["bgd"]
holder = population.where("geo", are.contained_in(bgd))
b_pop = holder.where("time", are.contained_in(temp))
b_pop = b_pop.drop("geo")
b_pop

# In[5]:

_ = ok.grade('q1_1')

# Run the following cell to create a table called `b_five` that has the population of Bangladesh every five years. At a glance, it appears that the population of Bangladesh has been growing quickly indeed!

# In[6]:

b_pop.set_format('population_total', NumberFormatter)

fives = np.arange(1970, 2016, 5)  # 1970, 1975, 1980, ...
b_five = b_pop.sort('time').where('time', are.contained_in(fives))
b_five

# **Question 2.** Assign `b_1970_through_2010` to a table that has the same columns as `b_five` and has one row for every five years from 1970 through 2010 (but not 2015). Then, use that table to assign `initial` to an array that contains the population for every five year interval from 1970 to 2010. Finally, assign `changed` to an array that contains the population for every five year interval from 1975 to 2015.
#
# *Hint*: You may find the `exclude` method to be helpful ([Docs](http://data8.org/datascience/_autosummary/datascience.tables.Table.exclude.html)).
Beispiel #22
0
#
# <!--
# BEGIN QUESTION
# name: q1ci
# points: 3
# -->

# In[5]:

ins_named = ins.merge(bus[['name', 'address', 'bid']], how='left')

ins_named.head()

# In[6]:

ok.grade("q1ci")

# In[7]:

worst_restaurant = ins_named[['score',
                              'name']].sort_values('score',
                                                   ascending=True).iloc[0]
worst_restaurant

# **Use the cell above to identify the restaurant** with the lowest inspection scores ever. Be sure to include the name of the restaurant as part of your answer in the cell below. You can also head to yelp.com and look up the reviews page for this restaurant. Feel free to add anything interesting you want to share.
#
# <!--
# BEGIN QUESTION
# name: q1cii
# points: 1
# manual: True
Beispiel #23
0
# #### Question 1.1
# Set `expected_row_sum` to the number that you __expect__ will result from summing all proportions in each row, excluding the first six columns.
#
# <!--
# BEGIN QUESTION
# name: q1_1
# -->

# In[315]:

# Set row_sum to a number that's the (approximate) sum of each row of word proportions.
expected_row_sum = 1

# In[316]:

ok.grade("q1_1")

# This dataset was extracted from [a dataset from Cornell University](http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html). After transforming the dataset (e.g., converting the words to lowercase, removing the naughty words, and converting the counts to frequencies), we created this new dataset containing the frequency of 5000 common words in each movie.

# In[317]:

print('Words with frequencies:', movies.drop(np.arange(6)).num_columns)
print('Movies with genres:', movies.num_rows)

# ## 1.1. Word Stemming
# The columns other than "Title", "Genre", "Year", "Rating", "# Votes" and "# Words" in the `movies` table are all words that appear in some of the movies in our dataset.  These words have been *stemmed*, or abbreviated heuristically, in an attempt to make different [inflected](https://en.wikipedia.org/wiki/Inflection) forms of the same base word into the same string.  For example, the column "manag" is the sum of proportions of the words "manage", "manager", "managed", and "managerial" (and perhaps others) in each movie. This is a common technique used in machine learning and natural language processing.
#
# Stemming makes it a little tricky to search for the words you want to use, so we have provided another table that will let you see examples of unstemmed versions of each stemmed word.  Run the code below to load it.

# In[318]:
Beispiel #24
0
np.average(raw_compensation.column("Total Pay"))


# You should see an error. Let's examine why this error occured by looking at the values in the "Total Pay" column. Use the `type` function and set `total_pay_type` to the type of the first value in the "Total Pay" column.

# In[ ]:


total_pay_type = ...
total_pay_type


# In[ ]:


_ = ok.grade('q1_1')


# **Question 1.2.** <br/>You should have found that the values in "Total Pay" column are strings (text). It doesn't make sense to take the average of the text values, so we need to convert them to numbers if we want to do this. Extract the first value in the "Total Pay" column.  It's Mark Hurd's pay in 2015, in *millions* of dollars.  Call it `mark_hurd_pay_string`.

# In[ ]:


mark_hurd_pay_string = ...
mark_hurd_pay_string


# In[ ]:


_ = ok.grade('q1_2')
Beispiel #25
0
#
# 1. the absolute value of $2^{5}-2^{11}-2^1$, and
# 2. $5 \times 13 \times 31 + 2$.
#
# Try to use just one statement (one line of code).

# In[ ]:

new_year = ...
new_year

# Check your work by executing the next cell.

# In[ ]:

_ = ok.grade('q11')

# ## 2. Text
# Programming doesn't just concern numbers. Text is one of the most common types of values used in programs.
#
# A snippet of text is represented by a **string value** in Python. The word "*string*" is a programming term for a sequence of characters. A string might contain a single character, a word, a sentence, or a whole book.
#
# To distinguish text data from actual code, we demarcate strings by putting quotation marks around them. Single quotes (`'`) and double quotes (`"`) are both valid, but the types of opening and closing quotation marks must match. The contents can be any sequence of characters, including numbers and symbols.
#
# We've seen strings before in `print` statements.  Below, two different strings are passed as arguments to the `print` function.

# In[ ]:

print("I <3", 'Data Science')

# Just like names can be given to numbers, names can be given to string values.  The names and strings aren't required to be similar in any way. Any name can be assigned to any string.
Beispiel #26
0
# <!--
# BEGIN QUESTION
# name: q1_0
# -->

# In[35]:


# Set row_sum to a number that's the (approximate) sum of each row of word proportions.
expected_row_sum = 1


# In[36]:


ok.grade("q1_0");


# This dataset was extracted from [a dataset from Cornell University](http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html). After transforming the dataset (e.g., converting the words to lowercase, removing the naughty words, and converting the counts to frequencies), we created this new dataset containing the frequency of 5000 common words in each movie.

# In[37]:


print('Words with frequencies:', movies.drop(np.arange(5)).num_columns) 
print('Movies with genres:', movies.num_rows)


# ## 1.1. Word Stemming
# The columns other than "Title", "Year", "Rating", "Genre", and "# Words" in the `movies` table are all words that appear in some of the movies in our dataset.  These words have been *stemmed*, or abbreviated heuristically, in an attempt to make different [inflected](https://en.wikipedia.org/wiki/Inflection) forms of the same base word into the same string.  For example, the column "manag" is the sum of proportions of the words "manage", "manager", "managed", and "managerial" (and perhaps others) in each movie. This is a common technique used in machine learning and natural language processing.
# 
# Stemming makes it a little tricky to search for the words you want to use, so we have provided another table that will let you see examples of unstemmed versions of each stemmed word.  Run the code below to load it.
Beispiel #27
0
train_email_nan = original_training_data['subject'].isna()
original_training_data['email'] = original_training_data['email'].fillna("")
print(sum(train_email_nan))

#Test Set
test_subject_nan = test['subject'].isna()
test['subject'] = test['subject'].fillna("")
print(sum(test_subject_nan))

test_email_nan = test['subject'].isna()
test['email'] = test['email'].fillna("")
print(sum(test_email_nan))

# In[5]:

ok.grade("q1a")

# ### Question 1b
#
# In the cell below, print the text of the first ham and the first spam email in the original training set.
#
# *The provided tests just ensure that you have assigned `first_ham` and `first_spam` to rows in the data, but only the hidden tests check that you selected the correct observations.*
#
# <!--
# BEGIN QUESTION
# name: q1b
# points: 1
# -->

# In[6]:
Beispiel #28
0
np.average(raw_compensation.column("Total Pay"))


# You should see an error. Let's examine why this error occured by looking at the values in the "Total Pay" column. Use the `type` function and set `total_pay_type` to the type of the first value in the "Total Pay" column.

# In[ ]:


total_pay_type = ...
total_pay_type


# In[ ]:


_ = ok.grade('q1_1')


# **Question 1.2.** <br/>You should have found that the values in "Total Pay" column are strings (text). It doesn't make sense to take the average of the text values, so we need to convert them to numbers if we want to do this. Extract the first value in the "Total Pay" column.  It's Mark Hurd's pay in 2015, in *millions* of dollars.  Call it `mark_hurd_pay_string`.

# In[ ]:


mark_hurd_pay_string = ...
mark_hurd_pay_string


# In[ ]:


_ = ok.grade('q1_2')
Beispiel #29
0
    'The Godfather: Part II (1974)', 'Pulp Fiction (1994)',
    "Schindler's List (1993)",
    'The Lord of the Rings: The Return of the King (2003)',
    '12 Angry Men (1957)', 'The Dark Knight (2008)',
    'Il buono, il brutto, il cattivo (1966)',
    'The Lord of the Rings: The Fellowship of the Ring (2001)')

top_10_movies = Table().with_columns("Rating", top_10_movie_ratings, "Name",
                                     top_10_movie_names)
# We've put this next line here so your table will get printed out when you
# run this cell.
top_10_movies

# In[8]:

_ = ok.grade('q2_1')

# #### Loading a table from a file
# In most cases, we aren't going to go through the trouble of typing in all the data manually. Instead, we can use our `Table` functions.
#
# `Table.read_table` takes one argument, a path to a data file (a string) and returns a table.  There are many formats for data files, but CSV ("comma-separated values") is the most common.
#
# **Question 2.2.** <br/>The file `imdb.csv` contains a table of information about the 250 highest-rated movies on IMDb.  Load it as a table called `imdb`.

# In[9]:

imdb = Table.read_table("imdb.csv")
imdb

# In[10]:
Beispiel #30
0
        if pd.isnull(value):
            x.iloc[index] = 'Missing'
    return x

for i in original_training_data.columns:
    original_training_data.apply(f)
    post_missing += [1 for k in original_training_data[i] if pd.isnull(k)]
    
print(f"There are now " + str(sum(post_missing)) + " missing values.")



# In[46]:


ok.grade("q1a");


# ### Question 1b
# 
# In the cell below, print the text of the first ham and the first spam email in the original training set.
# 
# *The provided tests just ensure that you have assigned `first_ham` and `first_spam` to rows in the data, but only the hidden tests check that you selected the correct observations.*
# 
# <!--
# BEGIN QUESTION
# name: q1b
# points: 1
# -->

# In[47]: