Пример #1
0
  - [How mrjob is
  run](https://pythonhosted.org/mrjob/guides/concepts.html#how-your-program-is-run)
  - [Adding passthrough
  options](https://pythonhosted.org/mrjob/job.html#mrjob.job.MRJob.add_passthrough_option)
  - [An example of someone solving similar
  problems](http://arunxjacob.blogspot.com/2013/11/hadoop-streaming-with-mrjob.html)

Finally, if you are find yourself processing a lot of special cases, you are
probably doing it wrong.  For example, mapreduce jobs for
`Top100WordsSimpleWikipediaPlain`, `Top100WordsSimpleWikipediaText`, and
`Top100WordsSimpleWikipediaNoMetaData` are less than 150 lines of code
(including generous blank lines and biolerplate code)
"""

from lib import QuestionList, Question, StringNumberListValidateMixin, JsonValidateMixin, TupleListValidateMixin, EntropyValidateMixin
QuestionList.set_name("mr")


class TupleNumberListValidateMixin(TupleListValidateMixin):
    @classmethod
    def list_length(cls):
        return 100

    @classmethod
    def tuple_validators(cls):
        return (cls.validate_tuple, cls.validate_number)


@QuestionList.add
class Top100WordsSimpleWikipediaPlain(StringNumberListValidateMixin, Question):
    """
Пример #2
0
      - Look for commonly repeated threads (e.g. you might end up picking up the photo credtis).
      - Long captions are often not lists of people.  The cutoff is subjective so to be definitive, *let's set that cutoff at 250 characters*.

  2. You will want to separate the captions based on various forms of punctuation.  Try using `re.split`, which is more sophisticated than `string.split`.

  3. You might find a person named "ra Lebenthal".  There is no one by this name.  Can anyone spot what's happening here?

  4. This site is pretty formal and likes to say things like "Mayor Michael Bloomberg" after his election but "Michael Bloomberg" before his election.  Can you find other ('optional') titles that are being used?  They should probably be filtered out b/c they ultimately refer to the same person: "Michael Bloomberg."

For the analysis, we think of the problem in terms of a [network](http://en.wikipedia.org/wiki/Computer_network) or a [graph](http://en.wikipedia.org/wiki/Graph_%28mathematics%29).  Any time a pair of people appear in a photo together, that is considered a link.  What we have described is more appropriately called an (undirected) [multigraph](http://en.wikipedia.org/wiki/Multigraph) with no self-loops but this has an obvious analog in terms of an undirected [weighted graph](http://en.wikipedia.org/wiki/Graph_%28mathematics%29#Weighted_graph).  In this problem, we will analyze the social graph of the new york social elite.

For this problem, we recommend using python's `networkx` library.
"""

from lib import QuestionList, Question, StringNumberListValidateMixin, TupleListValidateMixin
QuestionList.set_name("graph")

@QuestionList.add
class Degree(StringNumberListValidateMixin, Question):
  """
  The simplest question you might want to ask is 'who is the most popular'?  The easiest way to answer this question is to look at how many connections everyone has.  Return the top 100 people and their degree.  Remember that if an edge of the graph has weight 2, it counts for 2 in the degree.
  """
  def solution(self):
    """
    A list of 100 tuples of (name, degree) in descending order of degree
    """
    #return [('Alec Baldwin', 69)] * 100
    return [(u'Jean Shafiroff', 452), (u'Mark Gilbertson', 372), (u'Gillian Miniter', 345), (u'Alexandra Lebenthal', 279), (u'Geoffrey Bradfield', 262), (u'Somers Farkas', 215), (u'Andrew Saffir', 205), (u'Debbie Bancroft', 202), (u'Yaz Hernandez', 198), (u'Kamie Lightburn', 198), (u'Alina Cho', 196), (u'Eleanora Kennedy', 191), (u'Jamee Gregory', 188), (u'Sharon Bush', 187), (u'Muffie Potter Aston', 170), (u'Allison Aston', 168), (u'Mario Buatta', 166), (u'Lucia Hwong Gordon', 162), (u'Lydia Fenet', 160), (u'Bonnie Comley', 160), (u'Karen LeFrak', 157), (u'Patrick McMullan', 154), (u'Deborah Norville', 153), (u'John', 152), (u'Bettina Zilkha', 147), (u'Barbara Tober', 139), (u'Michael Bloomberg', 139), (u'Martha Stewart', 138), (u'Audrey Gruss', 136), (u'Stewart Lane', 136), (u'Liz Peek', 134), (u'Grace Meigher', 128), (u'Diana Taylor', 126), (u'Daniel Benedict', 126), (u'Kipton Cronkite', 126), (u'Roric Tobin', 125), (u'Nicole Miller', 125), (u'Rosanna Scotto', 124), (u'Margo Langenberg', 121), (u'Fe Fendi', 121), (u'Martha Glass', 120), (u'Janna Bullock', 120), (u'Adelina Wong Ettelson', 119), (u'Barbara Regna', 119), (u'Elizabeth Stribling', 118), (u'Leonard Lauder', 118), (u'Couri Hay', 118), (u'Margaret Russell', 117), (u'Alexandra Lind Rose', 117), (u'Lisa Anastos', 116), (u'Jennifer Creel', 116), (u'Dennis Basso', 115), (u'Julia Koch', 114), (u'Amy Fine Collins', 113), (u'Gregory Long', 113), (u'Sylvester Miniter', 112), (u'Wendy Carduner', 111), (u'Nathalie Kaplan', 108), (u'Deborah Roberts', 107), (u'Michele Herbert', 107), (u'Stephanie Winston Wolkoff', 105), (u'Dayssi Olarte de Kanavos', 105), (u'Gerald Loughlin', 105), (u'David', 105), (u'CeCe Black', 104), (u'Hilary Geary Ross', 104), (u'Karen Klopp', 104), (u'Fernanda Kellogg', 104), (u'Clare McKeon', 103), (u'Coco Kopelman', 103), (u'Alexia Hamm Ryan', 102), (u'Russell Simmons', 101), (u'Michael', 101), (u'Coralie Charriol Paul', 101), (u'Richard Johnson', 100), (u'Mary Davidson', 99), (u'Fern Mallis', 99), (u'Felicia Taylor', 99), (u'Alec Baldwin', 98), (u'Wilbur Ross', 98), (u'Frederick Anderson', 98), (u'Susan Shin', 98), (u'Amy Hoadley', 98), (u'Evelyn Lauder', 96), (u'Dawne Marie Grannum', 96), (u'Jonathan Tisch', 95), (u'Donna Karan', 94), (u'Melanie Holland', 93), (u'Suzanne Cochran', 92), (u'Pamela Fiori', 92), (u'Liliana Cavendish', 92), (u'Paula Zahn', 91), (u'Kelly Rutherford', 91), (u'Jonathan Farkas', 91), (u'Tory Burch', 91), (u'Georgina Schaeffer', 90), (u'Peter', 89), (u'Lizzie Tisch', 89), (u'Lauren Bush', 88), (u'Caryn Zucker', 88)]


@QuestionList.add
Пример #3
0
# -*- coding: utf-8-*-
from __future__ import unicode_literals

from numbers import Number

from lib import QuestionList, Question, TupleListValidateMixin, catch_validate_exception
QuestionList.set_name("sql")
"""
The city of New York does restaurant inspections and assigns a grade.  Inspections data the last 4 years are available [here](https://s3.amazonaws.com/thedataincubator/coursedata/nyc_inspection_data.zip).

The file `RI_Webextract_BigApps_Latest.xls` contains a description of each of the datafiles.  Take a look and then load the csv formatted `*.txt` files into Postgresql into five tables:
1. `actions`
2. `cuisines`
3. `violations`
4. `grades` (from `WebExtract.txt`)
5. `boroughs` (from `RI_Webextract_BigApps_Latest.xls`)

**Hints:**
1. Postgresql has a [`\copy` command](http://www.postgresql.org/docs/9.2/static/app-psql.html#APP-PSQL-META-COMMANDS-COPY) that can both save and load files in various formats.  It is a convenience wrapper for the [`copy` command](http://www.postgresql.org/docs/9.2/static/sql-copy.html) but behaves better (e.g. relative paths).

2. The files may contain malformatted text.  Unfortunately, this is all too common.  As a stop gap, remember that `iconv` is a unix utility that can convert files between different text encodings.

3. For more sophisticated needs, a good strategy is to write simple python scripts that will reparse files.  For example, commas (',') within a single field will trick many csv parsers into breaking up the field.  Write a python script that converts these 'inadvertent' delimiters into semicolons (';').
"""


class GroupbyValidateMixin(TupleListValidateMixin):
    @classmethod
    def list_length(cls):
        return cls._list_length
Пример #4
0
# -*- coding: utf-8-*-

from numbers import Number

from lib import QuestionList, Question, TupleListValidateMixin, catch_validate_exception
QuestionList.set_name("sql")

"""
The city of New York does restaurant inspections and assigns a grade.  Inspections data the last 4 years are available [here](https://s3.amazonaws.com/thedataincubator/coursedata/nyc_inspection_data.zip).

The file `RI_Webextract_BigApps_Latest.xls` contains a description of each of the datafiles.  Take a look and then load the csv formatted `*.txt` files into Postgresql into five tables:
1. `actions`
2. `cuisines`
3. `violations`
4. `grades` (from `WebExtract.txt`)
5. `boroughs` (from `RI_Webextract_BigApps_Latest.xls`)

**Hints:**
1. It is recommended to use sqlite3 for this project. Postgresql can work but will be more difficult to set up properly on Digital Ocean. If you do use sqlite, in order to do mathematical calculations like square root, you will need to compile and install the extension described on the [wiki](https://sites.google.com/a/thedataincubator.com/the-data-incubator-wiki/course-information-and-logistics/getting-started/setup).

2. Postgresql has a [`\copy` command](http://www.postgresql.org/docs/9.2/static/app-psql.html#APP-PSQL-META-COMMANDS-COPY) that can both save and load files in various formats.  It is a convenience wrapper for the [`copy` command](http://www.postgresql.org/docs/9.2/static/sql-copy.html) but behaves better (e.g. relative paths).

3. The files may contain malformatted text.  Unfortunately, this is all too common.  As a stop gap, remember that `iconv` is a unix utility that can convert files between different text encodings.

4. For more sophisticated needs, a good strategy is to write simple python scripts that will reparse files.  For example, commas (',') within a single field will trick many csv parsers into breaking up the field.  Write a python script that converts these 'inadvertent' delimiters into semicolons (';').
"""

class GroupbyValidateMixin(TupleListValidateMixin):
  @classmethod
  def list_length(cls):
    return cls._list_length
Пример #5
0
Your model will be assessed based on how root mean squared error of the number of stars you predict.  There is a reference solution (which should not be too hard to beat).  The reference solution has a score of 1.

**Download the data here **: http://thedataincubator.s3.amazonaws.com/coursedata/mldata/yelp_train_academic_dataset_review.json.gz


## Download and parse the data

The data is in the same format as in ml.py

## Helpful notes:
- You may run into trouble with the size of your models and Heroku's memory limit.  This is a major concern in real-world applications.  Your production environment will likely not be that different from Heroku and being able to deploy there is important and companies don't want to hire data scientists who cannot cope with this.  Think about what information the different stages of your pipeline need and how you can reduce the memory footprint.

"""

from lib import QuestionList, Question, list_or_dict, ListValidateMixin, YelpListOrDictValidateMixin
QuestionList.set_name("nlp")


class NLPValidateMixin(YelpListOrDictValidateMixin, Question):
  @classmethod
  def fields(cls):
    return ['text']

  @classmethod
  def _test_json(cls):
    return [
      {"votes": {"funny": 0, "useful": 0, "cool": 0}, "user_id": "WsGQfLLy3YlP_S9jBE3j1w", "review_id": "kzFlI35hkmYA_vPSsMcNoQ", "stars": 5, "date": "2012-11-03", "text": "Love it!!!!! Love it!!!!!! love it!!!!!!!   Who doesn't love Culver's!", "type": "review", "business_id": "LRKJF43s9-3jG9Lgx4zODg"},
      {"votes": {"funny": 0, "useful": 0, "cool": 0}, "user_id": "Veue6umxTpA3o1eEydowZg", "review_id": "Tfn4EfjyWInS-4ZtGAFNNw", "stars": 3, "date": "2013-12-30", "text": "Everything was great except for the burgers they are greasy and very charred compared to other stores.", "type": "review", "business_id": "LRKJF43s9-3jG9Lgx4zODg"},
      {"votes": {"funny": 0, "useful": 0, "cool": 0}, "user_id": "u5xcw6LCnnMhddoxkRIgUA", "review_id": "ZYaS2P5EmK9DANxGTV48Tw", "stars": 5, "date": "2010-12-04", "text": "I really like both Chinese restaurants in town.  This one has outstanding crab rangoon.  Love the chicken with snow peas and mushrooms and General Tso Chicken.  Food is always ready in 10 minutes which is accurate.  Good place and they give you free pop.", "type": "review", "business_id": "RgDg-k9S5YD_BaxMckifkg"},
      {"votes": {"funny": 0, "useful": 0, "cool": 0}, "user_id": "kj18hvJRPLepZPNL7ySKpg", "review_id": "uOLM0vvnFdp468ofLnszTA", "stars": 3, "date": "2011-06-02", "text": "Above average takeout with friendly staff. The sauce on the pan fried noodle is tasty. Dumplings are quite good.", "type": "review", "business_id": "RgDg-k9S5YD_BaxMckifkg"},
      {"votes": {"funny": 0, "useful": 0, "cool": 0}, "user_id": "L5kqM35IZggaPTpQJqcgwg", "review_id": "b3u1RHmZTNRc0thlFmj2oQ", "stars": 4, "date": "2012-05-28", "text": "We order from Chang Jiang often and have never been disappointed.  The menu is huge, and can accomodate anyone's taste buds.  The service is quick, usually ready in 10 minutes.", "type": "review", "business_id": "RgDg-k9S5YD_BaxMckifkg"}
Пример #6
0
from lib import QuestionList, Question, StringNumberListValidateMixin, \
TupleListValidateMixin, catch_validate_exception

QuestionList.set_name("spark_pagerank")
"""
# The PageRank Algorithm

In this assignment, you'll run PageRank on a list of connected citations in High Energy
Physics. The PageRank algorithm assigns a measure of importance (a "rank") to each document 
in a set based on how many documents have links to it. 

Download the dataset of citations with
`s3cmd get s3://thedataincubator-course/spark_pagerank/Cit-HepPh.txt`
Here's the original source and description: http://snap.stanford.edu/data/ca-AstroPh.html

## Implementation: some hints

There are two important RDDs you should be making:
	1. (pageID, linkList) : the list of neighbors for each page
	2. (pageID, rank) : contains the current rank for each page

A page's rank is a sum of "contributions" from each of its neighbors, which is its neighbor's 
rank divided by the neighbor's number of neighbors. To calculate, passes over the 
set of page ranks and update ranks during each pass until the ranks converge. 

Algorithm steps:
	1. Initialize each page's rank to 1.0.
	2. On each iteration, have page p send a contribution of rank(p) / numNeighbors(p) to its 
	neighbors (i.e. the pages it has links to).
	3. Set each page's rank to 0.15 + 0.85 * contributionsReceived. The last two steps repeat 
	for several iterations, during which the algorithm will converge to the correct P
Пример #7
0
class q1_tsm_MH(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self):
        self.q1_tsm_MH = q1_city_year_month_hour_model

    def transform(self, X):
        city_year_month_hour = str(X[12] + '-' + X[1] + '-' + X[3])
        prediction = float(self.q1_tsm_MH[city_year_month_hour])
        return prediction


'''*************************************************************************************************************************************'''

from numbers import Number
from lib import QuestionList, Question, list_or_dict, catch_validate_exception

QuestionList.set_name("ts")


class TimeSeriesRecordMixin(object):
    @classmethod
    def _test_txt(cls):
        return [
            u"2000 01 01 00   -11   -72 10197   220    26     4     0     0 bos",
            u"2000 01 01 01    -6   -78 10206   230    26     2     0 -9999 bos",
            u"2000 01 01 02   -17   -78 10211   230    36     0     0 -9999 bos",
            u"2000 01 01 03   -17   -78 10214   230    36     0     0 -9999 bos",
            u"2000 01 01 04   -17   -78 10216   230    36     0     0 -9999 bos",
        ]

    @classmethod
    def get_test_cases(cls):
Пример #8
0

## A few helpful notes about performance.

1. To deploy a model (get a trained model into Heroku), we suggest using the [`dill` library](https://pypi.python.org/pypi/dill) or [`joblib`](http://scikit-learn.org/stable/modules/model_persistence.html) to save it to disk and check it into git.  This allows you to train the model offline in another file but run it here by reading it in this file.  The model is way too complicated to be trained in real-time!

2. Make sure you load the `dill` file upon server start, not upon a call to `solution`.  This can be done by loaindg the model the model into the global scope.  The model is way too complicated to be even loaded in real-time!

3. Make sure you call `predict` once per call of `def solution`.  This can be done because `predict` is made to take a list of elements.

4. You probably want to use GridSearchCV to find the best hyperparameters by splitting the data into training and test.  But for the final model that you submit, don't forget to retrain on all your data (training and test) with these best parameters.
"""

from lib import (QuestionList, Question, list_or_dict, catch_validate_exception,
  YelpListOrDictValidateMixin)
QuestionList.set_name("ml")


class MLValidateMixin(YelpListOrDictValidateMixin, Question):
  @classmethod
  def fields(cls):
    return cls._fields

  @classmethod
  def _test_json(cls):
    return [
      {"business_id": "vcNAWiLM4dR7D2nwwJ7nCA", "full_address": "4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018", "hours": {"Tuesday": {"close": "17:00", "open": "08:00"}, "Friday": {"close": "17:00", "open": "08:00"}, "Monday": {"close": "17:00", "open": "08:00"}, "Wednesday": {"close": "17:00", "open": "08:00"}, "Thursday": {"close": "17:00", "open": "08:00"}}, "open": True, "categories": ["Doctors", "Health & Medical"], "city": "Phoenix", "review_count": 7, "name": "Eric Goldberg, MD", "neighborhoods": [], "longitude": -111.98375799999999, "state": "AZ", "stars": 3.5, "latitude": 33.499313000000001, "attributes": {"By Appointment Only": True}, "type": "business"},
      {"business_id": "JwUE5GmEO-sH1FuwJgKBlQ", "full_address": "6162 US Highway 51\nDe Forest, WI 53532", "hours": {}, "open": True, "categories": ["Restaurants"], "city": "De Forest", "review_count": 26, "name": "Pine Cone Restaurant", "neighborhoods": [], "longitude": -89.335843999999994, "state": "WI", "stars": 4.0, "latitude": 43.238892999999997, "attributes": {"Take-out": True, "Good For": {"dessert": False, "latenight": False, "lunch": True, "dinner": False, "breakfast": False, "brunch": False}, "Caters": False, "Noise Level": "average", "Takes Reservations": False, "Delivery": False, "Ambience": {"romantic": False, "intimate": False, "touristy": False, "hipster": False, "divey": False, "classy": False, "trendy": False, "upscale": False, "casual": False}, "Parking": {"garage": False, "street": False, "validated": False, "lot": True, "valet": False}, "Has TV": True, "Outdoor Seating": False, "Attire": "casual", "Alcohol": "none", "Waiter Service": True, "Accepts Credit Cards": True, "Good for Kids": True, "Good For Groups": True, "Price Range": 1}, "type": "business"},
      {"business_id": "uGykseHzyS5xAMWoN6YUqA", "full_address": "505 W North St\nDe Forest, WI 53532", "hours": {"Monday": {"close": "22:00", "open": "06:00"}, "Tuesday": {"close": "22:00", "open": "06:00"}, "Friday": {"close": "22:00", "open": "06:00"}, "Wednesday": {"close": "22:00", "open": "06:00"}, "Thursday": {"close": "22:00", "open": "06:00"}, "Sunday": {"close": "21:00", "open": "06:00"}, "Saturday": {"close": "22:00", "open": "06:00"}}, "open": True, "categories": ["American (Traditional)", "Restaurants"], "city": "De Forest", "review_count": 16, "name": "Deforest Family Restaurant", "neighborhoods": [], "longitude": -89.353437, "state": "WI", "stars": 4.0, "latitude": 43.252267000000003, "attributes": {"Take-out": True, "Good For": {"dessert": False, "latenight": False, "lunch": False, "dinner": False, "breakfast": False, "brunch": True}, "Caters": False, "Noise Level": "quiet", "Takes Reservations": False, "Delivery": False, "Parking": {"garage": False, "street": False, "validated": False, "lot": True, "valet": False}, "Has TV": True, "Outdoor Seating": False, "Attire": "casual", "Ambience": {"romantic": False, "intimate": False, "touristy": False, "hipster": False, "divey": False, "classy": False, "trendy": False, "upscale": False, "casual": True}, "Waiter Service": True, "Accepts Credit Cards": True, "Good for Kids": True, "Good For Groups": True, "Price Range": 1}, "type": "business"},
      {"business_id": "LRKJF43s9-3jG9Lgx4zODg", "full_address": "4910 County Rd V\nDe Forest, WI 53532", "hours": {"Monday": {"close": "22:00", "open": "10:30"}, "Tuesday": {"close": "22:00", "open": "10:30"}, "Friday": {"close": "22:00", "open": "10:30"}, "Wednesday": {"close": "22:00", "open": "10:30"}, "Thursday": {"close": "22:00", "open": "10:30"}, "Sunday": {"close": "22:00", "open": "10:30"}, "Saturday": {"close": "22:00", "open": "10:30"}}, "open": True, "categories": ["Food", "Ice Cream & Frozen Yogurt", "Fast Food", "Restaurants"], "city": "De Forest", "review_count": 7, "name": "Culver's", "neighborhoods": [], "longitude": -89.374983, "state": "WI", "stars": 4.5, "latitude": 43.251044999999998, "attributes": {"Take-out": True, "Wi-Fi": "free", "Takes Reservations": False, "Delivery": False, "Parking": {"garage": False, "street": False, "validated": False, "lot": True, "valet": False}, "Wheelchair Accessible": True, "Attire": "casual", "Accepts Credit Cards": True, "Good For Groups": True, "Price Range": 1}, "type": "business"},
      {"business_id": "RgDg-k9S5YD_BaxMckifkg", "full_address": "631 S Main St\nDe Forest, WI 53532", "hours": {"Monday": {"close": "22:00", "open": "11:00"}, "Tuesday": {"close": "22:00", "open": "11:00"}, "Friday": {"close": "22:30", "open": "11:00"}, "Wednesday": {"close": "22:00", "open": "11:00"}, "Thursday": {"close": "22:00", "open": "11:00"}, "Sunday": {"close": "21:00", "open": "16:00"}, "Saturday": {"close": "22:30", "open": "11:00"}}, "open": True, "categories": ["Chinese", "Restaurants"], "city": "De Forest", "review_count": 3, "name": "Chang Jiang Chinese Kitchen", "neighborhoods": [], "longitude": -89.343721700000003, "state": "WI", "stars": 4.0, "latitude": 43.2408748, "attributes": {"Take-out": True, "Has TV": False, "Outdoor Seating": False, "Attire": "casual"}, "type": "business"}
Пример #9
0
      - Look for commonly repeated threads (e.g. you might end up picking up the photo credtis).
      - Long captions are often not lists of people.  The cutoff is subjective so to be definitive, *let's set that cutoff at 250 characters*.

  2. You will want to separate the captions based on various forms of punctuation.  Try using `re.split`, which is more sophisticated than `string.split`.

  3. You might find a person named "ra Lebenthal".  There is no one by this name.  Can anyone spot what's happening here?

  4. This site is pretty formal and likes to say things like "Mayor Michael Bloomberg" after his election but "Michael Bloomberg" before his election.  Can you find other ('optional') titles that are being used?  They should probably be filtered out b/c they ultimately refer to the same person: "Michael Bloomberg."

For the analysis, we think of the problem in terms of a [network](http://en.wikipedia.org/wiki/Computer_network) or a [graph](http://en.wikipedia.org/wiki/Graph_%28mathematics%29).  Any time a pair of people appear in a photo together, that is considered a link.  What we have described is more appropriately called an (undirected) [multigraph](http://en.wikipedia.org/wiki/Multigraph) with no self-loops but this has an obvious analog in terms of an undirected [weighted graph](http://en.wikipedia.org/wiki/Graph_%28mathematics%29#Weighted_graph).  In this problem, we will analyze the social graph of the new york social elite.

For this problem, we recommend using python's `networkx` library.
"""

from lib import QuestionList, Question, StringNumberListValidateMixin, TupleListValidateMixin
QuestionList.set_name("graph")


@QuestionList.add
class Degree(StringNumberListValidateMixin, Question):
    """
  The simplest question you might want to ask is 'who is the most popular'?  The easiest way to answer this question is to look at how many connections everyone has.  Return the top 100 people and their degree.  Remember that if an edge of the graph has weight 2, it counts for 2 in the degree.
  """
    def solution(self):
        """
    A list of 100 tuples of (name, degree) in descending order of degree
    Overall solution stats:
    Number of nodes: 102261
    Number of edges: 191926
    Average degree:   3.7536
Пример #10
0
1. edit source code in Main.scala
2. run the command `sbt package` from the root directory of the project
3. use [spark-submit](https://spark.apache.org/docs/latest/submitting-applications.html) locally: this means adding a flag like --master local[2] to the spark-submit command.
4. use the create_spark_cluster script to run spark-submit on EMR

** Tips **
1. It makes sense to do your development on some subset of the entire dataset for the sake of expediency. Data from the much smaller stats.stackexchange.com is available in the same format on [AWS S3](s3://thedataincubator-course/spark-stats-data/)
2. SBT has some nice features, for example [Continuous build and test](http://www.scala-sbt.org/0.12.4/docs/Getting-Started/Running.html#continuous-build-and-test), which can greatly speed up your development.
3. Try eg. cat output_dir/* | sort -n -t , -k 1.2 -o sorted_output to concatenate the various part-xxxxx files.

**Question:** Why do we need to use spark? What are the circmstances where we would favor using this approach over others?
"""

from lib import QuestionList, Question, JsonValidateMixin, TupleListValidateMixin

QuestionList.set_name("spark")


class TupleNumberListValidateMixin(TupleListValidateMixin):
    @classmethod
    def list_length(cls):
        return 100

    @classmethod
    def tuple_validators(cls):
        return (cls.validate_int, cls.validate_number)


@QuestionList.add
class UpvotePercentageByFavorites(TupleNumberListValidateMixin, Question):
    @classmethod
Пример #11
0
from numbers import Number

from lib import QuestionList, Question, catch_validate_exception
QuestionList.set_name('ass1')


@QuestionList.add
class APlusB(Question):
    """
  What is a plus b?
  """
    def solution(self, a, b):
        return a + b

    """
  Do not touch!
  """

    @catch_validate_exception
    def validate(self):
        ans = self.solution(2, 3)
        if not isinstance(ans, Number):
            return "Answer is not a number! Need to return a single number."

        return None
Пример #12
0
1. edit source code in Main.scala
2. run the command `sbt package` from the root directory of the project
3. use [spark-submit](https://spark.apache.org/docs/latest/submitting-applications.html) locally: this means adding a flag like --master local[2] to the spark-submit command.
4. use the create_spark_cluster script to run spark-submit on EMR

** Tips **
1. It makes sense to do your development on some subset of the entire dataset for the sake of expediency. Data from the much smaller stats.stackexchange.com is available in the same format on [AWS S3](s3://thedataincubator-course/spark-stats-data/)
2. SBT has some nice features, for example [Continuous build and test](http://www.scala-sbt.org/0.12.4/docs/Getting-Started/Running.html#continuous-build-and-test), which can greatly speed up your development.
3. Try eg. cat output_dir/* | sort -n -t , -k 1.2 -o sorted_output to concatenate the various part-xxxxx files.

**Question:** Why do we need to use spark? What are the circmstances where we would favor using this approach over others?
"""

from lib import QuestionList, Question, JsonValidateMixin, TupleListValidateMixin

QuestionList.set_name("spark")


class TupleNumberListValidateMixin(TupleListValidateMixin):
    @classmethod
    def list_length(cls):
        return 100

    @classmethod
    def tuple_validators(cls):
        return (cls.validate_int, cls.validate_number)


@QuestionList.add
class UpvotePercentageByFavorites(TupleNumberListValidateMixin, Question):
    @classmethod
Пример #13
0
1. Cross validation is very different for time series than with other machine-learning problem classes.  In normal machine learning, we select a random subset of data as a validation set to estimate performance.  In time series, we have to consider the problem that we are trying to solve is often to predict a value in the future.  Therefore, the validation data always has to occur *after* the training data.  As a simple example, consider that it would not be very useful to have a predictor of tomorrow's temperature that depended on the temperature the day after.<br/>
We usually handle this by doing a **sliding-window validation method**.  That is, we train on the last $n$ data points and validate the prediction on the next $m$ data points, sliding the $n + m$ training / validation window in time.  In this way, we can estimate the parameters of our model.  To test the validity of the model, we might use a block of data at the end of our time series which is reserved for testing the model with the learned parameters.

1. Another concern is wheather the time series results are predictive.  In economics and finance, we refer to this as the ergodicity assumption, that past behavior can inform future behavior.  Many wonder if past behavoir in daily stock returns gives much predictive power for future behavior.

**Warning**: Feature generation is sometimes a little different for time-series.  Usually, feature generaiton on a set is only based on data in that training example (e.g. extracting the time of day of the temperature measurement).  In timeseries, we often want to use *lagged* data (the temperature an hour ago).  The easiest way to do this is to do the feature generation *before* making the training and validation split.

## Per city model:

It makes sense for each city to have it's own model.  Build a "groupby" estimator that takes an estimator as an argument and builds the resulting "groupby" estimator on each city.  That is, `fit` should fit a model per city while the `predict` method should look up the corresponding model and perform a predict on each, etc ...
"""

from numbers import Number
from lib import QuestionList, Question, list_or_dict, catch_validate_exception
QuestionList.set_name("ts")


class TimeSeriesRecordMixin(object):
  @classmethod
  def _test_txt(cls):
    return [
      "2000 01 01 00   -11   -72 10197   220    26     4     0     0 bos",
      "2000 01 01 01    -6   -78 10206   230    26     2     0 -9999 bos",
      "2000 01 01 02   -17   -78 10211   230    36     0     0 -9999 bos",
      "2000 01 01 03   -17   -78 10214   230    36     0     0 -9999 bos",
      "2000 01 01 04   -17   -78 10216   230    36     0     0 -9999 bos",
    ]

  @classmethod
  def get_test_cases(cls):
Пример #14
0
Файл: cf.py Проект: balsam2/test
from lib import (QuestionList, Question, MovieMixin, _number_validate,
                 catch_validate_exception)

QuestionList.set_name("cf")


class MovieReviewsValidateMixin(MovieMixin, Question):
    @classmethod
    def _test_records(cls):
        return [
            "1::122::5::838985046",
            "1::185::5::838983525",
            "2::736::3::868244698",
            "2::780::3::868244698",
            "3::590::3.5::1136075494",
        ]

    @catch_validate_exception
    def validate(self):
        # check that this
        for case in self.get_test_cases():
            solutions = self.solution(*case['args'], **case['kwargs'])
            for sol in solutions:
                val = _number_validate(sol)

                if val is not None:
                    return val

        return None

Пример #15
0
from numbers import Number

from lib import QuestionList, Question, catch_validate_exception
QuestionList.set_name('ass1')

@QuestionList.add
class APlusB(Question):
  """
  What is a plus b?
  """
  def solution(self, a, b):
    return a+b


  """
  Do not touch!
  """
  @catch_validate_exception
  def validate(self):
    ans = self.solution(2, 3)
    if not isinstance(ans, Number):
      return "Answer is not a number! Need to return a single number."

    return None
Пример #16
0
from lib import QuestionList, Question, StringNumberListValidateMixin, \
TupleListValidateMixin, catch_validate_exception

QuestionList.set_name("spark_pagerank")

"""
# The PageRank Algorithm

In this assignment, you'll run PageRank on a list of connected citations in High Energy
Physics. The PageRank algorithm assigns a measure of importance (a "rank") to each document 
in a set based on how many documents have links to it. 

Download the dataset of citations with
`s3cmd get s3://thedataincubator-course/spark_pagerank/Cit-HepPh.txt`
Here's the original source and description: http://snap.stanford.edu/data/ca-AstroPh.html

## Implementation: some hints

There are two important RDDs you should be making:
	1. (pageID, linkList) : the list of neighbors for each page
	2. (pageID, rank) : contains the current rank for each page

A page's rank is a sum of "contributions" from each of its neighbors, which is its neighbor's 
rank divided by the neighbor's number of neighbors. To calculate, passes over the 
set of page ranks and update ranks during each pass until the ranks converge. 

Algorithm steps:
	1. Initialize each page's rank to 1.0.
	2. On each iteration, have page p send a contribution of rank(p) / numNeighbors(p) to its 
	neighbors (i.e. the pages it has links to).
	3. Set each page's rank to 0.15 + 0.85 * contributionsReceived. The last two steps repeat