def aggregate_by_name(js): ''' We take a json of the form {'expenses':[expenseJSON1,expenseJSON2,...]}, where expenseJSON1 is a dictionary of the form {'user':user1,'amount':amount1, 'date':date1}, where user1 is a string for the user id, amount is an integer, and date1 is a date string of the form 'YYYY-MM-DD'. We use 'merge_ls_dct_no_key' to delete the 'user' key from the embedded JSON's, since it's redundant. Here we go, line by line: (merge_ls_dct_no_key,_,'user'), We are taking a list of expenseJSON's, and building a dict from the 'user' field of these expenseJSON's, with a list of expenseJSON's associated with each user. However, unlike 'merge_ls_dct', this function deletes the 'user' key from the expenseJSON in the list, since it is already in the key. describe_aggregation, We run 'describe_aggregation on the resulting dictionary. ''' return p( js['expenses'], (merge_ls_dct_no_key, _, 'user'), describe_aggregation, )
def fib3(n): ''' Finally, we have the most concise implementation. _if(cond,expr) returns a switch dict: {cond:expr, 'else':_} ''' return p(n, _if(_ > 1, l(fib2, _ - 1) + (fib2, _ - 2)))
def fib3_opt(n): ''' Finally, we have the most concise implementation. _if(cond,expr) returns a switch dict: {cond:expr, 'else':_} However, in this case, notice that we do not use a PypeVal epression l(fib2,-1). This is because the optimizer automatically encloses the first value of a binary operator into a PypeVal, which makes the code much cleaner. ''' return p(n, _if(_ > 1, (fib3_opt, _ - 1) + (fib3_opt, _ - 2)))
def train_classifier(texts, y): ''' Here is a perfect example of the "feel it ... func it" philosophy: The pype call uses the function arguments and function body to specify three variables, texts, a list of strings, y, a list of floats, and vectorizer, a scikit-learn object that vectorizes text. This reiterates the adivce that you should use the function body and function arguments to declare your scope, whenever you can. Line-by-line, here we go: {'vectorizer':vectorizer.fit, 'X':vectorizer.transform}, We build a dict, the first element of which is the fit vectorizer. Luckily, the 'fit' function returns an instance of the trained vectorizer, so we do not need to use _do. This vectorizer is then assigned to 'vectorizer'. Because iterating through dictionaries in Python3.6 preserves the order of the keys in which they were declared, we can apply the fit function to the vectorizer on the texts, assign that to the 'vectorizer' key. We need this instance of the vectorizer to run the classifier for unknown texts. After this, we apply the 'transform' to convert the texts into a training matrix keyed by 'X', whose rows are texts and whose columns are words. _a('classifier',(Classifier().fit,_['X'],y)), Finally, we can build a classifier. _a, or _assoc, means we are adding a key-value pair to the previous dictionary. This will be a new instance of our Classifier, which is trained through the fit function on the text-word matrix 'X' and the labels vector y. _d('X'), Since we don't need the X matrix anymore, we delete it from the returned JSON, which now only contains 'vectorizer' and 'classifier', the two things we will need to classify unknown texts. ''' vectorizer = Vectorizer() return p( texts, { 'vectorizer': vectorizer.fit, 'X': vectorizer.transform }, _a('classifier', (Classifier().fit, _['X'], y)), _d('X'), )
def sum_by_month(js): ''' Line-by-line: _['expenses'], Extract the 'expenses' field from the JSON. [_a('month',(date_string_to_month_string,_['date']))], Because it is a list, add a string for month. These date-string-to-month-string mappings are cached in pype.time_helpers. If the date string is '2019-02-15', its month string is '2019-02-01'. This gets around the object creation necessary for datetime parsing, which becomes a performance bottleneck for large data volume. (merge_ls_dct_no_key,_,'month'), Aggregate the JSON's by month, eliminate the 'month' key for each JSON. [[_['amount']]], The value is a dictionary keyed by the month string, with lists of JSON's as values. Because each JSON contains an integer amount, we iterate through each list and extract this integer amount. The result is a JSON keyed by month strings, with lists of integers as values. [sum], Now, we sum these values. ''' return p( js, _['expenses'], [_a('month', (date_string_to_month_string, _['date']))], (merge_ls_dct_no_key, _, 'month'), [[_['amount']]], [sum], )
def fib2(n): ''' Notice here we have a more concise description, using the l helper from pype.vals. We use this because we want the expression in 'else' to be passed over by the Python interpreter, but evaluated by the pype interpreter. Therefore, we need to turn this into a PypeVal. As we know, PypeVals override their operators to produce LamTups, which are hashable objects that contain a pype-interpretable expression. For example, the expression: _ <= 1 is evaluated by the interpreter as: L(<built-in function le>, G('_pype_mirror_',), 1) This is because _ is a PypeVal, whose operator is overridden to produce a LamTup. LamTups are hashable, so they can be used as set elements and dictionary keys. In the pype interpreter, when a LamTup is encountered, a function delam is run to extract a lambda expression: delam(_ <= 1) => delam(L(<built-in function le>, G('_pype_mirror_',), 1)) => (<built-in function le>, G('_pype_mirror_',), 1) We can see that the pype interpreter can now evaluate the final expression: p(0,_ <= 1) <=> p(0,(<built-in function le>, G('_pype_mirror_',), 1)) <=> True However, in the case of Lambda expressions, the double parentheses makes this very awkward: v((fib2,_-1)) + (fib2,_-2) So the l helper builds a PypeVal which encloses this tuple: l(fib2,_-1) + (fib2,_-2) ''' return p(n, {_ <= 1: _, 'else': l(fib2, _ - 1) + (fib2, _ - 2)})
def fib1(n): ''' Here is an example of the recursive fibonacci sequence, and a demonstration of how to use dict builds as scopes for the succeeding expressions. Line-by-line: {_ <= 1:_, 'else':_p({'fib1':(fib,_-1), 'fib2':(fib,_-2)}, _.fib1+_.fib2)} This is a switch dict, as we see because it has the 'else' key. _ <= 1:_, If n is less or equal to 1, return n. 'else':_p({'fib1':(fib,_-1), 'fib2':(fib,_-2)}, _.fib1+_.fib2)} _p means we are building an embedded pype. The first dictionary assigns fib(n-1) to 'fib1', and the second assigns fib(n-2) to 'fib2'. _.fib1+_.fib2 This adds the two values in the previous dict build. ''' return p( n, { _ <= 1: _, 'else': _p({ 'fib1': (fib1, _ - 1), 'fib2': (fib1, _ - 2) }, _.fib1 + _.fib2) })
''' return p( js['expenses'], [_a('date', (parse, _['date']))], [_a('month', (get_month, _['date']))], [_d('date')], [_a('month', (date_to_str, _['month']))], (merge_ls_dct_no_key, _, 'user'), [(merge_ls_dct_no_key, _, 'month')], [describe_aggregation], ) if __name__ == '__main__': js = p(sys.argv[1], open, _.read, json.loads) ''' For performance, pype doesn't strictly make the object immutable. Therefore, if you want to run different functions on the same object, it is best to do a deepcopy beforehand. ''' js1 = deepcopy(js) js2 = deepcopy(js) js3 = deepcopy(js) print("Printing JS before Aggregation") print("Aggregating by Name") aggregatedByName = aggregate_by_name(js1)
def sum_by_month_numpy(js): ''' Now, we throw numpy into the mix. [_a('month_int',_p(_['date'], date_string_to_month_string, month_string_to_int))], Instead of adding a month string, we add a month integer, which is an integer derived from the string of concatenated year,month, and day values. '2019-02-01' would be turned into 20190201. This done by cache lookups in pype.time_helpers: _p(_['date'], date_string_to_month_string, month_string_to_int)) date_string_to_month_string looks up the month string, as described above. month_string_to_int maps this month string to an integer, as described above. (zip_ls,[_['month_int']],[_['amount']]), We create a list of tuples containing the month integer and the amount. Notice that we are using two maps on the JSON list - [_['month_int']] creates a list of month intetegers, and [_['amount']] creates a list of amounts. zip_ls is a helper that takes the zip of two lists and converts the result into a list. np.array Cast the list in an array with two columns, the first for the month integers and the second for amounts. aggregate_by_key This is a function from pype.numpy helpers which returns three things: 1) A matrix whose i-th row is all the values corresponding with the i-th key. the matrix is padded with zeros so that lists of varying length can fit into it. 2) A list of keys, where the i-th key corresponds with the i-th row of (1). 3) A list of counts for the keys, where the i-th count is the count of the i-th key. (zip,_p( _1, [int_to_month_string]), (sum_by_row,_0)), Notice we are using zip instead of zip_ls, because the following tup_dct can take a zip object, whereas numpy.array requires an explicit list. _p( _1, [int_to_month_string]), _1 takes the second element in the tuple produced by aggregate_by_key, which is the month integers. [int_to_month_string] iterates through these keys and converts them backs to month strings. (sum_by_row,_0) _0 is the matrix containing all the amounts, with the i-th row corresponding with the i-th key in _1. sum_by_row just sums the rows of this vector, so now we have a zip over tuples (month string, amount sum). tup_dct Converts these tuples into a dictionary mapping the first element of the tuple to the second. ''' return p( js, _['expenses'], [ _a('month_int', _p(_['date'], date_string_to_month_string, month_string_to_int)) ], (zip_ls, [_['month_int']], [_['amount']]), np.array, aggregate_by_key, (zip, _p(_1, [int_to_month_string]), (sum_by_row, _0)), tup_dct, )
_a('month_int', _p(_['date'], date_string_to_month_string, month_string_to_int)) ], (zip_ls, [_['month_int']], [_['amount']]), np.array, aggregate_by_key, (zip, _p(_1, [int_to_month_string]), (sum_by_row, _0)), tup_dct, ) if __name__ == '__main__': js = p( sys.argv[1], open, _.read, json.loads, ) js1 = p(js, deepcopy, sum_by_month_imperative) js2 = p(js, deepcopy, sum_by_month) js3 = p(js, deepcopy, sum_by_month_numpy) print('Original JSON is:') pp.pprint(js) print('*' * 30) print('Output of imperative implementation is:') pp.pprint(js1) print('*' * 30) print('Output of pure pype implementation is:') pp.pprint(js2) print('*' * 30)
def classify(): ''' This is the function that is run on a JSON containing one field, 'texts', which is a list of strings. This function will return a list of JSON's containing the label for that text given by the classifier (1 or -1), and the original text. Notice that, in this routing, we need access to 'texts' in (zip,_,texts). Line-by-line: global MODEL We need this to refer to the model we trained at the initialization of the microservice. texts=request.get_json(force=True)['texts'] This extracts the 'texts' list from the json embedded in the request. MODEL['vectorizer'].transform, This uses the vectorizer to convert the list of strings in texts to a text-word matrix that can be fed into the classifier. MODEL['classifier'].predict, This runs the prediction on the text-word matrix, producing an array of 1's and -1's, where 1 indicates that the classification is positive (it is spam), and -1 indicates that the classification is negative (it is not spam). (zip,_,texts), We know that the n-th label produced by the classifier is for the n-th string in texts, so we zip them together to produce an iterable of tuples (label,text). [{'label':_0, 'text':_1, 'description':{_0 == 1: 'not spam', 'else':'spam'}}], Here, we are performing a mapping over the (label,text) tuples produced by the zip. For each tuple, we build a dictionary with three items. The first is the label, which is numberic, either 1.0 or -1.0. The second is the actual text string. However, to help the user, we also include a description of what the label means: 'description':{_0 == 1: 'not spam', 'else':'spam'} The value is a switch dict. Since _0 is a Getter object, it overrides the == operator to produce a LamTup, which Python will accept, but which the pype interpreter will run as an expression. _0 == 1 simply means, "the first element of the (label,text) tuple, label, is 1. If this is true, 'description is set to 'not spam'. Otherwise, it is set to 'spam'. jsonify This just turns the resulting JSON, a list of dicitonaries, into something that can be returned to the client over HTTP. ''' global MODEL texts = request.get_json(force=True)['texts'] return p(texts, MODEL['vectorizer'].transform, MODEL['classifier'].predict, (zip, _, texts), [{ 'label': _0, 'text': _1, 'description': { _0 == 1: 'not spam', 'else': 'spam' } }], jsonify)