def fold(self, num_folds): r""" Compute and return a partition of the sample supplied during instantiation, stratifying on the properties summarized by the additional lists or tuples passed to the contstructor. :param num_folds: number of folds in the partition. :type num_folds: integer :returns: the sample partition. :rtype: list of lists EXAMPLES In order to show how properties are specified let's consider the following toy sample >>> from yaplf.data import Example >>> sample = 5*(Example(10),) + 7*(Example(11),) + 8*(Example(12),) +\ ... 10*(Example(-10),) + 20*(Example(-11),) + 50*(Example(-12),) That is, sample will contain five copies of the item `10`, seven copies of `11` and so on. Suppose that each item is functionally associated to a *class* and to a *quality profile* as follows: - the most significative digit (including its sign) identifies the item's class; - the less significative digit identifies the item's quality. Therefore, items in the above mentioned sample can belong either to class `1` or to class `-1`, and their quality range from `0` to `2`. For instance, the first item (`10`) belongs to class `1` and has `0` as quality. So, it is easy to check that this sample contains 50 elements having quality 0 and belonging to class `1`, 80 elements belonging to class `-1` and so on. In spite of the fact that each item explicitly include its values for class and quality (this is not really what happens in the real world), let's also build two lists containing the class and the quality of each item: >>> classes = 5*(1,) + 7*(1,) + 8*(1,) + 10*(-1,) + 20*(-1,) + 50*(-1,) >>> qualities = 5*(0,) + 7*(1,) + 8*(2,) + 10*(0,) + 20*(1,) + 50*(2,) Given variables describing data, classes and quality it is possible to create an instance of :class:`StratifiedSampleFolder` and invoke its :meth:`fold` method: >>> from yaplf.utility.folding import StratifiedSampleFolder >>> strat = StratifiedSampleFolder(sample, classes, qualities) >>> partition = strat.fold(5) >>> partition #doctest: +NORMALIZE_WHITESPACE [[Example(10), Example(11), Example(12), Example(-10), Example(-10), Example(-11), Example(-11), Example(-11), Example(-11), Example(-12), Example(-12), Example(-12), Example(-12), Example(-12), Example(-12), Example(-12), Example(-12), Example(-12), Example(-12)], [Example(10), Example(11), Example(12), Example(12), Example(-10), Example(-10), Example(-11), Example(-11), Example(-11), Example(-11), Example(-12), Example(-12), Example(-12), Example(-12), Example(-12), Example(-12), Example(-12), Example(-12), Example(-12), Example(-12)], [Example(10), Example(11), Example(11), Example(12), Example(-10), Example(-10), Example(-11), Example(-11), Example(-11), Example(-11), Example(-12), Example(-12), Example(-12), Example(-12), Example(-12), Example(-12), Example(-12), Example(-12), Example(-12), Example(-12)], [Example(10), Example(11), Example(12), Example(12), Example(-10), Example(-10), Example(-11), Example(-11), Example(-11), Example(-11), Example(-12), Example(-12), Example(-12), Example(-12), Example(-12), Example(-12), Example(-12), Example(-12), Example(-12), Example(-12)], [Example(10), Example(11), Example(11), Example(12), Example(12), Example(-10), Example(-10), Example(-11), Example(-11), Example(-11), Example(-11), Example(-12), Example(-12), Example(-12), Example(-12), Example(-12), Example(-12), Example(-12), Example(-12), Example(-12), Example(-12)]] In order to check whether or not the partitioning did preserve the percentage of elements belonging to the two classes in each fold (recall the whole sample contained 80% of items in class `-1`) it is possible to count the percentage of such elements within the various folds: >>> [100*len([item for item in fold if item.pattern<0])/len(fold) \ ... for fold in partition] [84, 80, 80, 80, 76] Analogously, it is possible to check that the percentage of items having `2` as quality measure is approximately 58% (it is easy to obtain this percentage through inspection of the above sample definition): >>> [100*len([item for item in fold \ ... if str(item.pattern)[-1]=='2'])/len(fold) for fold in partition] [57, 60, 55, 60, 57] AUTHORS: - Dario Malchiodi (2011-01-21) """ SampleFolder._check_and_shuffle(self) distinct_data = tuple([tuple(set(data)) \ for data in self.stratification_data]) distinct_combinations = cartesian_product(*distinct_data) groups_with_equal_combination = [[self.sample[pos] \ for pos in range(len(self.sample)) \ if tuple(data[pos] \ for data in self.stratification_data) == combination] \ for combination in distinct_combinations] partitioned_groups = [SampleFolder.partition(group, num_folds) \ for group in groups_with_equal_combination] return [flatten([group[fold] for group in partitioned_groups]) \ for fold in range(num_folds)]
def cross_validation_step(learning_algorithm, parameters, split_sample, **kwargs): r""" Perform one step of cross validation using :obj:`learning_algorithm` as algorithm and :obj:`parameters` as parameters. For each sample chunk in :obj:`split_sample`, the algorithm is run using the remaining chunks in order to assemble training set and validating the result on the excluded chunk. The operation is cycled on all chunks and the obtained errors are averaged in order to assess the overall performance. :param learning_algorithm: learning algorithm to be used for training. :type learning_algorithm: :class:`yaplf.algorithms.LearningAlgorithm` :param parameters: parameters and values to be fed to the learning algorithm :type parameters: dictionary with parameters name as keys (note that typically these parameters have different value at each invocation of :meth:`cross_validation_step`). :param split_sample: partition of the available sample in chuncks (approximately) having the same size. :type split_sample: list or tuple composed by lists or tuples of :class:`yaplf.data.Example` :param fixed_parameters: -- assignments to parameters of the learning algorithm whose value does not change in the various cross validation steps. :type fixed_parameters: dictionary with parameters name as keys, default: {} :param error_measure: function to be used in order to average test errors on the various sample chunks. :type error_measure: function taking a list/tuple as argument and returning a float, default: numpy.mean :param run_parameters: assignments to parameters to be passed to the :meth:`run` method of the learning algorithm (forwarded to :meth:`train_and_test`). :type run_parameters: dictionary with parameters name as keys, default: {} :param error_model: error model to be used in order to evaluate the test error of a single chunk (forwarded to :meth:`train_and_test`). :type error_model: :class:`yaplf.utility.error.ErrorModel`, default: :class:`yaplf.utility.error.MSE` :returns: averaged performance of the induced models. :rtype: float EXAMPLES: Starting from two data sets, the following instructions train a perceptron using the Rosenblatt algorithm [Rosenblatt, 1958] on one of them and subsequently test the inferred perceptron on the remaining set. The procedure is then repeated after exchanging train and test set, and the two test errors are averaged: >>> from yaplf.data import LabeledExample >>> from yaplf.algorithms.neural import RosenblattPerceptronAlgorithm >>> from yaplf.utility.validation import cross_validation_step >>> split_sample = ((LabeledExample((0, 0), (0,)), ... LabeledExample((0, 1), (1,))), (LabeledExample((1, 0), (1,)), ... LabeledExample((1, 1), (1,)))) >>> parameters = {'threshold': True} >>> cross_validation_step(RosenblattPerceptronAlgorithm, parameters, ... split_sample, fixed_parameters = {'num_steps': 500}) 0.75 REFERENCES [Rosenblatt, 1958] Frank Rosenblatt, The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain, Psychological Review, v65, No. 6, pp. 386-408, doi:10.1037/h0042519. AUTHORS - Dario Malchiodi (2010-02-22) """ try: fixed_parameters = kwargs['fixed_parameters'] except KeyError: fixed_parameters = {} try: error_measure = kwargs['error_measure'] except KeyError: error_measure = mean filtered_args = filter_arguments(kwargs, \ ('fixed_parameters', 'error_measure')) parameters.update(fixed_parameters) errors = [train_and_test(learning_algorithm, flatten(split_sample[:i] + split_sample[i + 1:]), split_sample[i], parameters, **filtered_args) for i in range(len(split_sample))] return error_measure(errors)