Ejemplo n.º 1
0
    def fold(self, num_folds):
        r"""
        Compute and return a partition of the sample supplied during
        instantiation, stratifying on the properties summarized by the
        additional lists or tuples passed to the contstructor.

        :param num_folds: number of folds in the partition.

        :type num_folds: integer

        :returns: the sample partition.

        :rtype: list of lists

        EXAMPLES

        In order to show how properties are specified let's consider the
        following toy sample

        >>> from yaplf.data import Example
        >>> sample = 5*(Example(10),) + 7*(Example(11),) + 8*(Example(12),) +\
        ... 10*(Example(-10),) + 20*(Example(-11),) + 50*(Example(-12),)

        That is, sample will contain five copies of the item `10`, seven
        copies of `11` and so on. Suppose that each item is functionally
        associated to a *class* and to a *quality profile* as follows:

        - the most significative digit (including its sign) identifies the
          item's class;
        - the less significative digit identifies the item's quality.

        Therefore, items in the above mentioned sample can belong either to
        class `1` or to class `-1`, and their quality range from `0` to
        `2`. For instance, the first item (`10`) belongs to class `1` and
        has `0` as quality. So, it is easy to check that this sample contains
        50 elements having quality 0 and belonging to class `1`, 80 elements
        belonging to class `-1` and so on. In spite of the fact that each
        item explicitly include its values for class and quality (this is not
        really what happens in the real world), let's also build two lists
        containing the class and the quality of each item:

        >>> classes = 5*(1,) + 7*(1,) + 8*(1,) + 10*(-1,) + 20*(-1,) + 50*(-1,)
        >>> qualities = 5*(0,) + 7*(1,) + 8*(2,) + 10*(0,) + 20*(1,) + 50*(2,)

        Given variables describing data, classes and quality it is possible to
        create an instance of :class:`StratifiedSampleFolder` and invoke its
        :meth:`fold` method:

        >>> from yaplf.utility.folding import StratifiedSampleFolder
        >>> strat = StratifiedSampleFolder(sample, classes, qualities)
        >>> partition = strat.fold(5)
        >>> partition #doctest: +NORMALIZE_WHITESPACE
        [[Example(10), Example(11), Example(12), Example(-10), Example(-10),
        Example(-11), Example(-11), Example(-11), Example(-11), Example(-12),
        Example(-12), Example(-12), Example(-12), Example(-12), Example(-12),
        Example(-12), Example(-12), Example(-12), Example(-12)], [Example(10),
        Example(11), Example(12), Example(12), Example(-10), Example(-10),
        Example(-11), Example(-11), Example(-11), Example(-11), Example(-12),
        Example(-12), Example(-12), Example(-12), Example(-12), Example(-12),
        Example(-12), Example(-12), Example(-12), Example(-12)], [Example(10),
        Example(11), Example(11), Example(12), Example(-10), Example(-10),
        Example(-11), Example(-11), Example(-11), Example(-11), Example(-12),
        Example(-12), Example(-12), Example(-12), Example(-12), Example(-12),
        Example(-12), Example(-12), Example(-12), Example(-12)], [Example(10),
        Example(11), Example(12), Example(12), Example(-10), Example(-10),
        Example(-11), Example(-11), Example(-11), Example(-11), Example(-12),
        Example(-12), Example(-12), Example(-12), Example(-12), Example(-12),
        Example(-12), Example(-12), Example(-12), Example(-12)], [Example(10),
        Example(11), Example(11), Example(12), Example(12), Example(-10),
        Example(-10), Example(-11), Example(-11), Example(-11), Example(-11),
        Example(-12), Example(-12), Example(-12), Example(-12), Example(-12),
        Example(-12), Example(-12), Example(-12), Example(-12), Example(-12)]]

        In order to check whether or not the partitioning did preserve the
        percentage of elements belonging to the two classes in each fold
        (recall the whole sample contained 80% of items in class `-1`) it is
        possible to count the percentage of such elements within the various
        folds:

        >>> [100*len([item for item in fold if item.pattern<0])/len(fold) \
        ... for fold in partition]
        [84, 80, 80, 80, 76]

        Analogously, it is possible to check that the percentage of items
        having `2` as quality measure is approximately 58% (it is easy to
        obtain this percentage through inspection of the above sample
        definition):

        >>> [100*len([item for item in fold \
        ... if str(item.pattern)[-1]=='2'])/len(fold) for fold in partition]
        [57, 60, 55, 60, 57]

        AUTHORS:

        - Dario Malchiodi (2011-01-21)

        """

        SampleFolder._check_and_shuffle(self)

        distinct_data = tuple([tuple(set(data)) \
            for data in self.stratification_data])

        distinct_combinations = cartesian_product(*distinct_data)

        groups_with_equal_combination = [[self.sample[pos] \
            for pos in range(len(self.sample)) \
            if tuple(data[pos] \
                for data in self.stratification_data) == combination] \
                for combination in distinct_combinations]

        partitioned_groups = [SampleFolder.partition(group, num_folds) \
            for group in groups_with_equal_combination]

        return [flatten([group[fold] for group in partitioned_groups]) \
            for fold in range(num_folds)]
Ejemplo n.º 2
0
def cross_validation_step(learning_algorithm, parameters, split_sample,
    **kwargs):
    r"""
    Perform one step of cross validation using :obj:`learning_algorithm` as
    algorithm and :obj:`parameters` as parameters. For each sample chunk in
    :obj:`split_sample`, the algorithm is run using the remaining chunks in
    order to assemble training set and validating the result on the excluded
    chunk. The operation is cycled on all chunks and the obtained errors are
    averaged in order to assess the overall performance.

    :param learning_algorithm: learning algorithm to be used for training.

    :type learning_algorithm: :class:`yaplf.algorithms.LearningAlgorithm`

    :param parameters: parameters and values to be fed to the learning
      algorithm

    :type parameters: dictionary with parameters name as keys (note that
      typically these parameters have different value at each invocation
      of :meth:`cross_validation_step`).

    :param split_sample: partition of the available sample in chuncks
      (approximately) having the same size.

    :type split_sample: list or tuple composed by lists or tuples of
      :class:`yaplf.data.Example`

    :param fixed_parameters: -- assignments to parameters of the learning
      algorithm whose value does not change in the various cross validation
      steps.

    :type fixed_parameters: dictionary with parameters name as keys, default:
      {}

    :param error_measure: function to be used in order to average test errors
      on the various sample chunks.

    :type error_measure: function taking a list/tuple as argument and returning
      a float, default: numpy.mean

    :param run_parameters: assignments to parameters to be passed to the
      :meth:`run` method of the learning algorithm (forwarded to
      :meth:`train_and_test`).

    :type run_parameters: dictionary with parameters name as keys, default:
      {}

    :param error_model: error model to be used in order to evaluate the test
      error of a single chunk (forwarded to :meth:`train_and_test`).

    :type error_model: :class:`yaplf.utility.error.ErrorModel`, default:
      :class:`yaplf.utility.error.MSE`

    :returns: averaged performance of the induced models.

    :rtype: float

    EXAMPLES:

    Starting from two data sets, the following instructions train a perceptron
    using the Rosenblatt algorithm [Rosenblatt, 1958] on one of them and
    subsequently test the inferred perceptron on the remaining set. The
    procedure is then repeated after exchanging train and test set, and the two
    test errors are averaged:

    >>> from yaplf.data import LabeledExample
    >>> from yaplf.algorithms.neural import RosenblattPerceptronAlgorithm
    >>> from yaplf.utility.validation import cross_validation_step
    >>> split_sample = ((LabeledExample((0, 0), (0,)),
    ... LabeledExample((0, 1), (1,))), (LabeledExample((1, 0), (1,)),
    ... LabeledExample((1, 1), (1,))))
    >>> parameters = {'threshold': True}
    >>> cross_validation_step(RosenblattPerceptronAlgorithm, parameters,
    ... split_sample, fixed_parameters = {'num_steps': 500})
    0.75

    REFERENCES

    [Rosenblatt, 1958] Frank Rosenblatt, The Perceptron: A Probabilistic Model
    for Information Storage and Organization in the Brain, Psychological
    Review, v65, No. 6, pp. 386-408, doi:10.1037/h0042519.

    AUTHORS

    - Dario Malchiodi (2010-02-22)

    """

    try:
        fixed_parameters = kwargs['fixed_parameters']
    except KeyError:
        fixed_parameters = {}

    try:
        error_measure = kwargs['error_measure']
    except KeyError:
        error_measure = mean

    filtered_args = filter_arguments(kwargs, \
        ('fixed_parameters', 'error_measure'))

    parameters.update(fixed_parameters)
    errors = [train_and_test(learning_algorithm,
        flatten(split_sample[:i] + split_sample[i + 1:]),
        split_sample[i], parameters, **filtered_args)
        for i in range(len(split_sample))]
    return error_measure(errors)