示例#1
0
def dmatrix(formula_like,
            data={},
            eval_env=0,
            NA_action="drop",
            return_type="matrix"):
    """Construct a single design matrix given a formula_like and data.

    :arg formula_like: An object that can be used to construct a design
      matrix. See below.
    :arg data: A dict-like object that can be used to look up variables
      referenced in `formula_like`.
    :arg eval_env: Either a :class:`EvalEnvironment` which will be used to
      look up any variables referenced in `formula_like` that cannot be
      found in `data`, or else a depth represented as an
      integer which will be passed to :meth:`EvalEnvironment.capture`.
      ``eval_env=0`` means to use the context of the function calling
      :func:`dmatrix` for lookups. If calling this function from a library,
      you probably want ``eval_env=1``, which means that variables should be
      resolved in *your* caller's namespace.
    :arg NA_action: What to do with rows that contain missing values. You can
      ``"drop"`` them, ``"raise"`` an error, or for customization, pass an
      :class:`NAAction` object. See :class:`NAAction` for details on what
      values count as 'missing' (and how to alter this).
    :arg return_type: Either ``"matrix"`` or ``"dataframe"``. See below.

    The `formula_like` can take a variety of forms:

    * A formula string like "x1 + x2" (for :func:`dmatrix`) or "y ~ x1 + x2"
      (for :func:`dmatrices`). For details see :ref:`formulas`.
    * A :class:`ModelDesc`, which is a Python object representation of a
      formula. See :ref:`formulas` and :ref:`expert-model-specification` for
      details.
    * A :class:`DesignMatrixBuilder`.
    * An object that has a method called :meth:`__patsy_get_model_desc__`.
      For details see :ref:`expert-model-specification`.
    * A numpy array_like (for :func:`dmatrix`) or a tuple
      (array_like, array_like) (for :func:`dmatrices`). These will have
      metadata added, representation normalized, and then be returned
      directly. In this case `data` and `eval_env` are
      ignored. There is special handling for two cases:

      * :class:`DesignMatrix` objects will have their :class:`DesignInfo`
        preserved. This allows you to set up custom column names and term
        information even if you aren't using the rest of the patsy
        machinery.
      * :class:`pandas.DataFrame` or :class:`pandas.Series` objects will have
        their (row) indexes checked. If two are passed in, their indexes must
        be aligned. If ``return_type="dataframe"``, then their indexes will be
        preserved on the output.
      
    Regardless of the input, the return type is always either:

    * A :class:`DesignMatrix`, if ``return_type="matrix"`` (the default)
    * A :class:`pandas.DataFrame`, if ``return_type="dataframe"``.

    The actual contents of the design matrix is identical in both cases, and
    in both cases a :class:`DesignInfo` will be available in a
    ``.design_info`` attribute on the return value. However, for
    ``return_type="dataframe"``, any pandas indexes on the input (either in
    `data` or directly passed through `formula_like`) will be
    preserved, which may be useful for e.g. time-series models.
    """
    eval_env = EvalEnvironment.capture(eval_env, reference=1)
    (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env, NA_action,
                                      return_type)
    if lhs.shape[1] != 0:
        raise PatsyError("encountered outcome variables for a model "
                         "that does not expect them")
    return rhs
示例#2
0
    def subset(self, which_terms):
        """Create a new :class:`DesignInfo` for design matrices that contain a
        subset of the terms that the current :class:`DesignInfo` does.

        For example, if ``design_info`` has terms ``x``, ``y``, and ``z``,
        then::

          design_info2 = design_info.subset(["x", "z"])

        will return a new DesignInfo that can be used to construct design
        matrices with only the columns corresponding to the terms ``x`` and
        ``z``. After we do this, then in general these two expressions will
        return the same thing (here we assume that ``x``, ``y``, and ``z``
        each generate a single column of the output)::

          build_design_matrix([design_info], data)[0][:, [0, 2]]
          build_design_matrix([design_info2], data)[0]

        However, a critical difference is that in the second case, ``data``
        need not contain any values for ``y``. This is very useful when doing
        prediction using a subset of a model, in which situation R usually
        forces you to specify dummy values for ``y``.

        If using a formula to specify the terms to include, remember that like
        any formula, the intercept term will be included by default, so use
        ``0`` or ``-1`` in your formula if you want to avoid this.

        This method can also be used to reorder the terms in your design
        matrix, in case you want to do that for some reason. I can't think of
        any.

        Note that this method will generally *not* produce the same result as
        creating a new model directly. Consider these DesignInfo objects::

            design1 = dmatrix("1 + C(a)", data)
            design2 = design1.subset("0 + C(a)")
            design3 = dmatrix("0 + C(a)", data)

        Here ``design2`` and ``design3`` will both produce design matrices
        that contain an encoding of ``C(a)`` without any intercept term. But
        ``design3`` uses a full-rank encoding for the categorical term
        ``C(a)``, while ``design2`` uses the same reduced-rank encoding as
        ``design1``.

        :arg which_terms: The terms which should be kept in the new
          :class:`DesignMatrixBuilder`. If this is a string, then it is parsed
          as a formula, and then the names of the resulting terms are taken as
          the terms to keep. If it is a list, then it can contain a mixture of
          term names (as strings) and :class:`Term` objects.

        .. versionadded: 0.2.0
           New method on the class DesignMatrixBuilder.

        .. versionchanged: 0.4.0
           Moved from DesignMatrixBuilder to DesignInfo, as part of the
           removal of DesignMatrixBuilder.

        """
        if isinstance(which_terms, str):
            desc = ModelDesc.from_formula(which_terms)
            if desc.lhs_termlist:
                raise PatsyError("right-hand-side-only formula required")
            which_terms = [term.name() for term in desc.rhs_termlist]

        if self.term_codings is None:
            # This is a minimal DesignInfo
            # If the name is unknown we just let the KeyError escape
            new_names = []
            for t in which_terms:
                new_names += self.column_names[self.term_name_slices[t]]
            return DesignInfo(new_names)
        else:
            term_name_to_term = {}
            for term in self.term_codings:
                term_name_to_term[term.name()] = term

            new_column_names = []
            new_factor_infos = {}
            new_term_codings = OrderedDict()
            for name_or_term in which_terms:
                term = term_name_to_term.get(name_or_term, name_or_term)
                # If the name is unknown we just let the KeyError escape
                s = self.term_slices[term]
                new_column_names += self.column_names[s]
                for f in term.factors:
                    new_factor_infos[f] = self.factor_infos[f]
                new_term_codings[term] = self.term_codings[term]
            return DesignInfo(new_column_names,
                              factor_infos=new_factor_infos,
                              term_codings=new_term_codings)
示例#3
0
def _do_highlevel_design(formula_like, data, eval_env, NA_action, return_type):
    if return_type == "dataframe" and not have_pandas:
        raise PatsyError("pandas.DataFrame was requested, but pandas "
                         "is not installed")
    if return_type not in ("matrix", "dataframe"):
        raise PatsyError("unrecognized output type %r, should be "
                         "'matrix' or 'dataframe'" % (return_type, ))

    def data_iter_maker():
        return iter([data])

    builders = _try_incr_builders(formula_like, data_iter_maker, eval_env,
                                  NA_action)
    if builders is not None:
        return build_design_matrices(builders,
                                     data,
                                     NA_action=NA_action,
                                     return_type=return_type)
    else:
        # No builders, but maybe we can still get matrices
        if isinstance(formula_like, tuple):
            if len(formula_like) != 2:
                raise PatsyError("don't know what to do with a length %s "
                                 "matrices tuple" % (len(formula_like), ))
            (lhs, rhs) = formula_like
        else:
            # subok=True is necessary here to allow DesignMatrixes to pass
            # through
            (lhs, rhs) = (None, asarray_or_pandas(formula_like, subok=True))
        # some sort of explicit matrix or matrices were given. Currently we
        # have them in one of these forms:
        #   -- an ndarray or subclass
        #   -- a DesignMatrix
        #   -- a pandas.Series
        #   -- a pandas.DataFrame
        # and we have to produce a standard output format.
        def _regularize_matrix(m, default_column_prefix):
            di = DesignInfo.from_array(m, default_column_prefix)
            if have_pandas and isinstance(m,
                                          (pandas.Series, pandas.DataFrame)):
                orig_index = m.index
            else:
                orig_index = None
            if return_type == "dataframe":
                m = atleast_2d_column_default(m, preserve_pandas=True)
                m = pandas.DataFrame(m)
                m.columns = di.column_names
                m.design_info = di
                return (m, orig_index)
            else:
                return (DesignMatrix(m, di), orig_index)

        rhs, rhs_orig_index = _regularize_matrix(rhs, "x")
        if lhs is None:
            lhs = np.zeros((rhs.shape[0], 0), dtype=float)
        lhs, lhs_orig_index = _regularize_matrix(lhs, "y")

        assert isinstance(getattr(lhs, "design_info", None), DesignInfo)
        assert isinstance(getattr(rhs, "design_info", None), DesignInfo)
        if lhs.shape[0] != rhs.shape[0]:
            raise PatsyError("shape mismatch: outcome matrix has %s rows, "
                             "predictor matrix has %s rows" %
                             (lhs.shape[0], rhs.shape[0]))
        if rhs_orig_index is not None and lhs_orig_index is not None:
            if not rhs_orig_index.equals(lhs_orig_index):
                raise PatsyError("index mismatch: outcome and "
                                 "predictor have incompatible indexes")
        if return_type == "dataframe":
            if rhs_orig_index is not None and lhs_orig_index is None:
                lhs.index = rhs.index
            if rhs_orig_index is None and lhs_orig_index is not None:
                rhs.index = lhs.index
        return (lhs, rhs)
示例#4
0
 def __init__(self, matrix, column_suffixes):
     self.matrix = np.asarray(matrix)
     self.column_suffixes = column_suffixes
     if self.matrix.shape[1] != len(column_suffixes):
         raise PatsyError("matrix and column_suffixes don't conform")
示例#5
0
 def raise_patsy_error(x):
     raise PatsyError("WHEEEEEE")
示例#6
0
文件: desc.py 项目: joaonatali/patsy
def _check_interactable(expr):
    if expr.intercept:
        raise PatsyError("intercept term cannot interact with "
                            "anything else", expr.intercept_origin)
示例#7
0
文件: desc.py 项目: joaonatali/patsy
def _eval_number(evaluator, tree):
    raise PatsyError("numbers besides '0' and '1' are "
                        "only allowed with **", tree)
示例#8
0
 def _handle_NA_raise(self, values, is_NAs, origins):
     for is_NA, origin in zip(is_NAs, origins):
         if np.any(is_NA):
             raise PatsyError("factor contains missing values", origin)
     return values
示例#9
0
def categorical_to_int(data, levels, NA_action, origin=None):
    assert isinstance(levels, tuple)
    # In this function, missing values are always mapped to -1

    if have_pandas_categorical and isinstance(data, pandas.Categorical):
        data_levels_tuple = tuple(data.levels)
        if not data_levels_tuple == levels:
            raise PatsyError(
                "mismatching levels: expected %r, got %r" %
                (levels, data_levels_tuple), origin)
        # pandas.Categorical also uses -1 to indicate NA, and we don't try to
        # second-guess its NA detection, so we can just pass it back.
        return data.labels

    if isinstance(data, _CategoricalBox):
        if data.levels is not None and tuple(data.levels) != levels:
            raise PatsyError(
                "mismatching levels: expected %r, got %r" %
                (levels, tuple(data.levels)), origin)
        data = data.data

    data = _categorical_shape_fix(data)

    try:
        level_to_int = dict(zip(levels, range(len(levels))))
    except TypeError:
        raise PatsyError(
            "Error interpreting categorical data: "
            "all items must be hashable", origin)

    # fastpath to avoid doing an item-by-item iteration over boolean arrays,
    # as requested by #44
    if hasattr(data, "dtype") and np.issubdtype(data.dtype, np.bool_):
        if level_to_int[False] == 0 and level_to_int[True] == 1:
            return data.astype(np.int_)
    out = np.empty(len(data), dtype=int)
    for i, value in enumerate(data):
        if NA_action.is_categorical_NA(value):
            out[i] = -1
        else:
            try:
                out[i] = level_to_int[value]
            except KeyError:
                SHOW_LEVELS = 4
                level_strs = []
                if len(levels) <= SHOW_LEVELS:
                    level_strs += [repr(level) for level in levels]
                else:
                    level_strs += [
                        repr(level) for level in levels[:SHOW_LEVELS // 2]
                    ]
                    level_strs.append("...")
                    level_strs += [
                        repr(level) for level in levels[-SHOW_LEVELS // 2:]
                    ]
                level_str = "[%s]" % (", ".join(level_strs))
                raise PatsyError(
                    "Error converting data to categorical: "
                    "observation with value %r does not match "
                    "any of the expected levels (expected: %s)" %
                    (value, level_str), origin)
            except TypeError:
                raise PatsyError(
                    "Error converting data to categorical: "
                    "encountered unhashable value %r" % (value, ), origin)
    if have_pandas and isinstance(data, pandas.Series):
        out = pandas.Series(out, index=data.index)
    return out
示例#10
0
    def memorize_passes_needed(self, state, eval_env):
        # 'state' is just an empty dict which we can do whatever we want with,
        # and that will be passed back to later memorize functions
        state["transforms"] = {}

        eval_env = eval_env.with_outer_namespace(_builtins_dict)
        env_namespace = eval_env.namespace
        subset_names = [name for name in ast_names(self.code)
                        if name in env_namespace]
        eval_env = eval_env.subset(subset_names)
        state["eval_env"] = eval_env

        # example code: == "2 * center(x)"
        i = [0]
        def new_name_maker(token):
            value = eval_env.namespace.get(token)
            if hasattr(value, "__patsy_stateful_transform__"):
                obj_name = "_patsy_stobj%s__%s__" % (i[0], token)
                i[0] += 1
                obj = value.__patsy_stateful_transform__()
                state["transforms"][obj_name] = obj
                return obj_name + ".transform"
            else:
                return token
        # example eval_code: == "2 * _patsy_stobj0__center__.transform(x)"
        eval_code = replace_bare_funcalls(self.code, new_name_maker)
        state["eval_code"] = eval_code
        # paranoia: verify that none of our new names appeared anywhere in the
        # original code
        if has_bare_variable_reference(state["transforms"], self.code):
            raise PatsyError("names of this form are reserved for "
                                "internal use (%s)" % (token,), token.origin)
        # Pull out all the '_patsy_stobj0__center__.transform(x)' pieces
        # to make '_patsy_stobj0__center__.memorize_chunk(x)' pieces
        state["memorize_code"] = {}
        for obj_name in state["transforms"]:
            transform_calls = capture_obj_method_calls(obj_name, eval_code)
            assert len(transform_calls) == 1
            transform_call = transform_calls[0]
            transform_call_name, transform_call_code = transform_call
            assert transform_call_name == obj_name + ".transform"
            assert transform_call_code.startswith(transform_call_name + "(")
            memorize_code = (obj_name
                             + ".memorize_chunk"
                             + transform_call_code[len(transform_call_name):])
            state["memorize_code"][obj_name] = memorize_code
        # Then sort the codes into bins, so that every item in bin number i
        # depends only on items in bin (i-1) or less. (By 'depends', we mean
        # that in something like:
        #   spline(center(x))
        # we have to first run:
        #    center.memorize_chunk(x)
        # then
        #    center.memorize_finish(x)
        # and only then can we run:
        #    spline.memorize_chunk(center.transform(x))
        # Since all of our objects have unique names, figuring out who
        # depends on who is pretty easy -- we just check whether the
        # memorization code for spline:
        #    spline.memorize_chunk(center.transform(x))
        # mentions the variable 'center' (which in the example, of course, it
        # does).
        pass_bins = []
        unsorted = set(state["transforms"])
        while unsorted:
            pass_bin = set()
            for obj_name in unsorted:
                other_objs = unsorted.difference([obj_name])
                memorize_code = state["memorize_code"][obj_name]
                if not has_bare_variable_reference(other_objs, memorize_code):
                    pass_bin.add(obj_name)
            assert pass_bin
            unsorted.difference_update(pass_bin)
            pass_bins.append(pass_bin)
        state["pass_bins"] = pass_bins

        return len(pass_bins)
示例#11
0
def build_design_matrices(design_infos,
                          data,
                          NA_action="drop",
                          return_type="matrix",
                          dtype=np.dtype(float)):
    """Construct several design matrices from :class:`DesignMatrixBuilder`
    objects.

    This is one of Patsy's fundamental functions. This function and
    :func:`design_matrix_builders` together form the API to the core formula
    interpretation machinery.

    :arg design_infos: A list of :class:`DesignInfo` objects describing the
      design matrices to be built.
    :arg data: A dict-like object which will be used to look up data.
    :arg NA_action: What to do with rows that contain missing values. You can
      ``"drop"`` them, ``"raise"`` an error, or for customization, pass an
      :class:`NAAction` object. See :class:`NAAction` for details on what
      values count as 'missing' (and how to alter this).
    :arg return_type: Either ``"matrix"`` or ``"dataframe"``. See below.
    :arg dtype: The dtype of the returned matrix. Useful if you want to use
      single-precision or extended-precision.

    This function returns either a list of :class:`DesignMatrix` objects (for
    ``return_type="matrix"``) or a list of :class:`pandas.DataFrame` objects
    (for ``return_type="dataframe"``). In both cases, all returned design
    matrices will have ``.design_info`` attributes containing the appropriate
    :class:`DesignInfo` objects.

    Note that unlike :func:`design_matrix_builders`, this function takes only
    a simple data argument, not any kind of iterator. That's because this
    function doesn't need a global view of the data -- everything that depends
    on the whole data set is already encapsulated in the ``design_infos``. If
    you are incrementally processing a large data set, simply call this
    function for each chunk.

    Index handling: This function always checks for indexes in the following
    places:

    * If ``data`` is a :class:`pandas.DataFrame`, its ``.index`` attribute.
    * If any factors evaluate to a :class:`pandas.Series` or
      :class:`pandas.DataFrame`, then their ``.index`` attributes.

    If multiple indexes are found, they must be identical (same values in the
    same order). If no indexes are found, then a default index is generated
    using ``np.arange(num_rows)``. One way or another, we end up with a single
    index for all the data. If ``return_type="dataframe"``, then this index is
    used as the index of the returned DataFrame objects. Examining this index
    makes it possible to determine which rows were removed due to NAs.

    Determining the number of rows in design matrices: This is not as obvious
    as it might seem, because it's possible to have a formula like "~ 1" that
    doesn't depend on the data (it has no factors). For this formula, it's
    obvious what every row in the design matrix should look like (just the
    value ``1``); but, how many rows like this should there be? To determine
    the number of rows in a design matrix, this function always checks in the
    following places:

    * If ``data`` is a :class:`pandas.DataFrame`, then its number of rows.
    * The number of entries in any factors present in any of the design
    * matrices being built.

    All these values much match. In particular, if this function is called to
    generate multiple design matrices at once, then they must all have the
    same number of rows.

    .. versionadded:: 0.2.0
       The ``NA_action`` argument.

    """
    if isinstance(NA_action, str):
        NA_action = NAAction(NA_action)
    if return_type == "dataframe" and not have_pandas:
        raise PatsyError("pandas.DataFrame was requested, but pandas "
                         "is not installed")
    if return_type not in ("matrix", "dataframe"):
        raise PatsyError("unrecognized output type %r, should be "
                         "'matrix' or 'dataframe'" % (return_type, ))
    # Evaluate factors
    factor_info_to_values = {}
    factor_info_to_isNAs = {}
    rows_checker = _CheckMatch("Number of rows", lambda a, b: a == b)
    index_checker = _CheckMatch("Index", lambda a, b: a.equals(b))
    if have_pandas and isinstance(data, pandas.DataFrame):
        index_checker.check(data.index, "data.index", None)
        rows_checker.check(data.shape[0], "data argument", None)
    for design_info in design_infos:
        # We look at evaluators rather than factors here, because it might
        # happen that we have the same factor twice, but with different
        # memorized state.
        for factor_info in six.itervalues(design_info.factor_infos):
            if factor_info not in factor_info_to_values:
                value, is_NA = _eval_factor(factor_info, data, NA_action)
                factor_info_to_isNAs[factor_info] = is_NA
                # value may now be a Series, DataFrame, or ndarray
                name = factor_info.factor.name()
                origin = factor_info.factor.origin
                rows_checker.check(value.shape[0], name, origin)
                if (have_pandas
                        and isinstance(value,
                                       (pandas.Series, pandas.DataFrame))):
                    index_checker.check(value.index, name, origin)
                # Strategy: we work with raw ndarrays for doing the actual
                # combining; DesignMatrixBuilder objects never sees pandas
                # objects. Then at the end, if a DataFrame was requested, we
                # convert. So every entry in this dict is either a 2-d array
                # of floats, or a 1-d array of integers (representing
                # categories).
                value = np.asarray(value)
                factor_info_to_values[factor_info] = value
    # Handle NAs
    values = list(factor_info_to_values.values())
    is_NAs = list(factor_info_to_isNAs.values())
    origins = [
        factor_info.factor.origin for factor_info in factor_info_to_values
    ]
    pandas_index = index_checker.value
    num_rows = rows_checker.value
    # num_rows is None iff evaluator_to_values (and associated sets like
    # 'values') are empty, i.e., we have no actual evaluators involved
    # (formulas like "~ 1").
    if return_type == "dataframe" and num_rows is not None:
        if pandas_index is None:
            pandas_index = np.arange(num_rows)
        values.append(pandas_index)
        is_NAs.append(np.zeros(len(pandas_index), dtype=bool))
        origins.append(None)
    new_values = NA_action.handle_NA(values, is_NAs, origins)
    # NA_action may have changed the number of rows.
    if new_values:
        num_rows = new_values[0].shape[0]
    if return_type == "dataframe" and num_rows is not None:
        pandas_index = new_values.pop()
    factor_info_to_values = dict(zip(factor_info_to_values, new_values))
    # Build factor values into matrices
    results = []
    for design_info in design_infos:
        results.append(
            _build_design_matrix(design_info, factor_info_to_values, dtype))
    matrices = []
    for need_reshape, matrix in results:
        if need_reshape:
            # There is no data-dependence, at all -- a formula like "1 ~ 1".
            # In this case the builder just returns a single-row matrix, and
            # we have to broadcast it vertically to the appropriate size. If
            # we can figure out what that is...
            assert matrix.shape[0] == 1
            if num_rows is not None:
                matrix = DesignMatrix(np.repeat(matrix, num_rows, axis=0),
                                      matrix.design_info)
            else:
                raise PatsyError(
                    "No design matrix has any non-trivial factors, "
                    "the data object is not a DataFrame. "
                    "I can't tell how many rows the design matrix should "
                    "have!")
        matrices.append(matrix)
    if return_type == "dataframe":
        assert have_pandas
        for i, matrix in enumerate(matrices):
            di = matrix.design_info
            matrices[i] = pandas.DataFrame(matrix,
                                           columns=di.column_names,
                                           index=pandas_index)
            matrices[i].design_info = di
    return matrices
示例#12
0
def _max_allowed_dim(dim, arr, factor):
    if arr.ndim > dim:
        msg = ("factor '%s' evaluates to an %s-dimensional array; I only "
               "handle arrays with dimension <= %s" %
               (factor.name(), arr.ndim, dim))
        raise PatsyError(msg, factor)
示例#13
0
文件: build.py 项目: joaonatali/patsy
    def subset(self, which_terms):
        """Create a new :class:`DesignMatrixBuilder` that includes only a
        subset of the terms that this object does.

        For example, if `builder` has terms `x`, `y`, and `z`, then::

          builder2 = builder.subset(["x", "z"])

        will return a new builder that will return design matrices with only
        the columns corresponding to the terms `x` and `z`. After we do this,
        then in general these two expressions will return the same thing (here
        we assume that `x`, `y`, and `z` each generate a single column of the
        output)::

          build_design_matrix([builder], data)[0][:, [0, 2]]
          build_design_matrix([builder2], data)[0]

        However, a critical difference is that in the second case, `data` need
        not contain any values for `y`. This is very useful when doing
        prediction using a subset of a model, in which situation R usually
        forces you to specify dummy values for `y`.

        If using a formula to specify the terms to include, remember that like
        any formula, the intercept term will be included by default, so use
        `0` or `-1` in your formula if you want to avoid this.

        :arg which_terms: The terms which should be kept in the new
          :class:`DesignMatrixBuilder`. If this is a string, then it is parsed
          as a formula, and then the names of the resulting terms are taken as
          the terms to keep. If it is a list, then it can contain a mixture of
          term names (as strings) and :class:`Term` objects.

        .. versionadded: 0.2.0
        """
        factor_to_evaluators = {}
        for evaluator in self._evaluators:
            factor_to_evaluators[evaluator.factor] = evaluator
        design_info = self.design_info
        term_name_to_term = dict(zip(design_info.term_names,
                                     design_info.terms))
        if isinstance(which_terms, basestring):
            # We don't use this EvalEnvironment -- all we want to do is to
            # find matching terms, and we can't do that use == on Term
            # objects, because that calls == on factor objects, which in turn
            # compares EvalEnvironments. So all we do with the parsed formula
            # is pull out the term *names*, which the EvalEnvironment doesn't
            # effect. This is just a placeholder then to allow the ModelDesc
            # to be created:
            env = EvalEnvironment({})
            desc = ModelDesc.from_formula(which_terms, env)
            if desc.lhs_termlist:
                raise PatsyError("right-hand-side-only formula required")
            which_terms = [term.name() for term in desc.rhs_termlist]
        terms = []
        evaluators = set()
        term_to_column_builders = {}
        for term_or_name in which_terms:
            if isinstance(term_or_name, basestring):
                if term_or_name not in term_name_to_term:
                    raise PatsyError("requested term %r not found in "
                                     "this DesignMatrixBuilder" %
                                     (term_or_name, ))
                term = term_name_to_term[term_or_name]
            else:
                term = term_or_name
            if term not in self._termlist:
                raise PatsyError("requested term '%s' not found in this "
                                 "DesignMatrixBuilder" % (term, ))
            for factor in term.factors:
                evaluators.add(factor_to_evaluators[factor])
            terms.append(term)
            column_builder = self._term_to_column_builders[term]
            term_to_column_builders[term] = column_builder
        return DesignMatrixBuilder(terms, evaluators, term_to_column_builders)
示例#14
0
def build_design_matrices(builders, data,
                          NA_action="drop",
                          return_type="matrix",
                          dtype=np.dtype(float)):
    """Construct several design matrices from :class:`DesignMatrixBuilder`
    objects.

    This is one of Patsy's fundamental functions. This function and
    :func:`design_matrix_builders` together form the API to the core formula
    interpretation machinery.

    :arg builders: A list of :class:`DesignMatrixBuilders` specifying the
      design matrices to be built.
    :arg data: A dict-like object which will be used to look up data.
    :arg NA_action: What to do with rows that contain missing values. You can
      ``"drop"`` them, ``"raise"`` an error, or for customization, pass an
      :class:`NAAction` object. See :class:`NAAction` for details on what
      values count as 'missing' (and how to alter this).
    :arg return_type: Either ``"matrix"`` or ``"dataframe"``. See below.
    :arg dtype: The dtype of the returned matrix. Useful if you want to use
      single-precision or extended-precision.

    This function returns either a list of :class:`DesignMatrix` objects (for
    ``return_type="matrix"``) or a list of :class:`pandas.DataFrame` objects
    (for ``return_type="dataframe"``). In the latter case, the DataFrames will
    preserve any (row) indexes that were present in the input, which may be
    useful for time-series models etc. In any case, all returned design
    matrices will have ``.design_info`` attributes containing the appropriate
    :class:`DesignInfo` objects.

    Unlike :func:`design_matrix_builders`, this function takes only a simple
    data argument, not any kind of iterator. That's because this function
    doesn't need a global view of the data -- everything that depends on the
    whole data set is already encapsulated in the `builders`. If you are
    incrementally processing a large data set, simply call this function for
    each chunk.
    """
    if isinstance(NA_action, basestring):
        NA_action = NAAction(NA_action)
    if return_type == "dataframe" and not have_pandas:
        raise PatsyError("pandas.DataFrame was requested, but pandas "
                            "is not installed")
    if return_type not in ("matrix", "dataframe"):
        raise PatsyError("unrecognized output type %r, should be "
                            "'matrix' or 'dataframe'" % (return_type,))
    # Evaluate factors
    evaluator_to_values = {}
    evaluator_to_isNAs = {}
    num_rows = None
    pandas_index = None
    for builder in builders:
        # We look at evaluators rather than factors here, because it might
        # happen that we have the same factor twice, but with different
        # memorized state.
        for evaluator in builder._evaluators:
            if evaluator not in evaluator_to_values:
                value, is_NA = evaluator.eval(data, NA_action)
                evaluator_to_isNAs[evaluator] = is_NA
                # value may now be a Series, DataFrame, or ndarray
                if num_rows is None:
                    num_rows = value.shape[0]
                else:
                    if num_rows != value.shape[0]:
                        msg = ("Row mismatch: factor %s had %s rows, when "
                               "previous factors had %s rows"
                               % (evaluator.factor.name(), value.shape[0],
                                  num_rows))
                        raise PatsyError(msg, evaluator.factor)
                if (have_pandas
                    and isinstance(value, (pandas.Series, pandas.DataFrame))):
                    if pandas_index is None:
                        pandas_index = value.index
                    else:
                        if not pandas_index.equals(value.index):
                            msg = ("Index mismatch: pandas objects must "
                                   "have aligned indexes")
                            raise PatsyError(msg, evaluator.factor)
                # Strategy: we work with raw ndarrays for doing the actual
                # combining; DesignMatrixBuilder objects never sees pandas
                # objects. Then at the end, if a DataFrame was requested, we
                # convert. So every entry in this dict is either a 2-d array
                # of floats, or a 1-d array of integers (representing
                # categories).
                value = np.asarray(value)
                evaluator_to_values[evaluator] = value
    # Handle NAs
    values = evaluator_to_values.values()
    is_NAs = evaluator_to_isNAs.values()
    if return_type == "dataframe" and num_rows is not None:
        if pandas_index is None:
            pandas_index = np.arange(num_rows)
        values.append(pandas_index)
        is_NAs.append(np.zeros(len(pandas_index), dtype=bool))
    origins = [evaluator.factor.origin for evaluator in evaluator_to_values]
    new_values = NA_action.handle_NA(values, is_NAs, origins)
    if return_type == "dataframe" and num_rows is not None:
        pandas_index = new_values.pop()
    evaluator_to_values = dict(zip(evaluator_to_values, new_values))
    # Build factor values into matrices
    results = []
    for builder in builders:
        results.append(builder._build(evaluator_to_values, dtype))
    matrices = []
    for need_reshape, matrix in results:
        if need_reshape and num_rows is not None:
            assert matrix.shape[0] == 1
            matrices.append(DesignMatrix(np.repeat(matrix, num_rows, axis=0),
                                         matrix.design_info))
        else:
            # There is no data-dependence, at all -- a formula like "1 ~ 1". I
            # guess we'll just return some single-row matrices. Perhaps it
            # would be better to figure out how many rows are in the input
            # data and broadcast to that size, but eh. Input data is optional
            # in the first place, so even that would be no guarantee... let's
            # wait until someone actually has a relevant use case before we
            # worry about it.
            matrices.append(matrix)
    if return_type == "dataframe":
        assert have_pandas
        for i, matrix in enumerate(matrices):
            di = matrix.design_info
            matrices[i] = pandas.DataFrame(matrix,
                                           columns=di.column_names,
                                           index=pandas_index)
            matrices[i].design_info = di
    return matrices