Ejemplo n.º 1
 def test_process_control_chars(self):
     tools.assert_raises(TypeError, misc.process_control_chars, b'byte string')
     tools.assert_raises(ControlCharError, misc.process_control_chars,
             *[self.u_ascii_chars], **{'strategy':'strict'})
         strategy='ignore') == self.u_ascii_no_ctrl)
         strategy='replace') == self.u_ascii_ctrl_replace)
Ejemplo n.º 3
def unicode_to_xml(string, encoding='utf-8', attrib=False,
    '''Take a :class:`str` string and turn it into a byte :class:`bytes`
    suitable for xml

    :arg string: :class:`str` string to encode into an XML compatible byte
    :kwarg encoding: encoding to use for the returned byte :class:`bytes`.
        Default is to encode to :term:`UTF-8`.  If some of the characters in
        :attr:`string` are not encodable in this encoding, the unknown
        characters will be entered into the output string using xml character
    :kwarg attrib: If :data:`True`, quote the string for use in an xml
        attribute.  If :data:`False` (default), quote for use in an xml text
    :kwarg control_chars: :term:`control characters` are not allowed in XML
        documents.  When we encounter those we need to know what to do.  Valid
        options are:

        :replace: (default) Replace the control characters with ``?``
        :ignore: Remove the characters altogether from the output
        :strict: Raise an :exc:`~kitchen.text.exceptions.XmlEncodeError`  when
            we encounter a :term:`control character`

    :raises kitchen.text.exceptions.XmlEncodeError: If :attr:`control_chars`
        is set to ``strict`` and the string to be made suitable for output to
        xml contains :term:`control characters` or if :attr:`string` is not
        a :class:`str` string then we raise this exception.
    :raises ValueError: If :attr:`control_chars` is set to something other than
        ``replace``, ``ignore``, or ``strict``.
    :rtype: byte :class:`bytes`
    :returns: representation of the :class:`str` string as a valid XML
        byte :class:`bytes`

    XML files consist mainly of text encoded using a particular charset.  XML
    also denies the use of certain bytes in the encoded text (example: ``ASCII
    Null``).  There are also special characters that must be escaped if they
    are present in the input (example: ``<``).  This function takes care of
    all of those issues for you.

    There are a few different ways to use this function depending on your
    needs.  The simplest invocation is like this::

       unicode_to_xml(u'String with non-ASCII characters: <"á と">')

    This will return the following to you, encoded in :term:`utf-8`::

      'String with non-ASCII characters: &lt;"á と"&gt;'

    Pretty straightforward.  Now, what if you need to encode your document in
    something other than :term:`utf-8`?  For instance, ``latin-1``?  Let's

       unicode_to_xml(u'String with non-ASCII characters: <"á と">', encoding='latin-1')
       'String with non-ASCII characters: &lt;"á &#12392;"&gt;'

    Because the ``と`` character is not available in the ``latin-1`` charset,
    it is replaced with ``&#12392;`` in our output.  This is an xml character
    reference which represents the character at unicode codepoint ``12392``, the
    ``と`` character.

    When you want to reverse this, use :func:`xml_to_unicode` which will turn
    a byte :class:`bytes` into a :class:`str` string and replace the xml
    character references with the unicode characters.

    XML also has the quirk of not allowing :term:`control characters` in its
    output.  The :attr:`control_chars` parameter allows us to specify what to
    do with those.  For use cases that don't need absolute character by
    character fidelity (example: holding strings that will just be used for
    display in a GUI app later), the default value of ``replace`` works well::

        unicode_to_xml(u'String with disallowed control chars: \\u0000\\u0007')
        'String with disallowed control chars: ??'

    If you do need to be able to reproduce all of the characters at a later
    date (examples: if the string is a key value in a database or a path on a
    filesystem) you have many choices.  Here are a few that rely on ``utf-7``,
    a verbose encoding that encodes :term:`control characters` (as well as
    non-:term:`ASCII` unicode values) to characters from within the
    :term:`ASCII` printable characters.  The good thing about doing this is
    that the code is pretty simple.  You just need to use ``utf-7`` both when
    encoding the field for xml and when decoding it for use in your python

        unicode_to_xml(u'String with unicode: と and control char: \u0007', encoding='utf7')
        'String with unicode: +MGg and control char: +AAc-'
        # [...]
        xml_to_unicode('String with unicode: +MGg and control char: +AAc-', encoding='utf7')
        u'String with unicode: と and control char: \u0007'

    As you can see, the ``utf-7`` encoding will transform even characters that
    would be representable in :term:`utf-8`.  This can be a drawback if you
    want unicode characters in the file to be readable without being decoded
    first.  You can work around this with increased complexity in your
    application code::

        encoding = 'utf-8'
        u_string = u'String with unicode: と and control char: \u0007'
            # First attempt to encode to utf8
            data = unicode_to_xml(u_string, encoding=encoding, errors='strict')
        except XmlEncodeError:
            # Fallback to utf-7
            encoding = 'utf-7'
            data = unicode_to_xml(u_string, encoding=encoding, errors='strict')
        write_tag('<mytag encoding=%s>%s</mytag>' % (encoding, data))
        # [...]
        encoding = tag.attributes.encoding
        u_string = xml_to_unicode(u_string, encoding=encoding)

    Using code similar to that, you can have some fields encoded using your
    default encoding and fallback to ``utf-7`` if there are :term:`control
    characters` present.

    .. note::

        If your goal is to preserve the :term:`control characters` you cannot
        save the entire file as ``utf-7`` and set the xml encoding parameter
        to ``utf-7`` if your goal is to preserve the :term:`control
        characters`.  Because XML doesn't allow :term:`control characters`,
        you have to encode those separate from any encoding work that the XML
        parser itself knows about.

    .. seealso::

            if you're dealing with bytes that are non-text or of an unknown
            encoding that you must preserve on a byte for byte level.
            if you're dealing with strings in unknown encodings that you don't
            need to save with char-for-char fidelity.
    if not string:
        # Small optimization
        return b''
        process_control_chars(string, strategy=control_chars)
    except TypeError:
        raise XmlEncodeError('unicode_to_xml must have a unicode type as'
                ' the first argument.  Use bytes_string_to_xml for byte'
                ' strings.')
    except ValueError:
        raise ValueError('The control_chars argument to unicode_to_xml'
                ' must be one of ignore, replace, or strict')
    except ControlCharError as exc:
        raise XmlEncodeError(exc.args[0])

    # Escape characters that have special meaning in xml
    if attrib:
        string = xml.sax.saxutils.escape(string, entities={'"':"&quot;"})
        string = xml.sax.saxutils.escape(string)

    string = string.encode(encoding, 'xmlcharrefreplace')

    return string
