示例#1
0
 def test_process_control_chars(self):
     tools.assert_raises(TypeError, misc.process_control_chars, b'byte string')
     tools.assert_raises(ControlCharError, misc.process_control_chars,
             *[self.u_ascii_chars], **{'strategy':'strict'})
     tools.ok_(misc.process_control_chars(self.u_ascii_chars,
         strategy='ignore') == self.u_ascii_no_ctrl)
     tools.ok_(misc.process_control_chars(self.u_ascii_chars,
         strategy='replace') == self.u_ascii_ctrl_replace)
示例#2
0
 def test_process_control_chars(self):
     tools.assert_raises(TypeError, misc.process_control_chars,
                         'byte string')
     tools.assert_raises(ControlCharError, misc.process_control_chars,
                         *[self.u_ascii_chars], **{'strategy': 'strict'})
     tools.ok_(
         misc.process_control_chars(self.u_ascii_chars, strategy='ignore')
         == self.u_ascii_no_ctrl)
     tools.ok_(
         misc.process_control_chars(self.u_ascii_chars, strategy='replace')
         == self.u_ascii_ctrl_replace)
示例#3
0
def unicode_to_xml(string, encoding='utf-8', attrib=False,
        control_chars='replace'):
    '''Take a :class:`str` string and turn it into a byte :class:`bytes`
    suitable for xml

    :arg string: :class:`str` string to encode into an XML compatible byte
        :class:`bytes`
    :kwarg encoding: encoding to use for the returned byte :class:`bytes`.
        Default is to encode to :term:`UTF-8`.  If some of the characters in
        :attr:`string` are not encodable in this encoding, the unknown
        characters will be entered into the output string using xml character
        references.
    :kwarg attrib: If :data:`True`, quote the string for use in an xml
        attribute.  If :data:`False` (default), quote for use in an xml text
        field.
    :kwarg control_chars: :term:`control characters` are not allowed in XML
        documents.  When we encounter those we need to know what to do.  Valid
        options are:

        :replace: (default) Replace the control characters with ``?``
        :ignore: Remove the characters altogether from the output
        :strict: Raise an :exc:`~kitchen.text.exceptions.XmlEncodeError`  when
            we encounter a :term:`control character`

    :raises kitchen.text.exceptions.XmlEncodeError: If :attr:`control_chars`
        is set to ``strict`` and the string to be made suitable for output to
        xml contains :term:`control characters` or if :attr:`string` is not
        a :class:`str` string then we raise this exception.
    :raises ValueError: If :attr:`control_chars` is set to something other than
        ``replace``, ``ignore``, or ``strict``.
    :rtype: byte :class:`bytes`
    :returns: representation of the :class:`str` string as a valid XML
        byte :class:`bytes`

    XML files consist mainly of text encoded using a particular charset.  XML
    also denies the use of certain bytes in the encoded text (example: ``ASCII
    Null``).  There are also special characters that must be escaped if they
    are present in the input (example: ``<``).  This function takes care of
    all of those issues for you.

    There are a few different ways to use this function depending on your
    needs.  The simplest invocation is like this::

       unicode_to_xml(u'String with non-ASCII characters: <"á と">')

    This will return the following to you, encoded in :term:`utf-8`::

      'String with non-ASCII characters: &lt;"á と"&gt;'

    Pretty straightforward.  Now, what if you need to encode your document in
    something other than :term:`utf-8`?  For instance, ``latin-1``?  Let's
    see::

       unicode_to_xml(u'String with non-ASCII characters: <"á と">', encoding='latin-1')
       'String with non-ASCII characters: &lt;"á &#12392;"&gt;'

    Because the ``と`` character is not available in the ``latin-1`` charset,
    it is replaced with ``&#12392;`` in our output.  This is an xml character
    reference which represents the character at unicode codepoint ``12392``, the
    ``と`` character.

    When you want to reverse this, use :func:`xml_to_unicode` which will turn
    a byte :class:`bytes` into a :class:`str` string and replace the xml
    character references with the unicode characters.

    XML also has the quirk of not allowing :term:`control characters` in its
    output.  The :attr:`control_chars` parameter allows us to specify what to
    do with those.  For use cases that don't need absolute character by
    character fidelity (example: holding strings that will just be used for
    display in a GUI app later), the default value of ``replace`` works well::

        unicode_to_xml(u'String with disallowed control chars: \\u0000\\u0007')
        'String with disallowed control chars: ??'

    If you do need to be able to reproduce all of the characters at a later
    date (examples: if the string is a key value in a database or a path on a
    filesystem) you have many choices.  Here are a few that rely on ``utf-7``,
    a verbose encoding that encodes :term:`control characters` (as well as
    non-:term:`ASCII` unicode values) to characters from within the
    :term:`ASCII` printable characters.  The good thing about doing this is
    that the code is pretty simple.  You just need to use ``utf-7`` both when
    encoding the field for xml and when decoding it for use in your python
    program::

        unicode_to_xml(u'String with unicode: と and control char: \u0007', encoding='utf7')
        'String with unicode: +MGg and control char: +AAc-'
        # [...]
        xml_to_unicode('String with unicode: +MGg and control char: +AAc-', encoding='utf7')
        u'String with unicode: と and control char: \u0007'

    As you can see, the ``utf-7`` encoding will transform even characters that
    would be representable in :term:`utf-8`.  This can be a drawback if you
    want unicode characters in the file to be readable without being decoded
    first.  You can work around this with increased complexity in your
    application code::

        encoding = 'utf-8'
        u_string = u'String with unicode: と and control char: \u0007'
        try:
            # First attempt to encode to utf8
            data = unicode_to_xml(u_string, encoding=encoding, errors='strict')
        except XmlEncodeError:
            # Fallback to utf-7
            encoding = 'utf-7'
            data = unicode_to_xml(u_string, encoding=encoding, errors='strict')
        write_tag('<mytag encoding=%s>%s</mytag>' % (encoding, data))
        # [...]
        encoding = tag.attributes.encoding
        u_string = xml_to_unicode(u_string, encoding=encoding)

    Using code similar to that, you can have some fields encoded using your
    default encoding and fallback to ``utf-7`` if there are :term:`control
    characters` present.

    .. note::

        If your goal is to preserve the :term:`control characters` you cannot
        save the entire file as ``utf-7`` and set the xml encoding parameter
        to ``utf-7`` if your goal is to preserve the :term:`control
        characters`.  Because XML doesn't allow :term:`control characters`,
        you have to encode those separate from any encoding work that the XML
        parser itself knows about.

    .. seealso::

        :func:`bytes_to_xml`
            if you're dealing with bytes that are non-text or of an unknown
            encoding that you must preserve on a byte for byte level.
        :func:`guess_encoding_to_xml`
            if you're dealing with strings in unknown encodings that you don't
            need to save with char-for-char fidelity.
    '''
    if not string:
        # Small optimization
        return b''
    try:
        process_control_chars(string, strategy=control_chars)
    except TypeError:
        raise XmlEncodeError('unicode_to_xml must have a unicode type as'
                ' the first argument.  Use bytes_string_to_xml for byte'
                ' strings.')
    except ValueError:
        raise ValueError('The control_chars argument to unicode_to_xml'
                ' must be one of ignore, replace, or strict')
    except ControlCharError as exc:
        raise XmlEncodeError(exc.args[0])

    # Escape characters that have special meaning in xml
    if attrib:
        string = xml.sax.saxutils.escape(string, entities={'"':"&quot;"})
    else:
        string = xml.sax.saxutils.escape(string)

    string = string.encode(encoding, 'xmlcharrefreplace')

    return string
示例#4
0
def unicode_to_xml(string,
                   encoding='utf8',
                   attrib=False,
                   control_chars='replace'):
    '''Take a unicode string and turn it into a byte string suitable for xml

    :arg string: unicode string to encode for return
    :kwarg encoding: encoding to use for the returned byte string.  Default is
        to encode to utf8.  If all the characters in string are not encodable
        in this encoding, the unknown characters will be entered into the output
        string using xml character references.
    :kwarg attrib: If True, quote the string for use in an xml attribute.
        If False (default), quote for use in an xml text field.
    :kwarg control_chars: XML does not allow ASCII control characters.  When
        we encounter those we need to know what to do.  Valid options are:
        :replace: (default) Replace the control characters with "?"
        :ignore: Remove the characters altogether from the output
        :strict: Raise an error when we encounter a control character
    :raises XmlEncodeError: If control_chars is set to 'strict' and the string
        to be made suitable for output to xml contains control characters or if
        :attr:`string` is not a unicode type then we raise this exception.
    :raises ValueError: If control_chars is set to something other than
        replace, ignore, or strict.
    :rtype: byte string
    :returns: representation of the unicode string with any bytes that aren't
        available in xml taken care of.

    XML files consist mainly of text encoded using a particular charset.  XML
    also denies the use of certain bytes in the encoded text (example: ASCII
    Null).  There are also special characters that must be escaped if they are
    present in the input (example: "<").  This function takes care of all of
    those issues for you.

    There are a few different ways to use this function depending on your needs.
    The simplest invocation is like this::

       unicode_to_xml(u'String with non-ASCII characters: <"á と">')

    This will return the following to you, encoded in utf8::

      'String with non-ASCII characters: &lt;"á と"&gt;'

    Pretty straightforward.  Now, what if you need to encode your document in
    something other than utf8?  For instance, latin1?  Let's see::

       unicode_to_xml(u'String with non-ASCII characters: <"á と">', encoding='latin1')
       'String with non-ASCII characters: &lt;"á &#12392;"&gt;'

    Because the "と" character is not available in the latin1 charset, it is
    replaced with a "&#12392;" in our output.  This is an xml character
    reference which represents the character at unicode codepoint 12392, the
    "と" character.

    When you want to reverse this, use :func:`xml_to_unicode` which will turn
    a byte string to unicode and replace the xmlcharrefs with the unicode
    characters.

    XML also has the quirk of not allowing ASCII control characters in its
    output.  The control_chars parameter allows us to specify what to do with
    those.  For use cases that don't need absolute character by character
    fidelity (example: holding strings that will just be used for display
    in a GUI app later), the default value of 'replace' works well::

        unicode_to_xml(u'String with disallowed control chars: \u0000\u0007')
        'String with disallowed control chars: ??'

    If you do need to be able to reproduce all of the characters at a later
    date (examples: if the string is a key value in a database or a path on a
    filesystem) you have many choices.  Here are a few that rely on utf7, a
    verbose encoding that encodes control values (as well as all other unicode
    values) to characters from within the ASCII printable characters.  The good
    thing about doing this is that the code is pretty simple.  You just need to
    use utf7 both when encoding the field for xml and when decoding it for use
    in your python program::

        unicode_to_xml(u'String with unicode: と and control char: \u0007', encoding='utf7')
        'String with unicode: +MGg and control char: +AAc-'
        [...]
        xml_to_unicode('String with unicode: +MGg and control char: +AAc-', encoding='utf7')
        u'String with unicode: と and control char: \u0007'

    As you can see, the utf7 encoding will transform even characters that
    would be representable in utf8.  This can be a drawback if you want
    unicode characters in the file to be readable without being decoded first.
    You can work around this with increased complexity in your application
    code::

        encoding = 'utf8'
        u_string = u'String with unicode: と and control char: \u0007'
        try:
            # First attempt to encode to utf8
            data = unicode_to_xml(u_string, encoding=encoding, errors='strict')
        except XmlEncodeError:
            # Fallback to utf7
            encoding = 'utf7'
            data = unicode_to_xml(u_string, encoding=encoding, errors='strict')
        write_tag('<mytag encoding=%s>%s</mytag>' % (encoding, data))
        [...]
        encoding = tag.attributes.encoding
        u_string = xml_to_unicode(u_string, encoding=encoding)

    Using code similar to that, you can have some fields encoded using your
    default encoding and fallback to utf7 if there are control characters
    present.

    .. seealso::
        :func:`bytes_to_xml`
            if you're dealing with bytes that are non-text or of an unknown
            encoding that you must preserve on a byte for byte level.
        :func:`guess_encoding_to_xml`
            if you're dealing with strings in unknown encodings that you don't
            need to save with char-for-char fidelity.
    '''
    if not string:
        # Small optimization
        return ''
    try:
        process_control_chars(string, strategy=control_chars)
    except TypeError:
        raise XmlEncodeError(
            _('unicode_to_xml must have a unicode type as'
              ' the first argument.  Use bytes_string_to_xml for byte'
              ' strings.'))
    except ValueError:
        raise ValueError(
            _('The control_chars argument to unicode_to_xml'
              ' must be one of ignore, replace, or strict'))
    except ControlCharError, e:
        raise XmlEncodeError(e.args[0])