def test_process_control_chars(self): tools.assert_raises(TypeError, misc.process_control_chars, b'byte string') tools.assert_raises(ControlCharError, misc.process_control_chars, *[self.u_ascii_chars], **{'strategy':'strict'}) tools.ok_(misc.process_control_chars(self.u_ascii_chars, strategy='ignore') == self.u_ascii_no_ctrl) tools.ok_(misc.process_control_chars(self.u_ascii_chars, strategy='replace') == self.u_ascii_ctrl_replace)
def test_process_control_chars(self): tools.assert_raises(TypeError, misc.process_control_chars, 'byte string') tools.assert_raises(ControlCharError, misc.process_control_chars, *[self.u_ascii_chars], **{'strategy': 'strict'}) tools.ok_( misc.process_control_chars(self.u_ascii_chars, strategy='ignore') == self.u_ascii_no_ctrl) tools.ok_( misc.process_control_chars(self.u_ascii_chars, strategy='replace') == self.u_ascii_ctrl_replace)
def unicode_to_xml(string, encoding='utf-8', attrib=False, control_chars='replace'): '''Take a :class:`str` string and turn it into a byte :class:`bytes` suitable for xml :arg string: :class:`str` string to encode into an XML compatible byte :class:`bytes` :kwarg encoding: encoding to use for the returned byte :class:`bytes`. Default is to encode to :term:`UTF-8`. If some of the characters in :attr:`string` are not encodable in this encoding, the unknown characters will be entered into the output string using xml character references. :kwarg attrib: If :data:`True`, quote the string for use in an xml attribute. If :data:`False` (default), quote for use in an xml text field. :kwarg control_chars: :term:`control characters` are not allowed in XML documents. When we encounter those we need to know what to do. Valid options are: :replace: (default) Replace the control characters with ``?`` :ignore: Remove the characters altogether from the output :strict: Raise an :exc:`~kitchen.text.exceptions.XmlEncodeError` when we encounter a :term:`control character` :raises kitchen.text.exceptions.XmlEncodeError: If :attr:`control_chars` is set to ``strict`` and the string to be made suitable for output to xml contains :term:`control characters` or if :attr:`string` is not a :class:`str` string then we raise this exception. :raises ValueError: If :attr:`control_chars` is set to something other than ``replace``, ``ignore``, or ``strict``. :rtype: byte :class:`bytes` :returns: representation of the :class:`str` string as a valid XML byte :class:`bytes` XML files consist mainly of text encoded using a particular charset. XML also denies the use of certain bytes in the encoded text (example: ``ASCII Null``). There are also special characters that must be escaped if they are present in the input (example: ``<``). This function takes care of all of those issues for you. There are a few different ways to use this function depending on your needs. The simplest invocation is like this:: unicode_to_xml(u'String with non-ASCII characters: <"á と">') This will return the following to you, encoded in :term:`utf-8`:: 'String with non-ASCII characters: <"á と">' Pretty straightforward. Now, what if you need to encode your document in something other than :term:`utf-8`? For instance, ``latin-1``? Let's see:: unicode_to_xml(u'String with non-ASCII characters: <"á と">', encoding='latin-1') 'String with non-ASCII characters: <"á と">' Because the ``と`` character is not available in the ``latin-1`` charset, it is replaced with ``と`` in our output. This is an xml character reference which represents the character at unicode codepoint ``12392``, the ``と`` character. When you want to reverse this, use :func:`xml_to_unicode` which will turn a byte :class:`bytes` into a :class:`str` string and replace the xml character references with the unicode characters. XML also has the quirk of not allowing :term:`control characters` in its output. The :attr:`control_chars` parameter allows us to specify what to do with those. For use cases that don't need absolute character by character fidelity (example: holding strings that will just be used for display in a GUI app later), the default value of ``replace`` works well:: unicode_to_xml(u'String with disallowed control chars: \\u0000\\u0007') 'String with disallowed control chars: ??' If you do need to be able to reproduce all of the characters at a later date (examples: if the string is a key value in a database or a path on a filesystem) you have many choices. Here are a few that rely on ``utf-7``, a verbose encoding that encodes :term:`control characters` (as well as non-:term:`ASCII` unicode values) to characters from within the :term:`ASCII` printable characters. The good thing about doing this is that the code is pretty simple. You just need to use ``utf-7`` both when encoding the field for xml and when decoding it for use in your python program:: unicode_to_xml(u'String with unicode: と and control char: \u0007', encoding='utf7') 'String with unicode: +MGg and control char: +AAc-' # [...] xml_to_unicode('String with unicode: +MGg and control char: +AAc-', encoding='utf7') u'String with unicode: と and control char: \u0007' As you can see, the ``utf-7`` encoding will transform even characters that would be representable in :term:`utf-8`. This can be a drawback if you want unicode characters in the file to be readable without being decoded first. You can work around this with increased complexity in your application code:: encoding = 'utf-8' u_string = u'String with unicode: と and control char: \u0007' try: # First attempt to encode to utf8 data = unicode_to_xml(u_string, encoding=encoding, errors='strict') except XmlEncodeError: # Fallback to utf-7 encoding = 'utf-7' data = unicode_to_xml(u_string, encoding=encoding, errors='strict') write_tag('<mytag encoding=%s>%s</mytag>' % (encoding, data)) # [...] encoding = tag.attributes.encoding u_string = xml_to_unicode(u_string, encoding=encoding) Using code similar to that, you can have some fields encoded using your default encoding and fallback to ``utf-7`` if there are :term:`control characters` present. .. note:: If your goal is to preserve the :term:`control characters` you cannot save the entire file as ``utf-7`` and set the xml encoding parameter to ``utf-7`` if your goal is to preserve the :term:`control characters`. Because XML doesn't allow :term:`control characters`, you have to encode those separate from any encoding work that the XML parser itself knows about. .. seealso:: :func:`bytes_to_xml` if you're dealing with bytes that are non-text or of an unknown encoding that you must preserve on a byte for byte level. :func:`guess_encoding_to_xml` if you're dealing with strings in unknown encodings that you don't need to save with char-for-char fidelity. ''' if not string: # Small optimization return b'' try: process_control_chars(string, strategy=control_chars) except TypeError: raise XmlEncodeError('unicode_to_xml must have a unicode type as' ' the first argument. Use bytes_string_to_xml for byte' ' strings.') except ValueError: raise ValueError('The control_chars argument to unicode_to_xml' ' must be one of ignore, replace, or strict') except ControlCharError as exc: raise XmlEncodeError(exc.args[0]) # Escape characters that have special meaning in xml if attrib: string = xml.sax.saxutils.escape(string, entities={'"':"""}) else: string = xml.sax.saxutils.escape(string) string = string.encode(encoding, 'xmlcharrefreplace') return string
def unicode_to_xml(string, encoding='utf8', attrib=False, control_chars='replace'): '''Take a unicode string and turn it into a byte string suitable for xml :arg string: unicode string to encode for return :kwarg encoding: encoding to use for the returned byte string. Default is to encode to utf8. If all the characters in string are not encodable in this encoding, the unknown characters will be entered into the output string using xml character references. :kwarg attrib: If True, quote the string for use in an xml attribute. If False (default), quote for use in an xml text field. :kwarg control_chars: XML does not allow ASCII control characters. When we encounter those we need to know what to do. Valid options are: :replace: (default) Replace the control characters with "?" :ignore: Remove the characters altogether from the output :strict: Raise an error when we encounter a control character :raises XmlEncodeError: If control_chars is set to 'strict' and the string to be made suitable for output to xml contains control characters or if :attr:`string` is not a unicode type then we raise this exception. :raises ValueError: If control_chars is set to something other than replace, ignore, or strict. :rtype: byte string :returns: representation of the unicode string with any bytes that aren't available in xml taken care of. XML files consist mainly of text encoded using a particular charset. XML also denies the use of certain bytes in the encoded text (example: ASCII Null). There are also special characters that must be escaped if they are present in the input (example: "<"). This function takes care of all of those issues for you. There are a few different ways to use this function depending on your needs. The simplest invocation is like this:: unicode_to_xml(u'String with non-ASCII characters: <"á と">') This will return the following to you, encoded in utf8:: 'String with non-ASCII characters: <"á と">' Pretty straightforward. Now, what if you need to encode your document in something other than utf8? For instance, latin1? Let's see:: unicode_to_xml(u'String with non-ASCII characters: <"á と">', encoding='latin1') 'String with non-ASCII characters: <"á と">' Because the "と" character is not available in the latin1 charset, it is replaced with a "と" in our output. This is an xml character reference which represents the character at unicode codepoint 12392, the "と" character. When you want to reverse this, use :func:`xml_to_unicode` which will turn a byte string to unicode and replace the xmlcharrefs with the unicode characters. XML also has the quirk of not allowing ASCII control characters in its output. The control_chars parameter allows us to specify what to do with those. For use cases that don't need absolute character by character fidelity (example: holding strings that will just be used for display in a GUI app later), the default value of 'replace' works well:: unicode_to_xml(u'String with disallowed control chars: \u0000\u0007') 'String with disallowed control chars: ??' If you do need to be able to reproduce all of the characters at a later date (examples: if the string is a key value in a database or a path on a filesystem) you have many choices. Here are a few that rely on utf7, a verbose encoding that encodes control values (as well as all other unicode values) to characters from within the ASCII printable characters. The good thing about doing this is that the code is pretty simple. You just need to use utf7 both when encoding the field for xml and when decoding it for use in your python program:: unicode_to_xml(u'String with unicode: と and control char: \u0007', encoding='utf7') 'String with unicode: +MGg and control char: +AAc-' [...] xml_to_unicode('String with unicode: +MGg and control char: +AAc-', encoding='utf7') u'String with unicode: と and control char: \u0007' As you can see, the utf7 encoding will transform even characters that would be representable in utf8. This can be a drawback if you want unicode characters in the file to be readable without being decoded first. You can work around this with increased complexity in your application code:: encoding = 'utf8' u_string = u'String with unicode: と and control char: \u0007' try: # First attempt to encode to utf8 data = unicode_to_xml(u_string, encoding=encoding, errors='strict') except XmlEncodeError: # Fallback to utf7 encoding = 'utf7' data = unicode_to_xml(u_string, encoding=encoding, errors='strict') write_tag('<mytag encoding=%s>%s</mytag>' % (encoding, data)) [...] encoding = tag.attributes.encoding u_string = xml_to_unicode(u_string, encoding=encoding) Using code similar to that, you can have some fields encoded using your default encoding and fallback to utf7 if there are control characters present. .. seealso:: :func:`bytes_to_xml` if you're dealing with bytes that are non-text or of an unknown encoding that you must preserve on a byte for byte level. :func:`guess_encoding_to_xml` if you're dealing with strings in unknown encodings that you don't need to save with char-for-char fidelity. ''' if not string: # Small optimization return '' try: process_control_chars(string, strategy=control_chars) except TypeError: raise XmlEncodeError( _('unicode_to_xml must have a unicode type as' ' the first argument. Use bytes_string_to_xml for byte' ' strings.')) except ValueError: raise ValueError( _('The control_chars argument to unicode_to_xml' ' must be one of ignore, replace, or strict')) except ControlCharError, e: raise XmlEncodeError(e.args[0])