Python LOSSY_UTF8_RE Exemples

Langage de programmation: Python

Espace de nommage/Pack: ftfy.chardata

Class/Type: LOSSY_UTF8_RE

Exemples au hotexamples.com: 2

Python LOSSY_UTF8_RE - 2 exemples trouvés. Ce sont les exemples réels les mieux notés de ftfy.chardata.LOSSY_UTF8_RE extraits de projets open source. Vous pouvez noter les exemples pour nous aider à en améliorer la qualité.

Méthodes fréquemment utilisées

Afficher Cacher

sub(1)

Méthodes fréquemment utilisées

sub (1)

Exemple #1

0

Afficher le fichier

Fichier : fixes.py Projet : pombredanne/python-ftfy

def replace_lossy_sequences(byts): """ This function identifies sequences where information has been lost in a "sloppy" codec, indicated by byte 1A, and if they would otherwise look like a UTF-8 sequence, it replaces them with the UTF-8 sequence for U+FFFD. A further explanation: ftfy can now fix text in a few cases that it would previously fix incompletely, because of the fact that it can't successfully apply the fix to the entire string. A very common case of this is when characters have been erroneously decoded as windows-1252, but instead of the "sloppy" windows-1252 that passes through unassigned bytes, the unassigned bytes get turned into U+FFFD (�), so we can't tell what they were. This most commonly happens with curly quotation marks that appear ``â€œ like this â€�``. We can do better by building on ftfy's "sloppy codecs" to let them handle less-sloppy but more-lossy text. When they encounter the character ``�``, instead of refusing to encode it, they encode it as byte 1A -- an ASCII control code called SUBSTITUTE that once was meant for about the same purpose. We can then apply a fixer that looks for UTF-8 sequences where some continuation bytes have been replaced by byte 1A, and decode the whole sequence as �; if that doesn't work, it'll just turn the byte back into � itself. As a result, the above text ``â€œ like this â€�`` will decode as ``“ like this �``. If U+1A was actually in the original string, then the sloppy codecs will not be used, and this function will not be run, so your weird control character will be left alone but wacky fixes like this won't be possible. This is used as a transcoder within `fix_encoding`. """ return LOSSY_UTF8_RE.sub("\ufffd".encode("utf-8"), byts)

Exemple #2

0

Afficher le fichier

Fichier : fixes.py Projet : LuminosoInsight/python-ftfy

def replace_lossy_sequences(byts): """ This function identifies sequences where information has been lost in a "sloppy" codec, indicated by byte 1A, and if they would otherwise look like a UTF-8 sequence, it replaces them with the UTF-8 sequence for U+FFFD. A further explanation: ftfy can now fix text in a few cases that it would previously fix incompletely, because of the fact that it can't successfully apply the fix to the entire string. A very common case of this is when characters have been erroneously decoded as windows-1252, but instead of the "sloppy" windows-1252 that passes through unassigned bytes, the unassigned bytes get turned into U+FFFD (�), so we can't tell what they were. This most commonly happens with curly quotation marks that appear ``â€œ like this â€�``. We can do better by building on ftfy's "sloppy codecs" to let them handle less-sloppy but more-lossy text. When they encounter the character ``�``, instead of refusing to encode it, they encode it as byte 1A -- an ASCII control code called SUBSTITUTE that once was meant for about the same purpose. We can then apply a fixer that looks for UTF-8 sequences where some continuation bytes have been replaced by byte 1A, and decode the whole sequence as �; if that doesn't work, it'll just turn the byte back into � itself. As a result, the above text ``â€œ like this â€�`` will decode as ``“ like this �``. If U+1A was actually in the original string, then the sloppy codecs will not be used, and this function will not be run, so your weird control character will be left alone but wacky fixes like this won't be possible. This is used as a transcoder within `fix_encoding`. """ return LOSSY_UTF8_RE.sub('\ufffd'.encode('utf-8'), byts)