def replace_lossy_sequences(byts): """ This function identifies sequences where information has been lost in a "sloppy" codec, indicated by byte 1A, and if they would otherwise look like a UTF-8 sequence, it replaces them with the UTF-8 sequence for U+FFFD. A further explanation: ftfy can now fix text in a few cases that it would previously fix incompletely, because of the fact that it can't successfully apply the fix to the entire string. A very common case of this is when characters have been erroneously decoded as windows-1252, but instead of the "sloppy" windows-1252 that passes through unassigned bytes, the unassigned bytes get turned into U+FFFD (�), so we can't tell what they were. This most commonly happens with curly quotation marks that appear ``“ like this â€�``. We can do better by building on ftfy's "sloppy codecs" to let them handle less-sloppy but more-lossy text. When they encounter the character ``�``, instead of refusing to encode it, they encode it as byte 1A -- an ASCII control code called SUBSTITUTE that once was meant for about the same purpose. We can then apply a fixer that looks for UTF-8 sequences where some continuation bytes have been replaced by byte 1A, and decode the whole sequence as �; if that doesn't work, it'll just turn the byte back into � itself. As a result, the above text ``“ like this â€�`` will decode as ``“ like this �``. If U+1A was actually in the original string, then the sloppy codecs will not be used, and this function will not be run, so your weird control character will be left alone but wacky fixes like this won't be possible. This is used as a transcoder within `fix_encoding`. """ return LOSSY_UTF8_RE.sub("\ufffd".encode("utf-8"), byts)
def replace_lossy_sequences(byts): """ This function identifies sequences where information has been lost in a "sloppy" codec, indicated by byte 1A, and if they would otherwise look like a UTF-8 sequence, it replaces them with the UTF-8 sequence for U+FFFD. A further explanation: ftfy can now fix text in a few cases that it would previously fix incompletely, because of the fact that it can't successfully apply the fix to the entire string. A very common case of this is when characters have been erroneously decoded as windows-1252, but instead of the "sloppy" windows-1252 that passes through unassigned bytes, the unassigned bytes get turned into U+FFFD (�), so we can't tell what they were. This most commonly happens with curly quotation marks that appear ``“ like this â€�``. We can do better by building on ftfy's "sloppy codecs" to let them handle less-sloppy but more-lossy text. When they encounter the character ``�``, instead of refusing to encode it, they encode it as byte 1A -- an ASCII control code called SUBSTITUTE that once was meant for about the same purpose. We can then apply a fixer that looks for UTF-8 sequences where some continuation bytes have been replaced by byte 1A, and decode the whole sequence as �; if that doesn't work, it'll just turn the byte back into � itself. As a result, the above text ``“ like this â€�`` will decode as ``“ like this �``. If U+1A was actually in the original string, then the sloppy codecs will not be used, and this function will not be run, so your weird control character will be left alone but wacky fixes like this won't be possible. This is used as a transcoder within `fix_encoding`. """ return LOSSY_UTF8_RE.sub('\ufffd'.encode('utf-8'), byts)