Skip to content

craigcitro/python-ftfy

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ftfy: fixes text for you

Downloads Version

This is a module for making text less broken and more consistent. It works in Python 2.6, Python 3.2, or later.

The most interesting kind of brokenness that this resolves is when someone has encoded Unicode with one standard and decoded it with a different one. This often shows up as characters that turn into nonsense sequences:

  • The word schön might appear as schön.
  • An em dash () might appear as —.
  • Text that was meant to be enclosed in quotation marks might end up instead enclosed in “ and â€� (and that last character probably won't even display as anything meaningful).

This happens very often to real text. Fortunately, the nonsense sequences usually contain all the information you need to reconstruct what character was supposed to be there.

Any given text string might have other irritating properties, possibly even interacting with the erroneous decoding:

  • The text could contain HTML entities such as & in place of certain characters, when you would rather see what the characters actually are.
  • For that matter, it could contain instructions for a text terminal to do something like change colors, but you are not sending the text to a terminal, so those instructions are just going to look like ^[[30m; or something in the middle of the text.
  • The text could write words in non-standard ways for display purposes, such as using the three characters o p for the word "flop". This can happen when you copy text out of a PDF, for example.

Of course you're better off if all the text you take as input is decoded properly and written in standard ways. But often, your input is something you have no control over. Somebody else's minor mistake becomes your problem.

ftfy will do everything it can to fix the problem.

Examples

In these examples, unicode_literals are turned on. ftfy always expects Unicode strings as input.

>>> from __future__ import unicode_literals
>>> from ftfy import fix_text

>>> print(fix_text('This — should be an em dash'))
This — should be an em dash

>>> print(fix_text('ünicode'))
ünicode

>>> print(fix_text('Broken text… it’s flubberific!'))
Broken text... it's flubberific!

>>> print(fix_text('HTML entities <3'))
HTML entities <3

>>> print(fix_text('<em>HTML entities &lt;3</em>'))
<em>HTML entities &lt;3</em>

>>> wtf = '\xc3\xa0\xc2\xb2\xc2\xa0_\xc3\xa0\xc2\xb2\xc2\xa0'
>>> print(fix_text(wtf))
ಠ_ಠ

Using ftfy

The main function, fix_text, will run text through a sequence of fixes. If the text changed, it will run them through again, so that you can be sure the output ends up in a standard form that will be unchanged by fix_text.

All the fixes are on by default, but you can pass options to turn them off.

  • If remove_terminal_escapes is True, remove sequences of bytes that are instructions for Unix terminals, such as the codes that make text appear in different colors.
  • If fix_entities is True, consider replacing HTML entities with their equivalent characters. However, this never applies to text with a pair of angle brackets in it already; you're probably not supposed to decode entities there, and you'd make things ambiguous if you did.
  • If fix_encoding is True, look for common mistakes that come from encoding or decoding Unicode text incorrectly, and fix them if they are reasonably fixable.
  • If normalization is not None, apply the specified form of Unicode normalization, which can be one of 'NFC', 'NFKC', 'NFD', and 'NFKD'. The default, 'NFKC', applies the following relevant transformations:
    • C: Combine characters and diacritics that are written using separate code points, such as converting "e" plus an acute accent modifier into "é", or converting "ka" (か) plus a dakuten into the single character "ga" (が).
    • K: Replace characters that are functionally equivalent with the most common form. For example, half-width katakana will be replaced with full-width versions, full-width Roman characters will be replaced with ASCII characters, ellipsis characters will be replaced with three periods, and the ligature 'fl' will be replaced with 'fl'.
  • If uncurl_quotes is True, replace various curly quotation marks with plain-ASCII straight quotes.
  • If fix_line_breaks is true, convert all line breaks to Unix style (CRLF and CR line breaks become LF line breaks).
  • If fix_control_characters is true, remove all C0 control characters except the common useful ones: TAB, CR, LF, and FF. (CR characters may have already been removed by the fix_line_breaks step.)
  • If remove_bom is True, remove the Byte-Order Mark if it exists. (It's a hint for a UTF-16 decoder. It's not meant to actually end up in your string.)
  • If anything was changed, repeat all the steps, so that the function is idempotent. "&amp;amp;" will become "&", for example, not "&amp;".

Encodings ftfy can handle

ftfy can understand text that was decoded as any of these single-byte encodings:

  • Latin-1 (ISO-8859-1)
  • Windows-1252 (cp1252 -- used in Microsoft products)
  • Windows-1251 (cp1251 -- the Russian version of cp1252)
  • MacRoman (used on Mac OS 9 and earlier)
  • cp437 (used in MS-DOS)

when it was actually intended to be decoded as one of these variable-length encodings:

  • UTF-8
  • CESU-8 (what some people think is UTF-8)

It can also understand text that was intended as Windows-1252 but decoded as Latin-1. That's the very common case where things like smart-quotes and bullets turn into single weird control characters.

However, ftfy cannot understand other mixups between single-byte encodings, because it is extremely difficult to detect which mixup in particular is the one that happened.

Non-Unicode strings

When first using ftfy, you might be confused to find that you can't give it a bytestring (the type of object called str in Python 2).

ftfy fixes text. Treating bytestrings as text is exactly the kind of thing that causes the Unicode problems that ftfy has to fix. So if you don't give it a Unicode string, ftfy will point you to the Python Unicode HOWTO.

Reasonable ways that you might exchange data, such as JSON or XML, already have perfectly good ways of expressing Unicode strings. Given a Unicode string, ftfy can apply fixes that are very likely to work without false positives.

But what if you all you actually have is a mess of bytes on a disk? Well, you've got a problem, and ftfy is not quite the right tool to solve it.

As a sort of half-measure that covers a few common cases, you can decode the bytes as Latin-1 and let ftfy take it from there, which might include reinterpreting the Latin-1 text as Windows-1252 or UTF-8.

>>> print(fix_text(b'\x85test'))
UnicodeError: [informative error message]

>>> print(fix_text(b'\x85test'.decode('latin-1')))
—test

A note on encoding detection

If your input is a mess of unmarked bytes, you might want a tool that can just statistically analyze those bytes and predict what encoding they're in.

ftfy is not that tool. I might want to write that tool someday.

You may have heard of chardet. Chardet is admirable, but it is not that tool either. Its heuristics only work on multi-byte encodings, such as UTF-8 and the language-specific encodings used in East Asian languages. It works very badly on single-byte encodings, to the point where it will output wrong answers with high confidence.

There is lots of real-world text that's in an unknown single-byte encoding. There might be enough information to statistically sort out which encoding is which. But nothing, so far, actually does that.

Command-line usage

ftfy installs itself as a command line tool that reads a file and applies fix_text to it.

This has exactly the problem described above: a file on a disk is made of bytes in an unspecified encoding. It could assume the file is UTF-8, but if you had totally valid UTF-8 you probably wouldn't need this command line utility, and there's a slight chance that the file could contain Latin-1 that coincidentally looks like UTF-8.

Instead, it will follow the "half-measure" above.

You can type ftfy FILENAME, and it will read in FILENAME as Latin-1 text, fix everything that fix_text fixes (including re-interpreting it as UTF-8 if appropriate), and write the result to standard out as UTF-8.

This is not necessarily a good idea, but it's convenient. Consider this a proof of concept until we get a real encoding detector.

Who maintains ftfy?

I'm Rob Speer (rob@luminoso.com). I develop this tool as part of my text-understanding company, Luminoso, where it has proven essential.

Luminoso provides ftfy as free, open source software under the extremely permissive MIT license.

You can report bugs regarding ftfy on GitHub and we'll handle them.

About

Given Unicode text, make its representation consistent and possibly less broken.

Resources

License

Stars

Watchers

Forks

Packages

No packages published