Python chars_to_classesの例

プログラミング言語: Python

名前空間/パッケージ名: ftfy.chardata

メソッド/関数: chars_to_classes

hotexamples.comのコード掲載数: 4

Python chars_to_classes - 4件のコード例が見つかりました。すべてオープンソースプロジェクトから抽出されたPythonのftfy.chardata.chars_to_classesの実例で、最も評価が高いものを厳選しています。コード例の評価を行っていただくことで、より質の高いコード例が表示されるようになります。

コード例 #1

ファイルを表示

ファイル: badness.py プロジェクト: timgates42/python-ftfy

def sequence_weirdness(text):
    """
    Determine how often a text has unexpected characters or sequences of
    characters. This metric is used to disambiguate when text should be
    re-decoded or left as is.

    We start by normalizing text in NFC form, so that penalties for
    diacritical marks don't apply to characters that know what to do with
    them.

    The following things are deemed weird:

    - Lowercase letters followed by non-ASCII uppercase letters
    - Non-Latin characters next to Latin characters
    - Un-combined diacritical marks, unless they're stacking on non-alphabetic
      characters (in languages that do that kind of thing a lot) or other
      marks
    - C1 control characters
    - Adjacent symbols from any different pair of these categories:

        - Modifier marks
        - Letter modifiers
        - Non-digit numbers
        - Symbols (including math and currency)

    The return value is the number of instances of weirdness.
    """
    text2 = unicodedata.normalize('NFC', text)
    weirdness = len(WEIRDNESS_RE.findall(chars_to_classes(text2)))
    adjustment = len(MOJIBAKE_SYMBOL_RE.findall(text2)) * 2 - len(
        COMMON_SYMBOL_RE.findall(text2)
    )
    return weirdness * 2 + adjustment

コード例 #2

ファイルを表示

ファイル: badness.py プロジェクト: rlugojr/python-ftfy

def sequence_weirdness(text):
    """
    Determine how often a text has unexpected characters or sequences of
    characters. This metric is used to disambiguate when text should be
    re-decoded or left as is.

    We start by normalizing text in NFC form, so that penalties for
    diacritical marks don't apply to characters that know what to do with
    them.

    The following things are deemed weird:

    - Lowercase letters followed by non-ASCII uppercase letters
    - Non-Latin characters next to Latin characters
    - Un-combined diacritical marks, unless they're stacking on non-alphabetic
      characters (in languages that do that kind of thing a lot) or other
      marks
    - C1 control characters
    - Adjacent symbols from any different pair of these categories:

        - Modifier marks
        - Letter modifiers
        - Non-digit numbers
        - Symbols (including math and currency)

    The return value is the number of instances of weirdness.
    """
    text2 = unicodedata.normalize('NFC', text)
    weirdness = len(WEIRDNESS_RE.findall(chars_to_classes(text2)))
    punct_discount = len(COMMON_SYMBOL_RE.findall(text2))
    return weirdness * 2 - punct_discount

コード例 #3

ファイルを表示

ファイル: badness.py プロジェクト: craigcitro/python-ftfy

def sequence_weirdness(text):
    """
    Determine how often a text has unexpected characters or sequences of
    characters. This metric is used to disambiguate when text should be
    re-decoded or left as is.

    We start by normalizing text in NFC form, so that penalties for
    diacritical marks don't apply to characters that know what to do with
    them.
    """
    text2 = unicodedata.normalize('NFC', text)
    return len(WEIRDNESS_RE.findall(chars_to_classes(text2)))

コード例 #4

ファイルを表示

def sequence_weirdness(text):
    """
    Determine how often a text has unexpected characters or sequences of
    characters. This metric is used to disambiguate when text should be
    re-decoded or left as is.

    We start by normalizing text in NFC form, so that penalties for
    diacritical marks don't apply to characters that know what to do with
    them.
    """
    text2 = unicodedata.normalize('NFC', text)
    return len(WEIRDNESS_RE.findall(chars_to_classes(text2)))