Python sub 예제들

프로그래밍 언어: Python

네임스페이스/패키지 이름: spambayes.tokenizer.html_re

메소드/함수: sub

hotexamples.com에서의 예제들: 3

Python sub - 3개의 예제가 발견되었습니다. 이것들은 오픈소스 프로젝트에서 추출된 Python의 spambayes.tokenizer.html_re.sub에 대한 실세계 최고 등급의 예제들입니다. 예제들을 평가하여 예제의 품질 향상에 도움을 줄 수 있습니다.

예제 #1

파일 보기

 def tokenize(self, text):
     """Tokenize a chunk of text.
     Pulled mostly verbatim from the SpamBayes code.
     """
     maxword = 20
     text = numeric_entity_re.sub(numeric_entity_replacer, text)
     for cracker in (crack_urls,):
         text, tokens = cracker(text)
         for t in tokens:
             yield t
     text = breaking_entity_re.sub(' ', text)
     text = html_re.sub('', text)
     for w in text.split():
         n = len(w)
         if 3 <= n <= maxword:
             yield w
         elif n >= 3:
             for t in tokenize_word(w):
                 yield t

예제 #2

파일 보기

파일: MoinSecurityPolicy.py 프로젝트: GymWenFLL/tpp_libs

    def tokenize(self, text):
        """Tokenize a chunk of text.

        Pulled mostly verbatim from the SpamBayes code.
        """

        maxword = 20
        # Replace numeric character entities (like &#97; for the letter
        # 'a').
        text = numeric_entity_re.sub(numeric_entity_replacer, text)

        # Crack open URLs and extract useful bits of marrow...
        for cracker in (crack_urls,):
            text, tokens = cracker(text)
            for t in tokens:
                yield t

        # Remove HTML/XML tags.  Also &nbsp;.  <br> and <p> tags should
        # create a space too.
        text = breaking_entity_re.sub(' ', text)
        # It's important to eliminate HTML tags rather than, e.g.,
        # replace them with a blank (as this code used to do), else
        # simple tricks like
        #    Wr<!$FS|i|R3$s80sA >inkle Reduc<!$FS|i|R3$s80sA >tion
        # can be used to disguise words.  <br> and <p> were special-
        # cased just above (because browsers break text on those,
        # they can't be used to hide words effectively).
        text = html_re.sub('', text)

        # Tokenize everything in the body.
        for w in text.split():
            n = len(w)
            # Make sure this range matches in tokenize_word().
            if 3 <= n <= maxword:
                yield w

            elif n >= 3:
                for t in tokenize_word(w):
                    yield t

예제 #3

파일 보기

파일: MoinSecurityPolicy.py 프로젝트: thomasvangurp/spamfilter

    def tokenize(self, text):
        """Tokenize a chunk of text.

        Pulled mostly verbatim from the SpamBayes code.
        """

        maxword = 20
        # Replace numeric character entities (like &#97; for the letter
        # 'a').
        text = numeric_entity_re.sub(numeric_entity_replacer, text)

        # Crack open URLs and extract useful bits of marrow...
        for cracker in (crack_urls, ):
            text, tokens = cracker(text)
            for t in tokens:
                yield t

        # Remove HTML/XML tags.  Also &nbsp;.  <br> and <p> tags should
        # create a space too.
        text = breaking_entity_re.sub(' ', text)
        # It's important to eliminate HTML tags rather than, e.g.,
        # replace them with a blank (as this code used to do), else
        # simple tricks like
        #    Wr<!$FS|i|R3$s80sA >inkle Reduc<!$FS|i|R3$s80sA >tion
        # can be used to disguise words.  <br> and <p> were special-
        # cased just above (because browsers break text on those,
        # they can't be used to hide words effectively).
        text = html_re.sub('', text)

        # Tokenize everything in the body.
        for w in text.split():
            n = len(w)
            # Make sure this range matches in tokenize_word().
            if 3 <= n <= maxword:
                yield w

            elif n >= 3:
                for t in tokenize_word(w):
                    yield t