Python filterRepeatedchars 예제들

프로그래밍 언어: Python

네임스페이스/패키지 이름: scraper_utils

메소드/함수: filterRepeatedchars

hotexamples.com에서의 예제들: 3

Python filterRepeatedchars - 3개의 예제가 발견되었습니다. 이것들은 오픈소스 프로젝트에서 추출된 Python의 scraper_utils.filterRepeatedchars에 대한 실세계 최고 등급의 예제들입니다. 예제들을 평가하여 예제의 품질 향상에 도움을 줄 수 있습니다.

예제 #1

파일 보기

 def checkAndCleanText(self, inputText, rawData):
     """ Check and clean article text
     """
     cleanedText = inputText
     invalidFlag = False
     try:
         for badString in self.invalidTextStrings:
             if cleanedText.find(badString) >= 0:
                 logger.debug(
                     "%s: Found invalid text strings in data extracted: %s",
                     self.pluginName, badString)
                 invalidFlag = True
         # check if article content is not valid or is too little
         if invalidFlag is True or len(
                 cleanedText) < self.minArticleLengthInChars:
             cleanedText = self.extractArticleBody(rawData)
         # replace repeated spaces, tabs, hyphens, '\n', '\r\n', etc.
         cleanedText = filterRepeatedchars(
             cleanedText,
             deDupeList([' ', '\t', '\n', '\r\n', '-', '_', '.']))
         cleanedText = cleanedText.replace('\n', ' ')
         # remove invalid substrings:
         for stringToFilter in deDupeList(self.subStringsToFilter):
             cleanedText = cleanedText.replace(stringToFilter, " ")
     except Exception as e:
         logger.error("Error cleaning text: %s", e)
     return (cleanedText)

예제 #2

파일 보기

def test_filterRepeatedchars():
    # test to filter out Repeated charaters
    (parentFolder, sourceFolder, testdataFolder) = getAppFolders()
    sys.path.append(sourceFolder)
    import scraper_utils
    baseText = 'A good sentence with repeated    spaces and tabs \t\t\t and\n\n\n newlines and hyphens---- dots....'
    charList = [' ', '\t', '\n', '-']
    resultText = scraper_utils.filterRepeatedchars(baseText, charList)
    print('Result after filtering repeated characters:\n', resultText)
    assert resultText == "A good sentence with repeated spaces and tabs \t and\n newlines and hyphens- dots....",\
        "10. filterRepeatedchars() is not filtering repeated characters correctly."

예제 #3

파일 보기

 def checkAndCleanText(self, inputText, rawData):
     """ Check and clean article text
     """
     cleanedText = inputText
     try:
         # ignore the newspaper extracted text, the alternate method text is more accurate:
         cleanedText = self.extractArticleBody(rawData)
         for badString in self.invalidTextStrings:
             if cleanedText.find(badString) >= 0:
                 logger.debug("%s: Found invalid text strings in data extracted: %s", self.pluginName, badString)
                 return(None)
         # replace repeated spaces, tabs, hyphens, '\n', '\r\n', etc.
         cleanedText = filterRepeatedchars(cleanedText,
                                           deDupeList([' ', '\t', '\n', '\r\n', '-', '_', '.']))
         # remove invalid substrings:
         for stringToFilter in deDupeList(self.subStringsToFilter):
             cleanedText = cleanedText.replace(stringToFilter, " ")
     except Exception as e:
         logger.error("Error cleaning text: %s", e)
     return(cleanedText)