Python TextUtils.clean_textの例

プログラミング言語: Python

名前空間/パッケージ名: utils

クラス/型: TextUtils

メソッド/関数: clean_text

hotexamples.comのコード掲載数: 2

Python TextUtils.clean_text - 2件のコード例が見つかりました。すべてオープンソースプロジェクトから抽出されたPythonのutils.TextUtils.clean_textの実例で、最も評価が高いものを厳選しています。コード例の評価を行っていただくことで、より質の高いコード例が表示されるようになります。

よく使われるメソッド

表示非表示

TextUtils(3)

clean_text(2)

create_dataset(2)

clean(1)

create_batch(1)

create_label_vocab(1)

create_word_vocab(1)

is_blank(1)

is_plain_text(1)

コード例 #1

ファイルを表示

ファイル: HTMLParser.py プロジェクト: skywindy/fddc-extraction-release

 def parse_content(self, html_file_path):
     """
     解析 HTML 中的段落文本
     按顺序返回多个 paragraph 构成一个数组，
     每个 paragraph 是一个 content 行构成的数组
     :param html_file_path:
     :return:
     """
     rs = []
     with codecs.open(html_file_path, encoding='utf-8', mode='r') as fp:
         soup = BeautifulSoup(fp.read(), "html.parser")
         paragraphs = []
         for div in soup.find_all('div'):
             div_type = div.get('type')
             if div_type is not None and div_type == 'paragraph':
                 paragraphs.append(div)
         for paragraph_div in paragraphs:
             has_sub_paragraph = False
             for div in paragraph_div.find_all('div'):
                 div_type = div.get('type')
                 if div_type is not None and div_type == 'paragraph':
                     has_sub_paragraph = True
             if has_sub_paragraph:
                 continue
             rs.append([])
             for content_div in paragraph_div.find_all('div'):
                 div_type = content_div.get('type')
                 if div_type is not None and div_type == 'content':
                     rs[-1].append(TextUtils.clean_text(content_div.text))
     paragraphs = []
     for content_list in rs:
         if len(content_list) > 0:
             paragraphs.append(''.join(content_list))
     return paragraphs

コード例 #2

ファイルを表示

 def parse_content(self, html_file_path):
     """
     解析 HTML 中的段落文本
     按顺序返回多个 paragraph 构成一个数组，
     每个 paragraph 是一个 content 行构成的数组
     :param html_file_path:
     :return:
     """
     rs = []
     with codecs.open(html_file_path, encoding='utf-8', mode='r') as fp:
         soup = BeautifulSoup(fp.read(), "html.parser")
         paragraphs = []
         for div in soup.find_all('div'):
             div_type = div.get('type')
             #添加div_type == 'paragraph'的div块中的文本
             if div_type is not None and div_type == 'paragraph':
                 paragraphs.append(div)
         for paragraph_div in paragraphs:
             has_sub_paragraph = False  #判断paragraph中是否有子paragraph
             for div in paragraph_div.find_all('div'):
                 div_type = div.get('type')
                 if div_type is not None and div_type == 'paragraph':
                     has_sub_paragraph = True
             if has_sub_paragraph:
                 continue  #若存在子paragraph则continue，因为后面会遍历到该paragraph
             rs.append([])  #每个paragraphs中的content保存在rs的子列表中
             #将paragraph中的content添加到列表中
             for content_div in paragraph_div.find_all('div'):
                 div_type = content_div.get('type')
                 if div_type is not None and div_type == 'content':
                     rs[-1].append(TextUtils.clean_text(content_div.text))
     paragraphs = []
     for content_list in rs:
         if len(content_list) > 0:
             paragraphs.append(
                 ''.join(content_list))  #每个content_list结合在一起成为一个字符串
     return paragraphs