Python format_html_tokens Examples

Programming Language: Python

Namespace/Package Name: article_extraction.html

Method/Function: format_html_tokens

Examples at hotexamples.com: 2

Python format_html_tokens - 2 examples found. These are the top rated real world Python examples of article_extraction.html.format_html_tokens extracted from open source projects. You can rate examples to help us improve the quality of examples.

Example #1

Show file

File: test_html.py Project: mylove00025/article_extraction

    def test_format_html_tokens(self):
        tokens = ["<p>", "this", "is", "a", "test", "</p>",
                  "<a>", "link", "</a>", "text",
                  "<h1>", "header", "</h1>"]

        expected_result = ["this", "is", "a", "test", 
                           "\n", "\n", 
                           "link", "text", 
                           "\n", 
                           "header", 
                           "\n"]

        result = format_html_tokens(tokens)

        self.assertListEqual(result, expected_result)

Example #2

Show file

File: mss.py Project: mylove00025/article_extraction

    def extract_article(self, document):
        """Extract the article from the page contents."""
        html_document = clean_html(html.document_fromstring(document))

        tokens = tokenize_html(html_document)

        scores = [self.scoring.score(term) for term in tokens]

        terms = extract_maximum_subsequence(tokens, scores)

        terms = format_html_tokens(terms)

        terms = [re.sub(r"\n ", "\n", term, flags=re.UNICODE)
                 for term in terms]

        contents = create_text(terms)

        return contents