Python Configuration.keep_article_html Exemples

Langage de programmation: Python

Espace de nommage/Pack: newspaper.configuration

Class/Type: Configuration

Méthode/Fonction: keep_article_html

Exemples au hotexamples.com: 2

Python Configuration.keep_article_html - 2 exemples trouvés. Ce sont les exemples réels les mieux notés de newspaper.configuration.Configuration.keep_article_html extraits de projets open source. Vous pouvez noter les exemples pour nous aider à en améliorer la qualité.

Méthodes fréquemment utilisées

Afficher Cacher

Configuration(26)

language(6)

fetch_images(5)

follow_meta_refresh(4)

get_parser(4)

browser_user_agent(3)

memoize_articles(3)

keep_article_html(2)

MAX_AUTHORS(1)

MAX_TITLE(1)

MIN_WORD_COUNT(1)

is_memoize_articles(1)

verbose(1)

Méthodes fréquemment utilisées

Configuration (26)

language (6)

fetch_images (5)

follow_meta_refresh (4)

get_parser (4)

browser_user_agent (3)

memoize_articles (3)

keep_article_html (2)

MAX_AUTHORS (1)

MAX_TITLE (1)

Méthodes fréquemment utilisées

MIN_WORD_COUNT (1)

is_memoize_articles (1)

verbose (1)

Exemple #1

0

Afficher le fichier

Fichier : newspaper3k.py Projet : 5l1v3r1/metahtml

def newspaper_fulltext2(parser, language, url): ''' This is a faster version of the function that uses some internal newspaper3k functions so that the lxml parse tree doesn't need to be recreated. Adapted from https://github.com/codelucas/newspaper/blob/master/newspaper/api.py#L71 but modified to use an already existing lxml parser ''' from newspaper.cleaners import DocumentCleaner from newspaper.configuration import Configuration from newspaper.extractors import ContentExtractor from newspaper.outputformatters import OutputFormatter config = Configuration() config.language = language config.keep_article_html = True extractor = ContentExtractor(config) document_cleaner = DocumentCleaner(config) output_formatter = OutputFormatter(config) doc = parser doc = document_cleaner.clean(doc) doc = extractor.calculate_best_node(doc) if doc is not None: doc = extractor.post_cleanup(doc) text, html = output_formatter.get_formatted(doc) else: text = '' html = '' return { 'value': { 'text': text, 'html': html, }, 'pattern': 'newspaper3k', }

Exemple #2

0

Afficher le fichier

Fichier : newspaper3k_modified.py Projet : maxinebaghdadi/metahtml

def modified_fulltext(parser, language, url): ''' Adapted from https://github.com/codelucas/newspaper/blob/master/newspaper/api.py#L71 but modified to use an already existing lxml parser ''' url_parsed = urlparse(url) from newspaper.cleaners import DocumentCleaner from newspaper.configuration import Configuration from newspaper.extractors import ContentExtractor from newspaper.outputformatters import OutputFormatter config = Configuration() config.language = language config.keep_article_html = True extractor = ContentExtractor(config) document_cleaner = DocumentCleaner(config) output_formatter = OutputFormatter(config) doc = parser doc = rm_ads(doc,url_parsed.hostname) doc = clean(document_cleaner,doc) #doc = document_cleaner.clean(doc) doc = calculate_best_node(extractor,doc) #doc = extractor.calculate_best_node(doc) if doc is not None: #doc = extractor.add_siblings(doc) doc = post_cleanup(doc) #doc = extractor.post_cleanup(doc) text, html = get_formatted(doc) #text, html = output_formatter.get_formatted(doc) else: text = '' html = '' return { 'value' : { 'text' : text, 'html' : html, }, 'pattern' : 'modified', }