Python clean_html Exemples

Langage de programmation: Python

Espace de nommage/Pack: streamcorpus_pipeline._clean_html

Méthode/Fonction: clean_html

Exemples au hotexamples.com: 2

Python clean_html - 2 exemples trouvés. Ce sont les exemples réels les mieux notés de streamcorpus_pipeline._clean_html.clean_html extraits de projets open source. Vous pouvez noter les exemples pour nous aider à en améliorer la qualité.

Associées

extract_traceback

check_for_account_credentials

hg_push

Motif_from_counts

Lock

mixers

write

PaymentClient

Class

categorical

Related in langs

Text_Plain_Dateformat (PHP)

HeadedCsvReader (PHP)

Civilization (C#)

Murmur.TextMessage (C#)

R92SU_DBG (C++)

hb_pp_undefCompilerRules (C++)

Send (Go)

Compile (Go)

MutableTreeNode (Java)

Product (Java)

Exemple #1

0

Afficher le fichier

Fichier : soft_selectors.py Projet : anukat2015/dossier.models

def ids_and_clean_visible_from_streamcorpus_chunk_path(corpus_path): '''converts a streamcorpus.Chunk file into the structure that is passed by the search engine to find_soft_selectors ''' ch = clean_html(clean_html.default_config) cv = clean_visible(clean_visible.default_config) ids_and_clean_visible = [] for si in streamcorpus.Chunk(path=corpus_path): if not si.body.clean_visible: ## attempt to make clean_visible if not si.body.raw: logger.critical('no raw content, so skipping: %r', si.abs_url) continue abs_url = si.abs_url si = ch(si, {}) if not si: logger.critical( 'failed to make clean_html, so skipping: %r', abs_url) continue si = cv(si, {}) if not si or not si.body.clean_visible: logger.critical( 'failed to make clean_visible, so skipping: %r', abs_url) continue rec = (si.stream_id, si.body.clean_visible.decode('utf8'), {}) ids_and_clean_visible.append(rec) return ids_and_clean_visible

Exemple #2

0

Afficher le fichier

def test_stage(test_data_dir): stage = clean_html({}) # NB: not even defaults path = os.path.join(test_data_dir, 'test') with open(os.path.join(path, 'nytimes-index.html'), 'r') as f: raw = f.read().decode('utf8') si = StreamItem(body=ContentItem(raw=raw, media_type='text/html')) si = stage(si, {}) with open(os.path.join(path, 'nytimes-index-clean-stable.html'), 'r') as f: stable = f.read() assert si.body.clean_html == stable