Python HTMLParser.count Beispiele

Programmiersprache: Python

Namespace / Paketname: HTMLParser

Klasse / Typ: HTMLParser

Methode / Funktion: count

Beispiele auf hotexamples.com: 1

Python HTMLParser.count - 1 Beispiele gefunden. Dies sind die am besten bewerteten Python Beispiele für die HTMLParser.HTMLParser.count, die aus Open Source-Projekten extrahiert wurden. Sie können Beispiele bewerten, um die Qualität der Beispiele zu verbessern.

Häufig verwendete Methoden

Anzeigen Verbergen

HTMLParser(30)

__init__(30)

close(30)

feed(30)

encode(19)

get_starttag_text(4)

decode(3)

endswith(3)

get_data(2)

_init_(2)

clear_cdata_mode(2)

parse_mangastream(2)

fed(2)

rstrip(1)

handle_date(1)

error(1)

getContentFromTags(1)

find(1)

__getattribute(1)

count(1)

upper(1)

Beispiel #1

Datei anzeigen

Datei: textfiles-to-mongodb.py Projekt: bildlich/textfiles-to-mongodb

def fileToSentenceList(pathToTextFile):
	# Import string from file
	file = io.open(pathToTextFile, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True)
	rawString = file.read().strip()

	# Basic cleaning: Replace line breaks with spaces
	def removeLineBreaks(string):
		cleanString = re.sub("[\n\r]+", " ", string)	 # Linke breaks to spaces
		cleanString = re.sub("\s{2,}", " ", cleanString) # Remove double spaces
		return cleanString

	cleanString = removeLineBreaks(rawString);
	
	# Use nltk to tokenize sentences
	# See http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize
	sentences = sent_tokenize(cleanString)

	# Look at all the sentences and throw out things that we don't like
	buffer = sentences
	sentences = []
	for sentence in buffer:

		# 1.
		# Throw out words that don't begin w/ capital letter (happens often after direct speech).
		# These are correct sentences but I prefer not to have them in the pool because they make little
		# sense without context.
		regex = '^[\s({\["\'“‘\-«»‹›]*[A-ZÄ-Ü0-9]'
		match = re.match(regex, sentence)
		if match is None:
			#print "thrown out b/c sentence doesn't start w/ capital letter: ", sentence
			continue

		# 2.
		# Throw out one-word or two-word sentences that contain numbers
		# They are probably headlines: 'Chapter 2.' or '1.F.1.'
		if sentence.count(" ") < 2 and re.search("\d", sentence) is not None:
			#print "thrown out b/c it seems like a nonsensical headline:", sentence
			continue

		# Remove white-space at the beginning and end
		sentence = sentence.strip()
		
		# Use typographically correct quotation marks, apostrophes and dashes
		sentence = HTMLParser().unescape(smartypants.smartypants(sentence))
		
		# Avoid unclosed (or unopened) quotation marks, parentheses, brackets, braces
		sentence = complete_pairs(sentence)

		sentences.append({
			'sentence': sentence,
			'numberOfWords': sentence.count(' ') + 1,
			'file': pathToTextFile,
			'randomPoint': [random.random(), 0] # For efficient random entry retrieval. See http://stackoverflow.com/a/9499484/836005
		})
		
	return sentences