Python tokenize Examples

Programming Language: Python

Namespace/Package Name: textProcess

Method/Function: tokenize

Examples at hotexamples.com: 4

Python tokenize - 4 examples found. These are the top rated real world Python examples of textProcess.tokenize extracted from open source projects. You can rate examples to help us improve the quality of examples.

Example #1

Show file

File: parse_words.py Project: vatika/Search-Engine

	def get_body(self, content):
		content = re.sub("<ref.*?</ref>", ' ', content)
		content = re.sub("i.e", '', content)
		content = re.sub("\.", ' ', content)
		content = re.sub('[^a-zA-Z0-9 ]', '', content)
		content = re.sub(' +', ' ', content)

		return remove_stop_words(stem_tokens(tokenize(content.lower())))

Example #2

Show file

File: parse_words.py Project: vatika/Search-Engine

	def extract_external_links(self, content):
		lines=content.split("\n")
		for i in xrange(len(lines)):
			if '* [' in lines[i] or '*[' in lines[i]:
				word = ""
				temp = lines[i].split(' ')
				word=[key for key in temp if 'http' not in temp]
				try:
					word=' '.join(word).encode('utf-8')
					self.article.token['external_links'].extend(remove_stop_words(stem_tokens(tokenize(word))))
				except:
					pass

Example #3

Show file

File: parse_words.py Project: vatika/Search-Engine

	def get_tokens(self, content, title):
		self.article.token['title'] = tokenize(title)
		self.article.token['headings'] = tokenize(self.get_headings())
		self.article.token['References'] = tokenize(self.get_references(content))
		self.article.token['text'] = self.get_body(self.article.content)

Example #4

Show file

File: parse_words.py Project: vatika/Search-Engine

def processTitle(title):
	"""
	Title is converted to lower case, tokenized and stemmed
	"""
	return stem_tokens(tokenize(data.lower()), stemmer)