Python CategorizedPlaintextCorpusReader.paras示例

编程语言: Python

命名空间/包名称: nltk.corpus.reader

方法/功能: paras

hotexamples.com的示例: 1

Python CategorizedPlaintextCorpusReader.paras - 已找到1个示例。这些是从开源项目中提取的最受好评的nltk.corpus.reader.CategorizedPlaintextCorpusReader.paras现实Python示例。您可以评价示例，以帮助我们提高示例质量。

常用方法

显示隐藏

CategorizedPlaintextCorpusReader(23)

fileids(18)

categories(13)

words(11)

raw(4)

__init__(3)

paras(1)

sents(1)

示例#1

显示文件

from nltk.corpus.reader import CategorizedPlaintextCorpusReader
from nltk.corpus import stopwords
stopwordlist=stopwords.words('german')
from wordcloud import WordCloud

rootDir="../01access/GERMAN"
filepattern=r"(?!\.)[\w_]+(/RSS/FeedText/)[\w-]+/[\w-]+\.txt"
#filepattern=r"(?!\.)[\w_]+(/RSS/FullText/)[\w-]+/[\w-]+\.txt"
catpattern=r"([\w_]+)/.*"
rssreader=CategorizedPlaintextCorpusReader(rootDir,filepattern,cat_pattern=catpattern)


# In[3]:


singleDoc=rssreader.paras(categories="TECH")[0]
print("The first paragraph:\n",singleDoc)
print("Number of paragraphs in the corpus: ",len(rssreader.paras(categories="TECH")))


# In[4]:


techdocs=[[w.lower() for sent in singleDoc for w in sent if (len(w)>1 and w.lower() not in stopwordlist)] for singleDoc in rssreader.paras(categories="TECH")]
print("Number of documents in category Tech: ",len(techdocs))


# In[5]:


generaldocs=[[w.lower() for sent in singleDoc for w in sent if (len(w)>1 and w.lower() not in stopwordlist)] for singleDoc in rssreader.paras(categories="GENERAL")]