Skip to content

MichaelF89/zoekmachines

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

######
#
# Subset of New York Times Corpus
# for Project Information Retrieval (PIR)
#
#
# contact: Manos Tsagkias <e.tsagkias@uva.nl>
#          Christof Monz  <c.monz@uva.nl>
# last revision: 26 Januart 2010
########

the data/ directory is a subset of the New York Times
Corpus released by LDC. The subset includes 7.167 
articles from April 2007.

The directory structure is as follows:
year-month -> day -> article.xml

Each article comes in one XML file in the
corresponding directory. 

Sample data is provided in the directory: sample/
Contains 208 articles from May 01, 2007.

In the docs/ directory you can find useful guidelines
on how to access the XML data. Extraction tools
written in JAVA can be found in the tools/ directory.
   

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published