Skip to content

naiaden/VacCor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Processing the vacancies: The vacancies are stored in one XML file. Since processing this takes a lot of time, we want to do it distributedly. This script efficiently walks through the XML file, and separates it into individual vacancies, which are then passed on to multiple threads.

Processing the Twente News Corpus: Since ucto is a memory hogger for large input files, we tokenise the individual files. These are then concatenated into one large file. Kind of like the reverse process for the vacancies :-) This is because the background corpus is aggregated into type counts, ignoring any other information.

About

The title is not directed to you, so take no offence

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages