Skip to content

ppBruce/dbce

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Diff Based Content Extraction

It is a python framework I have developed for my bachelor thesis. The main purpose was to research ways for content extraction from large collections of HTML documents stored in Web Archives.

Copyright notice

This repository contains content that has been crawled for research purposes.

About

Diff Based Content Extraction is a part of my Bachelor Thesis: Joint Approach to Boilerplate Detection in Web Archives

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 57.2%
  • Python 41.8%
  • Other 1.0%