A runner of automatic data extraction methods RoadRunner and Webstemmer, as well as a semi-automatic method of extracting data using XPath with the help of the framework Scrapy. Works on news articles. Once your environment is set up, you may run all methods at once by running the executable Python script:
$ ./main.py
main.py
runs four functions:
scrape()
, which downloads the HTML from desired webpages into a chosen destination folder (see SCRAPE_DEST_FOLDER below) and creates a zip of them for Webstemmer to use. The destination folder is namedscraped_folders
. As we sometimes want the corpus to stay constant an environment variableENABLE_WEB_SCRAPING
may be set to false inside the.env
file. In addition to the scraped websites, we also create 'mixed' groups which contain HTML and zip files from different websites to see how the methods perform on pages which are not similar to one another. Thescraped-folders
folder must be synced with the FOLDER_NAMES constant inconstants.py
. The exact corpus which is used as an example for this work is available for download at: https://drive.google.com/file/d/1podWDM7qokj7Wn7-hS-yAPpOVX5nl6_3/view?usp=sharingroadrunner()
, which executes the RoadRunner method and saves results and generated wrappers (in the form of*.html
and.*xml
files) into./roadrunner/output/
webstemmer()
, which executes the Webstemmer method and saves results (in the form of*.txt
files) and generated wrappers (in the form of*.pat
files) into./webstemmer/webstemmer/
scrapy()
, which executes a custom implementation of Scrapy, a web crawler that extracts data using XPath, and writes the results into*.json
files in./scrapynews/scraped-content/
In the end the information of time needed per webpage for each data extraction method is shown.
- Add URLs to
constants.py
to choose custom webpages from which data will be extracted and the folder names into which they will be extracted. The folder names and URLs indices must correspond - Create a new file
.env
in the same folder asmain.py
and add the absoulte path of the folder in which you wish to save scraped data to the SCRAPE_DEST_FOLDER environment variable, eg.SCRAPE_DEST_FOLDER="/home/scraped-folders/"
. You may also setENABLE_WEB_SCRAPING
to False if you wish to prevent web scraping in thescrape
function. - Initialize pipenv by running
pipenv install
- Open pipenv shell by running
pipenv shell
- Add the repository folder to the bash PATH variable since the committed
geckodriver
binary is needed by Selenium to compare results. The executable is downloaded from the geckodriver GitHub page. This is done by running:PATH=${PATH}:$(pwd)
in your bash terminal when located inside the repository's root folder. - Run the program by running
./main.py
from the repository's root folder