Skip to content

baloncek2662/data-extraction-methods

Repository files navigation

Data Extraction Methods

A runner of automatic data extraction methods RoadRunner and Webstemmer, as well as a semi-automatic method of extracting data using XPath with the help of the framework Scrapy. Works on news articles. Once your environment is set up, you may run all methods at once by running the executable Python script:

$ ./main.py

main.py runs four functions:

  1. scrape(), which downloads the HTML from desired webpages into a chosen destination folder (see SCRAPE_DEST_FOLDER below) and creates a zip of them for Webstemmer to use. The destination folder is named scraped_folders. As we sometimes want the corpus to stay constant an environment variable ENABLE_WEB_SCRAPING may be set to false inside the .env file. In addition to the scraped websites, we also create 'mixed' groups which contain HTML and zip files from different websites to see how the methods perform on pages which are not similar to one another. The scraped-folders folder must be synced with the FOLDER_NAMES constant in constants.py. The exact corpus which is used as an example for this work is available for download at: https://drive.google.com/file/d/1podWDM7qokj7Wn7-hS-yAPpOVX5nl6_3/view?usp=sharing
  2. roadrunner(), which executes the RoadRunner method and saves results and generated wrappers (in the form of *.html and .*xml files) into ./roadrunner/output/
  3. webstemmer(), which executes the Webstemmer method and saves results (in the form of *.txt files) and generated wrappers (in the form of *.pat files) into ./webstemmer/webstemmer/
  4. scrapy(), which executes a custom implementation of Scrapy, a web crawler that extracts data using XPath, and writes the results into *.json files in ./scrapynews/scraped-content/

In the end the information of time needed per webpage for each data extraction method is shown.

Environment Setup

  • Add URLs to constants.py to choose custom webpages from which data will be extracted and the folder names into which they will be extracted. The folder names and URLs indices must correspond
  • Create a new file .env in the same folder as main.py and add the absoulte path of the folder in which you wish to save scraped data to the SCRAPE_DEST_FOLDER environment variable, eg. SCRAPE_DEST_FOLDER="/home/scraped-folders/". You may also set ENABLE_WEB_SCRAPING to False if you wish to prevent web scraping in the scrape function.
  • Initialize pipenv by running pipenv install
  • Open pipenv shell by running pipenv shell
  • Add the repository folder to the bash PATH variable since the committed geckodriver binary is needed by Selenium to compare results. The executable is downloaded from the geckodriver GitHub page. This is done by running: PATH=${PATH}:$(pwd) in your bash terminal when located inside the repository's root folder.
  • Run the program by running ./main.py from the repository's root folder

About

Automatic runner of data extraction methods

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages