Primary contributor: Abhishek Divekar
You can view examples of large databases gathered using this tool by visiting the Extracted_Articles repository, which contains 1,253 articles gathered from four online news sites.
The aim of this project was to make it easier to extract and store a large number of high-quality text articles from a list of websites - majorly online newspapers and blogs. The main idea was to create an extraction-code for each website: a 10-20 line script which could be applied to basically any webpage on the domain, to get the text of the article in that webpage. These extraction-codes are usually some mesh of Python regular expressions and HTML-parsing code (BeautifulSoup, XHTML etc).
I noticed that a lot of the extraction codes for different websites looked very similar, and it was easy to build them from one another. It took me just a few hours to hack together the extraction codes for quite a few websites.
The reason is fairly obvious; articles on websites all follow the same structure: headline, sub-headline/s, body, date-time, etc. This format got me thinking if there was a better way to handle all the data that comes with an HTML page, and to abstract it into its neccessary parts. Thus the project grew to include the ArticleObject class, which stored all of the relevant features of an article and allowed them to be manipulated more fluidly with different functions.
Another thing that I found tiresome during development was finding articles...after I had found a few, it was both difficult and time-consuming to find more. I found that this was particularly difficult when I didn't know what I was looking for, and I would just end up in some deep corner of a domain which had nothing to do with what I originally wanted. So, I decided to enlist everyone's favourite search engine, Google. It would find the articles I wanted, but only on the websites I had the extraction-codes for. This meant I could get the pure text, very fast (I built this when I needed a lot of data for another project).
the article URLs are automatically found via Google Search results.
- We restrict the search query with certain criterion:
- specific words eg: company names, names of persons etc.
- specific range of dates
- only over websites of your choice
- specific words eg: company names, names of persons etc.
- Doing such a specialized search ensures that the articles are of a higher quality.
- Note: This step is deliberately a slower than the others (taking around 3 mins per page of results). This is because Google tends to notice if you spam it with hundreds of search queries in a short period of time, and temporarily blocks your IP.
To counter this as much as possible, the project code includes a redirect function: every few searches, it Google Searches something totally unrelated. This means that your IP does not get restricted too fast, allowing the code to get more results before you need to wait and run the script again.
- This step is almost instantaneous with a fair internet connection:
- a 2MBps connection takes about 5 seconds per extraction.
- It is unfortuantely not possible to build a single extraction-code for all website, as the HTML formatting used varies from site to site (and even time to time).
- However, with the few examples of extraction-codes provided, it is easy to extract the most relevant features: the headline, the sub-headlines and the body of the article.
- The basic template can then be tweaked to work for other websites and updates to existing websites.
- Inside the file "regex_builder_helper.py", there is an example template of how I went about building the extraction codes.
- The extraction-codes have been made as reusable as possible, with the old extraction codes being saved in a MySQL table. When extracting text from a particular domain, the most recently updated code is tried first, and if that doesn't work, it falls back to the older extraction-codes, with the hope that they will work.
- The filenames are automatically truncated to fit within the host OS's path size.
- The salient feature of this design step is that, we obtain Google results by 30-day periods (one 'month'), and thus, all articles in that month go into a single directory. We thus organize our database by months. Each month contains a specified number of articles. Note: the "month"s do NOT start on the 1st and go till the 30th/31st. They are just 30-day blocks, used for organization.
- On subsequent runs of the script (which might be necessary due to the Google IP problem), we thus do not need to start from the beginning, but can start from a later month if we are satisfied with the articles obtained proviously.
- In this structure, you can also expand the number of articles we get each month. The code supports this: it does not re-extract articles which aleady exist in our database (it looks at the URL part of the files which are already saved).
- This step becomes more clear if you look inside the "test_example" folder, whose contents are all generated automatically by running the script.
The end effect of all this is that, we can quickly and painlessly build a large, high quality database of articles, just by specifying the following:
- The search query topic (eg. "Reliance Industries").
- How many months of articles do you want (from a specified starting point or from the present day).
- How many articles do you want to get per month.
- Which websites do you want to get these articles from (the extraction-code must already exist).