News Crawler

This repo provides a utility for parsing CommonCrawl News (a.k.a. CC-NEWS) files into a state ready for text analysis using Docker.

This repo is mostly taken from the RealNews repo part of Grover repo and adapted to use Docker.

Understanding CC-NEWS

CC-NEWS is the 'live' news version of the CommonCrawl (CC-MAIN). CC-MAIN provides a monthly slice of the internet, which is great, but monthly. CC-NEWS produces a file with news every 1-2 hours with a typical size of 1.2GB

Some issues (apart from size) arise from the warc format and the raw html format of the crawl. This repo helps to extract purely the text part of it. Results are stored directly into your s3 bucket of choice

For a sample result, consider the included realnews_tiny.jsonl file (also totally taken from the Grover repo btw.)

Usage with Docker

Clone this repo, move to real_news folder, build the image

cd real_news
docker build --tag real_news .

Run on a selected file

docker run news_crawl     
    -e AWS_ACCESS_KEY=[YOUR_ACCESS_KEY]     
    -e AWS_SECRET_KEY=[YOUR_SECRET_KEY]    
    python real_news.py 
    --path crawl-data/CC-NEWS/YYYY/MM/FILE_NAME 
    --bucket_name [YOUR_S3_TARGET_BUCKET]

For example for the file CC-NEWS-20201015155253-00179.warc.gz

docker run real_news
     -e AWS_ACCESS_KEY=my_super_access_key
     -e AWS_SECRET_KEY=my_super_secret_key
    python real_news.py
     --path crawl-data/CC-NEWS/2020/10/CC-NEWS-20201015155253-00179.warc.gz
     --bucket_name real-news

Finding news files

Find a news file to parse. To find all the files on a given day, use

aws s3 ls s3://commoncrawl/crawl-data/CC-NEWS/YYYY/MM/CC-NEWS-YYYYMMDD --no-sign-request

Where DD, MM & YYYY are to be replaced with the desired days accordingly. For instance, the following call

aws s3 ls s3://commoncrawl/crawl-data/CC-NEWS/2020/10/CC-NEWS-20201015 --no-sign-request yields a list of 18 files, for the day:

[DATE]     [TIME]   [FILE_ID]  [FILE_NAME]
2020-10-15 05:05:03 1072702920 CC-NEWS-20201015011026-00168.warc.gz
2020-10-15 07:05:03 1072729221 CC-NEWS-20201015032129-00169.warc.gz
[...]
2020-10-16 02:05:03 1072726498 CC-NEWS-20201015225649-00185.warc.gz

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
real_news		real_news
.gitignore		.gitignore
README.md		README.md
realnews_tiny.jsonl		realnews_tiny.jsonl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

real_news

real_news

.gitignore

.gitignore

README.md

README.md

realnews_tiny.jsonl

realnews_tiny.jsonl

Repository files navigation

News Crawler

Understanding CC-NEWS

Usage with Docker

Finding news files

About

Releases

Packages

Languages

md-experiments/news_crawler

Folders and files

Latest commit

History

Repository files navigation

News Crawler

Understanding CC-NEWS

Usage with Docker

Finding news files

About

Resources

Stars

Watchers

Forks

Languages