Skip to content

alifbae/news-data-extraction

Repository files navigation

News Data Extraction

Scripts for extracting news articles from US newspapers

Scrapped data is available in folders inside the respective newspaper directory in the 'articleData' directory

Strucutre:

Each .json file in the articleData directory has one article data stored in valid json format. Each json array has 5 keys:

  • 'Title' : Heading of the article
  • 'Content' : Body of the article
  • 'Date' : Date the article was published
  • 'Author' : Author(s) of the article
  • 'Link' : URL of that article

Note: Some articles will have "NULL" in their 'Author' key, this is because those articles are op-eds or opinion pieces that don't necessarily have an author (eg: letters to the editor)

How to use the data:

import json
import os

articleDataDirectoryPath = "" # whatever the path of the articleData directory is
filePathList = os.listDir(articleDataDirectoryPath) # gets a list of filePaths

for filePath in filePathList:
	absFilePath = articleDataDirectoryPath + filePath
	with open(absFilePath) as f:
		for line in f:
			articleData = json.load(line) 

			# use the data:
			title = articleData['Title']
			content = articleData['Content']
			date = articleData['Date']
			url = articleData['Link']
			author = articleData['Author'] # use a check for "NULL" author if you wish

Tokenizing for paragraphs and sentences:

In order to split the content into paragraphs use the python split() method on the content tag as shown below:

with open(jsonFilePath) as f:
	for line in f:
		articleData = json.loads(line)
		articleContent = articleData["Content"]

		articleParagraphs = articleContent.split(". , ") # Delimiter for the new paragraph

		paragraphCounter = 1
		for paragraph in articleParagraphs:
			print "Paragraph " + str(paragraphCounter) + " >> " + paragraph
			paragraphCounter += 1

		print '\n'
		paragraphCounter = 1
Currently data is available for:
  • LA Times
  • Seattle Times
  • Houston Chronicle
  • Chicago Tribune
Work in progress:
  • Philly

About

A repository of scripts for extracting news articles from US newspapers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages