Skip to content

fiestabonita/predictimdb

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

predictimdb

Python tools to scrape data from IDMB and predict movie revenue!

In the data folder, you have raw, unparsed data retrieved by the scraper.

The data is stored as a JSON encoded list of dictionaries. Each movie is a dictionary with the following fields:

      'name' : string, movie name
      'mpaa' : string, MPAA rating
      'link' : string, link to IMDB movie page
      ‘year' : string, release year
      ‘runtime' : string, length of movie in minutes
      ‘revenue' : string, US box office revenue of movie
      'stars' : list of name-URL dictionaries
      'directors' : list of name-URL dictionaries
      'writers' : list of name-URL dictionaries
      'budget' : string, filming budget given by IMDB
      'genres' : list of strings
      'castlist' : list of strings, cast names
      'release' : string, release date
      'countries' : list of strings
      'tagline' : string, brief line about the movie
      'production' : list of name-URL dictionaries
      'userscore' : dictionary with fields ‘score' and ‘count'
      'metascore' : string, metascore as given by IMDB

A name-URL dictionary refers to a dictionary with the fields ‘name' and ‘url’. The ‘name' is simply the name of the entity and the ‘url' is the url to the IMDB page for that entity . The purpose of this is that if we want to get more metadata on any of the entities, we can. NOTE: any of these fields may be empty - check before referencing

To scrape the data yourself, refer to code in the scraping directory: scraper.py contains helper functions that parse HTML, and imdb.py does the actual scraping (iterates through URLs, stores data, etc)

About

Python tools to scrape data from IDMB and predict movie revenue

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 53.4%
  • R 46.6%