Skip to content

julykid/youTube-8m

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Simple search project using Youtube-8m dataset and ElasticSearch

YouTube-8m dataset is a large-scale labeled video collection that consists of 8 million YouTube video IDs and associated labels from a diverse vocabulary of 4800 visual entities (aka, "Google Knowledge Graph"). More about Google Knowldege Graph here.

Prerequisites

Requirements

  • Swift
  • ElasticSearch
  • Kibana
  • Python libraries: pafy, csv, json, itertools, pandas, numpy, requests, urllib

Provision VMs

  • Hardware: 8 CPU, 32 GB
  • Connection: 100 GB local, 1GB network

Directions

Parsing TensorFlow data

  • Under ParseTFrecord directory, execute yt8m_parse.py

ElasticSearch

  • Execute batchrunpy2.sh

Change Log

YouTube Retrieve Metadata + Push to ES

Version 1

  • Base files to retrieve metadata from a sample subset to run locally on computer

Version 2

  • Cleaned code for easier viewing and debugging
  • Updated so that code is able to run the full set

Version 3

  • Fixed index id numbering and added column information (description, rating, likes, dislikes, author, published, etc.)
  • Added exception handling for invalid YouTube data

Version 4

  • Fixed thumbnail retrieval portion that was not working
  • Fixed try/except loop that was exiting prematurely with an error (was previously unable to go through all videos in the given document)
  • Modified so that push2ES_batch.py takes two system arguments to specify which documents to process

Shell Script batchrunpy, batchrunpy2

Version 1

  • Created shell scripts to simultaneously run 100 and 84 instances respectively of the push2ES_batch.py script

Version 2

  • Crash because all of the instances were starting and logging into the YouTube API at the same time
  • Added "sleep" and "nohup" to the command chains so there is a staggered start

About

youTube-8m dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 63.2%
  • Python 31.1%
  • Shell 5.7%