Simple search project using Youtube-8m dataset and ElasticSearch

YouTube-8m dataset is a large-scale labeled video collection that consists of 8 million YouTube video IDs and associated labels from a diverse vocabulary of 4800 visual entities (aka, "Google Knowledge Graph"). More about Google Knowldege Graph here.

Prerequisites

Requirements

Swift
ElasticSearch
Kibana
Python libraries: pafy, csv, json, itertools, pandas, numpy, requests, urllib

Provision VMs

Hardware: 8 CPU, 32 GB
Connection: 100 GB local, 1GB network

Directions

Parsing TensorFlow data

Under ParseTFrecord directory, execute yt8m_parse.py

ElasticSearch

Execute batchrunpy2.sh

Change Log

YouTube Retrieve Metadata + Push to ES

Version 1

Base files to retrieve metadata from a sample subset to run locally on computer

Version 2

Cleaned code for easier viewing and debugging
Updated so that code is able to run the full set

Version 3

Fixed index id numbering and added column information (description, rating, likes, dislikes, author, published, etc.)
Added exception handling for invalid YouTube data

Version 4

Fixed thumbnail retrieval portion that was not working
Fixed try/except loop that was exiting prematurely with an error (was previously unable to go through all videos in the given document)
Modified so that push2ES_batch.py takes two system arguments to specify which documents to process

Shell Script batchrunpy, batchrunpy2

Version 1

Created shell scripts to simultaneously run 100 and 84 instances respectively of the push2ES_batch.py script

Version 2

Crash because all of the instances were starting and logging into the YouTube API at the same time
Added "sleep" and "nohup" to the command chains so there is a staggered start

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
ParseTFrecord		ParseTFrecord
old		old
README.md		README.md
batchrunpy2.sh		batchrunpy2.sh
push2ES_batch2.py		push2ES_batch2.py
retrieveData_batch2.py		retrieveData_batch2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ParseTFrecord

ParseTFrecord

old

old

README.md

README.md

batchrunpy2.sh

batchrunpy2.sh

push2ES_batch2.py

push2ES_batch2.py

retrieveData_batch2.py

retrieveData_batch2.py

Repository files navigation

Simple search project using Youtube-8m dataset and ElasticSearch

Prerequisites

Requirements

Provision VMs

Directions

Parsing TensorFlow data

ElasticSearch

Change Log

YouTube Retrieve Metadata + Push to ES

Version 1

Version 2

Version 3

Version 4

Shell Script batchrunpy, batchrunpy2

Version 1

Version 2

About

Releases

Packages

Languages

julykid/youTube-8m

Folders and files

Latest commit

History

Repository files navigation

Simple search project using Youtube-8m dataset and ElasticSearch

Prerequisites

Requirements

Provision VMs

Directions

Parsing TensorFlow data

ElasticSearch

Change Log

YouTube Retrieve Metadata + Push to ES

Version 1

Version 2

Version 3

Version 4

Shell Script batchrunpy, batchrunpy2

Version 1

Version 2

About

Resources

Stars

Watchers

Forks

Languages