YouTube-8m dataset is a large-scale labeled video collection that consists of 8 million YouTube video IDs and associated labels from a diverse vocabulary of 4800 visual entities (aka, "Google Knowledge Graph"). More about Google Knowldege Graph here.
- Swift
- ElasticSearch
- Kibana
- Python libraries: pafy, csv, json, itertools, pandas, numpy, requests, urllib
- Hardware: 8 CPU, 32 GB
- Connection: 100 GB local, 1GB network
- Under
ParseTFrecord
directory, executeyt8m_parse.py
- Execute
batchrunpy2.sh
- Base files to retrieve metadata from a sample subset to run locally on computer
- Cleaned code for easier viewing and debugging
- Updated so that code is able to run the full set
- Fixed index id numbering and added column information (description, rating, likes, dislikes, author, published, etc.)
- Added exception handling for invalid YouTube data
- Fixed thumbnail retrieval portion that was not working
- Fixed try/except loop that was exiting prematurely with an error (was previously unable to go through all videos in the given document)
- Modified so that
push2ES_batch.py
takes two system arguments to specify which documents to process
- Created shell scripts to simultaneously run 100 and 84 instances respectively of the
push2ES_batch.py
script
- Crash because all of the instances were starting and logging into the YouTube API at the same time
- Added "sleep" and "nohup" to the command chains so there is a staggered start