Skip to content

To uncover the algorithmic biases exhibited by the YouTube content ranking and recommendation algorithms.

License

Notifications You must be signed in to change notification settings

FuckBrains/youtube-algorithmic-bias

 
 

Repository files navigation

Uncover YouTube Algorithmic Bias

This is a research project under Prof. Nithyanand at the University of Iowa, check group page for research outline:https://sparta.cs.uiowa.edu/

Generic Research Method:

To uncover the algorithmic biases exhibited by the YouTube content ranking and recommendation algorithms. First creating profiles for insterested groups, then train these profiles by viewing a number of pre-selected videos with the idea of giving the YouTube algorithm a strong indication of such profile. Then the tests are done by give profiles a keyword or a seed link, to observe and study how YouTube would rand and recommend contents.

Sampling/Training/Testing details:

Sampling

1. Generate base profiles:

Based on 7 profile data from reddit, a number of videos X for each profile can be specified, for example: 50. Then top 50 videos for each profile (based on upvotes) will be selected as the base profile, if any profile total videos are less than specified number, then all of them will be chosen. Videos can be shuffled to ensure randomness.

2. Generate extended profiles (two method):

  1. By random number generator: Let D be the digits after decimal points of ratio in the profiles, R be the ratio, C be a random number generated in the range of [0, 10^D). If C <= R*10^D, the top P% videos in this subreddit would be chosen into the extended list. Default P is 10 (so 10% of the top videos will be sampled if this subreddit is chosen). For each subreddit in each profile, above random choosing process is perfomed, total videos chosen will be sampled together with base videos as one extended profile. Videos can be shuffled to ensure randomness.
  2. By diversity index: Based on the profiles, a diversity number N can be choosen, for example: 1.2. Then top (ranked by overlaping ratio) X*N (here 50*1.2=60) subreddits that have cross population with those base profiles will be chosen, then for each profile, those top (60) subreddits will be normalized on their overlap ratio. The normalized ratio is denoted as R. The number of videos for each subreddits will be caculated by K = R*X*N, round to nearest integer. Top upvotes videos for each subreddit will be chosen based on the calculated number K. Those chosen videos will be sampled together with base videos as one extended profile. Videos can be shuffled to ensure randomness.

Training

  1. Given a list of cookies for login, a list of video lists for each profile to viewing. The scripts will run parallelly to open browser, loading cookies, and automating to view all the videos on each list. More setting details checking the GitHub Setting.py file.

Testing

1. Pilot testing

To verify the necessity of creating Google accounts and log in when training and testing, by collection the top 50 search results and recommendation lists for each search result with keywords: “Mueller report”, and comparing the above results b/t login and non-login. Subreddits used to train are: The Donald and EnoughTrumpSpam.

Description of folders/files:

  • config: contains the firefox binaries, profiles, etc.
  • generated_data: used to contain crawled data. Currently it contains the video lists for both base and extended profile generated by two methods. */summary* contains summary of how many videos are in each subreddits (and for extended videos, how many crossing subreddits are chosen). -error.txt contains which subreddits don't contain videos.
  • settings.py: contains all the available options for sampling/training/testing.
  • src: contains all the source code

About

To uncover the algorithmic biases exhibited by the YouTube content ranking and recommendation algorithms.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 68.3%
  • Jupyter Notebook 17.9%
  • JavaScript 13.8%