Skip to content

hroncok/badger

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Badger

Badger is a small set of scripts to mine association rules of Fedora Badges. Crawler downloads the data from Badges API, Miner generates the rules.

Baby Badger

Usage

At least for Miner it is recommended to use PyPy to get the results faster. For Crawler you can use any Python implementation you like, using PyPy brings no improvement here.

Here is an example of how to run this on Fedora supposing pypy and python-virtualenv packages are installed:

virtualenv -p /usr/bin/pypy venv
. venv/bin/activate
pip install -r requirements.txt
pypy crawler.py
pypy miner.py 9 0.9 10 15

Note that downloading the data from Badges API will take it's time. Therefore data.json is already provided with data from 07-05-2015 (just skip pypy crawler.py and go straight to pypy miner.py ... to use it)

Miner requires some arguments:

pypy miner.py <min_support> <min_confidence> [<user_limit> <badge_limit>]
  • min_support is the minimal support of the rules to be mined. You can experiment with it. The higher the number, the fewer the rules.
  • min_confidence is the minimal confidence of the rules to be mined. It's float between 0 and 1. The higher the number, the fewer the rules.
  • user_limit is a limit of users to use for mining. If you use it, you will get the top X user of Fedora Badges. You can omit the limits to use all the users, i.e. little less than 1500 now. (This would most certainly eat all your RAM.)
  • badge_limit is a limit of badges to use for mining. If you use it, the Miner will only see the first X badges in the order the Crawler downloaded the data (and although the order is deterministic, you can think of it as it would be random). You can omit the limits to use all the badges users get, i.e. little more than 250 now. (This would most certainly eat all your RAM.)

Badge Off!

The idea

The original idea was to get all the results, look at them and either filter them manually or by some other scripted logic, to avoid capt. Obvious rules, such as:

if-you-build-it...-koji-success-ii -> if-you-build-it...-koji-success-i with support 820 and confidence 1.0

Meaning if user has If you build it... (Koji Success II) they will most certainly get If you build it... (Koji Success I). This information is useless, because users allways get those badges in order that makes Koji Success I actually a dependency of Koji Success II.

Senior Badger (Badger II)

The problem

Unfortunately, this is quite useless, because running the Miner with 20 users and 20 badges eats all my RAM and swap. So running it unlimited would probably require a large SSD powered swap. Also, running with 12 users and 15 badges (which is quite OK for my RAM), generates more than 100 000 rules. So any manual filtering is off the table. I'm also certainly sure that even if the filtering is scripted, the filtered results would be so large none would ever want to read them :(

Old Yeller

About

Data mining on Fedora Badges

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages