An HTCondor Account Group manager, used by the US ATLAS Teir-1 at BNL.
HTCondor Hierarchical Group Quotas are used to partition the pool logically into queues, with each sub group containing a single species of job. The groups are organized into a tree, with leaf-nodes having jobs submitted into them and the higher levels serving to classify these jobs.
For example, the ATLAS tree is partitioned into production and analysis, with each of these containing a few different queues of each type.
This repo contains code for a website to manage the groups, some scripts to put the groups in the configuration of the 'condor_negotiator', and software to auto-balance the groups based on which have demand in them and what size jobs each group has.
The software needs some packages built via the setup scripts in this repo, a webserver running Apache (nginx would work too), and a MySQL database somewhere (relatively low traffic).
All parts rely on the 'gq' RPM package, built by setup-gq.py. This contains code used by the website, the balancing software, and the other scripts.
-
Set up database and web servers
-
Build & install the gq and gqweb packages via the setup scripts
-
Run the init_db.py script to define the tables from the website models
-
Configure the webpage settings, esp. the SECRET_KEY, and figure out what kind of authentication you want to set up. It is loaded in the site code in gqweb/views/pre_initalize.py which calls 'load_user_header()' from util/userload.py. By default is uses the REMOTE_USER environment variable so some kind of BasicAuth should be easy to set up
The balancing scripts require knowledge about how much workload is in each queue. To provide this data you need to set up software like that in gq/jobquery/, where there are modules contain a get_jobs() method that returns a mapping of group_name → # idle at the time it is called. See 'pandajobs.py' for an example that looks to PANDA for activated jobs.
The balancing scripts themselves are bin/get_idle_jobs.py and bin/balance_load.py, both of which should be called via Cron or some other regular mechanism.
The script update_quotas.py safely write a config file that contains the current groups and their parameters, and triggers a reconfig of condor if things changed.
The script is bin/update_quotas.py, and it too requires the 'gq' package. It should be run via Cron wherever the 'condor_negotiator' for your pool runs.