Knock yourselves out just keep your requests simple.
The meeseeks-box agent runs on each node. Head, routing, compute, doesn't matter. It's all the same. Just don't connect the nodes in cycles.
$ ./meeseeks-box [config files..]
config files are JSON, some sample configs are included:
examples/master.cfg
this is how a master node would be configured
it connects to the head nodes
examples/head*.cfg
this is how a head node would be configured
it does not have any pools but listens and connects to the compute nodes
examples/node*.cfg
these are cluster compute nodes
they each have pools to process jobs
in these sample configs, the port number is being changed so all of them can run one one host generally you don't change the port number, and run one meeseeks-box per host
The config sections, objects, and defaults are as follows: {
name: #set the name of the node, defaults to the hostname if not set
user: #set the effective username/id of the node. Only effective if root.
group: #set the effective group of the node. Only effective if root.
#Defaults to user's primary group.
logging: {
#logging config, use python logging.basicConfig parameters such as level=10 for DEBUG
}
defaults: {
# if a parameter is set here, it will change the default value in all other sections if applicable to that section
}
listen: { #configures the listening socket
address: defaults to localhost
port: defaults to 49463
ssl: {SSLContext config}
}
state: { #configures the state manager
expire: 300 # how long in seconds a job will persist without being updated
# the state of completed/failed/killed jobs will be available for this long
expire_active_jobs: true #if true, jobs in pools will be expired if node is down
timeout: 60 #timeout in seconds to receive updated node status before it is marked offline
file: <filename> #if set, save/reload state from this file)
checkpoint: <int> #if set, save state to file every <int> seconds)
history: <filename> #if set, write finished/expired jobs to this file
}
nodes: list of downstream nodes to connect to
{ <nodename>:{
address: defaults to <nodename>
port: defaults to 49463
ssl: {SSLContext config}
refresh: 1 # how often in seconds we sync state
poll: 10 # how often in seconds we request status
timeout: 10 # timeout in seconds to connect/send/receive data
} , ... }
pools: list of job processing pools on this node
{ <poolname>:{
slots: 0
# if > 0 sets limit of how many jobs can run simultaneously
# 0 sets no limit, but nodes with slots will be preferred
hold: false # if true, jobs will not start until hold=false
drain: false # if true, no new jobs will be assigned to this pool
runtime: null # if set, limit of how long a job can run for
update: 0 # how often in seconds the state of running jobs is updated
#this is only required if you want task info updates
#jobs in pools will not expire while node is up
plugin: optional <path.module.Class> to provide this pool instance
} , ... }
use_loadavg: false #if set true, load average will be used to select nodes vs. free pool slots
wait_in_pool: false #if set true, jobs will be assigned to nodes with full pools and run when a slot is free
#if false (default) jobs will remain unassigned until a slot is free
}
config can also be provided on the command line using key.key.key=value
example:
$ ./meeseeks-box name=master defaults.refresh=10 state.expire=300 nodes.n11.address=10.0.0.11 nodes.n12.address=10.0.0.12
connect with something like
nc localhost 49463
and send JSON. newline sends requests for processing. double newline disconnects client.
[ {
"submit" :{
"id": string #job id, optional, MUST be unique. A UUID will be generated if id is omitted
#if an existing job id is given, the job will be modified if possible
"uid": [optional] username/id of the job. If not specified the job will run as the node's user.
"gid": [optional] groupname/id of the job.
(If uid/gid differ from the node and the node was not started as root the job will fail.)
"pool": string #pool name, REQUIRED.
"args": [executable, arg, arg, arg] #The command to run and arguments. If subprocess.Popen likes it, it will work.
"node": string #optional. Node selection, can be * for all in pool or end with * for wildcard
"stdin": path #path to file to use for the job's stdin
"stdout": path #optional, path to file to use for the job's stdout else stdout_data returns the base64 encoded output
"stderr": path #optional, path to file to use for the job's stderr else stderr_data returns the base64 encoded output
"runtime: int #optional, maximum runtime of the job
"hold": false|true #optional, if true job will be assigned to a node but not run until set false
"restart": false|true #if true, job will be restarted on the same node if it exits with success (rc == 0)
"retries": int #if >0, job will be restarted a max of retrues on the same node if it exits with failure
"resubmit": false|true #if true, when job is finished (done or failed), resubmit it to the submit_node
"state": {new|killed} #set state of job, killed will stop running job, new will restart finished job
"tags": [...] #list of tags, can be matched in query with tag=
}
response will be:
{
"submit": the job_id:job map or jid:false if submission failed
job attributes (also includes keys from submit spec):
node: the node the job is assigned to
state: the job state (new,running,done,failed,killed)
active: True if the job is being processed by a node.
To move a job: kill the job, wait for active=False, then reassign and set state='new'.
rc: exit code if job done/failed
error: error details if job did not spawn or exit
stdout: base64 encoded stdout after exit if not redirected to a file
stdout: base64 encoded stderr after exit if not redirected to a file
ts: update timestamp
seq: sync sequence number. Jobs with the highest seq are most recently updated on this node.
submit_ts: submit timestamp
start_ts: job start timestamp
end_ts: job end timestamp
start_count: count of time job has started
fail_count: count of times job has failed
}
"get": job_id | [job_ids] | {query spec}
response will be jid:job map, or false if job_id does not exist
"kill": job_id | [job_ids] | {query spec} #kills a job.
response will be jid:job map, or false if job_id does not exist
"nodes" : {}
fetch the node status this node knows about
response will be:
{
"nodes": { nodename:{ ts: , online:true|false, loadavg: , routing: [nodelist] }, .... },
}
}
"pools" : {}
fetch the pool status this node knows about
response will be:
{
"pools": { poolname:{ nodename: slots, ... }, ... }
}
}
"config": {...} #push a new configuration (if provided) to the node, response is current config
#configuration can be pushed to remote nodes via a job in the __config pool
#example: {"submit":{"pool":"__config","node":"<node>","args":{<config>}}}
#when state is 'done', job args will reflect current config
} ]
new: job has been submitted but not started. May be assigned to a node's pool if wait_in_pool is set. if claimed by a pool, node will be set and active=True
running: job is running. pid will be set. node is set to the node the job is running on.
done: job is finished. stdout_data and stderr_data will contain output
failed: job failed. rc will be set, or error will be set. If the job expired and retries are set, node may cleared
killed: job was killed, rc may be set if job was running. if acknowledged and released by node, active=False
meeseeks-client [client-options] <command> [args...]
or use q-symlinks: q{stat|sub|job|del|mod|conf} [args]
commands are:
sub [submit-options] <pool[@node]> <executable> [args....] (submits job, returns job info)
submit-options:
pool= sets pool for job to run in
node= sets node(s) for job to run on. Can be comma-seperated list or end with * for all/wildcard
stdin= stdout= stderr= (redirect job in/out/err to files named)
restart= (1=restart when done)
retries= (times to retry if job fails)
resubmit= (1=resubmit when job is done or fails)
runtime= (max runtime of job)
id= (set new job's id or submit changes to existing job)
state= (change existing job state, 'new' will restart finished job)
hold= (1=queue but do not start job)
tag= list of tags, can be matched in query with tag=
job|get [jobids|filter] (get all or specified job info as JSON)
jobids are any number of job id arguments
filter is any job attributes such as node= pool= state= tag=
ls [filter] (list job ids)
del|kill <jobids|filter> (kill job)
mod|set <jobids|filter : > key=value ... (set key=value in jobs matching jobids or filter, return new job info)
if a filter is provided, ':' is used to delimit filter key=value from job key=value
set a job in any finished state (done,failed,killed) to state='new' to restart job
if node is not specified when restarting job, node will be cleared.
stat|show [filter] {nodes pools jobs active tree}
(prints flat or tree cluster status,
specify which elements to show, defaults to flat/all)
Job Flags: A=active H=hold E=error R=repeating
nodes (prints full node status JSON)
pools (prints full pool status JSON)
conf [key=value] [node]
get/sends config to directly connected or specified node
client-options can be set by the environment var MEESEEKS_CONF and default to:
address= (default localhost)
port= (defult 49463)
refresh= (interval to continuously refresh status until no jobs left)
meeseeks-watch [key=value]... [config-file]
watches files, submits jobs on them
JSON config format, can also be specified on command line as key.subkey..=value|value,value
{
"defaults" : { ... defaults for all other sections ... },
"name" : <string> root name for logging/jobs
"client" : { ... configuration for connecting to meeseeks ... },
"template" : {
"<name>": { defines a watch template, see spec for watch. }
},
"watch" : {
"<name>": {
"template": <template(s)> applies template(s) to this watch config.
Keys defined in watch override keys in template
Name in logging/jobs will be root_name-template_name-watch_name
"plugin": optional <path.module.Class> to provide this watch instance.
meeseeks.watch.WatchXattr is included for tracking state with filesystem xattrs
"path" : <path to watch>
"glob" : <pattern> | [ <pattern>, ... ]
Watches the files matching the pattern.
A list of patterns can be specified to generate multiple lists of files
Use a glob of "*" for all files
If no globs are defined, this watch will simply ensure the jobs defined are always running
"reverse" : <bool> files are ASCII sorted Z->A, 9->0 to handle datestamps newest to oldest.
If true, reverse sort the files (oldest to newest)
"split" : <character> optional character to split filenames on to generate filename parts.
"match" : <int> match filename parts across globs to create filesets
a fileset is complete when the specified parts of a filename in each list exist.
<int> filename parts from the first glob are used as the key
files will not be processed until a complete fileset exists, unless partial is set true.
For example,
glob: [*.foo,*.bar,*.baz]
split: .
match: 2
the set will be complete if we have 20200101.00.foo, 20200101.00.bar, 20200101.00.baz
default is 0 (no filesets)
"partial" : <bool> if true, process fileset as soon as a file from the first glob is present
"skip" : <suffix> fileset will be skipped and marked done if <matched parts>.<suffix> is present
if skip: "out: in above fileset example, job will be skipped if 20200101.00.out exists
"updated" : <bool> if set, files will be reprocessed if modtime changes. Default false.
If multiple jobs are defined, only the first will be run on file update.
"retry" : <bool> if true, files with failed jobs will be reprocessed. Default true.
killed jobs will never be retried
"run_all" : <bool> if true, unprocessed jobs from 0 to the file's index will be sequentially submitted.
if false, only the unprocessed job for the file's index will be submitted.
default true
"min_age" : <int> if set, file must be at least <int> seconds old to be considered
"max_age" : <int> if set, file must be newer than <int> seconds to be considered
"max_index": <int> if set, maximum index down the list we will submit jobs for.
"refresh" : <int> interval in seconds running jobs are checked and file status updated, default 10
"rescan" : <int> interval in seconds files in path are rescanned, default 60
"count" : <int> if set, exit after jobs have run count times.
"jobs" : [
{ jobspec } | [ {jobspec}, ... ] ,
jobspec(s) to submit on first (usually newest) file/fileset in the list(s)
,
{} | [{},...]
jobspec(s) to submit on next file/fileset...
, ..... ,
{} | [{},...]
last jobspec(s) will be submitted on all other unprocessed files
]
}
}
}
if file list(s) have been generated, jobs[n] is run on file[n] in each list:
if only one job is defined, this job will be run on all files
if multiple jobs are defined, files will have the jobs run on them in sequence as more files appear
if a jobspec is empty or null, do nothing.
for example, if jobs[0] and jobs[1] are defined, when a new file[0] appears:
jobs[0] will run on new file[0]
previously processed file[0] will now be file[1], so jobs[1] will run on it.
file[1] will now be file[2] but if jobs[2] does not exist nothing happens.
if multiple file lists, each jobs[n] can also be a list of jobspecs:
jobs[n][0] will be submitted for file[n] from list[0],
jobs[n][1] for list[1][n],
jobs[n][-1] will process any remaining lists.
in jobspecs, the following format keys available:
%(name)s name of this watch
%(filename)s filename
%(file)s full path to file including filename
%(fileset)s all filenames in the fileset
%(fileset<n>)s if a fileset, will be glob[n] filename
%(<n>)s part <n> of the filename or fileset match pattern
%(index)s job index
%(<k>)s key <k> in the watch config, such as %(path)s
When a job ends, the job result JSON is written to a hidden file in <path> named:
._<name>_<n>_<filename>.<state>
<name> is the watch name.
<n> is the index of the job in the jobspec list
<filename> is the filename
<state> is the finished job state, typically "done" if successful.
If a .done (or .failed if retry=0) file exists the file is considered processed
If updated=True, last file mtime is tracked in ._<name>_<n>_<filename>.mtime
hidden ._ files will be deleted when associated files are deleted, unless cleanup=0
to generate a config by applying multiple templates to a list of paths, define only the paths in config:
"watch": {
"name1":{ "path":"path1" },
"name2":{ "path":"path2" },
...
}
and apply templates globally with:
meeseeks-watch defaults.template=<template>,<template>,<template> templates.cfg paths.cfg
generated watch names will be <template>-<name>
Meeseeks is not designed to provide any kind of real security.
Job uid/gid of 0 will be ignored (unless the node is effectively running as root, so don't do that! Always set user/group in the config) and meeseeks-client always set sets the job uid to the running user. However, any user can run jobs as any nonzero uid, or kill any job regardless of uid. Enforcing uid checks are pointless when anyone can connect to the listener port.
If you will be running a multi-user cluster, you should place the service, commands, and trusted users in a 'meeseeks' group and restrict access to the port with SSL and certificates that only the meeseeks group can access.