h-baselines

h-baselines is a repository of high-performing and benchmarked hierarchical reinforcement learning models and algorithms. This repository is motivated by, and partially adapted from, the baselines and stable-baselines repositories.

The models and algorithms supported within this repository can be found here, and benchmarking results are available here.

Setup Instructions
1.1. Basic Installation
1.2. Installing MuJoCo
1.3. Importing AntGather
1.4. Installing Flow
Supported Models/Algorithms
2.1. RL Algorithms
   2.1.1. Synchronous Updates
2.2. Fully Connected Neural Networks
2.3. Goal-Conditioned HRL
   2.3.1. Meta Period
   2.3.2. Intrinsic Rewards
   2.3.3. HIRO (Data Efficient Hierarchical Reinforcement Learning)
   2.3.4. HAC (Learning Multi-level Hierarchies With Hindsight)
   2.3.5. CHER (Inter-Level Cooperation in Hierarchical Reinforcement Learning)
2.4. Multi-Agent Policies
Environments
3.1. MuJoCo Environments
3.2. Flow Environments
Citing
Bibliography

1. Setup Instructions

1.1 Basic Installation

To install the h-baselines repository, begin by opening a terminal and set the working directory of the terminal to match

cd path/to/h-baselines

Next, create and activate a conda environment for this repository by running the commands in the script below. Note that this is not required, but highly recommended. If you do not have Anaconda on your device, refer to the provided links to install either Anaconda or Miniconda.

conda env create -f environment.yml
source activate h-baselines

Finally, install the contents of the repository onto your conda environment (or your local python build) by running the following command:

pip install -e .

If you would like to (optionally) validate that the repository was successfully installed and is running, you can do so by executing the unit tests as follows:

nose2

The test should return a message along the lines of:

----------------------------------------------------------------------
Ran XXX tests in YYYs

OK

1.2 Installing MuJoCo

In order to run the MuJoCo environments described within the README, you will need to install MuJoCo and the mujoco-py package. To install both components follow the setup instructions located here. This package should work with all versions of MuJoCo (with some changes likely to the version of gym provided); however, the algorithms have been benchmarked to perform well on mujoco-py==1.50.1.68.

1.3 Importing AntGather

To properly import and run the AntGather environment, you will need to first clone and install the rllab library. You can do so running the following commands:

git clone https://github.com/rll/rllab.git
cd rllab
python setup.py develop
git submodule add -f https://github.com/florensacc/snn4hrl.git sandbox/snn4hrl

While all other environments run on all version of MuJoCo, this one will require MuJoCo-1.3.1. You may also need to install some missing packages as well that are required by rllab. If you're installation is successful, the following command should not fail:

python experiments/run_fcnet.py "AntGather"

When benchmarking this environment, we modified the control range and frame skip to match those used for the other Ant environments. If you would like to recreate these results and replay any pretrained policies, you will need to modify the rllab module such that the git diff of the repository returns the following:

--- a/rllab/envs/mujoco/mujoco_env.py
+++ b/rllab/envs/mujoco/mujoco_env.py
@@ -82,6 +82,7 @@ class MujocoEnv(Env):
             size = self.model.numeric_size.flat[init_qpos_id]
             init_qpos = self.model.numeric_data.flat[addr:addr + size]
             self.init_qpos = init_qpos
+        self.frame_skip = 5
         self.dcom = None
         self.current_com = None
         self.reset()
diff --git a/vendor/mujoco_models/ant.xml b/vendor/mujoco_models/ant.xml
index 1ee575e..906f350 100644
--- a/vendor/mujoco_models/ant.xml
+++ b/vendor/mujoco_models/ant.xml
@@ -68,13 +68,13 @@
     </body>
   </worldbody>
   <actuator>
-    <motor joint="hip_4" ctrlrange="-150.0 150.0" ctrllimited="true" />
-    <motor joint="ankle_4" ctrlrange="-150.0 150.0" ctrllimited="true" />
-    <motor joint="hip_1" ctrlrange="-150.0 150.0" ctrllimited="true" />
-    <motor joint="ankle_1" ctrlrange="-150.0 150.0" ctrllimited="true" />
-    <motor joint="hip_2" ctrlrange="-150.0 150.0" ctrllimited="true" />
-    <motor joint="ankle_2" ctrlrange="-150.0 150.0" ctrllimited="true" />
-    <motor joint="hip_3" ctrlrange="-150.0 150.0" ctrllimited="true" />
-    <motor joint="ankle_3" ctrlrange="-150.0 150.0" ctrllimited="true" />
+    <motor joint="hip_4" ctrlrange="-30.0 30.0" ctrllimited="true" />
+    <motor joint="ankle_4" ctrlrange="-30.0 30.0" ctrllimited="true" />
+    <motor joint="hip_1" ctrlrange="-30.0 30.0" ctrllimited="true" />
+    <motor joint="ankle_1" ctrlrange="-30.0 30.0" ctrllimited="true" />
+    <motor joint="hip_2" ctrlrange="-30.0 30.0" ctrllimited="true" />
+    <motor joint="ankle_2" ctrlrange="-30.0 30.0" ctrllimited="true" />
+    <motor joint="hip_3" ctrlrange="-30.0 30.0" ctrllimited="true" />
+    <motor joint="ankle_3" ctrlrange="-30.0 30.0" ctrllimited="true" />
   </actuator>
 </mujoco>

1.4 Installing Flow

In order to run any of the mixed-autonomy traffic flow tasks describe here, you fill need to install the flow library, along with any necessary third-party tools. To do so, following the commands located on this link. If your installation was successful, should run without failing:

python experiments/run_fcnet.py "ring-v0"

Once you've installed Flow, you will also be able to run all training environments located in the flow/examples folder from this repository as well. These can be accessed by appending "flow:" to the environment name when running the scripts in h-baselines/experiments. For example, if you would like to run the "singleagent_ring" environment in flow/example/rl/exp_configs, run:

python experiments/run_fcnet.py "flow:singleagent_ring"

2. Supported Models/Algorithms

This repository currently supports the use several algorithms of goal-conditioned hierarchical reinforcement learning models.

2.1 RL Algorithms

This repository supports the training of policies via two off-policy RL algorithms: TD3 and SAC, as well as one on-policy RL algorithm: PPO.

To train a policy using this algorithm, create a RLAlgorithm object and execute the learn method, providing the algorithm the proper policy along the process:

from hbaselines.algorithms import RLAlgorithm
from hbaselines.fcnet.td3 import FeedForwardPolicy  # for TD3 algorithm

# create the algorithm object
alg = RLAlgorithm(policy=FeedForwardPolicy, env="AntGather")

# train the policy for the allotted number of timesteps
alg.learn(total_timesteps=1000000)

The specific algorithm that is executed is defined by the policy that is provided. If, for example, you would like to switch the above script to train a feed-forward policy using the SAC or PPO algorithms, then the policy must simply be changed to:

from hbaselines.fcnet.sac import FeedForwardPolicy  # for SAC
from hbaselines.fcnet.ppo import FeedForwardPolicy  # for PPO

The hyperparameters and modifiable features of this algorithm are as follows:

policy (type [ hbaselines.base_policies.Policy ]) : the policy model to use
env (gym.Env or str) : the environment to learn from (if registered in Gym, can be str)
eval_env (gym.Env or str) : the environment to evaluate from (if registered in Gym, can be str)
nb_train_steps (int) : the number of training steps
nb_rollout_steps (int) : the number of rollout steps
nb_eval_episodes (int) : the number of evaluation episodes
actor_update_freq (int) : number of training steps per actor policy update step. The critic policy is updated every training step.
meta_update_freq (int) : number of training steps per meta policy update step. The actor policy of the meta-policy is further updated at the frequency provided by the actor_update_freq variable. Note that this value is only relevant when using the GoalConditionedPolicy policy.
reward_scale (float) : the value the reward should be scaled by
render (bool) : enable rendering of the training environment
render_eval (bool) : enable rendering of the evaluation environment
num_envs (int) : number of environments used to run simulations in parallel. Each environment is run on a separate CPUS and uses the same policy as the rest. Must be less than or equal to nb_rollout_steps. This term is covered in the following section.
verbose (int) : the verbosity level: 0 none, 1 training information, 2 tensorflow debug
policy_kwargs (dict) : policy-specific hyperparameters

2.1.1 Synchronous Updates

This repository supports parallelism via synchronous updates to speed up training for environments that are relatively slow to simulate. In order to do so, a specified number of environments are instantiated and updated in parallel for a number of rollout steps before calling the next policy update operation, as seen in the figure below. The number of environments in this case must be less than or equal to the number of rollout steps, as specified under nb_rollout_steps.

To assign multiple CPUs/environments for a given training algorithm, set the num_envs term as seen below:

from hbaselines.algorithms import RLAlgorithm

alg = RLAlgorithm(
    ...,
    # set num_envs as seen in the above figure
    num_envs=3,
    # set nb_rollout step as seen in the above figure
    nb_rollout_steps=5,
)

2.2 Fully Connected Neural Networks

We include a generic feed-forward neural network within the repository to validate the performance of typically used neural network model on the benchmarked environments. This consists of a pair of actor and critic fully connected networks with a tanh nonlinearity at the output layer of the actor. The output of the actors for the off-policy algorithms (TD3 and SAC) are also scaled to match the desired action space.

The feed-forward policy can be imported by including the following script:

# for TD3
from hbaselines.fcnet.td3 import FeedForwardPolicy

# for SAC
from hbaselines.fcnet.sac import FeedForwardPolicy

# for PPO
from hbaselines.fcnet.ppo import FeedForwardPolicy

This model can then be included to the algorithm via the policy parameter. The input parameters to this policy are as follows:

The modifiable parameters of this policy are as follows:

sess (tf.compat.v1.Session) : the current TensorFlow session
ob_space (gym.spaces.*) : the observation space of the environment
ac_space (gym.spaces.*) : the action space of the environment
co_space (gym.spaces.*) : the context space of the environment
verbose (int) : the verbosity level: 0 none, 1 training information, 2 tensorflow debug
model_params (dict) : dictionary of model-specific parameters, including:
- model_type (str) : the type of model to use. Must be one of {"fcnet", "conv"}.
- layers (list of int) :the size of the Neural network for the policy
- layer_norm (bool) : enable layer normalisation
- act_fun (tf.nn.*) : the activation function to use in the neural network
- ignore_image (bool) : observation includes an image but should it be ignored. Required if "model_type" is set to "conv".
- image_height (int) : the height of the image in the observation. Required if "model_type" is set to "conv".
- image_width (int) : the width of the image in the observation. Required if "model_type" is set to "conv".
- image_channels (int) : the number of channels of the image in the observation. Required if "model_type" is set to "conv".
- kernel_sizes (list of int) : the kernel size of the neural network conv layers for the policy. Required if "model_type" is set to "conv".
- strides (list of int) : the kernel size of the neural network conv layers for the policy. Required if "model_type" is set to "conv".
- filters (list of int) : the channels of the neural network conv layers for the policy. Required if "model_type" is set to "conv".

Additionally, TD3 policy parameters are:

buffer_size (int) : the max number of transitions to store
batch_size (int) : SGD batch size
actor_lr (float) : actor learning rate
critic_lr (float) : critic learning rate
tau (float) : target update rate
gamma (float) : discount factor
use_huber (bool) : specifies whether to use the huber distance function as the loss for the critic. If set to False, the mean-squared error metric is used instead
noise (float) : scaling term to the range of the action space, that is subsequently used as the standard deviation of Gaussian noise added to the action if apply_noise is set to True in get_action
target_policy_noise (float) : standard deviation term to the noise from the output of the target actor policy. See TD3 paper for more.
target_noise_clip (float) : clipping term for the noise injected in the target actor policy

SAC policy parameters are:

buffer_size (int) : the max number of transitions to store
batch_size (int) : SGD batch size
actor_lr (float) : actor learning rate
critic_lr (float) : critic learning rate
tau (float) : target update rate
gamma (float) : discount factor
use_huber (bool) : specifies whether to use the huber distance function as the loss for the critic. If set to False, the mean-squared error metric is used instead
target_entropy (float): target entropy used when learning the entropy coefficient. If set to None, a heuristic value is used.

And PPO policy parameters are:

learning_rate (float) : the learning rate
n_minibatches (int) : number of training minibatches per update
n_opt_epochs (int) : number of training epochs per update procedure
gamm (float) : the discount factor
lam (float) : factor for trade-off of bias vs variance for Generalized Advantage Estimator
ent_coef (float) : entropy coefficient for the loss calculation
vf_coef (float) : value function coefficient for the loss calculation
max_grad_norm (float) : the maximum value for the gradient clipping
cliprange (float) : clipping parameter, it can be a function
cliprange_vf (float) : clipping parameter for the value function, it can be a function. This is a parameter specific to the OpenAI implementation. If None is passed (default), then cliprange (that is used for the policy) will be used. IMPORTANT: this clipping depends on the reward scaling. To deactivate value function clipping (and recover the original PPO implementation), you have to pass a negative value (e.g. -1).

These parameters can be assigned when using the algorithm object by assigning them via the policy_kwargs term. For example, if you would like to train a fully connected network using the TD3 algorithm with a hidden size of [64, 64], this could be done as such:

from hbaselines.algorithms import RLAlgorithm
from hbaselines.fcnet.td3 import FeedForwardPolicy  # for TD3 algorithm

# create the algorithm object
alg = RLAlgorithm(
    policy=FeedForwardPolicy, 
    env="AntGather",
    policy_kwargs={
        # modify the network to include a hidden shape of [64, 64]
        "layers": [64, 64],
    }
)

# train the policy for the allotted number of timesteps
alg.learn(total_timesteps=1000000)

All policy_kwargs terms that are not specified are assigned default parameters. These default terms are available via the following command:

from hbaselines.algorithms.rl_algorithm import FEEDFORWARD_PARAMS
print(FEEDFORWARD_PARAMS)

Additional algorithm-specific default policy parameters can be found via the following commands:

# for TD3
from hbaselines.algorithms.rl_algorithm import TD3_PARAMS
print(TD3_PARAMS)

# for SAC
from hbaselines.algorithms.rl_algorithm import SAC_PARAMS
print(SAC_PARAMS)

# for PPO
from hbaselines.algorithms.rl_algorithm import PPO_PARAMS
print(PPO_PARAMS)

2.3 Goal-Conditioned HRL

Goal-conditioned HRL models, also known as feudal models, are a variant of hierarchical models that have been widely studied in the HRL community. This repository supports a two-level (Manager/Worker) variant of this policy, seen in the figure below. The policy can be imported via the following command:

# for TD3
from hbaselines.goal_conditioned.td3 import GoalConditionedPolicy

# for SAC
from hbaselines.goal_conditioned.sac import GoalConditionedPolicy

This network consists of a high-level, or Manager, policy that computes and outputs goals every time steps, and a low-level policy that takes as inputs the current state and the assigned goals and is encouraged to perform actions that satisfy these goals via an intrinsic reward function: . The contextual term, , parametrizes the environmental objective (e.g. desired position to move to), and consequently is passed both to the manager policy as well as the environmental reward function .

All of the parameters specified within the Fully Connected Neural Networks section are valid for this policy as well. Further parameters are described in the subsequent sections below.

All policy_kwargs terms that are not specified are assigned default parameters. These default terms are available via the following command:

from hbaselines.algorithms.rl_algorithm import GOAL_CONDITIONED_PARAMS
print(GOAL_CONDITIONED_PARAMS)

Moreover, similar to the feed-forward policy, additional algorithm-specific default policy parameters can be found via the following commands:

# for TD3
from hbaselines.algorithms.rl_algorithm import TD3_PARAMS
print(TD3_PARAMS)

# for SAC
from hbaselines.algorithms.rl_algorithm import SAC_PARAMS
print(SAC_PARAMS)

# for PPO
from hbaselines.algorithms.rl_algorithm import PPO_PARAMS
print(PPO_PARAMS)

2.3.1 Meta Period

The meta-policy action period, , can be specified to the policy during training by passing the term under the meta_period policy parameter. This can be assigned through the algorithm as follows:

from hbaselines.algorithms import RLAlgorithm
from hbaselines.goal_conditioned.td3 import GoalConditionedPolicy  # for TD3 algorithm

alg = RLAlgorithm(
    ...,
    policy=GoalConditionedPolicy,
    policy_kwargs={
        # specify the meta-policy action period
        "meta_period": 10
    }
)

2.3.2 Intrinsic Rewards

The intrinsic rewards, or , define the rewards assigned to the lower level policies for achieving goals assigned by the policies immediately above them. The choice of intrinsic reward can have a significant affect on the training performance of both the upper and lower level policies. Currently, this repository supports the use of two intrinsic reward functions:

negative_distance: This is of the form:

if relative_goals is set to False, and

if relative_goals is set to True. This attribute is described in the section on HIRO.
non_negative_distance: This reward function is designed to maintain a positive value within the intrinsic rewards to prevent the lower-level agents from being incentivized from falling/dying in environments that can terminate prematurely. This is done by offsetting the value by the maximum assignable distance, assuming that the states always fall within the goal space (, ). This reward is of the form:

if relative_goals is set to False, and

if relative_goals is set to True. This attribute is described in the section on HIRO.
exp_negative_distance: This reward function is designed to maintain the reward between 0 and 1 for environments that may terminate prematurely. This is of the form:

if relative_goals is set to False, and

if relative_goals is set to True. This attribute is described in the section on HIRO.

Intrinsic rewards of the form above are not scaled by the any term, and as such may be dominated by the largest term in the goal space. To circumvent this, we also include a scaled variant of each of the above intrinsic rewards were the states and goals are divided by goal space of the higher level policies. The new scaled rewards are then:

where is the goal-space high values and are the goal-space low values. These intrinsic rewards can be used by initializing the string with "scaled_", for example: scaled_negative_distance, scaled_non_negative_distance, or scaled_exp_negative_distance.

To assign your choice of intrinsic rewards when training a hierarchical policy, set the intrinsic_reward_type attribute to the type of intrinsic reward you would like to use:

from hbaselines.algorithms import RLAlgorithm
from hbaselines.goal_conditioned.td3 import GoalConditionedPolicy  # for TD3 algorithm

alg = RLAlgorithm(
    ...,
    policy=GoalConditionedPolicy,
    policy_kwargs={
        # assign the intrinsic reward you would like to use
        "intrinsic_reward_type": "scaled_negative_distance"
    }
)

2.3.3 HIRO (Data Efficient Hierarchical Reinforcement Learning)

The HIRO [3] algorithm provides two primary contributions to improve training of generic goal-conditioned hierarchical policies.

First of all, the HIRO algorithm redefines the assigned goals from absolute desired states to relative changes in states. This is done by redefining the reward intrinsic rewards provided to the Worker policies (see the Intrinsic Rewards section). In order to maintain the same absolute position of the goal regardless of state change, a fixed goal-transition function is used in between goal-updates by the manager policy. The goal transition function is accordingly defined as:

where is the meta_period.

In order to use relative goals when training a hierarchical policy, set the relative_goals parameter to True:

from hbaselines.algorithms import RLAlgorithm
from hbaselines.goal_conditioned.td3 import GoalConditionedPolicy  # for TD3 algorithm

alg = RLAlgorithm(
    ...,
    policy=GoalConditionedPolicy,
    policy_kwargs={
        # add this line to include HIRO-style relative goals
        "relative_goals": True
    }
)

Second, HIRO addresses the non-stationarity effects between the Manager and Worker policies, which can have a detrimental effect particularly in off-policy training, by relabeling the manager actions (or goals) to make the actual observed action sequence more likely to have happened with respect to the current instantiation of the Worker policy. This is done by sampling a sequence of potential goals sampled via a Gaussian centered at and choosing the candidate goal that maximizes the log-probability of the actions that were originally performed by the Worker.

In order to use HIRO's goal relabeling (or off-policy corrections) procedure when training a hierarchical policy, set the off_policy_corrections parameter to True:

from hbaselines.algorithms import RLAlgorithm
from hbaselines.goal_conditioned.td3 import GoalConditionedPolicy  # for TD3 algorithm

alg = RLAlgorithm(
    ...,
    policy=GoalConditionedPolicy,
    policy_kwargs={
        # add this line to include HIRO-style off policy corrections
        "off_policy_corrections": True
    }
)

2.3.4 HAC (Learning Multi-level Hierarchies With Hindsight)

The HAC algorithm [5] attempts to address non-stationarity between levels of a goal-conditioned hierarchy by employing various forms of hindsight to samples within the replay buffer.

Hindsight action transitions assist by training each subgoal policy with respect to a transition function that simulates the optimal lower level policy hierarchy. This is done by by replacing the action performed by the manager with the subgoal state achieved in hindsight. For example, given an original sub-policy transition:

sample = {
    "meta observation": s_0,
    "meta action" g_0,
    "meta reward" r,
    "worker observations" [
        (s_0, g_0),
        (s_1, h(g_0, s_0, s_1)),
        ...
        (s_k, h(g_{k-1}, s_{k-1}, s_k))
    ],
    "worker actions" [
        a_0,
        a_1,
        ...
        a_{k-1}
    ],
    "intrinsic rewards": [
        r_w(s_0, g_0, s_1),
        r_w(s_1, h(g_0, s_0, s_1), s_2),
        ...
        r_w(s_{k-1}, h(g_{k-1}, s_{k-1}, s_k), s_k)
    ]
}

The original goal is relabeled to match the original as follows:

sample = {
    "meta observation": s_0,
    "meta action" s_k, <---- the changed component
    "meta reward" r,
    "worker observations" [
        (s_0, g_0),
        (s_1, h(g_0, s_0, s_1)),
        ...
        (s_k, h(g_{k-1}, s_{k-1}, s_k))
    ],
    "worker actions" [
        a_0,
        a_1,
        ...
        a_{k-1}
    ],
    "intrinsic rewards": [
        r_w(s_0, g_0, s_1),
        r_w(s_1, h(g_0, s_0, s_1), s_2),
        ...
        r_w(s_{k-1}, h(g_{k-1}, s_{k-1}, s_k), s_k)
    ]
}

In cases when the relative_goals feature is being employed, the hindsight goal is labeled using the inverse goal transition function. In other words, for a sample with a meta period of length , the goal for every worker for every worker observation indexed by is:

The "meta action", as represented in the example above, is then .

Hindsight goal transitions extend the use of hindsight to the worker observations and intrinsic rewards within the sample as well. This is done by modifying the relevant worker-specific features as follows:

sample = {
    "meta observation": s_0,
    "meta action" \bar{g}_0,
    "meta reward" r,
    "worker observations" [ <------------
        (s_0, \bar{g}_0),               |
        (s_1, \bar{g}_1),               |---- the changed components
        ...                             |
        (s_k, \bar{g}_k)                |
    ], <---------------------------------
    "worker actions" [
        a_0,
        a_1,
        ...
        a_{k-1}
    ],
    "intrinsic rewards": [ <-------------
        r_w(s_0, \bar{g}_0, s_1),       |
        r_w(s_1, \bar{g}_1,, s_2),      |---- the changed components
        ...                             |
        r_w(s_{k-1}, \bar{g}_k, s_k)    |
    ] <----------------------------------
}

where for is equal to if relative_goals is False and is defined by the equation above if set to True.

Finally, sub-goal testing promotes exploration when using hindsight by storing the original (non-hindsight) sample in the replay buffer as well. This happens at a rate defined by the subgoal_testing_rate term.

In order to use hindsight action and goal transitions when training a hierarchical policy, set the hindsight parameter to True:

from hbaselines.algorithms import RLAlgorithm
from hbaselines.goal_conditioned.td3 import GoalConditionedPolicy  # for TD3 algorithm

alg = RLAlgorithm(
    ...,
    policy=GoalConditionedPolicy,
    policy_kwargs={
        # include hindsight action and goal transitions in the replay buffer
        "hindsight": True,
        # specify the sub-goal testing rate
        "subgoal_testing_rate": 0.3
    }
)

2.3.5 CHER (Inter-Level Cooperation in Hierarchical Reinforcement Learning)

The CHER algorithm [4] attempts to promote cooperation between Manager and Worker policies in a goal-conditioned hierarchy by including a weighted cooperative gradient term to the Manager's gradient update procedure (see the right figure below).

Under this formulation, the Manager's update step is defined as:

To use the cooperative gradient update procedure, set the cooperative_gradients term in policy_kwargs to True. The weighting term ( in the above equation), can be modified via the cg_weights term (see the example below).

from hbaselines.algorithms import RLAlgorithm
from hbaselines.goal_conditioned.td3 import GoalConditionedPolicy  # for TD3 algorithm

alg = RLAlgorithm(
    ...,
    policy=GoalConditionedPolicy,
    policy_kwargs={
        # add this line to include the cooperative gradient update procedure
        # for the higher-level policies
        "cooperative_gradients": True,
        # specify the cooperative gradient (lambda) weight
        "cg_weights": 0.01
    }
)

2.4 Multi-Agent Policies

This repository also supports the training of multi-agent variant of both the fully connected and goal-conditioned policies. The fully-connected policies are import via the following commands:

# for TD3
from hbaselines.multiagent.td3 import MultiFeedForwardPolicy

# for SAC
from hbaselines.multiagent.sac import MultiFeedForwardPolicy

Moreover, the hierarchical variants are import via the following commands:

# for TD3
from hbaselines.multiagent.h_td3 import MultiGoalConditionedPolicy

# for SAC
from hbaselines.multiagent.h_sac import MultiGoalConditionedPolicy

These policies supports training off-policy variants of three popular multi-agent algorithms:

Independent learners: Independent (or Naive) learners provide a separate policy with independent parameters to each agent in an environment. Within this setting, agents are provided separate observations and reward signals, and store their samples and perform updates separately. A review of independent learners in reinforcement learning can be found here: https://hal.archives-ouvertes.fr/hal-00720669/document

To train a policy using independent learners, do not modify any policy-specific attributes:
```
from hbaselines.algorithms.rl_algorithm import RLAlgorithm
from hbaselines.multiagent.td3 import MultiFeedForwardPolicy  # for TD3

alg = RLAlgorithm(
    policy=MultiFeedForwardPolicy,
    env="...",  # replace with an appropriate environment
    policy_kwargs={}
)
```
Shared policies: Unlike the independent learners formulation, shared policies utilize a single policy with shared parameters for all agents within the network. Moreover, the samples experienced by all agents are stored within one unified replay buffer. See the following link for an early review of the benefit of shared policies: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.55.8066&rep=rep1&type=pdf

To train a policy using the shared policy feature, set the shared attribute to True:
```
from hbaselines.algorithms.rl_algorithm import RLAlgorithm
from hbaselines.multiagent.td3 import MultiFeedForwardPolicy  # for TD3

alg = RLAlgorithm(
    policy=MultiFeedForwardPolicy,
    env="...",  # replace with an appropriate environment
    policy_kwargs={
        "shared": True,
    }
)
```
MADDPG: We implement algorithmic-variants of MAPPG for all supported off-policy RL algorithms. See: https://arxiv.org/pdf/1706.02275.pdf

To train a policy using their MADDPG variants as opposed to independent learners, algorithm, set the maddpg attribute to True:
```
from hbaselines.algorithms.rl_algorithm import RLAlgorithm
from hbaselines.multiagent.td3 import MultiFeedForwardPolicy  # for TD3

alg = RLAlgorithm(
    policy=MultiFeedForwardPolicy,
    env="...",  # replace with an appropriate environment
    policy_kwargs={
        "maddpg": True,
        "shared": False,  # or True
    }
)
```
This works for both shared and non-shared policies. For shared policies, we use a single centralized value function instead of a value function for each agent.

Note: MADDPG variants of the goal-conditioned hierarchies are currently not supported.

3. Environments

We benchmark the performance of all algorithms on a set of standardized Mujoco [7] (robotics) and Flow [8] (mixed-autonomy traffic) benchmarks. A description of each of the studied environments can be found below.

3.1 MuJoCo Environments

AntGather

This task was initially provided by [6].

In this task, a quadrupedal (Ant) agent is placed in a 20x20 space with 8 apples and 8 bombs. The agent receives a reward of +1 or collecting an apple and -1 for collecting a bomb. All other actions yield a reward of 0.

AntMaze

This task was initially provided by [3].

In this task, immovable blocks are placed to confine the agent to a U-shaped corridor. That is, blocks are placed everywhere except at (0,0), (8,0), (16,0), (16,8), (16,16), (8,16), and (0,16). The agent is initialized at position (0,0) and tasked at reaching a specific target position. "Success" in this environment is defined as being within an L2 distance of 5 from the target.

AntPush

This task was initially provided by [3].

In this task, immovable blocks are placed every where except at (0,0), (-8,0), (-8,8), (0,8), (8,8), (16,8), and (0,16), and a movable block is placed at (0,8). The agent is initialized at position (0,0), and is tasked with the objective of reaching position (0,19). Therefore, the agent must first move to the left, push the movable block to the right, and then finally navigate to the target. "Success" in this environment is defined as being within an L2 distance of 5 from the target.

AntFall

This task was initially provided by [3].

In this task, the agent is initialized on a platform of height 4. Immovable blocks are placed everywhere except at (-8,0), (0,0), (-8,8), (0,8), (-8,16), (0,16), (-8,24), and (0,24). The raised platform is absent in the region [-4,12]x[12,20], and a movable block is placed at (8,8). The agent is initialized at position (0,0,4.5), and is with the objective of reaching position (0,27,4.5). Therefore, to achieve this, the agent must first push the movable block into the chasm and walk on top of it before navigating to the target. "Success" in this environment is defined as being within an L2 distance of 5 from the target.

3.2 Flow Environments

We also explore the use of hierarchical policies on a suite of mixed-autonomy traffic control tasks, built off the Flow [8] framework for RL in microscopic (vehicle-level) traffic simulators. Within these environments, a subset of vehicles in any given network are replaced with "automated" vehicles whose actions are provided on an RL policy. A description of the attributes of the MDP within these tasks is provided in the following sub-sections. Additional information can be found through the environment classes and flow-specific parameters.

The below table describes all available tasks within this repository to train on. Any of these environments can be used by passing the environment name to the env parameter in the algorithm class. The multi-agent variants of these environments can also be trained by adding "multiagent-" to the start of the environment name (e.g. "multiagent-ring-v0").

Network type	Environment name	number of AVs	total vehicles	AV ratio	inflow rate (veh/hr)	acceleration penalty	stopping penalty
ring	ring-v0	5	50 - 75	1/15 - 1/10	--	yes	yes
	ring-v1	5	50 - 75	1/15 - 1/10	--	yes	no
	ring-v2	5	50 - 75	1/15 - 1/10	--	no	no
merge	merge-v0	~5	~50	1/10	2000	yes	no
	merge-v1	~13	~50	1/4	2000	yes	no
	merge-v2	~17	~50	1/3	2000	yes	no
highway	highway-v0	~10	~150	1/12	2215	yes	yes
	highway-v1	~10	~150	1/12	2215	yes	no
	highway-v2	~10	~150	1/12	2215	no	no
I-210	i210-v0	~50	~800	1/15	10250	yes	yes
	i210-v1	~50	~800	1/15	10250	yes	no
	i210-v2	~50	~800	1/15	10250	no	no

States

The state for any of these environments consists of the speeds and bumper-to-bumper gaps of the vehicles immediately preceding and following the AVs, as well as the speed of the AVs, i.e. . In single agent settings, these observations are concatenated in a single observation that is passed to a centralized policy.

In order to account for variability in the number of AVs () in the single agent seeting, a constant term is defined. When , information from the extra CAVs are not included in the state. Moreover, if the state is padded with zeros.

Actions

The actions consist of a list of bounded accelerations for each AV, i.e. , where and are the minimum and maximum accelerations, respectively. In the single agent setting, all actions are provided as an output from a single policy.

Once again, an term is used to handle variable numbers of AVs in the single agent setting. If the extra AVs are treated as human-driven vehicles and their states are updated using human driver models. Moreover, if , the extra actions are ignored.

Rewards

The reward provided by the environment is equal to the negative vector normal of the distance between the speed of all vehicles in the network and a desired speed, and is offset by largest possible negative term to ensure non-negativity if environments terminate prematurely. The exact mathematical formulation of this reward is:

where is the speed of the individual vehicles, is the desired speed, and is the number of vehicles in the network.

This reward may only include two penalties:

acceleration penalty: If set to True in env_params, the negative of the sum of squares of the accelerations by the AVs is added to the reward.
stopping penalty: If set to True in env_params, a penalty of -5 is added to the reward for every RL vehicle that is not moving.

Networks

We investigate the performance of our algorithms on a variety of network configurations demonstrating diverse traffic instabilities and forms of congestion. This networks are detailed below.

ring

This scenario consists of 50 (if density is fixed) or 50-75 vehicles (5 of which are automated) are placed on a sing-lane circular track of length 1500m. In the absence of the automated vehicle, the human-driven vehicles exhibit stop-and-go instabilities brought about by the string-unstable characteristic of human car-following dynamics.

merge

This scenarios is adapted from the following article [9]. It consists of a single-lane highway network with an on-ramp used to generate periodic perturbations to sustain congested behavior. In order to model the effect of p% AV penetration on the network, every 100/pth vehicle is replaced with an automated vehicle whose actions are sampled from an RL policy.

highway

This scenario consists of a single lane highway in which downstream traffic instabilities brought about by an edge with a reduced speed limit generate congestion in the form of stop-and-go waves. In order to model the effect of p% AV penetration on the network, every 100/pth vehicle is replaced with an automated vehicle whose actions are sampled from an RL policy.

I-210

This scenario is a recreation of a subsection of the I-210 network in Los Angeles, CA. For the moment, the on-ramps and off-ramps are disabled within this network, rendering it similar to a multi-lane variant of the highway network.

4. Citing

To cite this repository in publications, use the following:

@misc{h-baselines,
  author = {Kreidieh, Abdul Rahman},
  title = {Hierarchical Baselines},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/AboudyKreidieh/h-baselines}},
}

5. Bibliography

[1] Dayan, Peter, and Geoffrey E. Hinton. "Feudal reinforcement learning." Advances in neural information processing systems. 1993.

[2] Vezhnevets, Alexander Sasha, et al. "Feudal networks for hierarchical reinforcement learning." Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017.

[3] Nachum, Ofir, et al. "Data-efficient hierarchical reinforcement learning." Advances in Neural Information Processing Systems. 2018.

[4] Kreidieh, Abdul Rahmnan, et al. "Inter-Level Cooperation in Hierarchical Reinforcement Learning". arXiv preprint arXiv:1912.02368 (2019).

[5] Levy, Andrew, et al. "Learning Multi-Level Hierarchies with Hindsight." (2018).

[6] Florensa, Carlos, Yan Duan, and Pieter Abbeel. "Stochastic neural networks for hierarchical reinforcement learning." arXiv preprint arXiv:1704.03012 (2017).

[7] Todorov, Emanuel, Tom Erez, and Yuval Tassa. "Mujoco: A physics engine for model-based control." 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2012.

[8] Wu, Cathy, et al. "Flow: A Modular Learning Framework for Autonomy in Traffic." arXiv preprint arXiv:1710.05465 (2017).

[9] Kreidieh, Abdul Rahman, Cathy Wu, and Alexandre M. Bayen. "Dissipating stop-and-go waves in closed and open networks via deep reinforcement learning." 2018 21st International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2018.

Name		Name	Last commit message	Last commit date
Latest commit History 281 Commits
.github		.github
docs/img		docs/img
experiments		experiments
hbaselines		hbaselines
tests		tests
tex		tex
.coveragerc		.coveragerc
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
README.tex.md		README.tex.md
environment.yml		environment.yml
requirements.txt		requirements.txt
setup.py		setup.py

License

tamood/h-baselines

Folders and files

Latest commit

History

Repository files navigation

h-baselines

Contents

1. Setup Instructions

1.1 Basic Installation

1.2 Installing MuJoCo

1.3 Importing AntGather

1.4 Installing Flow

2. Supported Models/Algorithms

2.1 RL Algorithms

2.1.1 Synchronous Updates

2.2 Fully Connected Neural Networks

2.3 Goal-Conditioned HRL

2.3.1 Meta Period

2.3.2 Intrinsic Rewards

2.3.3 HIRO (Data Efficient Hierarchical Reinforcement Learning)

2.3.4 HAC (Learning Multi-level Hierarchies With Hindsight)

2.3.5 CHER (Inter-Level Cooperation in Hierarchical Reinforcement Learning)

2.4 Multi-Agent Policies

3. Environments

3.1 MuJoCo Environments

AntGather

AntMaze

AntPush

AntFall

3.2 Flow Environments

States

Actions

Rewards

Networks

ring

merge

highway

I-210

4. Citing

5. Bibliography

About

Resources

License

Stars

Watchers

Forks

Languages