Skip to content

MichalZawalski/RL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Model-free approach to solving the Rubik’s Cube

This is an experimental repository for the project of solving the Rubik's Cube with sparse rewards, without built-in knowledge and without environment model. For classic RL algorithms such as DQN, this is a virtually impossible task, since regardless of the exploration quality the agent cannot reach the solved state from a randomly scrambled cube and thus observes no positive reward at all. However, a successful policy can be trained by using a technique of Hindsight Experience Replay. The agent trained this way is capable of solving all the cubes scrambled with several moves despite observing no successful trajectory during the whole training procedure.

Project structure

This repository allows to run the HER algorithm on the Rubik's Cube environment, with sparse rewards. Additionally, it provides two benchmark goal-oriented environments - simple grid maze in which the agent has to reach the target field in a maze and BitFlipper environment in which the agent can change the values of n bits and has to reach the target configuration. The latter was proposed by the authors of HER and used as a motivating example for their work.

All the sources are located in src/ directory. Most important files contain:

  • run.py - main function which selects the experiment to run and specifies its complexity.
  • learning_configurations.py - specifies all the parameters of training, including its length, learning rate, exploration etc.
  • episode_replay_buffer.py - replay buffer, which apart from storing the experience manages goal assignment for replays.
  • DQN_HER.py - implementation for main loop of the training algorithm. Base for this file, namely the vanilla DQN algorithm, was adapted from stable baselines repository. At the time of creating this project the HER technique was not implemented in stable baselines yet. The changes are mostly connected with adapting to goal-oriented observations, since the goal assignment is handled by the replay buffer.
  • utility.py - evaluators of performance, training matrics and helper functions.

Having the desired experiment set in run.py, run it with the command

./scripts/run_local.sh ANONYMOUS "test"

This script prepares a virtual envirorenment and runs the selected experiment locally. The results can be send to neptune for visualization.

Experimental results

The agent trained for solving the Rubik's Cube starts learning quickly and after about 140000 episodes it successfully solves over 80% of cubes scrambled with 8 random moves. Charts presenting detailed performance of this agent during this stage of training can be found here. Though the progress decreases with time, after 1500000 training episodes its performance is the following: Success rate The agent successfully solves moderately scrambled cubes, but still solves only few completely random instances. Detailed results can be found here. Note that this level of performance is reached by an agent which observed hardly any successful episode during training.

In case of the benchmarks, the proposed implementation of HER easily reaches almost perfect success rate:

Note that the last example is far more difficult than showed in HER, though the authors did not aim to optimize this particular task.

Code ownership

The main loop of training and replay buffer is based on DQN implementation from hill-a/stable-baselines. The Rubik's Cube environment is taken from do-not-be-hasty/gym-rubik, which was forked from yoavain/gym-rubik. The BitFlipper environment is taken from do-not-be-hasty/BitFlipper, which is a fork of JoyChopra1298/BitFlipper. The maze environment is taken from do-not-be-hasty/mazelab, a fork of zuoxingdong/mazelab. The remaining relevant parts of the code were developed by the author, with most important: modifications of DQN training loop for using hindsight in DQN_HER.py and replay buffer in episode_replay_buffer.py, performance evaluators and other metrics in utility.py, experiments configurations in learning_configurations.py, Q-network architectures in networks.py and others. My total contribution to this project amounts to about 2000 lines of code.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published