ALL THE CODEZ
On AWS:
- Create a new VM with Ubuntu 18.04.
- Assign it a new Elastic IP.
Locally:
- Update
~/.ssh/config
with a new entry:
Host <name>
HostName <ip address/url>
User ubuntu
IdentityFile ~/.ssh/aws-macbookpro.pem # or wherever.
On the machine:
First,
sudo apt update
sudo apt upgrade
sudo reboot
- Set the hostname.
- Install nuvemfs.
- Install mujoco.
- Install linuxbrew and pipenv.
On AWS Ubuntu 18.04,
user$ sudo su
root$ hostnamectl set-hostname <whatever>
sudo apt install -y cifs-utils
wget https://nuvemfscliassets.blob.core.windows.net/nuvemfs-cli-assets/stable/nuvemfs-cli-x86_64-unknown-linux-musl
chmod +x nuvemfs-cli-x86_64-unknown-linux-musl
echo "alias nuvemfs=\"~/nuvemfs-cli-x86_64-unknown-linux-musl\"" >> ~/.profile
source ~/.profile
- Download and install Mujoco.
sudo apt install -y unzip clang
wget https://www.roboti.us/download/mujoco200_linux.zip
unzip mujoco200_linux.zip
mkdir ~/.mujoco
mv mujoco200_linux ~/.mujoco/mujoco200
rm mujoco200_linux.zip
echo "export LD_LIBRARY_PATH=\$LD_LIBRARY_PATH:/home/ubuntu/.mujoco/mujoco200/bin" >> ~/.profile
source ~/.profile
- Install dependencies
# libosmesa6-dev: Fixes `fatal error: GL/osmesa.h: No such file or directory`
# libglew-dev: Fixes `/usr/bin/ld: cannot find -lGL`
# ffmpeg: Necessary for mujoco videos.
sudo apt install -y libosmesa6-dev libglew-dev ffmpeg
# These seem to be only necessary on circleci/python:
# patchelf: Fixes `No such file or directory: 'patchelf'`.
# libglfw3-dev: Fixes `ImportError: Failed to load GLFW3 shared library.`.
sudo apt install -y patchelf libglfw3-dev
# These are required for slycot which is required by control...
sudo apt install gfortran libblas-dev liblapack-dev
Either clang will need to be set it as the default cc
alternative (sudo update-alternatives --config cc
) or you'll need to use gcc version 8. If you follow these instructions exactly (without ever installing build-essentials
) then it should work no problemo.
Logging in/out to fix $PATH
may also be necessary.
See
- Put the license key at
~/.mujoco/mjkey.txt
.
cp ~/nu/skainswo/mjkey.txt ~/.mujoco/mjkey.txt
See https://docs.brew.sh/Homebrew-on-Linux.
# See https://stackoverflow.com/questions/24426424/unattended-no-prompt-homebrew-installation-using-expect.
echo | sh -c "$(curl -fsSL https://raw.githubusercontent.com/Linuxbrew/install/master/install.sh)"
echo 'eval $(/home/linuxbrew/.linuxbrew/bin/brew shellenv)' >> ~/.profile
source ~/.profile
brew install pipenv
The nvidia-driver-430
and nvidia-cuda-toolkit
on Ubuntu 18.04 install CUDA 9.1 which is not supported by JAX at the moment.
- Remove any current installation.
sudo apt-get purge *cuda*
sudo apt-get purge *nvidia*
sudo apt-get purge *cudnn*
and then follow the runfile uninstall steps (https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#runfile-uninstallation).
- Make sure that gcc is current cc alternative:
sudo update-alternatives --config cc
cc --version
(This was necessary for CUDA 10.1. May not be necessary for 10.0.)
-
Follow the installation instructions here for the "runfile (local)" version. Install version 10.0 since TF and pytorch do not yet support 10.1.
-
Add
export PATH=/usr/local/cuda-10.0/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
to ~/.profile
.
-
Download the "cuDNN Library for Linux" (https://developer.nvidia.com/rdp/cudnn-download), not the deb version. You'll need to be logged in order for the downloads to work. Using wget/curl isn't sufficient. Easiest to download them locally and then scp them to the remote machine.
-
Install cuDNN (https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html#installlinux-tar) but note that the CUDA installation directory is
/usr/local/cuda-10.0
not/usr/local/cuda
. -
Reboot.
-
Follow the pip instructions here (https://github.com/google/jax#pip-installation) in a
pipenv shell
to install the new GPU versions ofjax
/jaxlib
.
See
- https://developer.nvidia.com/cuda-zone
- https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-nvidia-driver.html
- https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/optimize_gpu.html
- https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#faq2
- https://stackoverflow.com/questions/50622525/which-tensorflow-and-cuda-version-combinations-are-compatible
- https://discuss.pytorch.org/t/when-pytorch-supports-cuda-10-1/38852
Note that the deb installation does not seem to support multiple CUDA installations living in harmony. This may become problematic as some packages like pytorch do not yet support CUDA 10.1.
With CUDA 10.0, JAX may require the xla_gpu_cuda_data_dir
XLA flag to be set as well:
XLA_FLAGS=--xla_gpu_cuda_data_dir=/usr/local/cuda-10.0/
No downtime is necessary.
- Change the volume in the console.
- Then follow https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/recognize-expanded-volume-linux.html. Use
df -T
to get the filesystem type.
Sometimes the mujoco lockfile gets screwed up and in that case it's necessary to delete it. If jobs are just hanging forever without starting try deleting the lockfile:
rm $(pipenv --venv)/lib/python3.7/site-packages/mujoco_py/generated/mujocopy-buildlock.lock
See openai/mujoco-py#424.