Skip to content

onsails/datadog_nvml

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

datadog_nvml

Monitoring NVIDIA GPUs status using Datadog

Datadog による NVIDIAのGPUの状態をモニタリングするための Agent Check スクリプトです. nvidia-ml-py モジュールを利用しています.

screenshot

現在のモニタ項目

現在は以下の項目についてGPU毎に取得します.

Metrics

  • nvml.util.gpu: Percent of time over the past sample period during which one or more kernels was executing on the GPU.
  • nvml.util.memory: Percent of time over the past sample period during which global (device) memory was being read or written.
  • nvml.mem.total: トータルメモリ
  • nvml.mem.used: 使用中メモリ
  • nvml.mem.free: 空きメモリ
  • nvml.temp: 温度

Tags

  • name: GPU名(例: GEFORCE_GTX_660)

REQUIRES

nvidia-ml-py モジュールが必須です.

$ sudo /opt/datadog-agent/embedded/bin/pip install nvidia-ml-py

SETUP

二つのファイルを /etc/dd-agent ディレクトリの checks.d, conf.d ディレクトリにコピーします.

  • nvml.py: /etc/dd-agent/checks.d
  • nvml.yaml.default: /etc/dd-agent/conf.d
$ git clone https://github.com/ngi644/datadog_nvml.git
$ cd datadog_nvml
$ sudo cp nvml.py /etc/dd-agent/checks.d
$ sudo cp nvml.yaml.default /etc/dd-agent/conf.d

Datadogを再起動します.

$ sudo service datadog-agent restart

References

About

Monitoring NVIDIA GPUs status using Datadog

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%