Skip to content

angeloskyratzakos/datadog_nvml

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

datadog_nvml

Monitoring NVIDIA GPUs status using Datadog

Datadog Agent Check To capture and send metrics

nvidia-ml-py Python Module as API interface

screenshot1

screenshot1

Current Monitor Supported

Currently we will acquire the following items for each GPU.

Metrics

  • nvml.util.gpu: Percent of time over the past sample period during which one or more kernels was executing on the GPU.
  • nvml.util.memory: Percent of time over the past sample period during which global (device) memory was being read or written.
  • nvml.util.decode: Percent of usage of HW Decoding (NVDEC) from the last sample period (*)
  • nvml.util.encode: Percent of usage of HW Encoding (NVENC) from the last sample period (*)
  • nvml.mem.total: Total Memory
  • nvml.mem.used: Used Memory
  • nvml.mem.free: Free Memory
  • nvml.temp: Temperature

(*) HW accelerated encode and decode are supported on NVIDIA GeForce, Quadro, Tesla, and GRID products with Fermi, Kepler, Maxwell and Pascal generation GPUs.

Tags

  • name: GPU (GEFORCE_GTX_660)

REQUIRES

nvidia-ml-py (v7.352.0)

$ sudo /opt/datadog-agent/embedded/bin/pip install nvidia-ml-py==7.352.0

Check that was correctly installed:

# /opt/datadog-agent/embedded/bin/pip show nvidia-ml-py
Name: nvidia-ml-py
Version: 7.352.0
Summary: Python Bindings for the NVIDIA Management Library
Home-page: http://www.nvidia.com/
Author: NVIDIA Corporation
Author-email: nvml-bindings@nvidia.com
License: BSD
Location: /opt/datadog-agent/embedded/lib/python2.7/site-packages

SETUP

Copy the two files to the checks.d, conf.d directory in the /etc/dd-agent directory.

  • nvml.py: /etc/dd-agent/checks.d
  • nvml.yaml.default: /etc/dd-agent/conf.d
$ git clone https://github.com/ngi644/datadog_nvml.git
$ cd datadog_nvml
$ sudo cp nvml.py /etc/datadog-agent/checks.d
$ sudo cp nvml.yaml.default /etc/datadog-agent/conf.d

Restart Datadog Agent, to compile the PY Source and update the check file.

$ sudo service datadog-agent restart

Check if module was loaded correctly

$ sudo service datadog-agent info
Checks
  ======
...
    nvml (5.14.1)

      - instance #0 [OK]
      - Collected 16 metrics, 0 events & 1 service check
...

Tested

Tested on AWS EC2 G2 Familly (g2.2xlarge) that include 1x NVIDIA GRID K520 card. Tested on Bare-metal Supermicro server with NVIDIA TESLA P4 and P40 cards.

References

About

Monitoring NVIDIA GPUs status using Datadog

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%