Skip to content

labtraining/DevOps-Trouble-Map

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build Status

DevOps Trouble Map

Why is so much knowledge about your IT architecture implicit? Why do we need to check what is running during an incident to know about the state of the system? Which components are affected by this Nagios alert? Why does no one ever update the system documentation?

When you care about above questions try "DevOps Trouble Map" (short DOTM) which

  • doesn't reinvent monitoring, but integrates with Nagios, Icinga & Co.
  • provides automatic layer 4 system archictecture charts.
  • maps alerts live into system architecture charts.

Note that the project is pre-alpha right now. Here are some impressions what the code does so far:

Mapping of Nagios alerts to detected services (note the 2nd column in the alert table):

Alert Mapping

Those Nagios "service check" to "service" mappings are fuzzy logic regular expressions. DOTM brings presets and allows the user to refine them as needed. The fact that those mappings are actually necessary indicates the intrinsic problem of the missing service relation in Nagios, which mixes the concepts of "services" and "service checks". Only with "services" (which we detect based on open TCP ports) we can auto-detect impact.

Service Mapping

Additionally to the Nagios node and service states DOTM aggregates the current connection details from the nodes. It remembers old connections to be able to see service usage transitions and create alarms for long unused or suddenly disconnected services. This helps with typical questions like "do we actually still need this X" or uncover a wrong firewall configuration.

Connection and Service Tracking

Finally those two sources of information are combined into a very simple "graphical" representation:

Node Graph

In this "node graph" color coding indicates Nagios alerts as well as service states only discovered by DOTM.

Server Installation

The DOTM server has the following dependencies

  • netcat
  • redis-server
  • MaxMind GeoIP Lite (optional)
  • Python2 modules
    • redis
    • bottle
    • requests
    • GeoIP

To automatically install the server including its dependencies on Debian/Ubuntu simply run

scripts/install-server.sh

Agent Installation

The DOTM agent has the following dependencies

  • glib-2.0
  • libevent2

To automatically install the dotm_node agent simply run

scripts/install-agent.sh

Of course as the agent is to be run on all monitored systems its single binary should be distributed to all nodes using your favourite automation tool.

Software Stack

DOTM will use the following technologies

  • Simple remote agent "dotm_node" (in C using libevent and glib)
  • Redis as backend store
  • Python2 bottle with Jinja templating
  • JSON backend data access
  • any jQuery library for rendering

architecture overview

Redis Data Schema

So far the following relations are probably needed:

entity overview

Right now the following relation namespaces are used in Redis

  • dotm::nodes (list of node names, resovable via local resolver and to be identical with remote hostname)
  • dotm::nodes::<node name>
    • 'last_fetch' => <timestamp>
    • 'fetch_status' => <'OK' or error message>
    • 'ips' => <comma separated list of IPs>
    • 'service_alerts' => <hash of service name - status tuples>
  • dotm::connections::<node name>::<port>::<remote node/IP> (hash):
    • 'process' => <string>
    • 'connections' => <int>
    • 'last_connection' => <timestamp>
    • 'last_seen' => <timestamp>
    • 'direction' => <in/out>
    • 'remote_host' => <IP or node name>
    • 'remote_port' => <port number or 'high'>
    • 'local_port' => <port number>
  • dotm::services::<node name>::<port> (hash with the following key values):
    • 'process' => <string>
    • 'last_seen' => <timestamp>
  • dotm::resolver::ip_to_node::<IP> (string, <node name>)
  • dotm::checks::nodes::<node name> (key with set expire):
    • JSON containing basic status information: { "node": "hostname01" "status": "UP", "last_check": <timestamp>, "last_status_change": <timestamp>, "status_information": "hostname01 status information" }
  • dotm::checks::services::<node name> (list of service JSONs with set expire):
    • List containing all associated node checks: [ { "node": "hostname01", "service": "Service01 name" "status": "OK", "last_check": <timestamp>, "last_status_change": <timestamp>, "status_information": "Service01 status information" }, { "node": "hostname01", "service": "Service02 name" "status": "CRITICAL", "last_check": <timestamp>, "last_status_change": <timestamp>, "status_information": "Service02 status information" } ]
  • dotm::config::* (all preferences, for descriptions check the 'Settings' page)
  • dotm::state (state info, usually update locks + timestamps)
    • last_updated (timestamp)
    • update_running (0 or 1)
    • last_snapshot (timestamp)
  • dotm::queue (list of queued backend tasks in JSON)
    • {"id": <task key>, "fn": <function name/action>, "args": <function arguments>, "kwargs": <function keywords>}
  • dotm::queue::result::<uuid4 name> (status and result of the queued task in JSON)
    • {"status": <pending/processing/ready>, "result": <result in JSON>}
  • dotm::history (list of history <timestamps>)

About

Automated Layer 4 System Architecture Documentation and Impact Visualization

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • JavaScript 42.6%
  • Python 32.2%
  • Shell 11.2%
  • CSS 4.6%
  • C 3.8%
  • HTML 3.5%
  • Other 2.1%