DevOps Trouble Map

Why is so much knowledge about your IT architecture implicit? Why do we need to check what is running during an incident to know about the state of the system? Which components are affected by this Nagios alert? Why does no one ever update the system documentation?

When you care about above questions try "DevOps Trouble Map" (short DOTM) which

doesn't reinvent monitoring, but integrates with Nagios, Icinga & Co.
provides automatic layer 4 system archictecture charts.
maps alerts live into system architecture charts.

Note that the project is pre-alpha right now. Here are some impressions what the code does so far:

Mapping of Nagios alerts to detected services (note the 2nd column in the alert table):

Those Nagios "service check" to "service" mappings are fuzzy logic regular expressions. DOTM brings presets and allows the user to refine them as needed. The fact that those mappings are actually necessary indicates the intrinsic problem of the missing service relation in Nagios, which mixes the concepts of "services" and "service checks". Only with "services" (which we detect based on open TCP ports) we can auto-detect impact.

Additionally to the Nagios node and service states DOTM aggregates the current connection details from the nodes. It remembers old connections to be able to see service usage transitions and create alarms for long unused or suddenly disconnected services. This helps with typical questions like "do we actually still need this X" or uncover a wrong firewall configuration.

Finally those two sources of information are combined into a very simple "graphical" representation:

In this "node graph" color coding indicates Nagios alerts as well as service states only discovered by DOTM.

Server Installation

The DOTM server has the following dependencies

netcat
redis-server
MaxMind GeoIP Lite (optional)
Python2 modules
- redis
- bottle
- requests
- GeoIP

To automatically install the server including its dependencies on Debian/Ubuntu simply run

scripts/install-server.sh

Agent Installation

The DOTM agent has the following dependencies

glib-2.0
libevent2

To automatically install the dotm_node agent simply run

scripts/install-agent.sh

Of course as the agent is to be run on all monitored systems its single binary should be distributed to all nodes using your favourite automation tool.

Software Stack

DOTM will use the following technologies

Simple remote agent "dotm_node" (in C using libevent and glib)
Redis as backend store
Python2 bottle with Jinja templating
JSON backend data access
any jQuery library for rendering

Redis Data Schema

So far the following relations are probably needed:

Right now the following relation namespaces are used in Redis

dotm::nodes (list of node names, resovable via local resolver and to be identical with remote hostname)
dotm::nodes::<node name>
- 'last_fetch' => <timestamp>
- 'fetch_status' => <'OK' or error message>
- 'ips' => <comma separated list of IPs>
- 'service_alerts' => <hash of service name - status tuples>
dotm::connections::<node name>::<port>::<remote node/IP> (hash):
- 'process' => <string>
- 'connections' => <int>
- 'last_connection' => <timestamp>
- 'last_seen' => <timestamp>
- 'direction' => <in/out>
- 'remote_host' => <IP or node name>
- 'remote_port' => <port number or 'high'>
- 'local_port' => <port number>
dotm::services::<node name>::<port> (hash with the following key values):
- 'process' => <string>
- 'last_seen' => <timestamp>
dotm::resolver::ip_to_node::<IP> (string, <node name>)
dotm::checks::nodes::<node name> (key with set expire):
- JSON containing basic status information: { "node": "hostname01" "status": "UP", "last_check": <timestamp>, "last_status_change": <timestamp>, "status_information": "hostname01 status information" }
dotm::checks::services::<node name> (list of service JSONs with set expire):
- List containing all associated node checks: [ { "node": "hostname01", "service": "Service01 name" "status": "OK", "last_check": <timestamp>, "last_status_change": <timestamp>, "status_information": "Service01 status information" }, { "node": "hostname01", "service": "Service02 name" "status": "CRITICAL", "last_check": <timestamp>, "last_status_change": <timestamp>, "status_information": "Service02 status information" } ]
dotm::config::* (all preferences, for descriptions check the 'Settings' page)
dotm::state (state info, usually update locks + timestamps)
- last_updated (timestamp)
- update_running (0 or 1)
- last_snapshot (timestamp)
dotm::queue (list of queued backend tasks in JSON)
- {"id": <task key>, "fn": <function name/action>, "args": <function arguments>, "kwargs": <function keywords>}
dotm::queue::result::<uuid4 name> (status and result of the queued task in JSON)
- {"status": <pending/processing/ready>, "result": <result in JSON>}
dotm::history (list of history <timestamps>)

Name		Name	Last commit message	Last commit date
Latest commit History 434 Commits
agent		agent
aggregator		aggregator
backend		backend
doc		doc
frontend		frontend
scripts		scripts
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
Makefile.am		Makefile.am
README.md		README.md
configure.ac		configure.ac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent

agent

aggregator

aggregator

backend

backend

doc

doc

frontend

frontend

scripts

scripts

.gitignore

.gitignore

.travis.yml

.travis.yml

LICENSE

LICENSE

Makefile.am

Makefile.am

README.md

README.md

configure.ac

configure.ac

Repository files navigation

DevOps Trouble Map

Server Installation

Agent Installation

Software Stack

Redis Data Schema

About

Releases

Packages

Languages

License

labtraining/DevOps-Trouble-Map

Folders and files

Latest commit

History

Repository files navigation

DevOps Trouble Map

Server Installation

Agent Installation

Software Stack

Redis Data Schema

About

Resources

License

Stars

Watchers

Forks

Languages