armada/doc/source/operations/metrics.rst
Sean Eagan 0721ed43aa Implement Prometheus metric integration
This implements Prometheus metric integration, including metric
definition, collection, and exportation.

End user documentation for supported metric data and exportation
interface is included.

Change-Id: Ia0837f28073d6cd8e0220ac84cdd261b32704ae4
2019-08-15 16:12:17 +00:00

2.2 KiB

Metrics

Armada exposes metric data, for consumption by Prometheus.

Exporting

Metric data can be exported via:

  • API: Prometheus exporter in the /metrics endpoint. The Armada chart includes the appropriate Prometheus scrape configurations for this endpoint.
  • CLI: --metrics-output=<path> of apply command. The node exporter text file collector can then be used to export the produced text files to Prometheus.

Metric Names

Metric names are as follows:

armada_ + <action> + _ + <metric>

Supported <action>s

The below tree of <action>s are measured. Supported prometheus labels are noted. Labels are inherited by sub-actions except as noted.

  • `apply`:
    • description: apply a manifest
    • labels: manifest
    • sub-actions:
      • `chart_handle`:
        • description: fully handle a chart (see below sub-actions)
        • labels:
          • chart
          • action (installnoop) (not included in sub-actions)
        • sub-actions:
          • chart_download
          • chart_deploy
          • chart_test
      • `chart_delete`:
        • description: delete a chart (e.g. due to FAILED status)
        • labels: chart

Supported <metric>s

  • `failure_total`: total failed attempts
  • `attempt_total`: total attempts
  • `attempt_inprogress`: total attempts in progress
  • `duration_seconds`: duration of each attempt

Timeouts

The chart_handle and chart_test actions additionally include the following metrics:

  • `timeout_duration_seconds`: configured chart timeout duration in seconds
  • `timeout_usage_ratio`: = duration_seconds / timeout_duration_seconds

These can help identify charts whose timeouts may need to be changed to avoid potential failures or to acheive faster failures.

Chart concurrency

The chart_handle action additionally includes the following metric:

  • `concurrency_count`: count of charts being handled concurrently

This can help identify opportunities for greater chart concurrency.