Grafana - detecting abnormal behavior of applications

Have Grafana on premise with prometheus. Some anomalies can be detected by viewing a set of charts (slow requests, retries, pending transactions, etc.). SRE operators need to have the opportunity to see all information in one Grafana widget instead of multiple charts to manage incidents. They also need to be familiar with all services. Abnormal behavior can easily be recognized in one chart: but when trying to find a problem through 10 or more services in one chart, hopping in one graph loses sight of the others, and the problem can't be seen: The easiest workaround seems to be to convert the linear time series graphs into a single status history widget. I tried the rate() function with a zero, 15-minute, and 24-hour offset. I can also filter events by status: success="yes". For one service status history looks like this: But for all services, the recognition ability is still unacceptable: and some problems can't be resolved: The detector can occasionally be triggered by problems that happened 24 hours ago It doesn't detect zero events gaps as an incidents (should) The detector is triggered by events that are not present in the queries The LLM chatbots recommends to use "avg_over_time" conditions, but I don't know how to do that in PromQL. I don't include my formulas here on purpose so I won't judge the discussion before it goes in the right direction. Can't find the solutions I'm looking for here and in open resources, e.g. in play.grafana.org

Jun 15, 2025 - 23:40
 0
Grafana - detecting abnormal behavior of applications

Have Grafana on premise with prometheus.

Some anomalies can be detected by viewing a set of charts (slow requests, retries, pending transactions, etc.). SRE operators need to have the opportunity to see all information in one Grafana widget instead of multiple charts to manage incidents. They also need to be familiar with all services.

Abnormal behavior can easily be recognized in one chart:

Single time series

but when trying to find a problem through 10 or more services in one chart, hopping in one graph loses sight of the others, and the problem can't be seen:

Multiply time series widget

The easiest workaround seems to be to convert the linear time series graphs into a single status history widget. I tried the rate() function with a zero, 15-minute, and 24-hour offset. I can also filter events by status: success="yes". For one service status history looks like this:

Garfana status history widget

But for all services, the recognition ability is still unacceptable:

Garfana status history widget

and some problems can't be resolved:

  • The detector can occasionally be triggered by problems that happened 24 hours ago
  • It doesn't detect zero events gaps as an incidents (should)
  • The detector is triggered by events that are not present in the queries

The LLM chatbots recommends to use "avg_over_time" conditions, but I don't know how to do that in PromQL.

I don't include my formulas here on purpose so I won't judge the discussion before it goes in the right direction. Can't find the solutions I'm looking for here and in open resources, e.g. in play.grafana.org