From 61defed938f49f338bc5df9a9a8794dfae2021e6 Mon Sep 17 00:00:00 2001 From: Nobuto Murata Date: Sun, 17 Mar 2024 22:55:39 +0900 Subject: [PATCH] Don't expect a static job name A job name passed via the prometheus_scrape library doesn't end up as a static job name in the prometheus configuration file in the COS world even though COS expects a fixed string. Practically we cannot have a static job name like job=ceph in any of the alert rules in COS since the charms will convert the string "ceph" into: > juju_MODELNAME_ID_APPNAME_prometheus_scrape_JOBNAME(ceph)-N Let's give up the possibility of the static job name and use "up{}" so it will be annotated with the model name/ID, etc. without any specific job related condition. It will break the alert rules when one unit have more than one scraping endpoint because there will be no way to distinguish multiple scraping jobs. Ceph MON only has one prometheus endpoint for the time being so this change shouldn't cause an immediate issue. Overall, it's not ideal but at least better than the current status, which is an alert error out of the box. The following alert rule: > up{} == 0 will be converted and annotated as: > up{juju_application="ceph-mon",juju_model="ceph",juju_model_uuid="UUID"} == 0 Closes-Bug: #2044062 Change-Id: I0df8bc0238349b5f03179dfb8f4da95da48140c7 (cherry picked from commit fb3262183102171da5704868d7522290b3a9ede4) --- files/prometheus_alert_rules/prometheus_alerts.yml.default | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/files/prometheus_alert_rules/prometheus_alerts.yml.default b/files/prometheus_alert_rules/prometheus_alerts.yml.default index a544d41e..b292a3c4 100644 --- a/files/prometheus_alert_rules/prometheus_alerts.yml.default +++ b/files/prometheus_alert_rules/prometheus_alerts.yml.default @@ -343,7 +343,7 @@ groups: annotations: description: "The mgr/prometheus module at {{ $labels.instance }} is unreachable. This could mean that the module has been disabled or the mgr daemon itself is down. Without the mgr/prometheus module metrics and alerts will no longer function. Open a shell to an admin node or toolbox pod and use 'ceph -s' to to determine whether the mgr is active. If the mgr is not active, restart it, otherwise you can determine module status with 'ceph mgr module ls'. If it is not listed as enabled, enable it with 'ceph mgr module enable prometheus'." summary: "The mgr/prometheus module is not available" - expr: "up{job=\"ceph\"} == 0" + expr: "up{} == 0" for: "1m" labels: oid: "1.3.6.1.4.1.50495.1.2.1.6.2" @@ -601,7 +601,7 @@ groups: annotations: description: "The prometheus job that scrapes from Ceph is no longer defined, this will effectively mean you'll have no metrics or alerts for the cluster. Please review the job definitions in the prometheus.yml file of the prometheus instance." summary: "The scrape job for Ceph is missing from Prometheus" - expr: "absent(up{job=\"ceph\"})" + expr: "absent(up{})" for: "30s" labels: oid: "1.3.6.1.4.1.50495.1.2.1.12.1"