Browse Source

Merge "Update documentation for 1.0"

tags/1.0rc1^0
Jenkins 2 years ago
parent
commit
1c12b277bf
3 changed files with 2291 additions and 17 deletions
  1. 2230
    10
      doc/user/source/appendix_alarms.rst
  2. 33
    7
      doc/user/source/configure_alarms.rst
  3. 28
    0
      doc/user/source/release_notes.rst

+ 2230
- 10
doc/user/source/appendix_alarms.rst
File diff suppressed because it is too large
View File


+ 33
- 7
doc/user/source/configure_alarms.rst View File

@@ -368,12 +368,21 @@ file. This file has the following sections:
368 368
    to that category of nodes. For example::
369 369
 
370 370
      node_cluster_alarms:
371
-        controller:
372
-         cpu: ['cpu-critical-controller', 'cpu-warning-controller']
373
-         root-fs: ['root-fs-critical', 'root-fs-warning']
374
-         log-fs: ['log-fs-critical', 'log-fs-warning']
375
-
376
-   Creates three alarm groups for the cluster of nodes called 'controller':
371
+        controller-nodes:
372
+            apply_to_node: controller
373
+            alerting: enabled
374
+            members:
375
+                cpu:
376
+                    alarms: ['cpu-critical-controller', 'cpu-warning-controller']
377
+                root-fs:
378
+                    alarms: ['root-fs-critical', 'root-fs-warning']
379
+                log-fs:
380
+                    alarms: ['log-fs-critical', 'log-fs-warning']
381
+                hdd-errors:
382
+                    alerting: enabled_with_notification
383
+                    alarms: ['hdd-errors-critical']
384
+
385
+   Creates four alarm groups for the cluster of controller nodes:
377 386
 
378 387
    * The *cpu* alarm group is mapped to two alarms defined in the ``alarms``
379 388
      section known as the 'cpu-critical-controller' and
@@ -388,6 +397,13 @@ file. This file has the following sections:
388 397
      section known as the 'log-fs-critical' and 'log-fs-warning' alarms. These
389 398
      alarms monitor the file system where the logs are created on the
390 399
      controller nodes.
400
+   * The *hdd-errors* alarm group is mapped to the 'hdd-errors-critical' alarm
401
+     defined in the ``alarms`` section. This alarm monitors the ``kern.log``
402
+     log entries containing critical IO errors detected by the kernel.
403
+     The *hdd-error* alarm obtains the *enabled_with_notification* alerting
404
+     attribute, meaning that the operator will be notified if any of the
405
+     controller nodes encounters a disk failure. Other alarms do not trigger
406
+     notification per node but at an aggregated cluster level.
391 407
 
392 408
    .. note:: An *alarm group* is a mere implementation artifact (although it
393 409
       has functional value) that is primarily used to distribute the alarms
@@ -425,7 +441,7 @@ structure of that file.
425 441
    important to keep exactly the same copy of
426 442
    ``/etc/hiera/override/gse_filters.yaml`` across all the nodes of the
427 443
    OpenStack environment including the node(s) where Nagios is installed.
428
-   
444
+
429 445
 The aggregation rules and correlation policies are defined in the ``/etc/hiera/override/gse_filters.yaml`` configuration file.
430 446
 
431 447
 This file has the following sections:
@@ -590,6 +606,7 @@ the service cluster aggregation rules::
590 606
     output_metric_name: cluster_service_status
591 607
     interval: 10
592 608
     warm_up_period: 20
609
+    alerting: enabled_with_notification
593 610
     clusters:
594 611
       nova-api:
595 612
         policy: highest_severity
@@ -638,6 +655,10 @@ Where
638 655
 |   The number of seconds after a (re)start that the GSE plugin will wait
639 656
     before emitting its metric messages.
640 657
 
658
+| alerting
659
+|   Type: string (one of 'disabled', 'enabled' or 'enabled_with_notification').
660
+|   The alerting configuration of the service clusters.
661
+
641 662
 | clusters
642 663
 |   Type: list
643 664
 |   The list of service clusters that the plugin handles. See
@@ -720,6 +741,7 @@ cluster aggregation rules::
720 741
     output_metric_name: cluster_node_status
721 742
     interval: 10
722 743
     warm_up_period: 80
744
+    alerting: enabled_with_notification
723 745
     clusters:
724 746
       controller:
725 747
         policy: majority_of_members
@@ -768,6 +790,10 @@ Where
768 790
 |   The number of seconds after a (re)start that the GSE plugin will wait
769 791
     before emitting its metric messages.
770 792
 
793
+| alerting
794
+|   Type: string (one of 'disabled', 'enabled' or 'enabled_with_notification').
795
+|   The alerting configuration of the node clusters.
796
+
771 797
 | clusters
772 798
 |   Type: list
773 799
 |   The list of node clusters that the plugin handles. See

+ 28
- 0
doc/user/source/release_notes.rst View File

@@ -10,6 +10,34 @@ Release notes
10 10
 Version 1.0.0
11 11
 +++++++++++++
12 12
 
13
+The StackLight Collector plugin 1.0.0 for Fuel contains the following updates:
14
+
15
+New alarms:
16
+
17
+  * Monitor RabbitMQ based on Pacemaker point-of-view
18
+  * Monitor all partitions and OSD disk(s)
19
+  * Horizon HTTP 5xx errors
20
+  * Keystone slow response times
21
+  * HDD errors
22
+  * SWAP percent usage
23
+  * Network packet drops
24
+  * Local OpenStack API checks
25
+  * Local checks for services: Apache, Memcached, MySQL, RabbitMQ, Pacemaker
26
+
27
+Alarm enhancements:
28
+
29
+  * Added the ``group by`` attribute support for alarm rules
30
+  * Added support for ``pattern matching`` to filter metric dimensions
31
+
32
+Bug fixes:
33
+
34
+ * Fixed the concurrent execution of logrotate.
35
+   See `#1455104 <https://bugs.launchpad.net/lma-toolchain/+bug/1455104>`_.
36
+ * Implemented the capability for the Elasticsearch bulk size to increase when
37
+   required. See `#1617211 <https://bugs.launchpad.net/lma-toolchain/+bug/1617211>`_.
38
+ * Implemented the capability to use RabbitMQ management API in place of the
39
+   :command:`rabbitmqctl` command.
40
+
13 41
 Version 0.10.0
14 42
 ++++++++++++++
15 43
 

Loading…
Cancel
Save