From d3d4f84a0b45d24f710a43c8d55451b8265107f2 Mon Sep 17 00:00:00 2001
From: James Gu <jgu@suse.com>
Date: Thu, 22 Feb 2018 17:49:49 -0800
Subject: [PATCH] Metrics retention policy enhancement

Support differentiable metrics retention policy based on metrics
type.

Change-Id: I915376827604bc692cd26b7ed00812c64ee2e3c0
story: 2001576
---
 specs/rocky/approved/metrics-retention.rst    | 289 ++++++++++++++++++
 .../approved/python-persister-metrics.rst     | 199 ++++++++++++
 2 files changed, 488 insertions(+)
 create mode 100644 specs/rocky/approved/metrics-retention.rst
 create mode 100644 specs/rocky/approved/python-persister-metrics.rst

diff --git a/specs/rocky/approved/metrics-retention.rst b/specs/rocky/approved/metrics-retention.rst
new file mode 100644
index 0000000..49619cf
--- /dev/null
+++ b/specs/rocky/approved/metrics-retention.rst
@@ -0,0 +1,289 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+================================================
+Metric Retention Policy
+================================================
+
+Story board: https://storyboard.openstack.org/#!/story/2001576
+
+Metric retention policy must be in place to avoid disk being filled up.
+Retention period should be adjustable for different types of metrics, e.g.,
+monitoring vs. metering or aggregate vs. raw meters.
+
+Problem description
+===================
+
+In a cloud of 200 compute hosts, there can be up to one billion metrics
+generated daily. The time series database disks will be filled up in months
+if not weeks if old metric data is not purged regularly. The retention
+requirement can be different based on the type of the metrics and the usage
+model. For example, the customer may want to preserve the metering metrics
+for months or years, while s/he has no interest in more than a week old
+monitoring metrics. Some customers' billing system may pull the metering data
+on a daily base which could eliminate the need of longer retention of metering
+metrics. Monasca needs to support metric retention policy that can be tailored
+per metric or metric type.
+
+Use Cases
+---------
+
+- Use case 1
+  Operator configures default metric retention in the persister configuration.
+
+  The default retention policy is applied if a metric doesn't specify its
+  retention policy. This default retention is generally shorter period of time
+  and is targeted to the monitoring metrics.
+
+- Use case 2
+  Operator configures the retention policy for the roll up metrics in the
+  Monasca transform. Roll up metrics generally require a longer retention
+  period.
+
+- Use case 3
+  Operator configures the retention policy for Ceilometer metrics in the
+  pipeline and mapping configuration file. Metering metrics generally require
+  longer retention period.
+
+- Use case 4
+  The metric agent plugin sets retention policy when generates a new metric.
+  This is mostly a means to overwrite the default retention policy for
+  monitoring metrics.
+
+Proposed change
+===============
+
+** Posting to get preliminary feedback on the scope of this spec. **
+
+1. Monasca API
+   Add an optional metric property "TTL" in the create metrics api. TTL is the
+   number of seconds before the metric sets to expire. If set, the TTL property
+   will be included When posting a new metric message to Kafka.
+
+2. Monasca Persister
+   Persister reads the default retention policy setting from the service
+   configuration file in the influxDbConfiguration and cassandraDbConfiguration
+   section.
+   ::
+
+     # Retention policy may be left blank to indicate default policy.
+     retentionPolicy: 7
+
+   It may makes more sense to move this property to metricConfiguration section
+   and convert to use the unit of seconds instead of days.
+
+   It will retrieve the TTL property in the incoming metric message. If not set,
+   the TTL value from the default retention policy will used instead.
+
+   The TTL is set in the parameterized database query when persisting the metrics
+   into the time series database, including both Cassandra and InfluxDB.
+
+3. Monasca Ceilometer (aka Ceilosca)
+   Add TTL property in the pipeline-api.yaml
+   ::
+
+    - name: image_source
+      interval: 30
+      # expires after 90 days
+      TTL: 7776000
+      meters:
+          - "image"
+          - "image.size"
+          - "image.update"
+          - "image.upload"
+          - "image.delete"
+      sinks:
+          - meter_sink
+
+   Monasca-ceilometer implementation will parse the new property and set the TTL
+   when posting new metric message.
+
+4. Monasca Transform
+   Add TTL property in transform_specs.json
+   ::
+
+    {"aggregation_params_map":{"aggregation_pipeline":{"source":"streaming","usage":"fetch_quantity","setters":["rollup_quantity","set_aggregated_metric_name","set_aggregated_period"],"insert":["prepare_data","insert_data_pre_hourly"]},"aggregated_metric_name":"vm.mem.total_mb_agg","aggregation_period":"hourly", TTL:"7776000", "aggregation_group_by_list": ["host", "metric_id", "tenant_id", "resource_uuid"],"usage_fetch_operation": "avg","filter_by_list": [],"setter_rollup_group_by_list":["tenant_id"],"setter_rollup_operation": "sum","dimension_list":["aggregation_period","host","project_id"],"pre_hourly_operation":"avg","pre_hourly_group_by_list":["default"]},"metric_group":"vm_mem_total_mb_project","metric_id":"vm_mem_total_mb_project"}
+
+   Monasca-transform implementation will parse the new property and set the TTL
+   when posting new rolled up metric messages.
+
+Alternatives
+------------
+
+None
+
+Data model impact
+-----------------
+
+None
+
+REST API impact
+---------------
+
+Each API method which is either added or changed should have the following
+
+* Specification change for the create metric api
+
+  * Create metrics
+
+  * Method type: POST
+
+  * Normal http response code(s): No change
+
+  * Expected error http response code(s): no change
+
+  * URL: /v2.0/metrics
+
+  * Parameters: no change
+
+  * Request body: Consists of a single metric object or an array of metric
+    objects. A metric has the following properties:
+
+    * name (string(255), required) - The name of the metric.
+    * dimensions ({string(255): string(255)}, optional) - A dictionary
+      consisting of (key, value) pairs used to uniquely identify a metric.
+    * timestamp (string, required) - The timestamp in milliseconds from the
+      Epoch.
+    * value (float, required) - Value of the metric. Values with base-10
+      exponents greater than 126 or less than -130 are truncated.
+    * value_meta ({string(255): string}(2048), optional) - A dictionary
+      consisting of (key, value) pairs used to add information about the value.
+      Value_meta key value combinations must be 2048 characters or less
+      including '{"":""}' 7 characters total from every json string.
+    * TTL - time to live in seconds.
+
+* Example use case including typical API samples for both data supplied
+  by the caller and the response
+
+Security impact
+---------------
+
+None.  Security measures already in place for the Monasca API would remain.
+
+Other end user impact
+---------------------
+
+None
+
+Performance Impact
+------------------
+
+This feature has no direct impact on the write throughput. However, it allows
+the user to enable shorter retention period for monitoring metrics which
+can potentially improve the read performance for the queries that involves
+search, grouping and filtering when there are less metrics in the table.  This
+improves the storage footprint.
+
+Other deployer impact
+---------------------
+
+No change in deployment of the services.
+
+For planning, the user now has the option to specify a shorter retention period
+for monitoring metrics or even per metric or metric category. The disk size
+should be calculated based upon the retention policy accordingly.
+
+Developer impact
+----------------
+
+Monasca agent plugin developers should be aware of the new TTL property
+now available to them. It is an optional property that is only needed if a
+different TTL value than the default retention policy in the persister service
+is needed.
+
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Contributors are welcome!
+
+Primary assignee:
+
+
+Other contributors:
+
+
+Work Items
+----------
+
+Work items or tasks -- break the feature up into the things that need to be
+done to implement it. Those parts might end up being done by different people,
+but we're mostly trying to understand the timeline for implementation.
+
+
+Dependencies
+============
+
+Dependent on retention policy support in the TSDB storage.  Both Cassandra
+and InfluxDB support specifying a retention policy.
+
+Testing
+=======
+
+~Please discuss the important scenarios needed to test here, as well as
+specific edge cases we should be ensuring work correctly. For each
+scenario please specify if this requires specialized hardware, a full
+openstack environment, or can be simulated inside the Monasca tree.~
+
+~Please discuss how the change will be tested. We especially want to know what
+tempest tests will be added. It is assumed that unit test coverage will be
+added so that doesn't need to be mentioned explicitly, but discussion of why
+you think unit tests are sufficient and we don't need to add more tempest
+tests would need to be included.~
+
+~Is this untestable in gate given current limitations (specific hardware /
+software configurations available)? If so, are there mitigation plans (3rd
+party testing, gate enhancements, etc).~
+
+
+Documentation Impact
+====================
+
+~Which audiences are affected most by this change, and which documentation
+titles on docs.openstack.org should be updated because of this change? Don't
+repeat details discussed above, but reference them here in the context of
+documentation for multiple audiences. For example, the Operations Guide targets
+cloud operators, and the End User Guide would need to be updated if the change
+offers a new feature available through the CLI or dashboard. If a config option
+changes or is deprecated, note here that the documentation needs to be updated
+to reflect this specification's change.~
+
+References
+==========
+
+~Please add any useful references here. You are not required to have any
+reference. Moreover, this specification should still make sense when your
+references are unavailable. Examples of what you could include are:~
+
+* ~Links to mailing list or IRC discussions~
+
+* ~Links to notes from a summit session~
+
+* ~Links to relevant research, if appropriate~
+
+* ~Related specifications as appropriate (e.g.  if it's an EC2 thing, link the
+  EC2 docs)~
+
+* ~Anything else you feel it is worthwhile to refer to~
+
+
+History
+=======
+
+Optional section intended to be used each time the spec is updated to describe
+new design, API or any database schema updated. Useful to let reader understand
+what's happened along the time.
+
+.. list-table:: Revisions
+   :header-rows: 1
+
+   * - Release Name
+     - Description
+   * - Queens
+     - Introduced
diff --git a/specs/rocky/approved/python-persister-metrics.rst b/specs/rocky/approved/python-persister-metrics.rst
new file mode 100644
index 0000000..d8e216e
--- /dev/null
+++ b/specs/rocky/approved/python-persister-metrics.rst
@@ -0,0 +1,199 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+=====================================================
+Python Persister Performance Metrics Collection (WIP)
+=====================================================
+
+Story board: https://storyboard.openstack.org/#!/story/2001576
+
+This defines the list of measurements for the metric upsert processing time and
+throughput in Python Persister and provides a rest api to retrieve those 
+measurements. 
+
+Problem description
+===================
+
+The Java Persister, built on top of the DropWizard framework, provides a list
+of internal performance related metrics, e.g., the total number of metric
+messages that have been processed since the last service start up, the average,
+min and max metric processing time etc. The Python Persister, on the other
+hand, lacks such instrumentation. This presents a challenge to the operator
+who wants to monitor, triage, and tune the Persister performance and to the
+Persister performance testing tool that was introduced in Queens release. The
+Cassandra Python Persister plugin depends on this feature for performance
+tuning.
+ 
+Use Cases
+---------
+
+- Use case 1: The developer instruments the defined performance metrics.
+
+  There are two approaches towards the internal performance metrics. The first
+  approach is in memory metering similar to the Java implementation. The data
+  collection starts when the Persister service starts up and is not persisted
+  through service restart. The second approach is to treat such measurement
+  exactly the same as the "normal" metrics Monasca collects. The advantage is
+  that such metrics will be persisted and rest apis are already available to
+  retrieve the metrics.
+  The list of Persister metrics includes:
+
+  1. Total number of metrics upsert request received and completed on a given
+     Persister service instance in the given period of time
+  2. Total number of metrics upsert request received and completed on a
+     process or thread in a given period of time (P2)
+  3. The average, min, max metric request processing time in a given period of
+     time for a given Persister service instance and process/thread.
+
+- Use case 2: Retrieves persister performance metrics through rest api.
+
+  The performance  metrics can be retrieved using the list metrics api in the
+  Monasca API service.
+
+Proposed change
+===============
+
+1. Monasca Persister
+
+   - Python Persister integrates with monasca-statsd to send count and timer
+     metrics
+   - Persister conf to add properties for statsd
+     
+2. Persister performance benchmark tool adds support to retrieve the metrics
+   from Monasca rest api source in addition to the DropWizard admin api.
+
+Alternatives
+------------
+
+None
+
+Data model impact
+-----------------
+
+None
+
+REST API impact
+---------------
+
+None
+
+Security impact
+---------------
+
+None
+
+Other end user impact
+---------------------
+
+None
+
+Performance Impact
+------------------
+
+TBD, The statsd call to update counter and timer is expected to have small
+performance impact.
+
+Other deployer impact
+---------------------
+
+No change in deployment of the services.
+
+Developer impact
+----------------
+
+None.
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Contributors are welcome!
+
+Primary assignee:
+  jgu
+
+Other contributors:
+  
+
+Work Items
+----------
+
+Work items or tasks -- break the feature up into the things that need to be
+done to implement it. Those parts might end up being done by different people,
+but we're mostly trying to understand the timeline for implementation.
+
+
+Dependencies
+============
+
+None
+
+Testing
+=======
+
+Please discuss the important scenarios needed to test here, as well as
+specific edge cases we should be ensuring work correctly. For each
+scenario please specify if this requires specialized hardware, a full
+openstack environment, or can be simulated inside the Monasca tree.
+
+Please discuss how the change will be tested. We especially want to know what
+tempest tests will be added. It is assumed that unit test coverage will be
+added so that doesn't need to be mentioned explicitly, but discussion of why
+you think unit tests are sufficient and we don't need to add more tempest
+tests would need to be included.
+
+Is this untestable in gate given current limitations (specific hardware /
+software configurations available)? If so, are there mitigation plans (3rd
+party testing, gate enhancements, etc).
+
+
+Documentation Impact
+====================
+
+Which audiences are affected most by this change, and which documentation
+titles on docs.openstack.org should be updated because of this change? Don't
+repeat details discussed above, but reference them here in the context of
+documentation for multiple audiences. For example, the Operations Guide targets
+cloud operators, and the End User Guide would need to be updated if the change
+offers a new feature available through the CLI or dashboard. If a config option
+changes or is deprecated, note here that the documentation needs to be updated
+to reflect this specification's change.
+
+References
+==========
+
+Please add any useful references here. You are not required to have any
+reference. Moreover, this specification should still make sense when your
+references are unavailable. Examples of what you could include are:
+
+* Links to mailing list or IRC discussions
+
+* Links to notes from a summit session
+
+* Links to relevant research, if appropriate
+
+* Related specifications as appropriate (e.g.  if it's an EC2 thing, link the
+  EC2 docs)
+
+* Anything else you feel it is worthwhile to refer to
+
+
+History
+=======
+
+Optional section intended to be used each time the spec is updated to describe
+new design, API or any database schema updated. Useful to let reader understand
+what's happened along the time.
+
+.. list-table:: Revisions
+   :header-rows: 1
+
+   * - Release Name
+     - Description
+   * - Queens
+     - Introduced