WIP Metrics retention policy enhancement

Support differentiable metrics retention policy based on metrics type. Also outline alternatives. Change-Id: I915376827604bc692cd26b7ed00812c64ee2e3c0 story: 2001576
2018-02-22 17:49:49 -08:00 · 2018-02-22 17:49:49 -08:00 · 744f5dc639
commit 744f5dc639
parent 4aa92c0caa
2 changed files with 599 additions and 0 deletions
--- a/specs/rocky/approved/metrics-retention.rst
+++ b/specs/rocky/approved/metrics-retention.rst
@ -0,0 +1,400 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+================================================
+Metric Retention Policy
+================================================
+
+Story board: https://storyboard.openstack.org/#!/story/2001576
+
+Metric retention policy must be in place to avoid disk being filled up.
+Retention period should be adjustable for different types of metrics, e.g.,
+monitoring vs. metering or aggregate vs. raw meters.
+
+Problem description
+===================
+
+In a cloud of 200 compute hosts, there can be up to one billion metrics
+generated daily. The time series database disks will be filled up in months
+if not weeks if old metric data is not purged regularly. The retention
+requirement can be different based on the type of the metrics and the usage
+model. For example, the customer may want to preserve the metering metrics
+for months or years, while s/he has no interest in more than a week old
+monitoring metrics. Some customers' billing system may pull the metering data
+on a daily base which could eliminate the need of longer retention of metering
+metrics. Monasca needs to support metric retention policy that can be tailored
+per metric or metric type.
+
+Use Cases
+---------
+
+x - Use case 1
+x   Operator configures default metric retention in the persister configuration.
+
+x   The default retention policy is applied if a metric doesn't specify its
+x   retention policy. This default retention is generally shorter period of time
+x   and is targeted to the monitoring metrics.
+
+x - Use case 2
+x Operator configures the retention policy for the roll up metrics in the
+x   Monasca transform. Roll up metrics generally require a longer retention
+x   period.
+
+x - Use case 3
+x   Operator configures the retention policy for Ceilometer metrics in the
+x   pipeline and mapping configuration file. Metering metrics generally require
+x   longer retention period.
+
+x - Use case 4
+x   The metric agent plugin sets retention policy when generates a new metric.
+x   This is mostly a means to overwrite the default retention policy for
+x   monitoring metrics.
+
+- Use case 1a
+  Installer sets a default TTL value in configuration
+
+- Use case 1b
+  Installer loads a set of metric to TTL mappings, which is stored in the
+  Monasca API data store.
+
+- Use case 2
+  Monasca API receives new metric (regardless of source).  Metric is mapped to
+  a dictionary to determine TTL (or default value used if no match).  TTL is
+  passed with metric value on to the Persister for storage in TSDB.
+
+  Note that the use cases for monasca-agent to post metrics are unchanged, just
+  the processing from Monasca API
+
+- Use case 3
+  Operator uses Monasca CLI to specify (or modify) a TTL value for a metric
+  match string. Match string could be specific, such as "cpu.user_perc" or a
+  wildcard string, such as "image.*".  CLI posts request to Monasca TTL API,
+  where it is validated then stored in database.
+
+- Use case 4
+  Operator uses Monasca CLI to GET the dictionary of metric:TTL mappings.
+
+- Use case 5 (optional)
+  Operator uses Monasca UI to accomplish use case 3 or 4
+
+
+Proposed change
+===============
+
+1. Monasca API
+   Add a new API for managing the mapping of metrics to TTL values.
+   TBD - API structure
+
+   Add storage for the mapping in the MySQL database. This is to allow
+   all instances of Monasca API to share the configuration dynamically.
+   Create a schema for storing the metric:TTL dictionary.
+
+2. Monasca Persister
+   Persister reads the default retention policy setting from the service
+   configuration file in the influxDbConfiguration and cassandraDbConfiguration
+   section.
+   ::
+
+     # Retention policy may be left blank to indicate default policy.
+     # Unit is days
+     retentionPolicy: 7
+
+   It may be convenient to allow specifying a unit with the policy value.  For
+   example "7d" for 7 days or "3m" for 3 months.
+
+   It will retrieve the TTL property in the incoming metric message. If not set,
+   the TTL value from the default retention policy will used instead.
+
+   It is expected with the addition of this Metrics Retention feature that the
+   default retentionPolicy value would be set to a low value, and that metrics
+   that are to be kept longer would be called out specifically through the
+   Retention API and appropriate values set.
+
+   The TTL is set in the parameterized database query when persisting the metrics
+   into the time series database, including both Cassandra and InfluxDB.
+   TBD - exact call structures for each TSDB.
+
+   Note that this does mean that each storage back end would need to have code
+   customized in the persister to support passing the TTL value.  This may also
+   be possible for ElasticSearch, though that is not part of this initial spec.
+
+3. Monasca CLI (optional)
+   A new CLI feature could be created to simplify getting the list of TTL
+   mappings or posting an update to a TTL mapping.  This would need Keystone
+   authentication, as does the existing 'monasca' CLI, and could be added to it.
+   TBD: whether the current monasca CLI could handle ingesting a json structure.
+
+4. Monasca UI (optional)
+   A new feature could be added to the Monasca UI that would allow a Cloud
+   Operator to view and edit the list of TTL mappings.
+   Bonus points for allowing the UI to have sample metrics and simulate the
+   mapping on the page.
+
+Alternatives
+------------
+
+The original proposal was to have monasca-transform, monasca-ceilometer, and
+monasca-agent each keep a TTL default setting and have a property to allow
+specifying a TTL per metric.  This would have also required a change to the
+Monasca API to add an optional TTL to the metric POST listener.
+
+While this would have been simpler to implement in the Monasca API, the
+additional work to change all the services that originate metrics made this
+alternative not as appealing.
+
+
+Another alternative would be to implement a new Monasca Retention API as
+outlined, but not include dimensions for the metrics. This would allow a much
+simpler data structure of key:value pairs, with the key being the unique match
+string and the value the standardized TTL value.  While the implementation
+would be much simpler, it is felt that the additional power of having match
+dimensions would be beneficial.
+
+
+Data model impact
+-----------------
+
+The Monasca API data model will need to be extended to store the metric to
+TTL mappings.
+TBD - schema
+
+REST API impact
+---------------
+
+A new metric retention API endpoint would be added to Monasca API.
+- Post new mapping
+- update mapping
+- get single mapping
+- delete single mapping
+- get list of mappings (for backup or verification) (format compatible with post list)
+- post list of mappings (for install or restore) (does this wipe all other mappings?)
+
+URL: /v2.0/metrics-retention (single metric:ttl entry)
+
+     /v2.0/metrics-retention/list (dict of all entries)
+
+
+The communication from Monasca API to Persister would have the TTL value
+added as a parameter.
+
+NOTE: care should be taken in defining the REST API path, as Gnocchi uses
+"/metric", which may be confusing to some users.
+
+
+dimensions?  service:xyz or host:node1
+
+cpu.user* : 2
+
+cpu.user_perc  : 1
+
+cpu.user_perc {dim: host:node1} : 2
+
+cpu.user_perc {dim: tenant: coke} : 3
+
+longest wins? or an option
+default long or short?  default short (1 day for dev, 7 for production) then set desired events longer
+- if that strategy then go longer in case of conflict
+Don't want to depend on order - dictionary not in guaranteed order (james suggested adding a datestamp but...)
+
+JSON structure for PUT/GET to Retention API
+  ::
+
+  {
+    match: "cpu.user_perc",
+    dimensions: "",
+    retentionPolicy: "7d"
+  }
+
+Special case: to delete a retention policy, give a retentionPolicy value of
+None and it will be removed from the list.
+  ::
+
+  {
+    match: "cpu.user_time",
+    dimensions: "",
+    retentionPolicy: None
+  }
+
+
+---
+Each API method which is either added or changed should have the following
+
+* Specification change for the create metric api
+
+  * Create metrics
+
+  * Method type: POST
+
+  * Normal http response code(s): No change
+
+  * Expected error http response code(s): no change
+
+  * URL: /v2.0/metrics
+
+  * Parameters: no change
+
+  * Request body: Consists of a single metric object or an array of metric
+    objects. A metric has the following properties:
+
+    * name (string(255), required) - The name of the metric.
+    * dimensions ({string(255): string(255)}, optional) - A dictionary
+      consisting of (key, value) pairs used to uniquely identify a metric.
+    * timestamp (string, required) - The timestamp in milliseconds from the
+      Epoch.
+    * value (float, required) - Value of the metric. Values with base-10
+      exponents greater than 126 or less than -130 are truncated.
+    * value_meta ({string(255): string}(2048), optional) - A dictionary
+      consisting of (key, value) pairs used to add information about the value.
+      Value_meta key value combinations must be 2048 characters or less
+      including '{"":""}' 7 characters total from every json string.
+    * TTL - time to live in seconds.
+
+* Example use case including typical API samples for both data supplied
+  by the caller and the response
+
+Security impact
+---------------
+
+None.  Security measures already in place for the Monasca API would remain.
+
+Other end user impact
+---------------------
+
+None for most users, as access is restricted to Cloud Operators.
+A Cloud Operator would have a new responsibility to configure retention for
+the metrics.
+
+A future discussion could be had about whether a tenant user should be granted
+the ability to set their own retention policies, but generally the Cloud
+Operator is responsible for ensuring there are sufficient resources to meet the
+retention requirements.
+
+Performance Impact
+------------------
+
+This feature has no direct impact on the write throughput. However, it allows
+the user to enable shorter retention period for monitoring metrics which
+can potentially improve the read performance for the queries that involves
+search, grouping and filtering when there are less metrics in the table.  This
+improves the storage footprint.
+
+Depending on how complex the metric retention match string gets there could be
+some performance impact. TBD
+
+Other deployer impact
+---------------------
+
+No change in deployment of the services.
+The service could be deployed with simply a default TTL value in configuration.
+If the operator desires, at install time a complete list of TTL values could
+be loaded as part of the installation process once the Monasca API is running.
+
+For planning, the user now has the option to specify a shorter retention period
+for monitoring metrics or even per metric or metric category. The disk size
+should be calculated based upon the retention policy accordingly.
+
+Developer impact
+----------------
+
+Monasca agent plugin developers should be aware of the new TTL property
+now available to them. It is an optional property that is only needed if a
+different TTL value than the default retention policy in the Persister service
+is needed.
+
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Contributors are welcome!
+
+Primary assignee:
+
+
+Other contributors:
+
+
+Work Items
+----------
+
+Work items or tasks -- break the feature up into the things that need to be
+done to implement it. Those parts might end up being done by different people,
+but we're mostly trying to understand the timeline for implementation.
+
+
+Dependencies
+============
+
+Dependent on retention policy support in the TSDB storage.  Both Cassandra
+and InfluxDB support specifying a retention policy.
+
+
+Testing
+=======
+
+~Please discuss the important scenarios needed to test here, as well as
+specific edge cases we should be ensuring work correctly. For each
+scenario please specify if this requires specialized hardware, a full
+openstack environment, or can be simulated inside the Monasca tree.~
+
+~Please discuss how the change will be tested. We especially want to know what
+tempest tests will be added. It is assumed that unit test coverage will be
+added so that doesn't need to be mentioned explicitly, but discussion of why
+you think unit tests are sufficient and we don't need to add more tempest
+tests would need to be included.~
+
+~Is this untestable in gate given current limitations (specific hardware /
+software configurations available)? If so, are there mitigation plans (3rd
+party testing, gate enhancements, etc).~
+
+
+Documentation Impact
+====================
+
+~Which audiences are affected most by this change, and which documentation
+titles on docs.openstack.org should be updated because of this change? Don't
+repeat details discussed above, but reference them here in the context of
+documentation for multiple audiences. For example, the Operations Guide targets
+cloud operators, and the End User Guide would need to be updated if the change
+offers a new feature available through the CLI or dashboard. If a config option
+changes or is deprecated, note here that the documentation needs to be updated
+to reflect this specification's change.~
+
+References
+==========
+
+~Please add any useful references here. You are not required to have any
+reference. Moreover, this specification should still make sense when your
+references are unavailable. Examples of what you could include are:~
+
+* ~Links to mailing list or IRC discussions~
+
+* ~Links to notes from a summit session~
+
+* ~Links to relevant research, if appropriate~
+
+* ~Related specifications as appropriate (e.g.  if it's an EC2 thing, link the
+  EC2 docs)~
+
+* ~Anything else you feel it is worthwhile to refer to~
+
+
+History
+=======
+
+Optional section intended to be used each time the spec is updated to describe
+new design, API or any database schema updated. Useful to let reader understand
+what's happened along the time.
+
+.. list-table:: Revisions
+   :header-rows: 1
+
+   * - Release Name
+     - Description
+   * - Queens
+     - Introduced
--- a/specs/rocky/approved/python-persister-metrics.rst
+++ b/specs/rocky/approved/python-persister-metrics.rst
@ -0,0 +1,199 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+=====================================================
+Python Persister Performance Metrics Collection (WIP)
+=====================================================
+
+Story board: https://storyboard.openstack.org/#!/story/2001576
+
+This defines the list of measurements for the metric upsert processing time and
+throughput in Python Persister and provides a rest api to retrieve those 
+measurements. 
+
+Problem description
+===================
+
+The Java Persister, built on top of the DropWizard framework, provides a list
+of internal performance related metrics, e.g., the total number of metric
+messages that have been processed since the last service start up, the average,
+min and max metric processing time etc. The Python Persister, on the other
+hand, lacks such instrumentation. This presents a challenge to the operator
+who wants to monitor, triage, and tune the Persister performance and to the
+Persister performance testing tool that was introduced in Queens release. The
+Cassandra Python Persister plugin depends on this feature for performance
+tuning.
+ 
+Use Cases
+---------
+
+- Use case 1: The developer instruments the defined performance metrics.
+
+  There are two approaches towards the internal performance metrics. The first
+  approach is in memory metering similar to the Java implementation. The data
+  collection starts when the Persister service starts up and is not persisted
+  through service restart. The second approach is to treat such measurement
+  exactly the same as the "normal" metrics Monasca collects. The advantage is
+  that such metrics will be persisted and rest apis are already available to
+  retrieve the metrics.
+  The list of Persister metrics includes:
+
+  1. Total number of metrics upsert request received and completed on a given
+     Persister service instance in the given period of time
+  2. Total number of metrics upsert request received and completed on a
+     process or thread in a given period of time (P2)
+  3. The average, min, max metric request processing time in a given period of
+     time for a given Persister service instance and process/thread.
+
+- Use case 2: Retrieves persister performance metrics through rest api.
+
+  The performance  metrics can be retrieved using the list metrics api in the
+  Monasca API service.
+
+Proposed change
+===============
+
+1. Monasca Persister
+
+   - Python Persister integrates with monasca-statsd to send count and timer
+     metrics
+   - Persister conf to add properties for statsd
+     
+2. Persister performance benchmark tool adds support to retrieve the metrics
+   from Monasca rest api source in addition to the DropWizard admin api.
+
+Alternatives
+------------
+
+None
+
+Data model impact
+-----------------
+
+None
+
+REST API impact
+---------------
+
+None
+
+Security impact
+---------------
+
+None
+
+Other end user impact
+---------------------
+
+None
+
+Performance Impact
+------------------
+
+TBD, The statsd call to update counter and timer is expected to have small
+performance impact.
+
+Other deployer impact
+---------------------
+
+No change in deployment of the services.
+
+Developer impact
+----------------
+
+None.
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Contributors are welcome!
+
+Primary assignee:
+  jgu
+
+Other contributors:
+  
+
+Work Items
+----------
+
+Work items or tasks -- break the feature up into the things that need to be
+done to implement it. Those parts might end up being done by different people,
+but we're mostly trying to understand the timeline for implementation.
+
+
+Dependencies
+============
+
+None
+
+Testing
+=======
+
+Please discuss the important scenarios needed to test here, as well as
+specific edge cases we should be ensuring work correctly. For each
+scenario please specify if this requires specialized hardware, a full
+openstack environment, or can be simulated inside the Monasca tree.
+
+Please discuss how the change will be tested. We especially want to know what
+tempest tests will be added. It is assumed that unit test coverage will be
+added so that doesn't need to be mentioned explicitly, but discussion of why
+you think unit tests are sufficient and we don't need to add more tempest
+tests would need to be included.
+
+Is this untestable in gate given current limitations (specific hardware /
+software configurations available)? If so, are there mitigation plans (3rd
+party testing, gate enhancements, etc).
+
+
+Documentation Impact
+====================
+
+Which audiences are affected most by this change, and which documentation
+titles on docs.openstack.org should be updated because of this change? Don't
+repeat details discussed above, but reference them here in the context of
+documentation for multiple audiences. For example, the Operations Guide targets
+cloud operators, and the End User Guide would need to be updated if the change
+offers a new feature available through the CLI or dashboard. If a config option
+changes or is deprecated, note here that the documentation needs to be updated
+to reflect this specification's change.
+
+References
+==========
+
+Please add any useful references here. You are not required to have any
+reference. Moreover, this specification should still make sense when your
+references are unavailable. Examples of what you could include are:
+
+* Links to mailing list or IRC discussions
+
+* Links to notes from a summit session
+
+* Links to relevant research, if appropriate
+
+* Related specifications as appropriate (e.g.  if it's an EC2 thing, link the
+  EC2 docs)
+
+* Anything else you feel it is worthwhile to refer to
+
+
+History
+=======
+
+Optional section intended to be used each time the spec is updated to describe
+new design, API or any database schema updated. Useful to let reader understand
+what's happened along the time.
+
+.. list-table:: Revisions
+   :header-rows: 1
+
+   * - Release Name
+     - Description
+   * - Queens
+     - Introduced