WIP Metrics retention policy enhancement

Support differentiable metrics retention policy based on metrics
type.  Also outline alternatives.

Change-Id: I915376827604bc692cd26b7ed00812c64ee2e3c0
story: 2001576
This commit is contained in:
James Gu 2018-02-22 17:49:49 -08:00 committed by Joseph Davis
parent 4aa92c0caa
commit 744f5dc639
2 changed files with 599 additions and 0 deletions

View File

@ -0,0 +1,400 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
================================================
Metric Retention Policy
================================================
Story board: https://storyboard.openstack.org/#!/story/2001576
Metric retention policy must be in place to avoid disk being filled up.
Retention period should be adjustable for different types of metrics, e.g.,
monitoring vs. metering or aggregate vs. raw meters.
Problem description
===================
In a cloud of 200 compute hosts, there can be up to one billion metrics
generated daily. The time series database disks will be filled up in months
if not weeks if old metric data is not purged regularly. The retention
requirement can be different based on the type of the metrics and the usage
model. For example, the customer may want to preserve the metering metrics
for months or years, while s/he has no interest in more than a week old
monitoring metrics. Some customers' billing system may pull the metering data
on a daily base which could eliminate the need of longer retention of metering
metrics. Monasca needs to support metric retention policy that can be tailored
per metric or metric type.
Use Cases
---------
x - Use case 1
x Operator configures default metric retention in the persister configuration.
x The default retention policy is applied if a metric doesn't specify its
x retention policy. This default retention is generally shorter period of time
x and is targeted to the monitoring metrics.
x - Use case 2
x Operator configures the retention policy for the roll up metrics in the
x Monasca transform. Roll up metrics generally require a longer retention
x period.
x - Use case 3
x Operator configures the retention policy for Ceilometer metrics in the
x pipeline and mapping configuration file. Metering metrics generally require
x longer retention period.
x - Use case 4
x The metric agent plugin sets retention policy when generates a new metric.
x This is mostly a means to overwrite the default retention policy for
x monitoring metrics.
- Use case 1a
Installer sets a default TTL value in configuration
- Use case 1b
Installer loads a set of metric to TTL mappings, which is stored in the
Monasca API data store.
- Use case 2
Monasca API receives new metric (regardless of source). Metric is mapped to
a dictionary to determine TTL (or default value used if no match). TTL is
passed with metric value on to the Persister for storage in TSDB.
Note that the use cases for monasca-agent to post metrics are unchanged, just
the processing from Monasca API
- Use case 3
Operator uses Monasca CLI to specify (or modify) a TTL value for a metric
match string. Match string could be specific, such as "cpu.user_perc" or a
wildcard string, such as "image.*". CLI posts request to Monasca TTL API,
where it is validated then stored in database.
- Use case 4
Operator uses Monasca CLI to GET the dictionary of metric:TTL mappings.
- Use case 5 (optional)
Operator uses Monasca UI to accomplish use case 3 or 4
Proposed change
===============
1. Monasca API
Add a new API for managing the mapping of metrics to TTL values.
TBD - API structure
Add storage for the mapping in the MySQL database. This is to allow
all instances of Monasca API to share the configuration dynamically.
Create a schema for storing the metric:TTL dictionary.
2. Monasca Persister
Persister reads the default retention policy setting from the service
configuration file in the influxDbConfiguration and cassandraDbConfiguration
section.
::
# Retention policy may be left blank to indicate default policy.
# Unit is days
retentionPolicy: 7
It may be convenient to allow specifying a unit with the policy value. For
example "7d" for 7 days or "3m" for 3 months.
It will retrieve the TTL property in the incoming metric message. If not set,
the TTL value from the default retention policy will used instead.
It is expected with the addition of this Metrics Retention feature that the
default retentionPolicy value would be set to a low value, and that metrics
that are to be kept longer would be called out specifically through the
Retention API and appropriate values set.
The TTL is set in the parameterized database query when persisting the metrics
into the time series database, including both Cassandra and InfluxDB.
TBD - exact call structures for each TSDB.
Note that this does mean that each storage back end would need to have code
customized in the persister to support passing the TTL value. This may also
be possible for ElasticSearch, though that is not part of this initial spec.
3. Monasca CLI (optional)
A new CLI feature could be created to simplify getting the list of TTL
mappings or posting an update to a TTL mapping. This would need Keystone
authentication, as does the existing 'monasca' CLI, and could be added to it.
TBD: whether the current monasca CLI could handle ingesting a json structure.
4. Monasca UI (optional)
A new feature could be added to the Monasca UI that would allow a Cloud
Operator to view and edit the list of TTL mappings.
Bonus points for allowing the UI to have sample metrics and simulate the
mapping on the page.
Alternatives
------------
The original proposal was to have monasca-transform, monasca-ceilometer, and
monasca-agent each keep a TTL default setting and have a property to allow
specifying a TTL per metric. This would have also required a change to the
Monasca API to add an optional TTL to the metric POST listener.
While this would have been simpler to implement in the Monasca API, the
additional work to change all the services that originate metrics made this
alternative not as appealing.
Another alternative would be to implement a new Monasca Retention API as
outlined, but not include dimensions for the metrics. This would allow a much
simpler data structure of key:value pairs, with the key being the unique match
string and the value the standardized TTL value. While the implementation
would be much simpler, it is felt that the additional power of having match
dimensions would be beneficial.
Data model impact
-----------------
The Monasca API data model will need to be extended to store the metric to
TTL mappings.
TBD - schema
REST API impact
---------------
A new metric retention API endpoint would be added to Monasca API.
- Post new mapping
- update mapping
- get single mapping
- delete single mapping
- get list of mappings (for backup or verification) (format compatible with post list)
- post list of mappings (for install or restore) (does this wipe all other mappings?)
URL: /v2.0/metrics-retention (single metric:ttl entry)
/v2.0/metrics-retention/list (dict of all entries)
The communication from Monasca API to Persister would have the TTL value
added as a parameter.
NOTE: care should be taken in defining the REST API path, as Gnocchi uses
"/metric", which may be confusing to some users.
dimensions? service:xyz or host:node1
cpu.user* : 2
cpu.user_perc : 1
cpu.user_perc {dim: host:node1} : 2
cpu.user_perc {dim: tenant: coke} : 3
longest wins? or an option
default long or short? default short (1 day for dev, 7 for production) then set desired events longer
- if that strategy then go longer in case of conflict
Don't want to depend on order - dictionary not in guaranteed order (james suggested adding a datestamp but...)
JSON structure for PUT/GET to Retention API
::
{
match: "cpu.user_perc",
dimensions: "",
retentionPolicy: "7d"
}
Special case: to delete a retention policy, give a retentionPolicy value of
None and it will be removed from the list.
::
{
match: "cpu.user_time",
dimensions: "",
retentionPolicy: None
}
---
Each API method which is either added or changed should have the following
* Specification change for the create metric api
* Create metrics
* Method type: POST
* Normal http response code(s): No change
* Expected error http response code(s): no change
* URL: /v2.0/metrics
* Parameters: no change
* Request body: Consists of a single metric object or an array of metric
objects. A metric has the following properties:
* name (string(255), required) - The name of the metric.
* dimensions ({string(255): string(255)}, optional) - A dictionary
consisting of (key, value) pairs used to uniquely identify a metric.
* timestamp (string, required) - The timestamp in milliseconds from the
Epoch.
* value (float, required) - Value of the metric. Values with base-10
exponents greater than 126 or less than -130 are truncated.
* value_meta ({string(255): string}(2048), optional) - A dictionary
consisting of (key, value) pairs used to add information about the value.
Value_meta key value combinations must be 2048 characters or less
including '{"":""}' 7 characters total from every json string.
* TTL - time to live in seconds.
* Example use case including typical API samples for both data supplied
by the caller and the response
Security impact
---------------
None. Security measures already in place for the Monasca API would remain.
Other end user impact
---------------------
None for most users, as access is restricted to Cloud Operators.
A Cloud Operator would have a new responsibility to configure retention for
the metrics.
A future discussion could be had about whether a tenant user should be granted
the ability to set their own retention policies, but generally the Cloud
Operator is responsible for ensuring there are sufficient resources to meet the
retention requirements.
Performance Impact
------------------
This feature has no direct impact on the write throughput. However, it allows
the user to enable shorter retention period for monitoring metrics which
can potentially improve the read performance for the queries that involves
search, grouping and filtering when there are less metrics in the table. This
improves the storage footprint.
Depending on how complex the metric retention match string gets there could be
some performance impact. TBD
Other deployer impact
---------------------
No change in deployment of the services.
The service could be deployed with simply a default TTL value in configuration.
If the operator desires, at install time a complete list of TTL values could
be loaded as part of the installation process once the Monasca API is running.
For planning, the user now has the option to specify a shorter retention period
for monitoring metrics or even per metric or metric category. The disk size
should be calculated based upon the retention policy accordingly.
Developer impact
----------------
Monasca agent plugin developers should be aware of the new TTL property
now available to them. It is an optional property that is only needed if a
different TTL value than the default retention policy in the Persister service
is needed.
Implementation
==============
Assignee(s)
-----------
Contributors are welcome!
Primary assignee:
Other contributors:
Work Items
----------
Work items or tasks -- break the feature up into the things that need to be
done to implement it. Those parts might end up being done by different people,
but we're mostly trying to understand the timeline for implementation.
Dependencies
============
Dependent on retention policy support in the TSDB storage. Both Cassandra
and InfluxDB support specifying a retention policy.
Testing
=======
~Please discuss the important scenarios needed to test here, as well as
specific edge cases we should be ensuring work correctly. For each
scenario please specify if this requires specialized hardware, a full
openstack environment, or can be simulated inside the Monasca tree.~
~Please discuss how the change will be tested. We especially want to know what
tempest tests will be added. It is assumed that unit test coverage will be
added so that doesn't need to be mentioned explicitly, but discussion of why
you think unit tests are sufficient and we don't need to add more tempest
tests would need to be included.~
~Is this untestable in gate given current limitations (specific hardware /
software configurations available)? If so, are there mitigation plans (3rd
party testing, gate enhancements, etc).~
Documentation Impact
====================
~Which audiences are affected most by this change, and which documentation
titles on docs.openstack.org should be updated because of this change? Don't
repeat details discussed above, but reference them here in the context of
documentation for multiple audiences. For example, the Operations Guide targets
cloud operators, and the End User Guide would need to be updated if the change
offers a new feature available through the CLI or dashboard. If a config option
changes or is deprecated, note here that the documentation needs to be updated
to reflect this specification's change.~
References
==========
~Please add any useful references here. You are not required to have any
reference. Moreover, this specification should still make sense when your
references are unavailable. Examples of what you could include are:~
* ~Links to mailing list or IRC discussions~
* ~Links to notes from a summit session~
* ~Links to relevant research, if appropriate~
* ~Related specifications as appropriate (e.g. if it's an EC2 thing, link the
EC2 docs)~
* ~Anything else you feel it is worthwhile to refer to~
History
=======
Optional section intended to be used each time the spec is updated to describe
new design, API or any database schema updated. Useful to let reader understand
what's happened along the time.
.. list-table:: Revisions
:header-rows: 1
* - Release Name
- Description
* - Queens
- Introduced

View File

@ -0,0 +1,199 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
=====================================================
Python Persister Performance Metrics Collection (WIP)
=====================================================
Story board: https://storyboard.openstack.org/#!/story/2001576
This defines the list of measurements for the metric upsert processing time and
throughput in Python Persister and provides a rest api to retrieve those
measurements.
Problem description
===================
The Java Persister, built on top of the DropWizard framework, provides a list
of internal performance related metrics, e.g., the total number of metric
messages that have been processed since the last service start up, the average,
min and max metric processing time etc. The Python Persister, on the other
hand, lacks such instrumentation. This presents a challenge to the operator
who wants to monitor, triage, and tune the Persister performance and to the
Persister performance testing tool that was introduced in Queens release. The
Cassandra Python Persister plugin depends on this feature for performance
tuning.
Use Cases
---------
- Use case 1: The developer instruments the defined performance metrics.
There are two approaches towards the internal performance metrics. The first
approach is in memory metering similar to the Java implementation. The data
collection starts when the Persister service starts up and is not persisted
through service restart. The second approach is to treat such measurement
exactly the same as the "normal" metrics Monasca collects. The advantage is
that such metrics will be persisted and rest apis are already available to
retrieve the metrics.
The list of Persister metrics includes:
1. Total number of metrics upsert request received and completed on a given
Persister service instance in the given period of time
2. Total number of metrics upsert request received and completed on a
process or thread in a given period of time (P2)
3. The average, min, max metric request processing time in a given period of
time for a given Persister service instance and process/thread.
- Use case 2: Retrieves persister performance metrics through rest api.
The performance metrics can be retrieved using the list metrics api in the
Monasca API service.
Proposed change
===============
1. Monasca Persister
- Python Persister integrates with monasca-statsd to send count and timer
metrics
- Persister conf to add properties for statsd
2. Persister performance benchmark tool adds support to retrieve the metrics
from Monasca rest api source in addition to the DropWizard admin api.
Alternatives
------------
None
Data model impact
-----------------
None
REST API impact
---------------
None
Security impact
---------------
None
Other end user impact
---------------------
None
Performance Impact
------------------
TBD, The statsd call to update counter and timer is expected to have small
performance impact.
Other deployer impact
---------------------
No change in deployment of the services.
Developer impact
----------------
None.
Implementation
==============
Assignee(s)
-----------
Contributors are welcome!
Primary assignee:
jgu
Other contributors:
Work Items
----------
Work items or tasks -- break the feature up into the things that need to be
done to implement it. Those parts might end up being done by different people,
but we're mostly trying to understand the timeline for implementation.
Dependencies
============
None
Testing
=======
Please discuss the important scenarios needed to test here, as well as
specific edge cases we should be ensuring work correctly. For each
scenario please specify if this requires specialized hardware, a full
openstack environment, or can be simulated inside the Monasca tree.
Please discuss how the change will be tested. We especially want to know what
tempest tests will be added. It is assumed that unit test coverage will be
added so that doesn't need to be mentioned explicitly, but discussion of why
you think unit tests are sufficient and we don't need to add more tempest
tests would need to be included.
Is this untestable in gate given current limitations (specific hardware /
software configurations available)? If so, are there mitigation plans (3rd
party testing, gate enhancements, etc).
Documentation Impact
====================
Which audiences are affected most by this change, and which documentation
titles on docs.openstack.org should be updated because of this change? Don't
repeat details discussed above, but reference them here in the context of
documentation for multiple audiences. For example, the Operations Guide targets
cloud operators, and the End User Guide would need to be updated if the change
offers a new feature available through the CLI or dashboard. If a config option
changes or is deprecated, note here that the documentation needs to be updated
to reflect this specification's change.
References
==========
Please add any useful references here. You are not required to have any
reference. Moreover, this specification should still make sense when your
references are unavailable. Examples of what you could include are:
* Links to mailing list or IRC discussions
* Links to notes from a summit session
* Links to relevant research, if appropriate
* Related specifications as appropriate (e.g. if it's an EC2 thing, link the
EC2 docs)
* Anything else you feel it is worthwhile to refer to
History
=======
Optional section intended to be used each time the spec is updated to describe
new design, API or any database schema updated. Useful to let reader understand
what's happened along the time.
.. list-table:: Revisions
:header-rows: 1
* - Release Name
- Description
* - Queens
- Introduced