WIP Metrics retention policy enhancement
Support differentiable metrics retention policy based on metrics type. Also outline alternatives. Change-Id: I915376827604bc692cd26b7ed00812c64ee2e3c0 story: 2001576
This commit is contained in:
parent
4aa92c0caa
commit
744f5dc639
400
specs/rocky/approved/metrics-retention.rst
Normal file
400
specs/rocky/approved/metrics-retention.rst
Normal file
@ -0,0 +1,400 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
================================================
|
||||
Metric Retention Policy
|
||||
================================================
|
||||
|
||||
Story board: https://storyboard.openstack.org/#!/story/2001576
|
||||
|
||||
Metric retention policy must be in place to avoid disk being filled up.
|
||||
Retention period should be adjustable for different types of metrics, e.g.,
|
||||
monitoring vs. metering or aggregate vs. raw meters.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
In a cloud of 200 compute hosts, there can be up to one billion metrics
|
||||
generated daily. The time series database disks will be filled up in months
|
||||
if not weeks if old metric data is not purged regularly. The retention
|
||||
requirement can be different based on the type of the metrics and the usage
|
||||
model. For example, the customer may want to preserve the metering metrics
|
||||
for months or years, while s/he has no interest in more than a week old
|
||||
monitoring metrics. Some customers' billing system may pull the metering data
|
||||
on a daily base which could eliminate the need of longer retention of metering
|
||||
metrics. Monasca needs to support metric retention policy that can be tailored
|
||||
per metric or metric type.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
x - Use case 1
|
||||
x Operator configures default metric retention in the persister configuration.
|
||||
|
||||
x The default retention policy is applied if a metric doesn't specify its
|
||||
x retention policy. This default retention is generally shorter period of time
|
||||
x and is targeted to the monitoring metrics.
|
||||
|
||||
x - Use case 2
|
||||
x Operator configures the retention policy for the roll up metrics in the
|
||||
x Monasca transform. Roll up metrics generally require a longer retention
|
||||
x period.
|
||||
|
||||
x - Use case 3
|
||||
x Operator configures the retention policy for Ceilometer metrics in the
|
||||
x pipeline and mapping configuration file. Metering metrics generally require
|
||||
x longer retention period.
|
||||
|
||||
x - Use case 4
|
||||
x The metric agent plugin sets retention policy when generates a new metric.
|
||||
x This is mostly a means to overwrite the default retention policy for
|
||||
x monitoring metrics.
|
||||
|
||||
- Use case 1a
|
||||
Installer sets a default TTL value in configuration
|
||||
|
||||
- Use case 1b
|
||||
Installer loads a set of metric to TTL mappings, which is stored in the
|
||||
Monasca API data store.
|
||||
|
||||
- Use case 2
|
||||
Monasca API receives new metric (regardless of source). Metric is mapped to
|
||||
a dictionary to determine TTL (or default value used if no match). TTL is
|
||||
passed with metric value on to the Persister for storage in TSDB.
|
||||
|
||||
Note that the use cases for monasca-agent to post metrics are unchanged, just
|
||||
the processing from Monasca API
|
||||
|
||||
- Use case 3
|
||||
Operator uses Monasca CLI to specify (or modify) a TTL value for a metric
|
||||
match string. Match string could be specific, such as "cpu.user_perc" or a
|
||||
wildcard string, such as "image.*". CLI posts request to Monasca TTL API,
|
||||
where it is validated then stored in database.
|
||||
|
||||
- Use case 4
|
||||
Operator uses Monasca CLI to GET the dictionary of metric:TTL mappings.
|
||||
|
||||
- Use case 5 (optional)
|
||||
Operator uses Monasca UI to accomplish use case 3 or 4
|
||||
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
1. Monasca API
|
||||
Add a new API for managing the mapping of metrics to TTL values.
|
||||
TBD - API structure
|
||||
|
||||
Add storage for the mapping in the MySQL database. This is to allow
|
||||
all instances of Monasca API to share the configuration dynamically.
|
||||
Create a schema for storing the metric:TTL dictionary.
|
||||
|
||||
2. Monasca Persister
|
||||
Persister reads the default retention policy setting from the service
|
||||
configuration file in the influxDbConfiguration and cassandraDbConfiguration
|
||||
section.
|
||||
::
|
||||
|
||||
# Retention policy may be left blank to indicate default policy.
|
||||
# Unit is days
|
||||
retentionPolicy: 7
|
||||
|
||||
It may be convenient to allow specifying a unit with the policy value. For
|
||||
example "7d" for 7 days or "3m" for 3 months.
|
||||
|
||||
It will retrieve the TTL property in the incoming metric message. If not set,
|
||||
the TTL value from the default retention policy will used instead.
|
||||
|
||||
It is expected with the addition of this Metrics Retention feature that the
|
||||
default retentionPolicy value would be set to a low value, and that metrics
|
||||
that are to be kept longer would be called out specifically through the
|
||||
Retention API and appropriate values set.
|
||||
|
||||
The TTL is set in the parameterized database query when persisting the metrics
|
||||
into the time series database, including both Cassandra and InfluxDB.
|
||||
TBD - exact call structures for each TSDB.
|
||||
|
||||
Note that this does mean that each storage back end would need to have code
|
||||
customized in the persister to support passing the TTL value. This may also
|
||||
be possible for ElasticSearch, though that is not part of this initial spec.
|
||||
|
||||
3. Monasca CLI (optional)
|
||||
A new CLI feature could be created to simplify getting the list of TTL
|
||||
mappings or posting an update to a TTL mapping. This would need Keystone
|
||||
authentication, as does the existing 'monasca' CLI, and could be added to it.
|
||||
TBD: whether the current monasca CLI could handle ingesting a json structure.
|
||||
|
||||
4. Monasca UI (optional)
|
||||
A new feature could be added to the Monasca UI that would allow a Cloud
|
||||
Operator to view and edit the list of TTL mappings.
|
||||
Bonus points for allowing the UI to have sample metrics and simulate the
|
||||
mapping on the page.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
The original proposal was to have monasca-transform, monasca-ceilometer, and
|
||||
monasca-agent each keep a TTL default setting and have a property to allow
|
||||
specifying a TTL per metric. This would have also required a change to the
|
||||
Monasca API to add an optional TTL to the metric POST listener.
|
||||
|
||||
While this would have been simpler to implement in the Monasca API, the
|
||||
additional work to change all the services that originate metrics made this
|
||||
alternative not as appealing.
|
||||
|
||||
|
||||
Another alternative would be to implement a new Monasca Retention API as
|
||||
outlined, but not include dimensions for the metrics. This would allow a much
|
||||
simpler data structure of key:value pairs, with the key being the unique match
|
||||
string and the value the standardized TTL value. While the implementation
|
||||
would be much simpler, it is felt that the additional power of having match
|
||||
dimensions would be beneficial.
|
||||
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
The Monasca API data model will need to be extended to store the metric to
|
||||
TTL mappings.
|
||||
TBD - schema
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
A new metric retention API endpoint would be added to Monasca API.
|
||||
- Post new mapping
|
||||
- update mapping
|
||||
- get single mapping
|
||||
- delete single mapping
|
||||
- get list of mappings (for backup or verification) (format compatible with post list)
|
||||
- post list of mappings (for install or restore) (does this wipe all other mappings?)
|
||||
|
||||
URL: /v2.0/metrics-retention (single metric:ttl entry)
|
||||
|
||||
/v2.0/metrics-retention/list (dict of all entries)
|
||||
|
||||
|
||||
The communication from Monasca API to Persister would have the TTL value
|
||||
added as a parameter.
|
||||
|
||||
NOTE: care should be taken in defining the REST API path, as Gnocchi uses
|
||||
"/metric", which may be confusing to some users.
|
||||
|
||||
|
||||
dimensions? service:xyz or host:node1
|
||||
|
||||
cpu.user* : 2
|
||||
|
||||
cpu.user_perc : 1
|
||||
|
||||
cpu.user_perc {dim: host:node1} : 2
|
||||
|
||||
cpu.user_perc {dim: tenant: coke} : 3
|
||||
|
||||
longest wins? or an option
|
||||
default long or short? default short (1 day for dev, 7 for production) then set desired events longer
|
||||
- if that strategy then go longer in case of conflict
|
||||
Don't want to depend on order - dictionary not in guaranteed order (james suggested adding a datestamp but...)
|
||||
|
||||
JSON structure for PUT/GET to Retention API
|
||||
::
|
||||
|
||||
{
|
||||
match: "cpu.user_perc",
|
||||
dimensions: "",
|
||||
retentionPolicy: "7d"
|
||||
}
|
||||
|
||||
Special case: to delete a retention policy, give a retentionPolicy value of
|
||||
None and it will be removed from the list.
|
||||
::
|
||||
|
||||
{
|
||||
match: "cpu.user_time",
|
||||
dimensions: "",
|
||||
retentionPolicy: None
|
||||
}
|
||||
|
||||
|
||||
---
|
||||
Each API method which is either added or changed should have the following
|
||||
|
||||
* Specification change for the create metric api
|
||||
|
||||
* Create metrics
|
||||
|
||||
* Method type: POST
|
||||
|
||||
* Normal http response code(s): No change
|
||||
|
||||
* Expected error http response code(s): no change
|
||||
|
||||
* URL: /v2.0/metrics
|
||||
|
||||
* Parameters: no change
|
||||
|
||||
* Request body: Consists of a single metric object or an array of metric
|
||||
objects. A metric has the following properties:
|
||||
|
||||
* name (string(255), required) - The name of the metric.
|
||||
* dimensions ({string(255): string(255)}, optional) - A dictionary
|
||||
consisting of (key, value) pairs used to uniquely identify a metric.
|
||||
* timestamp (string, required) - The timestamp in milliseconds from the
|
||||
Epoch.
|
||||
* value (float, required) - Value of the metric. Values with base-10
|
||||
exponents greater than 126 or less than -130 are truncated.
|
||||
* value_meta ({string(255): string}(2048), optional) - A dictionary
|
||||
consisting of (key, value) pairs used to add information about the value.
|
||||
Value_meta key value combinations must be 2048 characters or less
|
||||
including '{"":""}' 7 characters total from every json string.
|
||||
* TTL - time to live in seconds.
|
||||
|
||||
* Example use case including typical API samples for both data supplied
|
||||
by the caller and the response
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None. Security measures already in place for the Monasca API would remain.
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None for most users, as access is restricted to Cloud Operators.
|
||||
A Cloud Operator would have a new responsibility to configure retention for
|
||||
the metrics.
|
||||
|
||||
A future discussion could be had about whether a tenant user should be granted
|
||||
the ability to set their own retention policies, but generally the Cloud
|
||||
Operator is responsible for ensuring there are sufficient resources to meet the
|
||||
retention requirements.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
This feature has no direct impact on the write throughput. However, it allows
|
||||
the user to enable shorter retention period for monitoring metrics which
|
||||
can potentially improve the read performance for the queries that involves
|
||||
search, grouping and filtering when there are less metrics in the table. This
|
||||
improves the storage footprint.
|
||||
|
||||
Depending on how complex the metric retention match string gets there could be
|
||||
some performance impact. TBD
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
No change in deployment of the services.
|
||||
The service could be deployed with simply a default TTL value in configuration.
|
||||
If the operator desires, at install time a complete list of TTL values could
|
||||
be loaded as part of the installation process once the Monasca API is running.
|
||||
|
||||
For planning, the user now has the option to specify a shorter retention period
|
||||
for monitoring metrics or even per metric or metric category. The disk size
|
||||
should be calculated based upon the retention policy accordingly.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
Monasca agent plugin developers should be aware of the new TTL property
|
||||
now available to them. It is an optional property that is only needed if a
|
||||
different TTL value than the default retention policy in the Persister service
|
||||
is needed.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Contributors are welcome!
|
||||
|
||||
Primary assignee:
|
||||
|
||||
|
||||
Other contributors:
|
||||
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
Work items or tasks -- break the feature up into the things that need to be
|
||||
done to implement it. Those parts might end up being done by different people,
|
||||
but we're mostly trying to understand the timeline for implementation.
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
Dependent on retention policy support in the TSDB storage. Both Cassandra
|
||||
and InfluxDB support specifying a retention policy.
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
~Please discuss the important scenarios needed to test here, as well as
|
||||
specific edge cases we should be ensuring work correctly. For each
|
||||
scenario please specify if this requires specialized hardware, a full
|
||||
openstack environment, or can be simulated inside the Monasca tree.~
|
||||
|
||||
~Please discuss how the change will be tested. We especially want to know what
|
||||
tempest tests will be added. It is assumed that unit test coverage will be
|
||||
added so that doesn't need to be mentioned explicitly, but discussion of why
|
||||
you think unit tests are sufficient and we don't need to add more tempest
|
||||
tests would need to be included.~
|
||||
|
||||
~Is this untestable in gate given current limitations (specific hardware /
|
||||
software configurations available)? If so, are there mitigation plans (3rd
|
||||
party testing, gate enhancements, etc).~
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
~Which audiences are affected most by this change, and which documentation
|
||||
titles on docs.openstack.org should be updated because of this change? Don't
|
||||
repeat details discussed above, but reference them here in the context of
|
||||
documentation for multiple audiences. For example, the Operations Guide targets
|
||||
cloud operators, and the End User Guide would need to be updated if the change
|
||||
offers a new feature available through the CLI or dashboard. If a config option
|
||||
changes or is deprecated, note here that the documentation needs to be updated
|
||||
to reflect this specification's change.~
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
~Please add any useful references here. You are not required to have any
|
||||
reference. Moreover, this specification should still make sense when your
|
||||
references are unavailable. Examples of what you could include are:~
|
||||
|
||||
* ~Links to mailing list or IRC discussions~
|
||||
|
||||
* ~Links to notes from a summit session~
|
||||
|
||||
* ~Links to relevant research, if appropriate~
|
||||
|
||||
* ~Related specifications as appropriate (e.g. if it's an EC2 thing, link the
|
||||
EC2 docs)~
|
||||
|
||||
* ~Anything else you feel it is worthwhile to refer to~
|
||||
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
Optional section intended to be used each time the spec is updated to describe
|
||||
new design, API or any database schema updated. Useful to let reader understand
|
||||
what's happened along the time.
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Queens
|
||||
- Introduced
|
199
specs/rocky/approved/python-persister-metrics.rst
Normal file
199
specs/rocky/approved/python-persister-metrics.rst
Normal file
@ -0,0 +1,199 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=====================================================
|
||||
Python Persister Performance Metrics Collection (WIP)
|
||||
=====================================================
|
||||
|
||||
Story board: https://storyboard.openstack.org/#!/story/2001576
|
||||
|
||||
This defines the list of measurements for the metric upsert processing time and
|
||||
throughput in Python Persister and provides a rest api to retrieve those
|
||||
measurements.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
The Java Persister, built on top of the DropWizard framework, provides a list
|
||||
of internal performance related metrics, e.g., the total number of metric
|
||||
messages that have been processed since the last service start up, the average,
|
||||
min and max metric processing time etc. The Python Persister, on the other
|
||||
hand, lacks such instrumentation. This presents a challenge to the operator
|
||||
who wants to monitor, triage, and tune the Persister performance and to the
|
||||
Persister performance testing tool that was introduced in Queens release. The
|
||||
Cassandra Python Persister plugin depends on this feature for performance
|
||||
tuning.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
- Use case 1: The developer instruments the defined performance metrics.
|
||||
|
||||
There are two approaches towards the internal performance metrics. The first
|
||||
approach is in memory metering similar to the Java implementation. The data
|
||||
collection starts when the Persister service starts up and is not persisted
|
||||
through service restart. The second approach is to treat such measurement
|
||||
exactly the same as the "normal" metrics Monasca collects. The advantage is
|
||||
that such metrics will be persisted and rest apis are already available to
|
||||
retrieve the metrics.
|
||||
The list of Persister metrics includes:
|
||||
|
||||
1. Total number of metrics upsert request received and completed on a given
|
||||
Persister service instance in the given period of time
|
||||
2. Total number of metrics upsert request received and completed on a
|
||||
process or thread in a given period of time (P2)
|
||||
3. The average, min, max metric request processing time in a given period of
|
||||
time for a given Persister service instance and process/thread.
|
||||
|
||||
- Use case 2: Retrieves persister performance metrics through rest api.
|
||||
|
||||
The performance metrics can be retrieved using the list metrics api in the
|
||||
Monasca API service.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
1. Monasca Persister
|
||||
|
||||
- Python Persister integrates with monasca-statsd to send count and timer
|
||||
metrics
|
||||
- Persister conf to add properties for statsd
|
||||
|
||||
2. Persister performance benchmark tool adds support to retrieve the metrics
|
||||
from Monasca rest api source in addition to the DropWizard admin api.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
None
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
None
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
TBD, The statsd call to update counter and timer is expected to have small
|
||||
performance impact.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
No change in deployment of the services.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Contributors are welcome!
|
||||
|
||||
Primary assignee:
|
||||
jgu
|
||||
|
||||
Other contributors:
|
||||
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
Work items or tasks -- break the feature up into the things that need to be
|
||||
done to implement it. Those parts might end up being done by different people,
|
||||
but we're mostly trying to understand the timeline for implementation.
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Please discuss the important scenarios needed to test here, as well as
|
||||
specific edge cases we should be ensuring work correctly. For each
|
||||
scenario please specify if this requires specialized hardware, a full
|
||||
openstack environment, or can be simulated inside the Monasca tree.
|
||||
|
||||
Please discuss how the change will be tested. We especially want to know what
|
||||
tempest tests will be added. It is assumed that unit test coverage will be
|
||||
added so that doesn't need to be mentioned explicitly, but discussion of why
|
||||
you think unit tests are sufficient and we don't need to add more tempest
|
||||
tests would need to be included.
|
||||
|
||||
Is this untestable in gate given current limitations (specific hardware /
|
||||
software configurations available)? If so, are there mitigation plans (3rd
|
||||
party testing, gate enhancements, etc).
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Which audiences are affected most by this change, and which documentation
|
||||
titles on docs.openstack.org should be updated because of this change? Don't
|
||||
repeat details discussed above, but reference them here in the context of
|
||||
documentation for multiple audiences. For example, the Operations Guide targets
|
||||
cloud operators, and the End User Guide would need to be updated if the change
|
||||
offers a new feature available through the CLI or dashboard. If a config option
|
||||
changes or is deprecated, note here that the documentation needs to be updated
|
||||
to reflect this specification's change.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
Please add any useful references here. You are not required to have any
|
||||
reference. Moreover, this specification should still make sense when your
|
||||
references are unavailable. Examples of what you could include are:
|
||||
|
||||
* Links to mailing list or IRC discussions
|
||||
|
||||
* Links to notes from a summit session
|
||||
|
||||
* Links to relevant research, if appropriate
|
||||
|
||||
* Related specifications as appropriate (e.g. if it's an EC2 thing, link the
|
||||
EC2 docs)
|
||||
|
||||
* Anything else you feel it is worthwhile to refer to
|
||||
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
Optional section intended to be used each time the spec is updated to describe
|
||||
new design, API or any database schema updated. Useful to let reader understand
|
||||
what's happened along the time.
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Queens
|
||||
- Introduced
|
Loading…
Reference in New Issue
Block a user