Metrics retention policy enhancement
Support differentiable metrics retention policy based on metrics type. Change-Id: I915376827604bc692cd26b7ed00812c64ee2e3c0 story: 2001576
This commit is contained in:
parent
4aa92c0caa
commit
d3d4f84a0b
|
@ -0,0 +1,289 @@
|
||||||
|
..
|
||||||
|
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||||
|
License.
|
||||||
|
|
||||||
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||||
|
|
||||||
|
================================================
|
||||||
|
Metric Retention Policy
|
||||||
|
================================================
|
||||||
|
|
||||||
|
Story board: https://storyboard.openstack.org/#!/story/2001576
|
||||||
|
|
||||||
|
Metric retention policy must be in place to avoid disk being filled up.
|
||||||
|
Retention period should be adjustable for different types of metrics, e.g.,
|
||||||
|
monitoring vs. metering or aggregate vs. raw meters.
|
||||||
|
|
||||||
|
Problem description
|
||||||
|
===================
|
||||||
|
|
||||||
|
In a cloud of 200 compute hosts, there can be up to one billion metrics
|
||||||
|
generated daily. The time series database disks will be filled up in months
|
||||||
|
if not weeks if old metric data is not purged regularly. The retention
|
||||||
|
requirement can be different based on the type of the metrics and the usage
|
||||||
|
model. For example, the customer may want to preserve the metering metrics
|
||||||
|
for months or years, while s/he has no interest in more than a week old
|
||||||
|
monitoring metrics. Some customers' billing system may pull the metering data
|
||||||
|
on a daily base which could eliminate the need of longer retention of metering
|
||||||
|
metrics. Monasca needs to support metric retention policy that can be tailored
|
||||||
|
per metric or metric type.
|
||||||
|
|
||||||
|
Use Cases
|
||||||
|
---------
|
||||||
|
|
||||||
|
- Use case 1
|
||||||
|
Operator configures default metric retention in the persister configuration.
|
||||||
|
|
||||||
|
The default retention policy is applied if a metric doesn't specify its
|
||||||
|
retention policy. This default retention is generally shorter period of time
|
||||||
|
and is targeted to the monitoring metrics.
|
||||||
|
|
||||||
|
- Use case 2
|
||||||
|
Operator configures the retention policy for the roll up metrics in the
|
||||||
|
Monasca transform. Roll up metrics generally require a longer retention
|
||||||
|
period.
|
||||||
|
|
||||||
|
- Use case 3
|
||||||
|
Operator configures the retention policy for Ceilometer metrics in the
|
||||||
|
pipeline and mapping configuration file. Metering metrics generally require
|
||||||
|
longer retention period.
|
||||||
|
|
||||||
|
- Use case 4
|
||||||
|
The metric agent plugin sets retention policy when generates a new metric.
|
||||||
|
This is mostly a means to overwrite the default retention policy for
|
||||||
|
monitoring metrics.
|
||||||
|
|
||||||
|
Proposed change
|
||||||
|
===============
|
||||||
|
|
||||||
|
** Posting to get preliminary feedback on the scope of this spec. **
|
||||||
|
|
||||||
|
1. Monasca API
|
||||||
|
Add an optional metric property "TTL" in the create metrics api. TTL is the
|
||||||
|
number of seconds before the metric sets to expire. If set, the TTL property
|
||||||
|
will be included When posting a new metric message to Kafka.
|
||||||
|
|
||||||
|
2. Monasca Persister
|
||||||
|
Persister reads the default retention policy setting from the service
|
||||||
|
configuration file in the influxDbConfiguration and cassandraDbConfiguration
|
||||||
|
section.
|
||||||
|
::
|
||||||
|
|
||||||
|
# Retention policy may be left blank to indicate default policy.
|
||||||
|
retentionPolicy: 7
|
||||||
|
|
||||||
|
It may makes more sense to move this property to metricConfiguration section
|
||||||
|
and convert to use the unit of seconds instead of days.
|
||||||
|
|
||||||
|
It will retrieve the TTL property in the incoming metric message. If not set,
|
||||||
|
the TTL value from the default retention policy will used instead.
|
||||||
|
|
||||||
|
The TTL is set in the parameterized database query when persisting the metrics
|
||||||
|
into the time series database, including both Cassandra and InfluxDB.
|
||||||
|
|
||||||
|
3. Monasca Ceilometer (aka Ceilosca)
|
||||||
|
Add TTL property in the pipeline-api.yaml
|
||||||
|
::
|
||||||
|
|
||||||
|
- name: image_source
|
||||||
|
interval: 30
|
||||||
|
# expires after 90 days
|
||||||
|
TTL: 7776000
|
||||||
|
meters:
|
||||||
|
- "image"
|
||||||
|
- "image.size"
|
||||||
|
- "image.update"
|
||||||
|
- "image.upload"
|
||||||
|
- "image.delete"
|
||||||
|
sinks:
|
||||||
|
- meter_sink
|
||||||
|
|
||||||
|
Monasca-ceilometer implementation will parse the new property and set the TTL
|
||||||
|
when posting new metric message.
|
||||||
|
|
||||||
|
4. Monasca Transform
|
||||||
|
Add TTL property in transform_specs.json
|
||||||
|
::
|
||||||
|
|
||||||
|
{"aggregation_params_map":{"aggregation_pipeline":{"source":"streaming","usage":"fetch_quantity","setters":["rollup_quantity","set_aggregated_metric_name","set_aggregated_period"],"insert":["prepare_data","insert_data_pre_hourly"]},"aggregated_metric_name":"vm.mem.total_mb_agg","aggregation_period":"hourly", TTL:"7776000", "aggregation_group_by_list": ["host", "metric_id", "tenant_id", "resource_uuid"],"usage_fetch_operation": "avg","filter_by_list": [],"setter_rollup_group_by_list":["tenant_id"],"setter_rollup_operation": "sum","dimension_list":["aggregation_period","host","project_id"],"pre_hourly_operation":"avg","pre_hourly_group_by_list":["default"]},"metric_group":"vm_mem_total_mb_project","metric_id":"vm_mem_total_mb_project"}
|
||||||
|
|
||||||
|
Monasca-transform implementation will parse the new property and set the TTL
|
||||||
|
when posting new rolled up metric messages.
|
||||||
|
|
||||||
|
Alternatives
|
||||||
|
------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Data model impact
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
REST API impact
|
||||||
|
---------------
|
||||||
|
|
||||||
|
Each API method which is either added or changed should have the following
|
||||||
|
|
||||||
|
* Specification change for the create metric api
|
||||||
|
|
||||||
|
* Create metrics
|
||||||
|
|
||||||
|
* Method type: POST
|
||||||
|
|
||||||
|
* Normal http response code(s): No change
|
||||||
|
|
||||||
|
* Expected error http response code(s): no change
|
||||||
|
|
||||||
|
* URL: /v2.0/metrics
|
||||||
|
|
||||||
|
* Parameters: no change
|
||||||
|
|
||||||
|
* Request body: Consists of a single metric object or an array of metric
|
||||||
|
objects. A metric has the following properties:
|
||||||
|
|
||||||
|
* name (string(255), required) - The name of the metric.
|
||||||
|
* dimensions ({string(255): string(255)}, optional) - A dictionary
|
||||||
|
consisting of (key, value) pairs used to uniquely identify a metric.
|
||||||
|
* timestamp (string, required) - The timestamp in milliseconds from the
|
||||||
|
Epoch.
|
||||||
|
* value (float, required) - Value of the metric. Values with base-10
|
||||||
|
exponents greater than 126 or less than -130 are truncated.
|
||||||
|
* value_meta ({string(255): string}(2048), optional) - A dictionary
|
||||||
|
consisting of (key, value) pairs used to add information about the value.
|
||||||
|
Value_meta key value combinations must be 2048 characters or less
|
||||||
|
including '{"":""}' 7 characters total from every json string.
|
||||||
|
* TTL - time to live in seconds.
|
||||||
|
|
||||||
|
* Example use case including typical API samples for both data supplied
|
||||||
|
by the caller and the response
|
||||||
|
|
||||||
|
Security impact
|
||||||
|
---------------
|
||||||
|
|
||||||
|
None. Security measures already in place for the Monasca API would remain.
|
||||||
|
|
||||||
|
Other end user impact
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Performance Impact
|
||||||
|
------------------
|
||||||
|
|
||||||
|
This feature has no direct impact on the write throughput. However, it allows
|
||||||
|
the user to enable shorter retention period for monitoring metrics which
|
||||||
|
can potentially improve the read performance for the queries that involves
|
||||||
|
search, grouping and filtering when there are less metrics in the table. This
|
||||||
|
improves the storage footprint.
|
||||||
|
|
||||||
|
Other deployer impact
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
No change in deployment of the services.
|
||||||
|
|
||||||
|
For planning, the user now has the option to specify a shorter retention period
|
||||||
|
for monitoring metrics or even per metric or metric category. The disk size
|
||||||
|
should be calculated based upon the retention policy accordingly.
|
||||||
|
|
||||||
|
Developer impact
|
||||||
|
----------------
|
||||||
|
|
||||||
|
Monasca agent plugin developers should be aware of the new TTL property
|
||||||
|
now available to them. It is an optional property that is only needed if a
|
||||||
|
different TTL value than the default retention policy in the persister service
|
||||||
|
is needed.
|
||||||
|
|
||||||
|
|
||||||
|
Implementation
|
||||||
|
==============
|
||||||
|
|
||||||
|
Assignee(s)
|
||||||
|
-----------
|
||||||
|
|
||||||
|
Contributors are welcome!
|
||||||
|
|
||||||
|
Primary assignee:
|
||||||
|
|
||||||
|
|
||||||
|
Other contributors:
|
||||||
|
|
||||||
|
|
||||||
|
Work Items
|
||||||
|
----------
|
||||||
|
|
||||||
|
Work items or tasks -- break the feature up into the things that need to be
|
||||||
|
done to implement it. Those parts might end up being done by different people,
|
||||||
|
but we're mostly trying to understand the timeline for implementation.
|
||||||
|
|
||||||
|
|
||||||
|
Dependencies
|
||||||
|
============
|
||||||
|
|
||||||
|
Dependent on retention policy support in the TSDB storage. Both Cassandra
|
||||||
|
and InfluxDB support specifying a retention policy.
|
||||||
|
|
||||||
|
Testing
|
||||||
|
=======
|
||||||
|
|
||||||
|
~Please discuss the important scenarios needed to test here, as well as
|
||||||
|
specific edge cases we should be ensuring work correctly. For each
|
||||||
|
scenario please specify if this requires specialized hardware, a full
|
||||||
|
openstack environment, or can be simulated inside the Monasca tree.~
|
||||||
|
|
||||||
|
~Please discuss how the change will be tested. We especially want to know what
|
||||||
|
tempest tests will be added. It is assumed that unit test coverage will be
|
||||||
|
added so that doesn't need to be mentioned explicitly, but discussion of why
|
||||||
|
you think unit tests are sufficient and we don't need to add more tempest
|
||||||
|
tests would need to be included.~
|
||||||
|
|
||||||
|
~Is this untestable in gate given current limitations (specific hardware /
|
||||||
|
software configurations available)? If so, are there mitigation plans (3rd
|
||||||
|
party testing, gate enhancements, etc).~
|
||||||
|
|
||||||
|
|
||||||
|
Documentation Impact
|
||||||
|
====================
|
||||||
|
|
||||||
|
~Which audiences are affected most by this change, and which documentation
|
||||||
|
titles on docs.openstack.org should be updated because of this change? Don't
|
||||||
|
repeat details discussed above, but reference them here in the context of
|
||||||
|
documentation for multiple audiences. For example, the Operations Guide targets
|
||||||
|
cloud operators, and the End User Guide would need to be updated if the change
|
||||||
|
offers a new feature available through the CLI or dashboard. If a config option
|
||||||
|
changes or is deprecated, note here that the documentation needs to be updated
|
||||||
|
to reflect this specification's change.~
|
||||||
|
|
||||||
|
References
|
||||||
|
==========
|
||||||
|
|
||||||
|
~Please add any useful references here. You are not required to have any
|
||||||
|
reference. Moreover, this specification should still make sense when your
|
||||||
|
references are unavailable. Examples of what you could include are:~
|
||||||
|
|
||||||
|
* ~Links to mailing list or IRC discussions~
|
||||||
|
|
||||||
|
* ~Links to notes from a summit session~
|
||||||
|
|
||||||
|
* ~Links to relevant research, if appropriate~
|
||||||
|
|
||||||
|
* ~Related specifications as appropriate (e.g. if it's an EC2 thing, link the
|
||||||
|
EC2 docs)~
|
||||||
|
|
||||||
|
* ~Anything else you feel it is worthwhile to refer to~
|
||||||
|
|
||||||
|
|
||||||
|
History
|
||||||
|
=======
|
||||||
|
|
||||||
|
Optional section intended to be used each time the spec is updated to describe
|
||||||
|
new design, API or any database schema updated. Useful to let reader understand
|
||||||
|
what's happened along the time.
|
||||||
|
|
||||||
|
.. list-table:: Revisions
|
||||||
|
:header-rows: 1
|
||||||
|
|
||||||
|
* - Release Name
|
||||||
|
- Description
|
||||||
|
* - Queens
|
||||||
|
- Introduced
|
|
@ -0,0 +1,199 @@
|
||||||
|
..
|
||||||
|
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||||
|
License.
|
||||||
|
|
||||||
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||||
|
|
||||||
|
=====================================================
|
||||||
|
Python Persister Performance Metrics Collection (WIP)
|
||||||
|
=====================================================
|
||||||
|
|
||||||
|
Story board: https://storyboard.openstack.org/#!/story/2001576
|
||||||
|
|
||||||
|
This defines the list of measurements for the metric upsert processing time and
|
||||||
|
throughput in Python Persister and provides a rest api to retrieve those
|
||||||
|
measurements.
|
||||||
|
|
||||||
|
Problem description
|
||||||
|
===================
|
||||||
|
|
||||||
|
The Java Persister, built on top of the DropWizard framework, provides a list
|
||||||
|
of internal performance related metrics, e.g., the total number of metric
|
||||||
|
messages that have been processed since the last service start up, the average,
|
||||||
|
min and max metric processing time etc. The Python Persister, on the other
|
||||||
|
hand, lacks such instrumentation. This presents a challenge to the operator
|
||||||
|
who wants to monitor, triage, and tune the Persister performance and to the
|
||||||
|
Persister performance testing tool that was introduced in Queens release. The
|
||||||
|
Cassandra Python Persister plugin depends on this feature for performance
|
||||||
|
tuning.
|
||||||
|
|
||||||
|
Use Cases
|
||||||
|
---------
|
||||||
|
|
||||||
|
- Use case 1: The developer instruments the defined performance metrics.
|
||||||
|
|
||||||
|
There are two approaches towards the internal performance metrics. The first
|
||||||
|
approach is in memory metering similar to the Java implementation. The data
|
||||||
|
collection starts when the Persister service starts up and is not persisted
|
||||||
|
through service restart. The second approach is to treat such measurement
|
||||||
|
exactly the same as the "normal" metrics Monasca collects. The advantage is
|
||||||
|
that such metrics will be persisted and rest apis are already available to
|
||||||
|
retrieve the metrics.
|
||||||
|
The list of Persister metrics includes:
|
||||||
|
|
||||||
|
1. Total number of metrics upsert request received and completed on a given
|
||||||
|
Persister service instance in the given period of time
|
||||||
|
2. Total number of metrics upsert request received and completed on a
|
||||||
|
process or thread in a given period of time (P2)
|
||||||
|
3. The average, min, max metric request processing time in a given period of
|
||||||
|
time for a given Persister service instance and process/thread.
|
||||||
|
|
||||||
|
- Use case 2: Retrieves persister performance metrics through rest api.
|
||||||
|
|
||||||
|
The performance metrics can be retrieved using the list metrics api in the
|
||||||
|
Monasca API service.
|
||||||
|
|
||||||
|
Proposed change
|
||||||
|
===============
|
||||||
|
|
||||||
|
1. Monasca Persister
|
||||||
|
|
||||||
|
- Python Persister integrates with monasca-statsd to send count and timer
|
||||||
|
metrics
|
||||||
|
- Persister conf to add properties for statsd
|
||||||
|
|
||||||
|
2. Persister performance benchmark tool adds support to retrieve the metrics
|
||||||
|
from Monasca rest api source in addition to the DropWizard admin api.
|
||||||
|
|
||||||
|
Alternatives
|
||||||
|
------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Data model impact
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
REST API impact
|
||||||
|
---------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Security impact
|
||||||
|
---------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Other end user impact
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Performance Impact
|
||||||
|
------------------
|
||||||
|
|
||||||
|
TBD, The statsd call to update counter and timer is expected to have small
|
||||||
|
performance impact.
|
||||||
|
|
||||||
|
Other deployer impact
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
No change in deployment of the services.
|
||||||
|
|
||||||
|
Developer impact
|
||||||
|
----------------
|
||||||
|
|
||||||
|
None.
|
||||||
|
|
||||||
|
Implementation
|
||||||
|
==============
|
||||||
|
|
||||||
|
Assignee(s)
|
||||||
|
-----------
|
||||||
|
|
||||||
|
Contributors are welcome!
|
||||||
|
|
||||||
|
Primary assignee:
|
||||||
|
jgu
|
||||||
|
|
||||||
|
Other contributors:
|
||||||
|
|
||||||
|
|
||||||
|
Work Items
|
||||||
|
----------
|
||||||
|
|
||||||
|
Work items or tasks -- break the feature up into the things that need to be
|
||||||
|
done to implement it. Those parts might end up being done by different people,
|
||||||
|
but we're mostly trying to understand the timeline for implementation.
|
||||||
|
|
||||||
|
|
||||||
|
Dependencies
|
||||||
|
============
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Testing
|
||||||
|
=======
|
||||||
|
|
||||||
|
Please discuss the important scenarios needed to test here, as well as
|
||||||
|
specific edge cases we should be ensuring work correctly. For each
|
||||||
|
scenario please specify if this requires specialized hardware, a full
|
||||||
|
openstack environment, or can be simulated inside the Monasca tree.
|
||||||
|
|
||||||
|
Please discuss how the change will be tested. We especially want to know what
|
||||||
|
tempest tests will be added. It is assumed that unit test coverage will be
|
||||||
|
added so that doesn't need to be mentioned explicitly, but discussion of why
|
||||||
|
you think unit tests are sufficient and we don't need to add more tempest
|
||||||
|
tests would need to be included.
|
||||||
|
|
||||||
|
Is this untestable in gate given current limitations (specific hardware /
|
||||||
|
software configurations available)? If so, are there mitigation plans (3rd
|
||||||
|
party testing, gate enhancements, etc).
|
||||||
|
|
||||||
|
|
||||||
|
Documentation Impact
|
||||||
|
====================
|
||||||
|
|
||||||
|
Which audiences are affected most by this change, and which documentation
|
||||||
|
titles on docs.openstack.org should be updated because of this change? Don't
|
||||||
|
repeat details discussed above, but reference them here in the context of
|
||||||
|
documentation for multiple audiences. For example, the Operations Guide targets
|
||||||
|
cloud operators, and the End User Guide would need to be updated if the change
|
||||||
|
offers a new feature available through the CLI or dashboard. If a config option
|
||||||
|
changes or is deprecated, note here that the documentation needs to be updated
|
||||||
|
to reflect this specification's change.
|
||||||
|
|
||||||
|
References
|
||||||
|
==========
|
||||||
|
|
||||||
|
Please add any useful references here. You are not required to have any
|
||||||
|
reference. Moreover, this specification should still make sense when your
|
||||||
|
references are unavailable. Examples of what you could include are:
|
||||||
|
|
||||||
|
* Links to mailing list or IRC discussions
|
||||||
|
|
||||||
|
* Links to notes from a summit session
|
||||||
|
|
||||||
|
* Links to relevant research, if appropriate
|
||||||
|
|
||||||
|
* Related specifications as appropriate (e.g. if it's an EC2 thing, link the
|
||||||
|
EC2 docs)
|
||||||
|
|
||||||
|
* Anything else you feel it is worthwhile to refer to
|
||||||
|
|
||||||
|
|
||||||
|
History
|
||||||
|
=======
|
||||||
|
|
||||||
|
Optional section intended to be used each time the spec is updated to describe
|
||||||
|
new design, API or any database schema updated. Useful to let reader understand
|
||||||
|
what's happened along the time.
|
||||||
|
|
||||||
|
.. list-table:: Revisions
|
||||||
|
:header-rows: 1
|
||||||
|
|
||||||
|
* - Release Name
|
||||||
|
- Description
|
||||||
|
* - Queens
|
||||||
|
- Introduced
|
Loading…
Reference in New Issue