Metrics retention policy enhancement
Support differentiable metrics retention policy based on metrics type. Change-Id: I915376827604bc692cd26b7ed00812c64ee2e3c0 story: 2001576
This commit is contained in:
parent
4aa92c0caa
commit
d3d4f84a0b
289
specs/rocky/approved/metrics-retention.rst
Normal file
289
specs/rocky/approved/metrics-retention.rst
Normal file
@ -0,0 +1,289 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
================================================
|
||||
Metric Retention Policy
|
||||
================================================
|
||||
|
||||
Story board: https://storyboard.openstack.org/#!/story/2001576
|
||||
|
||||
Metric retention policy must be in place to avoid disk being filled up.
|
||||
Retention period should be adjustable for different types of metrics, e.g.,
|
||||
monitoring vs. metering or aggregate vs. raw meters.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
In a cloud of 200 compute hosts, there can be up to one billion metrics
|
||||
generated daily. The time series database disks will be filled up in months
|
||||
if not weeks if old metric data is not purged regularly. The retention
|
||||
requirement can be different based on the type of the metrics and the usage
|
||||
model. For example, the customer may want to preserve the metering metrics
|
||||
for months or years, while s/he has no interest in more than a week old
|
||||
monitoring metrics. Some customers' billing system may pull the metering data
|
||||
on a daily base which could eliminate the need of longer retention of metering
|
||||
metrics. Monasca needs to support metric retention policy that can be tailored
|
||||
per metric or metric type.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
- Use case 1
|
||||
Operator configures default metric retention in the persister configuration.
|
||||
|
||||
The default retention policy is applied if a metric doesn't specify its
|
||||
retention policy. This default retention is generally shorter period of time
|
||||
and is targeted to the monitoring metrics.
|
||||
|
||||
- Use case 2
|
||||
Operator configures the retention policy for the roll up metrics in the
|
||||
Monasca transform. Roll up metrics generally require a longer retention
|
||||
period.
|
||||
|
||||
- Use case 3
|
||||
Operator configures the retention policy for Ceilometer metrics in the
|
||||
pipeline and mapping configuration file. Metering metrics generally require
|
||||
longer retention period.
|
||||
|
||||
- Use case 4
|
||||
The metric agent plugin sets retention policy when generates a new metric.
|
||||
This is mostly a means to overwrite the default retention policy for
|
||||
monitoring metrics.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
** Posting to get preliminary feedback on the scope of this spec. **
|
||||
|
||||
1. Monasca API
|
||||
Add an optional metric property "TTL" in the create metrics api. TTL is the
|
||||
number of seconds before the metric sets to expire. If set, the TTL property
|
||||
will be included When posting a new metric message to Kafka.
|
||||
|
||||
2. Monasca Persister
|
||||
Persister reads the default retention policy setting from the service
|
||||
configuration file in the influxDbConfiguration and cassandraDbConfiguration
|
||||
section.
|
||||
::
|
||||
|
||||
# Retention policy may be left blank to indicate default policy.
|
||||
retentionPolicy: 7
|
||||
|
||||
It may makes more sense to move this property to metricConfiguration section
|
||||
and convert to use the unit of seconds instead of days.
|
||||
|
||||
It will retrieve the TTL property in the incoming metric message. If not set,
|
||||
the TTL value from the default retention policy will used instead.
|
||||
|
||||
The TTL is set in the parameterized database query when persisting the metrics
|
||||
into the time series database, including both Cassandra and InfluxDB.
|
||||
|
||||
3. Monasca Ceilometer (aka Ceilosca)
|
||||
Add TTL property in the pipeline-api.yaml
|
||||
::
|
||||
|
||||
- name: image_source
|
||||
interval: 30
|
||||
# expires after 90 days
|
||||
TTL: 7776000
|
||||
meters:
|
||||
- "image"
|
||||
- "image.size"
|
||||
- "image.update"
|
||||
- "image.upload"
|
||||
- "image.delete"
|
||||
sinks:
|
||||
- meter_sink
|
||||
|
||||
Monasca-ceilometer implementation will parse the new property and set the TTL
|
||||
when posting new metric message.
|
||||
|
||||
4. Monasca Transform
|
||||
Add TTL property in transform_specs.json
|
||||
::
|
||||
|
||||
{"aggregation_params_map":{"aggregation_pipeline":{"source":"streaming","usage":"fetch_quantity","setters":["rollup_quantity","set_aggregated_metric_name","set_aggregated_period"],"insert":["prepare_data","insert_data_pre_hourly"]},"aggregated_metric_name":"vm.mem.total_mb_agg","aggregation_period":"hourly", TTL:"7776000", "aggregation_group_by_list": ["host", "metric_id", "tenant_id", "resource_uuid"],"usage_fetch_operation": "avg","filter_by_list": [],"setter_rollup_group_by_list":["tenant_id"],"setter_rollup_operation": "sum","dimension_list":["aggregation_period","host","project_id"],"pre_hourly_operation":"avg","pre_hourly_group_by_list":["default"]},"metric_group":"vm_mem_total_mb_project","metric_id":"vm_mem_total_mb_project"}
|
||||
|
||||
Monasca-transform implementation will parse the new property and set the TTL
|
||||
when posting new rolled up metric messages.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
None
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
None
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
Each API method which is either added or changed should have the following
|
||||
|
||||
* Specification change for the create metric api
|
||||
|
||||
* Create metrics
|
||||
|
||||
* Method type: POST
|
||||
|
||||
* Normal http response code(s): No change
|
||||
|
||||
* Expected error http response code(s): no change
|
||||
|
||||
* URL: /v2.0/metrics
|
||||
|
||||
* Parameters: no change
|
||||
|
||||
* Request body: Consists of a single metric object or an array of metric
|
||||
objects. A metric has the following properties:
|
||||
|
||||
* name (string(255), required) - The name of the metric.
|
||||
* dimensions ({string(255): string(255)}, optional) - A dictionary
|
||||
consisting of (key, value) pairs used to uniquely identify a metric.
|
||||
* timestamp (string, required) - The timestamp in milliseconds from the
|
||||
Epoch.
|
||||
* value (float, required) - Value of the metric. Values with base-10
|
||||
exponents greater than 126 or less than -130 are truncated.
|
||||
* value_meta ({string(255): string}(2048), optional) - A dictionary
|
||||
consisting of (key, value) pairs used to add information about the value.
|
||||
Value_meta key value combinations must be 2048 characters or less
|
||||
including '{"":""}' 7 characters total from every json string.
|
||||
* TTL - time to live in seconds.
|
||||
|
||||
* Example use case including typical API samples for both data supplied
|
||||
by the caller and the response
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None. Security measures already in place for the Monasca API would remain.
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
This feature has no direct impact on the write throughput. However, it allows
|
||||
the user to enable shorter retention period for monitoring metrics which
|
||||
can potentially improve the read performance for the queries that involves
|
||||
search, grouping and filtering when there are less metrics in the table. This
|
||||
improves the storage footprint.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
No change in deployment of the services.
|
||||
|
||||
For planning, the user now has the option to specify a shorter retention period
|
||||
for monitoring metrics or even per metric or metric category. The disk size
|
||||
should be calculated based upon the retention policy accordingly.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
Monasca agent plugin developers should be aware of the new TTL property
|
||||
now available to them. It is an optional property that is only needed if a
|
||||
different TTL value than the default retention policy in the persister service
|
||||
is needed.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Contributors are welcome!
|
||||
|
||||
Primary assignee:
|
||||
|
||||
|
||||
Other contributors:
|
||||
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
Work items or tasks -- break the feature up into the things that need to be
|
||||
done to implement it. Those parts might end up being done by different people,
|
||||
but we're mostly trying to understand the timeline for implementation.
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
Dependent on retention policy support in the TSDB storage. Both Cassandra
|
||||
and InfluxDB support specifying a retention policy.
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
~Please discuss the important scenarios needed to test here, as well as
|
||||
specific edge cases we should be ensuring work correctly. For each
|
||||
scenario please specify if this requires specialized hardware, a full
|
||||
openstack environment, or can be simulated inside the Monasca tree.~
|
||||
|
||||
~Please discuss how the change will be tested. We especially want to know what
|
||||
tempest tests will be added. It is assumed that unit test coverage will be
|
||||
added so that doesn't need to be mentioned explicitly, but discussion of why
|
||||
you think unit tests are sufficient and we don't need to add more tempest
|
||||
tests would need to be included.~
|
||||
|
||||
~Is this untestable in gate given current limitations (specific hardware /
|
||||
software configurations available)? If so, are there mitigation plans (3rd
|
||||
party testing, gate enhancements, etc).~
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
~Which audiences are affected most by this change, and which documentation
|
||||
titles on docs.openstack.org should be updated because of this change? Don't
|
||||
repeat details discussed above, but reference them here in the context of
|
||||
documentation for multiple audiences. For example, the Operations Guide targets
|
||||
cloud operators, and the End User Guide would need to be updated if the change
|
||||
offers a new feature available through the CLI or dashboard. If a config option
|
||||
changes or is deprecated, note here that the documentation needs to be updated
|
||||
to reflect this specification's change.~
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
~Please add any useful references here. You are not required to have any
|
||||
reference. Moreover, this specification should still make sense when your
|
||||
references are unavailable. Examples of what you could include are:~
|
||||
|
||||
* ~Links to mailing list or IRC discussions~
|
||||
|
||||
* ~Links to notes from a summit session~
|
||||
|
||||
* ~Links to relevant research, if appropriate~
|
||||
|
||||
* ~Related specifications as appropriate (e.g. if it's an EC2 thing, link the
|
||||
EC2 docs)~
|
||||
|
||||
* ~Anything else you feel it is worthwhile to refer to~
|
||||
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
Optional section intended to be used each time the spec is updated to describe
|
||||
new design, API or any database schema updated. Useful to let reader understand
|
||||
what's happened along the time.
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Queens
|
||||
- Introduced
|
199
specs/rocky/approved/python-persister-metrics.rst
Normal file
199
specs/rocky/approved/python-persister-metrics.rst
Normal file
@ -0,0 +1,199 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
=====================================================
|
||||
Python Persister Performance Metrics Collection (WIP)
|
||||
=====================================================
|
||||
|
||||
Story board: https://storyboard.openstack.org/#!/story/2001576
|
||||
|
||||
This defines the list of measurements for the metric upsert processing time and
|
||||
throughput in Python Persister and provides a rest api to retrieve those
|
||||
measurements.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
The Java Persister, built on top of the DropWizard framework, provides a list
|
||||
of internal performance related metrics, e.g., the total number of metric
|
||||
messages that have been processed since the last service start up, the average,
|
||||
min and max metric processing time etc. The Python Persister, on the other
|
||||
hand, lacks such instrumentation. This presents a challenge to the operator
|
||||
who wants to monitor, triage, and tune the Persister performance and to the
|
||||
Persister performance testing tool that was introduced in Queens release. The
|
||||
Cassandra Python Persister plugin depends on this feature for performance
|
||||
tuning.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
- Use case 1: The developer instruments the defined performance metrics.
|
||||
|
||||
There are two approaches towards the internal performance metrics. The first
|
||||
approach is in memory metering similar to the Java implementation. The data
|
||||
collection starts when the Persister service starts up and is not persisted
|
||||
through service restart. The second approach is to treat such measurement
|
||||
exactly the same as the "normal" metrics Monasca collects. The advantage is
|
||||
that such metrics will be persisted and rest apis are already available to
|
||||
retrieve the metrics.
|
||||
The list of Persister metrics includes:
|
||||
|
||||
1. Total number of metrics upsert request received and completed on a given
|
||||
Persister service instance in the given period of time
|
||||
2. Total number of metrics upsert request received and completed on a
|
||||
process or thread in a given period of time (P2)
|
||||
3. The average, min, max metric request processing time in a given period of
|
||||
time for a given Persister service instance and process/thread.
|
||||
|
||||
- Use case 2: Retrieves persister performance metrics through rest api.
|
||||
|
||||
The performance metrics can be retrieved using the list metrics api in the
|
||||
Monasca API service.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
1. Monasca Persister
|
||||
|
||||
- Python Persister integrates with monasca-statsd to send count and timer
|
||||
metrics
|
||||
- Persister conf to add properties for statsd
|
||||
|
||||
2. Persister performance benchmark tool adds support to retrieve the metrics
|
||||
from Monasca rest api source in addition to the DropWizard admin api.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
None
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
None
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
TBD, The statsd call to update counter and timer is expected to have small
|
||||
performance impact.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
No change in deployment of the services.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Contributors are welcome!
|
||||
|
||||
Primary assignee:
|
||||
jgu
|
||||
|
||||
Other contributors:
|
||||
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
Work items or tasks -- break the feature up into the things that need to be
|
||||
done to implement it. Those parts might end up being done by different people,
|
||||
but we're mostly trying to understand the timeline for implementation.
|
||||
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
Please discuss the important scenarios needed to test here, as well as
|
||||
specific edge cases we should be ensuring work correctly. For each
|
||||
scenario please specify if this requires specialized hardware, a full
|
||||
openstack environment, or can be simulated inside the Monasca tree.
|
||||
|
||||
Please discuss how the change will be tested. We especially want to know what
|
||||
tempest tests will be added. It is assumed that unit test coverage will be
|
||||
added so that doesn't need to be mentioned explicitly, but discussion of why
|
||||
you think unit tests are sufficient and we don't need to add more tempest
|
||||
tests would need to be included.
|
||||
|
||||
Is this untestable in gate given current limitations (specific hardware /
|
||||
software configurations available)? If so, are there mitigation plans (3rd
|
||||
party testing, gate enhancements, etc).
|
||||
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Which audiences are affected most by this change, and which documentation
|
||||
titles on docs.openstack.org should be updated because of this change? Don't
|
||||
repeat details discussed above, but reference them here in the context of
|
||||
documentation for multiple audiences. For example, the Operations Guide targets
|
||||
cloud operators, and the End User Guide would need to be updated if the change
|
||||
offers a new feature available through the CLI or dashboard. If a config option
|
||||
changes or is deprecated, note here that the documentation needs to be updated
|
||||
to reflect this specification's change.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
Please add any useful references here. You are not required to have any
|
||||
reference. Moreover, this specification should still make sense when your
|
||||
references are unavailable. Examples of what you could include are:
|
||||
|
||||
* Links to mailing list or IRC discussions
|
||||
|
||||
* Links to notes from a summit session
|
||||
|
||||
* Links to relevant research, if appropriate
|
||||
|
||||
* Related specifications as appropriate (e.g. if it's an EC2 thing, link the
|
||||
EC2 docs)
|
||||
|
||||
* Anything else you feel it is worthwhile to refer to
|
||||
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
Optional section intended to be used each time the spec is updated to describe
|
||||
new design, API or any database schema updated. Useful to let reader understand
|
||||
what's happened along the time.
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - Queens
|
||||
- Introduced
|
Loading…
Reference in New Issue
Block a user