From d3d4f84a0b45d24f710a43c8d55451b8265107f2 Mon Sep 17 00:00:00 2001 From: James Gu Date: Thu, 22 Feb 2018 17:49:49 -0800 Subject: [PATCH] Metrics retention policy enhancement Support differentiable metrics retention policy based on metrics type. Change-Id: I915376827604bc692cd26b7ed00812c64ee2e3c0 story: 2001576 --- specs/rocky/approved/metrics-retention.rst | 289 ++++++++++++++++++ .../approved/python-persister-metrics.rst | 199 ++++++++++++ 2 files changed, 488 insertions(+) create mode 100644 specs/rocky/approved/metrics-retention.rst create mode 100644 specs/rocky/approved/python-persister-metrics.rst diff --git a/specs/rocky/approved/metrics-retention.rst b/specs/rocky/approved/metrics-retention.rst new file mode 100644 index 0000000..49619cf --- /dev/null +++ b/specs/rocky/approved/metrics-retention.rst @@ -0,0 +1,289 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +================================================ +Metric Retention Policy +================================================ + +Story board: https://storyboard.openstack.org/#!/story/2001576 + +Metric retention policy must be in place to avoid disk being filled up. +Retention period should be adjustable for different types of metrics, e.g., +monitoring vs. metering or aggregate vs. raw meters. + +Problem description +=================== + +In a cloud of 200 compute hosts, there can be up to one billion metrics +generated daily. The time series database disks will be filled up in months +if not weeks if old metric data is not purged regularly. The retention +requirement can be different based on the type of the metrics and the usage +model. For example, the customer may want to preserve the metering metrics +for months or years, while s/he has no interest in more than a week old +monitoring metrics. Some customers' billing system may pull the metering data +on a daily base which could eliminate the need of longer retention of metering +metrics. Monasca needs to support metric retention policy that can be tailored +per metric or metric type. + +Use Cases +--------- + +- Use case 1 + Operator configures default metric retention in the persister configuration. + + The default retention policy is applied if a metric doesn't specify its + retention policy. This default retention is generally shorter period of time + and is targeted to the monitoring metrics. + +- Use case 2 + Operator configures the retention policy for the roll up metrics in the + Monasca transform. Roll up metrics generally require a longer retention + period. + +- Use case 3 + Operator configures the retention policy for Ceilometer metrics in the + pipeline and mapping configuration file. Metering metrics generally require + longer retention period. + +- Use case 4 + The metric agent plugin sets retention policy when generates a new metric. + This is mostly a means to overwrite the default retention policy for + monitoring metrics. + +Proposed change +=============== + +** Posting to get preliminary feedback on the scope of this spec. ** + +1. Monasca API + Add an optional metric property "TTL" in the create metrics api. TTL is the + number of seconds before the metric sets to expire. If set, the TTL property + will be included When posting a new metric message to Kafka. + +2. Monasca Persister + Persister reads the default retention policy setting from the service + configuration file in the influxDbConfiguration and cassandraDbConfiguration + section. + :: + + # Retention policy may be left blank to indicate default policy. + retentionPolicy: 7 + + It may makes more sense to move this property to metricConfiguration section + and convert to use the unit of seconds instead of days. + + It will retrieve the TTL property in the incoming metric message. If not set, + the TTL value from the default retention policy will used instead. + + The TTL is set in the parameterized database query when persisting the metrics + into the time series database, including both Cassandra and InfluxDB. + +3. Monasca Ceilometer (aka Ceilosca) + Add TTL property in the pipeline-api.yaml + :: + + - name: image_source + interval: 30 + # expires after 90 days + TTL: 7776000 + meters: + - "image" + - "image.size" + - "image.update" + - "image.upload" + - "image.delete" + sinks: + - meter_sink + + Monasca-ceilometer implementation will parse the new property and set the TTL + when posting new metric message. + +4. Monasca Transform + Add TTL property in transform_specs.json + :: + + {"aggregation_params_map":{"aggregation_pipeline":{"source":"streaming","usage":"fetch_quantity","setters":["rollup_quantity","set_aggregated_metric_name","set_aggregated_period"],"insert":["prepare_data","insert_data_pre_hourly"]},"aggregated_metric_name":"vm.mem.total_mb_agg","aggregation_period":"hourly", TTL:"7776000", "aggregation_group_by_list": ["host", "metric_id", "tenant_id", "resource_uuid"],"usage_fetch_operation": "avg","filter_by_list": [],"setter_rollup_group_by_list":["tenant_id"],"setter_rollup_operation": "sum","dimension_list":["aggregation_period","host","project_id"],"pre_hourly_operation":"avg","pre_hourly_group_by_list":["default"]},"metric_group":"vm_mem_total_mb_project","metric_id":"vm_mem_total_mb_project"} + + Monasca-transform implementation will parse the new property and set the TTL + when posting new rolled up metric messages. + +Alternatives +------------ + +None + +Data model impact +----------------- + +None + +REST API impact +--------------- + +Each API method which is either added or changed should have the following + +* Specification change for the create metric api + + * Create metrics + + * Method type: POST + + * Normal http response code(s): No change + + * Expected error http response code(s): no change + + * URL: /v2.0/metrics + + * Parameters: no change + + * Request body: Consists of a single metric object or an array of metric + objects. A metric has the following properties: + + * name (string(255), required) - The name of the metric. + * dimensions ({string(255): string(255)}, optional) - A dictionary + consisting of (key, value) pairs used to uniquely identify a metric. + * timestamp (string, required) - The timestamp in milliseconds from the + Epoch. + * value (float, required) - Value of the metric. Values with base-10 + exponents greater than 126 or less than -130 are truncated. + * value_meta ({string(255): string}(2048), optional) - A dictionary + consisting of (key, value) pairs used to add information about the value. + Value_meta key value combinations must be 2048 characters or less + including '{"":""}' 7 characters total from every json string. + * TTL - time to live in seconds. + +* Example use case including typical API samples for both data supplied + by the caller and the response + +Security impact +--------------- + +None. Security measures already in place for the Monasca API would remain. + +Other end user impact +--------------------- + +None + +Performance Impact +------------------ + +This feature has no direct impact on the write throughput. However, it allows +the user to enable shorter retention period for monitoring metrics which +can potentially improve the read performance for the queries that involves +search, grouping and filtering when there are less metrics in the table. This +improves the storage footprint. + +Other deployer impact +--------------------- + +No change in deployment of the services. + +For planning, the user now has the option to specify a shorter retention period +for monitoring metrics or even per metric or metric category. The disk size +should be calculated based upon the retention policy accordingly. + +Developer impact +---------------- + +Monasca agent plugin developers should be aware of the new TTL property +now available to them. It is an optional property that is only needed if a +different TTL value than the default retention policy in the persister service +is needed. + + +Implementation +============== + +Assignee(s) +----------- + +Contributors are welcome! + +Primary assignee: + + +Other contributors: + + +Work Items +---------- + +Work items or tasks -- break the feature up into the things that need to be +done to implement it. Those parts might end up being done by different people, +but we're mostly trying to understand the timeline for implementation. + + +Dependencies +============ + +Dependent on retention policy support in the TSDB storage. Both Cassandra +and InfluxDB support specifying a retention policy. + +Testing +======= + +~Please discuss the important scenarios needed to test here, as well as +specific edge cases we should be ensuring work correctly. For each +scenario please specify if this requires specialized hardware, a full +openstack environment, or can be simulated inside the Monasca tree.~ + +~Please discuss how the change will be tested. We especially want to know what +tempest tests will be added. It is assumed that unit test coverage will be +added so that doesn't need to be mentioned explicitly, but discussion of why +you think unit tests are sufficient and we don't need to add more tempest +tests would need to be included.~ + +~Is this untestable in gate given current limitations (specific hardware / +software configurations available)? If so, are there mitigation plans (3rd +party testing, gate enhancements, etc).~ + + +Documentation Impact +==================== + +~Which audiences are affected most by this change, and which documentation +titles on docs.openstack.org should be updated because of this change? Don't +repeat details discussed above, but reference them here in the context of +documentation for multiple audiences. For example, the Operations Guide targets +cloud operators, and the End User Guide would need to be updated if the change +offers a new feature available through the CLI or dashboard. If a config option +changes or is deprecated, note here that the documentation needs to be updated +to reflect this specification's change.~ + +References +========== + +~Please add any useful references here. You are not required to have any +reference. Moreover, this specification should still make sense when your +references are unavailable. Examples of what you could include are:~ + +* ~Links to mailing list or IRC discussions~ + +* ~Links to notes from a summit session~ + +* ~Links to relevant research, if appropriate~ + +* ~Related specifications as appropriate (e.g. if it's an EC2 thing, link the + EC2 docs)~ + +* ~Anything else you feel it is worthwhile to refer to~ + + +History +======= + +Optional section intended to be used each time the spec is updated to describe +new design, API or any database schema updated. Useful to let reader understand +what's happened along the time. + +.. list-table:: Revisions + :header-rows: 1 + + * - Release Name + - Description + * - Queens + - Introduced diff --git a/specs/rocky/approved/python-persister-metrics.rst b/specs/rocky/approved/python-persister-metrics.rst new file mode 100644 index 0000000..d8e216e --- /dev/null +++ b/specs/rocky/approved/python-persister-metrics.rst @@ -0,0 +1,199 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +===================================================== +Python Persister Performance Metrics Collection (WIP) +===================================================== + +Story board: https://storyboard.openstack.org/#!/story/2001576 + +This defines the list of measurements for the metric upsert processing time and +throughput in Python Persister and provides a rest api to retrieve those +measurements. + +Problem description +=================== + +The Java Persister, built on top of the DropWizard framework, provides a list +of internal performance related metrics, e.g., the total number of metric +messages that have been processed since the last service start up, the average, +min and max metric processing time etc. The Python Persister, on the other +hand, lacks such instrumentation. This presents a challenge to the operator +who wants to monitor, triage, and tune the Persister performance and to the +Persister performance testing tool that was introduced in Queens release. The +Cassandra Python Persister plugin depends on this feature for performance +tuning. + +Use Cases +--------- + +- Use case 1: The developer instruments the defined performance metrics. + + There are two approaches towards the internal performance metrics. The first + approach is in memory metering similar to the Java implementation. The data + collection starts when the Persister service starts up and is not persisted + through service restart. The second approach is to treat such measurement + exactly the same as the "normal" metrics Monasca collects. The advantage is + that such metrics will be persisted and rest apis are already available to + retrieve the metrics. + The list of Persister metrics includes: + + 1. Total number of metrics upsert request received and completed on a given + Persister service instance in the given period of time + 2. Total number of metrics upsert request received and completed on a + process or thread in a given period of time (P2) + 3. The average, min, max metric request processing time in a given period of + time for a given Persister service instance and process/thread. + +- Use case 2: Retrieves persister performance metrics through rest api. + + The performance metrics can be retrieved using the list metrics api in the + Monasca API service. + +Proposed change +=============== + +1. Monasca Persister + + - Python Persister integrates with monasca-statsd to send count and timer + metrics + - Persister conf to add properties for statsd + +2. Persister performance benchmark tool adds support to retrieve the metrics + from Monasca rest api source in addition to the DropWizard admin api. + +Alternatives +------------ + +None + +Data model impact +----------------- + +None + +REST API impact +--------------- + +None + +Security impact +--------------- + +None + +Other end user impact +--------------------- + +None + +Performance Impact +------------------ + +TBD, The statsd call to update counter and timer is expected to have small +performance impact. + +Other deployer impact +--------------------- + +No change in deployment of the services. + +Developer impact +---------------- + +None. + +Implementation +============== + +Assignee(s) +----------- + +Contributors are welcome! + +Primary assignee: + jgu + +Other contributors: + + +Work Items +---------- + +Work items or tasks -- break the feature up into the things that need to be +done to implement it. Those parts might end up being done by different people, +but we're mostly trying to understand the timeline for implementation. + + +Dependencies +============ + +None + +Testing +======= + +Please discuss the important scenarios needed to test here, as well as +specific edge cases we should be ensuring work correctly. For each +scenario please specify if this requires specialized hardware, a full +openstack environment, or can be simulated inside the Monasca tree. + +Please discuss how the change will be tested. We especially want to know what +tempest tests will be added. It is assumed that unit test coverage will be +added so that doesn't need to be mentioned explicitly, but discussion of why +you think unit tests are sufficient and we don't need to add more tempest +tests would need to be included. + +Is this untestable in gate given current limitations (specific hardware / +software configurations available)? If so, are there mitigation plans (3rd +party testing, gate enhancements, etc). + + +Documentation Impact +==================== + +Which audiences are affected most by this change, and which documentation +titles on docs.openstack.org should be updated because of this change? Don't +repeat details discussed above, but reference them here in the context of +documentation for multiple audiences. For example, the Operations Guide targets +cloud operators, and the End User Guide would need to be updated if the change +offers a new feature available through the CLI or dashboard. If a config option +changes or is deprecated, note here that the documentation needs to be updated +to reflect this specification's change. + +References +========== + +Please add any useful references here. You are not required to have any +reference. Moreover, this specification should still make sense when your +references are unavailable. Examples of what you could include are: + +* Links to mailing list or IRC discussions + +* Links to notes from a summit session + +* Links to relevant research, if appropriate + +* Related specifications as appropriate (e.g. if it's an EC2 thing, link the + EC2 docs) + +* Anything else you feel it is worthwhile to refer to + + +History +======= + +Optional section intended to be used each time the spec is updated to describe +new design, API or any database schema updated. Useful to let reader understand +what's happened along the time. + +.. list-table:: Revisions + :header-rows: 1 + + * - Release Name + - Description + * - Queens + - Introduced