monasca-transform/docs/data_formats.md

6.6 KiB

Team and repository tags

Team and repositorytags

Monasca Transform Data Formats

There are two data formats used by monasca transform. The following sections describes the schema (Spark's DataFrame[1] schema) for the two formats.

Note: These are internal formats used by Monasca Transform when aggregating data. If you are a user who wants to create new aggregation pipeline using an existing framework, you don't need to know or care about these two formats.

As a developer, if you want to write new aggregation components then you might have to know how to enhance the record store data format or instance usage data format with additional fields that you may need or to write new aggregation components that aggregate data from the additional fields.

Source Metric

This is an example monasca metric. Monasca metric is transformed into record_store data format and later transformed/aggregated using re-usable generic aggregation components, to derive 'instance_usage` data format.

Example of a monasca metric:

{"metric":{"timestamp":1523323485360.6650390625,
           "name":"monasca.collection_time_sec",
           "dimensions":{"hostname":"devstack",
                         "component":"monasca-agent",
                         "service":"monitoring"},
           "value":0.0340659618,
           "value_meta":null},
 "meta":{"region":"RegionOne","tenantId":"d6bece1bbeff47c1b8734cd4e544dc02"},
 "creation_time":1523323489}

Record Store Data Format

Data Frame Schema:

Column Name Column Data Type Description
event_quantity pyspark.sql.types.DoubleType mapped to metric.value
event_timestamp_unix pyspark.sql.types.DoubleType calculated as metric.timestamp/1000 from source metric
event_timestamp_string pyspark.sql.types.StringType mapped to metric.timestamp from the source metric
event_type pyspark.sql.types.StringType placeholder for the future. mapped to metric.name from source metric
event_quantity_name pyspark.sql.types.StringType mapped to metric.name from source metric
resource_uuid pyspark.sql.types.StringType mapped to metric.dimensions.instanceId or metric.dimensions.resource_id from source metric
tenant_id pyspark.sql.types.StringType mapped to metric.dimensions.tenant_id or metric.dimensions.tenantid or metric.dimensions.project_id
user_id pyspark.sql.types.StringType mapped to meta.userId
region pyspark.sql.types.StringType placeholder of the future. mapped to meta.region, defaults to event_processing_params.set_default_region_to (pre_transform_spec)
zone pyspark.sql.types.StringType placeholder for the future. mapped to meta.zone, defaults to event_processing_params.set_default_zone_to (pre_transform_spec)
host pyspark.sql.types.StringType mapped to metric.dimensions.hostname or metric.value_meta.host
project_id pyspark.sql.types.StringType mapped to metric tenant_id
event_date pyspark.sql.types.StringType "YYYY-mm-dd". Extracted from metric.timestamp
event_hour pyspark.sql.types.StringType "HH". Extracted from metric.timestamp
event_minute pyspark.sql.types.StringType "MM". Extracted from metric.timestamp
event_second pyspark.sql.types.StringType "SS". Extracted from metric.timestamp
metric_group pyspark.sql.types.StringType identifier for transform spec group
metric_id pyspark.sql.types.StringType identifier for transform spec

Instance Usage Data Format

Data Frame Schema:

Column Name Column Data Type Description
tenant_id pyspark.sql.types.StringType project_id, defaults to NA
user_id pyspark.sql.types.StringType user_id, defaults to NA
resource_uuid pyspark.sql.types.StringType resource_id, defaults to NA
geolocation pyspark.sql.types.StringType placeholder for future, defaults to NA
region pyspark.sql.types.StringType placeholder for future, defaults to NA
zone pyspark.sql.types.StringType placeholder for future, defaults to NA
host pyspark.sql.types.StringType compute hostname, defaults to NA
project_id pyspark.sql.types.StringType project_id, defaults to NA
aggregated_metric_name pyspark.sql.types.StringType aggregated metric name, defaults to NA
firstrecord_timestamp_string pyspark.sql.types.StringType timestamp of the first metric used to derive this aggregated metric
lastrecord_timestamp_string pyspark.sql.types.StringType timestamp of the last metric used to derive this aggregated metric
usage_date pyspark.sql.types.StringType "YYYY-mm-dd" date
usage_hour pyspark.sql.types.StringType "HH" hour
usage_minute pyspark.sql.types.StringType "MM" minute
aggregation_period pyspark.sql.types.StringType "hourly" or "minutely"
firstrecord_timestamp_unix pyspark.sql.types.DoubleType epoch timestamp of the first metric used to derive this aggregated metric
lastrecord_timestamp_unix pyspark.sql.types.DoubleType epoch timestamp of the first metric used to derive this aggregated metric
quantity pyspark.sql.types.DoubleType aggregated metric quantity
record_count pyspark.sql.types.DoubleType number of source metrics that were used to derive this aggregated metric. For informational purposes only.
processing_meta pyspark.sql.types.MapType(pyspark.sql.types.StringType, pyspark.sql.types.StringType, True) Key-Value pairs to store additional information, to aid processing
extra_data_map pyspark.sql.types.MapType(pyspark.sql.types.StringType, pyspark.sql.types.StringType, True) Key-Value pairs to store group by column key value pair

References

[1] Spark SQL, DataFrames and Datasets Guide

[2] Spark DataTypes