8.1 KiB
Team and repository tags
Monasca Transform Data Formats
There are two data formats used by monasca transform. The following sections describes the schema (Spark's DataFrame[1] schema) for the two formats.
Note: These are internal formats used by Monasca Transform when aggregating data. If you are a user who wants to create new aggregation pipeline using an existing framework, you don't need to know or care about these two formats.
As a developer, if you want to write new aggregation components then you might have to know how to enhance the record store data format or instance usage data format with additional fields that you may need or to write new aggregation components that aggregate data from the additional fields.
Source Metric
This is an example monasca metric. Monasca metric is transformed into record_store
data format and
later transformed/aggregated using re-usable generic aggregation components, to derive
'instance_usage` data format.
Example of a monasca metric:
{"metric":{"timestamp":1523323485360.6650390625,
"name":"monasca.collection_time_sec",
"dimensions":{"hostname":"devstack",
"component":"monasca-agent",
"service":"monitoring"},
"value":0.0340659618,
"value_meta":null},
"meta":{"region":"RegionOne","tenantId":"d6bece1bbeff47c1b8734cd4e544dc02"},
"creation_time":1523323489}
Record Store Data Format
Data Frame Schema:
Column Name | Column Data Type | Description |
---|---|---|
event_quantity | pyspark.sql.types.DoubleType |
mapped to metric.value |
event_timestamp_unix | pyspark.sql.types.DoubleType |
calculated as metric.timestamp /1000 from source metric |
event_timestamp_string | pyspark.sql.types.StringType |
mapped to metric.timestamp from the source metric |
event_type | pyspark.sql.types.StringType |
placeholder for the future. mapped to metric.name from source metric |
event_quantity_name | pyspark.sql.types.StringType |
mapped to metric.name from source metric |
event_status | pyspark.sql.types.StringType |
placeholder for the future. Currently mapped to metric.dimensions.state from the source metric |
event_version | pyspark.sql.types.StringType |
placeholder for the future. Set to "1.0" |
record_type | pyspark.sql.types.StringType |
placeholder for the future. Set to "metrics" |
resource_uuid | pyspark.sql.types.StringType |
mapped to metric.dimensions.instanceId or metric.dimensions.resource_id from source metric |
tenant_id | pyspark.sql.types.StringType |
mapped to metric.dimensions.tenant_id or metric.dimensions.tenantid or metric.dimensions.project_id |
user_id | pyspark.sql.types.StringType |
mapped to meta.userId |
region | pyspark.sql.types.StringType |
placeholder of the future. mapped to meta.region , defaults to event_processing_params.set_default_region_to (pre_transform_spec ) |
zone | pyspark.sql.types.StringType |
placeholder for the future. mapped to meta.zone , defaults to event_processing_params.set_default_zone_to (pre_transform_spec ) |
host | pyspark.sql.types.StringType |
mapped to metric.dimensions.hostname or metric.value_meta.host |
project_id | pyspark.sql.types.StringType |
mapped to metric tenant_id |
service_group | pyspark.sql.types.StringType |
placeholder for the future. mapped to service_id in pre_transform_spec |
service_id | pyspark.sql.types.StringType |
placeholder for the future. mapped to service_id in pre_transform_spec |
event_date | pyspark.sql.types.StringType |
"YYYY-mm-dd". Extracted from metric.timestamp |
event_hour | pyspark.sql.types.StringType |
"HH". Extracted from metric.timestamp |
event_minute | pyspark.sql.types.StringType |
"MM". Extracted from metric.timestamp |
event_second | pyspark.sql.types.StringType |
"SS". Extracted from metric.timestamp |
metric_group | pyspark.sql.types.StringType |
identifier for transform spec group |
metric_id | pyspark.sql.types.StringType |
identifier for transform spec |
namespace | pyspark.sql.types.StringType |
mapped to metric.dimensions.namespace |
pod_name | pyspark.sql.types.StringType |
mapped to metric.dimensions.pod_name |
app | pyspark.sql.types.StringType |
mapped to metric.dimensions.app |
container_name | pyspark.sql.types.StringType |
mapped to metric.dimensions.container_name |
interface | pyspark.sql.types.StringType |
mapped to metric.dimensions.interface |
deployment | pyspark.sql.types.StringType |
mapped to metric.dimensions.deployment |
daemon_set | pyspark.sql.types.StringType |
mapped to metric.dimensions.daemon_set |
Instance Usage Data Format
Data Frame Schema:
Column Name | Column Data Type | Description |
---|---|---|
tenant_id | pyspark.sql.types.StringType |
project_id, defaults to NA |
user_id | pyspark.sql.types.StringType |
user_id, defaults to NA |
resource_uuid | pyspark.sql.types.StringType |
resource_id, defaults to NA |
geolocation | pyspark.sql.types.StringType |
placeholder for future, defaults to NA |
region | pyspark.sql.types.StringType |
placeholder for future, defaults to NA |
zone | pyspark.sql.types.StringType |
placeholder for future, defaults to NA |
host | pyspark.sql.types.StringType |
compute hostname, defaults to NA |
project_id | pyspark.sql.types.StringType |
project_id, defaults to NA |
aggregated_metric_name | pyspark.sql.types.StringType |
aggregated metric name, defaults to NA |
firstrecord_timestamp_string | pyspark.sql.types.StringType |
timestamp of the first metric used to derive this aggregated metric |
lastrecord_timestamp_string | pyspark.sql.types.StringType |
timestamp of the last metric used to derive this aggregated metric |
service_group | pyspark.sql.types.StringType |
placeholder for the future, defaults to NA |
service_id | pyspark.sql.types.StringType |
placeholder for the future, defaults to NA |
usage_date | pyspark.sql.types.StringType |
"YYYY-mm-dd" date |
usage_hour | pyspark.sql.types.StringType |
"HH" hour |
usage_minute | pyspark.sql.types.StringType |
"MM" minute |
aggregation_period | pyspark.sql.types.StringType |
"hourly" or "minutely" |
namespace | pyspark.sql.types.StringType |
|
pod_name | pyspark.sql.types.StringType |
|
app | pyspark.sql.types.StringType |
|
container_name | pyspark.sql.types.StringType |
|
interface | pyspark.sql.types.StringType |
|
deployment | pyspark.sql.types.StringType |
|
daemon_set | pyspark.sql.types.StringType |
|
firstrecord_timestamp_unix | pyspark.sql.types.DoubleType |
epoch timestamp of the first metric used to derive this aggregated metric |
lastrecord_timestamp_unix | pyspark.sql.types.DoubleType |
epoch timestamp of the first metric used to derive this aggregated metric |
quantity | pyspark.sql.types.DoubleType |
aggregated metric quantity |
record_count | pyspark.sql.types.DoubleType |
number of source metrics that were used to derive this aggregated metric. For informational purposes only. |
processing_meta | pyspark.sql.types.MapType(pyspark.sql.types.StringType, pyspark.sql.types.StringType, True) |
Key-Value pairs to store additional information, to aid processing |
References
[1] Spark SQL, DataFrames and Datasets Guide
[2] Spark DataTypes