12 KiB

Raw Blame History

ETSI NFVI software modification specification

https://storyboard.openstack.org/#!/story/2006557

Implement the needed interfacing between VNFM and Fenix that is specified in ETSI FEAT03 related documentation etsi. Limit current changes to instances and instance groups.

Problem description

This feature addresses the support for the coordination of the NFVI software modification process with the VNFs hosted on the NFVI in order to minimize impact on service availability.

Use Cases

Guarantee a zero impact to VNF service during Fenix infrastructure maintenance, upgrade and scaling workflow operation. This implies that VNF and VNFM supports the ETSI specification and Fenix interaction.

Proposed change

Implement APIs to set VNF specific instance and instance group variables.

New APIs are to have VNF project instance and instance group data changed in the Fenix database. These constraints might be set in VNFD or the VNF element manager can change these any time according to VNF current load level. Having the constraints gives the ability to optimize the infrastructure maintenance operation as we can scale down the VNFs as much as possible and therefore to able to maintain parallel as many compute nodes as possible. Instance grouping can be instances belonging to certain anti-affinity group, but all instances need to be grouped, so we know how many of those are at least needed and how many of those can be exposed to maintenance at the same time. If nothing else, group mean instance of a certain flavor.

Make an example workflow that supports the usage of these APIs. Workflow should implement one example rolling maintenance use case. Existing Fenix interaction towards VNFM will be utilized with small changes.

The variables common to instance and instance group can be overridden in the instance object. Both objects can be updated at any time. Update can be considered in any action that is not currently not ongoing. Existing timer would not be updated. These objects are not enough to optimize infrastructure workflow. The existing Fenix interaction is also needed to optimize the maintenance window as small as possible. Also this allows upgrading the VNF with new infrastructure capabilities and with no additional impact on VNF service availability if done at the same time as the infrastructure upgrade.

This diagram will illustrate the existing Fenix workflow where application manager updates instance and instance group constraints always when instances are created or deleted. Constraints can also be updated anytime if the level of VNF service will allow different amount of instances at that time.

seqdiag {: activation = none; app-manager --> fenix [label = "Update instance and instance group constraints anytime and when created"] === --- === infra-admin -> fenix [label = "Maintenance session n for hosts", note="Start the maintenance process"]; fenix -> app-manager [label = "MAINTENANCE"]; app-manager -> fenix [label = "ACK_MAINTENANCE"]; fenix --> app-manager [label = "IN_SCALE", note="Optional down scale"]; app-manager --> fenix [label = "Remove instance related constraints of scaled down instances. Update instance groups constraints to match scaling"] app-manager --> fenix [label = "ACK_IN_SCALE"] fenix --> app-manager [label = "PREPARE_MAINTENANCE", note="If there is not empty host Fenix makes one"] app-manager --> fenix [label = "ACK_PREPARE_MAINTENANCE"] fenix --> app-manager [label = "ADMIN_ACTION_DONE"] === Repeated for every compute === fenix -> app-manager [label = "PLANNED_MAINTENANCE", note="If VM-s are on the host. Migrate or Live migrate"] app-manager -> fenix [label = "ACK_PLANNED_MAINTENANCE"] fenix --> app-manager [label = "ADMIN_ACTION_DONE"] fenix --> app-manager [label = "IN_MAINTENANCE"] ... Actual maintenance happens here ... fenix --> app-manager [label = "MAINTENANCE_COMPLETE"] === --- === fenix --> app-manager [label = "MAINTENANCE_COMPLETE", note="Maintenance is done"] app-manager --> fenix [label = "Add instance constraints of instances possibly added when scaling up when maintenance is completed. Update instance groups constraints to match scaling"] app-manager --> fenix [label = "ACK_MAINTENANCE_COMPLETE", note="Up scale"]

}

Alternatives

N/A

Data model impact

Fenix database will need to have new tables to support instance and instance group objects.

REST API impact

All APIs will have 200 OK as return. Error codes defined during implementation.

API PUT /v1/instance/{instance_id} is used to update instance object. API GET /v1/instance/{instance_id} is used to get instance object. PUT API should have this structure as input and GET API as return:

{
    "instance_id": "instance_UUId string",
    "project_id": "Project UUID string",
    "group_id": "group_UUID string",
    "instance_name": "Name string",
    "max_interruption_time": 120, # seconds
    # How long live migration can take
    "migration_type": "LIVE_MIGRATION",
    # LIVE_MIGRATION, MIGRATION or OWN_ACTION
    # Own action is create new and delete old instance.
    # Note! VNF need to obey resource_mitigation with own action
    # This affects to order of delete old and create new to not over
    # commit the resources.
    "resource_mitigation": "True", # True or False
    # Current instance needs double allocation when being migrated.
    # This is true also if instance first scaled out and only then the old
    # instance is removed. It must be True also if VNF needed to scale
    # down, since we go over that scaled down capacity.
    "lead_time": 60 # seconds
    # How long lead time VNF needs for 'migration_type' operation. VNF needs to
    # report back to Fenix as soon as it is ready, but at least within this
    # time. Reporting as fast as can is crucial for optimizing
    # infrastructure upgrade/maintenance.
}

API DELETE /v1/instance/{instance_id} is used to delete instance object.

API PUT /v1/instance_group/{group_id} is used to update instance group object:

{
    "group_id": "group_UUID string",
    "project_id": "Project UUID string",
    "group_name": "Name string",
    "anti_affinity_group": "True", # True or False
    "max_instances_per_host": 2, # 1..N
    # Describes how many instance can be on same host with
    # anti_affinity_group: True
    # Already exist in OpenStack as 'max_server_per_host', but might not
    # exist in different clouds.
    "max_impacted_members": 2, # 1..N
    # Maximum amount of instances that can be impacted
    # Note! This can be dynamic to VNF load
    "recovery_time": 10, # seconds
    # max_impacted_members needs to take into account counting previous
    # action members before the recovery time passes
    # Note! regardless anti_affinity
    "resource_mitigation": "True", # True or False
    # Instances in group needs double allocation when affected.
    # This is true in migrations, but also if instance first scaled out and
    # only then the old instance removed.
    # It must be True also if VNF needed to scale down, since we go over
    # that scaled down capacity.
}

API GET /v1/instance_group/{group_id} is used to get instance group. compared to PUT this strcuture has also the instance_ids:

{
    "group_id": "group_UUID string",
    "project_id": "Project UUID string",
    "group_name": "Name string",
    "anti_affinity_group": "True", # True or False
    "max_instances_per_host": 2, # 1..N
    # Describes how many instance can be on same host with
    # anti_affinity_group: True
    # Already exist in OpenStack as 'max_server_per_host', but might not
    # exist in different clouds.
    "max_impacted_members": 2, # 1..N
    # Maximum amount of instances that can be impacted
    # Note! This can be dynamic to VNF load
    "recovery_time": 10, # seconds
    # max_impacted_members needs to take into account counting previous
    # action members before the recovery time passes
    # Note! regardless anti_affinity
    "resource_mitigation": "True", # True or False
    # Instances in group needs double allocation when affected.
    # This is true in migrations, but also if instance first scaled out and
    # only then the old instance removed.
    # It must be True also if VNF needed to scale down, since we go over
    # that scaled down capacity.
    "instance_ids": [] # List of instances belonging to this group
}

API DELETE /v1/instance_group/{instance_id} is used to delete instance group object.

New API is needed for project instance specific reply:

This API will not be used to reply to 'state' 'PREPARE_MAINTENANCE' and 'PLANNED_MAINTENANCE' notifications that will be instance specific.

PUT /v1/maintenance/<session_id>/<project_id>/<instance_id>:

{
    "instance_action": "MIGRATE",
    "state": "ACK_PLANNED_MAINTENANCE"
}

Notifications impact

Event type maintenance.planned notification will need changes.

New state value INSTANCE_ACTION_FALLBACK should be added to tell live migration was not possible and Fenix will force the migration to complete. After that the normal INSTANCE_ACTION_DONE or INSTANCE_ACTION_FAILED will be expected.

instance_ids is currently limited to either single instance_id or a link to get all affected instances. Now this should be always a single instance, but in state value of MAINTENANCE or SCALE_IN. MAINTENANCE should always have the link to Fenix API to get all instances that may be affected during the maintenance session. SCALE_IN can mention only one exact instance as it maybe be needed to allow other pinned instance to have a target host with needed resources. This can happen in small edge deployment. Empty string indicates VNF can decide how it scales down. Workflow may then need to have several SCALE_IN notifications to finally have enough unused resources to execute workflow further. state having value MAINTENANCE_COMPLETE should have empty string as instance_ids value. In this state VNF should scale back to instances it had in the beginning of the maintenance session.

Other end user impact

VNFD and EM needs to support defining and updating instance and instance group variables

Other deployer impact

VNFM needs to proxy updating instance and instance group variables

Implementation

Assignee(s)

Primary assignee:: Tomi Juvonen <tomi.juvonen@nokia.com>

Work Items

APIs to set instance and instance group objects
Example workflow
Testing
Documentation changes

Dependencies

There can be enhancements later on to other projects. Anyhow initially needed functionality can be handled completely inside Fenix.

Testing

There is huge amount of combinations of VNF deployments and used variables can be changed during the operations. Fenix will support all there variables and their changes. Fenix workflow is always an example and limits to what it can support and is tested against. The main thing to test is that all variables and their changes are supported and validated. The testing of VNF deployment might be limited to example use case supported by example workflow.

Documentation Impact

Fenix documentation needs to be updated after the implementation is ready.

12 KiB Raw Blame History