fenix/doc/source/specifications/ussuri-etsi-feat03.rst

..
 This work is licensed under a Creative Commons Attribution 3.0 Unported
 License.

 http://creativecommons.org/licenses/by/3.0/legalcode

=============================================
ETSI NFVI software modification specification
=============================================

https://storyboard.openstack.org/#!/story/2006557

Implement the needed interfacing between VNFM and Fenix that is specified in
`ETSI FEAT03 related documentation`_ etsi. Limit current changes to instances
and instance groups.

Problem description
===================

This feature addresses the support for the coordination of the NFVI software
modification process with the VNFs hosted on the NFVI in order to minimize
impact on service availability.

Use Cases
---------

Guarantee a zero impact to VNF service during Fenix infrastructure maintenance,
upgrade and scaling workflow operation. This implies that VNF and VNFM supports
the ETSI specification and Fenix interaction.


Proposed change
===============

Implement APIs to set VNF specific instance and instance group variables.

New APIs are to have VNF project instance and instance group data changed in
the Fenix database. These constraints might be set in VNFD or the VNF element
manager can change these any time according to VNF current load level.
Having the constraints gives the ability to optimize the infrastructure
maintenance operation as we can scale down the VNFs as much as possible and
therefore to able to maintain parallel as many compute nodes as possible.
Instance grouping can be instances belonging to certain anti-affinity group,
but all instances need to be grouped, so we know how many of those are at
least needed and how many of those can be exposed to maintenance at the same
time. If nothing else, group mean instance of a certain flavor.

Make an example workflow that supports the usage of these APIs. Workflow should
implement one example rolling maintenance use case. Existing Fenix interaction
towards VNFM will be utilized with small changes.

The variables common to instance and instance group can be overridden in the
instance object. Both objects can be updated at any time. Update can be
considered in any action that is not currently not ongoing. Existing timer
would not be updated. These objects are not enough to optimize infrastructure
workflow. The existing Fenix interaction is also needed to optimize the
maintenance window as small as possible. Also this allows upgrading the VNF
with new infrastructure capabilities and with no additional impact on VNF
service availability if done at the same time as the infrastructure upgrade.

This diagram will illustrate the existing Fenix workflow where application
manager updates instance and instance group constraints always when
instances are created or deleted. Constraints can also be updated anytime
if the level of VNF service will allow different amount of instances at
that time.

.. seqdiag::

    seqdiag {
        activation = none;
        app-manager --> fenix [label = "Update instance and instance group constraints anytime and when created"]
        === --- ===
        infra-admin  -> fenix [label = "Maintenance session \n for hosts", note="Start the maintenance process"];
        fenix -> app-manager [label = "MAINTENANCE"];
        app-manager -> fenix [label = "ACK_MAINTENANCE"];
        fenix --> app-manager [label = "IN_SCALE", note="Optional down scale"];
        app-manager --> fenix [label = "Remove instance related constraints of scaled down instances. Update instance groups constraints to match scaling"]
        app-manager --> fenix [label = "ACK_IN_SCALE"]
        fenix --> app-manager [label = "PREPARE_MAINTENANCE", note="If there is not empty host Fenix makes one"]
        app-manager --> fenix [label = "ACK_PREPARE_MAINTENANCE"]
        fenix --> app-manager [label = "ADMIN_ACTION_DONE"]
        === Repeated for every compute ===
        fenix -> app-manager [label = "PLANNED_MAINTENANCE", note="If VM-s are on the host. Migrate or Live migrate"]
        app-manager -> fenix [label = "ACK_PLANNED_MAINTENANCE"]
        fenix --> app-manager [label = "ADMIN_ACTION_DONE"]
        fenix --> app-manager [label = "IN_MAINTENANCE"]
        ... Actual maintenance happens here ...
        fenix --> app-manager [label = "MAINTENANCE_COMPLETE"]
        === --- ===
        fenix --> app-manager [label = "MAINTENANCE_COMPLETE", note="Maintenance is done"]
        app-manager --> fenix [label = "Add instance constraints of instances possibly added when scaling up when maintenance is completed. Update instance groups constraints to match scaling"]
        app-manager --> fenix [label = "ACK_MAINTENANCE_COMPLETE", note="Up scale"]

    }


Alternatives
------------

N/A

Data model impact
-----------------

Fenix database will need to have new tables to support instance and
instance group objects.

REST API impact
---------------

All APIs will have 200 OK as return. Error codes defined during implementation.

API PUT ``/v1/instance/{instance_id}`` is used to update instance object.
API GET ``/v1/instance/{instance_id}`` is used to get instance object.
``PUT`` API should have this structure as input and ``GET`` API as return::

    {
        "instance_id": "instance_UUId string",
        "project_id": "Project UUID string",
        "group_id": "group_UUID string",
        "instance_name": "Name string",
        "max_interruption_time": 120, # seconds
        # How long live migration can take
        "migration_type": "LIVE_MIGRATION",
        # LIVE_MIGRATION, MIGRATION or OWN_ACTION
        # Own action is create new and delete old instance.
        # Note! VNF need to obey resource_mitigation with own action
        # This affects to order of delete old and create new to not over
        # commit the resources.
        "resource_mitigation": "True", # True or False
        # Current instance needs double allocation when being migrated.
        # This is true also if instance first scaled out and only then the old
        # instance is removed. It must be True also if VNF needed to scale
        # down, since we go over that scaled down capacity.
        "lead_time": 60 # seconds
        # How long lead time VNF needs for 'migration_type' operation. VNF needs to
        # report back to Fenix as soon as it is ready, but at least within this
        # time. Reporting as fast as can is crucial for optimizing
        # infrastructure upgrade/maintenance.
    }

API DELETE ``/v1/instance/{instance_id}`` is used to delete instance object.

API PUT ``/v1/instance_group/{group_id}`` is used to update instance group
object::

    {
        "group_id": "group_UUID string",
        "project_id": "Project UUID string",
        "group_name": "Name string",
        "anti_affinity_group": "True", # True or False
        "max_instances_per_host": 2, # 1..N
        # Describes how many instance can be on same host with
        # anti_affinity_group: True
        # Already exist in OpenStack as 'max_server_per_host', but might not
        # exist in different clouds.
        "max_impacted_members": 2, # 1..N
        # Maximum amount of instances that can be impacted
        # Note! This can be dynamic to VNF load
        "recovery_time": 10, # seconds
        # max_impacted_members needs to take into account counting previous
        # action members before the recovery time passes
        # Note! regardless anti_affinity
        "resource_mitigation": "True", # True or False
        # Instances in group needs double allocation when affected.
        # This is true in migrations, but also if instance first scaled out and
        # only then the old instance removed.
        # It must be True also if VNF needed to scale down, since we go over
        # that scaled down capacity.
    }

API GET ``/v1/instance_group/{group_id}`` is used to get instance group.
compared to ``PUT`` this strcuture has also the ``instance_ids``::

    {
        "group_id": "group_UUID string",
        "project_id": "Project UUID string",
        "group_name": "Name string",
        "anti_affinity_group": "True", # True or False
        "max_instances_per_host": 2, # 1..N
        # Describes how many instance can be on same host with
        # anti_affinity_group: True
        # Already exist in OpenStack as 'max_server_per_host', but might not
        # exist in different clouds.
        "max_impacted_members": 2, # 1..N
        # Maximum amount of instances that can be impacted
        # Note! This can be dynamic to VNF load
        "recovery_time": 10, # seconds
        # max_impacted_members needs to take into account counting previous
        # action members before the recovery time passes
        # Note! regardless anti_affinity
        "resource_mitigation": "True", # True or False
        # Instances in group needs double allocation when affected.
        # This is true in migrations, but also if instance first scaled out and
        # only then the old instance removed.
        # It must be True also if VNF needed to scale down, since we go over
        # that scaled down capacity.
        "instance_ids": [] # List of instances belonging to this group
    }


API DELETE ``/v1/instance_group/{instance_id}`` is used to delete instance
group object.

New API is needed for project instance specific reply:

This API will not be used to reply to 'state' 'PREPARE_MAINTENANCE' and
'PLANNED_MAINTENANCE' notifications that will be instance specific.

PUT ``/v1/maintenance/<session_id>/<project_id>/<instance_id>``::

    {
        "instance_action": "MIGRATE",
        "state": "ACK_PLANNED_MAINTENANCE"
    }


Notifications impact
--------------------

Event type ``maintenance.planned`` notification will need changes.

New ``state`` value ``INSTANCE_ACTION_FALLBACK`` should be added to tell live
migration was not possible and Fenix will force the migration to complete.
After that the normal ``INSTANCE_ACTION_DONE`` or ``INSTANCE_ACTION_FAILED``
will be expected.

``instance_ids`` is currently limited to either single ``instance_id`` or
a link to get all affected instances. Now this should be always a single
instance, but in ``state`` value of ``MAINTENANCE`` or ``SCALE_IN``.
``MAINTENANCE`` should always have the link to Fenix API to get all instances
that may be affected during the maintenance session. ``SCALE_IN`` can mention
only one exact instance as it maybe be needed to allow other pinned instance
to have a target host with needed resources. This can happen in small edge
deployment. Empty string indicates VNF can decide how it scales down. Workflow
may then need to have several ``SCALE_IN`` notifications to finally have enough
unused resources to execute workflow further. ``state`` having value
``MAINTENANCE_COMPLETE`` should have empty string as ``instance_ids`` value. In
this ``state`` VNF should scale back to instances it had in the beginning of
the maintenance session.

Other end user impact
---------------------

VNFD and EM needs to support defining and updating instance and instance group
variables

Other deployer impact
---------------------

VNFM needs to proxy updating instance and instance group
variables


Implementation
==============

Assignee(s)
-----------

Primary assignee:
  Tomi Juvonen <tomi.juvonen@nokia.com>

Work Items
----------

* APIs to set instance and instance group objects
* Example workflow
* Testing
* Documentation changes


Dependencies
============

There can be enhancements later on to other projects. Anyhow initially needed
functionality can be handled completely inside Fenix.


Testing
=======

There is huge amount of combinations of VNF deployments and used variables can
be changed during the operations. Fenix will support all there variables and
their changes. Fenix workflow is always an example and limits to what it can
support and is tested against. The main thing to test is that all variables and
their changes are supported and validated. The testing of VNF deployment might
be limited to example use case supported by example workflow.


Documentation Impact
====================

Fenix documentation needs to be updated after the implementation is ready.


References
==========

.. _`ETSI FEAT03 related documentation`: https://nfvwiki.etsi.org/index.php?title=Feature_Tracking#FEAT03:_NFVI_software_modification