Merge "Enhance Placement Process"

2022-07-29 08:12:30 +00:00 · 2022-07-29 08:12:30 +00:00 · caabc47fb7
parent 7cb5e1e3f7 bccb5ba425
commit caabc47fb7
1 changed files with 523 additions and 0 deletions
--- a/specs/zed/enhance_placement_process.rst
+++ b/specs/zed/enhance_placement_process.rst
@ -0,0 +1,523 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+
+=========================
+Enhance Placement Process
+=========================
+
+.. Blueprints:
+
+- https://blueprints.launchpad.net/tacker/+spec/enhance-placement
+
+This specification supports the placement functionality for VNF instance
+and enhances the applicability of Tacker to various systems.
+
+Problem description
+===================
+
+The placement constraints are defined in ETSI NFV-SOL 003 v3.3.1
+[#NFV-SOL003_331]_ and that VNFM sends to NFVO in order to the resource
+placement decision.
+In VNF Lifecycle Management (LCM), there are some cases that VNFs are
+not deployed due to placement constraints or lack of availability zone's
+resources.
+From a high-availability perspective, Tacker needs to implement a
+``fallbackBestEffort`` option in Grant Request and availability zone
+reselection functions.
+
+Proposed change
+===============
+
+1. Add fallbackBestEffort parameter
+-----------------------------------
+
+``fallbackBestEffort`` is to look for an alternate best effort placement
+for specified resources cannot be allocated based on specified placement
+constraint and defined in ETSI NFV-SOL 003 v3.3.1 [#NFV-SOL003_331]_.
+On this specification, Tacker will support additional parameter
+``fallbackBestEffort`` in Grant Request (``PlacementConstrains``) and Tacker
+configuration settings.
+We will modify Tacker conductor and ``tacker.conf``.
+Add definitions details are described in :ref:`3. Add definitions to the
+configuration file<configuration-file>`.
+
+* GrantRequest.PlacementConstrains
+
+  .. list-table::
+      :widths: 15 10 30 30
+      :header-rows: 1
+
+      * - Attribute name
+        - Data type
+        - Cardinality
+        - Description
+      * - fallbackBestEffort
+        - Boolean
+        - 0..1
+        - Indication if the constraint is handled with fall back best
+          effort. Default value is "false".
+
+.. note::
+  If ``fallbackBestEffort`` is present in placement constraints and set to
+  "true", the NFVO shall process the Affinity/Anti-Affinity constraint
+  in a best effort manner, in which case, if specified resources cannot
+  be allocated based on specified placement constraint, the NFVO looks
+  for an alternate best effort placement for the specified resources to
+  be granted.
+
+2. Availability zone reselection
+--------------------------------
+
+The VNFLCM v2 API (instantiate/heal/scale for VNF) process can change
+the availability zone to be used from the one notified by the NFVO if
+necessary.
+If the availability zone notified by the NFVO has insufficient
+resources, the VNF is created/updated in a different availability zone.
+The availability zone is reselected considering Affinity/Anti-Affinity
+and re-create/update until there are no more candidates.
+
+1) Flowchart of availability zone reselection
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. blockdiag::
+
+  blockdiag {
+    orientation = portrait;
+    edge_layout = flowchart;
+
+    create [shape=flowchart.condition,label='stack create/update'];
+    start [shape=beginpoint];
+    lin [shape=flowchart.loopin,label='stack create/update\nin other AZ'];
+    recreate [shape=flowchart.condition,label='stack create/update'];
+    lout [shape=flowchart.loopout,label='no more AZ\ncandidates'];
+    failure [shape=endpoint,label='end'];
+    success [shape=endpoint,label='end'];
+
+    class none [shape=none];
+    n [class=none];
+
+    start -> create [label='NFVO AZ']
+    create -> lin [label=error]
+    lin -> recreate [label='other AZ']
+    recreate -> lout [label='error']
+    lout -> failure [label='failure'];
+
+    create -> success;
+
+    recreate -> n [dir=none]
+    n -> success [label='success'];
+  }
+
+The procedure consists of the following steps as illustrated in above
+flow:
+
+#. Execute "stack create/update" in the availability zone notified by
+   the NFVO.
+#. If an error occurs in 1, get the availability zone list.
+   Availability zone list details are described in :ref:`2) Get and
+   manage availability zone list<az-list>`.
+#. Select an availability zone (excluding the Availability Zone where
+   the error occurred) randomly in compliance with
+   Affinity/Anti-Affinity from the availability zone list obtained in 2,
+   and re-execute “stack create/update”.
+   Reselection policy details are described in :ref:`3) Availability
+   zone reselection policy<reselection-policy>`.
+#. If "stack create/update" in the availability zone reselected in 3
+   becomes an error, reselect the availability zone and repeat until
+   "stack create/update" succeeds or until all availability zone
+   candidates fail.
+   Detecting error details are described in :ref:`4) Detection method
+   of insufficient resource error<detection-method>`.
+
+.. _az-list:
+
+2) Get and manage availability zone list
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+``Get availability zone list``
+
+ Extract all availability zones as candidates for reselection without
+  limiting the availability zones to be reselected.
+  Although it is possible to extract only availability zones permitted
+  by Grant as candidates for reselection, this is not adopted in this
+  Spec because it depends on the NFVO product specifications.
+
+ The concept of availability zone exists for Compute/Volume/Network,
+  but this Spec targets only Compute.
+  The reason is that SOL(SOL003 v3.3.1 Type: GrantInfo
+  [#NFV-SOL003_331]_) specifies that the zoneId of GrantInfo, which is
+  the data type of addResources, etc., is usually specified as a COMPUTE
+  resource.
+
+  .. note::
+    ``SOL003 v3.3.1 Type: GrantInfo``
+
+    Reference to the identifier of the "ZoneInfo" structure in the
+    "Grant" structure defining the resource zone into which this
+    resource is to be placed. Shall be present for new resources if the
+    zones concept is applicable to them (typically, Compute resources)
+    and shall be absent for resources that have already been allocated.
+    Shall be present for new resources if the zones concept is
+    applicable to them (typically, Compute resources) and shall be
+    absent for resources that have been allocated.
+
+ Call the Compute-API "GetDetailedAvailabilityZoneInformation"
+  [#Compute-API]_ to get the availability zones from the "hosts"
+  response associated with "nova-compute".
+
+  Compute endpoints are obtained in the following way.
+
+  1. Get Keystone endpoint from
+     VnfInstance.vimConnectionInfo.interfaceInfo.endpoint
+  2. Call "List endpoints" [#Keystone-API_endpoints]_ and "List
+     services" [#Keystone-API_services]_ of Keystone-API to link obtained
+     Compute's services and endpoint
+
+``Manage availability zone list``
+
+ Keep in on-memory until availability zone reselection iterations are
+  completed, and discard after completion (no storage in DB).
+
+  .. note::
+    ``Error-Handling Retry consideration``
+
+    Since the availability zone list is not saved and is retrieved
+    again, there is no guarantee that the availability zone is
+    reselected in the same order when Retry is executed.
+
+.. _reselection-policy:
+
+3) Availability zone reselection policy
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Availability zones in error are excluded from the reselection
+candidates, and Availability zones are reselected randomly in compliance
+with Affinity/Anti-Affinity of PlacementConstraint.
+
+The availability zone in error can be identified in the following way.
+
+1. Call Heat-API "Show stack details" after an error occurs in "stack
+   create/update"
+2. Identify the VDU where the error occurred due to insufficient resource
+   by the stack_status_reason in the response of 1.
+3. Identify the availability zone by the VDU identified in 2.
+
+.. note::
+
+  Insufficient resource in availability zones that once failed during
+  reselection attempts may be resolved, but the availability zones will
+  not be reselected.
+  In Scale/Heal operations, VDUs that have already been deployed will
+  not be re-created.
+
+Availability zone reselection for each PlacementConstraint is as
+follows.
+
+Precondition: availability zone AZ-1/AZ-2/AZ-3 exist and VNF VDU-1/VDU-2
+are deployed
+
+ PlacementConstraint is Anti-Affinity
+
+  + Before reselection, the following attempts to deploy failed (AZ-1
+    has insufficient resource)
+
+    + VDU-1: AZ-1
+
+    + VDU-2: AZ-2
+
+  + Reselect the following (except AZ-1, select AZ-2/AZ-3 in compliance
+    with Anti-Affinity)
+
+    + VDU-1: AZ-2
+
+    + VDU-2: AZ-3
+
+  .. note::
+
+    The above is an example, and it is possible that the reverse
+    availability zones are selected for VDU-1 and VDU-2, but it is
+    guaranteed that they will not be the same availability zone.
+
+
+ PlacementConstraint is Affinity
+
+  + Before reselection, attempt to deploy in the following and fail
+    (AZ-1 has insufficient resource)
+
+    + VDU-1: AZ-1
+
+    + VDU-2: AZ-1
+
+  + Reselect the following (except AZ-1, select AZ-2/AZ-3 in compliance
+    with Affinity)
+
+    + VDU-1: AZ-2
+
+    + VDU-2: AZ-2
+
+  .. note::
+
+    The above is an example, and it is possible that the availability
+    zone AZ-3 is selected for VDU-1 and VDU-2, but it is guaranteed
+    that they will be the same availability zone.
+
+
+.. _detection-method:
+
+4) Detection method of insufficient resource error
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+When "stack create/update" fails, it is detected from "Show stack details"
+[#Heat-API]_ of Heat-API response whether the failure is due to
+insufficient resources.
+The error message that indicates insufficient resources is extracted
+from the parameter "stack_status_reason" in the response.
+
+.. note::
+
+  In the case of insufficient resources, the error occurs after "stack
+  create/update" returns an acceptance response, so the "Show stack
+  details" response is used to detect the cause.
+
+The following is an example of an error message stored in
+"stack_status_reason" when resources are insufficient.
+
+ ex1) Set the flavor defined in “OS::Nova::Server” to a large value
+  that cannot be deployed (not enough storage/not enough vcpu/not enough
+  memory).
+
+  + Resource CREATE failed: ResourceInError: resources.<VDU-name>: Went
+    to status ERROR due to “Message: No valid host was found. , Code:
+    500”
+
+ ex2) Specifies an extra-spec that cannot be assigned for the flavor
+  defined in "OS::Nova::Server."
+
+  + Resource CREATE failed: ResourceInError: resources.<VDU-name>: Went
+    to status ERROR due to “Message: Exceeded maximum number of retries.
+    Exhausted all hosts available for retrying build failures for
+    instance <server-UUID>., Code: 500”
+
+Error messages that Tacker detects as insufficient resources are
+specified by a regular expression in the configuration file.
+Add definitions details are described in :ref:`3. Add definitions to the
+configuration file<configuration-file>`.
+
+By changing the method of specifying this regular expression in
+accordance with the operational policy, it is possible to flexibly set a
+policy to detect more error messages as insufficient resource with a
+higher tolerance for misdetection, or to detect only specific error
+messages as insufficient resource.
+
+ ex1) Regular expression for a policy to detect more error messages
+  as insufficient resource by increasing the tolerance for
+  misclassification
+
+  + Resource CREATE failed:(. \*)
+
+ ex2) Regular expression to specify the policy to detect more error
+  messages as insufficient resource with higher tolerance for false
+  positives
+
+  + Resource CREATE failed: ResourceInError: resources(. \*): Went to
+    status ERROR due to "Message: No valid host was found. \*): Went to
+    status ERROR due to "Message: Exceeded maximum number of retries.
+    Exhausted all hosts available for retrying build failures for
+    instance(. \*). , Code: 500".
+
+5) AutoScalingGroup consideration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+In BaseHOT which includes AutoScalingGroup definitions, there is a
+constraint that each VNFC associated with a VDU under AutoScalingGroup
+cannot be set to Anti-Affinity for the availability zone.
+This constraint is due to the constraint in the HOT specification that
+availability zones can only be set for each VDU under the
+AutoScalingGroup.
+This constraint occurs not only at the time of reselection, but also at
+the time of initial execution.
+Therefore, BaseHOT which includes AutoScalingGroup definitions, ignores
+the PlacementConstraint for each VNFC associated with a VDU and
+reselects a single availability zone for each VDU under the
+AutoScalingGroup. (Always set to Affinity.)
+
+top HOT:
+
+.. code-block::
+
+  resources:
+    VDU1_scale_group:
+      type: OS::Heat::AutoScalingGroup
+      properties:
+        min_size: 1
+        max_size: 3
+        desired_capacity: { get_param: [ nfv, VDU, VDU1, desired_capacity ] }
+        resource:
+          type: VDU1.yaml
+          properties:
+            flavor: { get_param: [ nfv, VDU, VDU1, computeFlavourId ] }
+            image: { get_param: [ nfv, VDU, VDU1-VirtualStorage, vcImageId ] }
+            zone: { get_param: [ nfv, VDU, VDU1, locationConstraints] }
+            net1: { get_param: [ nfv, CP, VDU1_CP1, network] }
+            net2: { get_param: [ nfv, CP, VDU1_CP2, network ] }
+            subnet1: { get_param: [nfv, CP, VDU1_CP1, fixed_ips, 0, subnet ]}
+            subnet2: { get_param: [nfv, CP, VDU1_CP2, fixed_ips, 0, subnet ]}
+            net3: { get_resource: internalVL1 }
+            net4: { get_resource: internalVL2 }
+            net5: { get_resource: internalVL3 }
+
+nested HOT:
+
+.. code-block::
+
+  resources:
+    VDU1:
+      type: OS::Nova::Server
+      properties:
+        flavor: { get_param: flavor }
+        name: VDU1
+        block_device_mapping_v2: [{"volume_id": { get_resource: VDU1-VirtualStorage }}]
+        networks:
+        - port:
+            get_resource: VDU1_CP1
+        - port:
+            get_resource: VDU1_CP2
+        - port:
+            get_resource: VDU1_CP3
+        - port:
+            get_resource: VDU1_CP4
+        - port:
+            get_resource: VDU1_CP5
+        availability_zone: { get_param: zone }
+
+As shown above, top HOT specifies a single "zone" (availability zone)
+for each VDU under the AutoScalingGroup, so each VNFC associated with a
+VDU under the AutoScalingGroup is in the same availability zone.
+
+.. _configuration-file:
+
+3. Add definitions to the configuration file
+--------------------------------------------
+
+Add the following definition to the ``tacker.conf`` file.
+
+ Boolean value of "GrantRequest.PlacementConstrains.fallbackBestEffort"
+
+  Default value: "false"
+
+ Whether or not to reselect availability zone
+
+  Default value: not to reselect
+
+ Regular expression for detecting insufficient resource error
+
+  Default value: regular expression for insufficient resource error
+
+  .. note::
+    Consider the regular expression that can catch stack create and
+    stack update errors.
+
+ Maximum number of retries for reselection of availability zone
+
+  Default value: no upper limit
+
+  .. note::
+    Consider the case where there are a large number of availability
+    zones and the availability zone reselection process takes too long.
+
+
+Data model impact
+-----------------
+
+None
+
+REST API impact
+---------------
+
+None
+
+Security impact
+---------------
+
+None
+
+Notifications impact
+--------------------
+
+None
+
+Other end user impact
+---------------------
+
+None
+
+Performance Impact
+------------------
+
+None
+
+Other deployer impact
+---------------------
+
+None
+
+Developer impact
+----------------
+
+None
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  Yuta Kazato <yuta.kazato.nw@hco.ntt.co.jp>
+
+  Hirofumi Noguchi <hirofumi.noguchi.rs@hco.ntt.co.jp>
+
+Other contributors:
+  Hiroo Kitamura <hiroo.kitamura@ntt-at.co.jp>
+
+  Ai Hamano <ai.hamano@ntt-at.co.jp>
+
+Work Items
+----------
+
+* Implement availability zone reselection functions.
+* Add new parameter ``fallbackBestEffort`` in GrantRequest API.
+* Add new definitions to the Tacker configuration file ``tacker.conf``.
+* Add new unit and functional tests.
+* Add new examples to the Tacker User Guide.
+
+Dependencies
+============
+
+* VNF Lifecycle Operation Granting interface
+  (Grant Lifecycle Operation) [#NFV-SOL003_331]_
+
+* VNF Lifecycle Management interface
+  (Instantiate/Heal/Scale VNF) [#NFV-SOL003_331]_
+
+Testing
+========
+
+Unit and functional test cases will be added for the new placement functionalities.
+
+Documentation Impact
+====================
+
+New supported functions need to be added into the Tacker User Guide.
+
+References
+==========
+
+.. [#NFV-SOL003_331] https://www.etsi.org/deliver/etsi_gs/NFV-SOL/001_099/003/03.03.01_60/gs_nfv-sol003v030301p.pdf
+
+.. [#Compute-API] https://docs.openstack.org/api-ref/compute/?expanded=get-detailed-availability-zone-information-detail#availability-zones-os-availability-zone
+
+.. [#Keystone-API_endpoints] https://docs.openstack.org/api-ref/identity/v3/?expanded=list-endpoints-detail#list-endpoints
+
+.. [#Keystone-API_services] https://docs.openstack.org/api-ref/identity/v3/?expanded=list-services-detail#list-services
+
+.. [#Heat-API] https://docs.openstack.org/api-ref/orchestration/v1/index.html?expanded=show-stack-details-detail#show-stack-details