Merge "Add notes for DCN operational availability modes"

This commit is contained in:
Zuul 2019-02-25 09:47:03 +00:00 committed by Gerrit Code Review
commit b57c8c163c

View File

@ -28,6 +28,101 @@ to be co-located at the same physical location or datacenter. See
Such an architecture is referred to as "Distributed Compute Node" or "DCN" for
short.
Supported failure modes and High Availability recommendations
-------------------------------------------------------------
Handling negative scenarios for DCN starts from the deployment planning, like
choosing some particular SDN solution over provider networks to meet the
expected SLA.
Loss of control plane connectivity
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
A failure of the central control plane affects all DCN edge sites. There is no
autonomous control planes at the edge. No OpenStack control plane API or CLI
operations can be executed locally in that case. For example, you cannot create
a snapshot of a Nova VM, or issue an auth token, nor can you delete an image or
a VM.
.. note:: A single Controller service failure normally induces
no downtime for edge sites and should be handled as for usual HA deployments.
Loss of an edge site
^^^^^^^^^^^^^^^^^^^^
Running Nova VM instances will keep running. If stopped running, you need the
control plane back to recover the stopped or crashed workloads.
.. note:: A single Compute service failure normally affects only its edge site
without additional downtime induced for neighbor edge sites or the central
control plane.
OpenStack infrastructure services, like Nova Compute, will automatically
reconnect to MariaDB database cluster and RabbitMQ broker when the control
plane's uplink is back. No timed out operations can be resumed though and need
to be retried manually.
It is recommended to maintain each DCN edge site as a separate Availability Zone
(AZ) for Nova/Neutron and Cinder services.
Improving resiliency for N/S and E/W traffic
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Reliability of the central control plane may be enhanced with L3 HA network,
which only provides North-South routing. The East-West routing effectiveness of
edge networks may be improved by using DVR or highly available Open Virtual
Network (OVN). There is also BGPVPN and its backend specific choices.
Network recommendations
^^^^^^^^^^^^^^^^^^^^^^^
Traditional provider networks with backbone routing at the edge may fulfill or
complement a custom distributed routing solution, like L3 Spine-Leaf topology.
.. note:: Neutron SDN backends that involve tunnelling may be sub-optimal for
Edge DCN cases because of the known issues 1808594_ and 1808062_.
.. _1808594: https://bugs.launchpad.net/tripleo/+bug/1808594
.. _1808062: https://bugs.launchpad.net/tripleo/+bug/1808062
That said, when there is a network failure that disconnects the edge off the
central site, there is no SLA for recovery time but only what the provider
networks or a particular SDN choice can guarantee. For switched/routed/MPLS
provider networks, that may span from 10's of ms to a few seconds. With
the outage thresholds are typically considered to be a 15 seconds. These trace
back on various standards that are relevant here.
Config-drive/cloud-init details
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The simplest solution we recommend for DCN would involve only provider networks
at the edge. For that case, it is also recommended to use config-drive or
another configuration mechanism other than cloud-init. Otherwise, the latter
requires a `169.254.169.254/32` route for the provider routers to forward data
to the metadata service.
IPv6 details
^^^^^^^^^^^^
IPv6 for tenants' workloads and infrastructure tunnels interconnecting
the central site and the edge is a viable option as well. IPv6 cannot be used for
provisioning networks though. Key benefits IPv6 may provide for DCN are:
* SLAAC, which is a EUI-64 form of autoconfig that makes IPv6 addresses
calculated based on MAC addresses and requires no DHCP services placed on the
provider networks.
* Improved mobility for endpoints, like NFV APIs, to roam around different links
and edge sites without losing its connections and IP addresses.
* End-to-end IPv6 has been shown to have better performance by large content
networks. This is largely due to the presence of NAT in most end-to-end IPv4
connections that slows them down.
Storage recommendations
^^^^^^^^^^^^^^^^^^^^^^^
DCN with only ephemeral storage is available for Nova Compute services.
That is up to the edge cloud applications to be designed to provide enhanced
data availability, locality awareness and/or replication mechanisms.
Deploying from a centralized undercloud
---------------------------------------