Merge "Add notes for DCN operational availability modes"
This commit is contained in:
commit
b57c8c163c
@ -28,6 +28,101 @@ to be co-located at the same physical location or datacenter. See
|
||||
Such an architecture is referred to as "Distributed Compute Node" or "DCN" for
|
||||
short.
|
||||
|
||||
Supported failure modes and High Availability recommendations
|
||||
-------------------------------------------------------------
|
||||
|
||||
Handling negative scenarios for DCN starts from the deployment planning, like
|
||||
choosing some particular SDN solution over provider networks to meet the
|
||||
expected SLA.
|
||||
|
||||
Loss of control plane connectivity
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
A failure of the central control plane affects all DCN edge sites. There is no
|
||||
autonomous control planes at the edge. No OpenStack control plane API or CLI
|
||||
operations can be executed locally in that case. For example, you cannot create
|
||||
a snapshot of a Nova VM, or issue an auth token, nor can you delete an image or
|
||||
a VM.
|
||||
|
||||
.. note:: A single Controller service failure normally induces
|
||||
no downtime for edge sites and should be handled as for usual HA deployments.
|
||||
|
||||
Loss of an edge site
|
||||
^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Running Nova VM instances will keep running. If stopped running, you need the
|
||||
control plane back to recover the stopped or crashed workloads.
|
||||
|
||||
.. note:: A single Compute service failure normally affects only its edge site
|
||||
without additional downtime induced for neighbor edge sites or the central
|
||||
control plane.
|
||||
|
||||
OpenStack infrastructure services, like Nova Compute, will automatically
|
||||
reconnect to MariaDB database cluster and RabbitMQ broker when the control
|
||||
plane's uplink is back. No timed out operations can be resumed though and need
|
||||
to be retried manually.
|
||||
|
||||
It is recommended to maintain each DCN edge site as a separate Availability Zone
|
||||
(AZ) for Nova/Neutron and Cinder services.
|
||||
|
||||
Improving resiliency for N/S and E/W traffic
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Reliability of the central control plane may be enhanced with L3 HA network,
|
||||
which only provides North-South routing. The East-West routing effectiveness of
|
||||
edge networks may be improved by using DVR or highly available Open Virtual
|
||||
Network (OVN). There is also BGPVPN and its backend specific choices.
|
||||
|
||||
Network recommendations
|
||||
^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Traditional provider networks with backbone routing at the edge may fulfill or
|
||||
complement a custom distributed routing solution, like L3 Spine-Leaf topology.
|
||||
|
||||
.. note:: Neutron SDN backends that involve tunnelling may be sub-optimal for
|
||||
Edge DCN cases because of the known issues 1808594_ and 1808062_.
|
||||
|
||||
.. _1808594: https://bugs.launchpad.net/tripleo/+bug/1808594
|
||||
.. _1808062: https://bugs.launchpad.net/tripleo/+bug/1808062
|
||||
|
||||
That said, when there is a network failure that disconnects the edge off the
|
||||
central site, there is no SLA for recovery time but only what the provider
|
||||
networks or a particular SDN choice can guarantee. For switched/routed/MPLS
|
||||
provider networks, that may span from 10's of ms to a few seconds. With
|
||||
the outage thresholds are typically considered to be a 15 seconds. These trace
|
||||
back on various standards that are relevant here.
|
||||
|
||||
Config-drive/cloud-init details
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The simplest solution we recommend for DCN would involve only provider networks
|
||||
at the edge. For that case, it is also recommended to use config-drive or
|
||||
another configuration mechanism other than cloud-init. Otherwise, the latter
|
||||
requires a `169.254.169.254/32` route for the provider routers to forward data
|
||||
to the metadata service.
|
||||
|
||||
IPv6 details
|
||||
^^^^^^^^^^^^
|
||||
|
||||
IPv6 for tenants' workloads and infrastructure tunnels interconnecting
|
||||
the central site and the edge is a viable option as well. IPv6 cannot be used for
|
||||
provisioning networks though. Key benefits IPv6 may provide for DCN are:
|
||||
|
||||
* SLAAC, which is a EUI-64 form of autoconfig that makes IPv6 addresses
|
||||
calculated based on MAC addresses and requires no DHCP services placed on the
|
||||
provider networks.
|
||||
* Improved mobility for endpoints, like NFV APIs, to roam around different links
|
||||
and edge sites without losing its connections and IP addresses.
|
||||
* End-to-end IPv6 has been shown to have better performance by large content
|
||||
networks. This is largely due to the presence of NAT in most end-to-end IPv4
|
||||
connections that slows them down.
|
||||
|
||||
Storage recommendations
|
||||
^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
DCN with only ephemeral storage is available for Nova Compute services.
|
||||
That is up to the edge cloud applications to be designed to provide enhanced
|
||||
data availability, locality awareness and/or replication mechanisms.
|
||||
|
||||
Deploying from a centralized undercloud
|
||||
---------------------------------------
|
||||
|
Loading…
Reference in New Issue
Block a user