142 lines
5.7 KiB
ReStructuredText
142 lines
5.7 KiB
ReStructuredText
..
|
|
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
|
License.
|
|
|
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
|
|
|
Convention for heading levels in Neutron devref:
|
|
======= Heading 0 (reserved for the title in a document)
|
|
------- Heading 1
|
|
~~~~~~~ Heading 2
|
|
+++++++ Heading 3
|
|
''''''' Heading 4
|
|
(Avoid deeper levels because they do not render well.)
|
|
|
|
================================
|
|
Active/Passive High Availability
|
|
================================
|
|
|
|
Overview
|
|
--------
|
|
|
|
Initially it was assumed that there will only be a single kuryr-controller
|
|
instance in the Kuryr-Kubernetes deployment. While it simplified a lot of
|
|
controller code, it is obviously not a perfect situation. Having redundant
|
|
controllers can help with achieving higher availability and scalability of the
|
|
deployment.
|
|
|
|
Now with introduction of possibility to run Kuryr in Pods on Kubernetes cluster
|
|
HA is much easier to be implemented. The purpose of this document is to explain
|
|
how will it work in practice.
|
|
|
|
|
|
Proposed Solution
|
|
-----------------
|
|
|
|
There are two types of HA - Active/Passive and Active/Active. In this document
|
|
we'll focus on the former. A/P basically works as one of the instances being
|
|
the leader (doing all the exclusive tasks) and other instances waiting in
|
|
*standby* mode in case the leader *dies* to take over the leader role. As you
|
|
can see a *leader election* mechanism is required to make this work.
|
|
|
|
|
|
Leader election
|
|
~~~~~~~~~~~~~~~
|
|
|
|
The idea here is to use leader election mechanism based on Kubernetes
|
|
endpoints. The idea is neatly `explained on Kubernetes blog
|
|
<https://kubernetes.io/blog/2016/01/simple-leader-election-with-kubernetes/>`_.
|
|
Election is based on Endpoint resources, that hold annotation about current
|
|
leader and its leadership lease time. If leader dies, other instances of the
|
|
service are free to take over the record. Kubernetes API mechanisms will
|
|
provide update exclusion mechanisms to prevent race conditions.
|
|
|
|
This can be implemented by adding another *leader-elector* container to each
|
|
of kuryr-controller pods:
|
|
|
|
.. code:: yaml
|
|
|
|
- image: gcr.io/google_containers/leader-elector:0.5
|
|
name: leader-elector
|
|
args:
|
|
- "--election=kuryr-controller"
|
|
- "--http=0.0.0.0:${KURYR_CONTROLLER_HA_PORT:-16401}"
|
|
- "--election-namespace=kube-system"
|
|
- "--ttl=5s"
|
|
ports:
|
|
- containerPort: ${KURYR_CONTROLLER_HA_PORT:-16401}
|
|
protocol: TCP
|
|
|
|
This adds a new container to the pod. This container will do the
|
|
leader-election and expose the simple JSON API on port 16401 by default. This
|
|
API will be available to kuryr-controller container.
|
|
|
|
|
|
Kuryr Controller Implementation
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
The main issue with having multiple controllers is task division. All of the
|
|
controllers are watching the same endpoints and getting the same notifications,
|
|
but those notifications cannot be processed by multiple controllers at once,
|
|
because we end up with a huge race condition, where each controller creates
|
|
Neutron resources but only one succeeds to put the annotation on the Kubernetes
|
|
resource it is processing.
|
|
|
|
This is obviously unacceptable so as a first step we're implementing A/P HA,
|
|
where only the leader is working on the resources and the other instances wait
|
|
as standby. This will be implemented by periodically calling the leader-elector
|
|
API to check the current leader. On leader change:
|
|
|
|
* Pod losing the leadership will stop its Watcher. Please note that it will be
|
|
stopped gracefully, so all the ongoing operations will be completed.
|
|
* Pod gaining the leadership will start its Watcher. Please note that it will
|
|
get notified about all the previously created Kubernetes resources, but will
|
|
ignore them as they already have the annotations.
|
|
* Pods not affected by leadership change will continue to be in standby mode
|
|
with their Watchers stopped.
|
|
|
|
Please note that this means that in HA mode Watcher will not get started on
|
|
controller startup, but only when periodic task will notice that it is the
|
|
leader.
|
|
|
|
|
|
Issues
|
|
~~~~~~
|
|
|
|
There are certain issues related to orphaned OpenStack resources that we may
|
|
hit. Those can happen in two cases:
|
|
|
|
* Controller instance dies instantly during request processing. Some of
|
|
OpenStack resources were already created, but information about them was not
|
|
yet annotated onto Kubernetes resource. Therefore information is lost and we
|
|
end up with orphaned OpenStack resources. New leader will process the
|
|
Kubernetes resource by creating resources again.
|
|
* During leader transition (short period after a leader died, but before its
|
|
lease expired and periodic task on other controllers noticed that; this
|
|
shouldn't exceed 10s) some K8s resources are deleted. New leader will not
|
|
get the notification about the deletion and those will go unnoticed.
|
|
|
|
Both of this issues can be tackled by garbage-collector mechanism that will
|
|
periodically look over Kubernetes resources and delete OpenStack resources that
|
|
have no representation in annotations.
|
|
|
|
The latter of the issues can also be tackled by saving last seen
|
|
``resourceVersion`` of watched resources list when stopping the Watcher and
|
|
restarting watching from that point.
|
|
|
|
|
|
Future enhancements
|
|
~~~~~~~~~~~~~~~~~~~
|
|
|
|
It would be useful to implement the garbage collector and
|
|
``resourceVersion``-based protection mechanism described in section above.
|
|
|
|
Besides that to further improve the scalability, we should work on
|
|
Active/Active HA model, where work is divided evenly between all of the
|
|
kuryr-controller instances. This can be achieved e.g. by using
|
|
consistent hash ring to decide which instance will process which resource.
|
|
|
|
Potentially this can be extended with support for non-containerized deployments
|
|
by using Tooz and some other tool providing leader-election - like Consul or
|
|
Zookeeper.
|