Add Fenix architecture documentation

Story: 2004292
Task: #27875

Change-Id: Ifeed52ea6046372da6d44d3d58a38070a8d897da
Signed-off-by: Tomi Juvonen <tomi.juvonen@nokia.com>
This commit is contained in:
Tomi Juvonen 2018-11-09 10:05:19 +02:00
parent 36af47855e
commit d52ea93ce0
5 changed files with 296 additions and 1 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 63 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 40 KiB

View File

@ -0,0 +1,101 @@
.. _architecture:
==================
Fenix Architecture
==================
Fenix is an engine designed to make a rolling infrastructure maintenance and
upgrade possible with zero downtime for the application running on top of it.
Interfaces are designed to be generic, so they can work with different clouds,
virtual machines and containers. The first use case is with OpenStack and VMs,
but the aim is to have a wider scope, like edge (Akraino) and Airship.
The key in Fenix providing the zero downtime is to have an ability to
communicate with an application manager (VNFM). As the application is aware of
maintenance affecting its instances, it can safely be running somewhere else
when it happens. The application also get to know about new capabilities coming
over infrastructure maintenance/upgrade and can plan its own upgrade at the
same. As Fenix also provides scaling request towards applications, it is
possible to make upgrades without adding more resources.
Fenix has the ability to tell any infrastructure service when a host is down
for maintenance or back in use. This is handy for different things, like
enabling/disabling self-healing or billing. The same interface could also be
used for adding/removing hosts.
The design makes it possible to make everything with 'one-click'. Generic API,
notifications and tracking in a database are provided by Fenix together with
example workflow and action plug-ins. Anyhow, to build for specific cloud
deployment, one can provide workflow and action plug-ins to Fenix to fit to
any use case one can think of.
Internal design
===============
Fenix desing is pluggable:
.. image:: ../images/fenix-internal.png
:width: 1064 px
:scale: 75 %
:align: left
**fenix-api** is used to make maintenance workflow sessions and to provide admin
and project owners an API to communicate to Fenix.
**fenix-engine** is running the maintenance workflow sessions and keeping track
in database.
**base workflow** is providing basic Fenix functionality that can be inherited
by the workflow plug-in used in each maintenance session.
**workflow plug-in** is the workflow for your maintenance session. Different
plug-ins can be implemented for different clouds and deployments.
**action plug-ins** are called by the workflow plug-in. It is possible to have
different type of plug-ins, and if there is more than one of a specific type,
one can also define the order they are executed:
* **pre** plug-in is run first
* **host** plug-in is run for each host
* **post** plug-in is run last
There is a possibility to define 'metadata' to further indicate plug-in
specifics.
Interface design
================
Fenix has API and notifications that can be caught by different endpoint
interfaces by subscribing to corresponding event alarm:
.. image:: ../images/fenix-interface.png
:width: 1054 px
:scale: 75 %
:align: left
Infrastructure admin has an API to trigger, query, update and delete
maintenance sessions. Admin can also receive the status of a maintenance
session by the 'maintenance.session' notification trough 'oslo.notification'.
It is also possible to get the same information by subscribing to the
corresponding event alarm. This is handy for getting the event to own favorite
API endpoint.
Project/application having instances on top of the infrastructure under
maintenance can have a manager (VNFM) to communicate with the maintenance
session workflow. The manager can subscribe to project specific
'maintenance.planned' event alarms to get information about maintenance session
state affecting its instances. The subscription also tells to the workflow that
the project have a manager capable of communicating with the workflow.
Otherwise, workflow should have a default behavior towards project instances,
or fail if communication is mandatory in your cloud use case. There is also
a project-specific API to query its instances under current maintenance
workflow session state and to answer back to workflow.
Any infrastructure service can also be made to support 'maintenance.host'
notification. This notification is telling wether a host is in maintenance or
back in normal use. This might be important for enabling/disabling self-healing
or billing. Notification can also be used to indicate when a host is added or
removed.

View File

@ -0,0 +1,190 @@
.. _baseworkflow:
==================
Fenix BaseWorkflow
==================
BaseWorkFlow class implemented in '/fenix/workflow/workflow.py' is the one you
inherit when creating your own workflow. Example workflow 'default.py' using
this can be found from the workflow directory '/fenix/workflow/workflows'.
The class provides the access to all maintenance session related data and the
ability to send Fenix notifications and process the incoming API requests.
There is also a dictionary describing the generic workflow states that should be
supported:
.. code-block:: json
{
"MAINTENANCE": "maintenance",
"SCALE_IN": "scale_in",
"PREPARE_MAINTENANCE": "prepare_maintenance",
"START_MAINTENANCE": "start_maintenance",
"PLANNED_MAINTENANCE": "planned_maintenance",
"MAINTENANCE_COMPLETE": "maintenance_complete",
"MAINTENANCE_DONE": "maintenance_done",
"MAINTENANCE_FAILED": "maintenance_failed"
}
Key is the state name and value is the internal method that you
iplement in your workflow to handle that state. When the method returns, it
will be checked from Class variable 'self.state' what is the next method to be
called. So your state related method should change 'self.state' to what you
want to do next. The method should also implement calling of any action plug-ins
and other state related functionality like sending notifications.
States
======
Here is what is supposed to be done in different states when also utilizing
the default workflow.
MAINTENANCE
-----------
This is the initial state right after infrastructure admin has created the
maintenance session.
Here one should check if all projects are subscribed to AODH event alarm for
event type 'maintenance.planned'. If project supports this, one can assume we
can have interaction with that project manager (VNFM). If not, we should have some
default handling for project instances during rolling maintenance, or we should
decide to go to state 'MAINTENANCE_FAILED' as we do not support that kind of
project. From here onwards, we assume projects support this interaction, so
can better define other coming states.
Next, we send 'maintenance.planned' notification with state 'MAINTENANCE' to
each project. We wait for the duration of 'self.conf.project_maintenance_reply'
the reply or fail if some project did not reply. After all projects are in state
'ACK_MAINTENANCE' we can wait until the time is 'self.session.maintenance_at'
and then start the actual maintenance.
When it is time to start we might call the type 'pre' action plugins to make
actions needed before rolling host by host forwards. This might include
downloading of needed software changes and already doing some actions for
controllers in case of maintenance operation like OpenStack upgrade.
If currently all the compute capacity is in use and we want to have
an empty compute that we can maintain first, we should have 'self.state' as
'SCALE_IN' to scale down the application. If there is capacity, but no empty
host (assuming we want to make maintenance only to empty host), we can have
'self.state' as 'PREPARE_MAINTENANCE' to move instances around to have an empty
host if possible. In case we had an empty host, we can go straight put
'self.state' to 'START_MAINTENANCE' to start maintenance on that host.
SCALE_IN
--------
We send 'maintenance.planned' notification with state 'SCALE_IN' to each
project. We wait duration of 'self.conf.project_scale_in_reply' the reply or
fail if some project did not reply. After all projects are in the state
'ACK_SCALE_IN' we can repeat the same checks as in state 'MAINTENANCE' to
decide is 'self.state' should be 'SCALE_IN', 'PREPARE_MAINTENANCE' or
'START_MAINTENANCE'. Again on any error we always put 'self.state' to
'MAINTENANCE_FAILED'
PREPARE_MAINTENANCE
-------------------
As we have some logic to figure out the host that we can make empty, we can
send 'maintenance.planned' notification with state 'PREPARE_MAINTENANCE' to each
project having instances on that host. We wait for the duration of
'self.conf.project_maintenance_reply' the reply or fail if some project did
not reply. After all affected projects are in state 'ACK_PREPARE_MAINTENANCE' we
can check project and instance specific answer and make action given like
'migrate' to move instances away from the host. After the action is done we will
send 'maintenance.planned' for each each instance with the state
'INSTANCE_ACTION_DONE' and with the corresponding 'instance_id'.
Next, we should be able to put 'self.state'to 'START_MAINTENANCE'.
START_MAINTENANCE
-----------------
In case no hosts are maintained yet, we can go through all empty compute hosts in
the maintenance session:
We send 'maintenance.host' notification with state 'IN_MAINTENANCE' for
each host before we start to maintain it. Then we run action plug-ins of
type 'host'
in the order they are defined to run. After we are ready with the
maintenance actions we send 'maintenance.host' notification with state
'MAINTENANCE_COMPLETE'.
When all empty computes are maintained we can put 'self.state' to
'PLANNED_MAINTENANCE'.
In case all empty hosts were already maintained, we could pick empty host that
we have after 'PLANNED_MAINTENANCE' is run on some compute host:
We send 'maintenance.host' notification with state 'IN_MAINTENANCE' before
we start to maintain the host. Then we run action plug-ins of type 'host' in
the order they are defined to run. After we are ready with the maintenance
actions we send 'maintenance.host' notification with state
'MAINTENANCE_COMPLETE'.
When all empty computes are maintained we can put 'self.state' to
'PLANNED_MAINTENANCE' or if all compute hosts are maintained we can put
'self.state' to 'MAINTENANCE_COMPLETE'.
PLANNED_MAINTENANCE
-------------------
We find a host that has not been maintained yet and contains instances. After
choosing the host, we can send 'maintenance.planned' notification with state
'PLANNED_MAINTENANCE' to each project having instances on the host. After all
affected projects are in state 'ACK_PLANNED_MAINTENANCE' we can check project
and instance specific answer and make action given like 'migrate' to move
instances away from the host. After the action is done we will send
'maintenance.planned' with the state 'INSTANCE_ACTION_DONE' with the
'instance_id' for the instance action was completed. It might also be that
the project manager did already an own to re-instantiate, so we do not have to
do any action.
When the project manager receives 'PLANNED_MAINTENANCE' it also knows that
instances will now be moved to the already maintained host. With the payload,
there will also go 'metadata' that can indicate new capabilities the project is
getting when instances are moving. It might be for example:
"metadata": {"openstack_version": "Queens"}
It might be nice to make the application (VNF) upgrade now at the same time
when instances are anyhow moved to new compute host with new capabilities.
Next, when all instances are moved and the host is empty, we can put
'self.state' to 'START_MAINTENANCE'
MAINTENANCE_COMPLETE
--------------------
Now all instances have been moved to already maintained compute hosts and all
compute host are maintained. Next, we might run action 'post' type of action
plug-ins to finalize maintenance.
When this is done we can send 'maintenance.planned' notification with state
'MAINTENANCE_COMPLETE' to each project. In case projects scaled down at the
beginning of the maintenance they can now scale back to full operation. After
all projects are in state 'ACK_MAINTENANCE_COMPLETE' we can change the
'self.state' to 'MAINTENANCE_DONE'
MAINTENANCE_DONE
----------------
This will now make the maintenance session idle until infrastructure admin will
delete it.
MAINTENANCE_FAILED
------------------
This will now make the maintenance session idle until infrastructure admin will
fix and continue the session or delete it.
Future
======
Currently, infrastructure admin needs to poll Fenix API to know the session
state. When notification with the event type 'maintenance.session' gets
implemented, infrastructure admin will be receiving state change whenever it
will change.

View File

@ -2,4 +2,8 @@
Users guide
===========
Users guide of fenix.
.. toctree::
:maxdepth: 2
architecture
baseworkflow