Add Fenix architecture documentation
Story: 2004292 Task: #27875 Change-Id: Ifeed52ea6046372da6d44d3d58a38070a8d897da Signed-off-by: Tomi Juvonen <tomi.juvonen@nokia.com>
This commit is contained in:
parent
36af47855e
commit
d52ea93ce0
Binary file not shown.
After Width: | Height: | Size: 63 KiB |
Binary file not shown.
After Width: | Height: | Size: 40 KiB |
|
@ -0,0 +1,101 @@
|
|||
.. _architecture:
|
||||
|
||||
==================
|
||||
Fenix Architecture
|
||||
==================
|
||||
|
||||
Fenix is an engine designed to make a rolling infrastructure maintenance and
|
||||
upgrade possible with zero downtime for the application running on top of it.
|
||||
Interfaces are designed to be generic, so they can work with different clouds,
|
||||
virtual machines and containers. The first use case is with OpenStack and VMs,
|
||||
but the aim is to have a wider scope, like edge (Akraino) and Airship.
|
||||
|
||||
The key in Fenix providing the zero downtime is to have an ability to
|
||||
communicate with an application manager (VNFM). As the application is aware of
|
||||
maintenance affecting its instances, it can safely be running somewhere else
|
||||
when it happens. The application also get to know about new capabilities coming
|
||||
over infrastructure maintenance/upgrade and can plan its own upgrade at the
|
||||
same. As Fenix also provides scaling request towards applications, it is
|
||||
possible to make upgrades without adding more resources.
|
||||
|
||||
Fenix has the ability to tell any infrastructure service when a host is down
|
||||
for maintenance or back in use. This is handy for different things, like
|
||||
enabling/disabling self-healing or billing. The same interface could also be
|
||||
used for adding/removing hosts.
|
||||
|
||||
The design makes it possible to make everything with 'one-click'. Generic API,
|
||||
notifications and tracking in a database are provided by Fenix together with
|
||||
example workflow and action plug-ins. Anyhow, to build for specific cloud
|
||||
deployment, one can provide workflow and action plug-ins to Fenix to fit to
|
||||
any use case one can think of.
|
||||
|
||||
|
||||
Internal design
|
||||
===============
|
||||
|
||||
Fenix desing is pluggable:
|
||||
|
||||
.. image:: ../images/fenix-internal.png
|
||||
:width: 1064 px
|
||||
:scale: 75 %
|
||||
:align: left
|
||||
|
||||
|
||||
**fenix-api** is used to make maintenance workflow sessions and to provide admin
|
||||
and project owners an API to communicate to Fenix.
|
||||
|
||||
**fenix-engine** is running the maintenance workflow sessions and keeping track
|
||||
in database.
|
||||
|
||||
**base workflow** is providing basic Fenix functionality that can be inherited
|
||||
by the workflow plug-in used in each maintenance session.
|
||||
|
||||
**workflow plug-in** is the workflow for your maintenance session. Different
|
||||
plug-ins can be implemented for different clouds and deployments.
|
||||
|
||||
**action plug-ins** are called by the workflow plug-in. It is possible to have
|
||||
different type of plug-ins, and if there is more than one of a specific type,
|
||||
one can also define the order they are executed:
|
||||
|
||||
* **pre** plug-in is run first
|
||||
* **host** plug-in is run for each host
|
||||
* **post** plug-in is run last
|
||||
|
||||
There is a possibility to define 'metadata' to further indicate plug-in
|
||||
specifics.
|
||||
|
||||
Interface design
|
||||
================
|
||||
|
||||
Fenix has API and notifications that can be caught by different endpoint
|
||||
interfaces by subscribing to corresponding event alarm:
|
||||
|
||||
.. image:: ../images/fenix-interface.png
|
||||
:width: 1054 px
|
||||
:scale: 75 %
|
||||
:align: left
|
||||
|
||||
Infrastructure admin has an API to trigger, query, update and delete
|
||||
maintenance sessions. Admin can also receive the status of a maintenance
|
||||
session by the 'maintenance.session' notification trough 'oslo.notification'.
|
||||
It is also possible to get the same information by subscribing to the
|
||||
corresponding event alarm. This is handy for getting the event to own favorite
|
||||
API endpoint.
|
||||
|
||||
Project/application having instances on top of the infrastructure under
|
||||
maintenance can have a manager (VNFM) to communicate with the maintenance
|
||||
session workflow. The manager can subscribe to project specific
|
||||
'maintenance.planned' event alarms to get information about maintenance session
|
||||
state affecting its instances. The subscription also tells to the workflow that
|
||||
the project have a manager capable of communicating with the workflow.
|
||||
Otherwise, workflow should have a default behavior towards project instances,
|
||||
or fail if communication is mandatory in your cloud use case. There is also
|
||||
a project-specific API to query its instances under current maintenance
|
||||
workflow session state and to answer back to workflow.
|
||||
|
||||
Any infrastructure service can also be made to support 'maintenance.host'
|
||||
notification. This notification is telling wether a host is in maintenance or
|
||||
back in normal use. This might be important for enabling/disabling self-healing
|
||||
or billing. Notification can also be used to indicate when a host is added or
|
||||
removed.
|
||||
|
|
@ -0,0 +1,190 @@
|
|||
.. _baseworkflow:
|
||||
|
||||
==================
|
||||
Fenix BaseWorkflow
|
||||
==================
|
||||
|
||||
BaseWorkFlow class implemented in '/fenix/workflow/workflow.py' is the one you
|
||||
inherit when creating your own workflow. Example workflow 'default.py' using
|
||||
this can be found from the workflow directory '/fenix/workflow/workflows'.
|
||||
|
||||
The class provides the access to all maintenance session related data and the
|
||||
ability to send Fenix notifications and process the incoming API requests.
|
||||
|
||||
There is also a dictionary describing the generic workflow states that should be
|
||||
supported:
|
||||
|
||||
.. code-block:: json
|
||||
|
||||
{
|
||||
"MAINTENANCE": "maintenance",
|
||||
"SCALE_IN": "scale_in",
|
||||
"PREPARE_MAINTENANCE": "prepare_maintenance",
|
||||
"START_MAINTENANCE": "start_maintenance",
|
||||
"PLANNED_MAINTENANCE": "planned_maintenance",
|
||||
"MAINTENANCE_COMPLETE": "maintenance_complete",
|
||||
"MAINTENANCE_DONE": "maintenance_done",
|
||||
"MAINTENANCE_FAILED": "maintenance_failed"
|
||||
}
|
||||
|
||||
Key is the state name and value is the internal method that you
|
||||
iplement in your workflow to handle that state. When the method returns, it
|
||||
will be checked from Class variable 'self.state' what is the next method to be
|
||||
called. So your state related method should change 'self.state' to what you
|
||||
want to do next. The method should also implement calling of any action plug-ins
|
||||
and other state related functionality like sending notifications.
|
||||
|
||||
States
|
||||
======
|
||||
|
||||
Here is what is supposed to be done in different states when also utilizing
|
||||
the default workflow.
|
||||
|
||||
MAINTENANCE
|
||||
-----------
|
||||
|
||||
This is the initial state right after infrastructure admin has created the
|
||||
maintenance session.
|
||||
|
||||
Here one should check if all projects are subscribed to AODH event alarm for
|
||||
event type 'maintenance.planned'. If project supports this, one can assume we
|
||||
can have interaction with that project manager (VNFM). If not, we should have some
|
||||
default handling for project instances during rolling maintenance, or we should
|
||||
decide to go to state 'MAINTENANCE_FAILED' as we do not support that kind of
|
||||
project. From here onwards, we assume projects support this interaction, so
|
||||
can better define other coming states.
|
||||
|
||||
Next, we send 'maintenance.planned' notification with state 'MAINTENANCE' to
|
||||
each project. We wait for the duration of 'self.conf.project_maintenance_reply'
|
||||
the reply or fail if some project did not reply. After all projects are in state
|
||||
'ACK_MAINTENANCE' we can wait until the time is 'self.session.maintenance_at'
|
||||
and then start the actual maintenance.
|
||||
|
||||
When it is time to start we might call the type 'pre' action plugins to make
|
||||
actions needed before rolling host by host forwards. This might include
|
||||
downloading of needed software changes and already doing some actions for
|
||||
controllers in case of maintenance operation like OpenStack upgrade.
|
||||
|
||||
If currently all the compute capacity is in use and we want to have
|
||||
an empty compute that we can maintain first, we should have 'self.state' as
|
||||
'SCALE_IN' to scale down the application. If there is capacity, but no empty
|
||||
host (assuming we want to make maintenance only to empty host), we can have
|
||||
'self.state' as 'PREPARE_MAINTENANCE' to move instances around to have an empty
|
||||
host if possible. In case we had an empty host, we can go straight put
|
||||
'self.state' to 'START_MAINTENANCE' to start maintenance on that host.
|
||||
|
||||
SCALE_IN
|
||||
--------
|
||||
|
||||
We send 'maintenance.planned' notification with state 'SCALE_IN' to each
|
||||
project. We wait duration of 'self.conf.project_scale_in_reply' the reply or
|
||||
fail if some project did not reply. After all projects are in the state
|
||||
'ACK_SCALE_IN' we can repeat the same checks as in state 'MAINTENANCE' to
|
||||
decide is 'self.state' should be 'SCALE_IN', 'PREPARE_MAINTENANCE' or
|
||||
'START_MAINTENANCE'. Again on any error we always put 'self.state' to
|
||||
'MAINTENANCE_FAILED'
|
||||
|
||||
PREPARE_MAINTENANCE
|
||||
-------------------
|
||||
|
||||
As we have some logic to figure out the host that we can make empty, we can
|
||||
send 'maintenance.planned' notification with state 'PREPARE_MAINTENANCE' to each
|
||||
project having instances on that host. We wait for the duration of
|
||||
'self.conf.project_maintenance_reply' the reply or fail if some project did
|
||||
not reply. After all affected projects are in state 'ACK_PREPARE_MAINTENANCE' we
|
||||
can check project and instance specific answer and make action given like
|
||||
'migrate' to move instances away from the host. After the action is done we will
|
||||
send 'maintenance.planned' for each each instance with the state
|
||||
'INSTANCE_ACTION_DONE' and with the corresponding 'instance_id'.
|
||||
|
||||
Next, we should be able to put 'self.state'to 'START_MAINTENANCE'.
|
||||
|
||||
START_MAINTENANCE
|
||||
-----------------
|
||||
|
||||
In case no hosts are maintained yet, we can go through all empty compute hosts in
|
||||
the maintenance session:
|
||||
|
||||
We send 'maintenance.host' notification with state 'IN_MAINTENANCE' for
|
||||
each host before we start to maintain it. Then we run action plug-ins of
|
||||
type 'host'
|
||||
in the order they are defined to run. After we are ready with the
|
||||
maintenance actions we send 'maintenance.host' notification with state
|
||||
'MAINTENANCE_COMPLETE'.
|
||||
|
||||
When all empty computes are maintained we can put 'self.state' to
|
||||
'PLANNED_MAINTENANCE'.
|
||||
|
||||
In case all empty hosts were already maintained, we could pick empty host that
|
||||
we have after 'PLANNED_MAINTENANCE' is run on some compute host:
|
||||
|
||||
We send 'maintenance.host' notification with state 'IN_MAINTENANCE' before
|
||||
we start to maintain the host. Then we run action plug-ins of type 'host' in
|
||||
the order they are defined to run. After we are ready with the maintenance
|
||||
actions we send 'maintenance.host' notification with state
|
||||
'MAINTENANCE_COMPLETE'.
|
||||
|
||||
When all empty computes are maintained we can put 'self.state' to
|
||||
'PLANNED_MAINTENANCE' or if all compute hosts are maintained we can put
|
||||
'self.state' to 'MAINTENANCE_COMPLETE'.
|
||||
|
||||
PLANNED_MAINTENANCE
|
||||
-------------------
|
||||
|
||||
We find a host that has not been maintained yet and contains instances. After
|
||||
choosing the host, we can send 'maintenance.planned' notification with state
|
||||
'PLANNED_MAINTENANCE' to each project having instances on the host. After all
|
||||
affected projects are in state 'ACK_PLANNED_MAINTENANCE' we can check project
|
||||
and instance specific answer and make action given like 'migrate' to move
|
||||
instances away from the host. After the action is done we will send
|
||||
'maintenance.planned' with the state 'INSTANCE_ACTION_DONE' with the
|
||||
'instance_id' for the instance action was completed. It might also be that
|
||||
the project manager did already an own to re-instantiate, so we do not have to
|
||||
do any action.
|
||||
|
||||
When the project manager receives 'PLANNED_MAINTENANCE' it also knows that
|
||||
instances will now be moved to the already maintained host. With the payload,
|
||||
there will also go 'metadata' that can indicate new capabilities the project is
|
||||
getting when instances are moving. It might be for example:
|
||||
|
||||
"metadata": {"openstack_version": "Queens"}
|
||||
|
||||
It might be nice to make the application (VNF) upgrade now at the same time
|
||||
when instances are anyhow moved to new compute host with new capabilities.
|
||||
|
||||
Next, when all instances are moved and the host is empty, we can put
|
||||
'self.state' to 'START_MAINTENANCE'
|
||||
|
||||
MAINTENANCE_COMPLETE
|
||||
--------------------
|
||||
|
||||
Now all instances have been moved to already maintained compute hosts and all
|
||||
compute host are maintained. Next, we might run action 'post' type of action
|
||||
plug-ins to finalize maintenance.
|
||||
|
||||
When this is done we can send 'maintenance.planned' notification with state
|
||||
'MAINTENANCE_COMPLETE' to each project. In case projects scaled down at the
|
||||
beginning of the maintenance they can now scale back to full operation. After
|
||||
all projects are in state 'ACK_MAINTENANCE_COMPLETE' we can change the
|
||||
'self.state' to 'MAINTENANCE_DONE'
|
||||
|
||||
MAINTENANCE_DONE
|
||||
----------------
|
||||
|
||||
This will now make the maintenance session idle until infrastructure admin will
|
||||
delete it.
|
||||
|
||||
MAINTENANCE_FAILED
|
||||
------------------
|
||||
|
||||
This will now make the maintenance session idle until infrastructure admin will
|
||||
fix and continue the session or delete it.
|
||||
|
||||
|
||||
Future
|
||||
======
|
||||
|
||||
Currently, infrastructure admin needs to poll Fenix API to know the session
|
||||
state. When notification with the event type 'maintenance.session' gets
|
||||
implemented, infrastructure admin will be receiving state change whenever it
|
||||
will change.
|
|
@ -2,4 +2,8 @@
|
|||
Users guide
|
||||
===========
|
||||
|
||||
Users guide of fenix.
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
architecture
|
||||
baseworkflow
|
||||
|
|
Loading…
Reference in New Issue