diff --git a/doc/source/images/fenix-interface.png b/doc/source/images/fenix-interface.png new file mode 100644 index 0000000..d857faa Binary files /dev/null and b/doc/source/images/fenix-interface.png differ diff --git a/doc/source/images/fenix-internal.png b/doc/source/images/fenix-internal.png new file mode 100644 index 0000000..69e8e6c Binary files /dev/null and b/doc/source/images/fenix-internal.png differ diff --git a/doc/source/user/architecture.rst b/doc/source/user/architecture.rst new file mode 100644 index 0000000..09fbc50 --- /dev/null +++ b/doc/source/user/architecture.rst @@ -0,0 +1,101 @@ +.. _architecture: + +================== +Fenix Architecture +================== + +Fenix is an engine designed to make a rolling infrastructure maintenance and +upgrade possible with zero downtime for the application running on top of it. +Interfaces are designed to be generic, so they can work with different clouds, +virtual machines and containers. The first use case is with OpenStack and VMs, +but the aim is to have a wider scope, like edge (Akraino) and Airship. + +The key in Fenix providing the zero downtime is to have an ability to +communicate with an application manager (VNFM). As the application is aware of +maintenance affecting its instances, it can safely be running somewhere else +when it happens. The application also get to know about new capabilities coming +over infrastructure maintenance/upgrade and can plan its own upgrade at the +same. As Fenix also provides scaling request towards applications, it is +possible to make upgrades without adding more resources. + +Fenix has the ability to tell any infrastructure service when a host is down +for maintenance or back in use. This is handy for different things, like +enabling/disabling self-healing or billing. The same interface could also be +used for adding/removing hosts. + +The design makes it possible to make everything with 'one-click'. Generic API, +notifications and tracking in a database are provided by Fenix together with +example workflow and action plug-ins. Anyhow, to build for specific cloud +deployment, one can provide workflow and action plug-ins to Fenix to fit to +any use case one can think of. + + +Internal design +=============== + +Fenix desing is pluggable: + +.. image:: ../images/fenix-internal.png + :width: 1064 px + :scale: 75 % + :align: left + + +**fenix-api** is used to make maintenance workflow sessions and to provide admin +and project owners an API to communicate to Fenix. + +**fenix-engine** is running the maintenance workflow sessions and keeping track +in database. + +**base workflow** is providing basic Fenix functionality that can be inherited +by the workflow plug-in used in each maintenance session. + +**workflow plug-in** is the workflow for your maintenance session. Different +plug-ins can be implemented for different clouds and deployments. + +**action plug-ins** are called by the workflow plug-in. It is possible to have +different type of plug-ins, and if there is more than one of a specific type, +one can also define the order they are executed: + +* **pre** plug-in is run first +* **host** plug-in is run for each host +* **post** plug-in is run last + +There is a possibility to define 'metadata' to further indicate plug-in +specifics. + +Interface design +================ + +Fenix has API and notifications that can be caught by different endpoint +interfaces by subscribing to corresponding event alarm: + +.. image:: ../images/fenix-interface.png + :width: 1054 px + :scale: 75 % + :align: left + +Infrastructure admin has an API to trigger, query, update and delete +maintenance sessions. Admin can also receive the status of a maintenance +session by the 'maintenance.session' notification trough 'oslo.notification'. +It is also possible to get the same information by subscribing to the +corresponding event alarm. This is handy for getting the event to own favorite +API endpoint. + +Project/application having instances on top of the infrastructure under +maintenance can have a manager (VNFM) to communicate with the maintenance +session workflow. The manager can subscribe to project specific +'maintenance.planned' event alarms to get information about maintenance session +state affecting its instances. The subscription also tells to the workflow that +the project have a manager capable of communicating with the workflow. +Otherwise, workflow should have a default behavior towards project instances, +or fail if communication is mandatory in your cloud use case. There is also +a project-specific API to query its instances under current maintenance +workflow session state and to answer back to workflow. + +Any infrastructure service can also be made to support 'maintenance.host' +notification. This notification is telling wether a host is in maintenance or +back in normal use. This might be important for enabling/disabling self-healing +or billing. Notification can also be used to indicate when a host is added or +removed. + diff --git a/doc/source/user/baseworkflow.rst b/doc/source/user/baseworkflow.rst new file mode 100644 index 0000000..a29f2d9 --- /dev/null +++ b/doc/source/user/baseworkflow.rst @@ -0,0 +1,190 @@ +.. _baseworkflow: + +================== +Fenix BaseWorkflow +================== + +BaseWorkFlow class implemented in '/fenix/workflow/workflow.py' is the one you +inherit when creating your own workflow. Example workflow 'default.py' using +this can be found from the workflow directory '/fenix/workflow/workflows'. + +The class provides the access to all maintenance session related data and the +ability to send Fenix notifications and process the incoming API requests. + +There is also a dictionary describing the generic workflow states that should be +supported: + +.. code-block:: json + + { + "MAINTENANCE": "maintenance", + "SCALE_IN": "scale_in", + "PREPARE_MAINTENANCE": "prepare_maintenance", + "START_MAINTENANCE": "start_maintenance", + "PLANNED_MAINTENANCE": "planned_maintenance", + "MAINTENANCE_COMPLETE": "maintenance_complete", + "MAINTENANCE_DONE": "maintenance_done", + "MAINTENANCE_FAILED": "maintenance_failed" + } + +Key is the state name and value is the internal method that you +iplement in your workflow to handle that state. When the method returns, it +will be checked from Class variable 'self.state' what is the next method to be +called. So your state related method should change 'self.state' to what you +want to do next. The method should also implement calling of any action plug-ins +and other state related functionality like sending notifications. + +States +====== + +Here is what is supposed to be done in different states when also utilizing +the default workflow. + +MAINTENANCE +----------- + +This is the initial state right after infrastructure admin has created the +maintenance session. + +Here one should check if all projects are subscribed to AODH event alarm for +event type 'maintenance.planned'. If project supports this, one can assume we +can have interaction with that project manager (VNFM). If not, we should have some +default handling for project instances during rolling maintenance, or we should +decide to go to state 'MAINTENANCE_FAILED' as we do not support that kind of +project. From here onwards, we assume projects support this interaction, so +can better define other coming states. + +Next, we send 'maintenance.planned' notification with state 'MAINTENANCE' to +each project. We wait for the duration of 'self.conf.project_maintenance_reply' +the reply or fail if some project did not reply. After all projects are in state +'ACK_MAINTENANCE' we can wait until the time is 'self.session.maintenance_at' +and then start the actual maintenance. + +When it is time to start we might call the type 'pre' action plugins to make +actions needed before rolling host by host forwards. This might include +downloading of needed software changes and already doing some actions for +controllers in case of maintenance operation like OpenStack upgrade. + +If currently all the compute capacity is in use and we want to have +an empty compute that we can maintain first, we should have 'self.state' as +'SCALE_IN' to scale down the application. If there is capacity, but no empty +host (assuming we want to make maintenance only to empty host), we can have +'self.state' as 'PREPARE_MAINTENANCE' to move instances around to have an empty +host if possible. In case we had an empty host, we can go straight put +'self.state' to 'START_MAINTENANCE' to start maintenance on that host. + +SCALE_IN +-------- + +We send 'maintenance.planned' notification with state 'SCALE_IN' to each +project. We wait duration of 'self.conf.project_scale_in_reply' the reply or +fail if some project did not reply. After all projects are in the state +'ACK_SCALE_IN' we can repeat the same checks as in state 'MAINTENANCE' to +decide is 'self.state' should be 'SCALE_IN', 'PREPARE_MAINTENANCE' or +'START_MAINTENANCE'. Again on any error we always put 'self.state' to +'MAINTENANCE_FAILED' + +PREPARE_MAINTENANCE +------------------- + +As we have some logic to figure out the host that we can make empty, we can +send 'maintenance.planned' notification with state 'PREPARE_MAINTENANCE' to each +project having instances on that host. We wait for the duration of +'self.conf.project_maintenance_reply' the reply or fail if some project did +not reply. After all affected projects are in state 'ACK_PREPARE_MAINTENANCE' we +can check project and instance specific answer and make action given like +'migrate' to move instances away from the host. After the action is done we will +send 'maintenance.planned' for each each instance with the state +'INSTANCE_ACTION_DONE' and with the corresponding 'instance_id'. + +Next, we should be able to put 'self.state'to 'START_MAINTENANCE'. + +START_MAINTENANCE +----------------- + +In case no hosts are maintained yet, we can go through all empty compute hosts in +the maintenance session: + + We send 'maintenance.host' notification with state 'IN_MAINTENANCE' for + each host before we start to maintain it. Then we run action plug-ins of + type 'host' + in the order they are defined to run. After we are ready with the + maintenance actions we send 'maintenance.host' notification with state + 'MAINTENANCE_COMPLETE'. + + When all empty computes are maintained we can put 'self.state' to + 'PLANNED_MAINTENANCE'. + +In case all empty hosts were already maintained, we could pick empty host that +we have after 'PLANNED_MAINTENANCE' is run on some compute host: + + We send 'maintenance.host' notification with state 'IN_MAINTENANCE' before + we start to maintain the host. Then we run action plug-ins of type 'host' in + the order they are defined to run. After we are ready with the maintenance + actions we send 'maintenance.host' notification with state + 'MAINTENANCE_COMPLETE'. + + When all empty computes are maintained we can put 'self.state' to + 'PLANNED_MAINTENANCE' or if all compute hosts are maintained we can put + 'self.state' to 'MAINTENANCE_COMPLETE'. + +PLANNED_MAINTENANCE +------------------- + +We find a host that has not been maintained yet and contains instances. After +choosing the host, we can send 'maintenance.planned' notification with state +'PLANNED_MAINTENANCE' to each project having instances on the host. After all +affected projects are in state 'ACK_PLANNED_MAINTENANCE' we can check project +and instance specific answer and make action given like 'migrate' to move +instances away from the host. After the action is done we will send +'maintenance.planned' with the state 'INSTANCE_ACTION_DONE' with the +'instance_id' for the instance action was completed. It might also be that +the project manager did already an own to re-instantiate, so we do not have to +do any action. + +When the project manager receives 'PLANNED_MAINTENANCE' it also knows that +instances will now be moved to the already maintained host. With the payload, +there will also go 'metadata' that can indicate new capabilities the project is +getting when instances are moving. It might be for example: + + "metadata": {"openstack_version": "Queens"} + +It might be nice to make the application (VNF) upgrade now at the same time +when instances are anyhow moved to new compute host with new capabilities. + +Next, when all instances are moved and the host is empty, we can put +'self.state' to 'START_MAINTENANCE' + +MAINTENANCE_COMPLETE +-------------------- + +Now all instances have been moved to already maintained compute hosts and all +compute host are maintained. Next, we might run action 'post' type of action +plug-ins to finalize maintenance. + +When this is done we can send 'maintenance.planned' notification with state +'MAINTENANCE_COMPLETE' to each project. In case projects scaled down at the +beginning of the maintenance they can now scale back to full operation. After +all projects are in state 'ACK_MAINTENANCE_COMPLETE' we can change the +'self.state' to 'MAINTENANCE_DONE' + +MAINTENANCE_DONE +---------------- + +This will now make the maintenance session idle until infrastructure admin will +delete it. + +MAINTENANCE_FAILED +------------------ + +This will now make the maintenance session idle until infrastructure admin will +fix and continue the session or delete it. + + +Future +====== + +Currently, infrastructure admin needs to poll Fenix API to know the session +state. When notification with the event type 'maintenance.session' gets +implemented, infrastructure admin will be receiving state change whenever it +will change. diff --git a/doc/source/user/index.rst b/doc/source/user/index.rst index 066e544..1c99617 100644 --- a/doc/source/user/index.rst +++ b/doc/source/user/index.rst @@ -2,4 +2,8 @@ Users guide =========== -Users guide of fenix. +.. toctree:: + :maxdepth: 2 + + architecture + baseworkflow