self-healing-sig/use-cases/fenix-rolling-upgrade.rst

4.5 KiB

Infrastructure rolling maintenance and upgrade

Telco has for years made maintenance and upgrades in rolling fashion. Now it is the time to achieve this in the OpenStack also. Rolling upgrade makes minimal downtime to infrastructure as well as for the application on top of it.

Problem description

  • Infrastructure maintenance and upgrade needs to possible in rolling fashion to minimize downtime for services and applications.
  • Maintenance and upgrade needs to be managed without adding more resources to a system while all compute capacity is in use.
  • It needs to be possible to know what hosts and instances are maintained and what not.
  • There needs to be a generic messaging defined between infrastructure and application manager (VNFM).
  • It has to be possible to ask application manager to scale down at non busy hour to get free capacity during rolling maintenance and upgrade.
  • Application manager will need to know when planned maintenance session is over, so it can scale back to full capacity.
  • Application manager needs to be aware of planned host maintenance, so application (VNF) will safely be running somewhere else when the host will be down for maintenance.
  • Different infrastructure services needs to be aware of host being down for maintenance. This can be important to disable automatic self-healing actions or billing. There needs to be a generic messaging defined for this.
  • Application manager needs to know when his instances are to move to upgraded host, so it can also make its own upgrade to take new capabilities into use.
  • Rolling maintenance framework needs to be pluggable to handle different maintenance and upgrade workflows and actions for hosts. This is also important to support different payloads and clouds.
  • Infrastructure admin needs to be able to have rolling maintenance done with one-click.
  • Infrastructure admin needs to be able to know rolling maintenance status through API and notification.
  • It must be possible for each maintenance session to define needed software packages and plug-ins to run the maintenance workflow properly.

OpenStack projects used

All mentioned problems are being solved by the new Fenix project to manage the rolling maintenance and upgrade. More of its internals can be read from project own documentation and blueprints. Proof of concept code is already being tested in the OPNFV Doctor CI with a sample implementation. The Doctor maintenance design document describes the initial interaction needed. Also, the presentation in the OpenStack Vancouver summit "How to gain VNF zero downtime during Infrastructure Maintenance and Upgrade" will show the way for implementing the Fenix.

As Fenix can interact with the application manager. There is a blueprint to support the interaction in Tacker. This would enable a complex test case to be built to test Fenix workflow, that uses purely OpenStack components.

To disable self-healing, Fenix host maintenance notification could be supported by Vitrage and Masakari.

As workflows can be different, there has already been some discussion with the Airship and the Blazar projects. The Blazar should make a blueprint to have it possible to change application-specific reservations to support rolling maintenance. Airship could later look to implement its own maintenance and upgrade process by utilizing Fenix.

Upgrade checks for different projects are a community goal for Stein. This is one step towards the automated rolling upgrade.

Future work

Fenix blueprints indicate what is yet to be done for the basic Fenix engine. When this work is ready, one can concentrate to make the sample workflow plug-in for the rolling upgrade, sample upgrade action plug-ins and the framework for testing it. Ideally, the framework use case would be the OpenStack and application (VNF) upgrade. This can then work as an example to implement own workflow and other plug-ins for a specific real work use case.