ironic-specs/specs/approved/deployment-steps-framework.rst

..
 This work is licensed under a Creative Commons Attribution 3.0 Unported
 License.

 http://creativecommons.org/licenses/by/3.0/legalcode

==========================
Deployment Steps Framework
==========================

https://storyboard.openstack.org/#!/story/1753128

There is a desire for ironic to support customizable and extendable deployment
steps, which would provide the ability to prepare bare metal nodes (servers)
that better match the needs of the users who will be using the nodes.

In order to support that, we propose refactoring the existing deployment
code in ironic into a deployment steps framework, similar to the cleaning
steps framework.

Problem description
===================

Presently, ironic provides a way to prepare nodes prior to them being made
available for deployment (see `state diagram`_). This is done via `cleaning`_.
However, it is not always possible, efficient, or effective to perform some of
these preparations without knowing the requirements of the users of the
nodes. In addition, there may be operations that should only be done once the
users' requirements are known.

For example, during `cleaning`_, a node could be configured for RAID.
However, this might not be the desired RAID configuration that the user of the
node wants. Since the user's desires are only known at deployment time, a
mechanism that allows for custom RAID configuration during deployment is
preferred.

Features like custom RAID configuration, BIOS configuration, and custom
kernel boot parameters are a few use cases that would benefit from a way
of defining deployment steps at deploy time, in ironic.

It makes sense to provide support for this via deployment steps. This would
be conceptually similar to the cleaning steps supported by ironic already.

Proposed change
===============

This proposal is the first step in providing support for performing different
deployment operations based on the user's desires. (The `RFE to reconfigure
nodes on deploy using traits`_ is an example of a feature that depends on
this work.)

The proposed change is to implement a deployment steps (or ``deploy steps``)
framework that is very similar to the existing framework for automated and
manual `cleaning`_. (This was discussed and agreed upon in principle, at the
`OpenStack Dublin PTG`_.)

This change is internal to ironic. Users will not be able to affect the
deployment process any more than they can do today.

Conceptually, the clean steps model is a simple idea and operators are familiar
with it. Having similar deploy steps provides consistency and it will be easier
for operators to adopt, due to their familiarity with clean steps. It is also
powerful in that, at the end of the day (or year or two), a particular step
could be a clean step, a deploy step, or both.

This includes re-factoring of code to be used by both clean and deploy steps.

The existing deployment process will be implemented as a list of one (or more)
deploy steps.

What is a deploy step?
----------------------
Similar to clean steps, functions that are deploy steps will be decorated
with ``@deploy_step``, defined in ironic/drivers/base.py as follows::

 def deploy_step(priority, argsinfo=None):
    """Decorator for deployment steps.

    :param priority: an integer priority; used for determining the order in
        which the step is run in the deployment process. (See below,
        "When are deploy steps executed" for more details.)
    :param argsinfo: a dictionary of keyword arguments where key is the name of
        the argument and value is a dictionary as follows:

            ‘description’: <description>. Required. This should include
                           possible values.
            ‘required’: Boolean. Optional; default is False. True if this
                        argument is required.

An alternative is to have one decorator that allows specifying a function
to be a clean step and/or a deploy step, e.g.::

 @step(clean_priority=0, deploy_priority=0, argsinfo=None)

However, clean steps are abortable and deploy steps aren't (yet, see below),
and it is unclear whether other arguments might be added for the deploy step
decorator. Thus, it seems safer and simpler to have a separate decorator for
deploy steps.  (Having one decorator for both types of steps is left as a
future exercise.)

Although ironic allows cleaning to be aborted, ironic doesn't allow the
deployment to be aborted (although there is an `RFE to support abort in
deploy_wait`_). So it is outside the scope of this specification.

A deploy step can be implemented by any Interface, not just DeployInterface.

When are deploy steps executed?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Each deploy step has a priority; a non-negative integer. In this first phase,
the priorities will be hard-coded. There will be no way to turn off or change
these priorities.

The steps are executed from highest priority to lowest priority. Steps with
priorities of zero (0) are not executed. A step has to be finished, before the
next one is started.

Alternatives
------------

There may be other ways to provide support for customizable deployment
steps per user/instance, but there doesn't seem to be good reasons for
having a different design from that used for clean steps.

We could choose not to provide support for customized deploy steps on a per
user/instance basis. In that case, some of the current workarounds to overcome
this problem include:

* have groups of nodes configured in advance (using clean steps) for each
  required combination of configurations. This could lead to strange capacity
  planning issues.

* executing the desired configuration steps after each node is deployed.
  As these configuration steps are executed post-deploy, most of them need a
  reboot of the node, orchestration is needed to do these reboots properly,
  and this causes performance issues that are not acceptable in a production
  environment. This approach won't work for pre-deploy steps though, such as
  RAID for the boot disk.

* users can create their own images for each use case. But the limitation
  is that the number of images can grow exponentially, and that there is no
  ability to match a specific type of hardware with a specific image.

* use a customizable DeployInterface like the `ansible`_ deploy interface
  (although the `ansible`_ deploy interface is not recommended for production
  use). This may not be able to achieve the same level of access to the
  hardware or settings, to have the same effect.

Data model impact
-----------------

Similar to clean steps, a Node object will be updated with:

* a new ``deploy_step`` field: this is the current deploy step that is being
  executed or None if no steps have been executed yet. This will require an
  update to the DB.
* ``driver_internal_info['deploy_steps']``: the list of deploy steps to be
  executed.
* ``driver_internal_info['deploy_step_index']``: the index into the list of
  deploy steps (or None if no steps have been executed yet); this corresponds
  to node.deploy_step.

State Machine Impact
--------------------

No new state or transition will be added.

The state of the node will alternate from states.DEPLOYING (``deploying``) to
states.DEPLOYWAIT (``wait call-back``) for each asynchronous deploy step.

REST API impact
---------------

There will not be any new API methods.

GET /v1/nodes/*
~~~~~~~~~~~~~~~
The GET /v1/nodes/* requests that return information about nodes will
be modified to also return the node's ``deploy_step`` field and the
deploy-related information in the node's ``driver_internal_info`` field.

Similar to the ``clean_step`` field, the ``deploy_step`` field will be the
current deploy step being executed, or None if there is no deployment in
progress (or hasn't started yet).

If the deployment fails, the ``deploy_step`` field will show which step caused
the deployment to fail.

This change requires a new API version. For nodes that have not yet been
deployed using the deploy steps, the ``deploy_step`` field will be None, and
there won't be any deploy-related entries in the ``driver_internal_info``
field.

For older API versions, this ``deploy_step`` field will not be available,
although any deploy-related entries in the ``driver_internal_info`` field will
be shown.

Client (CLI) impact
-------------------
The only change (when the new API version is specified), is that the response
for a Node will include the new ``deploy_step`` field and during deployment,
the new deploy-step-related entries in the node's ``driver_internal_info``
field.

"ironic" CLI
~~~~~~~~~~~~
Even though this has been deprecated, responses will include the change
described above.

"openstack baremetal" CLI
~~~~~~~~~~~~~~~~~~~~~~~~~
Responses will inclde the change described above.

RPC API impact
--------------

None.

Driver API impact
-----------------

Similar to cleaning, these methods will be added to the
drivers.base.BaseInterface class::

    def get_deploy_steps(self, task):
        """Get a list of deploy steps this interface can perform on a node.

        :param task: a TaskManager object, useful for interfaces overriding this method
        :returns: a list of deploy step dictionaries
        """

    def execute_deploy_step(self, task, step):
        """Execute the deploy step on task.node.

        :param task: a TaskManager object
        :param step: The dictionary representing the step to execute
        :raises DeployStepFailed: if the step fails
        :returns: None if this method has completed synchronously, or
            states.DEPLOYWAIT if the step will continue to execute
            asynchronously.
        """

The actual deploy steps will be determined in the coding phase; we will start
with one big deploy step (to get the framework in) and then break that step up
into more steps -- determined by what makes sense given the existing code, and
the constraints (e.g. support for out-of-tree drivers, backwards compatibility
when a deploy step in release N is split into several steps in release N+1).

(This specification will be updated with the actual deploy steps, once that
is determined.)

Out-of-tree Interfaces
~~~~~~~~~~~~~~~~~~~~~~
Although the conductor will still support deployment the old way (without
deploy steps), this support will be deprecated and removed based on the
`standard deprecation policy
<https://governance.openstack.org/tc/reference/tags/assert_follows-standard-deprecation.html>`_.
(The deprecation period may be extended if there is a strong desire to do so
by the vendors; we're flexible.)

For out-of-tree interfaces that don't have deploy steps, the conductor will
emit (log) a deprecation warning, that the out-of-tree interface should be
updated to use deploy steps, and that all nodes that are being deployed
using the old way, need to be finished deploying, before an upgrade to the
release where there is no longer any more support for the old way.

Nova driver impact
------------------

None

Ramdisk impact
--------------

There should be no impact to the ramdisk (IPA).

In the future, when we allow configuration and specification of deploy steps
per node, we might provide support for collecting deploy steps from the
ramdisk, but that is out of scope for this first phase.

Security impact
---------------

None

Other end user impact
---------------------

None.

Scalability impact
------------------

None.

Performance Impact
------------------

None.

Other deployer impact
---------------------

None.

Developer impact
----------------

DeployInterfaces (and any other interfaces involved in the deployment process)
will need to be written with deploy steps in mind.


Implementation
==============

Assignee(s)
-----------

Primary assignee:
  * rloo (Ruby Loo)

Work Items
----------

Ironic:
  * Add support for deploy steps to base driver
  * rework the existing code into one or more deploy steps
  * Update the conductor to get the deploy steps and execute them

``python-ironicclient``:
  * Add support for node.deploy_step

Dependencies
============
None.

Testing
=======

* unit tests for all new code and changed behaviour
* CI jobs already test the deployment process; they should continue to work
  with these changes

Upgrades and Backwards Compatibility
====================================

* Old Interfaces will work with the new BaseInterface class because
  the code will cleanly fall back when an Interface does not support
  ``get_deploy_steps()``. A deprecation warning will be logged, and we will
  remove support for the old way according to the OpenStack policy for
  deprecations & removals.

* Likewise, an Interface implementation with ``get_deploy_steps()`` will work
  in an older version of Ironic.

* In a cold upgrade:

  * if the agent heartbeats and driver_internal_info['deploy_steps'] is empty,
    proceed the old way.
  * if a deployment is started by a conductor using deploy steps (new code),
    it means all the conductors are using the new code, so the deployment
    can continue on any conductor that supports the node

* In a rolling upgrade:

  * if the agent heartbeats and driver_internal_info['deploy_steps'] is empty,
    proceed the old way (similar to cold upgrade)
  * a new conductor will not use the deploy steps mechanism if it is pinned to
    the old release (via `pin_release_version` configuration option).
    if a deployment is started by a conductor using deploy steps (new code),
    it means that it is unpinned, and all the conductors are using the new
    code, so the deployment can continue on any conductor that supports the
    node.

Documentation Impact
====================

* api-ref: https://developer.openstack.org/api-ref/baremetal/ will be updated
  to include the new node.deploy_step field

References
==========

* `cleaning`_
* `OpenStack Dublin PTG`_ etherpad
* `RFE to reconfigure nodes on deploy using traits`_
* `RFE to support abort in deploy_wait`_
* `state diagram`_

.. _`cleaning`: https://docs.openstack.org/ironic/latest/admin/cleaning.html
.. _`OpenStack Dublin PTG`: https://etherpad.openstack.org/p/ironic-rocky-ptg-deploy-steps
.. _`RFE to reconfigure nodes on deploy using traits`: https://bugs.launchpad.net/ironic/+bug/1722275
.. _`RFE to support abort in deploy_wait`: https://bugs.launchpad.net/ironic/+bug/1498251
.. _`state diagram`: https://docs.openstack.org/ironic/latest/contributor/states.html
.. _`ansible`: https://docs.openstack.org/ironic/latest/admin/drivers/ansible.html