ironic-specs/specs/backlog/obtaining-steps.rst
melissaml ae35f97b43 Rename review.openstack.org to review.opendev.org
There are many references to review.openstack.org, and while the
redirect should work, we can also go ahead and fix them.

Change-Id: Ic82bde84617c461aba1e4a02b3fc562a90c99a6e
2019-07-11 11:39:16 +08:00

9.6 KiB

Obtaining Steps

https://storyboard.openstack.org/#!/story/1719925

https://storyboard.openstack.org/#!/story/1715419

In Ironic, we have a concept of steps1 to be executed to achieve a task utilizing a blend of driver code running in the conductor and code operating inside of the ironic-python-agent.

In order for this to be useful, we have to be able to raise the visibility of what is available to be performed to the end user of the API. Presently users are only able to rely upon documentation, and the state of the code including modules that could be loaded in.

This issue is further compounded as the entire list of steps is a union of information identified from the ironic-conductor process managing the node and the ironic-python-agent process executing upon the node.

Note

This document is present in the backlog as there are implementation issues to this feature. Please see Gerrit change 606199 for more information.

Problem description

  • API users presently have to rely upon documentation of steps to know what is available.
  • Different steps may be available with different hardware managers.
  • With the increasing use of the Deploy Steps2 framework, new steps should be anticipated to be added with new releases of Ironic.
  • The ironic-python-agent must be running to obtain a complete list of steps.

Proposed change

In order to keep this solution relatively lightweight, there are four fundamental changes that will be needed in order to facilitate visibility.

This doesn't seek to solve complete visibility by creating additional processes, but instead seeks to provide tools to collect data, with the limiting factor being we can only return the current available information.

How to do it?

Step 1

The initial step is to provide an API endpoint that returns the current available list of steps visible for a node running in the conductor. This would be an API endpoint, to a RPC method, to a conductor manager method, which would then return the list of steps, while tolerating the absence of ironic-python-agent.

Note

The ironic community consensus is that this feature should cache steps and return those cached steps as available to the user.

Step 2

Addition of a hold provision state verb and holding state.

Note

During a specific planning and discussion meeting to determine the path for a feature such as this, the ironic community reached a consensus on the call that a holding state would be useful, and could likey be implemented aside from the API functionality proposed in this backlog specification.

Initial State Temporary State Possible next verbs
manageable holding manage, clean, provide, active, inspect
available holding active, manage, provide

With the invocation of the state:

  • The machine is moved to the provisioning network.

Note

There is a slight issue with this transition in that to clean the node would realistically need to be on the cleaning network. Operationally changing the DHCP address is problematic as we have learned with the rescue feature.

  • The deployment ramdisk is booted.
  • The ironic-python-agent would then be left in a running state, allowed to heartbeat (or be polled), and the API endpoint added in the prior step would fetch a complete list of steps that can be executed upon.

Alternatives

An alternative to this solution would be to provide an async API endpoint to perform the steps detailed in step 2, and cache the data which could then be retrieved by the user asynchronously. In this case, the user would have to poll the API to determine if the cached information has been updated.

The conundrum is that this would have to be constrained by states, which means we would still need to build state machine states around this to represent the current operation to users.

Data model impact

None

State Machine Impact

As noted above, we would add a new hold verb, which would allow transition back to the prior state. This hold verb would only be accessible from the manageable and available states.

In this holding state, API users would be able to request logical next steps, in-line with the present state, as detailed in the table above.

REST API impact

The node object returned would expose additional provision_state states, however this is a known quantity with all state machine impacts.

An additional provision state target verb of hold to trigger the state machine change.

An endpoint will be added on to enable an API user to return the list of known steps via the RPC interface and the conductor, which will be triggered as a GET request.

Note

Community consensus is that we should not be initiating a synchronous call to IPA to collect data, that we should instead return cached data and somehow trigger the cache to be updated.

Example:

GET /v1/nodes/{node_ident}/steps[?type=(clean|deploy)]
{
  [{"source": "conductor",
    "deploy": [
      {
        "interface": "deploy",
        "step": "deploy",
        "priority": 100,
      },
    ],
    "clean": [
      {
        "interface": "deploy",
        "step": "erase_devices",
        "reboot_requested": False,
        "priority": 10,
        "abortable": True,
      },
      {
        "interface": "bios",
        "step": "apply_configuration",
        "args": {....},
        "priority": 0,
      },
      {
        "interface": "raid",
        "step": "create_configuration",
        "args": {....},
        "priority": 0
      },
      {
        "interface": "raid"
        "step": "delete_configuration",
         "args": {....},
         "priority": 0
      }
    ]
  },
  {"source": "agent",
  ...
  }
  ]
}

If a specific type is requested, then the request shall only return the requested type of steps. If no type is defined, both sets will be returned to the caller.

Normal response code: 200 Expected error codes:

* 400 with malformed request
* 503 upon conductor error

Note

API micro-version will be incremented in accordance with standard procedure.

Client (CLI) impact

"ironic" CLI

None

"openstack baremetal" CLI

An openstack baremetal node steps and openstack baremetal node hold commands will be added to facilitate returning the data exposed by this api.

RPC API impact

A new RPC method will need to be added called get_steps that will support a single argument to indicate what class of steps are being requested by the API user.

Driver API impact

None

Nova driver impact

None is required for this feature.

That being said, there is value to enable a node to be scheduled which is being held for an available deployment. As such, it could be an optional enhancement which could save quite a bit of time in a deployment process. This could be enabled by allowing nova to consider a node in the holding state to be available for deployments by also evaluating the target_provision_state for nodes in holding. It would be fairly tight coupling, but a frequent ask is for faster deployments, and it would be a route that we could take to enable such functionality in terms of "holding for deployment".

Ramdisk impact

None

Security impact

None

Other end user impact

None

Scalability impact

None

Performance Impact

None

Other deployer impact

None

Developer impact

None

Implementation

Assignee(s)

Primary assignee:

Julia Kreger (TheJulia) <juliaashleykreger@gmail.com>

Other contributors:

?

Work Items

  • Implement API to retrieve a list of states.
  • Implement State machine changes to allow an idle agent instance to return cleaning step data.
  • Add API tests to ironic-tempest-plugin.
  • Update state machine documentation.
  • Add Admin documentation.
  • Update CLI documentation.

Dependencies

None

Testing

Basic API contract and state testing should be sufficient for this feature.

Upgrades and Backwards Compatibility

N/A, The existing rolling upgrades and RPC version pinning practice should be more than sufficient to support this feature.

Documentation Impact

Additional details will need to be added to the Admin guide. State documentation will need to be updated. Update client documentation for new state verb.

References


  1. Manual cleaning - https://specs.openstack.org/openstack/ironic-specs/specs/5.0/manual-cleaning.html↩︎

  2. Deploy Steps - https://specs.openstack.org/openstack/ironic-specs/specs/11.1/deployment-steps-framework.html↩︎