ironic-specs/priorities/stein-priorities.rst
melissaml ae35f97b43 Rename review.openstack.org to review.opendev.org
There are many references to review.openstack.org, and while the
redirect should work, we can also go ahead and fix them.

Change-Id: Ic82bde84617c461aba1e4a02b3fc562a90c99a6e
2019-07-11 11:39:16 +08:00

15 KiB

Stein Project Priorities

This is a list of development priorities the Ironic team is prioritizing for Stein development, in order of relative size and dependency addressing. Note that this is not our complete backlog for the cycle, we still hope to review and land non-priority items.

The primary contact(s) listed is/are responsible for tracking the status of that work and herding cats to help get that work done. They are not the only contributor(s) to this work, and not necessary doing most of the coding! They are expected to be available on IRC and the ML for questions, and report status on the whiteboard for the weekly IRC sync-up. The number of primary contacts is typically limited to 2-3 individuals to simplify communication. We expect at least one of them to have core privileges to simplify getting changes in.

As the time remaining in the Stein cycle is approximately 30 weeks from the Project Teams Gathering, the list of priorities has been split into two major pieces based upon an estimate of relative size. The overall goal is for the Smaller Goals items to be focused on with in the first few months of the cycle, while the larger Epic Goals may receive some work early on, but will be targeted for later in the cycle.

Smaller Goals

Priority Primary Contacts
Upgrade Checker TheJulia, rloo
Python3 First derek, TheJulia
iPXE/PXE interface split TheJulia, stendulker
UEFI First hshiina
HTTPClient booting TheJulia
Nova conductor_group awareness jroll, TheJulia
Enhanced Checksum Support jroll, kaifeng
DHCP-less/L3 virtual media boot shekar, stendulker

Epic Goals

Goal Primary Contacts
Deploy Templates mgoddard, dtantsur, rloo
Graphical Console mkrai, etingof
Federation Capabilities TheJulia, dtantsur
Task execution improvements etingof, TheJulia, mgoddard
No IPA to conductor communication jroll, rloo
Getting steps TheJulia, dtantsur
Conductor role splitting jroll, dtantsur
Neutron Event Processing vdrok, mgoddard, hjensas

Inter-Project Goals

Deployment state callbacks to nova TheJulia, jroll
Smartnic Support TheJulia, mkrai, moshele

Details

Upgrade Checker

This is an OpenStack Community goal for the Stein Cycle. For ironic this will mean a new command called ironic-status upgrade check. This command is intended to return an error for things that would be fatal for an upgrade such as new required configuration missing, or schema/data upgrades not yet performed. The story can be found at story 2003657.

Python3 First

This is an OpenStack Community goal for the Stein Cycle. Most of this work has already been completed in ironic. Largely we need to change our tests so we are explicitly testing on Python3. We can't do this for every test at the moment, but we should be able to change most and still ensure the bulk of the code paths are covered by tests labeled with python2.

We also desire for third party CI to begin to leverage Python3, with a goal of approximately 50% of third party CI jobs until we stop supporting Python2. The story can be found at story 2003230.

iPXE/PXE interface split

This is an older effort that has been restarted in the interest of supporting multiple architectures (such as AArch64, Power, and x86_64) in the same deployment.

As it turns out, Power's architecture expects the older PXELinux style templates that are written by our PXE boot interface. Additionally, while AArch64 can be booted using iPXE, no pre-built binaries are available.

As such, we need to no longer make this global for the conductor, but specific to the node, and splitting the interfaces apart begins to make much more sense. The original specification can be found ipxe-boot interface. The story can be found at story 1628069.

UEFI First

2020 is an important year for Baremetal Operators, as Legacy boot mode support is anticipated to be removed from newer processors being shipped.

To ensure our success, we need to improve our testing and prepare for the time when UEFI is the only boot mode available for newer hardware. As a result, this will become a multi-cycle focus to enable the default boot mode to be changed to uefi in a future cycle. The story can be found at story 2003936.

HTTPClient Booting

While the community is interested in supporting HTTPClient based booting, we currently have a few steps to surpass first. Namely the iPXE/PXE interface split and improved UEFI testing.

The nature of this work is to enable an explict HTTP booting scenario where the booting node does not leverage PXE. The story can be found at story 2003934.

Nova conductor_group awareness

This work is exclusively in the ironic virt driver in the openstack/nova repository. This would enable us to define a conductor_group to which the nova-compute process leverages for the view of baremetal nodes it is responsible for. The story can be found at story 2003942.

Enhanced Checksum Support

Ironic presently defaults to use of MD5 checksums for the image_checksum which is far from ideal. During the Rocky cycle, Glance has enhanced their support for checksum storage, which means we should enhance ours as well. The story can be found at story 2003938.

DHCP-less/L3 virtual media boot

Some operators and vendors wish to enable ironic to manage deployments where DHCP is not something that is leveraged or utilized in the deployment process. In order to do this, we need to enable some additional capabilities in terms of enabling information to be attached to a deployment ramdisk. The specification can be found at the L3 based deployments specification. The story can be found at story 1749193.

Deploy Templates

In the future, we want to take specific action based upon traits submitted to ironic from Nova describing the instance's expected state or behavior.

This will allow us to take actions and influence the deployment steps, and as such is a continuation of the Deploy Steps work from the Rocky cycle. The story can be found at story 1722275.

Graphical Console

We need a way to expose graphical (e.g. VNC) consoles to users from drivers that support it. We reached agreement on the specification in the Rocky cycle and have started to work through the patches to enable this. Our goal being to have a framework and preferably at least one vendor driver to support Graphical console connectivity. The specification can be found vnc graphical console specification. The story can be found at story 1567629.

Federation Capabilities

Edge computing is bringing a variety of cases where support for federation of ironic deployments can be useful and extremely powerful.

In order to better support this emerging use case, we want to try and agree on a viable path forward that meets several different use cases and requirements. The objective for this effort is an agreed upon specification. The story can be found at story 2001821.

Task execution improvements

We realize that our task execution and locking model is problematic, and while it does scale in some ways, it does not scale in other ways. This work will consist of worker execution improvements, an evaluation and possible implementation of different worker thread execution models, and careful improvement of locking. The story can be found at story 2003943.

No IPA to conductor communication

Larger operators need much more strict security in their deployments, where they wish to prevent all outbound network connectivity to the control plane. Presently the design model requires that nodes are able to reach ironic's API in order to perform heartbeat and lookup operations.

The concept with this is to optionally enable the conductor to drive the deployment by polling IPA using the already known IP address. That being said, this is realistically going to require Task execution improvements to be complete to help ensure that operators are able to have performant deployments. The specification can be found at change 212206. The story can be found at story 1526486.

Getting steps

One of the biggest frustrations that people have with our cleaning model is the lack of visibility into what steps they can execute. This is further compounded with deploy steps. We have ideas on this and we need to begin providing the mechanisms to raise that visibility.

This may also involve state machine states to enable the agent to sit in a holding pattern pending operator action.

The goal is ultimately to provide a CLI for the user to be able to understand the available steps that can be utilized. The story can be found at story 1715419.

Neutron Event Processing

Currently ironic has no way to determine when certain asynchronous events actually finish in neutron, and with what result. Nova, on the contrary, uses a special neutron driver, which filters out notifications and posts some of them to a special nova API endpoint. We should do the same. The story can be found at story 1304673.

Conductor role splitting

The conductor presently does all of the work... But does it need to?

This is a question we should be asking ourselves as we evolve, if we can optionally break the conductor into many pieces, to enable edge conductors, or edge local boot management. The goal here is to try and obtain a matrix of distinct actions taken, which will hopefully further guide us as time moves on. The story can be found at story 2003940.

Smartnic Support

Smartnics complicates ironic as the NIC needs to be programmed with the power in a state such that the configuration on the NIC can be changed.

While the effort to support this may ultimately result in enhancements to neutron in the form of Super-Agents to apply the configuration, we still need to understand the impact to our workflows and ensure that sufficient security is still present. The primary objective is to have a joint specification written in advance of the Berlin summit to reach consensus with the Neutron team as to the mechanics, information passing, and setting storage. The story can be found at story 2003346.

Deployment state callbacks to nova

One of the issues in ironic's nova virt driver is that no concept of callbacks exist. Due to this, the virt driver polls the ironic API endpoint repeatedly, which increases overall system load. In an ideal world, ironic would utilize a mechanism to indicate deployment state similar to how neutron informs nova that networking has been configured. The story can be found at story 2003939.