We have a check in the code that is never true for manual power
actions because of what happens in the conductor manager. Remove it.
Conflicts:
ironic/conductor/utils.py
Change-Id: I50b7b78a41188c41e4944894851f1d12684f824a
(cherry picked from commit bc04a42a96)
Otherwise a reboot during fast-track will leave the newly booted
agent without an ability to request a token.
Change-Id: I963276efae5599bfed6cbb4df18e8dd3bd1b9839
(cherry picked from commit 2d4d375358)
We wipe these fields on some conditions, most notable - on starting
the deployment. Make the removal of these fields to always go through
the helpers in conductor/utils (and remove an unused one).
Change-Id: Idb952588bb8a6d5131764f29c6225762ba5d55cc
(cherry picked from commit 5909163924)
This change updates various agent cleaning functions to also
support deploy steps via a new step_type argument. It does not
yet make it possible to run in-band deploy steps.
The agent's get_clean_steps call has been modified to not fail
if there are no cached steps. This is going to be normal for
both cleaning and deploy in the future, and it is impossible
to hit now because of the way cleaning is started.
Change-Id: I789b130e7e490e23924338a024397973957272ac
Story: #2006963
In order to provide increased security, it is necessary
to hash the rescue password in advance of it being stored
into the database and to provide some sort of control for
hash strength.
This change IS incompatible with prior IPA versions with
regard to use of the rescue feature, but I fully expect
we will backport the change to IPA on to stable branches
and perform a release as it is a security improvement.
Change-Id: I1e118467a536229de6f7c245c1c48f0af38dcef2
Story: 2006777
Task: 27301
To avoid problems with FIPS 140-2 let's generate
the token using the secrets module instead of random.
Change-Id: I90c3c94112d093e2309414b9902f58d31d925ad3
Story: 2007444
Task: 39104
This provides an embedded agent_token through out of band
means which ensures greater security for such deployments.
Also some changes to the redfish virtual media boot unit
tests as the entirety of conductor utilties was mocked.
Downside with the pregenerated token, the logic had to be
slightly revised as the existence of a token with an older
client is a case that virtual media deployments will hit.
This was addressed, and tests updated as a result.
Story: 2007025
Task: 37819
Change-Id: Iaa2a52c2e53534cbf6998ad24f1f1184c0f222d8
In order to improve security of the lookup/heartbeat
endpoints, we need to generate and provide temporary tokens
to the initial callers, if supported, to facilitate the
verification of commands.
This is the first patch in an entire series which utimately
enables the endpoint communication to be better secured.
The idea behind this started in private story 2006634 which
is locked as a security related filing covering multiple
aspects of ironic/ironic-python-agent interaction centered
around miss-use and generally exposed endpoints. That story
will remain marked as a private bug because it has several
different items covered, some of which did not prove to be
actually exploitable, but spawned stories 2006777, 2006773,
2007025, and is ultimately similar to Story 1526748.
Operationally this is a minimally invasive security
enhancement to lay the foundation to harden interactions
with the agent. This will take place over a series of
patches to both Ironic and the Ironic-Python-Agent.
Also see "Security of /heartbeat and /lookup endpoints"
in http://lists.openstack.org/pipermail/openstack-discuss/2019-November/010789.html
Story: 2007025
Task: 37818
Change-Id: I0118007cac3d6548e9d41c5e615a819150b6ef1a
With more than 4000 lines of code (more than 8000 of unit tests) the
current manager.py is barely manageable. In preparation for the new
deployment API, this change moves the free-standing deploy-related
functions to utils.py and the new deployments.py.
Change-Id: Ic3f369a7fa72d09263b0670ae78980913b024962
Story: #2006910
Since we've dropped support for Python 2.7, it's time to look at
the bright future that Python 3.x will bring and stop forcing
compatibility with older versions.
This patch removes the six library from requirements, not
looking back.
Change-Id: Ib546f16965475c32b2f8caabd560e2c7d382ac5a
Do not try to configure networks when powered on, unless it's a node
with a SmartNIC, in which case do power on before configuring networks.
A new helper is created based on existing code in agent.py.
Change-Id: I3a8fab7a39b604ed17a690fa9c31b3cd1dbdc6a7
Story: #1528920
Task: #37753
A malicious user with:
* API access normally reserved for the provisioning,
cleaning, rescue networks.
* Insight about a node, such as a MAC address, or baremetal node
UUID.
* Insight into the state of the node, such as the access provided
to Compute API users, or other Bare Metal API users.
Can submit an erroneous ``heartbeat`` to the ironic-api endpoint
with a ``callback_url`` that is not of the actual intended agent.
This can potentially cause a rescue, cleaning, or deployment
operation to be derailed, or at worst commands to be sent to
to an endpoint the malicious user controls.
Story: 2006773
Task: 37295
Change-Id: I1a5e3c2b34d45c06fb74e82d0f30735ce9041914
configdrive can contain a vendor_data2.json file containing key/value
pairs injected by nova's vendordata mechanism[1].
This change lets Ironic accept a vendor_data key when configdrive is
provided as json, allowing parity with nova.
This change requires an openstacksdk release 0.37.0
[1] https://www.madebymikal.com/nova-vendordata-deployment-an-excessively-detailed-guide/
Change-Id: Id990b970619a113c5d5ead47fb550870d91b5e04
Task: 36756
Story: 2006597
Blueprint: nova-less-deploy
Some drivers use a periodic task to poll for completion of a deploy
or clean step. The iDRAC RAID driver is one example of this. In
https://review.opendev.org/#/c/676152, the agent heartbeat handler was
modified to resume deployment if not currently in the core deploy step.
This makes sense for the ilo driver, which does not poll for completion
of RAID configuration, but the iDRAC driver polls the lifecycle
controller's job queue, and expects to be able to resume deployment once
the job is complete. However, there is now a race between the agent
heartbeat as the node boots up, and the job queue poller.
This change adds new flags, cleaning_polling and deployment_polling,
which can be used by a driver to signal that they are polling for
completion of a deploy step, and that the agent heartbeat should not be
used for this purpose.
We also add here some more cleanup of the cleaning and deployment step
metadata in driver_internal_info, since if these fields are left in
place they may affect subsequent cleaning or deployment steps.
Change-Id: I34591440ab993a80a0cc88be6e10e33f1ae4a660
Story: 2003817
Task: 36563
PXE is inherently unreliable and sometimes times out without an
obvious reason. It happens particularly often in resource constrained
environments, such as the CI. This change allows an operator to
set a timeout, after which the boot is retried again.
The _add_node_filters call had to be refactored to avoid hitting
the complexity limit.
Change-Id: I34a11f52e8e98e5b64f2d21f7190468a9e4b030d
Story: #2005167
Task: #29901
We don't prevent cleaning to happen for nodes in maintenance mode.
However, cleaning cannot succeed in this case, as we disable processing
heartbeats. This change adds a new configuration option that will
cause such node to enter CLEAN FAIL on the first heartbeat.
The same is done for deployment and automated cleaning during providing.
Finally, elevate the log level for such heartbeats from debug to warning,
as it may be a sign of a problem (especially if the new option is off).
Change-Id: I9f3ee44f39c448eb2609c5989acd36e7da844ef4
Story: #1563644
Task: #9171
In case of a failure during cleaning, ironic currently shuts the
node off. This is dangerous, e.g. when the cleaning step is a
firmware upgrade. This patch proposes to corect this behaviour
and leave the node on in case cleaning raises an exception.
Task: #30357
Story: #2005375
Change-Id: I5fe8b380c890eb9b9dcee33868ceda2a9bab9929
Add power state change callbacks of an instance to nova by
performing API requests. Whenever there is a change in the
power state of a physical instance (example a "power on"
or "power off" IPMI command is issued or the periodic
``_sync_power_states`` task detects a change in power state)
ironic will create and send a ``power-update`` external event
to nova using which nova will update the power state of the
instance in its database. By conveying the power state changes
to nova, ironic becomes the source of truth thus preventing
nova from forcing wrong power states on the instance during
the nova-ironic periodic sync. It also adds the possibility of
bringing up/down a physical instance through the ironic API
even if it was put down/up through the nova API.
Note that ironic only sends requests to nova if the target
power state is either "power on" or "power off". Other error
states will be ignored. In cases where the power state change
is originally coming from nova, the event will still be
created and sent to nova and on the nova side it will be a
no-op with a debug log saying the node is already powering on/off.
NOTE: Although an exclusive lock (task_manager.upgrade_lock()
method) is used when calling the nova API to send events,
there can still be a race condition if the nova-ironic power sync
happens to happen a nano-second before the power state change
event is received from ironic in which case the nova state will
be forced on the node.
Credit for introducing ksa adapter: Eric Fried <openstack@fried.cc>
Depends-On: https://review.opendev.org/#/c/645611/
Part of blueprint nova-support-instance-power-update
Story: 2004969
Task: 29424
Change-Id: I6d105524e1645d9a40dfeae2850c33cf2d110826
Asynchronous out of band steps in a deploy template fails to
execute. This commit fixes that issue. Asynchronous steps can
set 'skip_current_deploy_step' flag to False in
'driver_internal_info' to make sure that upon reboot same step
is re-executed. Also it can set 'deployment_reboot' flag to True
in 'driver_internal_info' to signal that it has rebooted the node.
Co-Authored-By: Mark Goddard <mark@stackhpc.com>
Change-Id: If6217afb5453c311d5ca71ba37458a9b97c18395
Story: 2006342
Task: 36095
There is enough steps code in conductor.utils to warrant a separate
module.
Change-Id: I0126e860210bbc56991876f26e64d81d3d7d5c08
Story: 1722275
Task: 29902
Provides a facility to minimize the power state changes of
a baremetal node to save critical time during deployment
operations.
Story: #2004965
Task: #29408
Depends-On: https://review.openstack.org/636778
Change-Id: I7ebbaddb33b38c87246c10165339ac4bac0ac6fc
Extend the API with the ability to build config drives from meta_data,
network_data and user_data, where meta_data and network_data are JSON
objects, and user_data is either a JSON object, a JSON array or
raw contents as a string.
This change uses openstacksdk (which is already an indirect dependency)
for building config drives.
Change-Id: Ie1f399a4cb6d4fe5afec79341d3bccc0f81204b2
Story: #2005083
Task: #29663
Adds the conductor-side logic required to map an instance's requested
traits to zero or more deploy templates. The steps defined in those
deploy templates are combined and added to deployment steps from the
driver interfaces, and used when provisioning the node.
The deploy steps for a node that come from deploy templates are
validated during node validation, and when deploying a node.
Change-Id: Ic4ac7926a1eaeb8b84d4f9f1af23bbe54554f250
Story: 1722275
Task: 28675
The power states already contain the word 'power', so currently
an error reads:
"Timed out after 30 secs wating for power power off on node ..."
Change-Id: I2535c15172df475a7d08c5219c2b97690ea67a58
Extend Ironic to enable use of Smart NICs to implement
generic networking services for baremetal servers.
Extending the ramdisk, direct, iscsi and ansible deployment Interfaces
to support the Smart NIC use-cases.
For Smart NIC use-case the baremetal node must be powered on and
booted into bios then wait for agent that runs on the Smart NIC to be
alive then do the network changes required.
Task: #26932
Story: #2003346
Change-Id: I00d6f13dd991074e4f45ada4d7cf4ccc0edbc7e1
When a timeout occurs when a node is in CLEANWAIT state, the conductor
puts it into the CLEANFAIL state. However, it tries to do that twice, and
our state machine doesn't support moving from a CLEANFAIL state to another
state via the 'fail' verb/event.
The code was changed so that it doesn't try to move it to CLEANFAIL twice,
and a check is put to prevent the node from being 'failed' frome a CLEANFAIL
state.
Change-Id: Ieeb77dd28a5d3053588c46fe2a700b5e6ceabbd7
Story: 2004299
Task: 27855
Create the object for automated clean and add the logic
in the conductor to be able to enable clean for specific
nodes, when general automated clean is disabled.
Story: #2002161
Task: #24579
Change-Id: If0130082e16d1205fdf65d083854ef9849754f8b
This addresses nits from the two reviews related to the
deploy_steps framework:
- I5feac3856cc4b87a850180b7fd0b3b9805f9225f
- I1baeeaaa6ed521e4189958fd7624cd6c5de96707
It also updates the release note to:
- indicate that support for drivers with no deploy steps will
be removed in the Stein release (as opposed to the T* release),
based on discussions in [1].
- mention that node.deploy_step is available in REST API version 1.44.
[1] http://eavesdrop.openstack.org/meetings/ironic/2018/ironic.2018-07-09-15.00.log.html#l-64
Change-Id: I97ab00cab21814287d1b8344b3e4ca0c093fb6ad
Story: #1753128
Task: #22592
This adds a 'deploy_step' decorator. A deploy step must take as the
only positional argument, a TaskManager object.
A step can be executed synchronously or asynchronously. A step should
return None if the method has completed synchronously or
states.DEPLOYWAIT if the step will continue to execute asynchronously.
If the step executes asynchronously, it should issue a call to the
'continue_node_deploy' RPC, so the conductor can begin the next
deploy step.
Only steps with priorities greater than 0 are used.
These steps are ordered by priority from highest value to lowest
value. For steps with the same priority, they are ordered by driver
interface priority (see conductor.manager.DEPLOYING_INTERFACE_PRIORITY).
All in-tree DeployInterfaces are converted to have one big deploy_step
(their existing deploy() method).
A new RPC method 'continue_node_deploy' (RPC API version 1.45) is used
by deploy steps to notify the conductor to continue node deployment
(e.g. execute the next deploy step).
Similar to cleaning, the conductor gets the node's deploy steps and
executes them, one at a time (one deploy step right now). The conductor
also handles out-of-tree drivers that don't have deploy steps yet; a
warning is logged in these cases.
Co-Authored-By: Ruby Loo <rloo@oath.com>
Change-Id: I5feac3856cc4b87a850180b7fd0b3b9805f9225f
Story: #1753128
Task: #22592
* removes any bits related to loading classic drivers from
the drivers factory code
* removes exceptions that only happen when classic drivers
can be loaded
* removes the BaseDriver, moves the useful functionality to
the BareDriver class
* /v1/drivers/?type=classic now always returns an empty list
* removes the migration updating classic drivers to hardware
types
The documentation will be updated separately.
Change-Id: I8ee58dfade87ae2a2544c5dcc27702c069f5089d
This makes the _notify_conductor_resume_clean method public by
removing the leading underscore from its name. And move it into
conductor/utils.py.
The idrac hardware type's RAID configuration out-of-band cleaning
has been using it [1].
The method is going to be used by the RAID configuration support
that is being added to the iRMC hardware type [2].
[1] 580d4338e2/ironic/drivers/modules/drac/raid.py (L892)
[2] https://review.openstack.org/#/c/512979
Change-Id: Ifd10dd88d65306049119588e6088359a5d38c158
This patch implements setting and using the fault field.
For each case currently maintenance is set to True, the fault is set
accordingly. A periodic task is added to check power state for nodes
in maintenance due to power failure, maintenance is cleared if the
power state of a node can be retrieved.
When a node is taken out of maintenance by user, the fault is
cleared (if there is any).
Story: #1596107
Task: #10469
Change-Id: Ic4ab20af9022a2d06bdac567e7a098f3ba08570a
Partial-Bug: #1596107
Also a few related errors based on some earlier investigation
may have been pulled in along the lines of E305.
Story: #2001985
Change-Id: Ifb2d3b481202fbd8cbb472e02de0f14f4d0809fd
* Adds 'bios' interface to 'BaseDriver'
* Adds BIOSInterface driver class
* Adds fake & no-bios drivers and entries
* Implements it for 'fake-hardare' hardware type
* Adds configuration parameters:
+ [DEFAULT]/enabled_bios_interfaces
+ [DEFAULT]/default_bios_interface
* Adds 'bios_interface' field to Node object
* Handle 'bios_interface' field in _convert_to_version
* Adds bios in CLEANING_INTERFACE_PRIORITY
Drivers can implement this interface to do BIOS
configuration.
Co-Authored-By: Yolanda Robla Mota <yroblamo@redhat.com>
Co-Authored-By: Luong Anh Tuan <tuanla@vn.fujitsu.com>
Change-Id: I7e57130242b6cab21b54e35dc3c0b7819bdc43c0
Story: #1712032