Adds a wait step to allow for finer grained workflows
and forcing interruptions which may be needed in some
cases with specialized hardware.
Change-Id: Idc338b761ebe35a4635022a324ca5acbf29fc462
Adds a database retry decorator to capture and retry exceptions
rooted in SQLite locking. These locking errors are rooted in
the fact that essentially, we can only have one distinct writer
at a time. This writer becomes transaction oriented as well.
Unfortunately with our green threads and API surface, we run into
cases where we have background operations (mainly, periodic tasks...)
and API surface transacations which need to operate against the DB
as well. Because we can't say one task or another (realistically
speaking) can have exclusive control and access, then we run into
database locking errors.
So when we encounter a lock error, we retry.
Adds two additional configuration parameters to the database
configuration section, to allow this capability to be further
tuned, as file IO performance is *surely* a contributing factor
to our locking issues as we mostly see them with a loaded CI
system where other issues begin to crop up.
The new parameters are as follows:
* sqlite_retries, a boolean value allowing the retry logic
to be disabled. This can largely be ignored, but is available
as it was logical to include.
* sqlite_max_wait_for_retry, a integer value, default 30 seconds
as to how long to wait for retrying SQLite database operations
which are failing due to a "database is locked" error.
The retry logic uses the tenacity library, and performs an
expoential backoff. Setting the amount of time to a very large
number is not advisable, as such the default of 30 seconds was
deemed reasonable.
Change-Id: Ifeb92e9f23a94f2d96bb495fe63a71df9865fef3
Disables internal heartbeat mechanism when ironic has been
configured to utilize a SQLite database backend.
This is done to lessen the possibility of a
"database is locked" error, which can occur when two
distinct threads attempt to write to the database
at the same time with open writers.
The process keepalive heartbeat process was identified as
a major source of these write operations as it was writing
every ten seconds by default, which would also collide with
periodic tasks.
Change-Id: I7b6d7a78ba2910f22673ad8e72e255f321d3fdff
Adds the logic and testing to handle vendor interfaces to be able
to be called as steps, as well as adds the ipmitool send_raw
vendor passthru method to be able to be called as a step.
Change-Id: I741a4173f1d150298008d3190e4c3998402a8b86
An issue previously existed where periodics would cause an open
transaction to exist with the database which would cause issues
when attempting to write to the database.
This issue has been fixed by assembling the data to return to
the calling method, such that an open transaction does not
remain, by copying the data retrieved from the database,
thus disjointing it from the transaction.
Closes-Bug: #2027405
Change-Id: I6401193b04fd3be78c37433bfdd0ccbd92aac8da
* Updates API version to 1.85 to permit an ``unhold`` verb
* Adds the ``deploy hold`` and ``clean hold`` provision states
to the internal state machine.
* Adds on documentation on steps to help provide greater clarity
to Ironic's users on how to utilize steps. It should be noted
this documentation also includes the power state reserved step
names from the DPU functionality patch.
* Fixes the state machine diagram. Changes type to PNG as SVG
rendering is broken due to python libraries utilized for SVG
generation which do not work on more recent Python versions.
Change-Id: I34f58f4e77e7757b89247fd64f5fcde26f679453
We have started to notice an SAWarning from sqlalchemy indicating:
SAWarning: Cannot correctly sort tables; there are unresolvable
cycles between tables "allocations, nodes", which is usually
caused by mutually dependent foreign key constraints.
Foreign key constraints involving these tables will not be
considered; this warning may raise an error in a future release.
Hunting this down, it appears to be the two data consistency Foreign
Key constraints in the "allocations" table where an allocation would
try to have a conductor_affinity value mapped to conductors.id
and also have a direct association to a node, which *also* had the
same constraint.
And then similarlly, mapping in reverse, asserting a fk constraint,
when nodes also had it's own constraint back on allocations.
Sort of a circular loop.
Anyhow, removes it, and adds a db migration to remove the two
constraints.
Change-Id: I5596008e4971a29c635c45b24cb85db2d0d13ed3
When a node is inspected more than one time and the database is
configured as a storage backend, a new entry is made in the database
for each inspection result (node inventory). This patch handles this
behaviour as follows:
By deleting previous inventory entries for the same node before adding
a new entry in the database.
By retrieving the most recent node inventory from the database when the
database is queried.
Change-Id: Ic3df86f395601742d2fea2bcde62f7547067d8e4
Add to the information collected by Redfish hardware inspection from
sushy, and store it in the documented hardware inventory format
Change-Id: I651599b84e6b8901647960b719626489b000b65f
Allows steps to be executed on child nodes, and adds
the reserved power_on, power_off, and reboot step names.
Change-Id: I4673214d2ed066aa8b95a35513b144668ade3e2b
The db field value version check, which is a preflight to
major upgrades (to detect if a prior upgrade was not completed)
was using model_query, which could orphan an open transaction in
the same process until the python interpretter went and took out
the perverable trash.
We now use an explicit session which structurally ensures we close
any open transactions which allows a metadata lock to be obtained
to perform a schema update..
Change-Id: Id51419bc50af5a756bb7b0ca451df1936dd6f904
Adds the parent node support and tests in one change
including all DB/Model/API changes along with RBAC and
basic API tests.
* Updates the API version to 1.83
* Adds parent_node and related index to the nodes table.
* Adds new API parameters to list by parent node relationship.
Depends-On: https://review.opendev.org/c/openstack/ironic/+/883967
Change-Id: I8d64fee7105718199986db4994e13352d639f04f
This patch deals with overlooked situation in patch
d23f72ee50, in which
`irmc_ipmi_succeed` flag is added to deal with iRMC
firmware's IPMI incompatibility introduced at iRMC
firmware version S6 2.00 and later.
That flag is set and updated by `irmc` power_interface
code and rest of iRMC driver code use that flag.
When `ipmitool` is set as power_interface, that flag
is not set nor updated and rest of iRMC driver code
fail to handle IPMI incompatibility correctly.
This patch adds logic to check power_interface to
make iRMC driver properly deal with iRMC firmware's IPMI
incompatibility even when `ipmitool` power_interface
is used.
Change-Id: Id353c4f5260a7c469779b50ad302f442223df5a0
In the recent change to cinder, to address CVE-2023-2088,
cinder changed the policy rules and behavior for unbinding,
or "detaching" a volume. This was because of a vulnerability
in compute nodes where a volume which was in use by a VM
could be detached outside of Nova, and nova wouldn't become
aware the volume was detached, and the volume could be accessible
to the next VM.
This vulnerability doesn't apply to bare metal operations as
volumes are attached to whole baremetal nodes with Ironic.
We now generate and use a service token when interacting with
Cinder which allows cinder to recognize "this request is
coming from a fellow OpenStack service", and by-pass
checking with Nova if the "instance" is managed by Nova,
or Not. This allows the volumes to be attached, and detached
as needed as part of the power operation flow and overall
set of lifecycle operations.
Related-Bug: 2004555
Closes-Bug: 2019892
Change-Id: Ib258bc9650496da989fc93b759b112d279c8b217
When enabling scope enforcement, the self_owned_node check could
generate a failure because the check internally can be touched
by both a project scoped and system scoped endpoint.
This change changes the tag in the policy so it doesn't prematurely
return an error to the API consumer.
Change-Id: I49e2f7f29eb98e5bb4e18614cea0aca726703f55
Currently, if an attempt is made to fetch MAC address information using
OOB inspection on a Redfish-managed node and EthernetInterfaces
attribute is missing on the node, inspection fails due to a
MissingAttributeError exception being raised by sushy. This change adds
catching and handling this exception.
Change-Id: I6f16da05e19c7efc966128fdf79f13546f51b5a6
Prior to this fix, we have been unable to run the Metal3 CI job
with SQLAlchemy's internal autocommit setting enabled. However
that setting is deprecated and needs to be removed.
Investigating our DB queries and request patterns, we were able
to identify some queries which generally resulted in the
underlying task and lock being held longer because the output
was not actually returned, which is something we've generally
had to fix in some places previously. Doing some of these
changes did drastically reduce the number of errors encountered
with the Metal3 CI job, however it did not eliminate them
entirely.
Further investigation, we were able to determine that the underlying
issue we were encountering was when we had an external semi-random
reader, such as Metal3 polling endpoints, we could reach a situation
where we would be blocked from updating the database as to open a
write lock, we need the active readers not to be interacting with
the database, and with a random reader of sorts, the only realistic
option we have is to enable the Write Ahead Log[0]. We didn't have
to do this with SQLAlchemy previously because autocommit behavior
hid the complexities from us, but in order to move to SQLAlchemy
2.0, we do need to remove autocommit.
Additionally, adds two unit tests for get_node_with_token rpc
method, which apparently we missed or lost somewhere along the
way. Also, adds notes to two Database interactions to suggest
we look at them in the future as they may not be the most
efficient path forward.
[0]: https://www.sqlite.org/wal.html
Change-Id: Iebcc15fe202910b942b58fc004d077740ec61912
The troubleshooting kernel command line option nomodeset
unfortunately changes the way framebuffer interactions work
with graphics devices which in some cases can result in kernel
memory to be used for graphics updates. When this happens on
some specific hardware common in rack mount servers with baseboard
management controllers, this can cause the memory bus to become
locked for a brief time while the graphics update is occuring.
This locked memory bus means disk IO can become blocked,
and network cards can overflow their buffers resulting in
packet loss on top of the latency incurred by the graphics
update executing.
As such, we've removed the nomodeset option from default usage and
added a note describing its removal to the documentation along
with a release note.
Change-Id: I9084d88c3ec6f13bd64b8707892758fa87dd7f86
This patch fixes a condition where iRMC driver interfaces would have
the FIPS enforcement logic check applied if the SNMP version was not
set to SNMP v3, even if the interfaces did not use SNMP.
With this patch, if FIPS enabled, iRMC driver enforces SNMP
version to be version 3 only when any xxx_interface of iRMC
driver actually uses SNMP.
Story: 2010713
Task: 47879
Change-Id: I774c459a5e11b7cd01f7a65754d5a2c7cc573476
We have seen duplicate ip issues when leaving clean failed nodes
powered on. This patch allows operators to power down nodes that
enter clean failed state.
Change-Id: Iecb402227485fe0ba787a262121c9d6a048b0e13
The current check is insufficient: it passes for Kubernetes shared
volumes, although hard-linking between them is not possible.
This patch changes the approach to trying a hard link and falling
back to copyfile instead.
The patch relies on optimizations in Python 3.8 and thus should not
be backported beyond the Zed series to avoid performance regression.
Change-Id: I929944685b3ac61b2f63d2549198a2d8a1c8fe35
Unused by Nova and unlike memory_mb/local_gb also by Ironic (actually,
our usage of local_gb is worth double-checking as well, but at the very
least it's referenced by inspection implementations).
Change-Id: Ie8b0d9f58f4dcd102c183c30ae7f5acf68a5e4c3
Currently, we silently fall back to ironic-inspector managing boot if
the boot interface cannot do it. What ironic-inspector does is set
the boot device to PXE and issue a reboot request. This was done
to keep backward compatibility with how inspection worked before managed
boot was introduced.
With in-band inspection migrating to Ironic proper, this "unmanaged"
mode becomes a more exotic case since it requires additional PXE
infrastructure. Additionally, the popularity of Redfish is rapidly
growing, and we support pre-populating ports when Redfish is used.
As such, the "unmanaged" mode should no longer be allowed by default.
This change prepares for the future flip of the default value by
issuing a deprecation warning if no explicit value is set for the option.
Depends-On: https://review.opendev.org/c/openstack/bifrost/+/877469
Change-Id: I6a13cf62b427c9e5c7d7d9ddc447d60f94592c9a