Commit Graph

2461 Commits

Author SHA1 Message Date
Zuul
b769a8199a Merge "Add wait step" 2023-07-28 05:16:26 +00:00
Zuul
96b1718b42 Merge "Enable vendor interfaces to be called as steps" 2023-07-27 17:19:09 +00:00
Julia Kreger
8fc8372e74 Add wait step
Adds a wait step to allow for finer grained workflows
and forcing interruptions which may be needed in some
cases with specialized hardware.

Change-Id: Idc338b761ebe35a4635022a324ca5acbf29fc462
2023-07-24 22:42:20 +00:00
Julia Kreger
091edb0631 Retry SQLite DB write failures due to locks
Adds a database retry decorator to capture and retry exceptions
rooted in SQLite locking. These locking errors are rooted in
the fact that essentially, we can only have one distinct writer
at a time. This writer becomes transaction oriented as well.

Unfortunately with our green threads and API surface, we run into
cases where we have background operations (mainly, periodic tasks...)
and API surface transacations which need to operate against the DB
as well. Because we can't say one task or another (realistically
speaking) can have exclusive control and access, then we run into
database locking errors.

So when we encounter a lock error, we retry.

Adds two additional configuration parameters to the database
configuration section, to allow this capability to be further
tuned, as file IO performance is *surely* a contributing factor
to our locking issues as we mostly see them with a loaded CI
system where other issues begin to crop up.

The new parameters are as follows:
* sqlite_retries, a boolean value allowing the retry logic
  to be disabled. This can largely be ignored, but is available
  as it was logical to include.
* sqlite_max_wait_for_retry, a integer value, default 30 seconds
  as to how long to wait for retrying SQLite database operations
  which are failing due to a "database is locked" error.

The retry logic uses the tenacity library, and performs an
expoential backoff. Setting the amount of time to a very large
number is not advisable, as such the default of 30 seconds was
deemed reasonable.

Change-Id: Ifeb92e9f23a94f2d96bb495fe63a71df9865fef3
2023-07-18 13:14:45 +00:00
Julia Kreger
0099d1812d Don't actually heartbeat with sqlite
Disables internal heartbeat mechanism when ironic has been
configured to utilize a SQLite database backend.

This is done to lessen the possibility of a
"database is locked" error, which can occur when two
distinct threads attempt to write to the database
at the same time with open writers.

The process keepalive heartbeat process was identified as
a major source of these write operations as it was writing
every ten seconds by default, which would also collide with
periodic tasks.

Change-Id: I7b6d7a78ba2910f22673ad8e72e255f321d3fdff
2023-07-14 09:22:01 -07:00
Julia Kreger
76c075269d Enable vendor interfaces to be called as steps
Adds the logic and testing to handle vendor interfaces to be able
to be called as steps, as well as adds the ipmitool send_raw
vendor passthru  method to be able to be called as a step.

Change-Id: I741a4173f1d150298008d3190e4c3998402a8b86
2023-07-13 07:40:53 -07:00
Julia Kreger
fb978dab1c DB: Fix result set locking with periodics
An issue previously existed where periodics would cause an open
transaction to exist with the database which would cause issues
when attempting to write to the database.

This issue has been fixed by assembling the data to return to
the calling method, such that an open transaction does not
remain, by copying the data retrieved from the database,
thus disjointing it from the transaction.

Closes-Bug: #2027405
Change-Id: I6401193b04fd3be78c37433bfdd0ccbd92aac8da
2023-07-12 12:07:09 -07:00
Julia Kreger
c4e3100d5c Add hold steps
* Updates API version to 1.85 to permit an ``unhold`` verb
* Adds the ``deploy hold`` and ``clean hold`` provision states
  to the internal state machine.
* Adds on documentation on steps to help provide greater clarity
  to Ironic's users on how to utilize steps. It should be noted
  this documentation also includes the power state reserved step
  names from the DPU functionality patch.
* Fixes the state machine diagram. Changes type to PNG as SVG
  rendering is broken due to python libraries utilized for SVG
  generation which do not work on more recent Python versions.

Change-Id: I34f58f4e77e7757b89247fd64f5fcde26f679453
2023-06-30 14:34:26 -07:00
Julia Kreger
402c32094b Handle SAWarning around allocations FK Constratins
We have started to notice an SAWarning from sqlalchemy indicating:

  SAWarning: Cannot correctly sort tables; there are unresolvable
      cycles between tables "allocations, nodes", which is usually
      caused by mutually dependent foreign key constraints.
      Foreign key constraints involving these tables will not be
      considered; this warning may raise an error in a future release.

Hunting this down, it appears to be the two data consistency Foreign
Key constraints in the "allocations" table where an allocation would
try to have a conductor_affinity value mapped to conductors.id
and also have a direct association to a node, which *also* had the
same constraint.

And then similarlly, mapping in reverse, asserting a fk constraint,
when nodes also had it's own constraint back on allocations.

Sort of a circular loop.

Anyhow, removes it, and adds a db migration to remove the two
constraints.

Change-Id: I5596008e4971a29c635c45b24cb85db2d0d13ed3
2023-06-26 14:27:59 -07:00
Zuul
ce1abd4007 Merge "Handle duplicate node inventory entries per node" 2023-06-14 16:33:52 +00:00
Mahnoor Asghar
fa2d6685f3 Handle duplicate node inventory entries per node
When a node is inspected more than one time and the database is
configured as a storage backend, a new entry is made in the database
for each inspection result (node inventory). This patch handles this
behaviour as follows:
By deleting previous inventory entries for the same node before adding
 a new entry in the database.
By retrieving the most recent node inventory from the database when the
database is queried.

Change-Id: Ic3df86f395601742d2fea2bcde62f7547067d8e4
2023-06-07 08:08:37 -04:00
Zuul
97f7177495 Merge "execute on child node support" 2023-06-07 04:04:45 +00:00
Zuul
8ef69aaa6a Merge "Prepare [inspector]require_managed_boot to change to True in the future" 2023-06-05 14:36:59 +00:00
Zuul
964a82db18 Merge "Add to Redfish hardware inventory collection" 2023-06-01 10:25:14 +00:00
Zuul
2bd69444d9 Merge "[iRMC] Fix IPMI incompatibility handling error" 2023-05-30 13:20:39 +00:00
Mahnoor Asghar
b3d7ba88d2 Add to Redfish hardware inventory collection
Add to the information collected by Redfish hardware inspection from
sushy, and store it in the documented hardware inventory format

Change-Id: I651599b84e6b8901647960b719626489b000b65f
2023-05-30 05:58:00 -04:00
Zuul
32532eeda5 Merge "DPU modeling - parent_node DB/Model/API" 2023-05-24 23:18:33 +00:00
Julia Kreger
013ac0cb41 execute on child node support
Allows steps to be executed on child nodes, and adds
the reserved power_on, power_off, and reboot step names.

Change-Id: I4673214d2ed066aa8b95a35513b144668ade3e2b
2023-05-24 15:42:46 -07:00
Julia Kreger
93688e9531 Explicitly use a session for DB version check
The db field value version check, which is a preflight to
major upgrades (to detect if a prior upgrade was not completed)
was using model_query, which could orphan an open transaction in
the same process until the python interpretter went and took out
the perverable trash.

We now use an explicit session which structurally ensures we close
any open transactions which allows a metadata lock to be obtained
to perform a schema update..

Change-Id: Id51419bc50af5a756bb7b0ca451df1936dd6f904
2023-05-23 20:26:05 +00:00
Julia Kreger
3f5e25e182 DPU modeling - parent_node DB/Model/API
Adds the parent node support and tests in one change
including all DB/Model/API changes along with RBAC and
basic API tests.

* Updates the API version to 1.83
* Adds parent_node and related index to the nodes table.
* Adds new API parameters to list by parent node relationship.

Depends-On: https://review.opendev.org/c/openstack/ironic/+/883967
Change-Id: I8d64fee7105718199986db4994e13352d639f04f
2023-05-23 18:23:25 +00:00
Vanou Ishii
27bf209113 [iRMC] Fix IPMI incompatibility handling error
This patch deals with overlooked situation in patch
d23f72ee50, in which
`irmc_ipmi_succeed` flag is added to deal with iRMC
firmware's IPMI incompatibility introduced at iRMC
firmware version S6 2.00 and later.
That flag is set and updated by `irmc` power_interface
code and rest of iRMC driver code use that flag.

When `ipmitool` is set as power_interface, that flag
is not set nor updated and rest of iRMC driver code
fail to handle IPMI incompatibility correctly.

This patch adds logic to check power_interface to
make iRMC driver properly deal with iRMC firmware's IPMI
incompatibility even when `ipmitool` power_interface
is used.

Change-Id: Id353c4f5260a7c469779b50ad302f442223df5a0
2023-05-22 07:11:06 -04:00
Zuul
b5eafe9069 Merge "Update docs: Ironic uses launchpad now" 2023-05-20 22:40:12 +00:00
Zuul
fe8134ea28 Merge "Fix self_owned_node policy check" 2023-05-19 22:23:36 +00:00
Zuul
f206eb1f65 Merge "[iRMC] Fix parse_driver_info bug enforcing SNMP v3 under FIPS mode" 2023-05-19 17:42:09 +00:00
Julia Kreger
9c0b4c90a1 Fix Cinder Integration fallout from CVE-2023-2088
In the recent change to cinder, to address CVE-2023-2088,
cinder changed the policy rules and behavior for unbinding,
or "detaching" a volume. This was because of a vulnerability
in compute nodes where a volume which was in use by a VM
could be detached outside of Nova, and nova wouldn't become
aware the volume was detached, and the volume could be accessible
to the next VM.

This vulnerability doesn't apply to bare metal operations as
volumes are attached to whole baremetal nodes with Ironic.

We now generate and use a service token when interacting with
Cinder which allows cinder to recognize "this request is
coming from a fellow OpenStack service", and by-pass
checking with Nova if the "instance" is managed by Nova,
or Not. This allows the volumes to be attached, and detached
as needed as part of the power operation flow and overall
set of lifecycle operations.

Related-Bug: 2004555
Closes-Bug: 2019892

Change-Id: Ib258bc9650496da989fc93b759b112d279c8b217
2023-05-18 07:43:31 -07:00
Jay Faulkner
65b8895e8a Update docs: Ironic uses launchpad now
Ironic switched to launchpad. Ensure our docs point contributors to the
correct location.

Change-Id: Ifa75c75741dd4a584bc2cb972eb4726c4c48d064
2023-05-17 15:42:41 -07:00
Zuul
832275015a Merge "Support longer checksums for redfish firmware upgrade" 2023-05-09 23:45:15 +00:00
Julia Kreger
9da6dfd73d Fix self_owned_node policy check
When enabling scope enforcement, the self_owned_node check could
generate a failure because the check internally can be touched
by both a project scoped and system scoped endpoint.

This change changes the tag in the policy so it doesn't prematurely
return an error to the API consumer.

Change-Id: I49e2f7f29eb98e5bb4e18614cea0aca726703f55
2023-05-09 09:51:43 -07:00
Zuul
1d0818cba2 Merge "Remove use of nomodeset by default" 2023-05-09 06:29:42 +00:00
OpenStack Proposal Bot
3139460cd2 Imported Translations from Zanata
For more information about this automatic import see:
https://docs.openstack.org/i18n/latest/reviewing-translation-import.html

Change-Id: Ice56ac44161d27ede41fdf53024e62e49c572049
2023-05-09 04:29:51 +00:00
Zuul
47b778977c Merge "Handle MissingAttributeError when using OOB inspections to fetch MACs" 2023-05-08 15:08:21 +00:00
Julia Kreger
03cd9788e6 Support longer checksums for redfish firmware upgrade
Previoulsy only SHA1 hashes were supported, now we support
SHA256 and SHA512 by length.

Change-Id: Iddb196faca4008837595a3d0923f55d0e9d2aea5
2023-05-03 07:34:37 -07:00
Jacob Anders
f10958a542 Handle MissingAttributeError when using OOB inspections to fetch MACs
Currently, if an attempt is made to fetch MAC address information using
OOB inspection on a Redfish-managed node and EthernetInterfaces
attribute is missing on the node, inspection fails due to a
MissingAttributeError exception being raised by sushy. This change adds
catching and handling this exception.

Change-Id: I6f16da05e19c7efc966128fdf79f13546f51b5a6
2023-05-02 22:09:26 +10:00
Julia Kreger
75b881bd31 Fix DB/Lock session handling issues
Prior to this fix, we have been unable to run the Metal3 CI job
with SQLAlchemy's internal autocommit setting enabled. However
that setting is deprecated and needs to be removed.

Investigating our DB queries and request patterns, we were able
to identify some queries which generally resulted in the
underlying task and lock being held longer because the output
was not actually returned, which is something we've generally
had to fix in some places previously. Doing some of these
changes did drastically reduce the number of errors encountered
with the Metal3 CI job, however it did not eliminate them
entirely.

Further investigation, we were able to determine that the underlying
issue we were encountering was when we had an external semi-random
reader, such as Metal3 polling endpoints, we could reach a situation
where we would be blocked from updating the database as to open a
write lock, we need the active readers not to be interacting with
the database, and with a random reader of sorts, the only realistic
option we have is to enable the Write Ahead Log[0]. We didn't have
to do this with SQLAlchemy previously because autocommit behavior
hid the complexities from us, but in order to move to SQLAlchemy
2.0, we do need to remove autocommit.

Additionally, adds two unit tests for get_node_with_token rpc
method, which apparently we missed or lost somewhere along the
way. Also, adds notes to two Database interactions to suggest
we look at them in the future as they may not be the most
efficient path forward.

[0]: https://www.sqlite.org/wal.html

Change-Id: Iebcc15fe202910b942b58fc004d077740ec61912
2023-05-01 15:35:33 -07:00
Zuul
c42c2efe95 Merge "Remove all references to the "cpus" property" 2023-04-27 23:10:00 +00:00
Julia Kreger
f2605e9281 Remove use of nomodeset by default
The troubleshooting kernel command line option nomodeset
unfortunately changes the way framebuffer interactions work
with graphics devices which in some cases can result in kernel
memory to be used for graphics updates. When this happens on
some specific hardware common in rack mount servers with baseboard
management controllers, this can cause the memory bus to become
locked for a brief time while the graphics update is occuring.

This locked memory bus means disk IO can become blocked,
and network cards can overflow their buffers resulting in
packet loss on top of the latency incurred by the graphics
update executing.

As such, we've removed the nomodeset option from default usage and
added a note describing its removal to the documentation along
with a release note.

Change-Id: I9084d88c3ec6f13bd64b8707892758fa87dd7f86
2023-04-26 07:34:29 -07:00
Vanou Ishii
3f09bdcf95 [iRMC] Fix parse_driver_info bug enforcing SNMP v3 under FIPS mode
This patch fixes a condition where iRMC driver interfaces would have
the FIPS enforcement logic check applied if the SNMP version was not
set to SNMP v3, even if the interfaces did not use SNMP.

With this patch, if FIPS enabled, iRMC driver enforces SNMP
version to be version 3 only when any xxx_interface of iRMC
driver actually uses SNMP.

Story: 2010713
Task: 47879
Change-Id: I774c459a5e11b7cd01f7a65754d5a2c7cc573476
2023-04-26 06:36:45 -04:00
Chris Krelle
510a612eed Add ablity to power off nodes in clean failed
We have seen duplicate ip issues when leaving clean failed nodes
powered on. This patch allows operators to power down nodes that
enter clean failed state.

Change-Id: Iecb402227485fe0ba787a262121c9d6a048b0e13
2023-04-24 16:20:54 -07:00
Zuul
8ef9db1570 Merge "Always fall back from hard linking to copying files" 2023-04-10 15:17:49 +00:00
Zuul
0a4144a046 Merge "On rpc service stop, wait for node reservation release" 2023-04-05 18:48:25 +00:00
Dmitry Tantsur
59c6ad96ce Always fall back from hard linking to copying files
The current check is insufficient: it passes for Kubernetes shared
volumes, although hard-linking between them is not possible.
This patch changes the approach to trying a hard link and falling
back to copyfile instead.

The patch relies on optimizations in Python 3.8 and thus should not
be backported beyond the Zed series to avoid performance regression.

Change-Id: I929944685b3ac61b2f63d2549198a2d8a1c8fe35
2023-03-31 15:49:15 +02:00
Dmitry Tantsur
3e21560bf7 Remove all references to the "cpus" property
Unused by Nova and unlike memory_mb/local_gb also by Ironic (actually,
our usage of local_gb is worth double-checking as well, but at the very
least it's referenced by inspection implementations).

Change-Id: Ie8b0d9f58f4dcd102c183c30ae7f5acf68a5e4c3
2023-03-28 11:53:26 +02:00
Zuul
abbd859b76 Merge "Enables boot modes switching with Anaconda deploy for ilo driver" 2023-03-27 14:07:23 +00:00
Zuul
eb3cbc027b Merge "Fixes Secureboot with Anaconda deploy" 2023-03-20 15:32:39 +00:00
Nisha Agarwal
6341003dac Enables boot modes switching with Anaconda deploy for ilo driver
Enables boot modes switching with Anaconda deploy for ilo driver

Story: 2010357
Task: 46530

Change-Id: I383cdd5c9d45b074d351ec98b1145fd68e2f3ac3
2023-03-17 09:00:47 +00:00
Zuul
d2a7afcc74 Merge "Fix auth_protocol and priv_protocol for SNMP v3" 2023-03-17 00:00:49 +00:00
Nisha Agarwal
c5e004a73e Fixes Secureboot with Anaconda deploy
Fixes Secureboot with Anaconda deploy with PXE and iPXE

Story:2010356
Task: 46529

Change-Id: Id6262654bb5e41e02c7d90b9a9aaf395e7b6a088
2023-03-16 15:04:39 +00:00
Dmitry Tantsur
58388212bc Prepare [inspector]require_managed_boot to change to True in the future
Currently, we silently fall back to ironic-inspector managing boot if
the boot interface cannot do it. What ironic-inspector does is set
the boot device to PXE and issue a reboot request. This was done
to keep backward compatibility with how inspection worked before managed
boot was introduced.

With in-band inspection migrating to Ironic proper, this "unmanaged"
mode becomes a more exotic case since it requires additional PXE
infrastructure. Additionally, the popularity of Redfish is rapidly
growing, and we support pre-populating ports when Redfish is used.

As such, the "unmanaged" mode should no longer be allowed by default.
This change prepares for the future flip of the default value by
issuing a deprecation warning if no explicit value is set for the option.

Depends-On: https://review.opendev.org/c/openstack/bifrost/+/877469
Change-Id: I6a13cf62b427c9e5c7d7d9ddc447d60f94592c9a
2023-03-15 13:19:40 +01:00
Zuul
821ce8c319 Merge "Wipe Agent Token when cleaning timeout occcurs" 2023-03-14 19:27:16 +00:00
Zuul
718d52c792 Merge "Clean out agent token even if power is already off" 2023-03-13 23:00:46 +00:00