265 Commits

Author SHA1 Message Date
Julia Kreger
4fb8163717 Fix boot mode detection for partition images
Previously, partition images were hard coded to be bios based
as opposed to consulting all of the values AND the node itself
before making the most appropriate determination. Now the agent
utilises the internal helper to properly determine the boot
mode when calling ironic-lib.

Story: 2008070
Task: 41265
Change-Id: Id5eeda69d5b9de2b393af414472d57b0d4380c43
2020-12-19 19:03:16 +00:00
Julia Kreger
246e0cf29e Change default ironic_lib invocation to flag local booting
The partition image support has been telling ironic-lib
that the machine will be local booted. While this is likely
harmless, and doesn't seem to break anythign, we should have
it match moving forward just to be on the safe side so we don't
accidently break things down the road.

Change-Id: I33e5d583964ef8c21aa04d7427bcd3957b89d449
2020-12-19 19:02:58 +00:00
Julia Kreger
a12a5744b6 Add fstab pointer to EFI partition
Adds support for the EFI partition to be appended to fstab so the
filesystem can be automounted and EFI loader updated should the
deployed operating system need to do so.

This should enable bootloaders to be upgraded by linux based
operating systems after the instance has been deployed when
a partition image was utilized for the initial deployment.

Change-Id: Iec28a8841cc01ec8b01a3f5cca070c934c7a2531
Story: 2008070
Task: 40754
2020-12-17 14:17:31 +00:00
Julia Kreger
f9870d5812 Prevent broken partition image UEFI deploys
Partition images can sometimes contain a /boot folder structure
event he assets for EFI booting on that filesystem. Which is a
good thing. The conundrum is that Ironic does not handle this
properly and potentially replaces the bootloader in this sequence
such that grub2-install is used instead of signed bootloader assets.

As such, we should be preserving the assets and using them from
a partition image much like we do when we have a wholedisk
image and can identify the assets.

Now we will preserve the EFI boot assets, copy them to the new EFI
boot partition, and call the EFI setup methods to manage the EFI
nvram.

Note, this change also splits the logic path out that performs the
end call of the EFI boot manager into a reusable method but does
not retool all of the testing as it is intertwined in the
install_grub2 testing.

Also adds some additional debug logging, as much of the bootloader
installation code has multiple fallback/cleanup points which makes
it difficult to debug from logs.

Story: 2008070
Task: 40753
Change-Id: If17d4b4c06df5504987e61a1fde6662e9acd6989
2020-12-14 14:37:14 +00:00
Julia Kreger
cb6c0059b5 Fix default disk label with partition images
Partition images through the agent have the unfortunate
side effect of being executed without full node context
by default. Luckilly we've had a similar problem and
cache the node.

This patch changes the lookup from a default of msdos
partitions to use the cached node object.

Change-Id: I002816c9372fdf1cc32f3c67f420073551479fd9
2020-12-14 06:36:18 -08:00
Julia Kreger
7a83773fbc Option to enable bootloader config failure bypass
Some hardware is very well intentioned. However this intention
can result in the UEFI NVRAM table being full which prevents us
from adding new records to the table. We can't be sure what to
delete, so in this case some operators just need the ability to
tell ironic "it is okay if this fails, it will still work."

The added ``ignore_bootloader_failure`` option adds
this capability which can be set per-node either in the agent
configuation via the ramdisk image, or in the pxe_append_params
configuration parameter for the node itself with a
``ipa-ignore-bootloader-failure`` option in order to prevent
the failure from being raised.

Change-Id: If3c83fb2ea2025fce092d495a64f32077c70d2d6
Story: 2008386
Task: 41309
2020-12-10 06:42:48 -08:00
Fedor Tarasenko
694ea7425d Support using LABEL as identifier for rootfs
Add possibility to use disk LABEL to identify rootfs uuid for
Software RAID deployment

Change-Id: I77f36e70ddc539af0190db1c1abe0fb2c66f34b4
Story: 2008303
Task: 41188
2020-11-03 13:03:34 +03:00
Julia Kreger
6542a9cb04 Don't run os-prober from grub2-mkconfig
By default, grub2-mkconfig scans everything to look for other
environments and then load those into the grub configuration.

It makes sense, but on newer versions of grub2 in distribution
images, os-prober is taking an exceptionally long time in some
cases where more than one storage device exists with other
filesystems.

As a result, of the os-prober execution by grub2-mkconfig, the
bootloader installation can completely time out and fail the
deployment. This is presently experienced with metalsmith on
centos8.

There are numerous sporatic reports of issues like this issue
where grub2-mkconfig hangs for some period of time, and this is
observable on Centos8.2 in our CI. While one report[0] mentions
this issue, Another bug [1] has the dialog that actually helps us
frame the context as to what we likely should do.

Also, fixes the unit testing so we actually test if we're running
with grub2. :\

[0]: https://bugzilla.redhat.com/show_bug.cgi?id=1744693
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1709682

Depends-On: https://review.opendev.org/#/c/748315
Change-Id: I14bf299afef3a1ddb2006fe5f182d7f0d249e734
2020-10-22 22:28:07 +00:00
Dmitry Tantsur
420ebc0d73 Do not silently swallow errors in the write_image deploy step
Calling join() does not raise, we need to explicitly check the result.

Change-Id: I81d3d727af220c2b50358edab8139f07874611f0
Story: #2008240
Task: #41083
2020-10-09 11:24:12 +02:00
Zuul
35d2292aa4 Merge "Log a warning of target_boot_mode does not match current boot mode" 2020-10-07 17:01:51 +00:00
Dmitry Tantsur
1a67dddde7 Log a warning of target_boot_mode does not match current boot mode
This is not a normal situation and is likely to cause problems.

Change-Id: Id0668fd160ac0539d85997e985f8c43d9da75c90
2020-10-07 12:30:23 +02:00
Dmitry Tantsur
fc4e0eed6a Don't try to call GRUB when root UUID is not provided
We don't have a really working way to detect root UUID for whole
disk images at the moment, which results in an ignored traceback
every time install_bootloader is called with whole disk images in
UEFI mode. Avoid it by skipping GRUB2 if root UUID is unknown.

Change-Id: I84245538f59c664b72d1cafbca8d61be0978f489
2020-10-07 12:06:42 +02:00
Dmitry Tantsur
fe6b687968 When reporting that agent is busy, report the executed command
Also make this API return a proper HTTP code (409 instead of 500).

Change-Id: I5d86878b5ed6142ed2630adee78c0867c49b663f
2020-09-18 17:52:49 +02:00
Julia Kreger
d3c3d4dabe Update the cache if we don't have a root device hint
Or at least try to.

Some deployments just don't use root device hints, and this is okay.

However, other deployments need root device hints, and with fast
track mode in ramdisks, we created a situation where the node cache
could be updated by a human or software between the time the agent
was started, and the deployment was requested.

As a result, the agent has been updated to check if we have a hint
and if we don't, update the cache from the node lookup endpoint.

This is not needed when the inband deploy steps are executed, as
the process of updating the steps does force the node cache to be
updated.

Change-Id: I27201319f31cdc01605a3c5ae9ef4b4218e4a3f6
Story: 2008039
Task: 40701
2020-08-25 19:34:48 +00:00
Zuul
dc395c5837 Merge "More refactoring of the image module" 2020-07-27 07:15:42 +00:00
Zuul
9ca640a1c5 Merge "Prevent un-needed iscsi cleanup" 2020-07-25 13:54:51 +00:00
Riccardo Pittau
80e11811f5 More refactoring of the image module
Introducing new function _umount_all_partitions to reduce the size
of _install_grub2

Change-Id: I304468d57b10d677f2a9d58aec42a1bf414c6cba
2020-07-24 14:34:46 +02:00
Zuul
bfb395837d Merge "Adds poll mode deployment support" 2020-07-22 19:53:31 +00:00
Julia Kreger
2a56ee03b6 Prevent un-needed iscsi cleanup
When we added software raid support, we started calling bootloader
installation. As time went on, we ehnanced that code path for non
RAID cases in order to ensure that UEFI nvram was setup
for the instance to boot properly.

Somewhere in this process, we missed a possible failure case where
the iscsi client tgtadm may return failures. Obviously, the correct
path is to not call iscsi teardown if we don't need to.

Since it was always semi-opportunistic teardown, we can't blindly
catch any error, and if we started iSCSI and failed to tear the
connection down, we might want to still fail, so this change
moves the logic over to use a flag on the agent object which
one extension to set the flag and the other to read it and take
action based upon that.

Change-Id: Id3b1ae5e59282f4109f6246d5614d44c93aefa7c
Story: 2007937
Task: 40395
2020-07-20 14:24:06 -07:00
Riccardo Pittau
9d9a6bce5c Refactor part of image module
Shuffle some functions around and reduce size of _is_bootloader_loaded
moving logic out to a new function.

Change-Id: I9c10bf05186dcebb37f175d61bf4ac9ff86b6510
2020-07-07 10:44:50 +02:00
Dmitry Tantsur
ba3caa6c64 Increase the ESP partition size to 550 MiB when using software RAID
This has been a popular guidance, and diskimage-builder has recently
started following it.

Change-Id: I794c846fb191c15b0a30546bf64d624dfbde0fd4
2020-07-02 17:30:33 +02:00
Zuul
de7d5affe7 Merge "Mount all vfat partitions before calling grub2" 2020-07-02 10:37:04 +00:00
Arne Wiebalck
c5022790b3 Mount all vfat partitions before calling grub2
In order to ensure grub2 finds all files it needs, mount all
vfat partitions specified in the deployed image.

Story: #2007618
Task: #39629
Change-Id: Ie5b6e0abc3f266409562f9ecb26538126b667056
2020-06-30 18:31:58 +02:00
Dmitry Tantsur
00ad03b709 Fixes minor issues in the read() retries patch
Follow-up to commit c5b97eb781cf9851f9abe87a1500b4da55b8bde8.

Two things slipped through the cracks:
* ImageDownloadError was instantiated incorrectly, resulting in a wrong
  error message. This was uncovered by using assertRaisesRegext in tests.
* We allowed calling write(None). This was uncovered by avoiding sleep(4)
  in tests and enabling more failed calls before timeout.

Change-Id: If5e798c5461ea3e474a153574b0db2da96f2dfa8
2020-06-30 10:51:53 +02:00
Zuul
f97f8e2c06 Merge "Fix confusing logging when running asynchronous commands" 2020-06-29 22:40:02 +00:00
Dmitry Tantsur
0eee26ea66 Fix confusing logging when running asynchronous commands
We log them as completed when they start executing.

Also fix a problem in remove_large_keys that prevented items
with defaultdict from being logged.

Change-Id: I34a06cc85f55c693416f8c4c9877d55d6affafc9
2020-06-26 15:19:04 +02:00
Zuul
c94fb84497 Merge "Minor clean-up follow-up to timeout on read() fix" 2020-06-25 10:23:18 +00:00
Julia Kreger
7abda4eefe Minor clean-up follow-up to timeout on read() fix
Just some minor cleanup driven from the review process.

Change-Id: I0b3d73c251d6da6d85e11279990dcc36751e27e7
2020-06-24 10:02:28 -07:00
Julia Kreger
159ab9f0ce Add full download retries
Instead of just trying to get the connection and handler
for the download, lets try to retry the whole action of
of downloading.

Change-Id: I9217792d32e6f33c70f146a9b7d3ef58c5644d8a
2020-06-23 20:27:41 +00:00
Julia Kreger
c5b97eb781 Add timeout operations to try and prevent hang on read()
Socket read operations can be blocking and may not timeout as
expected when thinking of timeouts at the beginning of a
socket request. This can occur when streaming file contents
down to the agent and there is a hard connectivity break.

In other words, we could be in a situation like:

- read(fd, len) - Gets data
- Select returns context to the program, we do things with data.
** hard connectivity break for next 90 seconds**
-  read(fd, len) - We drain the in-memory buffer side of the socket.
-  Select returns context, we do things with our remaining data
** Server retransmits **
** Server times out due to no ack **
** Server closes socket and issues a FIN,RST packet to the client **
** Connectivity restored, Client never got FIN,RST **
** Client socket still waiting for more data **
- read(fd, len) - No data returned
- Select returns, yet we have no data to act on as the buffer is
  empty OR the buffered data doesn't meet our requried read len value.
  tl;dr noop
- read(fd, len) <-- We continue to try and read until the socket is
                    recognized as dead, which could be a long time.

NOTE: The above read()s are python's read() on an contents being
      streamed. Lower level reads exist, but brains will hurt
      if we try to cover the dynamics at that level.

As such, we need to keep an eye on when the last time we
received a packet, and treat that as if we have timed out
or not. Requests periodically yeilds back even when no data
has been received, in order to allow the caller to wall
clock the progress/status and take appropriate action.

When we exceed the timeout time value with our wall clock,
we will fail the download.

Change-Id: I7214fc9dbd903789c9e39ee809f05454aeb5a240
2020-06-23 13:25:09 -07:00
Kaifeng Wang
61c95554ff Adds poll mode deployment support
Adds a new poll extension to provide get_hardware_info and get_node_info
interfaces.

get_hardware_info will be used for node validation by ironic deploy
drivers.

get_node_info will be used for sending lookup data to IPA.

standalone mode is assumed as debug only, but it's not the case
considering the poll mode will be introduced, slightly updates the
description, also prevents the mdns lookup when standalone is true.

Story: 1526486
Task: 28724

Change-Id: I5ad772a18cc4584585c5a7b6fb127547cece1998
2020-06-21 16:44:00 +08:00
Zuul
46bf7e0ef4 Merge "Add a deploy step for writing an image" 2020-06-20 00:00:10 +00:00
Dmitry Tantsur
6d7ec350ff Make get_partition_uuids work with whole disk images
We used to popular root UUID inside the message formatting function,
move it to actual prepare_image/cache_image calls.

Change-Id: Ifb22220dfd49633e8623dd76f7a6a128f5874b78
2020-06-17 14:38:58 +02:00
Zuul
d7cf7bd341 Merge "New extension call to return partition UUIDs" 2020-06-09 12:31:55 +00:00
Dmitry Tantsur
7e5fe1121e Make the install_bootloader command asynchronous
It does not return anything, so it makes no point for it to be
synchronous. Ironic always calls it with wait=True, so there is
no problem with backward compatibility either.

Change-Id: I44fec2e0cb54486328ce71263613d8592e384870
2020-06-08 15:10:05 +02:00
Dmitry Tantsur
9d4cf5532f Add a deploy step for writing an image
The new step just invokes the appropriate method of the standby extension.

Change-Id: Ic74f83ab2b7e58f8e4b46e0abfab79e221afeb3e
Story: 2006963
2020-06-02 15:23:54 +02:00
Dmitry Tantsur
6c1545b75b New extension call to return partition UUIDs
Currently we parse the success message from the write_image call.
This is inconvenient and incompatible with the deploy steps split.

Change-Id: I258dc1ff1ad1c9df5cbc26a7825d9e7ef2f3205b
Story: #2006963
2020-06-02 15:05:59 +02:00
Dmitry Tantsur
8adb7e1a04 Add timeout and retries when connection to an image server
If the server is stuck for any reason, the download will hang for
a potentially long time. Provide a timeout (defaults to 60 seconds)
and 2 retries on failure.

Change-Id: Ie53519266edd914fdbfa82fe52b4a55151e5ec5f
2020-04-24 10:34:40 +02:00
Dmitry Tantsur
c0502649ba Add raid.apply_configuration deploy step
For compatibility with out-of-band RAID deploy steps, we need to have
one apply_configuration step, not a create/delete pair.

Change-Id: I55bbed96673c9fa247cafdac9a3ade3a6ff3f38d
Story: #2006963
2020-04-20 12:50:14 +02:00
Zuul
b9e320e76f Merge "Add an ability to run in-band deploy steps" 2020-04-09 09:31:49 +00:00
Arne Wiebalck
66c32784af Editing follow-up for UEFI Software RAID support
This is a follow-up to https://review.opendev.org/#/c/696156/

Change-Id: I0fd2c09045ff07a57374934c35d4a3a8467f5e99
Story: #2006379
Task: #37635
2020-04-06 18:03:25 +02:00
Mark Goddard
1b4ce47921 Add an ability to run in-band deploy steps
Mostly adaptation of cleaning methods.

Co-Authored-By: Dmitry Tantsur <dtantsur@redhat.com>
Change-Id: Ife0502391bbece46d619a20a825dfdb191d5c2b4
Story: 2006963
Task: 37791
2020-04-06 10:24:08 +02:00
Raphael Glon
9343348106 Software RAID: Add UEFI support
The proposed changes concern two steps:

First, when creating the RAID configuration, have a GPT partition
table type (this is not necessary, but more natural with UEFI).
Also, leave some space, either for the EFI partitions or the BIOS
boot partitions, outside the Software RAID.

Secondly, when installing the bootloader, make sure the correct
boot partitions are created or relocated.

Change-Id: Icf0a76b0de89e7a8494363ec91b2f1afda4faa3b
Story: #2006379
Task: #37635
2020-04-02 18:02:19 +02:00
Zuul
68a71513f0 Merge "Bump hacking to 3.0.0" 2020-03-31 12:36:11 +00:00
Riccardo Pittau
a332a19a57 Bump hacking to 3.0.0
Change-Id: I1032ea6a2e9d79aeaecb1458c319cbeb15ac1fff
2020-03-30 12:55:46 +02:00
Julia Kreger
916cd5c8de Rescan after restarting the md device
If an md device is restarted, there is a chance, depending
on the OS, that the partition may not be found upon start
of the md device.

Instead, we should always rescan after re-assembling the raid
device.

Story: 2007275
Task: 38712
Change-Id: I92bac20812940e04381a54ef2905ef5f6e293813
2020-03-29 14:47:41 +00:00
Julia Kreger
55b011cb1f Fix GPT partition tables after agent writes contents
Fixes errors that were being raised upon restarting the agent
directly written out software raid images as the raidset is
restarted for device consistency and partition updates later
on in the code path of deployment.

Story: 2007455
Task: 39187
Change-Id: I9abf51eb77b262932e70329af5ce1593106a3171
2020-03-29 07:45:25 -07:00
Julia Kreger
bf0bb7a87a Improve debug logging around Raid/Bootloader
Change-Id: I7d34b918a859972a2d5650494824d3333016dd11
2020-03-28 08:55:32 -07:00
Zuul
d73d27afbd Merge "[trivial] Fix comment for Software RAID restart" 2020-03-25 10:57:30 +00:00
Arne Wiebalck
46c482d063 [trivial] Fix comment for Software RAID restart
The detection of the holder disks was moved elsewhere,
so the comment is misleading now.

Change-Id: If41b4270ab8fb1626979ca17134764e088e3cb65
2020-03-23 18:54:46 +01:00