ironic-python-agent

Author	SHA1	Message	Date
Dmitry Tantsur	403d2f06c6	Fix error message with UEFI-incompatible images It's somewhat confusing at the moment, since we're trying to find a UEFI partition by UUID "None". Don't search for partition if we don't know its UUID, and provide a better error message. Change-Id: Ief874084132797a445ddae8009264712a05facfd	2021-02-10 18:08:58 +01:00
Xinliang Liu	68a43b9da8	Fix UEFI boot entry creation for aarch64 Diskimage-builder installs grub with option '--removable'[1], thus for aarch64 no 'grubaa64.efi' file in efi directory only got 'BOOTAA64.EFI': linaro@bm-ubuntu:~$ tree /boot/efi /boot/efi └── EFI └── BOOT └── BOOTAA64.EFI 2 directories, 1 file [1]: `8f12d9530e/diskimage_builder/elements/bootloader/finalise.d/50-bootloader (L158)` Task: #41698 Story: #2008560 Change-Id: I9fc55c068ea980beae273411db9d3568eec25eb8	2021-01-27 03:32:23 +00:00
Julia Kreger	4fb8163717	Fix boot mode detection for partition images Previously, partition images were hard coded to be bios based as opposed to consulting all of the values AND the node itself before making the most appropriate determination. Now the agent utilises the internal helper to properly determine the boot mode when calling ironic-lib. Story: 2008070 Task: 41265 Change-Id: Id5eeda69d5b9de2b393af414472d57b0d4380c43	2020-12-19 19:03:16 +00:00
Julia Kreger	246e0cf29e	Change default ironic_lib invocation to flag local booting The partition image support has been telling ironic-lib that the machine will be local booted. While this is likely harmless, and doesn't seem to break anythign, we should have it match moving forward just to be on the safe side so we don't accidently break things down the road. Change-Id: I33e5d583964ef8c21aa04d7427bcd3957b89d449	2020-12-19 19:02:58 +00:00
Julia Kreger	a12a5744b6	Add fstab pointer to EFI partition Adds support for the EFI partition to be appended to fstab so the filesystem can be automounted and EFI loader updated should the deployed operating system need to do so. This should enable bootloaders to be upgraded by linux based operating systems after the instance has been deployed when a partition image was utilized for the initial deployment. Change-Id: Iec28a8841cc01ec8b01a3f5cca070c934c7a2531 Story: 2008070 Task: 40754	2020-12-17 14:17:31 +00:00
Julia Kreger	f9870d5812	Prevent broken partition image UEFI deploys Partition images can sometimes contain a /boot folder structure event he assets for EFI booting on that filesystem. Which is a good thing. The conundrum is that Ironic does not handle this properly and potentially replaces the bootloader in this sequence such that grub2-install is used instead of signed bootloader assets. As such, we should be preserving the assets and using them from a partition image much like we do when we have a wholedisk image and can identify the assets. Now we will preserve the EFI boot assets, copy them to the new EFI boot partition, and call the EFI setup methods to manage the EFI nvram. Note, this change also splits the logic path out that performs the end call of the EFI boot manager into a reusable method but does not retool all of the testing as it is intertwined in the install_grub2 testing. Also adds some additional debug logging, as much of the bootloader installation code has multiple fallback/cleanup points which makes it difficult to debug from logs. Story: 2008070 Task: 40753 Change-Id: If17d4b4c06df5504987e61a1fde6662e9acd6989	2020-12-14 14:37:14 +00:00
Julia Kreger	cb6c0059b5	Fix default disk label with partition images Partition images through the agent have the unfortunate side effect of being executed without full node context by default. Luckilly we've had a similar problem and cache the node. This patch changes the lookup from a default of msdos partitions to use the cached node object. Change-Id: I002816c9372fdf1cc32f3c67f420073551479fd9	2020-12-14 06:36:18 -08:00
Julia Kreger	7a83773fbc	Option to enable bootloader config failure bypass Some hardware is very well intentioned. However this intention can result in the UEFI NVRAM table being full which prevents us from adding new records to the table. We can't be sure what to delete, so in this case some operators just need the ability to tell ironic "it is okay if this fails, it will still work." The added ``ignore_bootloader_failure`` option adds this capability which can be set per-node either in the agent configuation via the ramdisk image, or in the pxe_append_params configuration parameter for the node itself with a ``ipa-ignore-bootloader-failure`` option in order to prevent the failure from being raised. Change-Id: If3c83fb2ea2025fce092d495a64f32077c70d2d6 Story: 2008386 Task: 41309	2020-12-10 06:42:48 -08:00
Fedor Tarasenko	694ea7425d	Support using LABEL as identifier for rootfs Add possibility to use disk LABEL to identify rootfs uuid for Software RAID deployment Change-Id: I77f36e70ddc539af0190db1c1abe0fb2c66f34b4 Story: 2008303 Task: 41188	2020-11-03 13:03:34 +03:00
Julia Kreger	6542a9cb04	Don't run os-prober from grub2-mkconfig By default, grub2-mkconfig scans everything to look for other environments and then load those into the grub configuration. It makes sense, but on newer versions of grub2 in distribution images, os-prober is taking an exceptionally long time in some cases where more than one storage device exists with other filesystems. As a result, of the os-prober execution by grub2-mkconfig, the bootloader installation can completely time out and fail the deployment. This is presently experienced with metalsmith on centos8. There are numerous sporatic reports of issues like this issue where grub2-mkconfig hangs for some period of time, and this is observable on Centos8.2 in our CI. While one report[0] mentions this issue, Another bug [1] has the dialog that actually helps us frame the context as to what we likely should do. Also, fixes the unit testing so we actually test if we're running with grub2. :\ [0]: https://bugzilla.redhat.com/show_bug.cgi?id=1744693 [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1709682 Depends-On: https://review.opendev.org/#/c/748315 Change-Id: I14bf299afef3a1ddb2006fe5f182d7f0d249e734	2020-10-22 22:28:07 +00:00
Dmitry Tantsur	420ebc0d73	Do not silently swallow errors in the write_image deploy step Calling join() does not raise, we need to explicitly check the result. Change-Id: I81d3d727af220c2b50358edab8139f07874611f0 Story: #2008240 Task: #41083	2020-10-09 11:24:12 +02:00
Zuul	35d2292aa4	Merge "Log a warning of target_boot_mode does not match current boot mode"	2020-10-07 17:01:51 +00:00
Dmitry Tantsur	1a67dddde7	Log a warning of target_boot_mode does not match current boot mode This is not a normal situation and is likely to cause problems. Change-Id: Id0668fd160ac0539d85997e985f8c43d9da75c90	2020-10-07 12:30:23 +02:00
Dmitry Tantsur	fc4e0eed6a	Don't try to call GRUB when root UUID is not provided We don't have a really working way to detect root UUID for whole disk images at the moment, which results in an ignored traceback every time install_bootloader is called with whole disk images in UEFI mode. Avoid it by skipping GRUB2 if root UUID is unknown. Change-Id: I84245538f59c664b72d1cafbca8d61be0978f489	2020-10-07 12:06:42 +02:00
Dmitry Tantsur	fe6b687968	When reporting that agent is busy, report the executed command Also make this API return a proper HTTP code (409 instead of 500). Change-Id: I5d86878b5ed6142ed2630adee78c0867c49b663f	2020-09-18 17:52:49 +02:00
Julia Kreger	d3c3d4dabe	Update the cache if we don't have a root device hint Or at least try to. Some deployments just don't use root device hints, and this is okay. However, other deployments need root device hints, and with fast track mode in ramdisks, we created a situation where the node cache could be updated by a human or software between the time the agent was started, and the deployment was requested. As a result, the agent has been updated to check if we have a hint and if we don't, update the cache from the node lookup endpoint. This is not needed when the inband deploy steps are executed, as the process of updating the steps does force the node cache to be updated. Change-Id: I27201319f31cdc01605a3c5ae9ef4b4218e4a3f6 Story: 2008039 Task: 40701	2020-08-25 19:34:48 +00:00
Zuul	dc395c5837	Merge "More refactoring of the image module"	2020-07-27 07:15:42 +00:00
Zuul	9ca640a1c5	Merge "Prevent un-needed iscsi cleanup"	2020-07-25 13:54:51 +00:00
Riccardo Pittau	80e11811f5	More refactoring of the image module Introducing new function _umount_all_partitions to reduce the size of _install_grub2 Change-Id: I304468d57b10d677f2a9d58aec42a1bf414c6cba	2020-07-24 14:34:46 +02:00
Zuul	bfb395837d	Merge "Adds poll mode deployment support"	2020-07-22 19:53:31 +00:00
Julia Kreger	2a56ee03b6	Prevent un-needed iscsi cleanup When we added software raid support, we started calling bootloader installation. As time went on, we ehnanced that code path for non RAID cases in order to ensure that UEFI nvram was setup for the instance to boot properly. Somewhere in this process, we missed a possible failure case where the iscsi client tgtadm may return failures. Obviously, the correct path is to not call iscsi teardown if we don't need to. Since it was always semi-opportunistic teardown, we can't blindly catch any error, and if we started iSCSI and failed to tear the connection down, we might want to still fail, so this change moves the logic over to use a flag on the agent object which one extension to set the flag and the other to read it and take action based upon that. Change-Id: Id3b1ae5e59282f4109f6246d5614d44c93aefa7c Story: 2007937 Task: 40395	2020-07-20 14:24:06 -07:00
Riccardo Pittau	9d9a6bce5c	Refactor part of image module Shuffle some functions around and reduce size of _is_bootloader_loaded moving logic out to a new function. Change-Id: I9c10bf05186dcebb37f175d61bf4ac9ff86b6510	2020-07-07 10:44:50 +02:00
Dmitry Tantsur	ba3caa6c64	Increase the ESP partition size to 550 MiB when using software RAID This has been a popular guidance, and diskimage-builder has recently started following it. Change-Id: I794c846fb191c15b0a30546bf64d624dfbde0fd4	2020-07-02 17:30:33 +02:00
Zuul	de7d5affe7	Merge "Mount all vfat partitions before calling grub2"	2020-07-02 10:37:04 +00:00
Arne Wiebalck	c5022790b3	Mount all vfat partitions before calling grub2 In order to ensure grub2 finds all files it needs, mount all vfat partitions specified in the deployed image. Story: #2007618 Task: #39629 Change-Id: Ie5b6e0abc3f266409562f9ecb26538126b667056	2020-06-30 18:31:58 +02:00
Dmitry Tantsur	00ad03b709	Fixes minor issues in the read() retries patch Follow-up to commit c5b97eb781cf9851f9abe87a1500b4da55b8bde8. Two things slipped through the cracks: * ImageDownloadError was instantiated incorrectly, resulting in a wrong error message. This was uncovered by using assertRaisesRegext in tests. * We allowed calling write(None). This was uncovered by avoiding sleep(4) in tests and enabling more failed calls before timeout. Change-Id: If5e798c5461ea3e474a153574b0db2da96f2dfa8	2020-06-30 10:51:53 +02:00
Zuul	f97f8e2c06	Merge "Fix confusing logging when running asynchronous commands"	2020-06-29 22:40:02 +00:00
Dmitry Tantsur	0eee26ea66	Fix confusing logging when running asynchronous commands We log them as completed when they start executing. Also fix a problem in remove_large_keys that prevented items with defaultdict from being logged. Change-Id: I34a06cc85f55c693416f8c4c9877d55d6affafc9	2020-06-26 15:19:04 +02:00
Zuul	c94fb84497	Merge "Minor clean-up follow-up to timeout on read() fix"	2020-06-25 10:23:18 +00:00
Julia Kreger	7abda4eefe	Minor clean-up follow-up to timeout on read() fix Just some minor cleanup driven from the review process. Change-Id: I0b3d73c251d6da6d85e11279990dcc36751e27e7	2020-06-24 10:02:28 -07:00
Julia Kreger	159ab9f0ce	Add full download retries Instead of just trying to get the connection and handler for the download, lets try to retry the whole action of of downloading. Change-Id: I9217792d32e6f33c70f146a9b7d3ef58c5644d8a	2020-06-23 20:27:41 +00:00
Julia Kreger	c5b97eb781	Add timeout operations to try and prevent hang on read() Socket read operations can be blocking and may not timeout as expected when thinking of timeouts at the beginning of a socket request. This can occur when streaming file contents down to the agent and there is a hard connectivity break. In other words, we could be in a situation like: - read(fd, len) - Gets data - Select returns context to the program, we do things with data. hard connectivity break for next 90 seconds - read(fd, len) - We drain the in-memory buffer side of the socket. - Select returns context, we do things with our remaining data Server retransmits Server times out due to no ack Server closes socket and issues a FIN,RST packet to the client Connectivity restored, Client never got FIN,RST Client socket still waiting for more data - read(fd, len) - No data returned - Select returns, yet we have no data to act on as the buffer is empty OR the buffered data doesn't meet our requried read len value. tl;dr noop - read(fd, len) <-- We continue to try and read until the socket is recognized as dead, which could be a long time. NOTE: The above read()s are python's read() on an contents being streamed. Lower level reads exist, but brains will hurt if we try to cover the dynamics at that level. As such, we need to keep an eye on when the last time we received a packet, and treat that as if we have timed out or not. Requests periodically yeilds back even when no data has been received, in order to allow the caller to wall clock the progress/status and take appropriate action. When we exceed the timeout time value with our wall clock, we will fail the download. Change-Id: I7214fc9dbd903789c9e39ee809f05454aeb5a240	2020-06-23 13:25:09 -07:00
Kaifeng Wang	61c95554ff	Adds poll mode deployment support Adds a new poll extension to provide get_hardware_info and get_node_info interfaces. get_hardware_info will be used for node validation by ironic deploy drivers. get_node_info will be used for sending lookup data to IPA. standalone mode is assumed as debug only, but it's not the case considering the poll mode will be introduced, slightly updates the description, also prevents the mdns lookup when standalone is true. Story: 1526486 Task: 28724 Change-Id: I5ad772a18cc4584585c5a7b6fb127547cece1998	2020-06-21 16:44:00 +08:00
Zuul	46bf7e0ef4	Merge "Add a deploy step for writing an image"	2020-06-20 00:00:10 +00:00
Dmitry Tantsur	6d7ec350ff	Make get_partition_uuids work with whole disk images We used to popular root UUID inside the message formatting function, move it to actual prepare_image/cache_image calls. Change-Id: Ifb22220dfd49633e8623dd76f7a6a128f5874b78	2020-06-17 14:38:58 +02:00
Zuul	d7cf7bd341	Merge "New extension call to return partition UUIDs"	2020-06-09 12:31:55 +00:00
Dmitry Tantsur	7e5fe1121e	Make the install_bootloader command asynchronous It does not return anything, so it makes no point for it to be synchronous. Ironic always calls it with wait=True, so there is no problem with backward compatibility either. Change-Id: I44fec2e0cb54486328ce71263613d8592e384870	2020-06-08 15:10:05 +02:00
Dmitry Tantsur	9d4cf5532f	Add a deploy step for writing an image The new step just invokes the appropriate method of the standby extension. Change-Id: Ic74f83ab2b7e58f8e4b46e0abfab79e221afeb3e Story: 2006963	2020-06-02 15:23:54 +02:00
Dmitry Tantsur	6c1545b75b	New extension call to return partition UUIDs Currently we parse the success message from the write_image call. This is inconvenient and incompatible with the deploy steps split. Change-Id: I258dc1ff1ad1c9df5cbc26a7825d9e7ef2f3205b Story: #2006963	2020-06-02 15:05:59 +02:00
Dmitry Tantsur	8adb7e1a04	Add timeout and retries when connection to an image server If the server is stuck for any reason, the download will hang for a potentially long time. Provide a timeout (defaults to 60 seconds) and 2 retries on failure. Change-Id: Ie53519266edd914fdbfa82fe52b4a55151e5ec5f	2020-04-24 10:34:40 +02:00
Dmitry Tantsur	c0502649ba	Add raid.apply_configuration deploy step For compatibility with out-of-band RAID deploy steps, we need to have one apply_configuration step, not a create/delete pair. Change-Id: I55bbed96673c9fa247cafdac9a3ade3a6ff3f38d Story: #2006963	2020-04-20 12:50:14 +02:00
Zuul	b9e320e76f	Merge "Add an ability to run in-band deploy steps"	2020-04-09 09:31:49 +00:00
Arne Wiebalck	66c32784af	Editing follow-up for UEFI Software RAID support This is a follow-up to https://review.opendev.org/#/c/696156/ Change-Id: I0fd2c09045ff07a57374934c35d4a3a8467f5e99 Story: #2006379 Task: #37635	2020-04-06 18:03:25 +02:00
Mark Goddard	1b4ce47921	Add an ability to run in-band deploy steps Mostly adaptation of cleaning methods. Co-Authored-By: Dmitry Tantsur <dtantsur@redhat.com> Change-Id: Ife0502391bbece46d619a20a825dfdb191d5c2b4 Story: 2006963 Task: 37791	2020-04-06 10:24:08 +02:00
Raphael Glon	9343348106	Software RAID: Add UEFI support The proposed changes concern two steps: First, when creating the RAID configuration, have a GPT partition table type (this is not necessary, but more natural with UEFI). Also, leave some space, either for the EFI partitions or the BIOS boot partitions, outside the Software RAID. Secondly, when installing the bootloader, make sure the correct boot partitions are created or relocated. Change-Id: Icf0a76b0de89e7a8494363ec91b2f1afda4faa3b Story: #2006379 Task: #37635	2020-04-02 18:02:19 +02:00
Zuul	68a71513f0	Merge "Bump hacking to 3.0.0"	2020-03-31 12:36:11 +00:00
Riccardo Pittau	a332a19a57	Bump hacking to 3.0.0 Change-Id: I1032ea6a2e9d79aeaecb1458c319cbeb15ac1fff	2020-03-30 12:55:46 +02:00
Julia Kreger	916cd5c8de	Rescan after restarting the md device If an md device is restarted, there is a chance, depending on the OS, that the partition may not be found upon start of the md device. Instead, we should always rescan after re-assembling the raid device. Story: 2007275 Task: 38712 Change-Id: I92bac20812940e04381a54ef2905ef5f6e293813	2020-03-29 14:47:41 +00:00
Julia Kreger	55b011cb1f	Fix GPT partition tables after agent writes contents Fixes errors that were being raised upon restarting the agent directly written out software raid images as the raidset is restarted for device consistency and partition updates later on in the code path of deployment. Story: 2007455 Task: 39187 Change-Id: I9abf51eb77b262932e70329af5ce1593106a3171	2020-03-29 07:45:25 -07:00
Julia Kreger	bf0bb7a87a	Improve debug logging around Raid/Bootloader Change-Id: I7d34b918a859972a2d5650494824d3333016dd11	2020-03-28 08:55:32 -07:00

1 2 3 4 5

217 Commits