The download retry interval was previously five seconds which is
not long enough to recover after a hard network connectivity break
where we may be reliant upon network port forwarding hold-down
timers or even routing protocol route propogation to recover
communication.
Previously the time value was 5 seconds, with 3 attempts, meaning
15 seconds total ignoring the error detection timeouts.
Now it is 10 seconds, with 10 attempts, meaning 100 seconds before
the error detection timeouts.
Change-Id: I6d11edc9a3156f2bdc21c3d432ecc7625d652699
Instead of just trying to get the connection and handler
for the download, lets try to retry the whole action of
of downloading.
Change-Id: I9217792d32e6f33c70f146a9b7d3ef58c5644d8a
Socket read operations can be blocking and may not timeout as
expected when thinking of timeouts at the beginning of a
socket request. This can occur when streaming file contents
down to the agent and there is a hard connectivity break.
In other words, we could be in a situation like:
- read(fd, len) - Gets data
- Select returns context to the program, we do things with data.
** hard connectivity break for next 90 seconds**
- read(fd, len) - We drain the in-memory buffer side of the socket.
- Select returns context, we do things with our remaining data
** Server retransmits **
** Server times out due to no ack **
** Server closes socket and issues a FIN,RST packet to the client **
** Connectivity restored, Client never got FIN,RST **
** Client socket still waiting for more data **
- read(fd, len) - No data returned
- Select returns, yet we have no data to act on as the buffer is
empty OR the buffered data doesn't meet our requried read len value.
tl;dr noop
- read(fd, len) <-- We continue to try and read until the socket is
recognized as dead, which could be a long time.
NOTE: The above read()s are python's read() on an contents being
streamed. Lower level reads exist, but brains will hurt
if we try to cover the dynamics at that level.
As such, we need to keep an eye on when the last time we
received a packet, and treat that as if we have timed out
or not. Requests periodically yeilds back even when no data
has been received, in order to allow the caller to wall
clock the progress/status and take appropriate action.
When we exceed the timeout time value with our wall clock,
we will fail the download.
Change-Id: I7214fc9dbd903789c9e39ee809f05454aeb5a240
We used to popular root UUID inside the message formatting function,
move it to actual prepare_image/cache_image calls.
Change-Id: Ifb22220dfd49633e8623dd76f7a6a128f5874b78
It does not return anything, so it makes no point for it to be
synchronous. Ironic always calls it with wait=True, so there is
no problem with backward compatibility either.
Change-Id: I44fec2e0cb54486328ce71263613d8592e384870
Currently we parse the success message from the write_image call.
This is inconvenient and incompatible with the deploy steps split.
Change-Id: I258dc1ff1ad1c9df5cbc26a7825d9e7ef2f3205b
Story: #2006963
Currently running of ipa-centos8-stable-ussuri image causes 100%
cpu usage while cleaning. Proposed change fixes this behavior and
significantly speeds up cleaning.
Change-Id: I2ba9a69f22b11830d8ff1bc346b17bf1a52f25b0
Story: #2007696
Task: #39809
For some reason pep8 test started to complain causing mayhem.
This patch fixes the issues and does some refactor of dmi_inspector
tests moving pure data to a separate file.
Change-Id: Ia244a496acd80abad679f8ae9832d4f0471500e7
The issue with json output in lshw was fixed in version B.02.19
This patch makes the memory calculation compatible with that
version and later versions that are included in recent distributions
(e.g. Ubuntu 20.04, Fedora 31)
Change-Id: Id5a30028b139c51cae6232cac73a50b917fea233
Story: 2007588
Task: 39527
If the server is stuck for any reason, the download will hang for
a potentially long time. Provide a timeout (defaults to 60 seconds)
and 2 retries on failure.
Change-Id: Ie53519266edd914fdbfa82fe52b4a55151e5ec5f
This function checks for /sys/firmware/efi. Some tests do not mock
isdir, so they fail on UEFI machines.
Change-Id: I088218ddb88717ac07669d0b97c6cd50208ede8c
For compatibility with out-of-band RAID deploy steps, we need to have
one apply_configuration step, not a create/delete pair.
Change-Id: I55bbed96673c9fa247cafdac9a3ade3a6ff3f38d
Story: #2006963
Now that we no longer support py27, we can use the standard library
unittest.mock module instead of the third party mock lib.
Change-Id: I5fdb2a02ee83c692d46cbe28266fcae033bec6f6
Signed-off-by: Sean McGinnis <sean.mcginnis@gmail.com>
DIB builds instance images with EFI partitions that only have the boot
flag, but not esp. According to parted documentation, boot is an alias
for esp on GPT, so accept it as well.
To avoid complexities when parsing parted output, the implementation
is switched to existing utils and ironic-lib functions.
Change-Id: I5f57535e5a89528c38d0879177b59db6c0f5c06e
Story: #2007455
Task: #39423
Currently we fail with HTTP 401 if both the known and the received
tokens are None. This prevents IPA from being updated before ironic.
Story: #2007557
Task: #39419
Change-Id: I80249bd3468b581dc035d72156cbfa2f5f225a1b
The logic to determine the version when getting the ironic version
header is not influenced by the version parameter passed to the
function.
Change-Id: Ie52a82bf71a2277cea11fd2dedfd9c1e0001d95f
The proposed changes concern two steps:
First, when creating the RAID configuration, have a GPT partition
table type (this is not necessary, but more natural with UEFI).
Also, leave some space, either for the EFI partitions or the BIOS
boot partitions, outside the Software RAID.
Secondly, when installing the bootloader, make sure the correct
boot partitions are created or relocated.
Change-Id: Icf0a76b0de89e7a8494363ec91b2f1afda4faa3b
Story: #2006379
Task: #37635
Now that an operator can pick the devices that participate in RAID,
it no longer makes sense to verify all devices.
Change-Id: Id5d8d539183f0db4ba3c4132ce6bc9919f9cd1ea
Story: #2006369
Adds a jitter and backoff behavior to the inspector data
collection command to prevent thundering heard sorts of
issues.
Change-Id: I00517010991cbe43d5958c7d76019ef6fe89c983
If an md device is restarted, there is a chance, depending
on the OS, that the partition may not be found upon start
of the md device.
Instead, we should always rescan after re-assembling the raid
device.
Story: 2007275
Task: 38712
Change-Id: I92bac20812940e04381a54ef2905ef5f6e293813
Fixes errors that were being raised upon restarting the agent
directly written out software raid images as the raidset is
restarted for device consistency and partition updates later
on in the code path of deployment.
Story: 2007455
Task: 39187
Change-Id: I9abf51eb77b262932e70329af5ce1593106a3171