Collects PCI class, revision, and bus information for the pci-devices
collector, these metrics as well as vendor id and device id are
components which can be used to construct device information like
lspci output, which is how cyborg agent collects accelerator devices.
Accelerator device based scheduling is possible after ironic has such
information in place.
Change-Id: I6c37c554f37dd5f1d21c8fd4fad2a4f44a3c75d7
Story: 2007971
Task: 40474
Eventlet, when monkey patching occurs, replaces the base
dns resolver methods. This can lead to compatability issues,
and un-expected exceptions being raised during the process
of monkey patching. Such as one if there are no resolvers.
As such, since we don't really need monkey patching of DNS,
and setting the flag should make the inspector CI jobs happier
where we don't need nor use DNS, AND tinycore may not be setting
a resolver configuration at all, which is the root of the failure
upon monkey patching that casues IPA to fail on start in certian
circumstances.
As a note, this has been performed on other projects due to
bugs. See Id9fe265d67f6e9ea5090bebcacae4a7a9150c5c2.
Change-Id: Ib8f7b844b1bfffff16f88ebbb6ef5ddbe61d5a30
Story: 2007936
Task: 40394
When no root_device hint is set, an MDRAID partition can be incorrectly
selected as the root device which causes installation of the bootloader
to the physical disks behind the MDRAID volume to fail. See the notes
in the referenced Story for more detail.
This change adds a little more specificity to the listing of block
devices.
Change-Id: I66db457e71a0586723ee753bef961aec5bf58827
Story: 2007905
Task: 40303
When we added software raid support, we started calling bootloader
installation. As time went on, we ehnanced that code path for non
RAID cases in order to ensure that UEFI nvram was setup
for the instance to boot properly.
Somewhere in this process, we missed a possible failure case where
the iscsi client tgtadm may return failures. Obviously, the correct
path is to not call iscsi teardown if we don't need to.
Since it was always semi-opportunistic teardown, we can't blindly
catch any error, and if we started iSCSI and failed to tear the
connection down, we might want to still fail, so this change
moves the logic over to use a flag on the agent object which
one extension to set the flag and the other to read it and take
action based upon that.
Change-Id: Id3b1ae5e59282f4109f6246d5614d44c93aefa7c
Story: 2007937
Task: 40395
delete_configuration still fetches all devices as it needs to clean
ones with broken RAID.
Story: #2007907
Task: #40307
Change-Id: I4b0be2b0755108490f9cd3c4f3b71a5e036761a1
Shuffle some functions around and reduce size of _is_bootloader_loaded
moving logic out to a new function.
Change-Id: I9c10bf05186dcebb37f175d61bf4ac9ff86b6510
Caches hardware information collected during inspection
so that the initial lookup can occur without any delay.
Also adds logging to track how long inventory collection takes.
Co-Authored-By: Dmitry Tantsur <dtantsur@protonmail.com>
Change-Id: I3e0d237d37219e783d81913fa6cc490492b3f96a
In order to ensure grub2 finds all files it needs, mount all
vfat partitions specified in the deployed image.
Story: #2007618
Task: #39629
Change-Id: Ie5b6e0abc3f266409562f9ecb26538126b667056
Follow-up to commit c5b97eb781cf9851f9abe87a1500b4da55b8bde8.
Two things slipped through the cracks:
* ImageDownloadError was instantiated incorrectly, resulting in a wrong
error message. This was uncovered by using assertRaisesRegext in tests.
* We allowed calling write(None). This was uncovered by avoiding sleep(4)
in tests and enabling more failed calls before timeout.
Change-Id: If5e798c5461ea3e474a153574b0db2da96f2dfa8
We log them as completed when they start executing.
Also fix a problem in remove_large_keys that prevented items
with defaultdict from being logged.
Change-Id: I34a06cc85f55c693416f8c4c9877d55d6affafc9
The download retry interval was previously five seconds which is
not long enough to recover after a hard network connectivity break
where we may be reliant upon network port forwarding hold-down
timers or even routing protocol route propogation to recover
communication.
Previously the time value was 5 seconds, with 3 attempts, meaning
15 seconds total ignoring the error detection timeouts.
Now it is 10 seconds, with 10 attempts, meaning 100 seconds before
the error detection timeouts.
Change-Id: I6d11edc9a3156f2bdc21c3d432ecc7625d652699
Instead of just trying to get the connection and handler
for the download, lets try to retry the whole action of
of downloading.
Change-Id: I9217792d32e6f33c70f146a9b7d3ef58c5644d8a
Socket read operations can be blocking and may not timeout as
expected when thinking of timeouts at the beginning of a
socket request. This can occur when streaming file contents
down to the agent and there is a hard connectivity break.
In other words, we could be in a situation like:
- read(fd, len) - Gets data
- Select returns context to the program, we do things with data.
** hard connectivity break for next 90 seconds**
- read(fd, len) - We drain the in-memory buffer side of the socket.
- Select returns context, we do things with our remaining data
** Server retransmits **
** Server times out due to no ack **
** Server closes socket and issues a FIN,RST packet to the client **
** Connectivity restored, Client never got FIN,RST **
** Client socket still waiting for more data **
- read(fd, len) - No data returned
- Select returns, yet we have no data to act on as the buffer is
empty OR the buffered data doesn't meet our requried read len value.
tl;dr noop
- read(fd, len) <-- We continue to try and read until the socket is
recognized as dead, which could be a long time.
NOTE: The above read()s are python's read() on an contents being
streamed. Lower level reads exist, but brains will hurt
if we try to cover the dynamics at that level.
As such, we need to keep an eye on when the last time we
received a packet, and treat that as if we have timed out
or not. Requests periodically yeilds back even when no data
has been received, in order to allow the caller to wall
clock the progress/status and take appropriate action.
When we exceed the timeout time value with our wall clock,
we will fail the download.
Change-Id: I7214fc9dbd903789c9e39ee809f05454aeb5a240
Adds a new poll extension to provide get_hardware_info and get_node_info
interfaces.
get_hardware_info will be used for node validation by ironic deploy
drivers.
get_node_info will be used for sending lookup data to IPA.
standalone mode is assumed as debug only, but it's not the case
considering the poll mode will be introduced, slightly updates the
description, also prevents the mdns lookup when standalone is true.
Story: 1526486
Task: 28724
Change-Id: I5ad772a18cc4584585c5a7b6fb127547cece1998
We used to popular root UUID inside the message formatting function,
move it to actual prepare_image/cache_image calls.
Change-Id: Ifb22220dfd49633e8623dd76f7a6a128f5874b78
It does not return anything, so it makes no point for it to be
synchronous. Ironic always calls it with wait=True, so there is
no problem with backward compatibility either.
Change-Id: I44fec2e0cb54486328ce71263613d8592e384870
Currently we parse the success message from the write_image call.
This is inconvenient and incompatible with the deploy steps split.
Change-Id: I258dc1ff1ad1c9df5cbc26a7825d9e7ef2f3205b
Story: #2006963
Currently running of ipa-centos8-stable-ussuri image causes 100%
cpu usage while cleaning. Proposed change fixes this behavior and
significantly speeds up cleaning.
Change-Id: I2ba9a69f22b11830d8ff1bc346b17bf1a52f25b0
Story: #2007696
Task: #39809
For some reason pep8 test started to complain causing mayhem.
This patch fixes the issues and does some refactor of dmi_inspector
tests moving pure data to a separate file.
Change-Id: Ia244a496acd80abad679f8ae9832d4f0471500e7
The issue with json output in lshw was fixed in version B.02.19
This patch makes the memory calculation compatible with that
version and later versions that are included in recent distributions
(e.g. Ubuntu 20.04, Fedora 31)
Change-Id: Id5a30028b139c51cae6232cac73a50b917fea233
Story: 2007588
Task: 39527
If the server is stuck for any reason, the download will hang for
a potentially long time. Provide a timeout (defaults to 60 seconds)
and 2 retries on failure.
Change-Id: Ie53519266edd914fdbfa82fe52b4a55151e5ec5f
This function checks for /sys/firmware/efi. Some tests do not mock
isdir, so they fail on UEFI machines.
Change-Id: I088218ddb88717ac07669d0b97c6cd50208ede8c