The current way of prioritizing ID/DM_SERIAL_SHORT or ID/DM_SERIAL works
in most cases but the udev values seem to be unreliable.
Based on experience it looks like lsblk might be a better
source of truth than udev in regerards to serial number
information. This commit makes lsblk the default provider
of block device serial number information.
Story: 2010263
Task: 46161
Change-Id: I16039b46676f1a61b32ee7ca7e6d526e65829113
Extend the ability to skip disks to RAID devices
This allows users to specify the volume name of
a logical device in the skip list which is then not cleaned
or created again during the create/apply configuration phase
The volume name can be specified in target raid config provided
the change https://review.opendev.org/c/openstack/ironic-python-agent/+/853182/
passes
Story: 2010233
Change-Id: Ib9290a97519bc48e585e1bafb0b60cc14e621e0f
Use 'volume_name' field from 'target_raid_config' to create logical
disks if it is present
Do not allow two logical disks to have the same volume name
Change-Id: If3e4e9f8698ec3e0cb49717f8ed2087d2ba03f2c
In the event a device name is set to contain a raid device path,
it is possible for the Name and Events field values of mdadm's
detailed output to contain text which inadvertently gets captured and
mapped as component data for the "holder" devices of the RAID set.
This would cause invalid values to get passed to UEFI methods
which would cause a deployment to fail under these circumstances.
We now ignore the Name and Events fields in mdadm output.
Change-Id: If721dfe1caa5915326482969e55fbf4697538231
Fix minor issues suggested by dtantsur
Add an example of skip list specification to the documentation
A follow-up patch to I3bdad3cca8acb3e0a69ebb218216e8c8419e9d65
Change-Id: Ic94a33b7bc0572a1cc8f92b330474ec63a173e81
Introduce a field skip_block_devices in properties - this is a list of dictionaries
Create a helper function list_block_devices_check_skip_list
Update tests of erase_devices_express to use node when calling _list_erasable_devices
Add tests covering various options of the skip list definition
Use the helper function in get_os_install_device when node is cached
Story: 2009914
Change-Id: I3bdad3cca8acb3e0a69ebb218216e8c8419e9d65
The 5 lines of code were extracted from erase_devices_metadata to _list_erasable_devices, but now are duplicated in both functions.
The variable block_devices is not used in erase_devices_metadata.
Change-Id: I89f56c69d90fb0eb61907d6667266fbd57d333af
Certain filesystems are sometimes used in specialty computing
environments where a shared storage infrastructure or fabric exists.
These filesystems allow for multi-host shared concurrent read/write
access to the underlying block device by *not* locking the entire
device for exclusive use. Generally ranges of the disk are reserved
for each interacting node to write to, and locking schemes are used
to prevent collissions.
These filesystems are common for use cases where high availability
is required or ability for individual computers to collaborate on a
given workload is critical, such as a group of hypervisors supporting
virtual machines because it can allow for nearly seamless transfer
of workload from one machine to another.
Similar technologies are also used for cluster quorum and cluster
durable state sharing, however that is not specifically considered
in scope.
Where things get difficult is becuase the entire device is not
exclusively locked with the storage fabrics, and in some cases locking
is handled by a Distributed Lock Manager on the network, or via special
sector interactions amongst the cluster members which understand
and support the filesystem.
As a reult of this IO/Interaction model, an Ironic-Python-Agent
performing cleaning can effectively destroy the cluster just by
attempting to clean storage which it percieves as attached locally.
This is not IPA's fault, often this case occurs when a Storage
Administrator forgot to update LUN masking or volume settings on
a SAN as it relates to an individual host in the overall
computing environment. The net result of one node cleaning the
shared volume may include restoration from snapshot, backup
storage, or may ultimately cause permenant data loss, depending
on the environment and the usage of that environment.
Included in this patch:
- IBM GPFS - Can be used on a shared block device... apparently according
to IBM's documentation. The standard use of GPFS is more Ceph
like in design... however GPFS is also a specially licensed
commercial offering, so it is a red flag if this is
encountered, and should be investigated by the environment's
systems operator.
- Red Hat GFS2 - Is used with shared common block devices in clusters.
- VMware VMFS - Is used with shared SAN block devices, as well as
local block devices. With shared block devices,
ranges of the disk are locked instead of the whole
disk, and the ranges are mapped to virtual machine
disk interfaces.
It is unknown, due to lack of information, if this
will detect and prevent erasure of VMFS logical
extent volumes.
Co-Authored-by: Jay Faulkner <jay@jvf.cc>
Change-Id: Ic8cade008577516e696893fdbdabf70999c06a5b
Story: 2009978
Task: 44985
Currently, if smartctl is not found by IPA, it will silently skip ATA
secure erase and proceed to shred (if enabled). This is supposedly for
backwards compatibility, but is quite hard to diagnose.
This change adds a warning message to make it more obvious what is
happening.
TrivialFix
Change-Id: I03a381e99de79f201ec7e9a388777c3d48457e93
UDev prefix is DM_ not ID_ for them. On top of that, they don't have
short serials (or at least don't always have).
Change-Id: I5b6075fbff72201a2fd620f789978acceafc417b
The lsblk output is available in json format since version 2.27 of
util-linux [1]
https: //mirrors.edge.kernel.org/pub/linux/utils/util-linux/v2.27/v2.27-ReleaseNotes
Change-Id: I0c5812736b7a320cc4ecc333f80db70eb78cc76d
Removes multipath base devices from consideration by
default, and instead allows the device-mapper device
managed by multipath to be picked up and utilized
instead.
In effect, allowing us to ignore standby paths *and*
leverage multiple concurrent IO paths if so offered
via ALUA.
In reality, anyone who has previously built IPA with
multipath tooling might not have encountered issues
previously because they used Active/Active SAN storage
environments. They would have worked because the IO lock
would have been exchanged between controllers and paths.
However, Active/Passive environments will block passive
paths from access, ultimately preventing new locks from
being established without proper negotiation. Ultimately
requiring multipathing *and* the agent to be smart enough
to know to disqualify underlying paths to backend storage
volumes.
An additional benefit of this is active/active MPIO devices
will, as long as ``multipath`` is present inside the ramdisk,
no longer possibly result in duplicate IO wipes occuring
accross numerous devices.
Story: #2010003
Task: #45108
Resolves: rhbz#2076622
Resolves: rhbz#2070519
Change-Id: I0fd6356f036d5ff17510fb838eaf418164cdfc92
https://storyboard.openstack.org/#!/story/2008290 added support
for NVMe-native storage cleaning, greatly improving storage clean
times on NVMe-based nodes as well as reducing device wear.
This is a follow up change which aims to make further improvements
to cleaning efficiency in mixed NVMe-HDD environments. This is
achieved by combining NVMe-native cleaning methods on NVMe devices
with traditional metadata clean on non-NVMe devices.
Story: 2009264
Task: 43498
Change-Id: I445d8f4aaa6cd191d2e540032aed3148fdbff341
Depending on the how the stars align with partition images
being written to a remote system, we *may* end up with
*either* a Partition UUID value, or a Partition's UUID value.
Which are distinctly different.
This is becasue the value, when collected as a result of writing
an image to disk *falls* back and passes the value to enable
partition discovery and matching.
Later on, when we realized we ought to create an fstab entry,
we blindly re-used the value thinking it was, indeed, always
a Partition's UUID and not the Partition UUID. Obviously,
the label type is quite explicit, either UUID or PARTUUID
respectively, when initial ramdisk utilities such as dracut
are searching and mounting filesystems.
Adds capability to identify the correct label to utilize
based upon the current state of the block devices on disk.
Granted, we are likely only exposed to this because of IO
race conditions under high concurrecy load operations.
Normally this would only be seen on test VMs, but
systems being backed by a Storage Area Network *can*
exibit the same IO race conditions as virtual machines.
Change-Id: I953c936cbf8fad889108cbf4e50b1a15f511b38c
Resolves: rhbz#2058717
Story: #2009881
Task: 44623
Move the software RAID code path from grub2-install to
efibootmgr:
- remove the UEFI efibootmgr exception for software RAID
- create and populate the ESPs on the holder disks
- update the NVRAM with all ESPs (the component devices
of the ESP mirror, use unique labels to avoid unintentional
deduplication of entries in the NVRAM)
Story: #2009794
Change-Id: I7ed34e595215194a589c2f1cd0b39ff0336da8f1
Replace the execute wrapper from utils with execute from ironic-lib in
hardware.py
Adjust unit tests as needed.
Change-Id: I63a3b0407b2ca2246bd0e6624bfa0f748c0d73f7
We use basically the same function in two modules in the same way, let's
put that in a common place.
Change-Id: I4016e43f2cb102d4327bafcc8a2f90112a6f944a
Use add instead of update to re-read the partition table with partx.
See [1] for more details.
Co-authored-by: Arne Wiebalck <arne.wiebalck@cern.ch>
[1] https: //opendev.org/openstack/ironic-python-agent/commit/dc8c1f16f9a00e2bff21612d1a9cf0ea0f3addf0
Change-Id: I2336e22dadc790cfbde87904612fcaa3b8c501db
This patch fixes a race during software RAID creation:
we create the partition with parted, the kernel then
notifies udev, but we need to wait for udevd to create
the device files before calling mdadm to create the
md device.
Credits to jcosmao for finding this.
Change-Id: I642f28acc351cf50263e37dfbc8468bf59de2cc5
One debug message only specified "Skipping" without any details.
Another did not log the whole line from lsblk. Fix both.
Change-Id: I9f8f4edad88ba2df5abc6a45a74ebdb3c7afcf97
This means we do not have to rely on modprobe idempotency as
much and it's less code duplication, which is always nice.
Signed-off-by: Jonas Schäfer <jonas.schaefer@cloudandheat.com>
Change-Id: I996aba47bc54309e15e7d56e4a96b23b8deb5c9c
This exposes the MAC address of the first LAN channel with an assigned
IP address in the inventory data. This is useful for inventory
processes where the asset number is not discoverable from the software
side: the BMC MAC is going to be unique (at least within an
organization).
Change-Id: I8a4bee0c25743befd7f2033e4e0cba26895c8926
Add a clean step for network burn-in via fio. Get basic
run parameters from the node's driver_info.
Story: #2007523
Task: #42385
Change-Id: I2861696740b2de9ec38f7e9fc2c5e448c009d0bf
Add a clean step for disk burn-in via fio. Get basic
run parameters from the node's driver_info.
Story: #2007523
Task: #42384
Change-Id: I5f5e336bd629846b3d779fd0fc7a2060b385b035
Add a clean step for memory burn-in via stress-ng. Get basic
run parameters from the node's driver_info.
Story: #2007523
Task: #42383
Change-Id: I33a83968c9f87cf795ec7ec922bce98b52c5181c
Add a clean step for CPU burn-in via stress-ng. Get basic
run parameters from the node's driver_info.
Story: #2007523
Task: #42382
Change-Id: I14fd4164991fb94263757244f716b6bfe8edf875
Due to a regression in lshw introduced by
https://github.com/lyonel/lshw/pull/60, there are some versions in the
wild that do not return sizes for memory banks <32GiB. In those cases,
work around the problem by looking at the top-level size (if available)
to find the total size. Previously we assumed that we only needed the
top-level size when there was no list of memory banks.
The issue is fixed upstream by https://github.com/lyonel/lshw/pull/65,
but the erroneous patch is still present in the lshw-B.02.19.2-5.el8
package in CentOS 8.4 and 8.5.
Change-Id: I6eb5981d28b9ae368239af0c1d0ec32ff79d95b3
Story: #2008865
Task: 42395
Currently if one interface cannot be handled (e.g. it has empty MAC),
the whole collection fails. Ignore unsupported interfaces instead.
Change-Id: Ibdaad62b39c239d4f3fb3111c2fae9e31e877b28
For some drives, the partition e.g. `/dev/sda1` will not have the
'ro' file which can result in a metadata erasure failure but the base
device (`/dev/sda`) will have this file. Add an additional check
for the base device.
Change-Id: Ia01bdbf82cee6ce15fabdc42f9c23036df55b4c5
Story: 2008696
Task: 42004