293 Commits

Author SHA1 Message Date
Julia Kreger
beb7484858 Guard shared device/cluster filesystems
Certain filesystems are sometimes used in specialty computing
environments where a shared storage infrastructure or fabric exists.
These filesystems allow for multi-host shared concurrent read/write
access to the underlying block device by *not* locking the entire
device for exclusive use. Generally ranges of the disk are reserved
for each interacting node to write to, and locking schemes are used
to prevent collissions.

These filesystems are common for use cases where high availability
is required or ability for individual computers to collaborate on a
given workload is critical, such as a group of hypervisors supporting
virtual machines because it can allow for nearly seamless transfer
of workload from one machine to another.

Similar technologies are also used for cluster quorum and cluster
durable state sharing, however that is not specifically considered
in scope.

Where things get difficult is becuase the entire device is not
exclusively locked with the storage fabrics, and in some cases locking
is handled by a Distributed Lock Manager on the network, or via special
sector interactions amongst the cluster members which understand
and support the filesystem.

As a reult of this IO/Interaction model, an Ironic-Python-Agent
performing cleaning can effectively destroy the cluster just by
attempting to clean storage which it percieves as attached locally.
This is not IPA's fault, often this case occurs when a Storage
Administrator forgot to update LUN masking or volume settings on
a SAN as it relates to an individual host in the overall
computing environment. The net result of one node cleaning the
shared volume may include restoration from snapshot, backup
storage, or may ultimately cause permenant data loss, depending
on the environment and the usage of that environment.

Included in this patch:
- IBM GPFS - Can be used on a shared block device... apparently according
             to IBM's documentation. The standard use of GPFS is more Ceph
             like in design... however GPFS is also a specially licensed
             commercial offering, so it is a red flag if this is
             encountered, and should be investigated by the environment's
             systems operator.
- Red Hat GFS2 - Is used with shared common block devices in clusters.
- VMware VMFS - Is used with shared SAN block devices, as well as
                local block devices. With shared block devices,
                ranges of the disk are locked instead of the whole
                disk, and the ranges are mapped to virtual machine
                disk interfaces.
                It is unknown, due to lack of information, if this
                will detect and prevent erasure of VMFS logical
                extent volumes.

Co-Authored-by: Jay Faulkner <jay@jvf.cc>
Change-Id: Ic8cade008577516e696893fdbdabf70999c06a5b
Story: 2009978
Task: 44985
2022-07-19 13:24:03 -07:00
Zuul
ccf4ee31cf Merge "Gather details about bond interfaces if present" 2022-07-02 02:56:46 +00:00
Zuul
a9de7f80cc Merge "Use json for lsblk output" 2022-06-30 23:38:15 +00:00
Zuul
312e1527ab Merge "Warn when smartctl not found" 2022-06-27 12:10:31 +00:00
Mark Goddard
b68fa6b2e1 Warn when smartctl not found
Currently, if smartctl is not found by IPA, it will silently skip ATA
secure erase and proceed to shred (if enabled). This is supposedly for
backwards compatibility, but is quite hard to diagnose.

This change adds a warning message to make it more obvious what is
happening.

TrivialFix

Change-Id: I03a381e99de79f201ec7e9a388777c3d48457e93
2022-06-24 16:58:37 +01:00
Derek Higgins
7e4fe3bf6a Gather details about bond interfaces if present
If present gather information about bonded interfaces.

Story: #2010093
Task: #45637

Change-Id: I394187640b4788ebec21c3391d33ed728fb72ffa
2022-06-21 09:45:03 +01:00
Dmitry Tantsur
69e2254503 Fix discovering WWN/serial for devicemapper devices
UDev prefix is DM_ not ID_ for them. On top of that, they don't have
short serials (or at least don't always have).

Change-Id: I5b6075fbff72201a2fd620f789978acceafc417b
2022-06-14 19:06:53 +02:00
Riccardo Pittau
09ea41c83d Use json for lsblk output
The lsblk output is available in json format since version 2.27 of
util-linux [1]

https: //mirrors.edge.kernel.org/pub/linux/utils/util-linux/v2.27/v2.27-ReleaseNotes

Change-Id: I0c5812736b7a320cc4ecc333f80db70eb78cc76d
2022-06-14 17:50:05 +02:00
Julia Kreger
014d37743a Multipath Hardware path handling
Removes multipath base devices from consideration by
default, and instead allows the device-mapper device
managed by multipath to be picked up and utilized
instead.

In effect, allowing us to ignore standby paths *and*
leverage multiple concurrent IO paths if so offered
via ALUA.

In reality, anyone who has previously built IPA with
multipath tooling might not have encountered issues
previously because they used Active/Active SAN storage
environments. They would have worked because the IO lock
would have been exchanged between controllers and paths.
However, Active/Passive environments will block passive
paths from access, ultimately preventing new locks from
being established without proper negotiation. Ultimately
requiring multipathing *and* the agent to be smart enough
to know to disqualify underlying paths to backend storage
volumes.

An additional benefit of this is active/active MPIO devices
will, as long as ``multipath`` is present inside the ramdisk,
no longer possibly result in duplicate IO wipes occuring
accross numerous devices.

Story: #2010003
Task: #45108
Resolves: rhbz#2076622
Resolves: rhbz#2070519
Change-Id: I0fd6356f036d5ff17510fb838eaf418164cdfc92
2022-05-18 20:26:39 -03:00
Zuul
f08f70134d Merge "Improve efficiency of storage cleaning in mixed media envs" 2022-03-15 18:05:29 +00:00
Jacob Anders
c5f7f18bcb Improve efficiency of storage cleaning in mixed media envs
https://storyboard.openstack.org/#!/story/2008290 added support
for NVMe-native storage cleaning, greatly improving storage clean
times on NVMe-based nodes as well as reducing device wear.

This is a follow up change which aims to make further improvements
to cleaning efficiency in mixed NVMe-HDD environments. This is
achieved by combining NVMe-native cleaning methods on NVMe devices
with traditional metadata clean on non-NVMe devices.

Story: 2009264
Task: 43498
Change-Id: I445d8f4aaa6cd191d2e540032aed3148fdbff341
2022-03-15 19:00:25 +10:00
Julia Kreger
99ca1086db Create fstab entry with appropriate label
Depending on the how the stars align with partition images
being written to a remote system, we *may* end up with
*either* a Partition UUID value, or a Partition's UUID value.

Which are distinctly different.

This is becasue the value, when collected as a result of writing
an image to disk *falls* back and passes the value to enable
partition discovery and matching.

Later on, when we realized we ought to create an fstab entry,
we blindly re-used the value thinking it was, indeed, always
a Partition's UUID and not the Partition UUID. Obviously,
the label type is quite explicit, either UUID or PARTUUID
respectively, when initial ramdisk utilities such as dracut
are searching and mounting filesystems.

Adds capability to identify the correct label to utilize
based upon the current state of the block devices on disk.

Granted, we are likely only exposed to this because of IO
race conditions under high concurrecy load operations.
Normally this would only be seen on test VMs, but
systems being backed by a Storage Area Network *can*
exibit the same IO race conditions as virtual machines.

Change-Id: I953c936cbf8fad889108cbf4e50b1a15f511b38c
Resolves: rhbz#2058717
Story: #2009881
Task: 44623
2022-03-10 07:04:01 -08:00
Arne Wiebalck
62c5674a60 SoftwareRAID: Use efibootmgr (and drop grub2-install)
Move the software RAID code path from grub2-install to
efibootmgr:

- remove the UEFI efibootmgr exception for software RAID
- create and populate the ESPs on the holder disks
- update the NVRAM with all ESPs (the component devices
  of the ESP mirror, use unique labels to avoid unintentional
  deduplication of entries in the NVRAM)

Story: #2009794

Change-Id: I7ed34e595215194a589c2f1cd0b39ff0336da8f1
2022-01-26 14:43:40 +01:00
Riccardo Pittau
7b03fbbb36 Call execute from ironic-lib in hardware.py
Replace the execute wrapper from utils with execute from ironic-lib in
hardware.py

Adjust unit tests as needed.

Change-Id: I63a3b0407b2ca2246bd0e6624bfa0f748c0d73f7
2021-11-18 07:52:48 +01:00
Riccardo Pittau
a799dcc422 Move rescan device function to general utils
We use basically the same function in two modules in the same way, let's
put that in a common place.

Change-Id: I4016e43f2cb102d4327bafcc8a2f90112a6f944a
2021-11-10 15:34:37 +01:00
Riccardo Pittau
23e67b5fea Re-read the partition table with partx -a, part 2
Use add instead of update to re-read the partition table with partx.

See [1] for more details.

Co-authored-by: Arne Wiebalck <arne.wiebalck@cern.ch>

[1] https: //opendev.org/openstack/ironic-python-agent/commit/dc8c1f16f9a00e2bff21612d1a9cf0ea0f3addf0

Change-Id: I2336e22dadc790cfbde87904612fcaa3b8c501db
2021-11-09 13:03:14 +01:00
Arne Wiebalck
9d707e9f4b Software RAID: Call udev_settle before creation
This patch fixes a race during software RAID creation:
we create the partition with parted, the kernel then
notifies udev, but we need to wait for udevd to create
the device files before calling mdadm to create the
md device.

Credits to jcosmao for finding this.

Change-Id: I642f28acc351cf50263e37dfbc8468bf59de2cc5
2021-10-05 11:42:49 +02:00
Dmitry Tantsur
07ff3b8bbc Trivial: better debugging in list_all_block_devices
One debug message only specified "Skipping" without any details.
Another did not log the whole line from lsblk. Fix both.

Change-Id: I9f8f4edad88ba2df5abc6a45a74ebdb3c7afcf97
2021-08-27 12:19:28 +02:00
Zuul
438a1f4445 Merge "Move loading of IPMI module loading to a single point" 2021-08-23 16:14:14 +00:00
Zuul
71f54b7f98 Merge "Increase version of hacking and pycodestyle" 2021-08-11 10:02:24 +00:00
Jonas Schäfer
6441db61ce Move loading of IPMI module loading to a single point
This means we do not have to rely on modprobe idempotency as
much and it's less code duplication, which is always nice.

Signed-off-by: Jonas Schäfer <jonas.schaefer@cloudandheat.com>

Change-Id: I996aba47bc54309e15e7d56e4a96b23b8deb5c9c
2021-08-06 13:14:45 +02:00
Jonas Schäfer
61af712fe5 Expose BMC MAC address in inventory data
This exposes the MAC address of the first LAN channel with an assigned
IP address in the inventory data. This is useful for inventory
processes where the asset number is not discoverable from the software
side: the BMC MAC is going to be unique (at least within an
organization).

Change-Id: I8a4bee0c25743befd7f2033e4e0cba26895c8926
2021-08-06 13:14:45 +02:00
Riccardo Pittau
efbbc86f53 Increase version of hacking and pycodestyle
Fix H904 "Delay string interpolations at logging calls" errors

Change-Id: I331808d0132094faf739998a6984440787d3ebf8
2021-07-30 14:34:33 +02:00
Arne Wiebalck
cacdd9bab3 Burn-in: Add network step
Add a clean step for network burn-in via fio. Get basic
run parameters from the node's driver_info.

Story: #2007523
Task: #42385

Change-Id: I2861696740b2de9ec38f7e9fc2c5e448c009d0bf
2021-07-13 11:36:31 +02:00
Arne Wiebalck
20c5894bc2 Burn-in: Add disk step
Add a clean step for disk burn-in via fio. Get basic
run parameters from the node's driver_info.

Story: #2007523
Task: #42384

Change-Id: I5f5e336bd629846b3d779fd0fc7a2060b385b035
2021-05-21 16:33:11 +02:00
Zuul
823e0ed743 Merge "Burn-in: Add memory step" 2021-05-11 09:31:54 +00:00
Zuul
5c01ec4f6f Merge "Burn-in: Add CPU step" 2021-05-10 15:00:14 +00:00
Arne Wiebalck
5c222560f0 Burn-in: Add memory step
Add a clean step for memory burn-in via stress-ng. Get basic
run parameters from the node's driver_info.

Story: #2007523
Task: #42383

Change-Id: I33a83968c9f87cf795ec7ec922bce98b52c5181c
2021-05-01 10:36:58 +02:00
Arne Wiebalck
6702fcaa43 Burn-in: Add CPU step
Add a clean step for CPU burn-in via stress-ng. Get basic
run parameters from the node's driver_info.

Story: #2007523
Task: #42382

Change-Id: I14fd4164991fb94263757244f716b6bfe8edf875
2021-05-01 10:36:20 +02:00
Zuul
10c29cdc41 Merge "Fix getting memory size in some lshw output" 2021-04-30 12:24:44 +00:00
Zane Bitter
ed791d9778 Fix getting memory size in some lshw output
Due to a regression in lshw introduced by
https://github.com/lyonel/lshw/pull/60, there are some versions in the
wild that do not return sizes for memory banks <32GiB. In those cases,
work around the problem by looking at the top-level size (if available)
to find the total size. Previously we assumed that we only needed the
top-level size when there was no list of memory banks.

The issue is fixed upstream by https://github.com/lyonel/lshw/pull/65,
but the erroneous patch is still present in the lshw-B.02.19.2-5.el8
package in CentOS 8.4 and 8.5.

Change-Id: I6eb5981d28b9ae368239af0c1d0ec32ff79d95b3
Story: #2008865
Task: 42395
2021-04-29 14:41:11 -04:00
Zane Bitter
c56cd4abc0 Fix missing data in log messages
Change-Id: I5d08deed86d79a7ea0b7a1625122af595037dab5
2021-04-29 09:55:56 -04:00
Dmitry Tantsur
1ab405b509 Do not fail network interface collection on unsupported interface
Currently if one interface cannot be handled (e.g. it has empty MAC),
the whole collection fails. Ignore unsupported interfaces instead.

Change-Id: Ibdaad62b39c239d4f3fb3111c2fae9e31e877b28
2021-04-07 17:16:27 +02:00
Bernd Mueller
2a64413bb6 typo chanages -> changes
Change-Id: Ifb75a5f6f01bd98011464eb05f98d8db001dcd54
2021-03-24 13:53:32 +01:00
Bob Fournier
4afe4f6069 Check the base device if the read-only file cannot be read
For some drives, the partition e.g. `/dev/sda1` will not have the
'ro' file which can result in a metadata erasure failure but the base
device (`/dev/sda`) will have this file.  Add an additional check
for the base device.

Change-Id: Ia01bdbf82cee6ce15fabdc42f9c23036df55b4c5
Story: 2008696
Task: 42004
2021-03-09 07:05:27 -05:00
Riccardo Pittau
bff252c726 Remove default parameter from execute
The param check_exit_code from the processutils extension execute has
default already at [0]
See:
https://opendev.org/openstack/oslo.concurrency/src/branch/master/oslo_concurrency/processutils.py#L214

Change-Id: Iedff5325e0737556d5eb3da601c984ddfc633873
2021-03-02 16:19:32 +01:00
Jacob Anders
d2127e7ef4 Remove nvme-cli warning and delay on nvme-format
This change adds '-f' flag to nvme-cli calls during NVMe Secure Erase.
This removes nvme-cli output warning that the device is about to be
irreversibly deleted as well as the related 10 second delay which is
pointlessly increasing NVMe cleaning time.

Story: 2008290
Change-Id: I7b7b8b7d4f643b07d5c9dcf7ec35cf7ebedf44d1
2021-03-02 15:37:35 +10:00
Zuul
4a22c887f8 Merge "Use try_execute from ironic-lib" 2021-03-01 13:54:15 +00:00
Mohammed Naser
ab267aabdd Allow clean_configuration to run against full-device arrays
At the moment, it is not possible for Ironic to clean up a
RAID array that is built from an entire device.  This patch
allows it to do so by overriding the behaviour of attempting
to find the device name if the device names does not end with
a number and is a real block device.

Story: #2008663
Task: #41948
Change-Id: I66b0990acaec45b1635795563987b99f9fa04ac7
2021-02-27 17:24:16 -05:00
Riccardo Pittau
0459c61c8d Use try_execute from ironic-lib
Also adapt unit tests

Change-Id: I37d050877daabc9dc0a5821cf20a689652b26f34
2021-02-25 14:46:17 +01:00
Zuul
6ea3aff8d6 Merge "New deploy step for injecting arbitrary files" 2021-02-22 18:48:22 +00:00
Zuul
2979ee5314 Merge "Add support for using NVMe specific cleaning" 2021-02-19 12:13:55 +00:00
Jacob Anders
8bcf1be920 Add support for using NVMe specific cleaning
This change adds support for utilising NVMe specific cleaning tools
on supported devices. This will remove the neccessity of using shred to
securely delete the contents of a NVMe drive and enable using nvme-cli
tools instead, improving cleaning performance and reducing wear on the device.

Story: 2008290
Task: 41168
Change-Id: I2f63db9b739e53699bd5f164b79640927bf757d7
2021-02-18 22:51:34 +10:00
Riccardo Pittau
7d7940d904 Move some raid specific functions to raid_utils
To reduce size of the hardware module and separate the raid specific
code in raid_utils, we move some functions and adapt the tests.

Change-Id: I73f6cf118575b627e66727d88d5567377c1999a0
2021-02-17 10:11:13 +01:00
Dmitry Tantsur
59cb08fd28 New deploy step for injecting arbitrary files
This change adds a deploy step inject_files that adds a flexible
way to inject files into the instance.

Change-Id: I0e70a2cbc13744195c9493a48662e465ec010dbe
Story: #2008611
Task: #41794
2021-02-16 16:56:52 +01:00
Riccardo Pittau
fc1f2c73c6 Use variable for lsblk columns device info
Adjusted unit tests accordingly.

Also removed redundant parenthesis.

Change-Id: I8e2cac5172f009d5204f83bd83e1f27cfd721f09
2021-02-03 15:31:32 +01:00
Julia Kreger
cb6c0059b5 Fix default disk label with partition images
Partition images through the agent have the unfortunate
side effect of being executed without full node context
by default. Luckilly we've had a similar problem and
cache the node.

This patch changes the lookup from a default of msdos
partitions to use the cached node object.

Change-Id: I002816c9372fdf1cc32f3c67f420073551479fd9
2020-12-14 06:36:18 -08:00
Zuul
1a9491e651 Merge "Bring up VLAN interfaces and include in introspection report" 2020-12-02 13:59:28 +00:00
Zuul
22985da710 Merge "Make mdadm a soft requirement" 2020-11-23 19:37:59 +00:00
Dmitry Tantsur
ab8dee0386 Make mdadm a soft requirement
No point in requiring it for deployments that don't use software RAID.

Change-Id: I8b40f02cc81d3154f98fa3f2cbb4d3c7319291b8
2020-11-20 17:07:00 +01:00