Commit Graph

1002 Commits (master)

Author SHA1 Message Date
Zuul 03f7a1d7ee Merge "Move GA metadata to deployed status" 2023-12-04 18:25:34 +00:00
Zuul 1332ebb7a7 Merge "Replace a file test from fsmond" 2023-12-04 14:03:11 +00:00
Heitor Matsui 6c863b3828 Move GA metadata to deployed status
This commit changes the GA metadata status on fresh install
to "deployed" given recent technical decision changes.

Test Plan:
PASS: build and install iso, verify the correct output with
      "software list"

Story: 2010676
Task: 49166

Change-Id: Idbab8655f9f2e4e080f389fa7823f5e6744c4c74
Signed-off-by: Heitor Matsui <heitorvieira.matsui@windriver.com>
2023-11-29 11:59:17 -03:00
Teresa Ho 36814db843 Increase timeout for runtime manifest
In management network reconfiguration for AIO-SX, the runtime manifest
executed during host unlock could take more than five minutes to complete.
This commit is to extend the timeout period from five minutes to eight
minutes.

Test Plan:
PASS: AIO-SX subcloud mgmt network reconfiguration

Story: 2010722
Task: 49133

Change-Id: I6bc0bacad86e82cc1385132f9cf10b56002f385e
Signed-off-by: Teresa Ho <teresa.ho@windriver.com>
2023-11-23 16:51:22 -05:00
Heitor Matsui e37f69765e Copy GA metadata file to USM location
This commits switches the GA metadata file copy from
/opt/patching/metadata to /opt/software/metadata.

Test Plan
PASS: build iso, install and verify that "software list"
      lists the GA release

Story: 2010676
Task: 49112

Change-Id: I75b8cd6ae41a9cf9b5af0225ebcaaf0d9e0ddb4e
Signed-off-by: Heitor Matsui <heitorvieira.matsui@windriver.com>
2023-11-21 12:31:19 -03:00
Erickson Silva de Oliveira 16181a2ce8 Replace a file test from fsmond
fsmond tries to create a test file in "/.fs-test" but
it is not possible because "/" is blocked by ostree.

So the fix is to replace this path from fsmond monitoring
with /sysroot/.fs_test.

Below is a comparison of the logs:
  - Before change:
  ( 196) fsmon_service : Warn : File (/.fs-test) test failed

  - After change:
  ( 201) fsmon_service : Info : tests passed

Test Plan:
  - PASS: Build mtce package
  - PASS: Replace fsmond binary on AIO-SX
  - PASS: Check fsmond.log output

Closes-Bug: 2043712

Change-Id: Ib4bad73448735bce1dff598151fce86f867f4db7
Signed-off-by: Erickson Silva de Oliveira <Erickson.SilvadeOliveira@windriver.com>
2023-11-17 08:15:28 -03:00
Zuul 8d2883aa68 Merge "After executing PXE boot install, turn off IPv6 autoconf" 2023-11-14 20:20:46 +00:00
Andre Kantek 97052df958 After executing PXE boot install, turn off IPv6 autoconf
It was detected that the PXE boot the IPv6 autoconf is turned on
due to an error in the network config file for the PXE interface.
Instead of applying the config to the interface it is configuring
the loopback.

By leaving autoconf turned on the interface it can receive unwanted
address configuration that can create errors during the ansible
playbook execution that will follow.

Closes-Bug: 2043509

Change-Id: I48584dc6b92fca02205c4774c4624410b6a29ba8
Signed-off-by: Andre Kantek <andrefernandozanella.kantek@windriver.com>
2023-11-14 16:13:33 -03:00
Teresa Ho e616a4495d Use FQDN for MGMT network in kickstart
With the introduction of FQDN for MGMT network feature, the DNS lookup
of 'controller' resolves to 'controller.internal'.
The kickstart script uses the DNS lookup of controller to determine
whether the system is using a IPv6 or IPv4 which results in a string
instead of IP address or 0 return code. This causes a problem in
installing nodes in IPv4 when the management interface is configured
over vlan.

The fix is to use the FQDN controller.internal.

Test plan:
PASS: Install IPv4 AIO-DX with mgmt vlan
PASS: Install IPv6 AIO-DX with mgmt vlan

Story: 2010722
Task: 48682
Closes-Bug: 2042953

Signed-off-by: Teresa Ho <teresa.ho@windriver.com>
Change-Id: I5377587c8bc8c62a62f03123cabef7366df3dd94
2023-11-09 16:12:13 +00:00
Eric MacDonald 79d8644b1e Add bmc reset delay in the reset progression command handler
This update solves two issues involving bmc reset.

Issue #1: A race condition can occur if the mtcAgent finds an
          unlocked-disabled or heartbeat failing node early in
          its startup sequence, say over a swact or an SM service
          restart and needs to issue a one-time-reset. If at that
          point it has not yet established access to the BMC then
          the one-time-reset request is skipped.

Issue #2: When issue #1 race conbdition does not occur before BMC
          access is established the mtcAgent will issue its one-time
          reset to a node. If this occurs as a result of a crashdump
          then this one-time reset can interrupt the collection of
          the vmcore crashdump file.

This update solves both of these issues by introducing a bmc reset
delay following the detection and in the handling of a failed node
that 'may' need to be reset to recover from being network isolated.

The delay prevents the crashdump from being interrupted and removes
the race condition by giving maintenance more time to establish bmc
access required to send the reset command.

To handle significantly long bmc reset delay values this update
cancels the posted 'in waiting' reset if the target recovers online
before the delay expires.

It is recommended to use a bmc reset delay that is longer than a
typical node reboot time. This is so that in the typical case, where
there is no crashdump happening, we don't reset the node late in its
almost done recovery. The number of seconds till the pending reset
countdown is logged periodically.

It can take upwards of 2-3 minutes for a crashdump to complete.
To avoid the double reboot, in the typical case, the bmc reset delay
is set to 5 minutes which is longer than a typical boot time.
This means that if the node recovers online before the delay expires
then great, the reset wasn't needed and is cancelled.

However, if the node is truely isolated or the shutdown sequence
hangs then although the recovery is delayed a bit to accomodate for
the crashdump case, the node is still recovered after the bmc reset
delay period. This could lead to a double reboot if the node
recovery-to-online time is longer than the bmc reset delay.

This update implements this change by adding a new 'reset send wait'
phase to the exhisting reset progression command handler.

Some consistency driven logging improvements were also implemented.

Test Plan:

PASS: Verify failed node crashdump is not interrupted by bmc reset.
PASS: Verify bmc is accessible after the bmc reset delay.
PASS: Verify handling of a node recovery case where the node does not
      come back before bmc_reset_delay timeout.
PASS: Verify posted reset is cancelled if the node goes online before
      the bmc reset delay and uptime shows less than 5 mins.
PASS: Verify reset is not cancelled if node comes back online without
      reboot before bmc reset delay and still seeing mtcAlive on one
      or more links.Handles the cluster-host only heartbeat loss case.
      The node is still rebooted with the bmc reset delay as backup.
PASS: Verify reset progression command handling, with and
      without reboot ACKs, with and without bmc
PASS: Verify reset delay defaults to 5 minutes
PASS: Verify reset delay change over a manual change and sighup
PASS: Verify bmc reset delay of 0, 10, 60, 120, 300 (default), 500
PASS: Verify host-reset when host is already rebooting
PASS: Verify host-reboot when host is already rebooting
PASS: Verify timing of retries and bmc reset timeout
PASS: Verify posted reset throttled log countdown

Failure Mode Cases:

PASS: Verify recovery handling of failed powered off node
PASS: Verify recovery handling of failed node that never comes online
PASS: Verify recovery handling when bmc is never accessible
PASS: Verify recovery handling cluster-host network heartbeat loss
PASS: Verify recovery handling management network heartbeat loss
PASS: Verify recovery handling both heartbeat loss
PASS: Verify mtcAgent restart handling finding unlocked disabled host

Regression:

PASS: Verify build and DX system install
PASS: Verify lock/unlock (soak 10 loops)
PASS: Verify host-reboot
PASS: Verify host-reset
PASS: Verify host-reinstall
PASS: Verify reboot graceful recovery (force and no force)
PASS: Verify transient heartbeat failure handling
PASS: Verify persistent heartbeat loss handling of mgmt and/or cluster networks
PASS: Verify SM peer reset handling when standby controller is rebooted
PASS: Verify logging and issue debug ability

Closes-Bug: 2042567
Closes-Bug: 2042571
Change-Id: I195661702b0d843d0bac19f3d1ae70195fdec308
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2023-11-02 20:58:00 +00:00
Zuul 3645d5db93 Merge "Update crashDumpMgr to source config from envfile" 2023-10-19 21:01:45 +00:00
Kyle MacLeod e81d0bf4e7 Prestaged ISO: copy ostree_repo to versioned platform-backup
This commit applies to the prestaged ISO install. The kickstart.cfg is
updated to copy the prestaged ostree_repo into release-specific
/opt/platform-backup/<release> location.

A minor change is also included in miniboot.cfg to sync the patching
metadata for prepatched ISOs. This fills a potential hole in the
patching metadata sync behaviour identified during testing.
Normally the patching metadata is synchronized from the system
controller down to the subcloud. For the prestaged ISO case, this change
is necessary to ensure the patching metadata is seeded from the
prepatched ISO created via gen-prestaged-iso.sh.

Test Plan
PASS:
- Build prestaged ISO, including container images and a patch
    - Install subcloud using prestaged ISO
    - Verify contents of /opt/platform-backup/<release> are properly
      populated.
    - Verify subcloud is installed using prestaged data from
      /opt/platform-backup/<release>
    - Verify that included container images are installed
- Build prestaged ISO using a pre-patched ISO. Install subcloud, ensure
  that patching metadata is properly synchronized on installation.

Out of scope failure:
- A new bug to be raised for the following:
    - Verify that the included patch is installed on the subcloud
      - It appears that this has never worked in Debian. The --patch
        option makes sense for a Debian installation, since the patches
        are contained in ostree commits. To fully support this
        functionality we need to implement a new mechanism to do a
        sw-patch upload and apply at some point during the installation.
      - Support for the gen-prestaged-iso.sh --patch option will be
        added in a future commit

Closes-Bug: 2039282
Signed-off-by: Kyle MacLeod <kyle.macleod@windriver.com>
Change-Id: I973f4704eae09634a0c3fe2f7fbc31ac1835fcf8
2023-10-16 10:04:35 -04:00
Zuul 61e01d2000 Merge "Fix kickstarts patching" 2023-10-11 15:10:19 +00:00
Salman Rana d24e48e490 Fix kickstarts patching
Ostree doesn't manage the /var filesystem. Anything
installed there during initial filesystem setup becomes
unpatchable [1]. As a result, the kickstart install dir
/var/www/pages/feed/rel-${platform_release}/kickstart
is not updated according to patch changes. /var/www/pages/feed/rel-${platform_release}/kickstart
is currently only used for PXE boot installs.
Subcloud remote installations are using the miniboot.cfg
kickstart from the load-imported ISO
(we may want to change this in some future commit).

This commit adds kickstart update support to
pxeboot-feed.service (pxeboot_feed.sh) so that
/var/www/pages/feed/rel-${platform_release}/kickstarts
is refreshed based on the kickstart dir from
/ostree (i.e., the patched changes).

[1] https://review.opendev.org/c/starlingx/ha/+/890918

Test Plan:
1. PASS: Verify Debian build and DC system install
         (virtual lab - disk and pxe installs)
2. PASS: Verify pxe install (DC remote install) with
         patched kickstart
3. PASS: Create a patch with changes to kickstart feed:
          - modify an existing kickstart
          - create a new kickstart file
          - delete an existing file
          - create a new kickstart sub-directory
          - modify centos subdir
	 verify patch apply, ensure that changes are
         correctly applied to:
         /var/www/pages/feed/rel-${platform_release}/kickstarts
4. PASS: Revert the patch from test #3 and ensure changes
         are correctly undone in the feed dir

Closes-Bug: 2034753

Change-Id: I74804bff23a74512db6a95fa514c84a1a6ea54a8
Signed-off-by: Salman Rana <salman.rana@windriver.com>
2023-10-11 14:40:38 +00:00
Enzo Candotti 23143abbca Update crashDumpMgr to source config from envfile
This commit updates the crashDumpMgr service in order to:
- Cleanup of current service naming and packaging to follow the
  standard Linux naming convention:
    - Repackage /etc/init.d/crashDumpMgr to
      /usr/sbin/crash-dump-manager
    - Rename crashDumpMgr.service to crash-dump-manager.service
- Add EnvironmentFile to crash-dump-manager service file to source
  configuration from /etc/default/crash-dump-manager.
- Update ExecStart of crash-dump-manager service to use parameters
  from EnvironmentFile
- Update crash-dump-manager service dependencies to run after
  config.service.
- Update logrotate configuration to support the retention polices of
  the maximum files. The “rotate 1” option was removed to permit
  crash-dump-manager to manage pruning old files.
- Modify the crash-dump-manager script to enable updates to the
  max_files parameter to a lower value. If there are currently more
  files than the new max_files value, the oldest files will be
  deleted the next time a crash dump file needs to be stored, thus
  adhering to the new max_files values.

Test Plan:

PASS: Build ISO and perform a fresh install. Verify the new
crash-dump-manager service is enabled and working as expected.
PASS: Add and apply new crashdump service parameters and force a kernel
panic. Verify that after the reboot, the max_files, max_used,
min_available and max_size values are updated accordingly to the service
parameters values.
PASS: Verify that the crashdump files are rotated as expected.

Story: 2010893
Task: 48910

Change-Id: I4a81fcc6ba456a0d73067b77588ee4a125e44e62
Signed-off-by: Enzo Candotti <enzo.candotti@windriver.com>
2023-10-06 23:06:54 +00:00
Zuul df8989e2a1 Merge "Set longer shutdown time and fix power state error log" 2023-10-05 21:29:12 +00:00
Li Zhu bfbaba5731 Set longer shutdown time and fix power state error log
1.Extended the timeout to 14mins to accommodate the longer shutdown time.
2.Fixed the power state error log so that it logs the requested state
instead of the current power_state.

Test Plan:

PASS: Verify logged version is 2.2
PASS: Verify success path with no FIT delay ; HP and ZT servers
PASS: Verify timing of the loop with timeout of 14 minutes
PASS: Verify shutdown timeout handling when shutdown exceeds 14
      minutes.
PASS: Verify install completes successfully when Power Off takes
      close to but less than 14 minutes
PASS: Verify power state failure log reports proper state

Closes-Bug: 2038484

Signed-off-by: Li Zhu <li.zhu@windriver.com>
Change-Id: Ic99a06dca9962fcae43b20e00d8ebcb127a80560
2023-10-05 17:12:19 -04:00
Zuul f6ab5912b3 Merge "Wipe all LVs during kickstart" 2023-09-27 16:27:51 +00:00
Gustavo Ornaghi Antunes 00b313de49 Wipe all LVs during kickstart
Backup and Restore are not completing because the manifest is
not applied when trying drbd-cephmon turns primary,
It is occurring because the LVs are not being wiped before
being removed, so some garbage is impacting drbd-cephmon
turns primary and causes the manifest fails to not be applied.

To ensure that drbd-cephmon turns primary on first unlock,
LVs will be wiped before recreating them during kickstart
procedure.

Test Plan:
PASS: Backup and restore on AIO-DX
PASS: Install AIO-SX over the previous installation without
wiping the disks and checking the install.log to verify
if the disks are wiped during kickstart.
PASS: Install AIO-DX, reinstall Controller-1, and checking the
install.log to verify if the disks were wiped during kickstart.

Closes-Bug: #2031542

Change-Id: Ib00d77fbc9dfd62e9c94f418e29f2805f8a0c036
Signed-off-by: Gustavo Ornaghi Antunes <gustavo.ornaghiantunes@windriver.com>
2023-09-27 14:20:01 +00:00
Zuul 8c2e1c395a Merge "Remove machine-id generated from build from subcloud install" 2023-09-27 12:58:00 +00:00
Zuul 182547b31f Merge "Revert "Fix kickstarts patching"" 2023-09-27 00:18:14 +00:00
Bruce Jones 4e09b61d0d Revert "Fix kickstarts patching"
This reverts commit 0366f8552d.

Reason for revert: breaks sanity

Change-Id: Ie580ae328a80abfc2a1964157ac1b14b70dc98e9
2023-09-26 22:28:49 +00:00
Andre Kantek 7d88382c9e Remove machine-id generated from build from subcloud install
As it was done in the previous change for local installation
https://review.opendev.org/c/starlingx/metal/+/863322
This change removes the ISO embedded machine-id file to allow the
value regeneration after the first boot post install for subclouds
that use the redfish protocol when added in a system controller.

Test Plan
[PASS] install 2 subclouds from the system controller containing the
        patch and check the values in /etc/machine-id and
        /var/lib/dbus/machine-id to unique for each subcloud

Closes-Bug: 2037434

Change-Id: If7a631b5769cb499956a7e5ee33e3361a6230452
Signed-off-by: Andre Kantek <andrefernandozanella.kantek@windriver.com>
2023-09-26 12:06:39 -03:00
Zuul 495bb4ab1a Merge "Fix kickstarts patching" 2023-09-22 14:16:29 +00:00
Zuul d4c61740a3 Merge "Add new configuration parameters to crashDumpMgr" 2023-09-21 17:02:13 +00:00
Salman Rana 0366f8552d Fix kickstarts patching
Ostree doesn't manage the /var filesystem. Anything
installed there during initial filesystem setup becomes
unpatchable [1]. As a result, the kickstart install dir
/var/www/pages/feed/rel-${platform_release}/kickstart
is not updated according to patch changes.

This commit changes the platform-kickstarts install paths
to a place that ostree handles,
/usr/share/platform-kickstarts/rel-${platform_release}
in this case and symlinks it to
/var/www/pages/feed/rel-${platform_release}/kickstarts.

[1] https://review.opendev.org/c/starlingx/ha/+/890918

Test Plan:
1. PASS: ISO install and verify symlink created:
         /var/www/pages/feed/rel-${platform_release}/kickstarts ->
         /usr/share/platform-kickstarts/rel-${platform_release}
2. PASS: Verify that the centos/ dir, kickstart.cfg & miniboot.cfg
         are installed to /usr/share/platform-kickstarts/
         rel-${platform_release}
3. PASS: Verify PATCH apply, ensure that changes are applied to
         /var/www/pages/feed/rel-${platform_release}/kickstarts
4. PASS: Manually remove/re-install the platform-kickstarts package
         and verify kickstarts dir and symlink

Closes-Bug: 2034753
Closes-Bug: 2035109

Change-Id: I307d28c086bb3d9f0e4d6792db44e55c99358a50
Signed-off-by: Salman Rana <salman.rana@windriver.com>
2023-09-18 15:37:31 -04:00
Enzo Candotti a120cc5fea Add new configuration parameters to crashDumpMgr
This commmit updates crashDumpMgr in order to add three new parameters
and enhance the existing one.

1. Maximum Files: Added 'max-files' parameter to specify the maximum
   number of saved crash dump files. The default value is 4.
2. Maximum Size: Updated the 'max-size' parameter to support
   the 'unlimited' value. The default value is 5GiB.
3. Maximum Used: Included 'max-used' parameter to limit the maximum
   storage used by saved crash dump files. It supports 'unlimited'
   and has a default value of unlimited.
4. Minimum Available: Implemented 'min-available' parameter, enabling
   the definition of a minimum available storage threshold on the
   crash dump file system. The value is restricted to a minimum of
   1GB and defaults to 10%.

These enhancements refine the crash dump management process and
offer more control over storage usage and crash dump file retention.

Story: 2010893
Task: 48676

Test Plan:
1) max-files parameter:
  PASS: don't set max-files param. Ensure the default value is used.
  Create 5 directories inside /var/crash. Each of them contains
  dmesg.<date> and dump.<date>. run the crashDumpMgr script.
  Verify:
    PASS: the vmcore_first.tar.1.gz is created when the first
          directory is read.
    PASS: 4 more vmcore_<date>.tar files are created.
    PASS: There will be 1 vmcore_first.tar.1.gz and 4
          vmcore_<date>.tar inside /var/log/crash.
    PASS: There will be one summary file for each direcory:
          <date>_dmesg.<date> inside /var/crash
2) max-size parameter
  PASS: don't set max-size param. Ensure the default value is used
        (5GiB).
  PASS: Set a fixed max-size param. Create a dump.<date> file greater
        that the max-size param. Run the crashDumpMgr script. Verify
        that the crash dump file is not generated and a log
        message is displayed.
3) max-used parameter:
  PASS: don't set max-used param. Ensure the default value is used
        (unlimited).
  PASS: Set a fixed max-used param. Create a dump.<date> file that
        will generate that the used space is greater that the
        max-used param. Run the crashDumpMgr script. Verify that
        the crash dump file is not generated, a log message is
        displayed and the directory is deleted.
4) min-available parameter:
  PASS: don't set min-available param. Ensure the default value is
        used (10% of /var/log/crash).
  PASS: Set a fixed 'min-available' param. Generate a 'dump.<date>'
        file to simulate a situation where the remaining space is
        less than the 'min-available' parameter. Run the crashDumpMgr
        script and ensure that it does not create the crashdump file,
        displays a log message, and deletes the entry.
5) PASS: Since the crashDumpMgr.service file is not being modified,
         verify that the script takes the default values.

Note: All tests have also been conducted by generating a kernel panic
and ensuring the crashDumpMgr script follows the correct workflow.

Change-Id: I8948593469dae01f190fd1ea21da3d0852bd7814
Signed-off-by: Enzo Candotti <enzo.candotti@windriver.com>
2023-09-18 19:22:09 +00:00
John Kung 7a1fb6333c Fix tox-docs failing sphinx
A new version of sphinx was released May 29 2022
which requires a language setting in config otherwise
a warning (treated as error) causes the sphinx operation
to fail.

Updated the sphinx config file to correct the issue.

The sphinx behavioural change is mentioned here:
https://github.com/sphinx-doc/sphinx/issues/10062
https://github.com/sphinx-doc/sphinx/issues/10474

Partial-Bug: 1976377
Partial-Bug: 2033431

Change-Id: I882faa7ce199d8817598980b9dc5090b4e1af57d
Signed-off-by: John Kung <john.kung@windriver.com>
2023-08-29 16:50:22 -04:00
Guilherme Schons bf5162bc20 Add patch extract from load
This commit adds extracting the patches files (metadata) from the
load being imported.

Test Plan:
    Passed: load from previous version imported as inactive
    Passed: load from new version imported

Story: 2010611
Task: 48546

Change-Id: I12a2c9f62523f6b08294f2538ad77b5c8338a751
Signed-off-by: Guilherme Schons <guilherme.dossantosschons@windriver.com>
2023-08-08 12:15:00 -03:00
Bin Qian 005544b651 support import previous compatible load
Add support for importing load with import.sh from the current load.
This enables to always import load with higher version import.sh (in the
case of load-import --inactive)

TCs:
    passed: from system controller running 23.09 load, import 22.12 load
    passed: regression import N+1 load

Story: 2010611
Task: 48371

Change-Id: I4aec6eaa89019d4852979c27a708e409f32e27b0
Signed-off-by: Bin Qian <bin.qian@windriver.com>
2023-07-25 19:10:12 +00:00
Roger Ferraz d8b704472b starlingx/metal README improvement
This story shall update the README file of a few most used StarlingX
repos.

Test Plan: N/A

Story: 2010814
Task: 48378

Change-Id: I3323f10f9cd983a5ff12f846e4c14af8cebbbd2f
Signed-off-by: Roger Ferraz <rogerio.ferraz@encora.com>
2023-07-19 12:32:13 -03:00
Zuul 67e2c4aaef Merge "Add intel multi-drivers-switch kernel parameter support to kickstarts" 2023-07-11 19:55:39 +00:00
Eric MacDonald 61bad300f7 Add intel multi-drivers-switch kernel parameter support to kickstarts
This update adds support to the Debian kickstarts to search the
install kernel command line for the multi-drivers-switch= option.
If that option is found, then the full option with the specified
version, ex: multi-drivers-switch=2.54 , will be added to the
disk boot kernel command line options.

Test Plan:

PASS: Verify Build and Install SX system
PASS: Unit test of code block function over an install
PASS: Verify if the multi-drivers-switch parameter exists on the
      the node install command line then the same option is
      propagated to the disk boot command line.
PASS: Verify the opposite of the above is true.

Closes-Bug: 2026893
Change-Id: I648b16dbc5aa2a0a7b8368c1b89a5d46418ab1e5
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2023-07-11 19:35:36 +00:00
Zuul c585c1dd71 Merge "Support CentOS previous release in subcloud remote install" 2023-07-11 17:20:57 +00:00
Zuul 3becc11faa Merge "Add multi-drivers-switch param to pxeboot cfg" 2023-07-11 14:00:56 +00:00
Bin Qian 2412a68815 Add multi-drivers-switch param to pxeboot cfg
Accept and apply intel driver ver parameter to pxeboot conf file
for nodes to be installed (include reinstall and upgrade).

TCs:
    Observed the pxeboot cfg file for a new host is configured
    with param multi-drivers-switch=<ver from service parameter>

Story: 2010651
Task: 48276

Signed-off-by: Bin Qian <bin.qian@windriver.com>
Change-Id: I6aebff98a5bb831de82f6da07ac53978b17f8caf
2023-07-11 13:47:51 +00:00
Kyle MacLeod 5f3c54297d Support CentOS previous release in subcloud remote install
This commit introduces support for installing CentOS-based previous
release (21.12) in Debian.

There are two main components in this commit:
1. Handle the label change for the backup partition:

Platform Backup in 21.12 vs 'platform_backup' in Debian
This is accomplished by ignoring the label/partlabel entirely when
searching for an existing backup partition. Instead, the partition
GUID is used to locate the partition. The GUID does not change
between distributions.

2. Use pre-bundled CentOS kickstarts for subcloud installs in Debian

Since modifications are required to the CentOS kickstart files for the
above, we copy the relevant pre-bundled centos kickstarts (for miniboot
and prestaged ISO only) into a centos-specific directory under the
Debian /var/www/pages/feed/rel-${platform_release}/kickstart directory
structure, in order to be available for the gen-bootloader-iso-centos.sh
utility. These files are included in the platform-kickstarts .deb
package.

NOTES on how the pre-bundled files are created:
- We cannot use the files under bsp-file/kickstarts/*.cfg, since they
  are not valid for 21.12 release (e.g. they refer to /var/www)
- Instead, files were taken from a valid 21.12 release and manually
  merged with the pre-bundled files generated from this repo

GOING FORWARD:
Only the bundled files at kickstart/files/centos/*.cfg will be
maintained. At a later time, we may choose to remove the partial
kickstarts under bsp-files/kickstarts/*.cfg, since they are not used
anywhere.

Test Plan

PASS:
- Build full ISO, verify that the
  /var/www/pages/feed/rel-23.09/kickstart/centos directory is populated
  with the pre-bundled kickstart files
- Verify previous-release CentOS subcloud install/deployment under
  Debian (requires patched 22.12 load)
- Verify current-release subcloud install under Debian

Story: 2010611
Task: 48268

Signed-off-by: Kyle MacLeod <kyle.macleod@windriver.com>
Change-Id: I1b7f76212e222dea7c6e586e4e9492f8a86a955e
2023-06-30 13:06:35 -04:00
Kyle MacLeod 0510b0c1a7 Support gpg-verify=false for subcloud remote ostree pull
This commit supports the developer use-case of a system controller
ostree repo configured with gpg-verify=false. In such cases, the
subcloud ostree repo instances must also be configured with
gpg-verify=false, or the ostree pull will fail.

We detect the boot parameter 'instgpg=0'. In which case we configure the
ostree repo with gpg-verify=false.  The instgpg=0 parameter is also
detected by LAT /install, which handles the LAT side of the ostree
repo configuration.

Test Plan:
PASS:
- Install subcloud with non-GPG signed ostree commits present on system
  controller. Ensure the ostree pull is successful on subcloud, with a
  successful install.
- Ensure normal subcloud installation is successful

Story: 2010611
Task: 48309

Signed-off-by: Kyle MacLeod <kyle.macleod@windriver.com>
Change-Id: I40a0823ed1fc868aa5d4fb7686f1648440664037
2023-06-28 16:51:54 -04:00
Zuul 4f280be013 Merge "Increase mtce host offline threshold to handle slow host shutdown" 2023-06-20 14:20:02 +00:00
Zuul fb1ab7114e Merge "Refactor from-load pxe setup in to-load kickstart" 2023-06-16 19:03:11 +00:00
Zuul 3d74a094b8 Merge "Fix ip -6 address netmask and workaround for multi-drivers-switch" 2023-06-16 18:53:56 +00:00
Eric MacDonald d863aea172 Increase mtce host offline threshold to handle slow host shutdown
Mtce polls/queries the remote host for mtcAlive messages
for 42 x 100 ms intervals over unlock or host failed cases.
Absence of mtcAlive during this (~5 sec) period indicates
the node is offline.

However, in the rare case where shutdown is slow, 5 seconds
is not long enough. Rare cases have been seen where 7 or 8
second wait time is required to properly declare offline.

To avoid the rare transient 200.004 host alarm over an
unlock operation, this update increases the mtce host
offline window from 5 to 10 seconds (approx) by modifying
the mtce configuration file offline threshold from 42 to 90.

Test Plan:

PASS: Verify unchallenged failed to offline period to be ~10 secs
PASS: Verify algorithm restarts if there is mtcAlive received
      anytime during the polls/queries (challenge) window.
PASS: Verify challenge handling leads to a longer but
      successful offline declaration.
PASS: Verify above handling for both unlock and spontaneous
      failure handling cases.

Closes-Bug: 2024249
Change-Id: Ice41ed611b4ba71d9cf8edbfe98da4b65dcd05cf
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2023-06-16 18:14:08 +00:00
Kyle MacLeod 9e94f1834a Fix ip -6 address netmask and workaround for multi-drivers-switch
This commit fixes the missing support for bootstrap_address_prefix
in the miniboot ip -6 address add command. We check for the provided
prefix value parsed from the boot arguments and make sure that it is
applied if present. Note that the bootstrap_address_prefix is a
mandatory install value, so it will be provided. However, we leave the
capability for it to be missing, in order to de-risk this commit.

Additionally, a workaround is included for full support of
multi-drivers-switch given in the boot arguments. When this argument is
given we parse out the kernel module version and use it to replace the
current kernel modules for ice/i40e/iavf with the modules of the given
version.

Test Plan
PASS:
- Replace  miniboot.cfg at /var/miniboot/kickstart-override/miniboot.cfg
  on target lab system requiring multi-drivers-switch=cvl-2.54:
    - Using subcloud install-value
      extra_boot_params: multi-drivers-switch=cvl-2.54,
      verify that the subcloud switches to the legacy kernel modules
      and the subcloud is able to properly configure its IP address and
      perform the ostree pull operation from the system controller.
- Install subcloud with no extra_boot_params, verify that the
  bootstrap_address_prefix is properly applied. Verify no regression.

Closes-Bug: 2023407

Signed-off-by: Kyle MacLeod <kyle.macleod@windriver.com>
Change-Id: I4f3d8e2f240f2aa061de30014cf39dfb9b42a035
2023-06-16 12:17:25 -04:00
Zuul e8bbc8c6d3 Merge "Create pmon config file for sssd to run on storage" 2023-06-14 14:05:19 +00:00
Andy Ning 8da5f0fe19 Create pmon config file for sssd to run on storage
Currently sssd is not configured and running on storage nodes so
ldap users can't login to storage nodes. This update creates sssd
pmon config file so that sssd is running on storage nodes.

Test Plan:
PASS: System with storage nodes deployment
PASS: In storage nodes, verify that the following config file exist:
      /etc/pmon.d/sssd.conf
PASS: In storage nodes, verify that sssd is running by
      systemctl status sssd
PASS: In storage nodes, verify ldap users are accessible by
      getent passwd

Closes-Bug: 2023399
Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/885878
Change-Id: I2e85873c3ddd18bab68365a58b5a8617eb1b2766
Signed-off-by: Andy Ning <andy.ning@windriver.com>
2023-06-12 09:35:18 -04:00
Kyle MacLeod 5f34e2843d Translate extra_boot_params into disk boot kernel options
The extra_boot_params install value is presented as a single boot
parameter in the initial miniboot ISO boot. This kickstart change
translates the install value into proper disk boot kernel options, so
that the provided extra_boot_params are applied as boot options for the
main /boot parameters in grub and syslinux.

Although the extra_boot_params value must be a single string, multiple
extra boot parameters can be specified by separating individual args
by a comma. Example: extra_boot_params=arg1=1,arg2=2. This change splits
the args by comma and ensures that the kernel boot options are separate
for the main boot.

Test Plan
PASS:
- Verify that extra_boot_params is parsed into separate kernel options
- Verify that disk kernel options are applied when subcloud is installed
  (i.e., the final install boots with the configured extra options)
- Verify comma-separated input values are translated into proper
  kernel options:
    - extra_boot_params=arg1=1,arg2=2 -> kernel options: arg1=1 arg2=2
    - extra_boot_params=arg1=1 -> kernel options: arg1=1
    - extra_boot_params=arg1 -> kernel options: arg1

Partial-Bug: 2023407
Depends-On: https://review.opendev.org/c/starlingx/distcloud/+/885758

Change-Id: I8ed10f7ffe8af51ae7b77eaa398b824347a0a998
Signed-off-by: Kyle MacLeod <kyle.macleod@windriver.com>
2023-06-09 12:12:47 -04:00
Zuul a1acf1a0d1 Merge "Fix prestage ISO install abort if previous subcloud install exists" 2023-05-23 21:55:34 +00:00
Zuul 13592cafa6 Merge "miniboot: Use release-specific prestage data, handle subcloud downgrade" 2023-05-23 18:29:33 +00:00
Kyle MacLeod d807f6b65e Fix prestage ISO install abort if previous subcloud install exists
This commit fixes the detection of www/pages/feed/rel-xx.x/install_uuid
via device '/dev/cgts-vg/var-lv'. There was a bug which was always
mounting the same device, rather than the proper device_list.

The code is also slightly refactored for simplification and clarity.

Test Plan
PASS:
- Generate ISO using gen-prestage-iso.sh without --force-install option
    - Verify installation failure (drop to boot prompt) if previous
      subcloud installation exists
    - Verify successful subcloud installation if no previous
      subcloud installation exists
- Generate ISO using gen-prestage-iso.sh with --force-install option
    - Verify successful installation regardless if previous subcloud
      installation exists or not

Closes-Bug: 2020526
Change-Id: Ib83d72fa07335ffa29d365da7813b226c4ef310b
Signed-off-by: Kyle MacLeod <kyle.macleod@windriver.com>
2023-05-23 10:51:54 -04:00
Kyle MacLeod 018d06ccec miniboot: Use release-specific prestage data, handle subcloud downgrade
This commit handles the relocation of ostree_repo prestaging data from
/opt/platform-backup to /opt/platform-backup/<release>. The miniboot.cfg
kickstart now looks for prestaged data in the release-specific location.

We also handle the backup partition name change across CentOS/Debian. In
the case of a downgrade the CentOS miniboot kickstart code is updated to
use the partition GUID rather than LABEL or PARTLABEL. The GUID is
constant across all releases and is therefore a more reliable indicator
of the backup partition.

Tech debt: Fix the arbitrary wait sleep calls used when configuring
VLAN addressing. Now uses the more efficient wait_for_interface
approach for the VLAN links.

Test Plan
PASS:
- Boot with prestaged data under /opt/platform-backup/<release>/
  Ensure boot/install successfully uses prestaged data.
- Boot into older release under prestaged /opt/platform-backup/21.12
    - Test moving from 22.12 -> 21.12 and 21.12 -> 22.12
    - Ensure backup partition is found using GUID approach.
    - Ensure boot/install successfully uses prestaged data.
- Boot into both current and older release with no prestaged data
    - Test moving from 22.12 -> 21.12 and 21.12 -> 22.12
    - Ensure boot/install is successful.
- Boot subcloud with bootstrap_vlan, ensure that the wait_for_interface
  calls properly wait until the link is up.

Story: 2010611
Task: 47943
Depends-On: https://review.opendev.org/c/starlingx/distcloud/+/880789

Change-Id: I381b60285e9bfc375f01f45b79174b71da7f0565
Signed-off-by: Kyle MacLeod <kyle.macleod@windriver.com>
2023-05-18 13:46:30 -04:00