From e265015e05a56abd0012e2163de835915587b3ab Mon Sep 17 00:00:00 2001 From: Juanita Balaraj Date: Thu, 9 Jan 2025 20:45:31 +0000 Subject: [PATCH] Release Notes for StarlingX 10.0 Updated Patchset 4 comments Fixed build errors Updated Patchset 2 comments Added New features for Stx 10.0 in the Introduction Guide Change-Id: I39dae18ed7cfe94390ee1ff02af369d26d7227e8 Signed-off-by: Juanita Balaraj --- .../introduction/index-intro-27197f27ad41.rst | 12 +- doc/source/releasenotes/index.rst | 3329 ++++++++++++----- 2 files changed, 2456 insertions(+), 885 deletions(-) diff --git a/doc/source/introduction/index-intro-27197f27ad41.rst b/doc/source/introduction/index-intro-27197f27ad41.rst index 328db6863..12df84d62 100644 --- a/doc/source/introduction/index-intro-27197f27ad41.rst +++ b/doc/source/introduction/index-intro-27197f27ad41.rst @@ -68,13 +68,19 @@ Supporting projects and repositories: For additional information about project teams, refer to the `StarlingX wiki `_. +------------------------------ +New features in StarlingX 10.0 +------------------------------ + +.. include:: /releasenotes/index.rst + :start-after: start-new-features-r10 + :end-before: end-new-features-r10 + ----------------------------- New features in StarlingX 9.0 ----------------------------- -.. include:: /releasenotes/index.rst - :start-after: start-new-features-r9 - :end-before: end-new-features-r9 +**See**: https://docs.starlingx.io/r/stx.9.0/releasenotes/index.html#release-notes ----------------------------- New features in StarlingX 8.0 diff --git a/doc/source/releasenotes/index.rst b/doc/source/releasenotes/index.rst index bd6a41531..a1d3c1391 100644 --- a/doc/source/releasenotes/index.rst +++ b/doc/source/releasenotes/index.rst @@ -17,7 +17,7 @@ StarlingX is a fully integrated edge cloud software stack that provides everything needed to deploy an edge cloud on one, two, or up to 100 servers. This section describes the new capabilities, Known Limitations and Workarounds, -Defects fixed and deprecated information in StarlingX 9.0 Release. +Defects fixed and deprecated information in StarlingX 10.0. .. contents:: :local: @@ -27,465 +27,1151 @@ Defects fixed and deprecated information in StarlingX 9.0 Release. ISO image --------- -The pre-built ISO (Debian) for StarlingX Release 9.0 is located at the +The pre-built ISO (Debian) for StarlingX 10.0 is located at the ``StarlingX mirror`` repo: -https://mirror.starlingx.windriver.com/mirror/starlingx/release/9.0.0/debian/monolithic/outputs/iso/ +https://mirror.starlingx.windriver.com/mirror/starlingx/release/10.0.0/debian/monolithic/outputs/iso/ -------------------------------------- -Source Code for StarlingX Release 9.0 -------------------------------------- +------------------------------ +Source Code for StarlingX 10.0 +------------------------------ -The source code for StarlingX Release 9.0 is available on the r/stx.9.0 +The source code for StarlingX 10.0 is available on the r/stx.10.0 branch in the `StarlingX repositories `_. ---------- Deployment ---------- -To deploy StarlingX Release 9.0, see `Consuming StarlingX `_. +To deploy StarlingX 10.0, see `Consuming StarlingX `_. -For detailed installation instructions, see `StarlingX 9.0 Installation Guides `_. +For detailed installation instructions, see `StarlingX 10.0 Installation Guides `_. + +.. Ghada / Greg please confirm if all features listed here are required in Stx 10.0 ----------------------------- New Features and Enhancements ----------------------------- -.. start-new-features-r9 - The sections below provide a detailed list of new features and links to the associated user guides (if applicable). -********************* -Kubernetes up-version -********************* +.. start-new-features-r10 -In StarlingX 9.0, the Kubernetes version that is supported is in the range -of v1.24 to v1.27. +**************************** +Platform Component Upversion +**************************** + +The ``auto_update`` attribute supported for |prod| applications +enables apps to be automatically updated when a new app version tarball is +installed on a system. + +**See**: https://wiki.openstack.org/wiki/StarlingX/Containers/Applications/AppIntegration + +The following platform component versions have been updated in |prod| 10.0. + +- sriov-fec-operator 2.9.0 + +- kubernetes-power-manager 2.5.1 + +- kubevirt-app: 1.1.0 + +- security-profiles-operator 0.8.7 + +- nginx-ingress-controller + + - ingress-nginx 4.11.1 + + - secret-observer 0.1.1 + +- auditd 1.0.5 + +- snmp 1.0.3 + +- cert-manager 1.15.3 + +- ceph-csi-rbd 3.11.0 + +- node-interface-metrics-exporter 0.1.3 + +- node-feature-discovery 0.16.4 + +- app-rook-ceph + + - rook-ceph 1.13.7 + - rook-ceph-cluster 1.13.7 + - rook-ceph-floating-monitor 1.0.0 + - rook-ceph-provisioner 2.0.0 + +- dell-storage + + - csi-powerstore 2.10.0 + - csi-unity 2.10.0 + - csi-powerscale 2.10.0 + - csi-powerflex 2.10.1 + - csi-powermax 2.10.0 + - csm-replication 1.8.0 + - csm-observability 1.8.0 + - csm-resiliency 1.9.0 + +- portieris 0.13.16 + +- metrics-server 3.12.1 (0.7.1) + +- FluxCD helm-controller 1.0.1 (for Helm 3.12.2) + +- power-metrics + + - cadvisor 0.50.0 + + - telegraf 1.1.30 + +- security-profiles-operator 0.8.7 + +- vault + + - vault 1.14.0 + + - vault-manager 1.0.1 + +- oidc-auth-apps + + - oidc-auth-secret-observer secret-observer 0.1.6 1.0 + + - oidc-dex dex-0.18.0+STX.4 2.40.0 + + - oidc-oidc-client oidc-client 0.1.22 1.0 + +- platform-integ-apps + + - ceph-csi-cephfs 3.11.0 + + - ceph-pools-audit 0.2.0 + +- app-istio + + - istio-operator 1.22.1 + + - kiali-server 1.85.0 + +- harbor 1.12.4 + +- ptp-notification 2.0.55 + +- intel-device-plugins-operator + + - intel-device-plugins-operator 0.30.3 + + - intel-device-plugins-qat 0.30.1 + + - intel-device-plugins-gpu 0.30.0 + + - intel-device-plugins-dsa 0.30.1 + + - secret-observer 0.1.1 + +- node-interface-metrics-exporter 0.1.3 + +- oran-o2 2.0.4 + +- helm 3.14.4 for K8s 1.21 - 1.29 + +- Redfish Tool 1.1.8-1 + +**See**: :ref:`Application Reference ` + +******************** +Kubernetes Upversion +******************** + +|prod-long| Release |this-ver| supports Kubernetes 1.29.2. + +***************************************** +Distributed Cloud Scalability Improvement +***************************************** + +|prod| System Controller scalability has been improved in |prod| 10.0 with +both 5 thousand maximum managed nodes and maximum number of parallel operations. **************************************** -Platform Application Components Revision +Unified Software Delivery and Management **************************************** -.. Need updated versions for this section wherever applicable +In |prod| 10.0, the Software Patching functionality and the +Software Upgrades functionality have been re-designed into a single Unified +Software Management framework. There is now a single procedure for managing +the deployment of new software; regardless of whether the new software is a +new Patch Release or a new Major Release. The same APIs/CLIs are used, the +same procedures are used, the same |VIM| / Host Orchestration strategies are used +and the same Distributed Cloud / Subcloud Orchestration strategies are used; +regardless of whether the new software is a new Patch Release or a new Major +Release. -The following applications have been updated to a new version in StarlingX Release 9.0. -All platform application up-versions are updated to remain current and address -security vulnerabilities in older versions. - -- app-sriov-fec-operator: 2.7.1 - -- cert-manager: 1.11.1 - -- metric-server: 1.0.18 - -- nginx-ingress-controller: 1.9.3 - -- oidc-dex: 2.37.0 - -- vault: 1.14.8 - -- portieris: 0.13.10 - -- istio: 1.19.4 - -- kiali: 1.75.0 - -****************** -FluxCD Maintenance -****************** -FluxCD helm-controller is upgraded from v0.27.0 to v0.35.0 and is compatible -with Helm version up to v3.12.1 and Kubernetes v1.27.3. - -FluxCD source-controller is upgraded from v0.32.1 to v1.0.1 and is compatible -with Helm version up to v3.12.1 and Kubernetes v1.27.3. - -**************** -Helm Maintenance -**************** - -Helm has been upgraded to v3.12.2 in StarlingX Release 9.0. +**See**: :ref:`appendix-commands-replaced-by-usm-for-updates-and-upgrades-835629a1f5b8` +for a detailed list of deprecated commands and new commands. ******************************************* -Support for Silicom TimeSync Server Adaptor +Infrastructure Management Component Updates ******************************************* -The Silicom network adaptor provides local time sync support via a local |GNSS| -module which is based on the Intel Columbiaville device. +In |prod| 10.0, the new Unified Software Management framework +supports enhanced Patch Release packaging and enhanced Major Release deployments. -- ``cvl-4.10`` Silicom driver bundle - - ice driver: 1.10.1.2 - - i40e driver: 2.21.12 - - iavf driver: 4.6.1 +Patch Release packaging has been simplified to deliver new or modified Debian +packages, instead of the cryptic difference of OSTree builds done previously. +This allows for inspection and validation of Patch Release content prior to +deploying, and allows for future flexibility of Patch Release packaging. - .. note:: +Major Release deployments have been enhanced to fully leverage OSTree. An +OSTree deploy is now used to update the host software. The new software's +root filesystem can be installed on the host, while the host is still running +the software of the old root filesystem. The host is simply rebooted +into the new software's root filesystem. This provides a significant +improvement in both the upgrade duration and the upgrade service impact +(especially for |AIO-SX| systems), as previously upgrading hosts needed to have +disks/root-filesystems wiped and then software re-installed. - `cvl-4.10` is only recommended if the Silicom STS2 card is used. +**See** -********************************************* -Kubernetes Upgrade Optimization - AIO-Simplex -********************************************* +- :ref:`patch-release-deployment-before-bootstrap-and-commissioning-of-7d0a97144db8` -**Configure Kubernetes Multi-Version Upgrade Cloud Orchestration for AIO-SX** +- :ref:`manual-host-software-deployment-ee17ec6f71a4` -You can configure Kubernetes multi-version upgrade orchestration strategy using -the :command:`sw-manager` command. This feature is enabled from -|prod| |k8s-multi-ver-orch-strategy-release| and is supported only for the -|AIO-SX| system. +- :ref:`manual-removal-host-software-deployment-24f47e80e518` -**See**: :ref:`Configure Kubernetes Multi-Version Upgrade Cloud Orchestration for AIO-SX ` +- :ref:`manual-rollback-host-software-deployment-9295ce1e6e29` -**Manual Kubernetes Multi-Version Upgrade in AIO-SX** +*********************************************************** +Unified Software Management - Rollback Orchestration AIO-SX +*********************************************************** -|AIO-SX| now supports multi-version Kubernetes upgrades. In this model, -Kubernetes is upgraded by two or more versions after disabling applications and -then applications are enabled again. This is faster than upgrading Kubernetes -one version at a time. Also, the upgrade can be aborted and reverted to the -original version. This feature is supported only for |AIO-SX|. - -**See**: :ref:`Manual Kubernetes Multi-Version Upgrade in AIO-SX ` - -*********************************** -Platform Admin Network Introduction -*********************************** - -The newly introduced admin network is an optional network that is used to -monitor and control internal |prod| between the subclouds and system controllers -in a Distributed Cloud environment. This function is performed by the management -network in the absence of an admin network. However, the admin network is more -easily reconfigured to handle subnet and IP address network parameter changes -after initial configuration. - -In deployment configurations, static routes from the management or admin -interface of subclouds controller nodes to the system controller's management -subnet must be present. This ensures that the subcloud comes online after deployment. +|VIM| Patch Orchestration has been enhanced to support the abort and rollback of +a Patch Release software deployment. |VIM| Patch Orchestration rollback will +automate the abort and rollback steps across all hosts of a Cloud configuration. .. note:: - The admin network is optional. The default management network will be used - if it is not present. + In |prod| 10.0, |VIM| Patch Orchestration Rollback is only + supported for |AIO-SX| configurations. -You can manage an optional admin network on a subcloud for IP connectivity to -the system controller management network where the IP addresses of the admin -network can be changed. +In |prod-long| 10.0 |VIM| Patch Orchestration Rollback is only +supported if the Patch Release software deployment has been aborted or +failed prior to the 'software deploy activate' step. If the Patch Release +software deployment is at or beyond the 'software deploy activate' step, +then an install plus restore of the Cloud is required in order to rollback +the Patch Release deployment. + +**See**: :ref:`orchestrated-rollback-host-software-deployment-c6b12f13a8a1` + + +*********************************** +Enhancements to Full Debian Support +*********************************** + +The Kernel can be configured during runtime as [ standard <-> lowlatency ]. + +**See**: :ref:`Modify the Kernel using the CLI ` + +********************************************************* +Support for Kernel Live Patching (for possible scenarios) +********************************************************* + +|prod-long| supports live patching that enables fixing critical functions +without rebooting the system and enables systems to be functional and running. +The live-patching modules will be built into the upgraded |prod-long| binary +patch. + +The upgraded binary patch is generated as the in-service type (non-reboot-required). +The kernel modules will be matched with the correct kernel release version +during binary patch upgrading. + +The relevant kernel module can be found in the location: +'/lib/modules//extra/kpatch' + +During binary patch upgrading, the user space tool ``kpatch`` is +used for: + +- installing the kernel module to ${installdir} + +- loading(insmod) the kernel module for the running kernel + +- unloading(rmmod) the kernel module from the running kernel + +- uninstallling the kernel module from ${installdir} + +- listing the enabled live patch kernel module + +************************** +Subcloud Phased Deployment +************************** + +Subclouds can be deployed using individual phases. Therefore, instead of using +a single operation, a subcloud can be deployed by executing each phase individually. +Users have the flexibility to proactively abort the deployment based on their +needs. When the deployment is resumed, previously installed contents will be +still valid. + +**See**: :ref:`Install a Subcloud in Phases ` + +****************************** +Kubernetes Local Client Access +****************************** + +You can configure Kubernetes access for a user logged in to the active +controller either through SSH or by using the system console. + +**See**: :ref:`configure-kubernetes-local-client-access` + +******************************* +Kubernetes Remote Client Access +******************************* + +The access to the Kubernetes cluster from outside the controller can be done +using the remote CLI container or using the host directly. + +**See**: :ref:`configure-kubernetes-remote-client-access` + +************************************************** +IPv4/IPv6 Dual Stack support for Platform Networks +************************************************** + +Migration of a single stack deployment to dual stack network deployments will +not cause service disruptions. + +Dual-stack networking facilitates the simultaneous use of both IPv4 and IPv6 +addresses, or continue to use each IP version independently. To accomplish +this, platform networks can be associated with 1 or 2 address pools, one for +each IP version (IPv4 or IPv6). The first pool is linked to the network +upon creation and cannot be subsequently removed. The second pool can be added or +removed to transition the system between dual-stack and single-stack modes. + +**See**: :ref:`dual-stack-support-318550fd91b5` + +********************************* +Run Kata Containers in Kubernetes +********************************* + +There are two methods to run Kata Containers in Kubernetes: by runtime class or +by annotation. Runtime class is supported in Kubernetes since v1.12.0 or +higher, and it is the recommended method for running Kata Containers. + +**See**: :ref:`kata_container` + +*************************************************** +External DNS Alternative: Adding Local Host Entries +*************************************************** + +You can configure user-defined host entries for external resources that are not +maintained by |DNS| records resolvable by the external |DNS| server(s) (i.e. +``nameservers`` in ``system dns-show/dns-modify``). This functionality enables +the configuration of local host records, supplementing hosts resolvable by +external |DNS| server(s). + +**See**: :ref:`user-host-entries-configuration-9ad4c060eb15` + +******************************************* +Power Metrics Enablement - vRAN Integration +******************************************* + +|prod| 10.0 supports integrated enhanced power metrics tool with +reduced impact on vRAN field deployment. + +Power Metrics may increase the scheduling latency due to perf and |MSR| +readings. It was observed that there was a latency impact of around 3 µs on +average, plus spikes with significant increases in maximum latency values. +There was also an impact on the kernel processing time. Applications that +run with priorities at or above 50 in real-time kernel isolated CPUs should +allow kernel services to avoid unexpected system behavior. + +**See**: :ref:`install-power-metrics-application-a12de3db7478` + +****************************************** + Crash dump File Size Setting Enhancements +****************************************** + +The Linux kernel can be configured to perform a crash dump and reboot in +response to specific serious events. A crash dump event produces a +crash dump report with bundle of files that represent the state of the kernel at the +time of the event, which is useful for post-event root cause analysis. + +The crash dump files that are generated by Linux kdump are configured to be +generated during kernel panics (default) are managed by the crashDumpMgr utility. +The utility will save crash dump files but the current handling uses a fixed +configuration when saving files. In order to provide a more flexible system +handling the crashDumpMgr utility is enhanced to support the following +configuration parameters that will control the storage and rotation of crash +dump files. + +- Maximum Files: New configuration parameter for the number of saved crash + dump files (default 4). + +- Maximum Size: Limit the maximum size of an individual crash dump file + (support for unlimited, default 5GB). + +- Maximum Used: Limit the maximum storage used by saved crash dump files + (support for unlimited, default unlimited). + +- Minimum Available: Limit the minimum available storage on the crash dump + file system (restricted to minimum 1GB, default 10%). + +The service parameters must be specified using the following service hierarchy. +It is recommended to model the parameters after the platform coredump service +parameters for consistency. + +.. code-block:: none + + platform crashdump = + +**See**: :ref:`customize-crashdumpmanager-46e0d32891a0` + +.. Michel Desjardins please confirm if this is applicable? + +*********************************************** +Subcloud Install or Restore of Previous Release +*********************************************** + +|prod| |this-ver| system controller supports both |prod| 9.0 and +|prod| |this-ver| subclouds fresh install or restore. + +If the upgrade is from |prod| 9.0 to a higher release, the **prestage status** +and **prestage versions** fields in the output of the +:command:`dcmanager subcloud list` command will be empty, regardless of whether +the deployment status of the subcloud was ``prestage-complete`` before the upgrade. +These fields will only be updated with values if you run ``subcloud prestage`` +or ``prestage orchestration`` again. + +**See**: :ref:`Subclouds Previous Major Release Management ` + +**For non-prestaged subcloud remote installations** +The ISO imported via ``load-import --active`` should always be at the same patch +level as the system controller. This is to ensure that the subcloud boot image +aligns with the patch level of the load to be installed on the subcloud. + +**See**:`installing-a-subcloud-using-redfish-platform-management-service` + +**For prestaged remote subcloud installations** +The ISO imported via ``load-import --inactive`` should be at the same patch level +as the system controller. If the system controller is patched after subclouds +have been prestaged, it is recommended to repeat the prestaging for each +subcloud. This is to ensure that the subcloud boot image aligns with the patch +level of the load to be installed on the subcloud. +**See**: :ref:`prestaging-prereqs` + +**************************************** +WAD Users Access Right Control via Group +**************************************** + +You can configure an |LDAP| / |WAD| user with 'sys_protected' group or 'sudo all'. + +- an |LDAP| / |WAD| user in 'sys_protected' group on |prod-long| + + - is equivalent to the special 'sysadmin' bootstrap user + + - via "source /etc/platform/openrc" + + - has Keystone admin/admin identity and credentials, and + - has Kubernetes /etc/kubernetes/admin.conf credentials + + - only a small number of users have this capability + +- an |LDAP| / |WAD| user with 'sudo all' capability on |prod-long| + + - can perform the following |prod|-type operations: + - sw_patch to unauthenticated endpoint + - docker/crictl to communicate with the respective daemons + - using some utilities - like show-certs.sh, license-install (recovery only) + - IP configuration for local network setup + - password changes of Linux users (i.e. local LDAP) + - access to restricted files, including some logs + - manual reboots + +The local |LDAP| server by default serves both HTTPS on port 636 and HTTP on +port 389. + +The HTTPS server certificate is issued by cert-manager ClusterIssuer +``system-local-ca`` and is managed internally by cert-manager. The certificate +will be automatically renewed when the expiration date approaches. The +certificate is called ``system-openldap-local-certificate`` with its secret +having the same name ``system-openldap-local-certificate`` in the +``deployment`` namespace. The server certificate and private key files are +stored in the ``/etc/ldap/certs/`` system directory. **See**: -- :ref:`Common Components ` +- :ref:`local-ldap-certificates-4e1df1e39341` -- :ref:`Manage Subcloud Network Parameters ` +- :ref:`sssd-support-5fb6c4b0320b` -**************************************************** -L3 Firewalls for all |prod-long| Platform Interfaces -**************************************************** +- :ref:`create-ldap-linux-accounts` -|prod| incorporates default firewall rules for the platform networks (|OAM|, -management, cluster-host, pxeboot, admin, and storage). You can configure -additional Kubernetes Network Policies to augment or override the default rules. +**************************************************************************************** +Accessing Collect Command with 'sudo' privileges and membership in 'sys-protected' Group +**************************************************************************************** -**See**: +The |prod| 10.0 adds support to run ``Collect`` from any +local |LDAP| or Remote |WAD| user account with 'sudo' capability and a member +of the 'sys_protected' group. -- :ref:`Modify Firewall Options ` +The ``Collect`` tool continues support from the 'sysadmin' user account +and also being run from any other successfully created |LDAP| and |WAD| account +with 'sudo' capability and a member of the 'sys_protected' group. -- :ref:`Default Firewall Rules ` +For security reasons, no password 'sudo' continues to be unsupported. -**************************************************** -app-sriov-fec-operator upgrade to FEC operator 2.7.1 -**************************************************** +.. Eric McDonald please confirm if this is supported in Stx 10.0 -A new version of the FEC Operator v2.7.1 (for all Intel hardware accelerators) -is supported to include ``igb_uio`` along with making the accelerator resource -names configurable and enabling accelerator device configuration using -``igb_uio`` driver when secure boot is enabled in the BIOS. +******************************** +Support for Intel In-tree Driver +******************************** + +The system supports both in-tree and out-of-tree versions of the Intel ``ice``, +``i40e``, and ``iavf`` drivers. On initial installation, the system uses the +default out-of-tree driver version. You can switch between the in-tree and +out-of-tree driver versions. For further details: + +**See**: :ref:`intel-driver-version-c6e3fa384ff7` .. note:: - |FEC| operator is now running on the |prod| platform core. - -**See**: :ref:`Configure Intel Wireless FEC Accelerators using SR-IOV FEC operator ` - - -************************************** -Redundant System Clock Synchronization -************************************** - -The ``phc2sys`` application can be configured to accept multiple source clock -inputs. The quality of these sources are compared to user-defined priority -values and the best available source is selected to set the system time. - -The quality of the configured sources is continuously monitored by ``phc2sys`` -application and will select a new best source if the current source degrades -or if another source becomes higher quality. - -**See**: :ref:`Redundant System Clock Synchronization `. - -******************************************************* -Configure Intel E810 NICs using Intel Ethernet Operator -******************************************************* - -You can install and use **Intel Ethernet** operator to orchestrate and manage -the configuration and capabilities provided by Intel E810 Series network -interface cards (NICs). - -**See**: :ref:`Configure Intel E810 NICs using Intel Ethernet Operator `. - -**************** -AppArmor Support -**************** - -AppArmor is a Mandatory Access Control (MAC) system built on Linux's LSM (Linux -Security Modules) interface. In practice, the kernel queries AppArmor before -each system call to know whether the process is authorized to do the given -operation. Through this mechanism, AppArmor confines programs to a limited set -of resources. - -AppArmor helps administrators in running a more secure kubernetes deployment -by restricting what operations containers/pods are allowed, and/or provide better -auditing through system logs. The access needed by a container/pod is -configured through profiles tuned to allow access such as Linux capabilities, -network access, file permissions, etc. - -**See**: :ref:`About AppArmor `. - -***************** -Support for Vault -***************** - -This release re-introduces support for Vault as it was intermittently -unavailable in |prod|. The supported version vault: 1.14.8 or later / -vault-k8s: 1.2.1 / helm-chart: 0.25.0 after the helm-v3 up-version to 3.6+ - -|prod| integrates open source Vault containerized security application -(Optional) into the |prod| solution, that requires |PVCs| as a storage -backend to be enabled. - -**See**: :ref:`Vault Overview `. - -********************* -Support for Portieris -********************* - -|prod| now supports version 0.13.10. Portieris is an open source Kubernetes -admission controller which ensures only policy-compliant images, such as signed -images from trusted registries, can run. The Portieris application uses images -from the ``icr.io registry``. You must configure service parameters for the -``icr.io registry`` prior to applying the Portieris application, -see: :ref:`About Changing External Registries for StarlingX Installation `. -For Distributed Cloud deployments, the images must be present on the System -Controller registry. - -**See**: :ref:`Portieris Overview `. + The ice in-tree driver does not support SyncE/GNSS deployments. ************************** -Configurable Power Manager +Password Rules Enhancement ************************** -Configurable Power Manager focuses on containerized applications that use power -profiles individually by the core and/or the application. +You can check current password expiry settings by running the +:command:`chage -l ` command replacing ```` with the name +of the user whose password expiry settings you wish to view. -|prod| has the capability to regulate the frequency of the entire processor. -However, this control is primarily directed towards the classification of the -core, distinguishing between application and platform cores. Consequently, if a -user requires to control over an individual core, such as Core 10 in a -24-core CPU, adjustments must be applied to all cores collectively. In the -context of containerized operations, it becomes imperative to establish -personalized configurations. This entails assigning each container the -requisite power configuration. In essence, this involves providing specific and -individualized power configurations to each core or group of cores. +You can also change password expiry settings by running the +:command:`sudo chage -M ` command. -**See**: :ref:`Configurable Power Manager `. +Use the following new password rules as listed below: -****************************************************** -Technology Preview - Install Power Metrics Application -****************************************************** +1. There should be a minimum length of 12 characters. -The Power Metrics app deploys two containers, cAdvisor and Telegraf that -collect metrics about hardware usage. +2. The password must contain at least one letter, one number, and one special + character. -**See**: :ref:`Install Power Metrics Application `. +3. Do not reuse the past 5 passwords. - -******************************************************* -Install Node Feature Discovery (NFD) |prod| Application -******************************************************* - -Node Feature Discovery (NFD) version 0.15.0 detects hardware features available -on each node in a kubernetes cluster and advertises those features using -Kubernetes node labels. This procedure walks you through the process of -installing the |NFD| |prod| Application. - -**See**: :ref:`Install Node Feature Discovery Application `. - -**************************************************************************** -Partial Disk (Transparent) Encryption Support via Software Encryption (LUKS) -**************************************************************************** - -A new encrypted filesystem using Linux Unified Key Setup (LUKS) is created -automatically on all hosts to store security-sensitive files. This is mounted -at '/var/luks/stx/luks_fs' and the files kept in '/var/luks/stx/luks_fs/controller' -directory are replicated between controllers. - -************************************************************* -K8s API/CLI OIDC (Dex) Authentication with Local LDAP Backend -************************************************************* - -|prod| offers |LDAP| commands to create and manage |LDAP| Linux groups as part -of a StarlingX local |LDAP| server (serving the local StarlingX cluster and, -in the case of Distributed Cloud, the entire Distribute Cloud System). - -StarlingX provides procedures to configure the **oidc-auth-apps** |OIDC| -Identity Provider (Dex) system application to use the StarlingX local |LDAP| -server (in addition to, or in place of the already supported remote Windows -Active Directory) to authenticate users of the Kubernetes API. +4. The Password expiration period should be defined by users, but by default + it is set to 90 days. **See**: -- :ref:`Overview of LDAP Servers ` -- :ref:`Create LDAP Linux Groups ` -- :ref:`Configure Kubernetes Client Access ` +- :ref:`linux-accounts-password-3dcad436dce4` -************************ -Create LDAP Linux Groups -************************ +- :ref:`starlingx-system-accounts-system-account-password-rules` -|prod| offers |LDAP| commands to create and manage |LDAP| Linux groups as part -of the `ldapscripts` library. +- :ref:`system-account-password-rules` +******************************************************************************* +Management Network Reconfiguration after Deployment Completion Phase 1 |AIO-SX| +******************************************************************************* -***************************************** -StarlingX OpenStack now supports Antelope -***************************************** - -Currently stx-openstack has been updated and now deploys OpenStack services -based on the Antelope release. - -******************* -Pod Security Policy -******************* - -|PSP| ONLY applies if running on Kubernetes v1.24 or earlier. |PSP| is -deprecated as of Kubernetes v1.21 and is removed in Kubernetes v1.25. -Instead of using |PSP|, you can enforce similar restrictions on Pods using -:ref:`Pod Security Admission Controller `. - -Since it has been introduced |PSP| has had usability problems. The way |PSPs| -are applied to pods has proven confusing especially when trying to use them. -It is easy to accidentally grant broader permissions than intended, and -difficult to inspect which |PSPs| apply in a certain situation. Kubernetes -offers a built-in |PSA| controller that will replace |PSPs| in the future. - -************************************************* -|WAD| users sudo and local linux group assignment -************************************************* - -StarlingX 9.0 supports and provides procedures for centrally configured -Window Active Directory (WAD) Users with sudo access and local linux group -assignments; i.e. with only |WAD| configuration changes. +|prod| 10.0 supports changes to the management IP addresses +for a standalone |AIO-SX| and for an |AIO-SX| subcloud after the node is +completely deployed. **See**: -- :ref:`Create LDAP Linux Accounts ` -- :ref:`Local LDAP Certificates ` -- :ref:`SSH User Authentication using Windows Active Directory ` +- :ref:`Manage Management Network Parameters for a Standalone AIO-SX ` + +- :ref:`Manage Subcloud Management Network Parameters ` + +**************************** +Networking Statistic Support +**************************** + +The Node Interface Metrics Exporter application is designed to fetch and +display node statistics in a Kubernetes environment. It deploys an Interface +Metrics Exporter DaemonSet on all nodes with the +``starlingx.io/interface-metrics=true node`` label. It uses the Netlink library +to gather data directly from the kernel, offering real-time insights into node +performance. + +**See**: :ref:`node-interface-metrics-exporter-application-d98b2707c7e9` + +***************************************************** +Add Existing Cloud as Subcloud Without Reinstallation +***************************************************** + +The subcloud enrollment feature converts a factory pre-installed system +or initially deployed as a standalone cloud system to a subcloud of a |DC|. +Factory pre-installation standalone systems must be installed locally in the +factory, and later deployed and configured on-site as a |DC| subcloud without +re-installing the system. + +**See**: :ref:`Enroll a Factory Installed Non Distributed Standalone System as a Subcloud ` + +******************************************** +Rook Support for freshly Installed StarlingX +******************************************** + +The new Rook Ceph application will be used for deploying the latest version of +Ceph via Rook. + +Rook Ceph is an orchestrator that provides a containerized solution for +Ceph Storage with a specialized Kubernetes Operator to automate the management +of the cluster. It is an alternative solution to the bare metal Ceph storage. +See https://rook.io/docs/rook/latest-release/Getting-Started/intro/ for more +details. + +The deployment model is the topology strategy that defines the storage backend +capabilities of the deployment. The deployment model dictates how the storage +solution will look like when defining rules for the placement of storage +cluster elements. + +Enhanced Availability for Ceph on AIO-DX +**************************************** + +Ceph on |AIO-DX| now works with 3 Ceph monitors providing High Availability and +enhancing uptime and resilience. + +Available Deployment Models +*************************** + +Each deployment model works with different deployment strategies and rules to +fit different needs. The following models available for the requirements of +your cluster are: + +- Controller Model (default) + +- Dedicated Model + +- Open Model + +**See**: :ref:`Deployment Models and Services for Rook Ceph ` + +Storage Backend +*************** + +Configuration of the storage backend defines the deployment models +characteristics and main configurations. + +Migration with Rook container based Ceph Installations +****************************************************** + +When you migrate an |AIO-SX| to an |AIO-DX| subcloud with Rook container-based +Ceph installations in |prod| 10.0, you would need to follow the +additional procedural steps below: + +.. rubric:: |proc| + +After you configure controller-1, follow the steps below: + +#. Add a new Ceph monitor on controller-1. + + .. code-block::none + + ~(keystone_admin)$ system host-fs-add controller-1 ceph= + +#. Add an |OSD| on controller-1. + + #. List host's disks and identify disks you want to use for Ceph |OSDs|. Ensure + you note the |UUIDs|. + + .. code-block::none + + ~(keystone_admin)$ system host-disk-list controller-1 + + #. Add disks as an |OSD| storage. + + .. code-block::none + + ~(keystone_admin)$ system host-stor-add controller-1 osd + + #. List |OSD| storage devices. + + .. code-block::none + + ~(keystone_admin)$ system host-stor-list controller-1 + +Unlock controller-1 and follow the steps below: + +#. Wait until Ceph is updated with two active monitors. To verify the updates, + run the :command:`ceph -s` command and ensure the output shows + `mon: 2 daemons, quorum a,b`. This confirms that both monitors are active. + + .. code-block::none + + ~(keystone_admin)$ ceph -s + cluster: + id: c55813c6-4ce5-470b-b9f5-e3c1fa0c35b1 + health: HEALTH_WARN + insufficient standby MDS daemons available + services: + mon: 2 daemons, quorum a,b (age 2m) + mgr: a(active, since 114s), standbys: b + mds: 1/1 daemons up + osd: 4 osds: 4 up (since 46s), 4 in (since 65s) + +#. Add the floating monitor. + + .. code-block::none + + ~(keystone_admin)$ system host-lock controller-1 + ~(keystone_admin)$ system controllerfs-add ceph-float= + ~(keystone_admin)$ system host-unlock controller-1 + + Wait for the controller to reset and come back up to an operational state. + +#. Re-apply the ``rook-ceph`` application. + + .. code-block::none + + ~(keystone_admin)$ system application-apply rook-ceph + +To Install and Uninstall Rook Ceph +********************************** + +**See**: + +- :ref:`Install Rook Ceph ` + +- :ref:`Uninstall Rook Ceph ` -******************************************* -Subcloud Error Root Cause Correction Action -******************************************* +Performance Configurations on Rook Ceph +*************************************** -This feature provides a root cause analysis of the subcloud -deployment / upgrade failure. This includes: +When using Rook Ceph it is important to consider resource allocation and +configuration adjustments to ensure optimal performance. Rook introduces +additional management overhead compared to a traditional bare-metal Ceph setup +and needs more infrastructure resources. -- existing 'deploy_status' that provides progress through phases of subcloud - deployment and, on error, the phase that failed +**See**: :ref:`performance-configurations-rook-ceph-9e719a652b02` -- introduces ``deploy_error_desc`` attribute that provides a summary of the - key deployment/upgrade errors +********************************************************************************** +Protecting against L2 Network Attackers - Securing local traffic on MGMT networks +********************************************************************************** -- Additional text that is added at the end of the 'deploy_error_desc' error - message, with information on: +A new security solution is introduced for |prod-long| inter-host management +network: - - trouble shooting commands +- Attackers with direct access to local |prod-long| L2 VLANs - - root cause of the errors and + - specifically protect LOCAL traffic on the MGMT network which is used for + private/internal infrastructure management of the |prod-long| cluster. - - suggested recovery action +- Protection against both passive and active attackers accessing private/internal + data, which could risk the security of the cluster -**See**: :ref:`Manage Subclouds Using the CLI ` + - passive attackers that are snooping traffic on L2 VLANs (MGMT), and + - active attackers attempting to connect to private internal endpoints on + |prod-long| L2 interfaces (MGMT) on |prod| hosts. -************************************ -Patch Orchestration Phase Operations -************************************ +IPsec is a set of communication rules or protocols for setting up secure +connections over a network. |prod| utilizes IPsec to protect local traffic +on the internal management network of multi-node systems. -The distributed cloud patch orchestration has the option to separate the upload -from the apply, remove, install and reboot operations. This facilitates -performing the upload operations outside of the system maintenance window -to reduce the total execution time during the patch activation that occurs -during the maintenance window. With the separation of operations, systems can -be prestaged with the updates prior to applying the changes to the system. +|prod| uses strongSwan as the IPsec implementation. strongSwan is an +opensource IPsec solution. See https://strongswan.org/ for more details. -**See**: :ref:`Distributed Cloud Guide ` +For the most part, IPsec on |prod| is transparent to users. -**************************************************** -Long Latency Between System Controller and Subclouds -**************************************************** +**See**: -Rehoming procedure of a subcloud that has been powered off for a long period of -time will differ from the regular rehoming procedure. Based on how long the -subcloud has been offline, the platform certificates will expire and will -need to be regenerated. +- :ref:`IPsec Overview ` -**See**: :ref:`Rehoming Subcloud with Expired Certificates ` +- :ref:`Configure and Enable IPsec ` -************** -GEO Redundancy -************** +- :ref:`IPSec Certificates ` -|prod| may be deployed across a geographically distributed set of regions. A -region consists of a local Kubernetes cluster with local redundancy and access -to high-bandwidth, low-latency networking between hosts within that region. +- :ref:`IPSec CLIs ` -|prod-long| Distributed Cloud GEO redundancy configuration supports the ability -to recover from a catastrophic event that requires subclouds to be rehomed away -from the failed system controller site to the available site(s) which have -enough spare capacity. This way, even if the failed site cannot be restored in -short time, the subclouds can still be rehomed to available peer system -controller(s) for centralized management. +********************************************************** +Vault application support for running on application cores +********************************************************** -In this release, the following items are addressed: +By default the Vault application's pods will run on platform cores. -* 1+1 GEO redundancy +"If ``static kube-cpu-mgr-policy`` is selected and when overriding the label +``app.starlingx.io/component`` for Vault namespace or pods, there are two requirements: - - Active-Active redundancy model - - Total number of subclouds should not exceed 1K +- The Vault server pods need to be restarted as directed by Hashicorp Vault + documentation. Restart each of the standby server pods in turn, then restart + the active server pod. -* Automated operations +- Ensure that sufficient hosts with worker function are available to run the + Vault server pods on application cores. - - Synchronization and liveness check between peer systems - - Alarm generation if peer system controller is down +**See**: -* Manual operations +- :ref:`Kubernetes CPU Manager Policies `. - - Batch rehoming from alive peer system controller +- :ref:`System backup, System and Storage Restore `. -**See**: :ref:`GEO Redundancy ` +- :ref:`Run Hashicorp Vault Restore Playbook Remotely `. -******************************** -Redfish Virtual Media Robustness -******************************** +- :ref:`Run Hashicorp Vault Restore Playbook Locally on the Controller `. -Redfish virtual media operations has been observed to frequently fail with -transient errors. While the conditions for those failures are not always known -(network, BMC timeouts, etc), it has been observed that if the Subcloud install -operation is retried, the operation is successful. +Restart the Vault Server pods +***************************** -To alleviate the transient conditions, the robustness of the Redfish virtual -media controller (RVMC) is improved by introducing additional error -handling and retry attempts. +The Vault server pods do not restart automatically. If the pods are to be +re-labelled to switch execution from platform to application cores, or vice-versa, +then the pods need to be restarted. -**See**: :ref:`Install a Subcloud Using Redfish Platform Management Service ` +Under Kubernetes the pods are restarted using the :command:`kubectl delete pod` +command. See, Hashicorp Vault documentation for the recommended procedure for +restarting server pods in |HA| configuration, +https://support.hashicorp.com/hc/en-us/articles/23744227055635-How-to-safely-restart-a-Vault-cluster-running-on-Kubernetes. -.. end-new-features-r9 +Ensure that sufficient hosts are available to run the server pods on application cores +************************************************************************************** + +The standard cluster with less than 3 worker nodes does not support Vault |HA| +on the application cores. In this configuration (less than three cluster hosts +with worker function): + +- When setting label app.starlingx.io/component=application with the Vault + app already applied in |HA| configuration (3 Vault server pods), ensure that + there are 3 nodes with worker function to support the |HA| configuration. + +- When applying Vault for the first time and with ``app.starlingx.io/component`` + set to "application": ensure that the server replicas is also set to 1 for + non-HA configuration. The replicas for Vault server are overriden both for + the Vault Helm chart and the Vault manager Helm chart: + + .. code-block:: none + + cat < vault_overrides.yaml + server: + extraLabels: + app.starlingx.io/component: application + ha: + replicas: 1 + injector: + extraLabels: + app.starlingx.io/component: application + EOF + + cat < vault-manager_overrides.yaml + manager: + extraLabels: + app.starlingx.io/component: application + server: + ha: + replicas: 1 + EOF + + $ system helm-override-update vault vault vault --values vault_overrides.yaml + + $ system helm-override-update vault vault-manager vault --values vault-manager_overrides.yaml + +****************************************************** +Component Based Upgrade and Update - VIM Orchestration +****************************************************** + +|VIM| Patch Orchestration in StarlingX 10.0 has been updated to interwork with +the new underlying Unified Software Management APIs. + +As before, |VIM| Patch Orchestration automates the patching of software across +all hosts of a Cloud configuration. All Cloud configurations are supported; +|AIO-SX|, |AIO-DX|, |AIO-DX| with worker nodes, Standard configuration with controller +storage and Standard configuration with dedicated storage. + +.. note:: + + This includes the automation of both applying a Patch and removing a Patch. + +**See** + +- :ref:`orchestrated-deployment-host-software-deployment-d234754c7d20` + +- :ref:`orchestrated-removal-host-software-deployment-3f542895daf8` . + +********************************************************** +Subcloud Remote Install, Upgrade and Prestaging Adaptation +********************************************************** + +StarlingX 10.0 supports software management upgrade/update process +that does not require re-installation. The procedure for upgrading a system is +simplified since the existing filesystem and associated release configuration +will remain intact in the versioned controlled paths (e.g. /opt/platform/config/). +In addition the /var and /etc directories is retained, indicating that +updates can be done directly as part of the software migration procedure. This +eliminates the need to perform a backup and restore procedure for |AIO-SX| +based systems. In addition, the rollback procedure can revert to the +existing versioned or saved configuration in the event an error occurs +if the system must be reverted to the older software release. + +With this change, prestaging for an upgrade will involve populating a new ostree +deployment directory in preparation for an atomic upgrade and pulling new container +image versions into the local container registry. Since the system is not +reinstalled, there is no requirement to save container images to a protected +partition during the prestaging process, the new container images can be +populated in the local container registry directly. + +**See**: :ref:`prestage-a-subcloud-using-dcmanager-df756866163f` + +******************************************************** +Update Default Certificate Configuration on Installation +******************************************************** + +You can configure default certificates during install for both standalone and +Distributed Cloud systems. + +**New bootstrap overrides for system-local-ca (Platform Issuer)** + +- You can customize the Platform Issuer (system-local-ca) used to sign the platform + certificates with an external Intermediate |CA| from bootstrap, using the new + bootstrap overrides. + + **See**: :ref:`Platform Issuer ` + + .. note:: + + It is recommended to configure these overrides. If it is not configured, + ``system-local-ca`` will be configured using a local auto-generated + Kubernetes Root |CA|. + +**REST API / Horizon GUI and Docker Registry certificates are issued during bootstrap** + +- The certificates for StarlingX REST APIs / Horizon GUI access and Local + Docker Registry will be automatically issued by ``system-local-ca`` during + bootstrap. They will be anchored to ``system-local-ca`` Root CA public + certificate, so only this certificate needs to be added in the user list of + trusted CAs. + +**HTTPS enabled by default for StarlingX REST API access** + +- The system is now configured by default with HTTPS enabled for access to + StarlingX API and the Horizon GUI. The certificate used to secure this will be + anchored to ``system-local-ca`` Root |CA| public certificate. + +**Playbook to update system-local-ca and re-sign the renamed platform certificates** + +- The ``migrate_platform_certificates_to_certmanager.yml`` playbook is renamed + to ``update_platform_certificates.yml``. + +**External certificates provided in bootstrap overrides can now be provided as +base64 strings, such that they can be securely stored with Ansible Vault** + +- The following bootstrap overrides for certificate data **CAN** be provided as + the certificate / key converted into single line base64 strings instead of the + filepath for the certificate / key: + + - ssl_ca_cert + + - k8s_root_ca_cert and k8s_root_ca_key + + - etcd_root_ca_cert and etcd_root_ca_key + + - system_root_ca_cert, system_local_ca_cert and system_local_ca_key + + .. note:: + + You can secure the certificate data in an encrypted bootstrap + overrides file using Ansible Vault. + + The base64 string can be obtained using the :command:`base64 -w0 ` + command. The string can be included in the overrides YAML file + (secured via Ansible Vault), then insecurely managed ``cert_file`` + can be removed from the system. + +*************************************************** +Dell CSI Driver Support - Test with Dell PowerStore +*************************************************** + +|prod| 10.0 supports a new system application to support +kubernetes CSM/CSI for Dell Storage Platforms. With this application the user +can communicate with Dell PowerScale, PowerMax, PowerFlex, PowerStore and +Unity XT Storage Platforms to provision |PVCs| and use them on kubernetes +stateful applications. + +**See**: :ref:`Dell Storage File System Provisioner ` +for details on installation and configurations. + +************************************************ +O-RAN O2 IMS and DMS Interface Compliancy Update +************************************************ + +With the new updates in Infrastructure Management Services (IMS) and +Deployment Management Services (DMS) the J-release for O-RAN O2, OAuth2 and mTLS +are mandatory options. It is fully compliant with latest O-RAN spec O2 IMS +interface R003 -v05.00 version and O2 DMS interface K8s profile - R003-v04.00 +version. Kubernetes Secrets are no longer required. + +The services implemented include: + +- O2 API with mTLS enabled + +- O2 API supported OAuth2.0 + +- Compliance with O2 IMS and DMS specs + +**See**: :ref:`oran-o2-application-b50a0c899e66` + +*************************************************** +Configure Liveness Probes for PTP Notification Pods +*************************************************** + +Helm overrides can be used to configure liveness probes for ``ptp-notification`` +containers. + +**See**: :ref:`configure-liveness-probes` + +************************* +Intel QAT and GPU Plugins +************************* + +The |QAT| and |GPU| applications provide a set of plugins developed by Intel +to facilitate the use of Intel hardware features in Kubernetes clusters. +These plugins are designed to enable and optimize the use of Intel-specific +hardware capabilities in a Kubernetes environment. + +Intel |GPU| plugin enables Kubernetes clusters to utilize Intel GPUs for +hardware acceleration of various workloads. + +Intel® QuickAssist Technology (Intel® QAT) accelerates cryptographic workloads +by offloading the data to hardware capable of optimizing those functions. + +The following QAT and GPU plugins are supported in |prod| 10.0. + +**See**: + +- :ref:`intel-device-plugins-operator-application-overview-c5de2a6212ae` + +- :ref:`gpu-device-plugin-configuration-615e2f6edfba` + +- :ref:`qat-device-plugin-configuration-616551306371` + +****************************************** +Support for Sapphire Rapids Integrated QAT +****************************************** + +Intel 4th generation Xeon Scalable Processor (Sapphire Rapids) support has been +introduced for the |prod| 10.0. + +- Drivers for QAT Gen 4 Intel Xeon Gold Scalable processor (Sapphire Rapids) + + - Intel Xeon Gold 6428N + +************************************************** +Sapphire Rapids Data Streaming Accelerator Support +************************************************** + +Intel® |DSA| is a high-performance data copy and transformation accelerator +integrated into Intel® processors starting with 4th Generation Intel® Xeon® +processors. It is targeted for optimizing streaming data movement and +transformation operations common with applications for high-performance +storage, networking, persistent memory, and various data processing +applications. + +**See**: :ref:`data-streaming-accelerator-db88a67c930c` + +************************* +DPDK Private Mode Support +************************* + +For the purpose of enabling and using ``needVhostNet``, |SRIOV| needs to be +configured on a worker host. + +**See**: :ref:`provisioning-sr-iov-interfaces-using-the-cli` + +****************************** +|SRIOV| |FEC| Operator Support +****************************** + +|FEC| Operator 2.9.0 is adopted based on Intel recommendations offering features +for various Intel hardware accelerators used for field deployments. + +**See**: :ref:`configure-sriov-fec-operator-to-enable-hw-accelerators-for-hosted-vran-containarized-workloads` + +****************************************************** +Support for Advanced VMs on Stx Platform with KubeVirt +****************************************************** + +The KubeVirt system application kubevirt-app-1.1.0 in |prod-long| includes: +KubeVirt, Containerized Data Importer (CDI) v1.58.0, and the Virtctl client tool. +|prod| 10.0 supports enhancements for this application, describes +the Kubevirt architecture with steps to install Kubevirt and provides examples +for effective implementation in your environment. + +**See**: + +- :ref:`index-kubevirt-f1bfd2a21152` + +*************************************************** +Support Harbor Registry (Harbor System Application) +*************************************************** + +Harbor registry is integrated as a System Application. End users can use Harbor, +running on |prod-long|, for holding and managing their container images. The +Harbor registry is currently not used by the platform. + +Harbor is an open-source registry that secures artifacts with policies and +role-based access control, ensures images are scanned and free from +vulnerabilities, and signs images as trusted. Harbor has been evolved to a +complete |OCI| compliant cloud-native artifact registry. + +With Harbor V2.0, users can manage images, manifest lists, Helm charts, +|CNABs|, |OPAs| among others which all adhere to the |OCI| image specification. +It also allows for pulling, pushing, deleting, tagging, replicating, and +scanning such kinds of artifacts. Signing images and manifest list are also +possible now. + +.. note:: + + When using local |LDAP| for authentication of the Harbor system application, + you cannot use local |LDAP| groups for authorization; use only individual + local |LDAP| users for authorization. + +**See**: :ref:`harbor-as-system-app-1d1e3ec59823` + +************************** +Support for DTLS over SCTP +************************** + +DTLS (Datagram Transport Layer Security) v1.2 is supported in |prod| 10.0. + +1. The |SCTP| module is now autoloaded by default. + +2. The socket buffer size values have been upgraded: + + Old values (in Bytes): + + - net.core.rmem_max=425984 + + - net.core.wmem_max=212992 + + New Values (In Bytes): + + - net.core.rmem_max=10485760 + + - net.core.wmem_max=10485760 + +3. To enable each |SCTP| socket association to have its own buffer space, the + socket accounting policies have been updated as follows: + + - net.sctp.sndbuf_policy=1 + + - net.sctp.rcvbuf_policy=1 + + Old value: + + - net.sctp.auth_enable=0 + + New value: + + - net.sctp.auth_enable=1 + +*********************************************************** +Banner Information Automation during Subcloud Bootstrapping +*********************************************************** + +Users can now customize and automate banner information for subclouds during +system commissioning and installation. + +You can customize the pre-login message (issue) and post-login |MOTD| across +the entire |prod| cluster during system commissioning and installation. + +**See**: :ref:`Brand the Login Banner During Commissioning ` + +.. end-new-features-r10 ---------------- Hardware Updates @@ -504,228 +1190,748 @@ Fixed bugs ********** This release provides fixes for a number of defects. Refer to the StarlingX bug -database to review the R9.0 `Fixed Bugs `_. +database to review the R10.0 `Fixed Bugs `_. -.. All please confirm if any Limitations need to be removed / added for Stx 9.0. +.. All please confirm if any Limitations need to be removed / added for Stx 10.0. ---------------------------------- -Known Limitations and Workarounds ---------------------------------- +---------------------------------------- +Known Limitations and Procedural Changes +---------------------------------------- -The following are known limitations you may encounter with your |prod| Release -9.0 and earlier releases. Workarounds are suggested where applicable. +The following are known limitations you may encounter with your |prod| 10.0 +and earlier releases. Workarounds are suggested where applicable. .. note:: These limitations are considered temporary and will likely be resolved in a future release. -************************************************ -Suspend/Resume on VMs with SR-IOV (direct) Ports -************************************************ +.. contents:: |minitoc| + :local: + :depth: 1 -When using VMs with SR-IOV ports created with the -vnic-type=direct option -after a Suspend action, if one wants to Resume the instance it might come up -with all virtual NICs created but missing the IP Address of the vNIC connected -to the SR-IOV port. +************************************ +Ceph Daemon Crash and Health Warning +************************************ -**Workaround**: Manually Power-Off and Power-On (or Hard-Reboot) the instance -and the IP should be assigned correctly again (no information is lost). +After a Ceph daemon crash, an alarm is displayed to verify Ceph health. -.. Cole please +Run ``ceph -s`` to display the following message: -***************************************** -Error on Restoring OpenStack after Backup -***************************************** +.. code-block:: -The ansible command for restoring the app will fail with |prod-long| Release 9.0 -with an error message mentioning the absence of an Armada directory. + cluster: + id: + health: HEALTH_WARN + 1 daemons have recently crashed -**Workaround**: Manually change the backup tarball adding the Armada directory -using the following the steps: +One or more Ceph daemons have crashed, and the crash has not yet been +archived or acknowledged by the administrator. + +**Procedural Changes**: To archive the crash, clear the health check warning +and the alarm. + +1. List the timestamp/uuid crash-ids for all newcrash information: + + .. code-block:: none + + [sysadmin@controller-0 ~(keystone_admin)]$ ceph crash ls-new + +2. Display details of a saved crash. + + .. code-block:: none + + [sysadmin@controller-0 ~(keystone_admin)]$ ceph crash info + +3. Archive the crash so it no longer appears in ``ceph crash ls-new`` output. + + .. code-block:: none + + [sysadmin@controller-0 ~(keystone_admin)]$ ceph crash archive + +4. After archiving the crash, make sure the recent crash is not displayed. + + .. code-block:: none + + [sysadmin@controller-0 ~(keystone_admin)]$ ceph crash ls-new + +5. If more than one crash needs to be archived run the following command. + + .. code-block:: none + + [sysadmin@controller-0 ~(keystone_admin)]$ ceph crash archive-all + +******************************** +Rook Ceph Application Limitation +******************************** + +After applying Rook Ceph application in an |AIO-DX| configuration the +``800.001 - Storage Alarm Condition: HEALTH_WARN`` alarm may be triggered. + +**Procedural Changes**: Restart the pod of the monitor associated with the +slow operations detected by Ceph. Check ``ceph -s``. + + +********************************************************************* +Subcloud failed during rehoming while creating RootCA update strategy +********************************************************************* + +Subcloud rehoming may fail while creating the RootCA update strategy. + +**Proceudral Changes**: Delete the subcloud from the new System Controller and +rehome it again. + +************************************************** +RSA required to be the platform issuer private key +************************************************** + +The ``system-local-ca`` issuer needs to use RSA type certificate/key. The usage +of other types of private keys is currently not supported during bootstrap +or with the ``Update system-local-ca or Migrate Platform Certificates to use +Cert Manager`` procedures. + +**Proceudral Changes**: N/A. + +***************************************************** +Host lock/unlock may interfere with application apply +***************************************************** + +Host lock and unlock operations may interfere with applications that are in +the applying state. + +**Proceudral Changes**: Re-applying or removing / installing applications may be +required. Application status can be checked using the :command:`system application-list` +command. + +**************************************************** +Add / delete operations on pods may result in errors +**************************************************** + +Under some circumstances, add / delete operations on pods may result in pods +staying in ContainerCreating/Terminating state and reporting an +'error getting ClusterInformation: connection is unauthorized: Unauthorized`. +This error may also prevent users from locking the host. + +**Proceudral Changes**: If this error occurs run the following +:command:`kubectl describe pod -n ` command. The following +message is displayed: + +`error getting ClusterInformation: connection is unauthorized: Unauthorized` + +.. note:: + + There is a known issue with the Calico CNI that may occur in rare + occasions if the Calico token required for communication with the + kube-apiserver becomes out of sync due to |NTP| skew or issues refreshing + the token. + +**Proceudral Changes**: Delete the calico-node pod (causing it to automatically +restart) using the following commands: .. code-block:: none - tar -xzf wr_openstack_backup_file.tgz # this will create a opt directory - cp -r opt/platform/fluxcd/ opt/platform/armada # copy fluxd to armada - tar -czf new_wr-openstack_backu.tgz opt/ # tar the opt directory into a new backup tarball + $ kubectl get pods -n kube-system --show-labels | grep calico -***************************************** -Subcloud Upgrade with Kubernetes Versions -***************************************** + $ kubectl delete pods -n kube-system -l k8s-app=calico-node -Subcloud Kubernetes versions are upgraded along with the System Controller. -You can add a new subcloud while the System Controller is on intermediate -versions of Kubernetes as long as the needed k8s images are available at the -configured sources. -**Workaround**: In a Distributed Cloud configuration, when upgrading from -|prod-long| Release 7.0 the Kubernetes version is v1.23.1. The default -version of the new install for Kubernetes is v1.24.4. Kubernetes must be -upgraded one version at a time on the System Controller. +****************************************** +Deploy does not fail after a system reboot +****************************************** + +Deploy does not fail after a system reboot. + +**Proceudral Changes**: Run the +:command:`sudo software-deploy-set-failed --hostname/-h --confirm` +utility to manually move the deploy and deploy host to a failed state which is +caused by a failover, lost power, network outage etc. You can only run this +utility with root privileges on the active controller. + +The utility displays the current state and warns the user about the next steps +to be taken in case the user needs to continue executing the utility. It also +displays the new states and the next operation to be executed. + +********************************* +Rook-ceph application limitations +********************************* + +This section documents the following known limitations you may encounter with +the rook-ceph application and procedural changes that you can use to resolve +the issue. + +**Remove all OSDs in a host** + +The procedure to remove |OSDs| will not work as expected when removing all +|OSDs| from a host. The Ceph cluster gets stuck in ``HEALTH_WARN`` state. .. note:: - New subclouds should not be added until the System Controller has been - upgraded to Kubernetes v1.24.4. -**************************************************** -AIO-SX Restore Fails during puppet-manifest-apply.sh -**************************************************** + Use the Procedural change only if the cluster is stuck in ``HEALTH_WARN`` + state after removing all |OSDs| on a host. -Restore fails using a backup file created after a fresh install. +**Procedural Changes**: -**Workaround**: During the restore process, after reinstalling the controllers, -the |OAM| interface must be configured with the same IP address protocol version -used during installation. +1. Check the cluster health status. + .. code-block::none -************************************************************************** -Subcloud Controller-0 is in a degraded state after upgrade and host unlock -************************************************************************** + $ ceph status -During an upgrade orchestration of the subcloud from |prod-long| Release 7.0 -to |prod-long| Release 8.0, and after host unlock, the subcloud is in a -``degraded`` state, and alarm 200.004 is raised, displaying -"controller-0 experienced a service-affecting failure. Auto-recovery in progress". +2. Check crushmap tree. -**Workaround**: You can recover the subcloud to the ``available`` state by -locking and unlocking controller-0 . + .. code-block::none -*********************************************************************** -Limitations when using Multiple Driver Versions for the same NIC Family -*********************************************************************** + $ ceph osd tree -The capability to support multiple NIC driver versions has the following -limitations: +3. Remove the host(s) that are empty in the command executed before -- Intel NIC family supports only: ice, i40e and iavf drivers + .. code-block::none -- Driver versions must respect the compatibility matrix between drivers + $ ceph crush remove -- Multiple driver versions cannot be loaded simultaneously and applies to the - entire system. +4. Check the cluster health status. -- Latest driver version will be loaded by default, unless specifically - configured to use a legacy driver version. + .. code-block::none -- Drivers used by the installer will always use the latest version, - therefore firmware compatibility must support basic NIC operations for each - version to facilitate installation + $ ceph status -- Host reboot is required to activate the configured driver versions +**Use the rook-ceph apply command when a host with OSD is in offline state** -- For Backup and Restore, the host must be rebooted a second time for - in order to activate the drivers versions. +The **rook-ceph apply** will not allocate the |OSDs| correctly if the host is +offline. -**Workaround**: NA +.. note:: -***************** -Quartzville Tools -***************** + Use either of the procedural changes below only if the |OSDs| are not + allocated in the Ceph cluster. -The following :command:`celo64e` and :command:`nvmupdate64e` commands are not -supported in |prod-long|, Release 8.0 due to a known issue in Quartzville -tools that crashes the host. +**Procedural Changes 1**: -**Workaround**: Reboot the host using the boot screen menu. +1. Check if the |OSD| is not in crushmap tree. -************************************************* -Controller SWACT unavailable after System Restore -************************************************* + .. code-block::none -After performing a restore of the system, the user is unable to swact the -controller. + $ ceph osd tree -**Workaround**: NA +2. Restart the rook-ceph operator pod. -************************************************************* -Intermittent Kubernetes Upgrade failure due to missing Images -************************************************************* + .. code-block::none -During a Kubernetes upgrade, the upgrade may intermittently fail when you run -:command:`system kube-host-upgrade control-plane` due to the -containerd cache being cleared. + $ kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=0 -**Workaround**: If the above failure is encountered, run the following commands -on the host encountering the failure: + $ kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=1 -.. rubric:: |proc| + .. note:: -#. Ensure the failure is due to missing images by running ``crictl images`` and - confirming the following are not present: + Wait for about 5 minutes to let the operator to try to recoever the |OSDs|. - .. code-block:: +3. Check if the |OSDS| have been added in crushmap tree. - registry.local:9001/k8s.gcr.io/kube-apiserver:v1.24.4 - registry.local:9001/k8s.gcr.io/kube-controller-manager:v1.24.4 - registry.local:9001/k8s.gcr.io/kube-scheduler:v1.24.4 - registry.local:9001/k8s.gcr.io/kube-proxy:v1.24.4 + .. code-block::none -#. Manually pull the image into containerd cache by running the following - commands, replacing ```` with your password for the admin - user. + $ ceph sd tree - .. code-block:: +**Procedural Changes 2**: - ~(keystone_admin)]$ crictl pull --creds admin: registry.local:9001/k8s.gcr.io/kube-apiserver:v1.24.4 - ~(keystone_admin)]$ crictl pull --creds admin: registry.local:9001/k8s.gcr.io/kube-controller-manager:v1.24.4 - ~(keystone_admin)]$ crictl pull --creds admin: registry.local:9001/k8s.gcr.io/kube-scheduler:v1.24.4 - ~(keystone_admin)]$ crictl pull --creds admin: registry.local:9001/k8s.gcr.io/kube-proxy:v1.24.4 +1. Check if the |OSD| is not in the crushmap tree OR it is in the crushmap tree + but not allocated in the correct location (within a host). -#. Ensure the images are present when running ``crictl images``. Rerun the - :command:`system kube-host-upgrade control-plane`` command. + .. code-block::none -*********************************** -Docker Network Bridge Not Supported -*********************************** + $ ceph osd tree -The Docker Network Bridge, previously created by default, is removed and no -longer supported in |prod-long| Release 8.0 as the default bridge IP address -collides with addresses already in use. +2. Lock the host -As a result, docker can no longer be used for running containers. This impacts -building docker images directly on the host. + .. code-block::none -**Workaround**: Create a Kubernetes pod that has network access, log in -to the container, and build the docker images. + $ system host-lock + Wait for the host to be locked. -************************************ -Impact of Kubenetes Upgrade to v1.24 -************************************ +3. Get the list from the |OSDs| inventory from the host. + + .. code-block::none + + $ system host-stor-list + +4. Remove the |OSDs| from the inventory. + + .. code-block::none + + $ system host-stor-delete + +5. Reapply the rook-ceph application. + + .. code-block::none + + $ system application-apply rook-ceph + + Wait for |OSDs| prepare pods to be recreated. + + .. code-block::none + + $ kubectl get pods -n rook-ceph -l app=rook-ceph-osd-prepare -w + +6. Add the |OSDs| in the inventory. + + .. code-block::none + + $ system host-stor-add + +7. Reapply the rook-ceph application. + + .. code-block::none + + $ system application-apply rook-ceph + + Wait for new |OSD| pods to be created and running. + + .. code-block::none + + $ kubectl get pods -n rook-ceph -l app=rook-ceph-osd -w + +***************************************************************************************************************** +Unable to set maximum VFs for NICs using out-of-tree ice driver v1.14.9.2 on systems with a large number of cores +***************************************************************************************************************** + +On systems with a large number of cores (>= 32 physical cores / 64 threads), +it is not possible to set the maximum number of |VFs| (32) for NICs using the +out-of-tree ice driver v1.14.9.2. + +If the issue is encountered, the following error logs will be reported in kern.log: + +.. code-block:: none + + [ 83.322344] ice 0000:51:00.1: Only 59 MSI-X interrupts available for SR-IOV. Not enough to support minimum of 2 MSI-X interrupts per VF for 32 VFs + [ 83.322362] ice 0000:51:00.1: Not enough resources for 32 VFs, err -28. Try with fewer number of VFs + +The impacted NICs are: + +- Intel E810 + +- Silicom STS2 + +**Procedural Changes**: Reduce the number of configured |VFs|. To determine the +maximum number of supported |VFs|: + +- Check /sys/class/net//device/sriov_vf_total_msix. + Example: + + .. code-block:: none + + cat /sys/class/net/enp81s0f0/device/sriov_vf_total_msix + 59 + +- Calculate the maximum number of |VFs| as sriov_vf_total_msix / 2. + Example: + + .. code-block:: none + + max_VFs = 59/2 = 29 + +***************************************************************** +Critical alarm 800.001 after Backup and Restore on AIO-SX Systems +***************************************************************** + +A Critical alarm 800.001 may be triggered after running the Restore +Playbook. The alarm details are as follows: + +.. code-block:: none + + ~(keystone_admin)]$ fm alarm-list + +-------+----------------------------------------------------------------------+--------------------------------------+----------+---------------+ + | Alarm | Reason Text | Entity ID | Severity | Time Stamp | + | ID | | | | | + +-------+----------------------------------------------------------------------+--------------------------------------+----------+---------------+ + | 800. | Storage Alarm Condition: HEALTH_ERR. Please check 'ceph -s' for more | cluster= | critical | 2024-08-29T06 | + | 001 | details. | 96ebcfd4-3ea5-4114-b473-7fd0b4a65616 | | :57:59.701792 | + | | | | | | + +-------+----------------------------------------------------------------------+--------------------------------------+----------+---------------+ + +**Procedural Changes**: To clear this alarm run the following commands: + +.. note:: + + Applies only to |AIO-SX| systems. + +.. code-block:: none + + FS_NAME=kube-cephfs + METADATA_POOL_NAME=kube-cephfs-metadata + DATA_POOL_NAME=kube-cephfs-data + + # Ensure that the Ceph MDS is stopped + sudo rm -f /etc/pmon.d/ceph-mds.conf + sudo /etc/init.d/ceph stop mds + + # Recover MDS state from filesystem + ceph fs new ${FS_NAME} ${METADATA_POOL_NAME} ${DATA_POOL_NAME} --force + + # Try to recover from some common errors + sudo ceph fs reset ${FS_NAME} --yes-i-really-mean-it + + cephfs-journal-tool --rank=${FS_NAME}:0 event recover_dentries summary + cephfs-journal-tool --rank=${FS_NAME}:0 journal reset + cephfs-table-tool ${FS_NAME}:0 reset session + cephfs-table-tool ${FS_NAME}:0 reset snap + cephfs-table-tool ${FS_NAME}:0 reset inode + sudo /etc/init.d/ceph start mds + +******************************************************************************* +Error installing Rook Ceph on |AIO-DX| with host-fs-add before controllerfs-add +******************************************************************************* + +When you provision controller-0 manually prior to unlock, the following sequence +of commands fail: + +.. code-block:: none + + ~(keystone_admin)]$ system storage-backend-add ceph-rook --confirmed + ~(keystone_admin)]$ system host-fs-add controller-0 ceph=20 + ~(keystone_admin)]$ system controllerfs-add ceph-float=20 + +The following error occurs when you run the :command:`controllerfs-add` command: + +"Failed to create controller filesystem ceph-float: controllers have pending +LVG updates, please retry again later". + +**Procedural Changes**: To avoid this issue, run the commands in the following +sequence: + +.. code-block:: none + + ~(keystone_admin)]$ system storage-backend-add ceph-rook --confirmed + ~(keystone_admin)]$ controllerfs-add ceph-float=20 + ~(keystone_admin)]$ system host-fs-add controller-0 ceph=20 + +*********************************************************** +Intermittent installation of Rook-Ceph on Distributed Cloud +*********************************************************** + +While installing rook-ceph, if the installation fails, this is due to +``ceph-mgr-provision`` not being provisioned correctly. + +**Procedural Changes**: It is recommended to use +the :command:`system application-remove rook-ceph --force` to initiate rook-ceph +installation. + +******************************************************************** +Authorization based on Local LDAP Groups is not supported for Harbor +******************************************************************** + +When using Local |LDAP| for authentication of the new Harbor system application, +you cannot use Local |LDAP| Groups for authorization; you can only use individual +Local |LDAP| users for authorization. + +**Procedural Changes**: Use only individual Local LDAP users for specifying +authorization. + +*************************************************** +Vault application is not supported during Bootstrap +*************************************************** + +The Vault application cannot be configured during Bootstrap. + +**Procedural Changes**: + +The application must be configured after the platform nodes are unlocked / +enabled / available, a storage backend is configured, and ``platform-integ-apps`` +is applied. If Vault is to be run in |HA| configuration (3 vault server pods) +then at least three controller / worker nodes must be unlocked / enabled / available. + +****************************************** +cert-manager cm-acme-http-solver pod fails +****************************************** + +On a multinode setup, when you deploy an acme issuer to issue a certificate, +the ``cm-acme-http-solver`` pod might fail and stays in "ImagePullBackOff" state +due to the following defect https://github.com/cert-manager/cert-manager/issues/5959. + +**Procedural Changes**: + +1. If you are using the namespace "test", create a docker-registry secret + "testkey" with local registry credentials in the "test" namespace. + + .. code-block:: none + + ~(keystone_admin)]$ kubectl create secret docker-registry testkey --docker-server=registry.local:9001 --docker-username=admin --docker-password=Password*1234 -n test + +2. Use the secret "testkey" in the issuer spec as follows: + + .. code-block:: none + + apiVersion: cert-manager.io/v1 + kind: Issuer + metadata: + name: stepca-issuer + namespace: test + spec: + acme: + server: https://test.com:8080/acme/acme/directory + skipTLSVerify: true + email: test@test.com + privateKeySecretRef: + name: stepca-issuer + solvers: + - http01: + ingress: + podTemplate: + spec: + imagePullSecrets: + - name: testkey + class: nginx + +************************************************************** +ptp-notification application is not supported during bootstrap +************************************************************** + +- Deployment of ``ptp-notification`` during bootstrap time is not supported due + to dependencies on the system |PTP| configuration which is handled + post-bootstrap. + + **Procedural Changes**: N/A. + +- The :command:`helm-chart-attribute-modify` command is not supported for + ``ptp-notification`` because the application consists of a single chart. + Disabling the chart would render ``ptp-notification`` non-functional. + See :ref:`sysconf-application-commands-and-helm-overrides` for details on + this command. + + **Procedural Changes**: N/A. + +****************************************** +Harbor cannot be deployed during bootstrap +****************************************** + +The Harbor application cannot be deployed during bootstrap due to the bootstrap +deployment dependencies such as early availability of storage class. + +**Procedural Changes**: N/A. + +******************** +Kubevirt Limitations +******************** + +The following limitations apply to Kubevirt in |prod| 10.0: + +- **Limitation**: Kubernetes does not provide CPU Manager detection. + + **Procedural Changes**: Add ``cpumanager`` to Kubevirt: + + .. code-block:: none + + apiVersion: kubevirt.io/v1 + kind: KubeVirt + metadata: + name: kubevirt + namespace: kubevirt + spec: + configuration: + developerConfiguration: + featureGates: + - LiveMigration + - Macvtap + - Snapshot + - CPUManager + + Check the label, using the following command: + + .. code-block:: none + + ~(keystone_admin)]$ kubectl describe node | grep cpumanager + + where `cpumanager=true` + +- **Limitation**: Huge pages do not show up under cat /proc/meminfo inside a + guest VM. Although, resources are being consumed on the host. For example, + if a VM is using 4GB of Huge pages, the host shows the same 4GB of huge + pages used. The huge page memory is exposed as normal memory to the VM. + + **Procedural Changes**: You need to configure Huge pages inside the guest + OS. + +See :ref:`Installation Guides ` for more details. + +- **Limitation**: Virtual machines using Persistent Volume Claim (PVC) must + have a shared ReadWriteMany (RWX) access mode to be live migrated. + + **Procedural Changes**: Ensure |PVC| is created with RWX. + + .. code-block:: + + $ class=cephfs --access-mode=ReadWriteMany + + $ virtctl image-upload --pvc-name=cirros-vm-disk-test-2 --pvc-size=500Mi --storage-class=cephfs --access-mode=ReadWriteMany --image-path=/home/sysadmin/Kubevirt-GA-testing/latest-manifest/kubevirt-GA-testing/cirros-0.5.1-x86_64-disk.img --uploadproxy-url=https://10.111.54.246 -insecure + + .. note:: + + - Live migration is not allowed with a pod network binding of bridge + interface type () + + - Live migration requires ports 49152, 49153 to be available in the + virt-launcher pod. If these ports are explicitly specified in the + masquarade interface, live migration will not function. + +- For live migration with |SRIOV| interface: + + - specify networkData: in cloudinit, so when the VM moves to another node + it will not loose the IP config + + - specify nameserver and internal |FQDNs| to connect to cluster metadata + server otherwise cloudinit will not work + + - fix the MAC address otherwise when the VM moves to another node the MAC + address will change and cause a problem establishing the link + + Example: + + .. code-block:: none + + cloudInitNoCloud: + networkData: | + ethernets: + sriov-net1: + addresses: + - 128.224.248.152/23 + gateway: 128.224.248.1 + match: + macAddress: "02:00:00:00:00:01" + nameservers: + addresses: + - 10.96.0.10 + search: + - default.svc.cluster.local + - svc.cluster.local + - cluster.local + set-name: sriov-link-enabled + version: 2 + +- **Limitation**: Snapshot |CRDs| and controllers are not present by default + and needs to be installed on |prod-long|. + + **Procedural Changes**: To install snapshot |CRDs| and controllers on + Kubernetes, see: + + - kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshots.yaml + + - kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshotcontents.yaml + + - kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshotclasses.yaml + + - kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/deploy/kubernetes/snapshot-controller/rbac-snapshot-controller.yaml + + - kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/deploy/kubernetes/snapshot-controller/setup-snapshot-controller.yaml + + Additionally, create ``VolumeSnapshotClass`` for Cephfs and RBD: + + .. code-block:: none + + cat <cephfs-storageclass.yaml + — + apiVersion: snapshot.storage.k8s.io/v1 + kind: VolumeSnapshotClass + metadata: + name: csi-cephfsplugin-snapclass + driver: cephfs.csi.ceph.com + parameters: + clusterID: 60ee9439-6204-4b11-9b02-3f2c2f0a4344 + csi.storage.k8s.io/snapshotter-secret-name: ceph-pool-kube-cephfs-data + csi.storage.k8s.io/snapshotter-secret-namespace: default deletionPolicy: Delete + + EOF + + .. code-block:: none + + cat <rbd-storageclass.yaml + — + apiVersion: snapshot.storage.k8s.io/v1 + kind: VolumeSnapshotClass + metadata: + name: csi-rbdplugin-snapclass + driver: rbd.csi.ceph.com + parameters: + clusterID: 60ee9439-6204-4b11-9b02-3f2c2f0a4344 + csi.storage.k8s.io/snapshotter-secret-name: ceph-pool-kube-rbd + csi.storage.k8s.io/snapshotter-secret-namespace: default deletionPolicy: Delete + EOF + + .. note:: + + Get the cluster ID from : ``kubectl describe sc cephfs, rbd`` + +- **Limitation**: Live migration is not possible when using configmap as a + filesystem. Currently, virtual machine instances (VMIs) cannot be live migrated as + ``virtiofs`` does not support live migration. + + **Procedural Changes**: N/A. + +- **Limitation**: Live migration is not possible when a VM is using secret + exposed as a filesystem. Currently, virtual machine instances cannot be + live migrated since ``virtiofs`` does not support live migration. + + **Procedural Changes**: N/A. + +- **Limitation**: Live migration will not work when a VM is using + ServiceAccount exposed as a file system. Currently, VMIs cannot be live + migrated since ``virtiofs`` does not support live migration. + + **Procedural Changes**: N/A. + +************************************* +synce4l CLI options are not supported +************************************* + +The SyncE configuration using the ``synce4l`` is not supported in |prod| +10.0. + +The service type of ``synce4l`` in the :command:`ptp-instance-add` command +is not supported in |prod-long| 10.0. + +**Procedural Changes**: N/A. + +*************************************************************************** +Kubernetes Pod Core Dump Handler may fail due to a missing Kubernetes token +*************************************************************************** + +In certain cases the Kubernetes Pod Core Dump Handler may fail due to a missing +Kubernetes token resulting in disabling configuration of the coredump on a per +pod basis and limiting namespace access. If application coredumps are not being +generated, verify if the k8s-coredump token is empty on the configuration file: +``/etc/k8s-coredump-conf.json`` using the following command: + +.. code-block:: none + + ~(keystone_admin)]$ ~$ sudo cat /etc/k8s-coredump-conf.json + { + "k8s_coredump_token": "" + + } + +**Procedural Changes**: If the k8s-coredump token is empty in the configuration file and +the kube-apiserver is verified to be responsive, users can re-execute the +create-k8s-account.sh script in order to generate the appropriate token after a +successful connection to kube-apiserver using the following commands: + +.. code-block:: none + + ~(keystone_admin)]$ :/home/sysadmin$ sudo chmod +x /etc/k8s-coredump/create-k8s-account.sh + + ~(keystone_admin)]$ :/home/sysadmin$ sudo /etc/k8s-coredump/create-k8s-account.sh + +**Limitations from previous releases** + +************************************* +Impact of Kubernetes Upgrade to v1.24 +************************************* In Kubernetes v1.24 support for the ``RemoveSelfLink`` feature gate was removed. -In previous releases of |prod-long| this has been set to "false" for backward +In previous releases of |prod| this has been set to "false" for backward compatibility, but this is no longer an option and it is now hardcoded to "true". -**Workaround**: Any application that relies on this feature gate being disabled -(i.e. assumes the existence of the "self link") must be updated before +**Procedural Changes**: Any application that relies on this feature gate being disabled +(i.e. assumes the existance of the "self link") must be updated before upgrading to Kubernetes v1.24. - -******************************************************************* -Password Expiry Warning Message is not shown for LDAP user on login -******************************************************************* - -In |prod-long| Release 8.0, the password expiry warning message is not shown -for LDAP users on login when the password is nearing expiry. This is due to -the ``pam-sssd`` integration. - -**Workaround**: It is highly recommend that LDAP users maintain independent -notifications and update their passwords every 3 months. - -The expired password can be reset by a user with root privileges using -the following command: - -.. code-block::none - - ~(keystone_admin)]$ sudo ldapsetpasswd ldap-username - Password: - Changing password for user uid=ldap-username,ou=People,dc=cgcs,dc=local - New Password: - Retype New Password: - Successfully set password for user uid=ldap-username,ou=People,dc=cgcs,dc=local - ****************************************** Console Session Issues during Installation ****************************************** @@ -734,7 +1940,7 @@ After bootstrap and before unlocking the controller, if the console session time out (or the user logs out), ``systemd`` does not work properly. ``fm, sysinv and mtcAgent`` do not initialize. -**Workaround**: If the console times out or the user logs out between bootstrap +**Procedural Changes**: If the console times out or the user logs out between bootstrap and unlock of controller-0, then, to recover from this issue, you must re-install the ISO. @@ -742,37 +1948,16 @@ re-install the ISO. PTP O-RAN Spec Compliant Timing API Notification ************************************************ -.. Need the version for the .tgz tarball....Please confirm if this is applicable to stx 8.0? - -- The ptp-notification .tgz application tarball and the corresponding - notificationservice-base:stx8.0-v2.0.2 image are not backwards compatible - with applications using the ``v1 ptp-notification`` API and the corresponding - notificationclient-base:stx.8.0-v2.0.2 image. - - Backward compatibility will be provided in StarlingX Release 9.0. - - .. note:: - - For |O-RAN| Notification support (v2 API), deploy and use the - ``ptp-notification-.tgz`` application tarball. Instructions for this - can be found in the |prod-long| Release 8.0 documentation. - - **See**: - - - :ref:`install-ptp-notifications` - - - :ref:`integrate-the-application-with-notification-client-sidecar` - - The ``v1 API`` only supports monitoring a single ptp4l + phc2sys instance. - **Workaround**: Ensure the system is not configured with multiple instances + **Procedural Changes**: Ensure the system is not configured with multiple instances when using the v1 API. - The O-RAN Cloud Notification defines a /././sync API v2 endpoint intended to allow a client to subscribe to all notifications from a node. This endpoint - is not supported |prod-long| Release 8.0. + is not supported in StarlingX. - **Workaround**: A specific subscription for each resource type must be + **Procedural Changes**: A specific subscription for each resource type must be created instead. - ``v1 / v2`` @@ -781,8 +1966,7 @@ PTP O-RAN Spec Compliant Timing API Notification services can be queried/subscribed to. - v2: The API conforms to O-RAN.WG6.O-Cloud Notification API-v02.01 - with the following exceptions, that are not supported in |prod-long| - Release 8.0. + with the following exceptions, that are not supported in StarlingX. - O-RAN SyncE Lock-Status-Extended notifications @@ -790,9 +1974,7 @@ PTP O-RAN Spec Compliant Timing API Notification - O-RAN Custom cluster names - - /././sync endpoint - - **Workaround**: See the respective PTP-notification v1 and v2 document + **Procedural Changes**: See the respective PTP-notification v1 and v2 document subsections for further details. v1: https://docs.starlingx.io/api-ref/ptp-notification-armada-app/api_ptp_notifications_definition_v1.html @@ -806,18 +1988,7 @@ Upper case characters in host names cause issues with kubernetes labelling Upper case characters in host names cause issues with kubernetes labelling. -**Workaround**: Host names should be lower case. - -**************** -Debian Bootstrap -**************** - -On CentOS bootstrap worked even if **dns_servers** were not present in the -localhost.yml. This does not work for Debian bootstrap. - -**Workaround**: You need to configure the **dns_servers** parameter in the -localhost.yml, as long as no |FQDNs| were used in the bootstrap overrides in -the localhost.yml file for Debian bootstrap. +**Procedural Changes**: Host names should be lower case. *********************** Installing a Debian ISO @@ -829,7 +2000,7 @@ in emergency mode if the disks and disk partitions are not completely wiped before the install, especially if the server was previously running a CentOS ISO. -**Workaround**: When installing a lab for any Debian install, the disks must +**Procedural Changes**: When installing a system for any Debian install, the disks must first be completely wiped using the following procedure before starting an install. @@ -867,59 +2038,19 @@ the other parameters (``audit-log-path``, ``audit-log-maxsize``, ``audit-log-maxage`` and ``audit-log-maxbackup``) cannot be changed at runtime. -**Workaround**: NA +**Procedural Changes**: NA **See**: :ref:`kubernetes-operator-command-logging-663fce5d74e7`. -****************************************************************** -Installing subcloud with patches in Partial-Apply is not supported -****************************************************************** - -When a patch has been uploaded and applied, but not installed, it is in -a ``Partial-Apply`` state. If a remote subcloud is installed via Redfish -(miniboot) at this point, it will run the patched software. Any patches in this -state will be applied on the subcloud as it is installed. However, this is not -reflected in the output from the :command:`sw-patch query` command on the -subcloud. - -**Workaround**: For remote subcloud install operations using the Redfish -protocol, you should avoid installing any subclouds if there are System -Controller patches in the ``Partial-Apply`` state. - ****************************************** PTP is not supported on Broadcom 57504 NIC ****************************************** |PTP| is not supported on the Broadcom 57504 NIC. -**Workaround**: None. Do not configure |PTP| instances on the Broadcom 57504 +**Procedural Changes**: None. Do not configure |PTP| instances on the Broadcom 57504 NIC. -************************************* -Metrics Server Update across Upgrades -************************************* - -After a platform upgrade, the Metrics Server will NOT be automatically updated. - -**Workaround**: To update the Metrics Server, -**See**: :ref:`Install Metrics Server ` - -*********************************************************************************** -Horizon Drop-Down lists in Chrome and Firefox causes issues due to the new branding -*********************************************************************************** - -Drop-down menus in Horizon do not work due to the 'select' HTML element on Chrome -and Firefox. - -It is considered a 'replaced element' as it is generated by the browser and/or -operating system. This element has a limited range of customizable CSS -properties. - -**Workaround**: The system should be 100% usable even with this limitation. -Changing browser's and/or operating system's theme could solve display issues -in case they limit the legibility of the elements (i.e. white text and -white background). - ************************************************************************************************ Deploying an App using nginx controller fails with internal error after controller.name override ************************************************************************************************ @@ -950,39 +2081,7 @@ Example of Helm override: ~(keystone_admin)$ system application-apply nginx-ingress-controller -**Workaround**: NA - -************************************************ -Kata Container is not supported on StarlingX 8.0 -************************************************ - -Kata Containers that were supported on CentOS in earlier releases of |prod-long| -will not be supported on |prod-long| Release 8.0. - -*********************************************** -Vault is not supported on StarlingX Release 8.0 -*********************************************** - -The Vault application is not supported on |prod-long| Release 8.0. - -**Workaround**: NA - -*************************************************** -Portieris is not supported on StarlingX Release 8.0 -*************************************************** - -The Portieris application is not supported on |prod-long| Release 8.0. - -**Workaround**: NA - -***************************** -DCManager Patch Orchestration -***************************** - -.. warning:: - Patches must be applied or removed on the System Controller prior to using - the :command:`dcmanager patch-strategy` command to propagate changes to the - subclouds. +**Procedural Changes**: NA **************************************** Optimization with a Large number of OSDs @@ -995,35 +2094,14 @@ succeeds. 800.001 - Storage Alarm Condition: HEALTH_WARN. Please check 'ceph -s' -**Workaround**: To optimize your storage nodes with a large number of |OSDs|, it +**Procedural Changes**: To optimize your storage nodes with a large number of |OSDs|, it is recommended to use the following commands: .. code-block:: none - $ ceph osd pool set kube-rbd pg_num 256 - $ ceph osd pool set kube-rbd pgp_num 256 + ~(keystone_admin)]$ ceph osd pool set kube-rbd pg_num 256 + ~(keystone_admin)]$ ceph osd pool set kube-rbd pgp_num 256 -****************************************************************** -PTP tx_timestamp_timeout causes ptp4l port to transition to FAULTY -****************************************************************** - -NICs using the Intel Ice NIC driver may report the following in the `ptp4l`` -logs, which might coincide with a |PTP| port switching to ``FAULTY`` before -re-initializing. - -.. code-block:: none - - ptp4l[80330.489]: timed out while polling for tx timestamp - ptp4l[80330.CGTS-30543489]: increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug - -This is due to a limitation of the Intel ICE driver. - -**Workaround**: The recommended workaround is to set the ``tx_timestamp_timeout`` -parameter to 700 (ms) in the ``ptp4l`` config using the following command. - -.. code-block:: none - - ~(keystone_admin)]$ system ptp-instance-parameter-add ptp-inst1 tx_timestamp_timeout=700 *************** BPF is disabled @@ -1064,24 +2142,9 @@ includes the following, but not limited to these packages. - i40e - ice -**Workaround**: It is recommended not to use BPF with real time kernel. +**Procedural Changes**: It is recommended not to use BPF with real-time kernel. If required it can still be used, for example, debugging only. -***************** -crashkernel Value -***************** - -**crashkernel=auto** is no longer supported by newer kernels, and hence the -v5.10 kernel will not support the "auto" value. - -**Workaround**: |prod-long| uses **crashkernel=2048m** instead of -**crashkernel=auto**. - -.. note:: - - |prod-long| Release 8.0 has increased the amount of reserved memory for - the crash/kdump kernel from 512 MiB to 2048 MiB. - *********************** Control Group parameter *********************** @@ -1094,9 +2157,11 @@ use case to linux-mm@kvack.org if you depend on this functionality." This parameter is used by a number of software packages in |prod-long|, including, but not limited to, **systemd, docker, containerd, libvirt** etc. -**Workaround**: NA. This is only a warning message about the future deprecation +**Procedural Changes**: NA. This is only a warning message about the future deprecation of an interface. +.. Chris F please confirm if this is applicable? + **************************************************** Kubernetes Taint on Controllers for Standard Systems **************************************************** @@ -1107,12 +2172,12 @@ controllers in Standard systems are intended ONLY for platform services. If application pods MUST run on controllers, a Kubernetes toleration of the taint can be specified in the application's pod specifications. -**Workaround**: Customer applications that need to run on controllers on +**Procedural Changes**: Customer applications that need to run on controllers on Standard systems will need to be enabled/configured for Kubernetes toleration in order to ensure the applications continue working after an upgrade from -|prod-long| Release 6.0 to |prod-long| future Releases. It is suggested to add -the Kubernetes toleration to your application prior to upgrading to |prod-long| -Release 8.0. +|prod| 6.0 to |prod-long| future Releases. It is suggested to add +the Kubernetes toleration to your application prior to upgrading to |prod| +8.0. You can specify toleration for a pod through the pod specification (PodSpec). For example: @@ -1125,7 +2190,7 @@ For example: .... spec tolerations: - - key: "node-role.kubernetes.io/control-plane" + - key: "node-role.kubernetes.io/master" operator: "Exists" effect: "NoSchedule" - key: "node-role.kubernetes.io/control-plane" @@ -1134,54 +2199,6 @@ For example: **See**: `Taints and Tolerations `__. -******************************************************** -New Kubernetes Taint on Controllers for Standard Systems -******************************************************** - -A new Kubernetes taint will be applied to controllers for Standard systems in -order to prevent application pods from being scheduled on controllers; since -controllers in Standard systems are intended ONLY for platform services. If -application pods MUST run on controllers, a Kubernetes toleration of the taint -can be specified in the application's pod specifications. You will also need to -change the nodeSelector / nodeAffinity to use the new label. - -**Workaround**: Customer applications that need to run on controllers on -Standard systems will need to be enabled/configured for Kubernetes toleration -in order to ensure the applications continue working after an upgrade to -|prod-long| Release 8.0 and |prod-long| future Releases. - -You can specify toleration for a pod through the pod specification (PodSpec). -For example: - -.. code-block:: none - - spec: - .... - template: - .... - spec - tolerations: - - key: "node-role.kubernetes.io/control-plane" - operator: "Exists" - effect: "NoSchedule" - -**See**: `Taints and Tolerations `__. - -************************************************************** -Ceph alarm 800.001 interrupts the AIO-DX upgrade orchestration -************************************************************** - -Upgrade orchestration fails on |AIO-DX| systems that have Ceph enabled. - -**Workaround**: Clear the Ceph alarm 800.001 by manually upgrading both -controllers and using the following command: - -.. code-block:: none - - ~(keystone_admin)]$ ceph mon enable-msgr2 - -Ceph alarm 800.001 is cleared. - *************************************************************** Storage Nodes are not considered part of the Kubernetes cluster *************************************************************** @@ -1191,18 +2208,7 @@ must only display controller and worker hosts that have control-plane and kubele components. Storage nodes do not have any of those components and so are not considered a part of the Kubernetes cluster. -**Workaround**: Do not include Storage nodes. - -*************************************************************************************** -Backup and Restore of ACC100 (Mount Bryce) configuration requires double unlock attempt -*************************************************************************************** - -After restoring from a previous backup with an Intel ACC100 processing -accelerator device, the first unlock attempt will be refused since this -specific kind of device will be updated in the same context. - -**Workaround**: A second attempt after few minutes will accept and unlock the -host. +**Procedural Changes**: Do not include Storage nodes as part of the Kubernetes upgrade. ************************************** Application Pods with SRIOV Interfaces @@ -1213,9 +2219,9 @@ label in their pod spec template. Pods with |SRIOV| interfaces may fail to start after a platform restore or Simplex upgrade and persist in the **Container Creating** state due to missing -PCI address information in the |CNI| configuration. +PCI address information in the CNI configuration. -**Workaround**: Application pods that require|SRIOV| should add the label +**Procedural Changes**: Application pods that require|SRIOV| should add the label **restart-on-reboot: "true"** to their pod spec template metadata. All pods with this label will be deleted and recreated after system initialization, therefore all pods must be restartable and managed by a Kubernetes controller @@ -1232,28 +2238,6 @@ Pod Spec template example: app: sriovdp restart-on-reboot: "true" - -*********************** -Management VLAN Failure -*********************** - -If the Management VLAN fails on the active System Controller, communication -failure 400.005 is detected, and alarm 280.001 is raised indicating -subclouds are offline. - -**Workaround**: System Controller will recover and subclouds are manageable -when the Management VLAN is restored. - -******************************** -Host Unlock During Orchestration -******************************** - -If a host unlock during orchestration takes longer than 30 minutes to complete, -a second reboot may occur. This is due to the delays, VIM tries to abort. The -abort operation triggers the second reboot. - -**Workaround**: NA - ************************************** Storage Nodes Recovery on Power Outage ************************************** @@ -1261,54 +2245,23 @@ Storage Nodes Recovery on Power Outage Storage nodes take 10-15 minutes longer to recover in the event of a full power outage. -**Workaround**: NA +**Procedural Changes**: NA -************************************* -Ceph OSD Recovery on an AIO-DX System -************************************* +********************************* +Ceph Recovery on an AIO-DX System +********************************* -In certain instances a Ceph OSD may not recover on an |AIO-DX| system -\(for example, if an OSD comes up after a controller reboot and a swact -occurs), and remains in the down state when viewed using the :command:`ceph -s` -command. +In certain instances Ceph may not recover on an |AIO-DX| system, and remains +in the down state when viewed using the +:command"`ceph -s` command; for example, if an |OSD| comes up after a controller +reboot and a swact occurs, or other possible causes for example, hardware +failure of the disk or the entire host, power outage, or switch down. -**Workaround**: Manual recovery of the OSD may be required. - -******************************************************** -Using Helm with Container-Backed Remote CLIs and Clients -******************************************************** - -If **Helm** is used within Container-backed Remote CLIs and Clients: - -- You will NOT see any helm installs from |prod| Platform's system - FluxCD applications. - - **Workaround**: Do not directly use **Helm** to manage |prod| Platform's - system FluxCD applications. Manage these applications using - :command:`system application` commands. - -- You will NOT see any helm installs from end user applications, installed - using **Helm** on the controller's local CLI. - - **Workaround**: It is recommended that you manage your **Helm** - applications only remotely; the controller's local CLI should only be used - for management of the |prod| Platform infrastructure. - -********************************************************************* -Remote CLI Containers Limitation for StarlingX Platform HTTPS Systems -********************************************************************* - -The python2 SSL lib has limitations with reference to how certificates are -validated. If you are using Remote CLI containers, due to a limitation in -the python2 SSL certificate validation, the certificate used for the 'ssl' -certificate should either have: - -#. CN=IPADDRESS and SAN=empty or, - -#. CN=FQDN and SAN=FQDN - -**Workaround**: Use CN=FQDN and SAN=FQDN as CN is a deprecated field in -the certificate. +**Procedural Changes**: There is no specific command or procedure that solves +the problem for all possible causes. Each case needs to be analyzed individually +to find the root cause of the problem and the solution. It is recommended to +contact Customer Support at, +`http://www.windriver.com/support `__. ******************************************************************* Cert-manager does not work with uppercase letters in IPv6 addresses @@ -1316,7 +2269,7 @@ Cert-manager does not work with uppercase letters in IPv6 addresses Cert-manager does not work with uppercase letters in IPv6 addresses. -**Workaround**: Replace the uppercase letters in IPv6 addresses with lowercase +**Procedural Changes**: Replace the uppercase letters in IPv6 addresses with lowercase letters. .. code-block:: none @@ -1343,41 +2296,41 @@ Kubernetes Root CA Certificates Kubernetes does not properly support **k8s_root_ca_cert** and **k8s_root_ca_key** being an Intermediate CA. -**Workaround**: Accept internally generated **k8s_root_ca_cert/key** or +**Procedural Changes**: Accept internally generated **k8s_root_ca_cert/key** or customize only with a Root CA certificate and key. ************************ Windows Active Directory ************************ +.. _general-limitations-and-workarounds-ul-x3q-j3x-dmb: + - **Limitation**: The Kubernetes API does not support uppercase IPv6 addresses. - **Workaround**: The issuer_url IPv6 address must be specified as lowercase. + **Procedural Changes**: The issuer_url IPv6 address must be specified as + lowercase. - **Limitation**: The refresh token does not work. - **Workaround**: If the token expires, manually replace the ID token. For - more information, see, :ref:`Configure Kubernetes Client Access `. + **Procedural Changes**: If the token expires, manually replace the ID token. For + more information, see, :ref:`Configure Kubernetes Client Access + `. - **Limitation**: TLS error logs are reported in the **oidc-dex** container on subclouds. These logs should not have any system impact. - **Workaround**: NA + **Procedural Changes**: NA -- **Limitation**: **stx-oidc-client** liveness probe sometimes reports - failures. These errors may not have system impact. - - **Workaround**: NA - -.. Stx LP Bug: https://bugs.launchpad.net/starlingx/+bug/1846418 +.. Stx LP Bug: https://bugs.launchpad.net/starlingx/+bug/1846418 Won't fix. +.. To be addressed in a future update. ************ BMC Password ************ -The BMC password cannot be updated. +The |BMC| password cannot be updated. -**Workaround**: In order to update the BMC password, de-provision the BMC, +**Procedural Changes**: In order to update the |BMC| password, de-provision the |BMC|, and then re-provision it again with the new password. **************************************** @@ -1387,7 +2340,7 @@ Application Fails After Host Lock/Unlock In some situations, application may fail to apply after host lock/unlock due to previously evicted pods. -**Workaround**: Use the :command:`kubectl delete` command to delete the evicted +**Procedural Changes**: Use the :command:`kubectl delete` command to delete the evicted pods and reapply the application. *************************************** @@ -1398,38 +2351,17 @@ If an application apply is in progress and a host is reset it will likely fail. A re-apply attempt may be required once the host recovers and the system is stable. -**Workaround**: Once the host recovers and the system is stable, a re-apply +**Procedural Changes**: Once the host recovers and the system is stable, a re-apply may be required. -******************************** -Pod Recovery after a Host Reboot -******************************** - -On occasions some pods may remain in an unknown state after a host is rebooted. - -**Workaround**: To recover these pods kill the pod. Also based on `https://github.com/kubernetes/kubernetes/issues/68211 `__ -it is recommended that applications avoid using a subPath volume configuration. - -**************************** -Rare Node Not Ready Scenario -**************************** - -In rare cases, an instantaneous loss of communication with the active -**kube-apiserver** may result in kubernetes reporting node\(s) as stuck in the -"Not Ready" state after communication has recovered and the node is otherwise -healthy. - -**Workaround**: A restart of the **kublet** process on the affected node\(s) -will resolve the issue. - ************************* Platform CPU Usage Alarms ************************* -Alarms may occur indicating platform cpu usage is \>90% if a large number of -pods are configured using liveness probes that run every second. +Alarms may occur indicating platform cpu usage is greater than 90% if a large +number of pods are configured using liveness probes that run every second. -**Workaround**: To mitigate either reduce the frequency for the liveness +**Procedural Changes**: To mitigate either reduce the frequency for the liveness probes or increase the number of platform cores. ******************* @@ -1439,25 +2371,20 @@ Pods Using isolcpus The isolcpus feature currently does not support allocation of thread siblings for cpu requests (i.e. physical thread +HT sibling). -**Workaround**: NA +**Procedural Changes**: For optimal results, if hyperthreading is enabled then +isolcpus should be allocated in multiples of two in order to ensure that both +|SMT| siblings are allocated to the same container. -***************************** -system host-disk-wipe command -***************************** - -The system host-disk-wipe command is not supported in this release. - -**Workaround**: NA - -************************************************************* +*********************************************************** Restrictions on the Size of Persistent Volume Claims (PVCs) -************************************************************* +*********************************************************** There is a limitation on the size of Persistent Volume Claims (PVCs) that can -be used for all StarlingX Platform Releases. +be used for all |prod| Releases. -**Workaround**: It is recommended that all PVCs should be a minimum size of -1GB. For more information, see, `https://bugs.launchpad.net/starlingx/+bug/1814595 `__. +**Procedural Changes**: It is recommended that all |PVCs| should be a minimum size of +1GB. For more information, see, +https://bugs.launchpad.net/starlingx/+bug/1814595. *************************************************************** Sub-Numa Cluster Configuration not Supported on Skylake Servers @@ -1465,7 +2392,7 @@ Sub-Numa Cluster Configuration not Supported on Skylake Servers Sub-Numa cluster configuration is not supported on Skylake servers. -**Workaround**: For servers with Skylake Gold or Platinum CPUs, Sub-NUMA +**Procedural Changes**: For servers with Skylake Gold or Platinum CPUs, Sub-|NUMA| clustering must be disabled in the BIOS. ***************************************************************** @@ -1473,10 +2400,10 @@ The ptp-notification-demo App is Not a System-Managed Application ***************************************************************** The ptp-notification-demo app is provided for demonstration purposes only. -Therefore, it is not supported on typical platform operations such as Backup -and Restore. +Therefore, it is not supported on typical platform operations such as Upgrades +and Backup and Restore. -**Workaround**: NA +**Procedural Changes**: NA ************************************************************************* Deleting image tags in registry.local may delete tags under the same name @@ -1485,46 +2412,357 @@ Deleting image tags in registry.local may delete tags under the same name When deleting image tags in the registry.local docker registry, you should be aware that the deletion of an **** will delete all tags under the specified that have the same 'digest' as the specified -. For more information, see, :ref:`Delete Image Tags in the Docker Registry `. +. For more information, see, :ref:`Delete Image Tags in +the Docker Registry `. -**Workaround**: NA +**Procedural Changes**: NA + +**************************************************************************** +Unable to create Kubernetes Upgrade Strategy for Subclouds using Horizon GUI +**************************************************************************** + +When creating a Kubernetes Upgrade Strategy for a +subcloud using the Horizon GUI, it fails and displays the following error: + +.. code-block:: none + + kube upgrade pre-check: Invalid kube version(s), left: (v1.24.4), right: + (1.24.4) + +**Procedural Changes**: Use the following steps to create the strategy: + +.. rubric:: |proc| + +#. Create a strategy for subcloud Kubernetes upgrade using the + :command:`dcmanager kube-upgrade-strategy create --to-version ` command. + +#. Apply the strategy using the Horizon GUI or the CLI using the command + :command:`dcmanager kube-upgrade-strategy apply`. + +:ref:`apply-a-kubernetes-upgrade-strategy-using-horizon-2bb24c72e947` + +********************************************** +Power Metrics Application in Real Time Kernels +********************************************** + +When executing Power Metrics application in Real +Time kernels, the overall scheduling latency may increase due to inter-core +interruptions caused by the MSR (Model-specific Registers) reading. + +Due to intensive workloads the kernel may not be able to handle the MSR +reading interruptions resulting in stalling data collection due to +not being scheduled on the affected core. + +**Procedural Changes**: N/A. + +*********************************************** +k8s-coredump only supports lowercase annotation +*********************************************** + +Creating K8s pod core dump fails when setting the +``starlingx.io/core_pattern`` parameter in upper case characters on the pod +manifest. This results in the pod being unable to find the target directory +and fails to create the coredump file. + +**Procedural Changes**: The ``starlingx.io/core_pattern`` parameter only accepts +lower case characters for the path and file name where the core dump is saved. + +**See**: :ref:`kubernetes-pod-coredump-handler-54d27a0fd2ec`. + +*********************** +NetApp Permission Error +*********************** + +When installing/upgrading to Trident 20.07.1 and later, and Kubernetes version +1.17 or higher, new volumes created will not be writable if: + +- The storageClass does not specify ``parameter.fsType`` + +- The pod using the requested |PVC| has an ``fsGroup`` enforced as part of a + Security constraint + +**Procedural Changes**: Specify ``parameter.fsType`` in the ``localhost.yml`` file under +``netapp_k8s_storageclasses`` parameters as below. + +The following example shows a minimal configuration in ``localhost.yml``: + +.. code-block:: + + ansible_become_pass: xx43U~a96DN*m.? + trident_setup_dir: /tmp/trident + netapp_k8s_storageclasses: + - metadata: + name: netapp-nas-backend + provisioner: netapp.io/trident + parameters: + backendType: "ontap-nas" + fsType: "nfs" + + netapp_k8s_snapshotstorageclasses: + - metadata: + name: csi-snapclass + +**See**: :ref:`Configure an External NetApp Deployment as the Storage Backend ` + +******************************** +Huge Page Limitation on Postgres +******************************** + +Debian postgres version supports huge pages, and by +default uses 1 huge page if it is available on the system, decreasing by 1 the +number of huge pages available. + +**Procedural Changes**: The huge page setting must be disabled by setting +``/etc/postgresql/postgresql.conf: "huge_pages = off"``. The postgres service +needs to be restarted using the Service Manager :command:`sudo sm-restart service postgres` +command. + +.. Warning:: + + The Procedural Changes is not persistent, therefore, if the host is rebooted + it will need to be applied again. This will be fixed in a future release. + +************************************************ +Password Expiry does not work on LDAP user login +************************************************ + +On Debian, the warning message is not being displayed for Active Directory users, +when a user logs in and the password is nearing expiry. Similarly, on login +when a user's password has already expired, the password change prompt is not +being displayed. + +**Procedural Changes**: It is recommended that users rely on Directory administration +tools for "Windows Active Directory" servers to handle password updates, +reminders and expiration. It is also recommended that passwords should be +updated every 3 months. + +.. note:: + + The expired password can be reset via Active Directory by IT administrators. + +*************************************** +Silicom TimeSync (STS) card limitations +*************************************** + +* Silicom and Intel based Time Sync NICs may not be deployed on the same system + due to conflicting time sync services and operations. + + |PTP| configuration for Silicom TimeSync (STS) cards is handled separately + from |prod| host |PTP| configuration and may result in configuration + conflicts if both are used at the same time. + + The sts-silicom application provides a dedicated ``phc2sys`` instance which + synchronizes the local system clock to the Silicom TimeSync (STS) card. Users + should ensure that ``phc2sys`` is not configured via |prod| |PTP| Host + Configuration when the sts-silicom application is in use. + + Additionally, if |prod| |PTP| Host Configuration is being used in parallel + for non-STS NICs, users should ensure that all ``ptp4l`` instances do not use + conflicting ``domainNumber`` values. + +* When the Silicom TimeSync (STS) card is configured in timing mode using the + sts-silicom application, the card goes through an initialization process on + application apply and server reboots. The ports will bounce up and down + several times during the initialization process, causing network traffic + disruption. Therefore, configuring the platform networks on the Silicom + TimeSync (STS) card is not supported since it will cause platform + instability. + +**Procedural Changes**: N/A. + +*********************************** +N3000 Image in the containerd cache +*********************************** + +The |prod-long| system without an N3000 image in the containerd cache fails to +configure during a reboot cycle, and results in a failed / disabled node. + +The N3000 device requires a reset early in the startup sequence. The reset is +done by the n3000-opae image. The image is automatically downloaded on bootstrap +and is expected to be in the cache to allow the reset to succeed. If the image +is not in the cache for any reason, the image cannot be downloaded as +``registry.local`` is not up yet at this point in the startup. This will result +in the impacted host going through multiple reboot cycles and coming up in an +enabled/degraded state. To avoid this issue: + +1. Ensure that the docker filesystem is properly engineered to avoid the image + being automatically removed by the system if flagged as unused. + For instructions to resize the filesystem, see + :ref:`Increase Controller Filesystem Storage Allotments Using the CLI ` + +2. Do not manually prune the N3000 image. + +**Procedural Changes**: Use the procedure below. + +.. rubric:: |proc| + +#. Lock the node. + + .. code-block:: none + + ~(keystone_admin)]$ system host-lock controller-0 + +#. Pull the (N3000) required image into the ``containerd`` cache. + + .. code-block:: none + + ~(keystone_admin)]$ crictl pull registry.local:9001/docker.io/starlingx/n3000-opae:stx.8.0-v1.0.2 + +#. Unlock the node. + + .. code-block:: none + + ~(keystone_admin)]$ system host-unlock controller-0 + +.. Henrique please confirm if this is applicable in 10.0?? + +***************** +Quartzville Tools +***************** + +The following :command:`celo64e` and :command:`nvmupdate64e` commands are not +supported in StarlingX due to a known issue in Quartzville tools that crashes +the host. + +**Procedural Change**: Reboot the host using the boot screen menu. + +******************************************************************************************************* +``ptp4l`` error "timed out while polling for tx timestamp" reported for NICs using the Intel ice driver +******************************************************************************************************* + +NICs using the Intel® ice driver may report the following error in the ``ptp4l`` +logs, which results in a |PTP| port switching to ``FAULTY`` before +re-initializing. + +.. note:: + + |PTP| ports frequently switching to ``FAULTY`` may degrade the accuracy of + the |PTP| timing. + +.. code-block:: none + + ptp4l[80330.489]: timed out while polling for tx timestamp + ptp4l[80330.489]: increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug + +.. note:: + + This is due to a limitation with the Intel® ice driver as the driver cannot + guarantee the time interval to return the timestamp to the ``ptp4l`` user + space process which results in the occasional timeout error message. + +**Procedural Changes**: The Procedural Changes recommended by Intel is to increase the +``tx_timestamp_timeout`` parameter in the ``ptp4l`` config. The increased +timeout value gives more time for the ice driver to provide the timestamp to +the ``ptp4l`` user space process. Timeout values of 50ms and 700ms have been +validated. However, the user can use a different value if it is more suitable +for their system. + +.. code-block:: none + + ~(keystone_admin)]$ system ptp-instance-parameter-add tx_timestamp_timeout=700 + ~(keystone_admin)]$ system ptp-instance-apply + +.. note:: + + The ``ptp4l`` timeout error log may also be caused by other underlying + issues, such as NIC port instability. Therefore, it is recommended to + confirm the NIC port is stable before adjusting the timeout values. + +*************************************************** +Cert-manager accepts only short hand IPv6 addresses +*************************************************** + +Cert-manager accepts only short hand IPv6 addresses. + +**Procedural Changes**: You must use the following rules when defining IPv6 addresses +to be used by Cert-manager. + +- all letters must be in lower case + +- each group of hexadecimal values must not have any leading 0s + (use :12: instead of :0012:) + +- the longest sequence of consecutive all-zero fields must be short handed + with ``::`` + +- ``::`` must not be used to short hand an IPv6 address with 7 groups of hexadecimal + values, use :0: instead of ``::`` + +.. note:: + + Use the rules above to set the IPv6 address related to the management + and |OAM| network in the Ansible bootstrap overrides file, localhost.yml. + +.. code-block:: none + + apiVersion: cert-manager.io/v1 + kind: Certificate + metadata: + name: oidc-auth-apps-certificate + namespace: test + spec: + secretName: oidc-auth-apps-certificate + dnsNames: + - ahost.com + ipAddresses: + - fe80:12:903a:1c1a:e802::11e4 + issuerRef: + name: cloudplatform-interca-issuer + kind: Issuer + +.. Stx LP Bug: https://bugs.launchpad.net/starlingx/+bug/1846418 Won't fix. +.. To be addressed in a future update. + +.. All please confirm if all these have been removed from the StarlingX 10.0? ------------------ Deprecated Notices ------------------ -.. All please confirm if all these have been removed from the StarlingX 9.0 Release? +*************** +Bare metal Ceph +*************** -**************************** -Airship Armada is deprecated -**************************** +Host-based Ceph will be deprecated in a future release. Adoption +of Rook-Ceph is recommended for new deployments as some host-based Ceph +deployments may not be upgradable. -.. note:: +********************************************************* +No support for system_platform_certificate.subject_prefix +********************************************************* - Airship Armada is removed in stx.9.0 and replaced with FluxCD. All Armada - based applications have to be removed before you perform an - upgrade from |prod-long| Release 9.0 to |prod-long| Release 10.0. +|prod| 10.0 no longer supports system_platform_certificate.subject_prefix +This is an optional field to add a prefix to further identify the certificate, +for example, |prod| for instance. -.. note:: - Some application repositories may still have "armada" in the file path but - are now supported by FluxCD. See https://opendev.org/starlingx/?sort=recentupdate&language=&q=armada. +*************************************************** +Static Configuration for Hardware Accelerator Cards +*************************************************** -StarlingX Release 7.0 introduces FluxCD based applications that utilize FluxCD -Helm/source controller pods deployed in the flux-helm Kubernetes namespace. -Airship Armada support is now considered to be deprecated. The Armada pod will -continue to be deployed for use with any existing Armada based applications but -will be removed in StarlingX Release 8.0, once the stx-openstack Armada -application is fully migrated to FluxCD. +Static configuration for hardware accelerator cards is deprecated in +|prod| 10.0 and will be discontinued in future releases. +Use |FEC| operator instead. -************************************ -Cert-manager API Version deprecation -************************************ +**See** :ref:`Switch between Static Method Hardware Accelerator and SR-IOV FEC Operator ` -The upgrade of cert-manager from 0.15.0 to 1.7.1, deprecated support for -cert manager API versions cert-manager.io/v1alpha2 and cert-manager.io/v1alpha3. -When creating cert-manager |CRDs| (certificates, issuers, etc) with |prod-long| -Release 8.0, use cert-manager.io/v1. +**************************************** +N3000 FPGA Firmware Update Orchestration +**************************************** + +The N3000 |FPGA| Firmware Update Orchestration has been deprecated in |prod| +10.0. For more information, see :ref:`n3000-overview` for more +information. + +******************** +show-certs.sh Script +******************** + +The ``show-certs.sh`` script that is available when you ssh to a controller is +deprecated in |prod| 10.0. + +The new response format of the 'system certificate-list' RESTAPI / CLI now +provides the same information as provided by ``show-certs.sh``. *************** Kubernetes APIs @@ -1534,6 +2772,323 @@ Kubernetes APIs that will be removed in K8s 1.25 are listed below: **See**: https://kubernetes.io/docs/reference/using-api/deprecation-guide/#v1-25 +*********************** +ptp-notification v1 API +*********************** + +The ptp-notification v1 API can still be used in |prod| 10.0. +The v1 API will be removed in a future release and only the O-RAN Compliant +Notification API (ptp-notification v2 API) will be supported. + +.. note:: + + It is recommended that all new deployments use the O-RAN Compliant + Notification API (ptp-notification v2 API). + +------------------- +Removed in Stx 10.0 +------------------- + +``kube-ignore-isol-cpus`` is no longer supported in |prod| 10.0. + +******************* +Pod Security Policy +******************* + +Pod Security Policy (PSP) is removed in |prod| 10.0 and +K8s v1.25 and ONLY applies if running on K8s v1.24 or earlier. Instead of +using Pod Security Policy, you can enforce similar restrictions on Pods +using Pod Security Admission Controller (PSAC) supporting K8s v1.25. + +.. note:: + + Although |prod| 10.0 still supports K8s v1.24 which supports + |PSP|, |prod| 10.0 has removed the |prod| default |PSP| policies, + roles and role-bindings that made |PSP| usable in |prod|; It is important + to note that |prod| 10.0 is officially NOT supporting the use + of |PSP| in its Kubernetes deployment. + +.. important:: + + Upgrades + + - |PSP| should be removed on hosted application's and converted to + |PSA| Controller before the upgrade to |prod| 10.0. + +.. - On 'upgrade activate or complete' of the upgrade to |prod| +.. 10.0, ALL |PSP| policies and all previously auto-generated ClusterRoles +.. and ClusterRoleBindings associated with |PSP| policies will be removed. + + - Using the :command:`system application-update` command for Platform + applications will remove the use of roles or rolebindings dealing with + |PSP| policies. + + - |PSA| Controller mechanisms should be configured to enforce the constraints that + the previous PSP policies were enforcing. + +**See**: :ref:`Pod Security Admission Controller ` + +******************************* +System certificate CLI Commands +******************************* + +The following commands are removed in |prod| 10.0 and replaced +by: + +- ``system certificate-install -m ssl `` + has been replaced by an automatically installed 'system-restapi-gui-certificate' + CERTIFICATE (in the 'deployment' namespace) which can be modified using the + 'update_platform_certificates' Ansible playbook + +- ``system certificate-install -m openstack `` + has been replaced by 'system os-certificate-install ' + +- ``system certificate-install -m ssl_ca `` + +- ``system certificate-install -m docker_registry `` + has been replaced by an automatically installed 'system-registry-local-certificate' + CERTIFICATE (in the 'deployment' namespace) which can be modified using the + 'update_platform_certificates' Ansible playbook + +- ``system certificate-uninstall -m ssl_ca `` and + ``system certificate-uninstall -m ssl_ca `` + have been replaced by: + + - ``'system ca-certificate-install '`` + - ``'system ca-certificate-uninstall '`` + +.. _appendix-commands-replaced-by-usm-for-updates-and-upgrades-835629a1f5b8: + +------------------------------------------------------------------------ +Appendix A - Commands replaced by USM for Updates (Patches) and Upgrades +------------------------------------------------------------------------ + +.. toctree:: + :maxdepth: 1 + +********************************** +Manually Managing Software Patches +********************************** + +The ``sudo sw-patch`` commands for manually managing software patches have +been replaced by ``software`` commands as listed below: + +The following commands for manually managing software patches are **no** longer +supported: + +- sw-patch upload + +- sw-patch upload-dir + +- sw-patch query + +- sw-patch show + +- sw-patch apply + +- sw-patch query-hosts + +- sw-patch host-install + +- sw-patch host-install-async + +- sw-patch remove + +- sw-patch delete + +- sw-patch what-requires + +- sw-patch query-dependencies + +- sw-patch is-applied + +- sw-patch is-available + +- sw-patch install-local + +- sw-patch drop-host + +- sw-patch commit + +Software patching is now manually managed by the ``software`` commands +described in the :ref: ```` Manual Deployment - Host Software +Deployment procedure. + +- software upload + +- software upload-dir + +- software list + +- software delete + +- software show + +- software deploy precheck + +- software deploy start + +- software deploy show + +- software deploy host + +- software deploy host-rollback + +- software deploy localhost + +- software deploy host-list + +- software deploy activate + +- software deploy complete + +- software deploy delete + +************************ +Manual Software Upgrades +************************ + +The ``system load-delete/import/list/show``, +``system upgrade-start/show/activate/abort/abort-complete/complete`` and +``system host-upgrade/upgrade-list/downgrade`` commands for manually managing +software upgrades have been replaced by ``software`` commands. + +The following commands for manually managing software upgrades are **no** longer +supported: + +- system load-import + +- system load-list + +- system load-show + +- system load-delete + +- system upgrade-start + +- system upgrade-show + +- system host-upgrade + +- system host-upgrade-list + +- system upgrade-activate + +- system upgrade-complete + +- system upgrade-abort + +- system host-downgrade + +- system upgrade-abort-complete + +Software upgrade is now manually managed by the ``software`` commands described +in the ```` 'Manual Deployment - Host Software Deployment' procedure. + +- software upload + +- software upload-dir + +- software list + +- software delete + +- software show + +- software deploy precheck + +- software deploy start + +- software deploy show + +- software deploy host + +- software deploy localhost + +- software deploy host-list + +- software deploy activate + +- software deploy complete + +- software deploy delete + +- software deploy abort + +- software deploy host-rollback + +- software deploy activate-rollback + +********************************* +Orchestration of Software Patches +********************************* + +The ``sw-manager patch-strategy-create/apply/show/abort/delete`` commands for +managing the orchestration of software patches have been replaced by +``sw-manager sw-deploy-strategy-create/apply/show/abort/delete`` commands. + +The following commands for managing the orchestration of software patches are +**no** longer supported + +- sw-manager patch-strategy create ... ... + +- sw-manager patch-strategy show + +- sw-manager patch-strategy apply + +- sw-manager patch-strategy abort + +- sw-manager patch-strategy delete + +Orchestrated software patching is now managed by the +``sw-manager sw-deploy-strategy-create/apply/show/abort/delete`` commands +described in the ``Orchestrated Deployment - Host Software Deployment`` procedure. + +- sw-manager sw-deploy-strategy create ... ... + +- sw-manager sw-deploy-strategy show + +- sw-manager sw-deploy-strategy apply + +- sw-manager sw-deploy-strategy abort + +- sw-manager sw-deploy-strategy delete + +********************************** +Orchestration of Software Upgrades +********************************** + +The ``sw-manager patch-strategy-create/apply/show/abort/delete`` commands for +managing the orchestration of software upgrades have been replaced by +``sw-manager sw-deploy-strategy-create/apply/show/abort/delete`` commands. + +The following commands for managing the orchestration of software upgrades are +no longer supported. + +- sw-manager upgrade-strategy create ... ... + +- sw-manager upgrade-strategy show + +- sw-manager upgrade-strategy apply + +- sw-manager upgrade-strategy abort + +- sw-manager upgrade-strategy delete + +Orchestrated software upgrade is now managed by the +``sw-manager sw-deploy-strategy-create/apply/show/abort/delete`` commands +described in the ```` 'Orchestrated Deployment - Host Software Deployment' +procedure. + +- sw-manager sw-deploy-strategy create < ... ... + +- sw-manager sw-deploy-strategy show + +- sw-manager sw-deploy-strategy apply + +- sw-manager sw-deploy-strategy abort + +- sw-manager sw-deploy-strategy delete -------------------------------------- Release Information for other versions @@ -1541,12 +3096,22 @@ Release Information for other versions You can find details about a release on the specific release page. +.. To change the 9.0 link + .. list-table:: * - Version - Release Date - Notes - Status + * - StarlingX R10.0 + - 2025-02 + - https://docs.starlingx.io/stx.10.0/releasenotes/index.html + - Maintained + * - StarlingX R9.0 + - 2024-03 + - https://docs.starlingx.io/r/stx.9.0/releasenotes/index.html + - Maintained * - StarlingX R8.0 - 2023-02 - https://docs.starlingx.io/r/stx.8.0/releasenotes/index.html