From 2639a15aa16b0d25e1f5b089422ee639634ab26c Mon Sep 17 00:00:00 2001 From: Juanita Balaraj Date: Wed, 22 Oct 2025 13:14:12 +0000 Subject: [PATCH] StarlingX Release Notes - Stx 11.0 Updated Patchset 1 comments Updated Deprecated Information Updated New Features Updated Limitations / Workarounds for all domains Updated Hardware Changes Change-Id: I7efc23f10caf366b1fb21da53cb2fd927050bf24 Signed-off-by: Juanita Balaraj --- .../introduction/index-intro-27197f27ad41.rst | 14 +- doc/source/releasenotes/index.rst | 5011 +++++++++-------- 2 files changed, 2630 insertions(+), 2395 deletions(-) diff --git a/doc/source/introduction/index-intro-27197f27ad41.rst b/doc/source/introduction/index-intro-27197f27ad41.rst index 12df84d62..182a9b3f4 100644 --- a/doc/source/introduction/index-intro-27197f27ad41.rst +++ b/doc/source/introduction/index-intro-27197f27ad41.rst @@ -69,12 +69,20 @@ For additional information about project teams, refer to the `StarlingX wiki `_. ------------------------------ -New features in StarlingX 10.0 +New features in StarlingX 11.0 ------------------------------ .. include:: /releasenotes/index.rst - :start-after: start-new-features-r10 - :end-before: end-new-features-r10 + :start-after: start-new-features-r11 + :end-before: end-new-features-r11 + +.. To change this link + +------------------------------ +New features in StarlingX 10.0 +------------------------------ + +**See**: https://docs.starlingx.io/r/stx.10.0/releasenotes/index.html#release-notes ----------------------------- New features in StarlingX 9.0 diff --git a/doc/source/releasenotes/index.rst b/doc/source/releasenotes/index.rst index c7510f509..50cc6e04e 100644 --- a/doc/source/releasenotes/index.rst +++ b/doc/source/releasenotes/index.rst @@ -7,17 +7,17 @@ .. The Stx 10.0 RN is WIP and not ready for review. .. Removed appearances of Armada as its not supported -=================== -R10.0 Release Notes -=================== +============================ +StarlingX 11.0 Release Notes +============================ .. rubric:: |context| StarlingX is a fully integrated edge cloud software stack that provides everything needed to deploy an edge cloud on one, two, or up to 100 servers. -This section describes the new capabilities, Known Limitations and Workarounds, -Defects fixed and deprecated information in StarlingX 10.0. +This section describes the new capabilities, Known Limitations and Procedural Changes, +Defects fixed and deprecated information in StarlingX 11.0. .. contents:: :local: @@ -27,1151 +27,707 @@ Defects fixed and deprecated information in StarlingX 10.0. ISO image --------- -The pre-built ISO (Debian) for StarlingX 10.0 is located at the +The pre-built ISO (Debian) for StarlingX 11.0 is located at the ``StarlingX mirror`` repo: -https://mirror.starlingx.windriver.com/mirror/starlingx/release/10.0.0/debian/monolithic/outputs/iso/ - +https://mirror.starlingx.windriver.com/mirror/starlingx/release/11.0.0/debian/bullseye/amd64/monolithic/outputs/iso/ + ------------------------------ -Source Code for StarlingX 10.0 +Source Code for StarlingX 11.0 ------------------------------ -The source code for StarlingX 10.0 is available on the r/stx.10.0 +The source code for StarlingX 11.0 is available on the r/stx.11.0 branch in the `StarlingX repositories `_. ---------- Deployment ---------- -To deploy StarlingX 10.0, see `Consuming StarlingX `_. +To deploy StarlingX 11.0, see `Consuming StarlingX `_. -For detailed installation instructions, see `StarlingX 10.0 Installation Guides `_. +For detailed installation instructions, see `StarlingX 11.0 Installation Guides `_. -.. Ghada / Greg please confirm if all features listed here are required in Stx 10.0 +.. Greg / Ghada please confirm if all features listed here are correct in Stx 11.0? ------------------------------ -New Features and Enhancements ------------------------------ +------------------------------------- +New Features / Enhancements / Updates +------------------------------------- The sections below provide a detailed list of new features and links to the associated user guides (if applicable). -.. start-new-features-r10 +.. start-new-features-r11 **************************** Platform Component Upversion **************************** -The ``auto_update`` attribute supported for |prod| applications -enables apps to be automatically updated when a new app version tarball is -installed on a system. +The following platform component versions have been updated in |prod-long| Release +Stx 11.0 -**See**: https://wiki.openstack.org/wiki/StarlingX/Containers/Applications/AppIntegration +- kernel version 6.12.40 -The following platform component versions have been updated in |prod| 10.0. +Supported Kubernetes versions in |prod-long| 11.0: -- sriov-fec-operator 2.9.0 - -- kubernetes-power-manager 2.5.1 - -- kubevirt-app: 1.1.0 - -- security-profiles-operator 0.8.7 +- 1.29.2 +- 1.30.6 +- 1.31.5 +- 1.32.2 - nginx-ingress-controller - - ingress-nginx 4.12.1 + - ingress-nginx 4.13.3 - - secret-observer 0.1.1 +- cert-manager 1.17.2 -- auditd 1.0.5 +- platform-integ-apps 3.11.0 -- snmp 1.0.3 + - ceph-csi-rbd-3.13.1 + - ceph-csi-cephfs-3.13.1 + - ceph-pools-audit-1.0.1 -- cert-manager 1.15.3 + .. note:: -- ceph-csi-rbd 3.11.0 + The Ceph pools audit chart is now disabled by default. It can be + enabled through user-overrides based on user preference, if required. -- node-interface-metrics-exporter 0.1.3 +- rook-ceph -- node-feature-discovery 0.16.4 + - rook-ceph-1.16.6 + - rook-ceph-cluster-1.16.6 + - rook-ceph-provisioner-2.1.0 + - rook-ceph-floating-monitor-2.1.0 -- app-rook-ceph +- oidc-auth-apps 2.42.0 - - rook-ceph 1.13.7 - - rook-ceph-cluster 1.13.7 - - rook-ceph-floating-monitor 1.0.0 - - rook-ceph-provisioner 2.0.0 + - dex-0.23.0 + - secret-observer-0.1.8 + - oidc-client-0.1.24 -- dell-storage +- Helm chart metrics-server: 3.12.2 (deploys Metrics Server 0.7.2) - - csi-powerstore 2.10.0 - - csi-unity 2.10.0 - - csi-powerscale 2.10.0 - - csi-powerflex 2.10.1 - - csi-powermax 2.10.0 - - csm-replication 1.8.0 - - csm-observability 1.8.0 - - csm-resiliency 1.9.0 +- kubevirt-app 1.5.0 -- portieris 0.13.16 +- node-feature-discovery 0.17.3 -- metrics-server 3.12.1 (0.7.1) +- sriov-fec-operator 2.11.1 -- FluxCD helm-controller 1.0.1 (for Helm 3.12.2) - -- power-metrics - - - cadvisor 0.50.0 - - - telegraf 1.1.30 +- node-interface-metrics-exporter 0.1.4 - security-profiles-operator 0.8.7 -- vault +- dell-storage - - vault 1.14.0 + - csi-powerflex 2.13.0 + - csi-powermax 2.13.0 + - csi-powerscale 2.13.0 + - csi-powerstore 2.13.0 + - csi-unity 2.13.0 + - csm-observability 1.11.0 + - csm-replication 1.11.0 + - csm-resiliency 1.12.0 - - vault-manager 1.0.1 +- oran-o2 2.2.1 -- oidc-auth-apps +- snmp 1.0.5 - - oidc-auth-secret-observer secret-observer 0.1.7 1.0 +- auditd 1.0.3 - - oidc-dex dex-0.20.0 2.41.1 +- portieris 0.13.28 - - oidc-oidc-client oidc-client 0.1.23 1.0 + .. warning:: -- platform-integ-apps - - - ceph-csi-cephfs 3.11.0 - - - ceph-pools-audit 0.2.0 - -- app-istio - - - istio-operator 1.22.1 - - - kiali-server 1.85.0 - -- harbor 1.12.4 - -- ptp-notification 2.0.55 + Kubernetes upgrade fails if Portieris is applied. - intel-device-plugins-operator - - intel-device-plugins-operator 0.30.3 + - intel-device-plugins-operator-0.32.5 - - intel-device-plugins-qat 0.30.1 + - intel-device-plugins-qat-0.32.1 - - intel-device-plugins-gpu 0.30.0 + - intel-device-plugins-gpu-0.32.1 - - intel-device-plugins-dsa 0.30.1 + - intel-device-plugins-dsa-0.32.1 - - secret-observer 0.1.1 + - secret-observer-0.1-1 -- node-interface-metrics-exporter 0.1.3 +- kubernetes-power-manager 2.5.1 -- oran-o2 2.0.4 + .. note:: -- helm 3.14.4 for K8s 1.21 - 1.29 + Intel has stopped support for the ``kubernetes-power-manager`` application. + This is still being supported by |prod-long| and will be removed in a + future release. For more information, see :ref:`configurable-power-manager-04c24b536696`. -- Redfish Tool 1.1.8-1 + ``cpu_busy_cycles`` metric is deprecated and must be replaced with + ``cpu_c0_state_residency_percent`` for continued usage + (if the metrics are customized via helm overrides). + +- power-metrics + + - cadvisor 0.52.1 + - telegraf 1.34.4 + +- app-istio + + - Istio 1.26.2 + - Kiali 2.11.0 + +- FluxCD helm-controller 1.2.0 +- FluxCD source-controller 1.5.0 +- FluxCD notification-controller 1.5.0 +- FluxCD kustomize-controller 1.5.1 +- Helm 3.17.1 for Kubernetes 1.29-1.32 + +- volume-snapshot-controller + + - snapshot-controller 6.1.0 for K8s 1.29.2 + - snapshot-controller 6.3.3 for K8s 1.30.6 + - snapshot-controller 8.0.0 for K8s 1.31.5 - 1.32.2 + - snapshot-controller 8.1.0 for K8s 1.33.0 + +- ptp-notification 2.0.75 + +- app-netapp-storage (NetApp Trident CSI) 25.02.1 + +- Mellanox (OFED) ConnectX 24.10-2.1.8 + +- Mellanox ConnectX-6 DX firmware 22.43.2566 + +- ice: 2.3.10 + + - Intel E810 - Required NVM/firmware: 4.80 + - Intel E825 - Required NVM/firmware: 4.02 + - Intel E830 - Required NVM/firmware: 1.11 +- i40e: 2.28.9 / Required NVM/firmware: 9.20 **See**: :ref:`Application Reference ` -******************** -Kubernetes Upversion -******************** +************************ +OpenBao is not supported +************************ -|prod-long| Release |this-ver| supports Kubernetes 1.29.2. +.. warning:: -***************************************** -Distributed Cloud Scalability Improvement -***************************************** + OpenBao is not supported in |prod-long| Stx 11.0. Do not upload/apply this + application on a production system. -|prod| System Controller scalability has been improved in |prod| 10.0 with -both 5 thousand maximum managed nodes and maximum number of parallel operations. +.. Greg is this required in stx 11.0? -**************************************** -Unified Software Delivery and Management -**************************************** +************************************************************* +Secure Pod-to-Pod Communication of Inter-Host Network Traffic +************************************************************* -In |prod| 10.0, the Software Patching functionality and the -Software Upgrades functionality have been re-designed into a single Unified -Software Management framework. There is now a single procedure for managing -the deployment of new software; regardless of whether the new software is a -new Patch Release or a new Major Release. The same APIs/CLIs are used, the -same procedures are used, the same |VIM| / Host Orchestration strategies are used -and the same Distributed Cloud / Subcloud Orchestration strategies are used; -regardless of whether the new software is a new Patch Release or a new Major -Release. +To strengthen security across the |prod-long|, new measures have been +implemented to protect selected pod-to-pod network traffic from both passive +and active network attackers including those with access to the cluster host network. -**See**: :ref:`appendix-commands-replaced-by-usm-for-updates-and-upgrades-835629a1f5b8` -for a detailed list of deprecated commands and new commands. +On |prod|, inter-host pod-to-pod traffic for a service can be configured to be +protected by IPsec in tunnel mode over cluster host network. The configurations +are defined as IPsec policies and managed by the ipsec-policy-operator +Kubernetes system application. + +**See**: + +- :ref:`inter-host-pod-to-pod-security-overview-f44d8d3c7541` + +- :ref:`install-ipsec-policy-operator-system-application-95ae437a67e2` + +- :ref:`configure-ipsec-for-selected-inter-host-pod-to-pod-traffic-usi-8cb9b4342b5d` + +- :ref:`remove-ipsec-policy-operator-system-application-06e7f2e4cdfb` + +Threat Mitigation +***************** + +- Passive attackers: Defend against traffic snooping and unauthorized + data observation + +- Active attackers: Blocked from attempting unauthorized connections to + |prod| cluster hosts + +Secure Pod-to-Pod Communication +******************************* + +The |prod-long| now supports encryption of Calico-based inter-hosts +networking using IPsec, ensuring secure Pod-to-Pod traffic across the +cluster-host network. + +- Applies to application pod-to-pod traffic on the cluster-host network + +- Applications, and applications' pod-to-pod traffic can selectively be protected + +- Excludes SR-IOV |VF| interface traffic + +- Configuring IPSec policies on pod-to-pod traffic may degrade the CPU + performance. Ensure that adequate resources are available to support + sustained and peak inter-node traffic + +**See**: + +- :ref:`inter-host-pod-to-pod-security-overview-502afc38a15e` + +- :ref:`configure-ipsec-for-selected-inter-host-pod-to-pod-traffic-usi-8cb9b4342b5d` + +- :ref:`protect-inter-host-pod-to-pod-traffic-of-services-51ef3b65e272` + +- :ref:`turn-off-inter-host-pod-to-pod-traffic-protection-in-the-clust-5265939c5344` + +- :ref:`unprotect-inter-host-pod-to-pod-traffic-at-specific-ports-of-s-a294b80c1d67` + +- :ref:`unprotect-inter-host-pod-to-pod-traffic-of-specific-services-co-c0eca384959d` + + +Install IPsec Policy Operator System Application +************************************************ + +The ``ipsec-policy-operator`` system application is managed by the system +application framework and will be automatically uploaded once the system is ready. +Subsequently, the application can be installed by applying its manifest. + +**See**: + +- :ref:`configure-ipsec-for-selected-inter-host-pod-to-pod-traffic-usi-8cb9b4342b5d` + +- :ref:`install-ipsec-policy-operator-system-application-95ae437a67e2` + +- :ref:`remove-ipsec-policy-operator-system-application-06e7f2e4cdfb` + + +************************************************* +Platform Networks Address Reduction for |AIO-SX| +************************************************* + +To reduce the number of IP addresses required for Distributed Cloud |AIO-SX| +Subcloud deployments, platform networks are updated to allocate only a +single IP address per subcloud, removing the need for additional unit-specific +addresses that are no longer required. + +However, the platform network IP address must be assigned from a shared +subnet, allowing multiple subclouds to use the same network address range. +This enables more efficient IP management across large-scale deployments. +The |OAM| network serves as a reference model, as it already supports the necessary +capabilities and expected behavior for this configuration. + +**See**: + +- :ref:`network-addressing-requirements-2fac0035b878` + +- :ref:`manage-management-network-parameters-for-a-standalone-aiosx-18c7aaace64d` + +********************************************** +Intermediate CA Support for Kubernetes Root CA +********************************************** + +|prod-long| now supports the use of server certificates signed by an +Intermediate Certificate Authority (CA) for the external kube-apiserver endpoint. +This enhancement ensures that external access to the Kubernetes API can be +validated under the same root of trust as other platform certificates, +improving consistency and security across the system. + +Intermediate CA Support for External Connections to kube-apiserver +****************************************************************** + +External connections to ``kube-apiserver`` are now routed through HAProxy, which +listens on port 6443. HAProxy uses the REST API / GUI certificate issued by +system-local-ca, supporting Intermediate CAs, to perform |SSL| termination +with the external client. It then initiates a new |SSL| connection to kube-apiserver, +now operating on port 16443 behind the firewall, on behalf of the client. +External clients must recognize and trust the public certificate of +system-local-ca's Root CA. + +**See**: :ref:`kubernetes-certificates-f4196d7cae9c`. ******************************************* -Infrastructure Management Component Updates +Unified PTP Notification Overall Sync State ******************************************* -In |prod| 10.0, the new Unified Software Management framework -supports enhanced Patch Release packaging and enhanced Major Release deployments. +The overall sync state notification (`sync-state`) describes the health of the +timing chain on the local system. A locked state is reported when the system +has reference to an external time source (|GNSS| or |PTP|) and the system clock is +synchronized to that time source. -Patch Release packaging has been simplified to deliver new or modified Debian -packages, instead of the cryptic difference of OSTree builds done previously. -This allows for inspection and validation of Patch Release content prior to -deploying, and allows for future flexibility of Patch Release packaging. +**See**: :ref:`ptp-notification-status-conditions-6d6105fccf10` -Major Release deployments have been enhanced to fully leverage OSTree. An -OSTree deploy is now used to update the host software. The new software's -root filesystem can be installed on the host, while the host is still running -the software of the old root filesystem. The host is simply rebooted -into the new software's root filesystem. This provides a significant -improvement in both the upgrade duration and the upgrade service impact -(especially for |AIO-SX| systems), as previously upgrading hosts needed to have -disks/root-filesystems wiped and then software re-installed. +****************************************************************************************** +New Default/Static Platform API/CLI/GUI Access-Control Roles for Configurator and Operator +****************************************************************************************** -**See** +In |prod-long|, 5 different keystone roles are supported: ``admin``, ``reader``, +``configurator``, ``operator`` and ``member``. -- :ref:`patch-release-deployment-before-bootstrap-and-commissioning-of-7d0a97144db8` +In |prod-long| Release 11.0, the following new keystone roles are introduced: -- :ref:`manual-host-software-deployment-ee17ec6f71a4` +- configurator -- :ref:`manual-removal-host-software-deployment-24f47e80e518` +- operator + +**See**: :ref:`keystone-account-roles-64098d1abdc1` + +******************* +Multi-Node Upgrades +******************* + +In |prod-long| Release 11.0, the restriction on K8s multi-node orchestrated +upgrades has been removed. You can now perform upgrades across multiple nodes +in a single orchestration strategy. + +Example: Upgrading from v1.29.2 to v1.32.2 + +**See**: :ref:`About Kubernetes Upgrade Cloud Orchestration ` + +*************************** +PTP Netlink API Integration +*************************** + +The following new interface parameters have been added in |prod-long| Release 11.0: + +* ``ts2phc.pin_index = 1`` +* ``ts2phc.channel = 1`` + +**See**: :ref:`instance-specific-considerations-d9d9509c79dd` + +.. Greg is this required? + +******************* +Docker Size updates +******************* + +In StarlingX Release 11.0 the default Docker filesystem size is 30GB. +Resize the Docker filesystem on all controllers to a minimum 50GB or more prior to +upgrading the system using the following command: + +.. code-block:: none + + system host-fs-modify docker= + +A new deploy precheck script is added to ensure the docker filesystem size is +not less than 50GB. + +.. Greg is this required? + +************************** +VIM Rollback Orchestration +************************** + +|prod-long| Release 11.0 introduces expanded rollback capabilities to improve +system recovery during software deployments: + +Manual Rollback is supported across all configurations, including |AIO-SX|, +|AIO-DX|, Standard, and Standard with dedicated storage. + +VIM Orchestrated Rollback is supported on duplex configurations (AIO-DX, AIO-DX+, +Standard, and Standard with dedicated storage) for the following scenarios: + +- Rollback of Major Release software deployments + +- Rollback of Patch Release software deployments + +- Rollback of Patched Major Release deployments + +- Recovery from aborted or failed deployments + +These enhancements aim to streamline recovery workflows and reduce downtime +across a broader range of deployment scenarios. + +**See**: + +- :ref:`orchestrated-rollback-host-software-deployment-c6b12f13a8a1` - :ref:`manual-rollback-host-software-deployment-9295ce1e6e29` -*********************************************************** -Unified Software Management - Rollback Orchestration AIO-SX -*********************************************************** +.. Greg is this required? -|VIM| Patch Orchestration has been enhanced to support the abort and rollback of -a Patch Release software deployment. |VIM| Patch Orchestration rollback will -automate the abort and rollback steps across all hosts of a Cloud configuration. - -.. note:: - - In |prod| 10.0, |VIM| Patch Orchestration Rollback is only - supported for |AIO-SX| configurations. - -In |prod-long| 10.0 |VIM| Patch Orchestration Rollback is only -supported if the Patch Release software deployment has been aborted or -failed prior to the 'software deploy activate' step. If the Patch Release -software deployment is at or beyond the 'software deploy activate' step, -then an install plus restore of the Cloud is required in order to rollback -the Patch Release deployment. - -**See**: :ref:`orchestrated-rollback-host-software-deployment-c6b12f13a8a1` - - -*********************************** -Enhancements to Full Debian Support -*********************************** - -The Kernel can be configured during runtime as [ standard <-> lowlatency ]. - -**See**: :ref:`Modify the Kernel using the CLI ` - -********************************************************* -Support for Kernel Live Patching (for possible scenarios) -********************************************************* - -|prod-long| supports live patching that enables fixing critical functions -without rebooting the system and enables systems to be functional and running. -The live-patching modules will be built into the upgraded |prod-long| binary -patch. - -The upgraded binary patch is generated as the in-service type (non-reboot-required). -The kernel modules will be matched with the correct kernel release version -during binary patch upgrading. - -The relevant kernel module can be found in the location: -'/lib/modules//extra/kpatch' - -During binary patch upgrading, the user space tool ``kpatch`` is -used for: - -- installing the kernel module to ${installdir} - -- loading(insmod) the kernel module for the running kernel - -- unloading(rmmod) the kernel module from the running kernel - -- uninstallling the kernel module from ${installdir} - -- listing the enabled live patch kernel module - -************************** -Subcloud Phased Deployment -************************** - -Subclouds can be deployed using individual phases. Therefore, instead of using -a single operation, a subcloud can be deployed by executing each phase individually. -Users have the flexibility to proactively abort the deployment based on their -needs. When the deployment is resumed, previously installed contents will be -still valid. - -**See**: :ref:`Install a Subcloud in Phases ` - -****************************** -Kubernetes Local Client Access -****************************** - -You can configure Kubernetes access for a user logged in to the active -controller either through SSH or by using the system console. - -**See**: :ref:`configure-kubernetes-local-client-access` - -******************************* -Kubernetes Remote Client Access -******************************* - -The access to the Kubernetes cluster from outside the controller can be done -using the remote CLI container or using the host directly. - -**See**: :ref:`configure-kubernetes-remote-client-access` - -************************************************** -IPv4/IPv6 Dual Stack support for Platform Networks -************************************************** - -Migration of a single stack deployment to dual stack network deployments will -not cause service disruptions. - -Dual-stack networking facilitates the simultaneous use of both IPv4 and IPv6 -addresses, or continue to use each IP version independently. To accomplish -this, platform networks can be associated with 1 or 2 address pools, one for -each IP version (IPv4 or IPv6). The first pool is linked to the network -upon creation and cannot be subsequently removed. The second pool can be added or -removed to transition the system between dual-stack and single-stack modes. - -**See**: :ref:`dual-stack-support-318550fd91b5` - -********************************* -Run Kata Containers in Kubernetes -********************************* - -There are two methods to run Kata Containers in Kubernetes: by runtime class or -by annotation. Runtime class is supported in Kubernetes since v1.12.0 or -higher, and it is the recommended method for running Kata Containers. - -**See**: :ref:`kata_container` - -*************************************************** -External DNS Alternative: Adding Local Host Entries -*************************************************** - -You can configure user-defined host entries for external resources that are not -maintained by |DNS| records resolvable by the external |DNS| server(s) (i.e. -``nameservers`` in ``system dns-show/dns-modify``). This functionality enables -the configuration of local host records, supplementing hosts resolvable by -external |DNS| server(s). - -**See**: :ref:`user-host-entries-configuration-9ad4c060eb15` - -******************************************* -Power Metrics Enablement - vRAN Integration -******************************************* - -|prod| 10.0 supports integrated enhanced power metrics tool with -reduced impact on vRAN field deployment. - -Power Metrics may increase the scheduling latency due to perf and |MSR| -readings. It was observed that there was a latency impact of around 3 µs on -average, plus spikes with significant increases in maximum latency values. -There was also an impact on the kernel processing time. Applications that -run with priorities at or above 50 in real-time kernel isolated CPUs should -allow kernel services to avoid unexpected system behavior. - -**See**: :ref:`install-power-metrics-application-a12de3db7478` - -****************************************** - Crash dump File Size Setting Enhancements -****************************************** - -The Linux kernel can be configured to perform a crash dump and reboot in -response to specific serious events. A crash dump event produces a -crash dump report with bundle of files that represent the state of the kernel at the -time of the event, which is useful for post-event root cause analysis. - -The crash dump files that are generated by Linux kdump are configured to be -generated during kernel panics (default) are managed by the crashDumpMgr utility. -The utility will save crash dump files but the current handling uses a fixed -configuration when saving files. In order to provide a more flexible system -handling the crashDumpMgr utility is enhanced to support the following -configuration parameters that will control the storage and rotation of crash -dump files. - -- Maximum Files: New configuration parameter for the number of saved crash - dump files (default 4). - -- Maximum Size: Limit the maximum size of an individual crash dump file - (support for unlimited, default 5GB). - -- Maximum Used: Limit the maximum storage used by saved crash dump files - (support for unlimited, default unlimited). - -- Minimum Available: Limit the minimum available storage on the crash dump - file system (restricted to minimum 1GB, default 10%). - -The service parameters must be specified using the following service hierarchy. -It is recommended to model the parameters after the platform coredump service -parameters for consistency. - -.. code-block:: none - - platform crashdump = - -**See**: :ref:`customize-crashdumpmanager-46e0d32891a0` - -.. Michel Desjardins please confirm if this is applicable? - -*********************************************** -Subcloud Install or Restore of Previous Release -*********************************************** - -|prod| |this-ver| system controller supports both |prod| 9.0 and -|prod| |this-ver| subclouds fresh install or restore. - -If the upgrade is from |prod| 9.0 to a higher release, the **prestage status** -and **prestage versions** fields in the output of the -:command:`dcmanager subcloud list` command will be empty, regardless of whether -the deployment status of the subcloud was ``prestage-complete`` before the upgrade. -These fields will only be updated with values if you run ``subcloud prestage`` -or ``prestage orchestration`` again. - -**See**: :ref:`Subclouds Previous Major Release Management ` - -**For non-prestaged subcloud remote installations** -The ISO imported via ``load-import --active`` should always be at the same patch -level as the system controller. This is to ensure that the subcloud boot image -aligns with the patch level of the load to be installed on the subcloud. - -**See**:`installing-a-subcloud-using-redfish-platform-management-service` - -**For prestaged remote subcloud installations** -The ISO imported via ``load-import --inactive`` should be at the same patch level -as the system controller. If the system controller is patched after subclouds -have been prestaged, it is recommended to repeat the prestaging for each -subcloud. This is to ensure that the subcloud boot image aligns with the patch -level of the load to be installed on the subcloud. -**See**: :ref:`prestaging-prereqs` - -**************************************** -WAD Users Access Right Control via Group -**************************************** - -You can configure an |LDAP| / |WAD| user with 'sys_protected' group or 'sudo all'. - -- an |LDAP| / |WAD| user in 'sys_protected' group on |prod-long| - - - is equivalent to the special 'sysadmin' bootstrap user - - - via "source /etc/platform/openrc" - - - has Keystone admin/admin identity and credentials, and - - has Kubernetes /etc/kubernetes/admin.conf credentials - - - only a small number of users have this capability - -- an |LDAP| / |WAD| user with 'sudo all' capability on |prod-long| - - - can perform the following |prod|-type operations: - - sw_patch to unauthenticated endpoint - - docker/crictl to communicate with the respective daemons - - using some utilities - like show-certs.sh, license-install (recovery only) - - IP configuration for local network setup - - password changes of Linux users (i.e. local LDAP) - - access to restricted files, including some logs - - manual reboots - -The local |LDAP| server by default serves both HTTPS on port 636 and HTTP on -port 389. - -The HTTPS server certificate is issued by cert-manager ClusterIssuer -``system-local-ca`` and is managed internally by cert-manager. The certificate -will be automatically renewed when the expiration date approaches. The -certificate is called ``system-openldap-local-certificate`` with its secret -having the same name ``system-openldap-local-certificate`` in the -``deployment`` namespace. The server certificate and private key files are -stored in the ``/etc/ldap/certs/`` system directory. - -**See**: - -- :ref:`local-ldap-certificates-4e1df1e39341` - -- :ref:`sssd-support-5fb6c4b0320b` - -- :ref:`create-ldap-linux-accounts` - -**************************************************************************************** -Accessing Collect Command with 'sudo' privileges and membership in 'sys-protected' Group -**************************************************************************************** - -The |prod| 10.0 adds support to run ``Collect`` from any -local |LDAP| or Remote |WAD| user account with 'sudo' capability and a member -of the 'sys_protected' group. - -The ``Collect`` tool continues support from the 'sysadmin' user account -and also being run from any other successfully created |LDAP| and |WAD| account -with 'sudo' capability and a member of the 'sys_protected' group. - -For security reasons, no password 'sudo' continues to be unsupported. - -.. Eric McDonald please confirm if this is supported in Stx 10.0 - -******************************** -Support for Intel In-tree Driver -******************************** - -The system supports both in-tree and out-of-tree versions of the Intel ``ice``, -``i40e``, and ``iavf`` drivers. On initial installation, the system uses the -default out-of-tree driver version. You can switch between the in-tree and -out-of-tree driver versions. For further details: - -**See**: :ref:`intel-driver-version-c6e3fa384ff7` - -.. note:: - - The ice in-tree driver does not support SyncE/GNSS deployments. - -************************** -Password Rules Enhancement -************************** - -You can check current password expiry settings by running the -:command:`chage -l ` command replacing ```` with the name -of the user whose password expiry settings you wish to view. - -You can also change password expiry settings by running the -:command:`sudo chage -M ` command. - -Use the following new password rules as listed below: - -1. There should be a minimum length of 12 characters. - -2. The password must contain at least one letter, one number, and one special - character. - -3. Do not reuse the past 5 passwords. - -4. The Password expiration period should be defined by users, but by default - it is set to 90 days. - -**See**: - -- :ref:`linux-accounts-password-3dcad436dce4` - -- :ref:`starlingx-system-accounts-system-account-password-rules` - -- :ref:`system-account-password-rules` - -******************************************************************************* -Management Network Reconfiguration after Deployment Completion Phase 1 |AIO-SX| -******************************************************************************* - -|prod| 10.0 supports changes to the management IP addresses -for a standalone |AIO-SX| and for an |AIO-SX| subcloud after the node is -completely deployed. - -**See**: - -- :ref:`Manage Management Network Parameters for a Standalone AIO-SX ` - -- :ref:`Manage Subcloud Management Network Parameters ` - -**************************** -Networking Statistic Support -**************************** - -The Node Interface Metrics Exporter application is designed to fetch and -display node statistics in a Kubernetes environment. It deploys an Interface -Metrics Exporter DaemonSet on all nodes with the -``starlingx.io/interface-metrics=true node`` label. It uses the Netlink library -to gather data directly from the kernel, offering real-time insights into node -performance. - -**See**: :ref:`node-interface-metrics-exporter-application-d98b2707c7e9` - -***************************************************** -Add Existing Cloud as Subcloud Without Reinstallation -***************************************************** - -The subcloud enrollment feature converts a factory pre-installed system -or initially deployed as a standalone cloud system to a subcloud of a |DC|. -Factory pre-installation standalone systems must be installed locally in the -factory, and later deployed and configured on-site as a |DC| subcloud without -re-installing the system. - -**See**: :ref:`Enroll a Factory Installed Non Distributed Standalone System as a Subcloud ` - -******************************************** -Rook Support for freshly Installed StarlingX -******************************************** - -The new Rook Ceph application will be used for deploying the latest version of -Ceph via Rook. - -Rook Ceph is an orchestrator that provides a containerized solution for -Ceph Storage with a specialized Kubernetes Operator to automate the management -of the cluster. It is an alternative solution to the bare metal Ceph storage. -See https://rook.io/docs/rook/latest-release/Getting-Started/intro/ for more -details. - -The deployment model is the topology strategy that defines the storage backend -capabilities of the deployment. The deployment model dictates how the storage -solution will look like when defining rules for the placement of storage -cluster elements. - -Enhanced Availability for Ceph on AIO-DX -**************************************** - -Ceph on |AIO-DX| now works with 3 Ceph monitors providing High Availability and -enhancing uptime and resilience. - -Available Deployment Models -*************************** - -Each deployment model works with different deployment strategies and rules to -fit different needs. The following models available for the requirements of -your cluster are: - -- Controller Model (default) - -- Dedicated Model - -- Open Model - -**See**: :ref:`Deployment Models and Services for Rook Ceph ` - -Storage Backend -*************** - -Configuration of the storage backend defines the deployment models -characteristics and main configurations. - -Migration with Rook container based Ceph Installations -****************************************************** - -When you migrate an |AIO-SX| to an |AIO-DX| subcloud with Rook container-based -Ceph installations in |prod| 10.0, you would need to follow the -additional procedural steps below: - -.. rubric:: |proc| - -After you configure controller-1, follow the steps below: - -#. Add a new Ceph monitor on controller-1. - - .. code-block::none - - ~(keystone_admin)$ system host-fs-add controller-1 ceph= - -#. Add an |OSD| on controller-1. - - #. List host's disks and identify disks you want to use for Ceph |OSDs|. Ensure - you note the |UUIDs|. - - .. code-block::none - - ~(keystone_admin)$ system host-disk-list controller-1 - - #. Add disks as an |OSD| storage. - - .. code-block::none - - ~(keystone_admin)$ system host-stor-add controller-1 osd - - #. List |OSD| storage devices. - - .. code-block::none - - ~(keystone_admin)$ system host-stor-list controller-1 - -Unlock controller-1 and follow the steps below: - -#. Wait until Ceph is updated with two active monitors. To verify the updates, - run the :command:`ceph -s` command and ensure the output shows - `mon: 2 daemons, quorum a,b`. This confirms that both monitors are active. - - .. code-block::none - - ~(keystone_admin)$ ceph -s - cluster: - id: c55813c6-4ce5-470b-b9f5-e3c1fa0c35b1 - health: HEALTH_WARN - insufficient standby MDS daemons available - services: - mon: 2 daemons, quorum a,b (age 2m) - mgr: a(active, since 114s), standbys: b - mds: 1/1 daemons up - osd: 4 osds: 4 up (since 46s), 4 in (since 65s) - -#. Add the floating monitor. - - .. code-block::none - - ~(keystone_admin)$ system host-lock controller-1 - ~(keystone_admin)$ system controllerfs-add ceph-float= - ~(keystone_admin)$ system host-unlock controller-1 - - Wait for the controller to reset and come back up to an operational state. - -#. Re-apply the ``rook-ceph`` application. - - .. code-block::none - - ~(keystone_admin)$ system application-apply rook-ceph - -To Install and Uninstall Rook Ceph -********************************** - -**See**: - -- :ref:`Install Rook Ceph ` - -- :ref:`Uninstall Rook Ceph ` - - -Performance Configurations on Rook Ceph +*************************************** +Upgrade / Rollback Process Optimization *************************************** -When using Rook Ceph it is important to consider resource allocation and -configuration adjustments to ensure optimal performance. Rook introduces -additional management overhead compared to a traditional bare-metal Ceph setup -and needs more infrastructure resources. +To accelerate recovery from failed operations during software updates and upgrades, +a new snapshot-based restore capability is introduced in |prod-long| Release +11.0. Unlike traditional backup and restore, this feature leverages OSTree +deployment management and LVM volume snapshots to revert the system to a +previously saved state without requiring a full reinstall. Snapshots will be +created for select LVM volumes, excluding directories such as /opt/backup, +/var/log, and /scratch, as outlined in the "Filesystem Summary" below. This +capability is currently limited to Simplex systems (AIO-SX). -**See**: :ref:`performance-configurations-rook-ceph-9e719a652b02` +.. list-table:: FileSystem Summary + :header-rows: 1 + :stub-columns: 1 -********************************************************************************** -Protecting against L2 Network Attackers - Securing local traffic on MGMT networks -********************************************************************************** + * - LVM Name + - Mount Path + - DRBD + - Versioned** + - Snapshot + * - root-lv + - /sysroot + - - + - N + - N* + * - var-lv + - /var + - - + - N + - Y + * - log-lv + - /var/log + - - + - N + - N + * - backup-lv + - /var/rootdirs/opt/backups + - - + - N + - N + * - ceph-mon-lv + - /var/lib/ceph/mon + - - + - N + - N + * - docker-lv + - /var/lib/docker + - - + - N + - Y + * - kubelet-lv + - /var/lib/kubelet + - - + - N + - Y + * - pgsql-lv + - /var/lib/postgresql + - drbd0 + - Y + - Y + * - rabbit-lv + - /var/lib/rabbitmq + - drbd1 + - Y + - Y + * - dockerdistribution-lv + - /var/lib/docker-distribution + - drbd8 + - N + - N + * - platform-lv + - /var/rootdirs/opt/platform + - drbd2 + - Y + - Y + * - etcd-lv + - /var/rootdirs/opt/etcd + - drbd7 + - Y + - Y + * - extension-lv + - /var/rootdirs/opt/extension + - drbd5 + - N + - N + * - dc-vault-lv + - /var/rootdirs/opt/dc-vault + - drbd6 + - Y + - N + * - scratch-lv + - /var/rootdirs/scratch + - - + - N + - N -A new security solution is introduced for |prod-long| inter-host management -network: +* Managed by OSTree -- Attackers with direct access to local |prod-long| L2 VLANs +** Versioned subpaths - - specifically protect LOCAL traffic on the MGMT network which is used for - private/internal infrastructure management of the |prod-long| cluster. +**See**: -- Protection against both passive and active attackers accessing private/internal - data, which could risk the security of the cluster - - - passive attackers that are snooping traffic on L2 VLANs (MGMT), and - - active attackers attempting to connect to private internal endpoints on - |prod-long| L2 interfaces (MGMT) on |prod| hosts. - -IPsec is a set of communication rules or protocols for setting up secure -connections over a network. |prod| utilizes IPsec to protect local traffic -on the internal management network of multi-node systems. - -|prod| uses strongSwan as the IPsec implementation. strongSwan is an -opensource IPsec solution. See https://strongswan.org/ for more details. - -For the most part, IPsec on |prod| is transparent to users. - -**See**: - -- :ref:`IPsec Overview ` - -- :ref:`Configure and Enable IPsec ` - -- :ref:`IPSec Certificates ` - -- :ref:`IPSec CLIs ` - -********************************************************** -Vault application support for running on application cores -********************************************************** - -By default the Vault application's pods will run on platform cores. - -"If ``static kube-cpu-mgr-policy`` is selected and when overriding the label -``app.starlingx.io/component`` for Vault namespace or pods, there are two requirements: - -- The Vault server pods need to be restarted as directed by Hashicorp Vault - documentation. Restart each of the standby server pods in turn, then restart - the active server pod. - -- Ensure that sufficient hosts with worker function are available to run the - Vault server pods on application cores. - -**See**: - -- :ref:`Kubernetes CPU Manager Policies `. - -- :ref:`System backup, System and Storage Restore `. - -- :ref:`Run Hashicorp Vault Restore Playbook Remotely `. - -- :ref:`Run Hashicorp Vault Restore Playbook Locally on the Controller `. - -Restart the Vault Server pods -***************************** - -The Vault server pods do not restart automatically. If the pods are to be -re-labelled to switch execution from platform to application cores, or vice-versa, -then the pods need to be restarted. - -Under Kubernetes the pods are restarted using the :command:`kubectl delete pod` -command. See, Hashicorp Vault documentation for the recommended procedure for -restarting server pods in |HA| configuration, -https://support.hashicorp.com/hc/en-us/articles/23744227055635-How-to-safely-restart-a-Vault-cluster-running-on-Kubernetes. - -Ensure that sufficient hosts are available to run the server pods on application cores -************************************************************************************** - -The standard cluster with less than 3 worker nodes does not support Vault |HA| -on the application cores. In this configuration (less than three cluster hosts -with worker function): - -- When setting label app.starlingx.io/component=application with the Vault - app already applied in |HA| configuration (3 Vault server pods), ensure that - there are 3 nodes with worker function to support the |HA| configuration. - -- When applying Vault for the first time and with ``app.starlingx.io/component`` - set to "application": ensure that the server replicas is also set to 1 for - non-HA configuration. The replicas for Vault server are overriden both for - the Vault Helm chart and the Vault manager Helm chart: - - .. code-block:: none - - cat < vault_overrides.yaml - server: - extraLabels: - app.starlingx.io/component: application - ha: - replicas: 1 - injector: - extraLabels: - app.starlingx.io/component: application - EOF - - cat < vault-manager_overrides.yaml - manager: - extraLabels: - app.starlingx.io/component: application - server: - ha: - replicas: 1 - EOF - - $ system helm-override-update vault vault vault --values vault_overrides.yaml - - $ system helm-override-update vault vault-manager vault --values vault-manager_overrides.yaml - -****************************************************** -Component Based Upgrade and Update - VIM Orchestration -****************************************************** - -|VIM| Patch Orchestration in StarlingX 10.0 has been updated to interwork with -the new underlying Unified Software Management APIs. - -As before, |VIM| Patch Orchestration automates the patching of software across -all hosts of a Cloud configuration. All Cloud configurations are supported; -|AIO-SX|, |AIO-DX|, |AIO-DX| with worker nodes, Standard configuration with controller -storage and Standard configuration with dedicated storage. - -.. note:: - - This includes the automation of both applying a Patch and removing a Patch. - -**See** +- :ref:`deploy-software-releases-using-the-cli-1ea02eb230e5` - :ref:`orchestrated-deployment-host-software-deployment-d234754c7d20` -- :ref:`orchestrated-removal-host-software-deployment-3f542895daf8` . +************************************ +Platform Real Time Kernel Robustness +************************************ -********************************************************** -Subcloud Remote Install, Upgrade and Prestaging Adaptation -********************************************************** +Stalld can be configured to use the ``queue_track`` backend, which is based +on |eBPF|. Stalld protects lower priority tasks from starvation. -StarlingX 10.0 supports software management upgrade/update process -that does not require re-installation. The procedure for upgrading a system is -simplified since the existing filesystem and associated release configuration -will remain intact in the versioned controlled paths (e.g. /opt/platform/config/). -In addition the /var and /etc directories is retained, indicating that -updates can be done directly as part of the software migration procedure. This -eliminates the need to perform a backup and restore procedure for |AIO-SX| -based systems. In addition, the rollback procedure can revert to the -existing versioned or saved configuration in the event an error occurs -if the system must be reverted to the older software release. +Unlike other backends, ``queue_track`` reduces CPU usage and more accurately +identifies which tasks can be excecuted even if they are currently blocked +waiting for a lock. -With this change, prestaging for an upgrade will involve populating a new ostree -deployment directory in preparation for an atomic upgrade and pulling new container -image versions into the local container registry. Since the system is not -reinstalled, there is no requirement to save container images to a protected -partition during the prestaging process, the new container images can be -populated in the local container registry directly. +**See**: :ref:`configure-stall-daemon-b38ece463e88`. -**See**: :ref:`prestage-a-subcloud-using-dcmanager-df756866163f` +***************************************** +Enable CONFIG_GENEVE Kernel Configuration +***************************************** -******************************************************** -Update Default Certificate Configuration on Installation -******************************************************** +|prod-long| Release 11.0 supports geneve.ko kernel module, controlled by the +CONFIG_GENEVE kernel config option. -You can configure default certificates during install for both standalone and -Distributed Cloud systems. +************************************************************************ +Cloud User Management GUI/CLI/RESTAPI Enhancements; Deletion Restriction +************************************************************************ -**New bootstrap overrides for system-local-ca (Platform Issuer)** +In |prod-long| Release 11.0, existing Local |LDAP| users in the sudo group do +not need to be migrated to the sys_admin group. -- You can customize the Platform Issuer (system-local-ca) used to sign the platform - certificates with an external Intermediate |CA| from bootstrap, using the new - bootstrap overrides. +Administrators may retain their existing configuration if required. However, +to better align with the platform's security and access control standards, +it is recommended to assign restricted sudo privileges through the sys_admin +group. - **See**: :ref:`Platform Issuer ` +Administrators may optionally update their configurations by transitioning +Local LDAP users from the sudo group to the sys_admin group. This can be done +using ONLY the following method: - .. note:: +``via pam_group & /etc/security/group.conf`` to map users into additional groups - It is recommended to configure these overrides. If it is not configured, - ``system-local-ca`` will be configured using a local auto-generated - Kubernetes Root |CA|. +**See**: :ref:`local-ldap-linux-user-accounts`. -**REST API / Horizon GUI and Docker Registry certificates are issued during bootstrap** +******************************* +In-tree and Out-of-tree drivers +******************************* -- The certificates for StarlingX REST APIs / Horizon GUI access and Local - Docker Registry will be automatically issued by ``system-local-ca`` during - bootstrap. They will be anchored to ``system-local-ca`` Root CA public - certificate, so only this certificate needs to be added in the user list of - trusted CAs. - -**HTTPS enabled by default for StarlingX REST API access** - -- The system is now configured by default with HTTPS enabled for access to - StarlingX API and the Horizon GUI. The certificate used to secure this will be - anchored to ``system-local-ca`` Root |CA| public certificate. - -**Playbook to update system-local-ca and re-sign the renamed platform certificates** - -- The ``migrate_platform_certificates_to_certmanager.yml`` playbook is renamed - to ``update_platform_certificates.yml``. - -**External certificates provided in bootstrap overrides can now be provided as -base64 strings, such that they can be securely stored with Ansible Vault** - -- The following bootstrap overrides for certificate data **CAN** be provided as - the certificate / key converted into single line base64 strings instead of the - filepath for the certificate / key: - - - ssl_ca_cert - - - k8s_root_ca_cert and k8s_root_ca_key - - - etcd_root_ca_cert and etcd_root_ca_key - - - system_root_ca_cert, system_local_ca_cert and system_local_ca_key - - .. note:: - - You can secure the certificate data in an encrypted bootstrap - overrides file using Ansible Vault. - - The base64 string can be obtained using the :command:`base64 -w0 ` - command. The string can be included in the overrides YAML file - (secured via Ansible Vault), then insecurely managed ``cert_file`` - can be removed from the system. - -*************************************************** -Dell CSI Driver Support - Test with Dell PowerStore -*************************************************** - -|prod| 10.0 supports a new system application to support -kubernetes CSM/CSI for Dell Storage Platforms. With this application the user -can communicate with Dell PowerScale, PowerMax, PowerFlex, PowerStore and -Unity XT Storage Platforms to provision |PVCs| and use them on kubernetes -stateful applications. - -**See**: :ref:`Dell Storage File System Provisioner ` -for details on installation and configurations. - -************************************************ -O-RAN O2 IMS and DMS Interface Compliancy Update -************************************************ - -With the new updates in Infrastructure Management Services (IMS) and -Deployment Management Services (DMS) the J-release for O-RAN O2, OAuth2 and mTLS -are mandatory options. It is fully compliant with latest O-RAN spec O2 IMS -interface R003 -v05.00 version and O2 DMS interface K8s profile - R003-v04.00 -version. Kubernetes Secrets are no longer required. - -The services implemented include: - -- O2 API with mTLS enabled - -- O2 API supported OAuth2.0 - -- Compliance with O2 IMS and DMS specs - -**See**: :ref:`oran-o2-application-b50a0c899e66` - -*************************************************** -Configure Liveness Probes for PTP Notification Pods -*************************************************** - -Helm overrides can be used to configure liveness probes for ``ptp-notification`` -containers. - -**See**: :ref:`configure-liveness-probes` - -************************* -Intel QAT and GPU Plugins -************************* - -The |QAT| and |GPU| applications provide a set of plugins developed by Intel -to facilitate the use of Intel hardware features in Kubernetes clusters. -These plugins are designed to enable and optimize the use of Intel-specific -hardware capabilities in a Kubernetes environment. - -Intel |GPU| plugin enables Kubernetes clusters to utilize Intel GPUs for -hardware acceleration of various workloads. - -Intel® QuickAssist Technology (Intel® QAT) accelerates cryptographic workloads -by offloading the data to hardware capable of optimizing those functions. - -The following QAT and GPU plugins are supported in |prod| 10.0. +In |prod-long| Release 11.0 only the out-of-tree versions of the Intel ``ice``, +``i40e``, and ``iavf`` drivers are supported. Switching between in-tree and +out-of-tree driver versions are not supported. **See**: -- :ref:`intel-device-plugins-operator-application-overview-c5de2a6212ae` +- :ref:`intel-driver-version-c6e3fa384ff7` -- :ref:`gpu-device-plugin-configuration-615e2f6edfba` +************************************ +CaaS Traffic Bandwidth Configuration +************************************ -- :ref:`qat-device-plugin-configuration-616551306371` +Previously, the ``max_tx_rate`` parameter was used to set the maximum transmission +rate for a |VF| interface, with the short form `-r`. With the introduction of the +``max_rx_rate`` parameter that is used to configure the maximum receiving +rate, both ``max_tx_rate`` and ``max_rx_rate`` can now be applied to define +bandwidth limits for platform interfaces. To align with naming conventions: -****************************************** -Support for Sapphire Rapids Integrated QAT -****************************************** +- ``-t`` short form for ``max_tx_rate`` parameter allows the configuration of + the maximum transmission rate for both |VF| and platform interfaces. -Intel 4th generation Xeon Scalable Processor (Sapphire Rapids) support has been -introduced for the |prod| 10.0. - -- Drivers for QAT Gen 4 Intel Xeon Gold Scalable processor (Sapphire Rapids) - - - Intel Xeon Gold 6428N - -************************************************** -Sapphire Rapids Data Streaming Accelerator Support -************************************************** - -Intel® |DSA| is a high-performance data copy and transformation accelerator -integrated into Intel® processors starting with 4th Generation Intel® Xeon® -processors. It is targeted for optimizing streaming data movement and -transformation operations common with applications for high-performance -storage, networking, persistent memory, and various data processing -applications. - -**See**: :ref:`data-streaming-accelerator-db88a67c930c` - -************************* -DPDK Private Mode Support -************************* - -For the purpose of enabling and using ``needVhostNet``, |SRIOV| needs to be -configured on a worker host. - -**See**: :ref:`provisioning-sr-iov-interfaces-using-the-cli` - -****************************** -|SRIOV| |FEC| Operator Support -****************************** - -|FEC| Operator 2.9.0 is adopted based on Intel recommendations offering features -for various Intel hardware accelerators used for field deployments. - -**See**: :ref:`configure-sriov-fec-operator-to-enable-hw-accelerators-for-hosted-vran-containarized-workloads` - -****************************************************** -Support for Advanced VMs on Stx Platform with KubeVirt -****************************************************** - -The KubeVirt system application kubevirt-app-1.1.0 in |prod-long| includes: -KubeVirt, Containerized Data Importer (CDI) v1.58.0, and the Virtctl client tool. -|prod| 10.0 supports enhancements for this application, describes -the Kubevirt architecture with steps to install Kubevirt and provides examples -for effective implementation in your environment. +- ``r`` short form for ``max_rx_rate`` parameter is used to set the maximum + receiving rate for platform interfaces. **See**: -- :ref:`index-kubevirt-f1bfd2a21152` +- :ref:`configuring-platform-network-bandwidth-using-cli-5425dde3ff23`. -*************************************************** -Support Harbor Registry (Harbor System Application) -*************************************************** +********************************** +Rook Ceph Updates and Enhancements +********************************** -Harbor registry is integrated as a System Application. End users can use Harbor, -running on |prod-long|, for holding and managing their container images. The -Harbor registry is currently not used by the platform. +Rook Ceph is an orchestrator that provides a containerized solution for Ceph +Storage with a specialized Kubernetes Operator to automate the management of +the cluster. It is an alternative solution for the bare-metal Ceph storage. +See https://rook.io/docs/rook/latest-release/Getting-Started/intro/ for more +details. -Harbor is an open-source registry that secures artifacts with policies and -role-based access control, ensures images are scanned and free from -vulnerabilities, and signs images as trusted. Harbor has been evolved to a -complete |OCI| compliant cloud-native artifact registry. +``ECblock`` pools are renamed: Both data and metadata pools for ``ECblock`` on +Rook Ceph changed names to comply with the new standards for upstream Rook Ceph. -With Harbor V2.0, users can manage images, manifest lists, Helm charts, -|CNABs|, |OPAs| among others which all adhere to the |OCI| image specification. -It also allows for pulling, pushing, deleting, tagging, replicating, and -scanning such kinds of artifacts. Signing images and manifest list are also -possible now. +- Data pool was renamed from ``ec-data-pool`` to ``kube-ecblock`` + +- Metadata pool was renamed from ``ec-metadata-pool`` to ``kube-ecblock-metadata`` + +**Ceph version upgrade** + +Ceph version is upgraded from 18.2.2 to 18.2.5, with minimal impact on the upgrade. + +**Rook Ceph OSDs Management** + +To add, remove or replace |OSDs| in a Rook Container-based Ceph, see +:ref:`host-delete-of-a-rook-ceph-cluster-member-d4892a8f4364` .. note:: - When using local |LDAP| for authentication of the Harbor system application, - you cannot use local |LDAP| groups for authorization; use only individual - local |LDAP| users for authorization. + Host-based Ceph is deprecated in |prod-long| Release 11.0. -**See**: :ref:`harbor-as-system-app-1d1e3ec59823` + For any new |prod-long| deployments Rook Ceph is mandatory in order to + prevent any service disruptions during migration procedures. -************************** -Support for DTLS over SCTP -************************** -DTLS (Datagram Transport Layer Security) v1.2 is supported in |prod| 10.0. +************************************ +User Management GUI/CLI Enhancements +************************************ -1. The |SCTP| module is now autoloaded by default. +For critical operations performed via the |prod| CLI or GUI such as delete +actions or operations that may impact services the system will display a +warning indicating that the operation is critical, irreversible, or +may affect service availability. Also the system will prompt the user to +confirm before proceeding with the execution of the operation. -2. The socket buffer size values have been upgraded: +A user confirmation request can optionally be used to safeguard critical +operations performed via the CLI. When the user CLI confirmation request is +enabled, CLI users are prompted to explicitly confirm a potentially critical +or destructive CLI command, before proceeding with the execution of the CLI +command. - Old values (in Bytes): +**See**: - - net.core.rmem_max=425984 +- :ref:`confirmation-support-8f0f2784db15` - - net.core.wmem_max=212992 +- :ref:`kubernetes-user-tutorials-configuring-container-backed-remote-clis-and-clients` - New Values (In Bytes): +*************************************************** +Optimized Platform Processing and Memory Usage- Ph1 +*************************************************** - - net.core.rmem_max=10485760 +|prod-long| Release 11.0 requires approximately 1 GB less memory, enabling +more efficient deployment in resource-constrained environments. - - net.core.wmem_max=10485760 +This feature is designed to optimize platform resource utilization, +specifically targeting processing and memory efficiency. This enables greater +flexibility for |prod-long| deployments in use cases with tighter footprint +constraints. -3. To enable each |SCTP| socket association to have its own buffer space, the - socket accounting policies have been updated as follows: +****************************************************** +Kubernetes Upgrade Procedure Optimization - Multi-Node +****************************************************** - - net.sctp.sndbuf_policy=1 +This feature enhances Kubernetes version upgrades across all |prod| +configurations including |AIO-DX|, |AIO-DX| with worker nodes, standard +configurations with controller storage, and standard configurations with +dedicated storage extending beyond |AIO-SX|. - - net.sctp.rcvbuf_policy=1 +The following enhancements are introduced: - Old value: +- Pre-caching of container images for all relevant versions during the upgrade's + preliminary phase. + +- The upgrade system now supports multi-node multi-K8s-version K8s upgrade + (both manual, and orchestrated), ie.: - - net.sctp.auth_enable=0 + - it supports multi-node upgrades of multiple Kubernetes versions in a + single manual upgrade + + - it supports multi-node upgrades of multiple Kubernetes versions in a + single orchestration - New value: +Previously, for multi-node environments, the Kubernetes upgrade process had +to be repeated end-to-end for each version in sequence. - - net.sctp.auth_enable=1 +Now, the upgrade system checks for kubelet version skew, allowing kubelet +components to run up to three minor versions behind the control plane. +This enhancement enables multi-version upgrades in a single cycle, eliminating +the need to upgrade kubelet through each intermediate version. As a result, +the overall number of upgrade steps is significantly reduced. -*********************************************************** -Banner Information Automation during Subcloud Bootstrapping -*********************************************************** -Users can now customize and automate banner information for subclouds during -system commissioning and installation. +**See**: -You can customize the pre-login message (issue) and post-login |MOTD| across -the entire |prod| cluster during system commissioning and installation. +- :ref:`configuring-kubernetes-multi-version-upgrade-orchestration-aio-b0b59a346466` -**See**: :ref:`Brand the Login Banner During Commissioning ` +- :ref:`manual-kubernetes-components-upgrade` -.. end-new-features-r10 +- :ref:`manual-kubernetes-multi-version-upgrade-in-aio-sx-13e05ba19840` + +.. end-new-features-r11 ---------------- Hardware Updates @@ -1190,7 +746,7 @@ Fixed bugs ********** This release provides fixes for a number of defects. Refer to the StarlingX bug -database to review the R10.0 `Fixed Bugs `_. +database to review the R11.0 `Fixed Bugs `_. .. All please confirm if any Limitations need to be removed / added for Stx 10.0. @@ -1198,7 +754,7 @@ database to review the R10.0 `Fixed Bugs cm_test_cert.yml + --- + apiVersion: cert-manager.io/v1 + kind: ClusterIssuer + metadata: + creationTimestamp: null + name: system-local-ca + spec: + ca: + secretName: system-local-ca + status: {} + --- + apiVersion: cert-manager.io/v1 + kind: Certificate + metadata: + creationTimestamp: null + name: stx-test-cm + namespace: cert-manager + spec: + commonName: stx-test-cm + issuerRef: + kind: ClusterIssuer + name: system-local-ca + secretName: stx-test-cm + status: {} + eof + +.. code-block:: none + + $ kubectl apply -f cm_test_cert.yml + + $ rm cm_test_cert.yml + + $ kubectl wait certificate -n cert-manager stx-test-cm --for=condition=Ready --timeout 20m + + # Verify that the TLS secret associated with the cert was created, using the following: + + $ kubectl get secret -n cert-manager stx-test-cm + +cert-manager cm-acme-http-solver pod fails +****************************************** + +On a multinode setup, when you deploy an acme issuer to issue a certificate, +the ``cm-acme-http-solver`` pod might fail and stays in "ImagePullBackOff" state +due to the following defect https://github.com/cert-manager/cert-manager/issues/5959. + +**Procedural Changes**: + +1. If you are using the namespace "test", create a docker-registry secret + "testkey" with local registry credentials in the "test" namespace. + + .. code-block:: none + + ~(keystone_admin)]$ kubectl create secret docker-registry testkey --docker-server=registry.local:9001 --docker-username=admin --docker-password=Password*1234 -n test + +2. Use the secret "testkey" in the issuer spec as follows: + + .. code-block:: none + + apiVersion: cert-manager.io/v1 + kind: Issuer + metadata: + name: stepca-issuer + namespace: test + spec: + acme: + server: https://test.com:8080/acme/acme/directory + skipTLSVerify: true + email: test@test.com + privateKeySecretRef: + name: stepca-issuer + solvers: + - http01: + ingress: + podTemplate: + spec: + imagePullSecrets: + - name: testkey + class: nginx + +Vault application is not supported during bootstrap +*************************************************** + +The Vault application cannot be configured during bootstrap. + +**Procedural Changes**: + +The application must be configured after the platform nodes are unlocked / +enabled / available, a storage backend is configured, and ``platform-integ-apps`` +is applied. If Vault is to be run in |HA| configuration (3 vault server pods) +then at least three controller / worker nodes must be unlocked / enabled / available. + +Vault application support for running on application cores +********************************************************** + +By default the Vault application's pods will run on platform cores. When +changing the core selection from platform cores to application cores the +following additional procedure is required for the vault application. + +**Procedural Changes**: + +"If ``static kube-cpu-mgr-policy`` is selected and when overriding the label +``app.starlingx.io/component`` for Vault namespace or pods, there are two +requirements: + +- The Vault server pods need to be restarted as directed by Hashicorp Vault + documentation. Restart each of the standby server pods in turn, then restart + the active server pod. + +- Ensure that sufficient hosts with worker function are available to run the + Vault server pods on application cores. + +**See**: :ref:`Kubernetes CPU Manager Policies `. + +Restart the Vault Server pods +============================= + +The Vault server pods do not restart automatically. + +**Procedural Changes**: If the pods are to be re-labelled to switch execution from platform to +application cores, or vice-versa, then the pods need to be restarted. + +Under kubernetes the pods are restarted using the :command:`kubectl delete pod` +command. See, Hashicorp Vault documentation for the recommended procedure for +restarting server pods in |HA| configuration, +https://support.hashicorp.com/hc/en-us/articles/23744227055635-How-to-safely-restart-a-Vault-cluster-running-on-Kubernetes. + +Ensure that sufficient hosts are available to run the server pods on application cores +====================================================================================== + +The standard cluster with less than 3 worker nodes does not support Vault |HA| +on the application cores. In this configuration (less than three cluster hosts +with worker function): + +**Procedural Changes**: + +- When setting label app.starlingx.io/component=application with the Vault + app already applied in |HA| configuration (3 vault server pods), ensure that + there are 3 nodes with worker function to support the |HA| configuration. + +- When applying Vault for the first time and with ``app.starlingx.io/component`` + set to "application": ensure that the server replicas is also set to 1 for + non-HA configuration. The replicas for Vault server are overriden both for + the Vault Helm chart and the Vault manager Helm chart: + + .. code-block:: none + + cat < vault_overrides.yaml + server: + extraLabels: + app.starlingx.io/component: application + ha: + replicas: 1 + injector: + extraLabels: + app.starlingx.io/component: application + EOF + + cat < vault-manager_overrides.yaml + manager: + extraLabels: + app.starlingx.io/component: application + server: + ha: + replicas: 1 + EOF + + $ system helm-override-update vault vault vault --values vault_overrides.yaml + + $ system helm-override-update vault vault-manager vault --values vault-manager_overrides.yaml + +Kubernetes upgrade fails if Portieris is applied +************************************************ + +Kubernetes upgrade fails if Portieris is applied prior to the upgrade. + +**Procedural Changes**: Remove the Portieris application prior to the +kubernetes upgrade. Perform the kubernetes upgrade and apply the Portieris +application. + +.. Greg is this limitation applicable to Stx? + +Portieris Helm override 'caCert' is renamed and moved to Portieris Helm chart +***************************************************************************** + +The 'caCert' Helm override of portieris-certs Helm chart is moved to the +Portieris Helm chart as 'TrustedCACert'. + +**Procedural Changes**: Before upgrading from |prod-long| 10.0 to +11.0, if 'caCert' Helm override is applied to the portieris-certs Helm +chart to trust a custom CA certificate, apply the 'TrustedCACert' Helm +override to 'portieris' Helm chart to trust the certificate. See, :ref:`install-portieris` +for information on TrustedCACert Helm override. + +Authorization based on Local LDAP Groups is not supported for Harbor +******************************************************************** + +When using Local |LDAP| for authentication of the Harbor system application, +you cannot use Local |LDAP| Groups for authorization; you can only use individual +Local |LDAP| users for authorization. + +**Procedural Changes**: Use only individual Local LDAP users for specifying +authorization. + +Harbor cannot be deployed during bootstrap +****************************************** + +The Harbor application cannot be deployed during bootstrap due to the bootstrap +deployment dependencies such as early availability of storage class. + +**Procedural Changes**: N/A. + +Windows Active Directory +************************ + +.. _general-limitations-and-workarounds-ul-x3q-j3x-dmb: + +- **Limitation**: The Kubernetes API does not support uppercase IPv6 addresses. + + **Procedural Changes**: The issuer_url IPv6 address must be specified as + lowercase. + +- **Limitation**: The refresh token does not work. + + **Procedural Changes**: If the token expires, manually replace the ID token. For + more information, see, :ref:`Configure Kubernetes Client Access + `. + +- **Limitation**: TLS error logs are reported in the **oidc-dex** container + on subclouds. These logs should not have any system impact. + + **Procedural Changes**: NA + +.. Stx LP Bug: https://bugs.launchpad.net/starlingx/+bug/1846418 Won't fix. +.. To be addressed in a future update. + +Security Audit Logging for K8s API +********************************** + +A custom policy file can only be created at bootstrap in ``apiserver_extra_volumes``. +If a custom policy file was configured at bootstrap, then after bootstrap the +user has the option to configure the parameter ``audit-policy-file`` to either +this custom policy file (``/etc/kubernetes/my-audit-policy-file.yml``) or the +default policy file ``/etc/kubernetes/default-audit-policy.yaml``. If no +custom policy file was configured at bootstrap, then the user can only +configure the parameter ``audit-policy-file`` to the default policy file. + +Only the parameter ``audit-policy-file`` is configurable after bootstrap, so +the other parameters (``audit-log-path``, ``audit-log-maxsize``, +``audit-log-maxage`` and ``audit-log-maxbackup``) cannot be changed at +runtime. + +**Procedural Changes**: NA + +**See**: :ref:`kubernetes-operator-command-logging-663fce5d74e7`. + +************************** +**Networking Limitations** +************************** + +.. contents:: |minitoc| + :local: + :depth: 1 + +Controller-0/1 PXEboot Network Communication Failure(200.003) Alarm Raised After Upgrade +**************************************************************************************** + +Alarm triggered: Controller-0/1 PXE boot network communication failure +(Error Alarm 200.003) following system upgrade. + +**Procedural Changes**: + +1. Identify the PXEboot file. + + .. code-block:: + + grep -l "net:pxeboot" "/etc/network/interfaces.d/"/* 2>/dev/null + /etc/network/interfaces.d//ifcfg-enp0s8:9 + + If the label differs from ':2' (e.g., displays as ifcfg-enp0s8:9), proceed + with the following step + +2. Copy the file in the same directory. + + .. code-block:: + + cp /etc/network/interfaces.d//ifcfg-enp0s8:9 /etc/network/interfaces.d//ifcfg-enp0s8:2 + +3. Restart mtcClient. + + .. code-block:: + + systemctl restart mtcClient.service + +Wait up to one minute for the alarm to clear. Repeat this process for all nodes. + +Add / delete operations on pods results in errors +************************************************* + +Under some circumstances, add / delete operations on pods results in +`error getting ClusterInformation: connection is unauthorized: Unauthorized` +and also results in pods staying in ContainerCreating/Terminating state. This +error may also prevent users from locking a host. + +**Procedural Changes**: If this error occurs run the following +:command:`kubectl describe pod -n ` command. The following +message is displayed: + +`error getting ClusterInformation: connection is unauthorized: Unauthorized` + +**Limitation**: There is also a known issue with the Calico CNI that may occur +in rare occasions if the Calico token required for communication with the +kube-apiserver becomes out of sync due to |NTP| skew or issues refreshing the +token. + +**Procedural Changes**: Delete the calico-node pod (causing it to automatically +restart) using the following commands: + +.. code-block:: none + + $ kubectl get pods -n kube-system --show-labels | grep calico + + $ kubectl delete pods -n kube-system -l k8s-app=calico-node + +Application Pods with SRIOV Interfaces +************************************** + +Application Pods with |SRIOV| Interfaces require a **restart-on-reboot: "true"** +label in their pod spec template. + +Pods with |SRIOV| interfaces may fail to start after a platform restore or +Simplex upgrade and persist in the **Container Creating** state due to missing +PCI address information in the CNI configuration. + +**Procedural Changes**: Application pods that require|SRIOV| should add the label +**restart-on-reboot: "true"** to their pod spec template metadata. All pods with +this label will be deleted and recreated after system initialization, therefore +all pods must be restartable and managed by a Kubernetes controller +\(i.e. DaemonSet, Deployment or StatefulSet) for auto recovery. + +Pod Spec template example: + +.. code-block:: none + + template: + metadata: + labels: + tier: node + app: sriovdp + restart-on-reboot: "true" + + +PTP O-RAN Spec Compliant Timing API Notification +************************************************ + +- The ``v1 API`` only supports monitoring a single ptp4l + phc2sys instance. + + **Procedural Changes**: Ensure the system is not configured with multiple instances + when using the v1 API. + +- The O-RAN Cloud Notification defines a /././sync API v2 endpoint intended to + allow a client to subscribe to all notifications from a node. This endpoint + is not supported |prod-long| Release 9.0. + + **Procedural Changes**: A specific subscription for each resource type must be + created instead. + +- ``v1 / v2`` + + - v1: Support for monitoring a single ptp4l instance per host - no other + services can be queried/subscribed to. + + - v2: The API conforms to O-RAN.WG6.O-Cloud Notification API-v02.01 + with the following exceptions, that are not supported in |prod-long| + Release 9.0. + + - O-RAN SyncE Lock-Status-Extended notifications + + - O-RAN SyncE Clock Quality Change notifications + + - O-RAN Custom cluster names + + - /././sync endpoint + + **Procedural Changes**: See the respective PTP-notification v1 and v2 document + subsections for further details. + + v1: https://docs.starlingx.io/api-ref/ptp-notification-armada-app/api_ptp_notifications_definition_v1.html + + v2: https://docs.starlingx.io/api-ref/ptp-notification-armada-app/api_ptp_notifications_definition_v2.html + +``ptp4l`` error "timed out while polling for tx timestamp" reported for NICs using the Intel ice driver +******************************************************************************************************* + +NICs using the Intel® ice driver may report the following error in the ``ptp4l`` +logs, which results in a |PTP| port switching to ``FAULTY`` before +re-initializing. + +.. note:: + + |PTP| ports frequently switching to ``FAULTY`` may degrade the accuracy of + the |PTP| timing. + +.. code-block:: none + + ptp4l[80330.489]: timed out while polling for tx timestamp + ptp4l[80330.489]: increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug + +.. note:: + + This is due to a limitation with the Intel® ice driver as the driver cannot + guarantee the time interval to return the timestamp to the ``ptp4l`` user + space process which results in the occasional timeout error message. + +**Procedural Changes**: The Procedural Changes recommended by Intel is to increase the +``tx_timestamp_timeout`` parameter in the ``ptp4l`` config. The increased +timeout value gives more time for the ice driver to provide the timestamp to +the ``ptp4l`` user space process. Timeout values of 50ms and 700ms have been +validated. However, the user can use a different value if it is more suitable +for their system. + +.. code-block:: none + + ~(keystone_admin)]$ system ptp-instance-parameter-add tx_timestamp_timeout=700 + ~(keystone_admin)]$ system ptp-instance-apply + +.. note:: + + The ``ptp4l`` timeout error log may also be caused by other underlying + issues, such as NIC port instability. Therefore, it is recommended to + confirm the NIC port is stable before adjusting the timeout values. + +PTP is not supported on Broadcom 57504 NIC +****************************************** + +|PTP| is not supported on the Broadcom 57504 NIC. + +**Procedural Changes**: None. Do not configure |PTP| instances on the Broadcom 57504 +NIC. + +synce4l CLI options are not supported +************************************* + +The SyncE configuration using the ``synce4l`` is not supported in |prod-long| +Release 24.09. + +The service type of ``synce4l`` in the :command:`ptp-instance-add` command +is not supported in |prod-long| Release 24.09. + +**Procedural Changes**: N/A. + +ptp-notification application is not supported during bootstrap +************************************************************** + +- Deployment of ``ptp-notification`` during bootstrap time is not supported due + to dependencies on the system |PTP| configuration which is handled + post-bootstrap. + + **Procedural Changes**: N/A. + +- The :command:`helm-chart-attribute-modify` command is not supported for + ``ptp-notification`` because the application consists of a single chart. + Disabling the chart would render ``ptp-notification`` non-functional. + + **Procedural Changes**: N/A. + +.. See :ref:`admin-application-commands-and-helm-overrides` for details on this command. + +The ptp-notification-demo App is Not a System-Managed Application +***************************************************************** + +The ptp-notification-demo app is provided for demonstration purposes only. +Therefore, it is not supported on typical platform operations such as Upgrades +and Backup and Restore. + +**Procedural Changes**: NA + +Silicom TimeSync (STS) card limitations +*************************************** + +* Silicom and Intel based Time Sync NICs may not be deployed on the same system + due to conflicting time sync services and operations. + + |PTP| configuration for Silicom TimeSync (STS) cards is handled separately + from |prod| host |PTP| configuration and may result in configuration + conflicts if both are used at the same time. + + The sts-silicom application provides a dedicated ``phc2sys`` instance which + synchronizes the local system clock to the Silicom TimeSync (STS) card. Users + should ensure that ``phc2sys`` is not configured via |prod| |PTP| Host + Configuration when the sts-silicom application is in use. + + Additionally, if |prod| |PTP| Host Configuration is being used in parallel + for non-STS NICs, users should ensure that all ``ptp4l`` instances do not use + conflicting ``domainNumber`` values. + +* When the Silicom TimeSync (STS) card is configured in timing mode using the + sts-silicom application, the card goes through an initialization process on + application apply and server reboots. The ports will bounce up and down + several times during the initialization process, causing network traffic + disruption. Therefore, configuring the platform networks on the Silicom + TimeSync (STS) card is not supported since it will cause platform + instability. + +**Procedural Changes**: N/A. + +N3000 Image in the containerd cache +*********************************** + +The |prod-long| system without an N3000 image in the containerd cache fails to +configure during a reboot cycle, and results in a failed / disabled node. + +The N3000 device requires a reset early in the startup sequence. The reset is +done by the n3000-opae image. The image is automatically downloaded on bootstrap +and is expected to be in the cache to allow the reset to succeed. If the image +is not in the cache for any reason, the image cannot be downloaded as +``registry.local`` is not up yet at this point in the startup. This will result +in the impacted host going through multiple reboot cycles and coming up in an +enabled/degraded state. To avoid this issue: + +1. Ensure that the docker filesystem is properly engineered to avoid the image + being automatically removed by the system if flagged as unused. + For instructions to resize the filesystem, see + :ref:`Increase Controller Filesystem Storage Allotments Using the CLI ` + +2. Do not manually prune the N3000 image. + +**Procedural Changes**: Use the procedure below. + +.. rubric:: |proc| + +#. Lock the node. + + .. code-block:: none + + ~(keystone_admin)]$ system host-lock controller-0 + +#. Pull the (N3000) required image into the ``containerd`` cache. + + .. code-block:: none + + ~(keystone_admin)]$ crictl pull registry.local:9001/docker.io/starlingx/n3000-opae:stx.8.0-v1.0.2 + +#. Unlock the node. + + .. code-block:: none + + ~(keystone_admin)]$ system host-unlock controller-0 + +Deploying an App using nginx controller fails with internal error after controller.name override +************************************************************************************************ + +An Helm override of controller.name to the nginx-ingress-controller app may +result in errors when creating ingress resources later on. + +Example of Helm override: + +.. code-block::none + + cat < values.yml + controller: + name: notcontroller + + EOF + + ~(keystone_admin)$ system helm-override-update nginx-ingress-controller ingress-nginx kube-system --values values.yml + +----------------+-----------------------+ + | Property | Value | + +----------------+-----------------------+ + | name | ingress-nginx | + | namespace | kube-system | + | user_overrides | controller: | + | | name: notcontroller | + | | | + +----------------+-----------------------+ + + ~(keystone_admin)$ system application-apply nginx-ingress-controller + +**Procedural Changes**: NA + +********************************* +**Distributed Cloud Limitations** +********************************* + +.. contents:: |minitoc| + :local: + :depth: 1 + +Subcloud Restore to N-1 Release with Additional Patches +******************************************************* + +If a subcloud is required to be restored to N-1 (Stx 10.0) release beyond the +N-1 ISO patch (prepatched ISO) level, use the following prestage and deploy steps: + +1. Restore the subcloud with install to the N-1 release using the following + command: + + .. code-block:: none + + $ dcmanager subcloud-backup restore --subcloud --with-install --release + + .. note:: + + The subcloud will be reinstalled with the N-1 (pre-patched) ISO + +2. Prestage the subcloud with additional N-1 patches if applicable after running + the following command: + + .. code-block:: none + + restore dcmanager subcloud prestage --for-sw-deploy --release + + Use ``dcmanager prestage-strategy create/apply`` command to prestage more + than one subcloud. + +3. Apply the N-1 patches on the subcloud using the following command: + + .. code-block:: none + + $ dcmanager sw-deploy-strategy create --release + + $ dcmanager sw-deploy-strategy apply + + Use ``--group`` option to create a strategy for more than one subcloud. + +Subcloud install or restore to the previous release +*************************************************** + +If the System Controller is on |prod| Release 11.0, subclouds +can be deployed or restored to either |prod| Release 10.0 or |prod| Release 11.0. + +The following operations have limited support for subclouds of the previous release: + +- Subcloud error reporting + +The following operations are not supported for subclouds of the previous release: + +- Orchestrated subcloud kubernetes upgrade + +**Procedural Changes**: N/A. + +**See**: :ref:`subclouds-previous-release-management-5e986615cb4b`. + +Subcloud Upgrade with Kubernetes Versions +***************************************** + +Before upgrading a cluster, ensure that the Kubernetes version is updated to +the latest one supported by the current (older) platform version. This step +is necessary because the new platform version only supports that specific +Kubernetes version. Orchestrated Kubernetes upgrades are not supported for N-1 +subclouds. Therefore, before upgrading the System Controller to Stx 11.0, +verify that both the System Controller and all existing subclouds are running +Kubernetes version v1.29.2; the latest version supported by Stx 10.0. + +**Procedural Changes**: N/A. + +Enhanced Parallel Operations for Distributed Cloud +************************************************** + +- No parallel operation should be performed while the System Controller is being + patched. + +- Only one type of parallel operation can be performed at a time. For example, + subcloud prestaging or upgrade orchestration should be postponed while batch + subcloud deployment is still in progress. + +Examples of parallel operation: + +- any type of ``dcmanager orchestration`` (prestage, sw-deploy, kube-upgrade, + kube-rootca-update) + +- concurrent ``dcmanager subcloud add`` + +- ``dcmanager subcloud-backup/subcloud-backup restore`` with --group option + +**Procedural Changes**: N/A. + + +**************************************** +**Container-Infrastructure Limitations** +**************************************** + +.. contents:: |minitoc| + :local: + :depth: 1 + +Kubernetes Memory Manager Policies +********************************** + +The interaction between the ``kube-memory-mgr-policy=static`` +and the Topology Manager policy "restricted" can result in pods failing to be +scheduled or started even when there is sufficient memory. This +occurs due to the restrictive design of the NUMA-aware memory manager, which +prevents the same NUMA node from being used for both single and multi-NUMA +allocations. + +**Procedural Changes**: It is important for users to understand the +implications of these memory management policies and configure their systems +accordingly to avoid unexpected failures. + +For detailed configuration options and examples, refer to the Kubernetes +documentation at https://kubernetes.io/docs/tasks/administer-cluster/memory-manager/. + +Alarm 900.024 Raised When Uploading N-1 Patch Release to the System Controller +****************************************************************************** + +When uploading an N-1 patch release to the System Controller, alarm 900.024 +(Obsolete Patch) will be triggered. + +This behavior is specific to the System Controller and occurs only when +uploading an N-1 patch + +**Procedural Changes**: This warning can be safely ignored. + +Kubevirt Limitations +******************** + +The following limitations apply to Kubevirt in |Prod-long| Release 24.09: + +- **Limitation**: Kubernetes does not provide CPU Manager detection. + + **Procedural Changes**: Add ``cpumanager`` to Kubevirt: + + .. code-block:: none + + apiVersion: kubevirt.io/v1 + kind: KubeVirt + metadata: + name: kubevirt + namespace: kubevirt + spec: + configuration: + developerConfiguration: + featureGates: + - LiveMigration + - Macvtap + - Snapshot + - CPUManager + + Check the label, using the following command: + + .. code-block:: none + + ~(keystone_admin)]$ kubectl describe node | grep cpumanager + + where `cpumanager=true` + +- **Limitation**: Huge pages do not show up under cat /proc/meminfo inside a + guest VM. Although, resources are being consumed on the host. For example, + if a VM is using 4GB of Huge pages, the host shows the same 4GB of huge + pages used. The huge page memory is exposed as normal memory to the VM. + + **Procedural Changes**: You need to configure Huge pages inside the guest + OS. + +See the Installation Guides at https://docs.starlingx.io/ for more details. + +- **Limitation**: Virtual machines using Persistent Volume Claim (PVC) must + have a shared ReadWriteMany (RWX) access mode to be live migrated. + + **Procedural Changes**: Ensure |PVC| is created with RWX. + + .. code-block:: + + $ class=cephfs --access-mode=ReadWriteMany + + $ virtctl image-upload --pvc-name=cirros-vm-disk-test-2 --pvc-size=500Mi --storage-class=cephfs --access-mode=ReadWriteMany --image-path=/home/sysadmin/Kubevirt-GA-testing/latest-manifest/kubevirt-GA-testing/cirros-0.5.1-x86_64-disk.img --uploadproxy-url=https://10.111.54.246 -insecure + + .. note:: + + - Live migration is not allowed with a pod network binding of bridge + interface type () + + - Live migration requires ports 49152, 49153 to be available in the + virt-launcher pod. If these ports are explicitly specified in the + masquarade interface, live migration will not function. + +- For live migration with |SRIOV| interface: + + - specify networkData: in cloudinit, so when the VM moves to another node + it will not loose the IP config + + - specify nameserver and internal |FQDNs| to connect to cluster metadata + server otherwise cloudinit will not work + + - fix the MAC address otherwise when the VM moves to another node the MAC + address will change and cause a problem establishing the link + + Example: + + .. code-block:: none + + cloudInitNoCloud: + networkData: | + ethernets: + sriov-net1: + addresses: + - 128.224.248.152/23 + gateway: 128.224.248.1 + match: + macAddress: "02:00:00:00:00:01" + nameservers: + addresses: + - 10.96.0.10 + search: + - default.svc.cluster.local + - svc.cluster.local + - cluster.local + set-name: sriov-link-enabled + version: 2 + +- **Limitation**: Snapshot |CRDs| and controllers are not present by default + and needs to be installed on |prod-long|. + + **Procedural Changes**: To install snapshot |CRDs| and controllers on + Kubernetes, see: + + - kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshots.yaml + + - kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshotcontents.yaml + + - kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshotclasses.yaml + + - kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/deploy/kubernetes/snapshot-controller/rbac-snapshot-controller.yaml + + - kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/deploy/kubernetes/snapshot-controller/setup-snapshot-controller.yaml + + Additionally, create ``VolumeSnapshotClass`` for Cephfs and RBD: + + .. code-block:: none + + cat <cephfs-storageclass.yaml + — + apiVersion: snapshot.storage.k8s.io/v1 + kind: VolumeSnapshotClass + metadata: + name: csi-cephfsplugin-snapclass + driver: cephfs.csi.ceph.com + parameters: + clusterID: 60ee9439-6204-4b11-9b02-3f2c2f0a4344 + csi.storage.k8s.io/snapshotter-secret-name: ceph-pool-kube-cephfs-data + csi.storage.k8s.io/snapshotter-secret-namespace: default deletionPolicy: Delete + + EOF + + .. code-block:: none + + cat <rbd-storageclass.yaml + — + apiVersion: snapshot.storage.k8s.io/v1 + kind: VolumeSnapshotClass + metadata: + name: csi-rbdplugin-snapclass + driver: rbd.csi.ceph.com + parameters: + clusterID: 60ee9439-6204-4b11-9b02-3f2c2f0a4344 + csi.storage.k8s.io/snapshotter-secret-name: ceph-pool-kube-rbd + csi.storage.k8s.io/snapshotter-secret-namespace: default deletionPolicy: Delete + EOF + + .. note:: + + Get the cluster ID from : ``kubectl describe sc cephfs, rbd`` + +- **Limitation**: Live migration is not possible when using configmap as a + filesystem. Currently, virtual machine instances (VMIs) cannot be live migrated as + ``virtiofs`` does not support live migration. + + **Procedural Changes**: N/A. + +- **Limitation**: Live migration is not possible when a VM is using secret + exposed as a filesystem. Currently, virtual machine instances cannot be + live migrated since ``virtiofs`` does not support live migration. + + **Procedural Changes**: N/A. + +- **Limitation**: Live migration will not work when a VM is using + ServiceAccount exposed as a file system. Currently, VMIs cannot be live + migrated since ``virtiofs`` does not support live migration. + + **Procedural Changes**: N/A. + +Docker Network Bridge Not Supported +*********************************** + +The Docker Network Bridge, previously created by default, is removed and no +longer supported in |prod-long| Release 9.0 as the default bridge IP address +collides with addresses already in use. + +As a result, docker can no longer be used for running containers. This impacts +building docker images directly on the host. + +**Procedural Changes**: Create a Kubernetes pod that has network access, log in +to the container, and build the docker images. + +Upper case characters in host names cause issues with kubernetes labelling +************************************************************************** + +Upper case characters in host names cause issues with kubernetes labelling. + +**Procedural Changes**: Host names should be lower case. + +Kubernetes Taint on Controllers for Standard Systems +**************************************************** + +In Standard systems, a Kubernetes taint is applied to controller nodes in order +to prevent application pods from being scheduled on those nodes; since +controllers in Standard systems are intended ONLY for platform services. +If application pods MUST run on controllers, a Kubernetes toleration of the +taint can be specified in the application's pod specifications. + +**Procedural Changes**: Customer applications that need to run on controllers on +Standard systems will need to be enabled/configured for Kubernetes toleration +in order to ensure the applications continue working after an upgrade from +|prod-long| Release 6.0 to |prod-long| future Releases. It is suggested to add +the Kubernetes toleration to your application prior to upgrading to |prod-long| +9.0 Release. + +You can specify toleration for a pod through the pod specification (PodSpec). +For example: + +.. code-block:: none + + spec: + .... + template: + .... + spec + tolerations: + - key: "node-role.kubernetes.io/master" + operator: "Exists" + effect: "NoSchedule" + - key: "node-role.kubernetes.io/control-plane" + operator: "Exists" + effect: "NoSchedule" + +**See**: `Taints and Tolerations `__. + +Application Fails After Host Lock/Unlock +**************************************** + +In some situations, application may fail to apply after host lock/unlock due to +previously evicted pods. + +**Procedural Changes**: Use the :command:`kubectl delete` command to delete the evicted +pods and reapply the application. + +Application Apply Failure if Host Reset +*************************************** + +If an application apply is in progress and a host is reset it will likely fail. +A re-apply attempt may be required once the host recovers and the system is +stable. + +**Procedural Changes**: Once the host recovers and the system is stable, a re-apply +may be required. + +Platform CPU Usage Alarms +************************* + +Alarms may occur indicating platform cpu usage is greater than 90% if a large +number of pods are configured using liveness probes that run every second. + +**Procedural Changes**: To mitigate either reduce the frequency for the liveness +probes or increase the number of platform cores. + +Pods Using isolcpus +******************* + +The isolcpus feature currently does not support allocation of thread siblings +for cpu requests (i.e. physical thread +HT sibling). + +**Procedural Changes**: For optimal results, if hyperthreading is enabled then +isolcpus should be allocated in multiples of two in order to ensure that both +|SMT| siblings are allocated to the same container. + +**Procedural Changes**: N/A. + +Deleting image tags in registry.local may delete tags under the same name +************************************************************************* + +When deleting image tags in the registry.local docker registry, you should be +aware that the deletion of an **** will delete all tags +under the specified that have the same 'digest' as the specified +. For more information, see, :ref:`Delete Image Tags in +the Docker Registry `. + +**Procedural Changes**: NA + +********************************* +**Distributed Cloud Limitations** +********************************* + +.. contents:: |minitoc| + :local: + :depth: 1 + +Limitation for Day-2 Deployment Manager operations +************************************************** + +After completing Day-1 operations and initiating a +Day-2 update for the Host resource, a ``system config update`` strategy is +generated. Consequently, alarms indicating the presence of this strategy in the +system are triggered. + +If a new Day-2 update is executed immediately after another update, and before +the previous strategy is created, it may lead to unexpected results . + +Before proceeding with Day-2 operations, use the following Procedural Changes: + +**Procedural Changes**: Wait for any alarms related to the ``system config update`` +strategy to clear, which indicates the completion of the strategy. Once the +alarms are cleared execute a new Day-2 update using either reconfiguration or +playbook re-application to apply new changes that were not applied in the +previous update. + +*********************************** +**Software Management Limitations** +*********************************** + +.. contents:: |minitoc| + :local: + :depth: 1 + +Deploy does not fail after a system reboot +****************************************** + +Deploy does not fail after a system reboot. + +**Procedural Changes**: Run the +:command:`sudo software-deploy-set-failed --hostname/-h --confirm` +utility to manually move the deploy and deploy host to a failed state which is +caused by a failover, lost power, network outage etc. You can only run this +utility with root privileges on the active controller. + +The utility displays the current state and warns the user about the next steps +to be taken in case the user needs to continue executing the utility. It also +displays the new states and the next operation to be executed. + +ISO/SIG Upload to Central Cloud Fails when Using sudo +***************************************************** + +To upload a software patch or major release to the System Controller region +using the ``--os-region-name SystemController`` option, the upload command must be +authenticated with Keystone. + +**Procedural Changes**: Do not use sudo with the ``--os-region-name SystemController`` +option. For example, avoid using :command:`sudo software upload ` +command. + +.. note:: + + When using the ``-local`` option, you must provide the absolute path to the + release files. + +.. note:: + + When using software upload commands with ``--os-region-name SystemController`` + to upload a software patch or major release to the System Controller + region, Keystone authentication is required. + +.. important:: + + Do not use sudo in combination with the ``--os-region-name SystemController`` + option. For example, avoid using: + + .. code-block:: + + $ sudo software --os-region-name SystemController upload + + Instead, ensure the command is executed with proper authentication and + without sudo. + +For more information see, :ref:`upload-software-releases-using-the-cli-203af02d6457` + +RT Throttling Service not running after Lock/Unlock on Upgraded Subclouds +************************************************************************* + +During the upgrade process, the |USM| post-upgrade script modifies ``systemd`` +presets to define which services should be automatically enabled or disabled. +As part of this process, any user-enabled custom services may be set to +"disabled" after the upgrade completes. + +Since this change occurs post-upgrade, ``systemd`` will not automatically +re-enable the affected service during subsequent lock / unlock operations. +By default, |USM| disables custom services not explicitly listed in the ``systemd`` +presets. Since service definitions can vary between releases, |USM| relies on +these presets to determine enablement status per host during the upgrade. +If a custom service is not included in the presets, it will be marked as +disabled and remain inactive after lock / unlock even following a successful +upgrade. + +Log message during the upgrade: + +.. code-block:: + + controller-0 usm-initialize[3061]: info Removed + /etc/systemd/system/multi-user.target.wants/sysctl-rt-sched-apply.service + +**Procedural Changes**: Once the upgrade to |prod-long| Release 11.0 completes, +run the :command:`service-enable` and :command:`service-start` commands for all +custom / user services before issuing the first lock / unlock (or reboot). + +The enable and start commands for this service are required only once prior +to the initial lock / unlock operation. After this step is completed, there is +no further need to manually start or enable custom services, as the |USM| +post-upgrade script has already run during the upgrade process. + +sw-manager sw-deploy-strategy apply fails +***************************************** + +``sw-manager apply`` fails to apply the patch. + +.. note:: + + The Procedural Changes is applicable only if the ``sw-manager sw-deploy-strategy`` + fails with the following issues. + +1. To show the operation is in an aborted state due to a timeout, run the + following command. + + .. code-block:: none + + ~(keystone_admin)]$ sw-manager sw-deploy-strategy show + + Strategy Patch Strategy: + strategy-uuid: 2082ab5e-a387-4b6a-be23-50ac23317725 + controller-apply-type: serial + storage-apply-type: serial + worker-apply-type: serial + default-instance-action: stop-start + alarm-restrictions: strict + current-phase: abort + current-phase-completion: 100% + state: aborted + apply-result: timed-out + apply-reason: + abort-result: success + abort-reason: + +2. If step 1 fails with 'timed-out' results, check if the timeout has occurred + due to step-name 'wait-alarms-clear' using the command below. + + To display results 'wait for alarm' that has timed out and run the + following command. + + .. code-block:: none + + ~(keystone_admin)]$ sw-manager sw-deploy-strategy show --details + + step-name: wait-alarms-clear + timeout: 2400 seconds + start-date-time: 2024-03-27 19:21:15 + end-date-time: 2024-03-27 20:01:16 + result: timed-out + +3. To list the 750.006 alarm, use the following command. + + .. code-block:: none + + ~(keystone_admin)]$ fm alarm-list + + +----------+---------------------------+--------------------+----------+---------------+ + | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp | + +----------+---------------------------+--------------------+----------+---------------+ + | 750.006 | A configuration change | platform-integ-apps| warning | 2024-03-27T| + | | requires a reapply of the | | | 19:21:15. | + | | platform-k8s_application= | | | 471422 | + | | integ-apps application. | | | | + +----------+---------------------------+--------------------+----------+---------------+ + +4. VIM orchestrated patch strategy failed with the 900.103 alarm being triggered. + + .. code-block:: none + + ~(keystone_admin)]$ fm alarm-list + + +----------+---------------------------+--------------------+----------+---------------+ + | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp | + +----------+---------------------------+--------------------+----------+---------------+ + | 900.103 | Software patch auto-apply | orchestration=sw- | critical | 2024-03-26T03T| + | | failed | | | | + +----------+---------------------------+--------------------+----------+---------------+ + +**Procedural Changes - Option 1** + +1. Check the system for existing alarms using the :command:`fm alarm-list` + command. If the existing alarms can be ignored then use the + :command:`sw-manager sw-deploy-strategy create --alarm-restrictions relaxed` + command to ignore any alarms during patch orchestration + +2. If the alarm was not ignored in using the command in step 1 and the issue is + seen when you encounter patch apply failure, check if alarm '750.006' is + present on the system. + +3. Delete the failed strategy using the following command. + + .. code-block:: none + + ~(keystone_admin)]$ sw-manager sw-deploy-strategy delete + +4. Create a new strategy. + + .. code-block:: none + + ~(keystone_admin)]$ sw-manager sw-deploy-strategy create --alarm-restrictions relaxed + +5. Apply the strategy. + + .. code-block:: none + + ~(keystone_admin)]$ sw-manager sw-deploy-strategy apply + +**Procedural Changes - Option 2** + +1. Create a new strategy (alarm-restrictions are not relaxed). + + .. code-block:: none + + ~(keystone_admin)]$ sw-manager sw-deploy-strategy create + +2. Apply the strategy. + + .. code-block:: none + + ~(keystone_admin)]$ sw-manager sw-deploy-strategy apply + + When the ``sw-deploy-strategy`` is in progress, and when at + 'wait-alarms-clear' step (this can be found from 'sw-manager patch strategy show --details | grep "step-name"'), + check if alarm 750.006 is present, then execute the below steps. + +3. Execute the command. + + .. code-block:: none + + ~(keystone_admin)]$ system application-apply platform-integ-apps + + This will re-apply the application and clear the alarm '750.006'. + +4. If the alarm still persists after step 3, manually delete the alarm using + :command:`fm alarm-delete ` command. + +********************************* +**Platform Services Limitations** +********************************* + +.. contents:: |minitoc| + :local: + :depth: 1 + +Kubernetes Pod Core Dump Handler may fail due to a missing Kubernetes token +*************************************************************************** + +In certain cases the Kubernetes Pod Core Dump Handler may fail due to a missing +Kubernetes token resulting in disabling configuration of the coredump on a per +pod basis and limiting namespace access. If application coredumps are not being +generated, verify if the k8s-coredump token is empty on the configuration file: +``/etc/k8s-coredump-conf.json`` using the following command: + +.. code-block:: none + + ~(keystone_admin)]$ ~$ sudo cat /etc/k8s-coredump-conf.json + { + "k8s_coredump_token": "" + + } + +**Procedural Changes**: If the k8s-coredump token is empty in the configuration file and +the kube-apiserver is verified to be responsive, users can re-execute the +create-k8s-account.sh script in order to generate the appropriate token after a +successful connection to kube-apiserver using the following commands: + +.. code-block:: none + + ~(keystone_admin)]$ :/home/sysadmin$ sudo chmod +x /etc/k8s-coredump/create-k8s-account.sh + + ~(keystone_admin)]$ :/home/sysadmin$ sudo /etc/k8s-coredump/create-k8s-account.sh + +Uploaded Applications Show Incorrect Progress During Platform Upgrade +********************************************************************* + +The outputs of the ``system application-list`` and ``system application-show`` +commands may display status messages indicating that dependencies for uploaded +applications are missing even after those dependencies have been applied or +updated. + +.. note:: + + If the required dependencies are actually met, this does not prevent the + applications from being applied. + +**Procedural Changes**: N/A. + +Restart Required for containerd to Apply Config Changes for AIO-SX +****************************************************************** + +On |AIO-SX| systems, certain container images were removed from +the registry due to the image garbage collector and changes introduced during +the Kubernetes upgrade. This may impact workloads that rely on specific image +versions. + +**Procedural Changes**: Increasing the Docker filesystem size will help retain the +image in the ``containerd`` cache. Additionally, only for |AIO-SX| it is +recommended to restart ``containerd`` after the Kubernetes upgrade. +For more details, see "Docker Size updates". + +Limitation Using Regular Expressions in some Parameters while Configuring Stalld +******************************************************************************** + +Stalld supports regular expressions in some parameters such as: + +- ignore_threads + +- ignore_processes + +For example, Stalld can be instructed to ignore all threads that start with +the keyword ``runner``; stalld --ignore_threads="runner.*"" + +**Procedural Changes**: In |prod-long| Release 24.09.300 the above functionality +is not available when using the ``system host-label api``, therefore, the user +will have to explicitly specify the threads to ignore. + +.. code-block:: none + + system host-label-assign controller-0 starlingx.io/stalld.ignore_threads="runnerA" + +BMC Password +************ + +The |BMC| password cannot be updated. + +**Procedural Changes**: In order to update the |BMC| password, de-provision the |BMC|, +and then re-provision it again with the new password. + +Configure Stalld +**************** + +It is recommended to configure Stalld during initial setup. If the workload is +high, runtime stalld configuration may not take effect till the node is rebooted + +**Procedural Changes**: Stalld should be configured during initial system setup. + +Sub-Numa Cluster Configuration not Supported on Skylake Servers +*************************************************************** + +Sub-Numa cluster configuration is not supported on Skylake servers. + +**Procedural Changes**: For servers with Skylake Gold or Platinum CPUs, Sub-|NUMA| +clustering must be disabled in the BIOS. + +Debian Bootstrap +**************** + +On CentOS bootstrap worked even if **dns_servers** were not present in the +localhost.yml. This does not work for Debian bootstrap. + +**Procedural Changes**: You need to configure the **dns_servers** parameter in the +localhost.yml, as long as no |FQDNs| were used in the bootstrap overrides in +the localhost.yml file for Debian bootstrap. + +Installing a Debian ISO +*********************** + +The disks and disk partitions need to be wiped before the install. +Installing a Debian ISO may fail with a message that the system is +in emergency mode if the disks and disk partitions are not +completely wiped before the install, especially if the server was +previously running a CentOS ISO. + +**Procedural Changes**: When installing a lab for any Debian install, the disks must +first be completely wiped using the following procedure before starting +an install. + +Use the following wipedisk commands to run before any Debian install for +each disk (eg: sda, sdb, etc): + +.. code-block:: none + + sudo wipedisk + # Show + sudo sgdisk -p /dev/sda + # Clear part table + sudo sgdisk -o /dev/sda + +.. note:: + + The above commands must be run before any Debian install. The above + commands must also be run if the same lab is used for CentOS installs after + the lab was previously running a Debian ISO. + +Metrics Server Update across Upgrades +************************************* + +After a platform upgrade, the Metrics Server will NOT be automatically updated. + +**Procedural Changes**: To update the Metrics Server, +**See**: :ref:`Install Metrics Server ` + +Backup and Restore Playbook fails due to self-triggered "backup in progress"/"restore in progress" flag +******************************************************************************************************* + +Backup and Restore causes the Playbook to fail +due to self-triggered "backup in progress" / "restore in progress" flag. + +**Procedural Changes**: Retry the backup after manually removing the +flag /etc/platform/.backup_in_progress if it has been more than 10 minutes based +on the error message: + +.. code-block:: none + + "backup has already been started less than x minutes ago. + Wait to start a new backup or manually remove the backup flag in + /etc/platform/.backup_in_progress " + +For a "restore in progress" flag, reinstall and retry the restore operation. + +****************************** +**Optimized-Edge Limitations** +****************************** + +.. contents:: |minitoc| + :local: + :depth: 1 + +.. CGTS-84325 Guilherme - Greg is this applcable to Stx? + +Data Streaming Accelerator Error During a USM Upgrade +***************************************************** + +During the upgrade from |prod-long| Release 10.0 to 11.0, |DSA| init container +fails and will remain in CrashLoopBack until |DSA| is fully upgraded to +|prod-long| Release 11.0. + +The issue occurs because a new parameter ``driver_name`` is required to +configure the workqueues in ``idxd`` driver in kernel 6.12.40. This behavior +should not impact platform upgrade but |DSA| may not be configured until +``intel-device-plugins-operator`` is successfully upgraded. + +**Procedural Changes**: To overcome this behavior, the new parameters can be +added by applying the following Helm overrides before the upgrade. + +For example, create the following override file: + +.. code-block:: none + + $ cat << 'EOF' > dsa-override.yml + overrideConfig: + dsa.conf: | + [ + { + "dev":"dsaX", + "read_buffer_limit":0, + "groups":[ + { + "dev":"groupX.0", + "read_buffers_reserved":0, + "use_read_buffer_limit":0, + "read_buffers_allowed":8, + "grouped_workqueues":[ + { + "dev":"wqX.0", + "mode":"dedicated", + "size":16, + "group_id":0, + "priority":10, + "block_on_fault":1, + "type":"user", + "name":"dpdk_appX0", + "driver_name":"user", + "threshold":15 + } + ], + "grouped_engines":[ + { + "dev":"engineX.0", + "group_id":0 + }, + ] + }, + { + "dev":"groupX.1", + "read_buffers_reserved":0, + "use_read_buffer_limit":0, + "read_buffers_allowed":8, + "grouped_workqueues":[ + { + "dev":"wqX.1", + "mode":"dedicated", + "size":16, + "group_id":1, + "priority":10, + "block_on_fault":1, + "type":"user", + "name":"dpdk_appX1", + "driver_name":"user", + "threshold":15 + } + ], + "grouped_engines":[ + { + "dev":"engineX.1", + "group_id":1 + }, + ] + }, + { + "dev":"groupX.2", + "read_buffers_reserved":0, + "use_read_buffer_limit":0, + "read_buffers_allowed":8, + "grouped_workqueues":[ + { + "dev":"wqX.2", + "mode":"dedicated", + "size":16, + "group_id":2, + "priority":10, + "block_on_fault":1, + "type":"user", + "name":"dpdk_appX2", + "driver_name":"user", + "threshold":15 + } + ], + "grouped_engines":[ + { + "dev":"engineX.2", + "group_id":2 + }, + ] + }, + { + "dev":"groupX.3", + "read_buffers_reserved":0, + "use_read_buffer_limit":0, + "read_buffers_allowed":8, + "grouped_workqueues":[ + { + "dev":"wqX.3", + "mode":"dedicated", + "size":16, + "group_id":3, + "priority":10, + "block_on_fault":1, + "type":"user", + "name":"dpdk_appX3", + "driver_name":"user", + "threshold":15 + } + ], + "grouped_engines":[ + { + "dev":"engineX.3", + "group_id":3 + }, + ] + }, + ] + } + ] + EOF + +Then apply the override file: + +.. code-block:: none + + $ system helm-override-update intel-device-plugins-operator intel-device-plugins-dsa intel-device-plugins-operator --values dsa-override.yml + +Apply the ``intel-device-plugins-operator`` application. + +.. code-block:: none + + $ system application-apply intel-device-plugins-operator + +Console Session Issues during Installation +****************************************** + +After bootstrap and before unlocking the controller, if the console session times +out (or the user logs out), ``systemd`` does not work properly. ``fm, sysinv and +mtcAgent`` do not initialize. + +**Procedural Changes**: If the console times out or the user logs out between bootstrap +and unlock of controller-0, then, to recover from this issue, you must +re-install the ISO. + +Power Metrics Application in Real Time Kernels +********************************************** + +When executing Power Metrics application in Real +Time kernels, the overall scheduling latency may increase due to inter-core +interruptions caused by the MSR (Model-specific Registers) reading. + +Due to intensive workloads the kernel may not be able to handle the MSR +reading interruptions resulting in stalling data collection due to +not being scheduled on the affected core. + +*********************** +**Storage Limitations** +*********************** + +.. contents:: |minitoc| + :local: + :depth: 1 + +Limitations of the Rook Ceph Application During Upgrade from Version 1.13 on |AIO-DX| +************************************************************************************* + +During the upgrade from v1.13 to v1.16 on a |AIO-DX| platform, mon quorum may +be temporarily disrupted for a few minutes. Once the upgrade completes, all +monitors are expected to come back online and quorum should re-establish +successfully. + +**Procedural Changes**: N/A. + +Rook Ceph Application Limitation During Floating Monitor Removal +**************************************************************** + +On a |AIO-DX| system, removing the floating monitor using +"system controllerfs-modify ceph-float --functions="" may lead to temporary +system instability, including the possibility of uncontrolled swacts. + +**Procedural Changes**: To avoid this issue, ensure that all finalizers are +removed from the floating monitor Rook Ceph chart after its deletion, using the +following command: + +.. code-block:: none + + $ kubectl patch hr rook-ceph-floating-monitor -p '{"metadata":{"finalizers":[]}}' --type=merge + +Host fails to lock during an upgrade ************************************ + +After adding multiple |OSDs| simultaneously configured in +the Ceph cluster, some |OSDs| may remain in a configuring state even though +the cluster is healthy and the |OSD| is deployed. This is an intermittent issue +that only occurs on systems with Ceph storage backend configured with more than +one |OSD| per host. This causes the :command:`system host-lock` command to +fail with the following error: + +.. code-block:: none + + $ system host-lock controller- + controller- : Rejected: Can not lock a controller with storage devices + in 'configuring' state. + +Since ``system host-lock`` on the controller fails and the |OSD| is still in +the configuring state, the upgrade is blocked from proceeding. + +**Procedural Changes**: Use the following steps to proceed with the upgrade. + +1. List the OSDID in the 'configuring' state using the following command: + + .. code-block:: none + + $ system host-stor-list + +2. Identify the |OSD| using the following command: + + .. code-block:: none + + $ ceph osd find osd. + +3. If the |OSD| is found manually update the database inventory using the ``stor uuid``: + + .. code-block:: none + + $ sudo -u postgres psql -U postgres -d sysinv -c "UPDATE i_istor SET state='configured' WHERE uuid='';"; + Ceph Daemon Crash and Health Warning ************************************ @@ -1261,7 +2595,6 @@ and the alarm. [sysadmin@controller-0 ~(keystone_admin)]$ ceph crash archive-all -******************************** Rook Ceph Application Limitation ******************************** @@ -1271,88 +2604,27 @@ After applying Rook Ceph application in an |AIO-DX| configuration the **Procedural Changes**: Restart the pod of the monitor associated with the slow operations detected by Ceph. Check ``ceph -s``. - -********************************************************************* -Subcloud failed during rehoming while creating RootCA update strategy -********************************************************************* - -Subcloud rehoming may fail while creating the RootCA update strategy. - -**Proceudral Changes**: Delete the subcloud from the new System Controller and -rehome it again. - -************************************************** -RSA required to be the platform issuer private key -************************************************** - -The ``system-local-ca`` issuer needs to use RSA type certificate/key. The usage -of other types of private keys is currently not supported during bootstrap -or with the ``Update system-local-ca or Migrate Platform Certificates to use -Cert Manager`` procedures. - -**Proceudral Changes**: N/A. - -***************************************************** -Host lock/unlock may interfere with application apply -***************************************************** +Avoid host lock/unlock during application apply +*********************************************** Host lock and unlock operations may interfere with applications that are in the applying state. -**Proceudral Changes**: Re-applying or removing / installing applications may be +**Procedural Changes**: Re-applying or removing / installing applications may be required. Application status can be checked using the :command:`system application-list` command. -**************************************************** -Add / delete operations on pods may result in errors -**************************************************** +Perform a host lock during application apply +******************************************** -Under some circumstances, add / delete operations on pods may result in pods -staying in ContainerCreating/Terminating state and reporting an -'error getting ClusterInformation: connection is unauthorized: Unauthorized`. -This error may also prevent users from locking the host. +Host lock and unlock operations may interfere with applications that are in +the applying state. -**Proceudral Changes**: If this error occurs run the following -:command:`kubectl describe pod -n ` command. The following -message is displayed: +**Procedural Changes**: Re-applying or removing / installing applications may be +required. Application status can be checked using the :command:`system application-list` +command. -`error getting ClusterInformation: connection is unauthorized: Unauthorized` - -.. note:: - - There is a known issue with the Calico CNI that may occur in rare - occasions if the Calico token required for communication with the - kube-apiserver becomes out of sync due to |NTP| skew or issues refreshing - the token. - -**Proceudral Changes**: Delete the calico-node pod (causing it to automatically -restart) using the following commands: - -.. code-block:: none - - $ kubectl get pods -n kube-system --show-labels | grep calico - - $ kubectl delete pods -n kube-system -l k8s-app=calico-node - - -****************************************** -Deploy does not fail after a system reboot -****************************************** - -Deploy does not fail after a system reboot. - -**Proceudral Changes**: Run the -:command:`sudo software-deploy-set-failed --hostname/-h --confirm` -utility to manually move the deploy and deploy host to a failed state which is -caused by a failover, lost power, network outage etc. You can only run this -utility with root privileges on the active controller. - -The utility displays the current state and warns the user about the next steps -to be taken in case the user needs to continue executing the utility. It also -displays the new states and the next operation to be executed. - -********************************* -Rook-ceph application limitations +Rook-ceph Application Limitations ********************************* This section documents the following known limitations you may encounter with @@ -1490,46 +2762,6 @@ offline. $ kubectl get pods -n rook-ceph -l app=rook-ceph-osd -w -***************************************************************************************************************** -Unable to set maximum VFs for NICs using out-of-tree ice driver v1.14.9.2 on systems with a large number of cores -***************************************************************************************************************** - -On systems with a large number of cores (>= 32 physical cores / 64 threads), -it is not possible to set the maximum number of |VFs| (32) for NICs using the -out-of-tree ice driver v1.14.9.2. - -If the issue is encountered, the following error logs will be reported in kern.log: - -.. code-block:: none - - [ 83.322344] ice 0000:51:00.1: Only 59 MSI-X interrupts available for SR-IOV. Not enough to support minimum of 2 MSI-X interrupts per VF for 32 VFs - [ 83.322362] ice 0000:51:00.1: Not enough resources for 32 VFs, err -28. Try with fewer number of VFs - -The impacted NICs are: - -- Intel E810 - -- Silicom STS2 - -**Procedural Changes**: Reduce the number of configured |VFs|. To determine the -maximum number of supported |VFs|: - -- Check /sys/class/net//device/sriov_vf_total_msix. - Example: - - .. code-block:: none - - cat /sys/class/net/enp81s0f0/device/sriov_vf_total_msix - 59 - -- Calculate the maximum number of |VFs| as sriov_vf_total_msix / 2. - Example: - - .. code-block:: none - - max_VFs = 59/2 = 29 - -***************************************************************** Critical alarm 800.001 after Backup and Restore on AIO-SX Systems ***************************************************************** @@ -1577,7 +2809,6 @@ Playbook. The alarm details are as follows: cephfs-table-tool ${FS_NAME}:0 reset inode sudo /etc/init.d/ceph start mds -******************************************************************************* Error installing Rook Ceph on |AIO-DX| with host-fs-add before controllerfs-add ******************************************************************************* @@ -1604,7 +2835,6 @@ sequence: ~(keystone_admin)]$ controllerfs-add ceph-float=20 ~(keystone_admin)]$ system host-fs-add controller-0 ceph=20 -*********************************************************** Intermittent installation of Rook-Ceph on Distributed Cloud *********************************************************** @@ -1615,475 +2845,16 @@ While installing rook-ceph, if the installation fails, this is due to the :command:`system application-remove rook-ceph --force` to initiate rook-ceph installation. -******************************************************************** -Authorization based on Local LDAP Groups is not supported for Harbor -******************************************************************** +Storage Nodes are not considered part of the Kubernetes cluster +*************************************************************** -When using Local |LDAP| for authentication of the new Harbor system application, -you cannot use Local |LDAP| Groups for authorization; you can only use individual -Local |LDAP| users for authorization. +When running the :command:`system kube-host-upgrade-list` command the output +must only display controller and worker hosts that have control-plane and kubelet +components. Storage nodes do not have any of those components and so are not +considered a part of the Kubernetes cluster. -**Procedural Changes**: Use only individual Local LDAP users for specifying -authorization. +**Procedural Changes**: Do not include Storage nodes as part of the Kubernetes upgrade. -*************************************************** -Vault application is not supported during Bootstrap -*************************************************** - -The Vault application cannot be configured during Bootstrap. - -**Procedural Changes**: - -The application must be configured after the platform nodes are unlocked / -enabled / available, a storage backend is configured, and ``platform-integ-apps`` -is applied. If Vault is to be run in |HA| configuration (3 vault server pods) -then at least three controller / worker nodes must be unlocked / enabled / available. - -****************************************** -cert-manager cm-acme-http-solver pod fails -****************************************** - -On a multinode setup, when you deploy an acme issuer to issue a certificate, -the ``cm-acme-http-solver`` pod might fail and stays in "ImagePullBackOff" state -due to the following defect https://github.com/cert-manager/cert-manager/issues/5959. - -**Procedural Changes**: - -1. If you are using the namespace "test", create a docker-registry secret - "testkey" with local registry credentials in the "test" namespace. - - .. code-block:: none - - ~(keystone_admin)]$ kubectl create secret docker-registry testkey --docker-server=registry.local:9001 --docker-username=admin --docker-password=Password*1234 -n test - -2. Use the secret "testkey" in the issuer spec as follows: - - .. code-block:: none - - apiVersion: cert-manager.io/v1 - kind: Issuer - metadata: - name: stepca-issuer - namespace: test - spec: - acme: - server: https://test.com:8080/acme/acme/directory - skipTLSVerify: true - email: test@test.com - privateKeySecretRef: - name: stepca-issuer - solvers: - - http01: - ingress: - podTemplate: - spec: - imagePullSecrets: - - name: testkey - class: nginx - -************************************************************** -ptp-notification application is not supported during bootstrap -************************************************************** - -- Deployment of ``ptp-notification`` during bootstrap time is not supported due - to dependencies on the system |PTP| configuration which is handled - post-bootstrap. - - **Procedural Changes**: N/A. - -- The :command:`helm-chart-attribute-modify` command is not supported for - ``ptp-notification`` because the application consists of a single chart. - Disabling the chart would render ``ptp-notification`` non-functional. - See :ref:`sysconf-application-commands-and-helm-overrides` for details on - this command. - - **Procedural Changes**: N/A. - -****************************************** -Harbor cannot be deployed during bootstrap -****************************************** - -The Harbor application cannot be deployed during bootstrap due to the bootstrap -deployment dependencies such as early availability of storage class. - -**Procedural Changes**: N/A. - -******************** -Kubevirt Limitations -******************** - -The following limitations apply to Kubevirt in |prod| 10.0: - -- **Limitation**: Kubernetes does not provide CPU Manager detection. - - **Procedural Changes**: Add ``cpumanager`` to Kubevirt: - - .. code-block:: none - - apiVersion: kubevirt.io/v1 - kind: KubeVirt - metadata: - name: kubevirt - namespace: kubevirt - spec: - configuration: - developerConfiguration: - featureGates: - - LiveMigration - - Macvtap - - Snapshot - - CPUManager - - Check the label, using the following command: - - .. code-block:: none - - ~(keystone_admin)]$ kubectl describe node | grep cpumanager - - where `cpumanager=true` - -- **Limitation**: Huge pages do not show up under cat /proc/meminfo inside a - guest VM. Although, resources are being consumed on the host. For example, - if a VM is using 4GB of Huge pages, the host shows the same 4GB of huge - pages used. The huge page memory is exposed as normal memory to the VM. - - **Procedural Changes**: You need to configure Huge pages inside the guest - OS. - -See :ref:`Installation Guides ` for more details. - -- **Limitation**: Virtual machines using Persistent Volume Claim (PVC) must - have a shared ReadWriteMany (RWX) access mode to be live migrated. - - **Procedural Changes**: Ensure |PVC| is created with RWX. - - .. code-block:: - - $ class=cephfs --access-mode=ReadWriteMany - - $ virtctl image-upload --pvc-name=cirros-vm-disk-test-2 --pvc-size=500Mi --storage-class=cephfs --access-mode=ReadWriteMany --image-path=/home/sysadmin/Kubevirt-GA-testing/latest-manifest/kubevirt-GA-testing/cirros-0.5.1-x86_64-disk.img --uploadproxy-url=https://10.111.54.246 -insecure - - .. note:: - - - Live migration is not allowed with a pod network binding of bridge - interface type () - - - Live migration requires ports 49152, 49153 to be available in the - virt-launcher pod. If these ports are explicitly specified in the - masquarade interface, live migration will not function. - -- For live migration with |SRIOV| interface: - - - specify networkData: in cloudinit, so when the VM moves to another node - it will not loose the IP config - - - specify nameserver and internal |FQDNs| to connect to cluster metadata - server otherwise cloudinit will not work - - - fix the MAC address otherwise when the VM moves to another node the MAC - address will change and cause a problem establishing the link - - Example: - - .. code-block:: none - - cloudInitNoCloud: - networkData: | - ethernets: - sriov-net1: - addresses: - - 128.224.248.152/23 - gateway: 128.224.248.1 - match: - macAddress: "02:00:00:00:00:01" - nameservers: - addresses: - - 10.96.0.10 - search: - - default.svc.cluster.local - - svc.cluster.local - - cluster.local - set-name: sriov-link-enabled - version: 2 - -- **Limitation**: Snapshot |CRDs| and controllers are not present by default - and needs to be installed on |prod-long|. - - **Procedural Changes**: To install snapshot |CRDs| and controllers on - Kubernetes, see: - - - kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshots.yaml - - - kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshotcontents.yaml - - - kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshotclasses.yaml - - - kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/deploy/kubernetes/snapshot-controller/rbac-snapshot-controller.yaml - - - kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/deploy/kubernetes/snapshot-controller/setup-snapshot-controller.yaml - - Additionally, create ``VolumeSnapshotClass`` for Cephfs and RBD: - - .. code-block:: none - - cat <cephfs-storageclass.yaml - — - apiVersion: snapshot.storage.k8s.io/v1 - kind: VolumeSnapshotClass - metadata: - name: csi-cephfsplugin-snapclass - driver: cephfs.csi.ceph.com - parameters: - clusterID: 60ee9439-6204-4b11-9b02-3f2c2f0a4344 - csi.storage.k8s.io/snapshotter-secret-name: ceph-pool-kube-cephfs-data - csi.storage.k8s.io/snapshotter-secret-namespace: default deletionPolicy: Delete - - EOF - - .. code-block:: none - - cat <rbd-storageclass.yaml - — - apiVersion: snapshot.storage.k8s.io/v1 - kind: VolumeSnapshotClass - metadata: - name: csi-rbdplugin-snapclass - driver: rbd.csi.ceph.com - parameters: - clusterID: 60ee9439-6204-4b11-9b02-3f2c2f0a4344 - csi.storage.k8s.io/snapshotter-secret-name: ceph-pool-kube-rbd - csi.storage.k8s.io/snapshotter-secret-namespace: default deletionPolicy: Delete - EOF - - .. note:: - - Get the cluster ID from : ``kubectl describe sc cephfs, rbd`` - -- **Limitation**: Live migration is not possible when using configmap as a - filesystem. Currently, virtual machine instances (VMIs) cannot be live migrated as - ``virtiofs`` does not support live migration. - - **Procedural Changes**: N/A. - -- **Limitation**: Live migration is not possible when a VM is using secret - exposed as a filesystem. Currently, virtual machine instances cannot be - live migrated since ``virtiofs`` does not support live migration. - - **Procedural Changes**: N/A. - -- **Limitation**: Live migration will not work when a VM is using - ServiceAccount exposed as a file system. Currently, VMIs cannot be live - migrated since ``virtiofs`` does not support live migration. - - **Procedural Changes**: N/A. - -************************************* -synce4l CLI options are not supported -************************************* - -The SyncE configuration using the ``synce4l`` is not supported in |prod| -10.0. - -The service type of ``synce4l`` in the :command:`ptp-instance-add` command -is not supported in |prod-long| 10.0. - -**Procedural Changes**: N/A. - -*************************************************************************** -Kubernetes Pod Core Dump Handler may fail due to a missing Kubernetes token -*************************************************************************** - -In certain cases the Kubernetes Pod Core Dump Handler may fail due to a missing -Kubernetes token resulting in disabling configuration of the coredump on a per -pod basis and limiting namespace access. If application coredumps are not being -generated, verify if the k8s-coredump token is empty on the configuration file: -``/etc/k8s-coredump-conf.json`` using the following command: - -.. code-block:: none - - ~(keystone_admin)]$ ~$ sudo cat /etc/k8s-coredump-conf.json - { - "k8s_coredump_token": "" - - } - -**Procedural Changes**: If the k8s-coredump token is empty in the configuration file and -the kube-apiserver is verified to be responsive, users can re-execute the -create-k8s-account.sh script in order to generate the appropriate token after a -successful connection to kube-apiserver using the following commands: - -.. code-block:: none - - ~(keystone_admin)]$ :/home/sysadmin$ sudo chmod +x /etc/k8s-coredump/create-k8s-account.sh - - ~(keystone_admin)]$ :/home/sysadmin$ sudo /etc/k8s-coredump/create-k8s-account.sh - -**Limitations from previous releases** - -************************************* -Impact of Kubernetes Upgrade to v1.24 -************************************* - -In Kubernetes v1.24 support for the ``RemoveSelfLink`` feature gate was removed. -In previous releases of |prod| this has been set to "false" for backward -compatibility, but this is no longer an option and it is now hardcoded to "true". - -**Procedural Changes**: Any application that relies on this feature gate being disabled -(i.e. assumes the existance of the "self link") must be updated before -upgrading to Kubernetes v1.24. - -****************************************** -Console Session Issues during Installation -****************************************** - -After bootstrap and before unlocking the controller, if the console session times -out (or the user logs out), ``systemd`` does not work properly. ``fm, sysinv and -mtcAgent`` do not initialize. - -**Procedural Changes**: If the console times out or the user logs out between bootstrap -and unlock of controller-0, then, to recover from this issue, you must -re-install the ISO. - -************************************************ -PTP O-RAN Spec Compliant Timing API Notification -************************************************ - -- The ``v1 API`` only supports monitoring a single ptp4l + phc2sys instance. - - **Procedural Changes**: Ensure the system is not configured with multiple instances - when using the v1 API. - -- The O-RAN Cloud Notification defines a /././sync API v2 endpoint intended to - allow a client to subscribe to all notifications from a node. This endpoint - is not supported in StarlingX. - - **Procedural Changes**: A specific subscription for each resource type must be - created instead. - -- ``v1 / v2`` - - - v1: Support for monitoring a single ptp4l instance per host - no other - services can be queried/subscribed to. - - - v2: The API conforms to O-RAN.WG6.O-Cloud Notification API-v02.01 - with the following exceptions, that are not supported in StarlingX. - - - O-RAN SyncE Lock-Status-Extended notifications - - - O-RAN SyncE Clock Quality Change notifications - - - O-RAN Custom cluster names - - **Procedural Changes**: See the respective PTP-notification v1 and v2 document - subsections for further details. - - v1: https://docs.starlingx.io/api-ref/ptp-notification-armada-app/api_ptp_notifications_definition_v1.html - - v2: https://docs.starlingx.io/api-ref/ptp-notification-armada-app/api_ptp_notifications_definition_v2.html - - -************************************************************************** -Upper case characters in host names cause issues with kubernetes labelling -************************************************************************** - -Upper case characters in host names cause issues with kubernetes labelling. - -**Procedural Changes**: Host names should be lower case. - -*********************** -Installing a Debian ISO -*********************** - -The disks and disk partitions need to be wiped before the install. -Installing a Debian ISO may fail with a message that the system is -in emergency mode if the disks and disk partitions are not -completely wiped before the install, especially if the server was -previously running a CentOS ISO. - -**Procedural Changes**: When installing a system for any Debian install, the disks must -first be completely wiped using the following procedure before starting -an install. - -Use the following wipedisk commands to run before any Debian install for -each disk (eg: sda, sdb, etc): - -.. code-block:: none - - sudo wipedisk - # Show - sudo sgdisk -p /dev/sda - # Clear part table - sudo sgdisk -o /dev/sda - -.. note:: - - The above commands must be run before any Debian install. The above - commands must also be run if the same lab is used for CentOS installs after - the lab was previously running a Debian ISO. - -********************************** -Security Audit Logging for K8s API -********************************** - -A custom policy file can only be created at bootstrap in ``apiserver_extra_volumes``. -If a custom policy file was configured at bootstrap, then after bootstrap the -user has the option to configure the parameter ``audit-policy-file`` to either -this custom policy file (``/etc/kubernetes/my-audit-policy-file.yml``) or the -default policy file ``/etc/kubernetes/default-audit-policy.yaml``. If no -custom policy file was configured at bootstrap, then the user can only -configure the parameter ``audit-policy-file`` to the default policy file. - -Only the parameter ``audit-policy-file`` is configurable after bootstrap, so -the other parameters (``audit-log-path``, ``audit-log-maxsize``, -``audit-log-maxage`` and ``audit-log-maxbackup``) cannot be changed at -runtime. - -**Procedural Changes**: NA - -**See**: :ref:`kubernetes-operator-command-logging-663fce5d74e7`. - -****************************************** -PTP is not supported on Broadcom 57504 NIC -****************************************** - -|PTP| is not supported on the Broadcom 57504 NIC. - -**Procedural Changes**: None. Do not configure |PTP| instances on the Broadcom 57504 -NIC. - -************************************************************************************************ -Deploying an App using nginx controller fails with internal error after controller.name override -************************************************************************************************ - -An Helm override of controller.name to the nginx-ingress-controller app may -result in errors when creating ingress resources later on. - -Example of Helm override: - -.. code-block::none - - cat < values.yml - controller: - name: notcontroller - - EOF - - ~(keystone_admin)$ system helm-override-update nginx-ingress-controller ingress-nginx kube-system --values values.yml - +----------------+-----------------------+ - | Property | Value | - +----------------+-----------------------+ - | name | ingress-nginx | - | namespace | kube-system | - | user_overrides | controller: | - | | name: notcontroller | - | | | - +----------------+-----------------------+ - - ~(keystone_admin)$ system application-apply nginx-ingress-controller - -**Procedural Changes**: NA - -**************************************** Optimization with a Large number of OSDs **************************************** @@ -2102,8 +2873,197 @@ is recommended to use the following commands: ~(keystone_admin)]$ ceph osd pool set kube-rbd pg_num 256 ~(keystone_admin)]$ ceph osd pool set kube-rbd pgp_num 256 +Storage Nodes Recovery on Power Outage +************************************** + +Storage nodes take 10-15 minutes longer to recover in the event of a full +power outage. + +**Procedural Changes**: NA + +Ceph Recovery on an AIO-DX System +********************************* + +In certain instances Ceph may not recover on an |AIO-DX| system, and remains +in the down state when viewed using the +:command"`ceph -s` command; for example, if an |OSD| comes up after a controller +reboot and a swact occurs, or other possible causes for example, hardware +failure of the disk or the entire host, power outage, or switch down. + +**Procedural Changes**: There is no specific command or procedure that solves +the problem for all possible causes. Each case needs to be analyzed individually +to find the root cause of the problem and the solution. + +Restrictions on the Size of Persistent Volume Claims (PVCs) +*********************************************************** + +There is a limitation on the size of Persistent Volume Claims (PVCs) that can +be used for all |prod-long| Releases. + +**Procedural Changes**: It is recommended that all |PVCs| should be a minimum size of +1GB. For more information, see, +https://bugs.launchpad.net/starlingx/+bug/1814595. + +platform-integ-apps application update aborted after removing StarlingX 9.0 +*************************************************************************** + +When StarlingX 9.0 is removed, the ``platform-integ-apps`` application is +downgraded, and a message will be displayed: + +.. code-block:: none + + ceph-csi failure:release rbd-provisioner: Failed during apply :Helm upgrade + failed: cannot patch "rbd.csi.ceph.com" with kind CSIDriver: CSIDriver.storage.k8s.io + "rbd.csi.ceph.com" is invalid: spec.fsGroupPolicy: Invalid value: + "ReadWriteOnceWithFSType": field is immutable. + +**Procedural Changes**: To resolve this problem do the following: + +1. Remove the Container Storage Interface (CSI) drivers using the following + commands: + + .. code-block:: none + + ~(keystone_admin)]$ kubectl delete csidriver cephfs.csi.ceph.com + + ~(keystone_admin)]$ kubectl delete csidriver rbd.csi.ceph.com + +2. Update the application so that the correct version is installed. + + .. code-block:: none + + ~(keystone_admin)]$ system application-update /usr/local/share/applications/helm/platform-integ-apps-22.12-72.tgz + +NetApp Permission Error +*********************** + +When installing/upgrading to Trident 20.07.1 and later, and Kubernetes version +1.17 or higher, new volumes created will not be writable if: + +- The storageClass does not specify ``parameter.fsType`` + +- The pod using the requested |PVC| has an ``fsGroup`` enforced as part of a + Security constraint + +**Procedural Changes**: Specify ``parameter.fsType`` in the ``localhost.yml`` file under +``netapp_k8s_storageclasses`` parameters as below. + +The following example shows a minimal configuration in ``localhost.yml``: + +.. code-block:: + + ansible_become_pass: xx43U~a96DN*m.? + trident_setup_dir: /tmp/trident + netapp_k8s_storageclasses: + - metadata: + name: netapp-nas-backend + provisioner: netapp.io/trident + parameters: + backendType: "ontap-nas" + fsType: "nfs" + + netapp_k8s_snapshotstorageclasses: + - metadata: + name: csi-snapclass + +**See**: :ref:`Configure an External NetApp Deployment as the Storage Backend ` + +Restrictions on the Minimum Size of Persistent Volume Claims (PVCs) +******************************************************************* + +There is a limitation on the size of Persistent Volume Claims (PVCs) that can +be used for all |prod| Releases. + +**Procedural Changes**: It is recommended that all PVCs should be a minimum size of +1GB. For more information, see, `https://bugs.launchpad.net/starlingx/+bug/1814595 `__. + +Failure to clean up platform-integ-apps files/Helm release +********************************************************** + +The System Controller does not have Ceph configured, +so the ``platform-integ-apps`` is not installed and the images are not +automatically downloaded to registry.central when upgrading the platform. + +The missing images on the subclouds are: + +.. code-block:: none + + registry.central:9001/docker.io/openstackhelm/ceph-config-helper:ubuntu_focal_18.2.0-1-20231013 + registry.central:9001/quay.io/cephcsi/cephcsi:v3.10.1 + registry.central:9001/registry.k8s.io/sig-storage/csi-attacher:v4.4.2 + registry.central:9001/registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.9.1 + registry.central:9001/registry.k8s.io/sig-storage/csi-provisioner:v3.6.2 + registry.central:9001/registry.k8s.io/sig-storage/csi-resizer:v1.9.2 + registry.central:9001/registry.k8s.io/sig-storage/csi-snapshotter:v6.3.2 + +If the System Controller does not have Ceph configured and the subclouds have +Ceph configured, then the images need to be manually uploaded to the +registry.central before starting the upgrade of the subclouds. + +To push the images to the registry.central, run the following commands on the +System Controller: + +.. code-block:: none + + # Change the variables according to the setup + REGISTRY_PREFIX="server:port/path" + REGISTRY_USERNAME="admin" + REGISTRY_PASSWORD="password" + + sudo docker login registry.local:9001 --username ${REGISTRY_USERNAME} --password ${REGISTRY_PASSWORD} + for image in\ + docker.io/openstackhelm/ceph-config-helper:ubuntu_focal_18.2.0-1-20231013 \ + registry.k8s.io/sig-storage/csi-attacher:v4.4.2 \ + registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.9.1 \ + registry.k8s.io/sig-storage/csi-provisioner:v3.6.2 \ + registry.k8s.io/sig-storage/csi-resizer:v1.9.2 \ + registry.k8s.io/sig-storage/csi-snapshotter:v6.3.2 \ + quay.io/cephcsi/cephcsi:v3.10.1 + + do + sudo docker pull ${REGISTRY_PREFIX}/${image} + sudo docker tag ${REGISTRY_PREFIX}/${image} registry.local:9001/${image} + sudo docker push registry.local:9001/${image} + done + +**Procedural Changes**: In case the subcloud upgrade finishes without the correct images +pushed to the registry.central, it is still possible to recover the system +following the steps below. + +After pushing the images to the registry.central, each subcloud must be +recovered with the following steps (these commands should be run on the Subcloud): + +.. code-block:: + + source /etc/platform/openrc + + # Remove old app manually + sudo rm -rf /opt/platform/helm/22.12/platform-integ-apps; + sudo rm -rf /opt/platform/fluxcd/22.12/platform-integ-apps; + sudo -u postgres psql postgres -d sysinv -c "DELETE from kube_app WHERE name = 'platform-integ-apps';"; + sudo sm-restart service sysinv-inv && sudo sm-restart service sysinv-conductor; + sleep 15; # Wait services to restart + system application-upload /usr/local/share/applications/helm/platform-integ-apps-22.12-72.tgz; + sleep 15; # Wait upload to fail (It is expcected to fail here) + system application-delete platform-integ-apps; + system application-upload /usr/local/share/applications/helm/platform-integ-apps-22.12-72.tgz; + sleep 10; # Wait for the upload to succeed + system application-apply platform-integ-apps; + +.. note:: + + The images need to be pushed to the registry.central registry before + upgrading the subclouds. + + +******************************** +**Operating System Limitations** +******************************** + +.. contents:: |minitoc| + :local: + :depth: 1 -*************** BPF is disabled *************** @@ -2142,10 +3102,9 @@ includes the following, but not limited to these packages. - i40e - ice -**Procedural Changes**: It is recommended not to use BPF with real-time kernel. +**Procedural Changes**: It is recommended not to use BPF with real time kernel. If required it can still be used, for example, debugging only. -*********************** Control Group parameter *********************** @@ -2160,264 +3119,27 @@ but not limited to, **systemd, docker, containerd, libvirt** etc. **Procedural Changes**: NA. This is only a warning message about the future deprecation of an interface. -.. Chris F please confirm if this is applicable? +Subcloud Reconfig may fail due to missing inventory file +******************************************************** -**************************************************** -Kubernetes Taint on Controllers for Standard Systems -**************************************************** +The :command:`dcmanager subcloud reconfig` command +may fail due to a missing file `/var/opt/dc/ansible/_inventory.yml`. -In Standard systems, a Kubernetes taint is applied to controller nodes in order -to prevent application pods from being scheduled on those nodes; since -controllers in Standard systems are intended ONLY for platform services. -If application pods MUST run on controllers, a Kubernetes toleration of the -taint can be specified in the application's pod specifications. - -**Procedural Changes**: Customer applications that need to run on controllers on -Standard systems will need to be enabled/configured for Kubernetes toleration -in order to ensure the applications continue working after an upgrade from -|prod| 6.0 to |prod-long| future Releases. It is suggested to add -the Kubernetes toleration to your application prior to upgrading to |prod| -8.0. - -You can specify toleration for a pod through the pod specification (PodSpec). -For example: +**Procedural Changes**: Provide the floating OAM IP address of the subcloud using the +"--bootstrap-address" argument. For example: .. code-block:: none - spec: - .... - template: - .... - spec - tolerations: - - key: "node-role.kubernetes.io/master" - operator: "Exists" - effect: "NoSchedule" - - key: "node-role.kubernetes.io/control-plane" - operator: "Exists" - effect: "NoSchedule" + ~(keystone_admin)]$ dcmanager subcloud reconfig --sysadmin-password --deploy-config deployment-config.yaml --bootstrap-address -**See**: `Taints and Tolerations `__. +*************************** +**Horizon GUI Limitations** +*************************** -*************************************************************** -Storage Nodes are not considered part of the Kubernetes cluster -*************************************************************** +.. contents:: |minitoc| + :local: + :depth: 1 -When running the :command:`system kube-host-upgrade-list` command the output -must only display controller and worker hosts that have control-plane and kubelet -components. Storage nodes do not have any of those components and so are not -considered a part of the Kubernetes cluster. - -**Procedural Changes**: Do not include Storage nodes as part of the Kubernetes upgrade. - -************************************** -Application Pods with SRIOV Interfaces -************************************** - -Application Pods with |SRIOV| Interfaces require a **restart-on-reboot: "true"** -label in their pod spec template. - -Pods with |SRIOV| interfaces may fail to start after a platform restore or -Simplex upgrade and persist in the **Container Creating** state due to missing -PCI address information in the CNI configuration. - -**Procedural Changes**: Application pods that require|SRIOV| should add the label -**restart-on-reboot: "true"** to their pod spec template metadata. All pods with -this label will be deleted and recreated after system initialization, therefore -all pods must be restartable and managed by a Kubernetes controller -\(i.e. DaemonSet, Deployment or StatefulSet) for auto recovery. - -Pod Spec template example: - -.. code-block:: none - - template: - metadata: - labels: - tier: node - app: sriovdp - restart-on-reboot: "true" - -************************************** -Storage Nodes Recovery on Power Outage -************************************** - -Storage nodes take 10-15 minutes longer to recover in the event of a full -power outage. - -**Procedural Changes**: NA - -********************************* -Ceph Recovery on an AIO-DX System -********************************* - -In certain instances Ceph may not recover on an |AIO-DX| system, and remains -in the down state when viewed using the -:command"`ceph -s` command; for example, if an |OSD| comes up after a controller -reboot and a swact occurs, or other possible causes for example, hardware -failure of the disk or the entire host, power outage, or switch down. - -**Procedural Changes**: There is no specific command or procedure that solves -the problem for all possible causes. Each case needs to be analyzed individually -to find the root cause of the problem and the solution. It is recommended to -contact Customer Support at, -`http://www.windriver.com/support `__. - -******************************************************************* -Cert-manager does not work with uppercase letters in IPv6 addresses -******************************************************************* - -Cert-manager does not work with uppercase letters in IPv6 addresses. - -**Procedural Changes**: Replace the uppercase letters in IPv6 addresses with lowercase -letters. - -.. code-block:: none - - apiVersion: cert-manager.io/v1 - kind: Certificate - metadata: - name: oidc-auth-apps-certificate - namespace: test - spec: - secretName: oidc-auth-apps-certificate - dnsNames: - - ahost.com - ipAddresses: - - fe80::903a:1c1a:e802::11e4 - issuerRef: - name: cloudplatform-interca-issuer - kind: Issuer - -******************************* -Kubernetes Root CA Certificates -******************************* - -Kubernetes does not properly support **k8s_root_ca_cert** and **k8s_root_ca_key** -being an Intermediate CA. - -**Procedural Changes**: Accept internally generated **k8s_root_ca_cert/key** or -customize only with a Root CA certificate and key. - -************************ -Windows Active Directory -************************ - -.. _general-limitations-and-workarounds-ul-x3q-j3x-dmb: - -- **Limitation**: The Kubernetes API does not support uppercase IPv6 addresses. - - **Procedural Changes**: The issuer_url IPv6 address must be specified as - lowercase. - -- **Limitation**: The refresh token does not work. - - **Procedural Changes**: If the token expires, manually replace the ID token. For - more information, see, :ref:`Configure Kubernetes Client Access - `. - -- **Limitation**: TLS error logs are reported in the **oidc-dex** container - on subclouds. These logs should not have any system impact. - - **Procedural Changes**: NA - -.. Stx LP Bug: https://bugs.launchpad.net/starlingx/+bug/1846418 Won't fix. -.. To be addressed in a future update. - -************ -BMC Password -************ - -The |BMC| password cannot be updated. - -**Procedural Changes**: In order to update the |BMC| password, de-provision the |BMC|, -and then re-provision it again with the new password. - -**************************************** -Application Fails After Host Lock/Unlock -**************************************** - -In some situations, application may fail to apply after host lock/unlock due to -previously evicted pods. - -**Procedural Changes**: Use the :command:`kubectl delete` command to delete the evicted -pods and reapply the application. - -*************************************** -Application Apply Failure if Host Reset -*************************************** - -If an application apply is in progress and a host is reset it will likely fail. -A re-apply attempt may be required once the host recovers and the system is -stable. - -**Procedural Changes**: Once the host recovers and the system is stable, a re-apply -may be required. - -************************* -Platform CPU Usage Alarms -************************* - -Alarms may occur indicating platform cpu usage is greater than 90% if a large -number of pods are configured using liveness probes that run every second. - -**Procedural Changes**: To mitigate either reduce the frequency for the liveness -probes or increase the number of platform cores. - -******************* -Pods Using isolcpus -******************* - -The isolcpus feature currently does not support allocation of thread siblings -for cpu requests (i.e. physical thread +HT sibling). - -**Procedural Changes**: For optimal results, if hyperthreading is enabled then -isolcpus should be allocated in multiples of two in order to ensure that both -|SMT| siblings are allocated to the same container. - -*********************************************************** -Restrictions on the Size of Persistent Volume Claims (PVCs) -*********************************************************** - -There is a limitation on the size of Persistent Volume Claims (PVCs) that can -be used for all |prod| Releases. - -**Procedural Changes**: It is recommended that all |PVCs| should be a minimum size of -1GB. For more information, see, -https://bugs.launchpad.net/starlingx/+bug/1814595. - -*************************************************************** -Sub-Numa Cluster Configuration not Supported on Skylake Servers -*************************************************************** - -Sub-Numa cluster configuration is not supported on Skylake servers. - -**Procedural Changes**: For servers with Skylake Gold or Platinum CPUs, Sub-|NUMA| -clustering must be disabled in the BIOS. - -***************************************************************** -The ptp-notification-demo App is Not a System-Managed Application -***************************************************************** - -The ptp-notification-demo app is provided for demonstration purposes only. -Therefore, it is not supported on typical platform operations such as Upgrades -and Backup and Restore. - -**Procedural Changes**: NA - -************************************************************************* -Deleting image tags in registry.local may delete tags under the same name -************************************************************************* - -When deleting image tags in the registry.local docker registry, you should be -aware that the deletion of an **** will delete all tags -under the specified that have the same 'digest' as the specified -. For more information, see, :ref:`Delete Image Tags in -the Docker Registry `. - -**Procedural Changes**: NA - -**************************************************************************** Unable to create Kubernetes Upgrade Strategy for Subclouds using Horizon GUI **************************************************************************** @@ -2441,24 +3163,10 @@ subcloud using the Horizon GUI, it fails and displays the following error: :ref:`apply-a-kubernetes-upgrade-strategy-using-horizon-2bb24c72e947` -********************************************** -Power Metrics Application in Real Time Kernels -********************************************** - -When executing Power Metrics application in Real -Time kernels, the overall scheduling latency may increase due to inter-core -interruptions caused by the MSR (Model-specific Registers) reading. - -Due to intensive workloads the kernel may not be able to handle the MSR -reading interruptions resulting in stalling data collection due to -not being scheduled on the affected core. - **Procedural Changes**: N/A. -*********************************************** k8s-coredump only supports lowercase annotation *********************************************** - Creating K8s pod core dump fails when setting the ``starlingx.io/core_pattern`` parameter in upper case characters on the pod manifest. This results in the pod being unable to find the target directory @@ -2469,42 +3177,6 @@ lower case characters for the path and file name where the core dump is saved. **See**: :ref:`kubernetes-pod-coredump-handler-54d27a0fd2ec`. -*********************** -NetApp Permission Error -*********************** - -When installing/upgrading to Trident 20.07.1 and later, and Kubernetes version -1.17 or higher, new volumes created will not be writable if: - -- The storageClass does not specify ``parameter.fsType`` - -- The pod using the requested |PVC| has an ``fsGroup`` enforced as part of a - Security constraint - -**Procedural Changes**: Specify ``parameter.fsType`` in the ``localhost.yml`` file under -``netapp_k8s_storageclasses`` parameters as below. - -The following example shows a minimal configuration in ``localhost.yml``: - -.. code-block:: - - ansible_become_pass: xx43U~a96DN*m.? - trident_setup_dir: /tmp/trident - netapp_k8s_storageclasses: - - metadata: - name: netapp-nas-backend - provisioner: netapp.io/trident - parameters: - backendType: "ontap-nas" - fsType: "nfs" - - netapp_k8s_snapshotstorageclasses: - - metadata: - name: csi-snapclass - -**See**: :ref:`Configure an External NetApp Deployment as the Storage Backend ` - -******************************** Huge Page Limitation on Postgres ******************************** @@ -2522,226 +3194,72 @@ command. The Procedural Changes is not persistent, therefore, if the host is rebooted it will need to be applied again. This will be fixed in a future release. -************************************************ -Password Expiry does not work on LDAP user login -************************************************ - -On Debian, the warning message is not being displayed for Active Directory users, -when a user logs in and the password is nearing expiry. Similarly, on login -when a user's password has already expired, the password change prompt is not -being displayed. - -**Procedural Changes**: It is recommended that users rely on Directory administration -tools for "Windows Active Directory" servers to handle password updates, -reminders and expiration. It is also recommended that passwords should be -updated every 3 months. - -.. note:: - - The expired password can be reset via Active Directory by IT administrators. - -*************************************** -Silicom TimeSync (STS) card limitations -*************************************** - -* Silicom and Intel based Time Sync NICs may not be deployed on the same system - due to conflicting time sync services and operations. - - |PTP| configuration for Silicom TimeSync (STS) cards is handled separately - from |prod| host |PTP| configuration and may result in configuration - conflicts if both are used at the same time. - - The sts-silicom application provides a dedicated ``phc2sys`` instance which - synchronizes the local system clock to the Silicom TimeSync (STS) card. Users - should ensure that ``phc2sys`` is not configured via |prod| |PTP| Host - Configuration when the sts-silicom application is in use. - - Additionally, if |prod| |PTP| Host Configuration is being used in parallel - for non-STS NICs, users should ensure that all ``ptp4l`` instances do not use - conflicting ``domainNumber`` values. - -* When the Silicom TimeSync (STS) card is configured in timing mode using the - sts-silicom application, the card goes through an initialization process on - application apply and server reboots. The ports will bounce up and down - several times during the initialization process, causing network traffic - disruption. Therefore, configuring the platform networks on the Silicom - TimeSync (STS) card is not supported since it will cause platform - instability. - -**Procedural Changes**: N/A. - -*********************************** -N3000 Image in the containerd cache -*********************************** - -The |prod-long| system without an N3000 image in the containerd cache fails to -configure during a reboot cycle, and results in a failed / disabled node. - -The N3000 device requires a reset early in the startup sequence. The reset is -done by the n3000-opae image. The image is automatically downloaded on bootstrap -and is expected to be in the cache to allow the reset to succeed. If the image -is not in the cache for any reason, the image cannot be downloaded as -``registry.local`` is not up yet at this point in the startup. This will result -in the impacted host going through multiple reboot cycles and coming up in an -enabled/degraded state. To avoid this issue: - -1. Ensure that the docker filesystem is properly engineered to avoid the image - being automatically removed by the system if flagged as unused. - For instructions to resize the filesystem, see - :ref:`Increase Controller Filesystem Storage Allotments Using the CLI ` - -2. Do not manually prune the N3000 image. - -**Procedural Changes**: Use the procedure below. - -.. rubric:: |proc| - -#. Lock the node. - - .. code-block:: none - - ~(keystone_admin)]$ system host-lock controller-0 - -#. Pull the (N3000) required image into the ``containerd`` cache. - - .. code-block:: none - - ~(keystone_admin)]$ crictl pull registry.local:9001/docker.io/starlingx/n3000-opae:stx.8.0-v1.0.2 - -#. Unlock the node. - - .. code-block:: none - - ~(keystone_admin)]$ system host-unlock controller-0 - -.. Henrique please confirm if this is applicable in 10.0?? - -***************** Quartzville Tools ***************** The following :command:`celo64e` and :command:`nvmupdate64e` commands are not -supported in StarlingX due to a known issue in Quartzville tools that crashes -the host. +supported in |prod-long|, Release 9.0 due to a known issue in Quartzville +tools that crashes the host. **Procedural Change**: Reboot the host using the boot screen menu. -******************************************************************************************************* -``ptp4l`` error "timed out while polling for tx timestamp" reported for NICs using the Intel ice driver -******************************************************************************************************* +------------------------------ +Deprecated Notices in Stx 11.0 +------------------------------ -NICs using the Intel® ice driver may report the following error in the ``ptp4l`` -logs, which results in a |PTP| port switching to ``FAULTY`` before -re-initializing. +******************************* +In-tree and Out-of-tree drivers +******************************* -.. note:: +In |prod-long| Release 11.0 only the out-of-tree versions of the Intel ``ice`` +``i40e``, and ``iavf`` drivers are supported. Switching between in-tree and +out-of-tree driver versions are not supported. - |PTP| ports frequently switching to ``FAULTY`` may degrade the accuracy of - the |PTP| timing. +The out_of_tree_drivers service parameter and the out-of-tree-drivers boot +parameter are deprecated and should not be modified to switch to in-tree driver +versions. The values will be ignored, and the system will always use the +out-of-tree versions of the Intel ``ice``, ``i40e``, and ``iavf`` drivers. -.. code-block:: none +**See**: :ref:`intel-driver-version-c6e3fa384ff7` - ptp4l[80330.489]: timed out while polling for tx timestamp - ptp4l[80330.489]: increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug +************************************* +Kubernetes Root CA boostrap overrides +************************************* -.. note:: +The overrides ``k8s_root_ca_cert``, ``k8s_root_ca_key``, and ``apiserver_cert_sans`` +will be deprecated in a future release. External connections to ``kube-apiserver`` +are now routed through a proxy that identifies itself using the REST API/GUI +certificate issued by the platform issuer (system-local-ca). - This is due to a limitation with the Intel® ice driver as the driver cannot - guarantee the time interval to return the timestamp to the ``ptp4l`` user - space process which results in the occasional timeout error message. +**See**: :ref:`ansible_bootstrap_configs_r7` -**Procedural Changes**: The Procedural Changes recommended by Intel is to increase the -``tx_timestamp_timeout`` parameter in the ``ptp4l`` config. The increased -timeout value gives more time for the ice driver to provide the timestamp to -the ``ptp4l`` user space process. Timeout values of 50ms and 700ms have been -validated. However, the user can use a different value if it is more suitable -for their system. +************************ +kubernetes-power-manager +************************ -.. code-block:: none +Intel has stopped support for the ``kubernetes-power-manager`` application. This +is still being supported by |prod-long| and will be removed in a future release. - ~(keystone_admin)]$ system ptp-instance-parameter-add tx_timestamp_timeout=700 - ~(keystone_admin)]$ system ptp-instance-apply +``cpu_busy_cycles`` metric is deprecated and must be replaced with +``cpu_c0_state_residency_percent`` for continued usage +(if the metrics are customized via helm overrides). -.. note:: - - The ``ptp4l`` timeout error log may also be caused by other underlying - issues, such as NIC port instability. Therefore, it is recommended to - confirm the NIC port is stable before adjusting the timeout values. - -*************************************************** -Cert-manager accepts only short hand IPv6 addresses -*************************************************** - -Cert-manager accepts only short hand IPv6 addresses. - -**Procedural Changes**: You must use the following rules when defining IPv6 addresses -to be used by Cert-manager. - -- all letters must be in lower case - -- each group of hexadecimal values must not have any leading 0s - (use :12: instead of :0012:) - -- the longest sequence of consecutive all-zero fields must be short handed - with ``::`` - -- ``::`` must not be used to short hand an IPv6 address with 7 groups of hexadecimal - values, use :0: instead of ``::`` - -.. note:: - - Use the rules above to set the IPv6 address related to the management - and |OAM| network in the Ansible bootstrap overrides file, localhost.yml. - -.. code-block:: none - - apiVersion: cert-manager.io/v1 - kind: Certificate - metadata: - name: oidc-auth-apps-certificate - namespace: test - spec: - secretName: oidc-auth-apps-certificate - dnsNames: - - ahost.com - ipAddresses: - - fe80:12:903a:1c1a:e802::11e4 - issuerRef: - name: cloudplatform-interca-issuer - kind: Issuer - -.. Stx LP Bug: https://bugs.launchpad.net/starlingx/+bug/1846418 Won't fix. -.. To be addressed in a future update. - -.. All please confirm if all these have been removed from the StarlingX 10.0? - ------------------- -Deprecated Notices ------------------- +For more information, see :ref:`configurable-power-manager-04c24b536696`. *************** Bare metal Ceph *************** -Host-based Ceph will be deprecated in a future release. Adoption -of Rook-Ceph is recommended for new deployments as some host-based Ceph -deployments may not be upgradable. - -********************************************************* -No support for system_platform_certificate.subject_prefix -********************************************************* - -|prod| 10.0 no longer supports system_platform_certificate.subject_prefix -This is an optional field to add a prefix to further identify the certificate, -for example, |prod| for instance. - +Host-based Ceph is deprecated in |prod-long| Release 11.0. Adoption +of Rook-Ceph is recommended for new deployments to avoid service +disruption introduced by Bare Metal Ceph to Rook migration. *************************************************** Static Configuration for Hardware Accelerator Cards *************************************************** Static configuration for hardware accelerator cards is deprecated in -|prod| 10.0 and will be discontinued in future releases. +|prod-long| Release 24.09.00 and will be discontinued in future releases. Use |FEC| operator instead. **See** :ref:`Switch between Static Method Hardware Accelerator and SR-IOV FEC Operator ` @@ -2750,8 +3268,8 @@ Use |FEC| operator instead. N3000 FPGA Firmware Update Orchestration **************************************** -The N3000 |FPGA| Firmware Update Orchestration has been deprecated in |prod| -10.0. For more information, see :ref:`n3000-overview` for more +The N3000 |FPGA| Firmware Update Orchestration has been deprecated in |prod-long| +Release 24.09.00. For more information, see :ref:`n3000-overview` for more information. ******************** @@ -2759,7 +3277,7 @@ show-certs.sh Script ******************** The ``show-certs.sh`` script that is available when you ssh to a controller is -deprecated in |prod| 10.0. +deprecated in |prod-long| Release 11.0. The new response format of the 'system certificate-list' RESTAPI / CLI now provides the same information as provided by ``show-certs.sh``. @@ -2768,17 +3286,17 @@ provides the same information as provided by ``show-certs.sh``. Kubernetes APIs *************** -Kubernetes APIs that will be removed in K8s 1.25 are listed below: +Kubernetes APIs that will be removed in K8s 1.27 are listed below: -**See**: https://kubernetes.io/docs/reference/using-api/deprecation-guide/#v1-25 +**See**: https://kubernetes.io/docs/reference/using-api/deprecation-guide/#v1-27 *********************** ptp-notification v1 API *********************** -The ptp-notification v1 API can still be used in |prod| 10.0. -The v1 API will be removed in a future release and only the O-RAN Compliant -Notification API (ptp-notification v2 API) will be supported. +The ptp-notification v1 API can still be used in |prod-long| Release 11.0. +The v1 API will be removed in a future release and only the O-RAN +Compliant Notification API (ptp-notification v2 API) will be supported. .. note:: @@ -2786,319 +3304,24 @@ Notification API (ptp-notification v2 API) will be supported. Notification API (ptp-notification v2 API). ------------------- -Removed in Stx 10.0 +Removed in Stx 11.0 ------------------- -``kube-ignore-isol-cpus`` is no longer supported in |prod| 10.0. +****************** +MacVTap Interfaces +****************** -******************* -Pod Security Policy -******************* - -Pod Security Policy (PSP) is removed in |prod| 10.0 and -K8s v1.25 and ONLY applies if running on K8s v1.24 or earlier. Instead of -using Pod Security Policy, you can enforce similar restrictions on Pods -using Pod Security Admission Controller (PSAC) supporting K8s v1.25. - -.. note:: - - Although |prod| 10.0 still supports K8s v1.24 which supports - |PSP|, |prod| 10.0 has removed the |prod| default |PSP| policies, - roles and role-bindings that made |PSP| usable in |prod|; It is important - to note that |prod| 10.0 is officially NOT supporting the use - of |PSP| in its Kubernetes deployment. - -.. important:: - - Upgrades - - - |PSP| should be removed on hosted application's and converted to - |PSA| Controller before the upgrade to |prod| 10.0. - -.. - On 'upgrade activate or complete' of the upgrade to |prod| -.. 10.0, ALL |PSP| policies and all previously auto-generated ClusterRoles -.. and ClusterRoleBindings associated with |PSP| policies will be removed. - - - Using the :command:`system application-update` command for Platform - applications will remove the use of roles or rolebindings dealing with - |PSP| policies. - - - |PSA| Controller mechanisms should be configured to enforce the constraints that - the previous PSP policies were enforcing. - -**See**: :ref:`Pod Security Admission Controller ` - -******************************* -System certificate CLI Commands -******************************* - -The following commands are removed in |prod| 10.0 and replaced -by: - -- ``system certificate-install -m ssl `` - has been replaced by an automatically installed 'system-restapi-gui-certificate' - CERTIFICATE (in the 'deployment' namespace) which can be modified using the - 'update_platform_certificates' Ansible playbook - -- ``system certificate-install -m openstack `` - has been replaced by 'system os-certificate-install ' - -- ``system certificate-install -m ssl_ca `` - -- ``system certificate-install -m docker_registry `` - has been replaced by an automatically installed 'system-registry-local-certificate' - CERTIFICATE (in the 'deployment' namespace) which can be modified using the - 'update_platform_certificates' Ansible playbook - -- ``system certificate-uninstall -m ssl_ca `` and - ``system certificate-uninstall -m ssl_ca `` - have been replaced by: - - - ``'system ca-certificate-install '`` - - ``'system ca-certificate-uninstall '`` - -.. _appendix-commands-replaced-by-usm-for-updates-and-upgrades-835629a1f5b8: - ------------------------------------------------------------------------- -Appendix A - Commands replaced by USM for Updates (Patches) and Upgrades ------------------------------------------------------------------------- - -.. toctree:: - :maxdepth: 1 - -********************************** -Manually Managing Software Patches -********************************** - -The ``sudo sw-patch`` commands for manually managing software patches have -been replaced by ``software`` commands as listed below: - -The following commands for manually managing software patches are **no** longer -supported: - -- sw-patch upload - -- sw-patch upload-dir - -- sw-patch query - -- sw-patch show - -- sw-patch apply - -- sw-patch query-hosts - -- sw-patch host-install - -- sw-patch host-install-async - -- sw-patch remove - -- sw-patch delete - -- sw-patch what-requires - -- sw-patch query-dependencies - -- sw-patch is-applied - -- sw-patch is-available - -- sw-patch install-local - -- sw-patch drop-host - -- sw-patch commit - -Software patching is now manually managed by the ``software`` commands -described in the :ref:``Manual Host Software Deployment `` -procedure. - -- software upload - -- software upload-dir - -- software list - -- software delete - -- software show - -- software deploy precheck - -- software deploy start - -- software deploy show - -- software deploy host - -- software deploy host-rollback - -- software deploy localhost - -- software deploy host-list - -- software deploy activate - -- software deploy complete - -- software deploy delete - -************************ -Manual Software Upgrades -************************ - -The ``system load-delete/import/list/show``, -``system upgrade-start/show/activate/abort/abort-complete/complete`` and -``system host-upgrade/upgrade-list/downgrade`` commands for manually managing -software upgrades have been replaced by ``software`` commands. - -The following commands for manually managing software upgrades are **no** longer -supported: - -- system load-import - -- system load-list - -- system load-show - -- system load-delete - -- system upgrade-start - -- system upgrade-show - -- system host-upgrade - -- system host-upgrade-list - -- system upgrade-activate - -- system upgrade-complete - -- system upgrade-abort - -- system host-downgrade - -- system upgrade-abort-complete - -Software upgrade is now manually managed by the ``software`` commands described -in the :ref:`manual-host-software-deployment-ee17ec6f71a4` -procedure. - -- software upload - -- software upload-dir - -- software list - -- software delete - -- software show - -- software deploy precheck - -- software deploy start - -- software deploy show - -- software deploy host - -- software deploy localhost - -- software deploy host-list - -- software deploy activate - -- software deploy complete - -- software deploy delete - -- software deploy abort - -- software deploy host-rollback - -- software deploy activate-rollback - -********************************* -Orchestration of Software Patches -********************************* - -The ``sw-manager patch-strategy-create/apply/show/abort/delete`` commands for -managing the orchestration of software patches have been replaced by -``sw-manager sw-deploy-strategy-create/apply/show/abort/delete`` commands. - -The following commands for managing the orchestration of software patches are -**no** longer supported - -- sw-manager patch-strategy create ... ... - -- sw-manager patch-strategy show - -- sw-manager patch-strategy apply - -- sw-manager patch-strategy abort - -- sw-manager patch-strategy delete - -Orchestrated software patching is now managed by the -``sw-manager sw-deploy-strategy-create/apply/show/abort/delete`` commands -described in the :ref:`orchestrated-deployment-host-software-deployment-d234754c7d20` -procedure. - -- sw-manager sw-deploy-strategy create ... ... - -- sw-manager sw-deploy-strategy show - -- sw-manager sw-deploy-strategy apply - -- sw-manager sw-deploy-strategy abort - -- sw-manager sw-deploy-strategy delete - -********************************** -Orchestration of Software Upgrades -********************************** - -The ``sw-manager patch-strategy-create/apply/show/abort/delete`` commands for -managing the orchestration of software upgrades have been replaced by -``sw-manager sw-deploy-strategy-create/apply/show/abort/delete`` commands. - -The following commands for managing the orchestration of software upgrades are -no longer supported. - -- sw-manager upgrade-strategy create ... ... - -- sw-manager upgrade-strategy show - -- sw-manager upgrade-strategy apply - -- sw-manager upgrade-strategy abort - -- sw-manager upgrade-strategy delete - -Orchestrated software upgrade is now managed by the -``sw-manager sw-deploy-strategy-create/apply/show/abort/delete`` commands -described in the :ref:`orchestrated-deployment-host-software-deployment-d234754c7d20` -procedure. - -- sw-manager sw-deploy-strategy create < ... ... - -- sw-manager sw-deploy-strategy show - -- sw-manager sw-deploy-strategy apply - -- sw-manager sw-deploy-strategy abort - -- sw-manager sw-deploy-strategy delete +MacVTap interfaces for KubeVirt |VMs| are not supported in |prod-long| +Release 11.0 and future releases. -------------------------------------- Release Information for other versions -------------------------------------- -You can find details about a release on the specific release page. +You can find details about a release on the specific release page at: +https://wiki.openstack.org/wiki/StarlingX/Release_Plan#List_of_Releases. -.. To change the 9.0 link +.. To recheck the 11.0 link .. list-table:: @@ -3106,6 +3329,10 @@ You can find details about a release on the specific release page. - Release Date - Notes - Status + * - StarlingX R11.0 + - 2025-11 + - https://docs.starlingx.io/r/stx.11.0/releasenotes/index.html + - Maintained * - StarlingX R10.0 - 2025-02 - https://docs.starlingx.io/r/stx.10.0/releasenotes/index.html @@ -3113,19 +3340,19 @@ You can find details about a release on the specific release page. * - StarlingX R9.0 - 2024-03 - https://docs.starlingx.io/r/stx.9.0/releasenotes/index.html - - Maintained + - :abbr:`EOL (End of Life)` * - StarlingX R8.0 - 2023-02 - https://docs.starlingx.io/r/stx.8.0/releasenotes/index.html - - Maintained + - :abbr:`EOL (End of Life)` * - StarlingX R7.0 - 2022-07 - https://docs.starlingx.io/r/stx.7.0/releasenotes/index.html - - Maintained + - :abbr:`EOL (End of Life)` * - StarlingX R6.0 - 2021-12 - https://docs.starlingx.io/r/stx.6.0/releasenotes/index.html - - Maintained + - :abbr:`EOL (End of Life)` * - StarlingX R5.0.1 - 2021-09 - https://docs.starlingx.io/r/stx.5.0/releasenotes/index.html