2126 Commits

Author SHA1 Message Date
Igor Pires Soares
c2c9a70d97 Revert "Add instrumentation log for kubeadm join command"
This reverts commit b1ec6538c6daeb21c01f59b74196b87bf99040ec.

Reason for revert: Fix an issue where puppet doesn't stop when a command
fails. Also aims to fix kubeadm join failures when unlocking
controller-1 after Backup & Restore.

This also enforces Zuul to use python 3.9, since it was running tox checks with python 3.12.

Closes-bug: 2090224
Change-Id: Iebb76674e45f82563d9901b44cf0afe44a436822
2024-11-29 23:16:49 +00:00
Wallysson Silva
74f90fc674 Escape $ character in dc configuration files
This ensures password containing a $ symbol can be read by
oslo_config. oslo_config supports variable substitution [1] and to
avoid substitution is need to escape $ with $$.

[1]: https://docs.openstack.org/oslo.config/latest/configuration/format.html#substitution

Test Plan:
- PASS: bootstrap a system controller with keystone admin password
containing $, dc services should start

Closes-Bug: 2089783
Change-Id: Icdbfae04b663bb9373116ff4967d4d78f57625c6
Signed-off-by: Wallysson Silva <wallysson.silva@windriver.com>
2024-11-27 15:54:13 -03:00
Zuul
cf6171a7dd Merge "Fix CLI behavior change for Helm v > 3.3.1" 2024-11-26 21:41:50 +00:00
Zuul
890719a2d1 Merge "Remove sw-patch-agent.service from manifests" 2024-11-26 18:57:11 +00:00
Marcelo de Castro Loebens
c863d68029 Fix CLI behavior change for Helm v > 3.3.1
Helm introduced changes in the CLI behavior for the command
'helm repo add' in versions superior to 3.3.1. For reference:
https://github.com/helm/helm/issues/8771 .

This caused the code inside platform::helm::repository to return an
error when repos are updated, which in turn makes the puppet manifest
fail.

The previous behavior can be achieved by using the flag 'force-update'
introduced by Helm. However, this is not backwards compatible, so it's
usage is conditioned to the software version to decrease the chance of
issues during upgrades from stx 8.0.

Test plan:
PASS: Bootstrap DC + SX subcloud.

PASS: Switch http_port to 80 (default is 8080).
      Verified that the puppet manifest executed successfully.
      Verified that helm repos were updated.

PASS: Switch https_enabled to false.
      Verified that Horizon is accessible using the port 80.

Story: 2011266
Task: 51407

Change-Id: I5dd0a5ac1914073a6e500cd9dde602aa63b874eb
Signed-off-by: Marcelo de Castro Loebens <Marcelo.DeCastroLoebens@windriver.com>
2024-11-26 14:08:45 +00:00
Bin Qian
1b6ead7eba fix compute node puppet failure after upgrade
Fix a small syntax error in puppet manifest that causes the compute node
main manifest failure after upgrade to stx-10.

Closes-bug: 2089591

TCs:
     passed: compute node unlock successfully with kubelet service
             running after upgrade to stx-10.
Change-Id: I73eb9b69647a0d8e05a8e4792fcec6deec6531fc
Signed-off-by: Bin Qian <bin.qian@windriver.com>
2024-11-25 19:36:05 +00:00
mmachado
020cfa9d63 Remove sw-patch-agent.service from manifests
sw-patch-agent service is to be disabled and must be removed
from patching and keystone manifests.

Depends-On: https://review.opendev.org/c/starlingx/config/+/936143

Test-Plan:
PASS: AIO-SX upgrade using sw-manager strategy
PASS: AIO-DX System Controller upgrade using strategy
PASS: subcloud upgrade using dcmanager strategy

Story: 2010676
Task: 51386

Change-Id: I201de8f2f2f4f16ad2d01933881a61f3ad41af7c
Signed-off-by: mmachado <mmachado@windriver.com>
2024-11-25 09:11:22 -03:00
Zuul
cd1a3ee190 Merge "Change log in check_grub_config to Facter::warn" 2024-11-22 18:33:42 +00:00
Zuul
52997b98a2 Merge "Configure systemd CPUShares/CPUQuota for Kubernetes services" 2024-11-22 17:30:34 +00:00
Kyale, Eliud
7cdae0548e Change log in check_grub_config to Facter::warn
Puppet::info switch to Facter::warn
different log api . doesn't rely on debug enabling

Test plan:

PASS - AIO-SX: iso install
PASS - AIO-DX: iso install
PASS - trigger update_grub_config.rb to test logging

Example log:

2024-11-22T17:22:52.792 Warning: 2024-11-22 17:22:52 +0000 Facter:
nohz_full=disabled is not presented in ...

Related-Bug : 2089028

Change-Id: I24598a2b54a6649a0f76b6a4b295eb1254a203dc
Signed-off-by: Kyale, Eliud <Eliud.Kyale@windriver.com>
2024-11-22 12:26:34 -05:00
Jim Gauld
602fa7a08a Configure systemd CPUShares/CPUQuota for Kubernetes services
This creates the systemd k8splatform.slice and this is configured
with 10x CPUShares. Kubernetes services are latency critical.
The following services are members of k8splatform.slice:
etcd.service, containerd.service, kubelet.service.

This also configures systemd CPUQuota:
- 75%*PlatformCPUS for kubelet.service
- 75%*PlatformCPUS for containerd.service

In general the process behaviour of containerd and etcd services
are auto-regulated by the load from kubelet. Usually these three
services are well behaved (highly interactive, wakeup, do
little work), mostly request driven.

In theory putting a quota on kubelet.service should be sufficient,
but there is occasionally a runaway log-flooding problem causing
containerd/containerd-shim to use too much. This is the reason
to also put a quota on containerd.service.

This adds systemd hung behavior mitigations for Kubernetes DropIn
files configuration, used after daemon-reload and restarting
services. This includes usage of new scripts:
- verify-systemd-running.sh

This is part of an overall set of adjustments are required for systemd
cgroups CPUShares, CPUQuota, and AllowedCPUs for key system services.
This will improve latency of Kubernetes critical components, and
throttles lesser important services.

Partial-Bug: 2084714

TEST PLAN:
AIO-SX, AIO-DX, Standard, Storage, DC:
- PASS: Fresh install
- PASS: bootstrap: Verify that K8S services run under k8splaform.slice
        systemctl status k8splatform.slice
- PASS: unlock: Verify that K8S services run under k8splaform.slice
- PASS: reboot: Verify that K8S services run under k8splaform.slice

AIO-SX:
- PASS: reconfigure number platform cpus; unlock, verify updated
        CPUShares: kubelet, containerd
- PASS: ansible-playbook replay
- PASS: Platform USM Upgrade; verify systemd CPUShares settings
- FAIL: docker-stx-override.conf requires regeneration

AIO-SX, AIO-DX:
- PASS: BnR - verify CPUShares after restore
- PASS: K8S orchestrated Upgrade 1.24 - 1.29
- TODO: Platform USM Upgrade, including pre-activation rollback

Change-Id: Ica5821b620453678861656db4efe6d7382bccadb
Signed-off-by: Jim Gauld <James.Gauld@windriver.com>
2024-11-22 08:38:58 -05:00
Zuul
005359b81a Merge "Add instrumentation log for kubeadm join command" 2024-11-21 21:14:04 +00:00
Jim Gauld
b1ec6538c6 Add instrumentation log for kubeadm join command
This uses kube_command helper for logging instrumentation
of 'kubeadm join' command. This is useful in cases when the
join command fails or hits timeout. In the case of timeout,
we currently get no indication of progress or the actual failure.

The platform::kubernetes::kube_command helper function is
updated to have new optional parameter 'unless', and the
'environment' parameter is modified to pass an array instead
of a string to handle an empty array as the default.

Partial-Bug: 2084714

TEST CASES:
PASS: AIO-SX, AIO-DX, Standard, DC: Fresh install ISO.
      Verify we get file output logs in /var/log/puppet/<dir>/
      for kubeadm-join-command.log with verbose output.
PASS: AIO-DX: K8S Orchestrated upgrade

Change-Id: Id88d07a62d9bd8785227213d5e1b49fca5260084
Signed-off-by: Jim Gauld <James.Gauld@windriver.com>
2024-11-20 19:04:40 -05:00
Zuul
89df24b7e9 Merge "Show detailed grub update logs" 2024-11-20 21:59:11 +00:00
Kyale, Eliud
e01bfe5fed Show detailed grub update logs
Show detailed logs that indicate which kernel
arguments have been updated in order to assist in
determining reboot cause. kernel arguments require
a reboot that affects performance and timing

Test plan:

PASS - AIO-SX: iso install
PASS - AIO-SX: manually edit kernel parameters and trigger puppet audit
               observe logs and reboot

Closes-Bug : 2089028

Change-Id: I721cadf3dfb725bf3722eacca7a039cf3c4e31d1
Signed-off-by: Kyale, Eliud <Eliud.Kyale@windriver.com>
2024-11-20 02:21:41 -05:00
Zuul
67361c5a53 Merge "Tune postgresql memory and I/O settings for system controllers" 2024-11-19 16:58:45 +00:00
Jim Gauld
f5b83fe391 Configure toprc with additional fields P, NU, CGNAME
This configures top using toprc configuration file for
sysadmin and root. This enables fields: P, NU, CGNAME,
and shows full command arguments by default.

This improves the System Engineering debugability of the
system since we can easier see where a task is running.
We see what logical cpu a task is currently running with
'P' last cpu used and 'NU' numa node, and see what systemd
cgroup name with 'CGNAME'. This also helps diagnose tasks
running in pods since they belong to well named cgroups.

Partial-Bug: 2084714

TEST PLAN:
- PASS: Fresh install AIO-SX, AIO-DX, Standard, Storage, DC
- PASS: run 'top' as sysadmin and root user, verify
  we see 'P', 'NU', 'CGNAME' and command arguments.

Change-Id: I50d8f25336a980bcdcba4d94ea727fee9726a527
Signed-off-by: Jim Gauld <James.Gauld@windriver.com>
2024-11-15 01:31:39 -05:00
Gustavo Herzmann
d405647ecb Tune postgresql memory and I/O settings for system controllers
Reduce work_mem from 512MB to 32MB to better handle increased
connection counts in scale environments. The previous value was a
legacy setting from when Ceilometer's complex queries required higher
memory allocation.

Increase shared_buffers from 80MB to 256MB to improve database caching
performance for large queries.

Tune random_page_cost and effective_io_concurrency parameters to
optimize I/O performance for solid state drives.

Test Plan:
01. PASS - Build an ISO with the commit and install a system controller
    with it, verify that the install completes successfully and that
    PostgreSQL starts successfully with the new config.
02. PASS - Run basic load test to confirm there's no performance
    degradation.
03. PASS - Monitor resource-intensive Distributed Cloud operations in a
    scale environment (e.g. dcmanager-audit) and verify they complete
    successfully.
04. PASS - Check logs for any warnings, errors or slow queries.

Story: 2011106
Task: 51246

Change-Id: Idb43369b3e11590a50226cfaa0c903a091586de2
Signed-off-by: Gustavo Herzmann <gustavo.herzmann@windriver.com>
2024-11-13 18:15:00 -03:00
Zuul
3e0d07fc35 Merge "Fix .ceph_started flag creation" vf/caracal 2024-11-06 19:10:11 +00:00
Zuul
da3716f602 Merge "Select devices to disable APM" 2024-11-05 19:37:24 +00:00
Zuul
900256b0f2 Merge "Increase HAProxy USM timeout for "slow" Requests" 2024-11-05 19:37:18 +00:00
Fabiano Correa Mercer
ba5bb84892 Increase HAProxy USM timeout for "slow" Requests
The software upload command currently fails due to a timeout while
awaiting the HTTP response.
This issue commonly occurs when uploading larger patches, such as
a 1GB file.
For example:
software upload test.patch

This timeout issue may occur on the client-side request, which is
addressed in the following fix:
https://review.opendev.org/c/starlingx/update/+/934084

Additionally, an adjustment is needed on the HAProxy side to increase
the timeout for "slow" USM requests (PUT,POST,DEL + precheck requests)

Test Done:

PASSED software upload on AIO-SX with a 1GB patch file with timeout 1800s
PASSED software upload on AIO-SX with a 1GB patch file with timeout 300s
       Edited the file: /etc/haproxy/haproxy.cfg
       Changed the "timeout server" setting to 300 seconds for the
        backend: alt-usm-restapi-internal
       Restarted the HAProxy service:
         sudo sm-restart service haproxy
       Apply the patch:
         software upload large.patch
       Confirmed the command failed due to HAProxy timeout,
       with the error:
           Gateway Timeout

Depends-On: https://review.opendev.org/c/starlingx/update/+/934084

Story: 2010676
Task: 51265

Change-Id: I3a520455ceb150d753262d1542e5c4e608acec99
Signed-off-by: Fabiano Correa Mercer <fabiano.correamercer@windriver.com>
2024-11-05 16:06:27 -03:00
Hediberto C Silva
d8463f6031 Select devices to disable APM
This commit is a part of the solution to mitigate a known issue that
the Advanced Power Management (APM) disk settings impacted read
performance. These settings are dynamically set based on the enabled
StarlingX tuned service profiles.

On some specific hardware configurations  (for example, PowerEdge XR11
with an integrated storage controller), degraded read performance was
observed where the Tuned Disk Monitor didn't detect high usage,
maintaining a limited and low APM level (default 20).
For write operations, a delay of about 60 seconds was noticed to
achive the highest disk performance.

Each unlocking will ensure the APM is disabled, but it can still be set
manually at runtime using: "sudo hdparm -B <apm_level> /dev/sda".

Ensuring it's disabled for all devices, we need to provide the names,
as Tuned retrieves all devices from /sys/block and attempts to apply
the apm_level setting for each one. After failing to apply it
three times (for example, with DRBD block devices), Tuned will disable
the set_apm command for the others.

For populate this parameter, it will needed to use the disks
persistent name (by-path value) from the inventory.

For example:
 devices_udev_regex=(ID_PATH=pci-...-ata-1.0)|(ID_PATH=pci-...-ata-2.0)

Test Plan:
  PASS: All packages built successfully
  PASS: Fresh Install SX/DX/STD in virtual environments
  PASS: After unlocking, verify that APM and Tuned Disk Monitor
        are disabled
  PASS: After unlocking, verify that /etc/tuned/starlingx/tuned.conf
        is populated with the selected devices
  PASS: All previous tests using XR11 lab
  PASS: After the initial unlock, the virtual host is locked,
        powered off, a disk is added, powered on, and after a new
        unlock, the new disk is added to devices_udev_regex.

Partial-bug: 2086509

Depends-on: https://review.opendev.org/c/starlingx/config-files/+/933897

Change-Id: I7c71ecb05c5a406283af6da7af9bef08df0ded66
Signed-off-by: Hediberto C Silva <hediberto.cavalcantedasilva@windriver.com>
2024-11-04 17:08:25 -03:00
Victor Romano
9240927b58 Adjust certmon parameters for scalability
To reduce the time it takes to audit subclouds by certmon, the
number of subclouds that can be audited in parallel was increased
from 4 to 20 and the timeout to check if it's possible to establish
a connection was reduced from 10 to 5.

Test plan:
  - PASS: Lock/unlock a system controller with these changes and
    verify the config file was correctly updated and certmon is
    auditing subclouds successfully

Partial-bug: 2085540

Change-Id: I01be5a7b50598e6ba97878e71eb84f1472673deb
Signed-off-by: Victor Romano <victor.gluzromano@windriver.com>
2024-10-29 10:58:03 -03:00
Zuul
9bc64ca31c Merge "pxeboot should be provisioned after BnR" 2024-10-24 19:03:01 +00:00
Fabiano Correa Mercer
48d396dfdc pxeboot should be provisioned after BnR
The pxeboot-ip service was not provisioned after an AIO-SX BnR in
R8.0, even though pxeboot ip ( 169.254.202.1 ) was installed.
This issue does not occur in R9.0.
However, the SM.pp can be simplified to ensure provisioning
in all cases.

Tests done:
AIO-SX fresh install
AIO-DX fresh install
AIO-DX host-swact
AIO-SX BnR
AIO-DX BnR

Closes-Bug: 2085537

Change-Id: I4143a23e75e8e17444364cf6c707722e9e494fd3
Signed-off-by: Fabiano Correa Mercer <fabiano.correamercer@windriver.com>
2024-10-24 12:56:44 +00:00
Zuul
4587bc0fd8 Merge "Disable Kubernetes application audit via config option." 2024-10-22 12:41:22 +00:00
Zuul
372d7509d9 Merge "Fix Puppet NFV Cinder version configuration" 2024-10-21 18:08:02 +00:00
marantes
b48469793e Fix Puppet NFV Cinder version configuration
In addition to the information shared at [1], the default Cinder
version configuration has been updated to version 3, replacing the
deprecated version 2.

[1] https://review.opendev.org/c/starlingx/config/+/932563

TEST PLAN:

PASS - build-pkgs -c -p puppet-nfv
PASS - build-image
PASS - AIO-SX fresh install
PASS - Upload/Apply stx-openstack
PASS - 'openstack endpoint list' showing cinderv3

Depends-On: https://review.opendev.org/c/starlingx/config/+/932563
Partial-Bug: 2084683

Change-Id: I9f7cb76b763df14af767dce8569aea23c711b391
Signed-off-by: marantes <murillo.arantes@windriver.com>
2024-10-21 12:22:52 -03:00
Zuul
22ab44d300 Merge "Start IPsec daemon before SM during reboot" 2024-10-18 10:28:01 +00:00
Kaustubh Dhokte
f33aa6a578 Use systemctl to restart kubelet
This is a temporary change for the puppet class
platform::kubernetes::update_kubelet_config::runtime.
There is a known issue that pmon-restart kubelet does not
actually restart the kubelet post platfom upgrades. This is because
of the missing kubelet.conf under /etc/pmon.d/. The issue is tracked
seperately. Until that issue is fixed, systemctl kubelet restart
replaces pmon kubelet restart for the runtime kubelet reconfig
functionality.

Test:
PASS: Boot standard (2+2) lab. Run 'system kube-config-kubelet'.
      Kubelets on all four nodes are restarted.
PASS: Boot AIO-SX lab. Run 'system kube-config-kubelet'.
      Kubelet on the controller is restarted.
      Kubernetes cluster is healthy.
PASS: Remove kubelet.conf from /etc/pmon.d to emulate current problem
      Run "system kube-config-kubelet". Kubelet is restarted.
      Kubernetes cluster healthy.

Closes-Bug: 2084622

Change-Id: I6ffffb2fd56682dfc5da34aa4b867190c20f27b2
Signed-off-by: Kaustubh Dhokte <kaustubh.dhokte@windriver.com>
2024-10-17 19:08:35 +00:00
Andy Ning
3844a4c513 Start IPsec daemon before SM during reboot
Currently IPsec daemon and SM has no ordering dependency in systemd,
so IPsec daemon usually starts before SM during system booting and
stops before SM during system shutdown. This causes an issue that when
the active controller shuts down, the IPsec daemon stops earlier and all
IPsec connections to the active controller are terminated. Since the
mtcAgent (a SM service) is still runing and working together with
hbsAgent to monitor the standby controller, it sends reboot request to
the standby controller, via pxeboot network, to reboot it when
connectivity lost on mgmt network is detected by heartbeat.

The expected behaviour is, when the active controller shuts down, the
other controller should become active without rebooting.

This change fixed the issue by updating IPsec starter systemd service
unit file so IPsec daemon start before SM, thus stop after SM (inverse
of start order).

Test Plan:
PASS: Multiple nodes system (such as DX + 1 worker node) deployment,
      verify deployment is successful with all nodes in
      unlocked/enabled/available state.
PASS: Shut down active controller, verify the other controller becomes
      active without rebooting.
PASS: Power on the shut down controller, after it boots up, verify all
      nodes are in unlocked/enabled/available state, and there are no
      alarms.

Story: 2010940
Task: 51182

Change-Id: I94166fae927a98b9caaac163bd399533bfe52719
Signed-off-by: Andy Ning <andy.ning@windriver.com>
2024-10-16 12:02:11 -04:00
Edson Dias
cc4ee0f8bd Disable Kubernetes application audit via config option.
This commit aims to facilitate the application
debugging process in scenarios where the audit
task is not convenient, a configuration option,
skip_k8s_application_audit, was added in the
application framework section of sysinv.conf file
to enable the possibility of turning the audit
task on/off.
If this option is enabled, then the system will
skip the audit task. Also, the default value set
to skip_k8s_application_audit is False.

Test Plan:
PASS: build-pkgs && build-image
PASS: AIO-SX fresh install
PASS: check if _k8s_application_audit is running
PASS: set skip_k8s_application_audit as false in
      sysinv.conf && restart sysinv conductor.
PASS: check if _k8s_application_audit stopped to
      work.

Depends-on: https://review.opendev.org/c/starlingx/config/+/932329

Story: 2011242
Task: 51176

Change-Id: I77708a7d8be4a9c3254a15e13979277d96f20f33
Signed-off-by: Edson Dias <edson.dias@windriver.com>
2024-10-16 09:01:01 -03:00
Steven Webster
d74a25a7e7 Use IP address over FQDN for dcmanager rabbit/db connections
In the past few months, _most_ StarlingX services have moved
from static IP addressing to FQDN resolution, in support of
the management network reconfig feature.

While doing DC scalability testing, it was found that a transient
domain resolution (controller.internal) issue was found after
adding approximately 250 subclouds to the system and involved
the rabbitmq/RPC subsystem.

The error message returned was similar to:

"OSError: failed to resolve broker hostname"

The rabbitmq/amqp library is calling a _connect() function,
which in turn calls the python socket getaddrinfo()

Multiple attemps were made to reproduce the scenario in a
non-scaled lab by stressing the getaddrinfo(), getting
dnsmasq up to ~40 CPU usage, but the same error was not
returned.

Testing was done on the DC scale lab by manually changing the
rabbit and DB config files and this confirmed that using the static
floating IP (avoiding domain name resolution all-together
resolved the issue)

It was decided to revert the FQDN aspect of the dcmanager
and dcorch modules for now, as the management network
reconfiguration feature would not even apply to an
AIO-DX system controller at this time.  This may be
re-evaluated in the future at which point a deeper dive
into the rabbit/RPC usage should be considered.

Testing:

- Install an AIO-DX system controller and install a subcloud.
  Ensure the subcloud is managed and online.
- Ensure the dcmanager.conf and dcorch.conf commands use an IP
  address in their transport_url and database connection
  parameters.

Depends-On: https://review.opendev.org/c/starlingx/config/+/932013

Story: 2010722
Task: 48447

Change-Id: Icd067441dd08321936eb03498ff65241fac0010e
2024-10-09 22:24:03 -04:00
Rei Oliveira
97dde7d666 Fix keystone access log for c1
This commit fix keystone access logging to
/var/log/keystone/keystone.log in controller-1.
The INFO log level is set only during bootstrap, it gets applied at
the moment only to controller-0.

This adds the logic to puppet as well to ensure that controller-1 has
the right settings too.

Test plan:

PASS: Full build, install and bootstrap
PASS: Host-swact to controller-1. Run authenticated commands such as
      'system host-list' and verify that it gets logged to
      /var/log/keystone/keystone.log


Story: 2011106
Task: 51139

Signed-off-by: Rei Oliveira <Reinildes.JoseMateusOliveira@windriver.com>
Change-Id: I2fa902b09474214bafa268a185474a2df6e7aa97
2024-10-08 14:28:06 +00:00
Zuul
0106e1f454 Merge "kubelet not running after unmask and pmon-start" 2024-10-07 15:28:32 +00:00
Edson Dias
31daec3365 kubelet not running after unmask and pmon-start
Start the kubelet service if not running after
executing kubelet unmask and pmon-start during
Kubernetes upgrades.

This also replaces the Kubernetes health check
script used during Kubernetes upgrades in favor
of the new 'sysinv-k8s-health check' which faster
at evaluating multiple health endpoints and has
enhanced logging.

Test Plan
PASS: Upgrade from previous release
      Upgrade Kubernetes from 1.24 to 1.25
PASS: Upgrade from previous release
      Multi-version Kubernetes upgrade from 1.24
      to 1.29
PASS: Upgrade from previous release
      Pause the kubeadm process to trigger a K8S
      upgrade abort.
      Upgrade Kubernetes from 1.24 to 1.25
      Check if the system sucessfully aborted
      the upgrade.
      Check kubectl command works without errors.

Closes-bug: 2083635

Co-Authored-By: Boovan Rajendran <boovan.rajendran@windriver.com>
Change-Id: Id735db17a4a398065e82fd392d9b7cfbbc212210
Signed-off-by: Edson Dias <edson.dias@windriver.com>
2024-10-04 17:03:24 -03:00
Zuul
67c78fe160 Merge "Modify QAT default configuration" 2024-10-04 19:59:42 +00:00
Felipe Sanches Zanoni
ba690c5ae3 Fix .ceph_started flag creation
The flag .ceph_started was being created in the ceph.pp manifest.
This flag enables SM and Pmon monitoring, but the Ceph processes might
not be initialized yet. This flag is created in ceph.sh script called
by MTC when enabling the host.

When configuring ceph backend at runtime, the MTC will not call the
ceph.sh script and the processes will not be fully started and the
flag will not be created. To solve this, the ceph.sh script will be
called at the end of puppet manifest runtime apply.

Test-Plan:
  PASS: On all setups, do a fresh install and check the flag is not
        being created during the puppet manifest apply.
        The flag must be created by the MTC.
  PASS: On all setups, do a fresh install without Ceph backend
        configured. Configure Ceph at runtime and check if the flag is
        created after Ceph is configured.

Partial-bug: https://bugs.launchpad.net/starlingx/+bug/2083056
Depends-on: https://review.opendev.org/c/starlingx/integ/+/930514

Change-Id: Ibbd7dbb41f00c2b2354eaa9a5bd8d383a3d63ac8
Signed-off-by: Felipe Sanches Zanoni <Felipe.SanchesZanoni@windriver.com>
2024-10-04 08:22:15 -03:00
Jorge Saffe
fcb9dbde24 Update Postgres Auth and Password Encryption
Currently, the default authentication and password encryption
method for PostgreSQL is 'md5'. However, it is necessary to
update this to a more secure method, such as 'scram-sha-256'.

The proposed solution addresses these updates using the
'puppetlabs-postgresql' Puppet module. Two new parameters
have been added to the hieradata to configure the password
encryption and authentication methods.

Test Plan:
- PASS Fresh Install SX env
   * Verify system status unlock/available

   * Login as admin user in psql
     (psql -U admin -h 127.0.0.1 -d sysinv)
   * Check postgres authorization configuration
     (SELECT * from pg_hba_file_rules;)
   * Check postgres password encryption configuration
     (SELECT rolname, rolpassword
      FROM pg_authid WHERE rolpassword IS NOT NULL;).

- PASS Fresh Install DX env
   * Verify system status unlock/available

   * Login as admin user in psql
     (psql -U admin -h 127.0.0.1 -d sysinv)
   * Check postgres authorization configuration
     (SELECT * from pg_hba_file_rules;)
   * Check postgres password encryption configuration
     (SELECT rolname, rolpassword
      FROM pg_authid WHERE rolpassword IS NOT NULL;).

   * Host swact to controller-1

   * Login as admin user in psql
     (psql -U admin -h 127.0.0.1 -d sysinv)
   * Check postgres authorization configuration
     (SELECT * from pg_hba_file_rules;)
   * Check postgres password encryption configuration
     (SELECT rolname, rolpassword
      FROM pg_authid WHERE rolpassword IS NOT NULL;).

   * collect logs (collect)
   * verify '/var/extra/database/' content

- PASS Fresh Install DC env
   * Verify system status unlock/available
   * Check postgres authorization configuration
     (SELECT * from pg_hba_file_rules;)
   * Check postgres password encryption configuration
     (SELECT rolname, rolpassword
      FROM pg_authid WHERE rolpassword IS NOT NULL;).

- PASS Backup and Restore SX - optimized
   * Verify system status unlock/available
   * Check postgres authorization configuration
     (SELECT * from pg_hba_file_rules;)
   * Check postgres password encryption configuration
     (SELECT rolname, rolpassword
      FROM pg_authid WHERE rolpassword IS NOT NULL;).

- PASS Upgrade SX
- PASS Upgrade SX-rollback
- PASS Upgrade DX
- PASS Upgrade DX-rollback

Closes-bug: 2069842
Depends-On: https://review.opendev.org/c/starlingx/integ/+/930638

Change-Id: I0e93ff924e5448454d7cb6ae356f074befa3dc33
Signed-off-by: Jorge Saffe <jorge.saffe@windriver.com>
2024-10-03 20:55:31 +00:00
Zuul
a2a91b2537 Merge "Remove the database admin role for postgres" 2024-10-02 16:50:28 +00:00
Ramesh Kumar Sivanandam
72e1e0c84d AIO-DX: Add super-admin.conf file on standby controller for K8s v1.29
A fresh installation of Kubernetes v1.29, as well as an upgrade from
v1.28 to v1.29 in a duplex system, creates the super-admin.conf file
on the active controller but not on the standby controller.

The super-admin.conf file should exist on both controller nodes for
redundancy purposes.

This change ensures that the super-admin.conf file is generated on
the standby controller during both a fresh install of K8s v1.29 and
the upgrade from v1.28 to v1.29.

Test Plan:
PASS: Install ISO with K8s 1.29 on AIO-DX and verify that the
      super-admin.conf is present on both controllers.
PASS: Install ISO with K8s 1.28 on AIO-DX, upgrade to 1.29 and verify
      that the super-admin.conf is present on both controllers.
PASS: Install ISO with K8s 1.28 on AIO-DX, set the controller-1 as
      active, upgrade to 1.29 and verify that the super-admin.conf
      is present on both controllers.
PASS: Verify that "sudo kubeadm certs check-expiration" command
      outputs the super-admin-conf details on both controllers.

Closes-Bug: 2081769

Change-Id: I58b6a995b37b70e8b3350311ca3c89e4a008f8b7
Signed-off-by: Ramesh Kumar Sivanandam <rameshkumar.sivanandam@windriver.com>
2024-09-26 15:25:40 -04:00
Jorge Saffe
84eaab9ab7 Remove the database admin role for postgres
PostgreSQL object-relational database admin role
is no longer needed. It can be removed safely.

Test Plan:
- PASS Fresh Install SX env
   * Verify system status unlock/available

- PASS Fresh Install DX env
   * Verify system status unlock/available

- PASS Upgrade SX
- PASS Upgrade SX-rollback
- PASS Upgrade DX
- PASS Upgrade DX-rollback

- PASS Fresh Install DC env
   * Verify system status unlock/available

Partial-bug: 2080971

Change-Id: I0c57f9bcab90ae0f987b828806c0b02e1200c2ca
Signed-off-by: Jorge Saffe <jorge.saffe@windriver.com>
2024-09-26 20:29:21 +02:00
Md Irshad Sheikh
da173de3b7 Modify QAT default configuration
This commit is to change default "ServicesEnabled" configuration
from asym;dc to sym;dc in the PF and VF configuration template files.
With the asym;dc configuration symmetric crypto is disabled,
so crypto-perf test failed and only the compression-perf test passed.
The crypto-perf test requires that the symmetric crypto (sym) service
to be enabled, while the compression-perf test requires the dc service
to be enabled.

For testing details please refer following link:
https://github.com/intel/intel-device-plugins-for-kubernetes/blob/release-0.30/cmd/qat_plugin/README.md#demos-and-testing

Test Plan:
PASSED: build-pkgs & build-image
PASSED: Check "systemctl status qat_service.service"
        Service should be up and running.
PASSED: Check the "systemctl is-enabled qat_service.service".
        Service should be enabled.
PASSED: Check the "/etc/init.d/qat_service status".
        The number of QAT VF endpoints should match to QAT
        supported sriov numvfs i.e 16.
PASSED: Check the number of PF and VF config files
        (Eg: 4xxx_dev0,4xxxvf_dev0.conf) in /etc directory. It
        should match the total QAT PFs and number of sriov numvfs.
        It should also have "ServicesEnabled = sym;dc".
PASSED: App apply after enabling the QAT plugin chart.
        After apply, QAT pod should be running.
PASSED: Check the description of the node after applying the app
        using command "kubectl describe node controller-0". It
        shows the Capacity:  qat.intel.com/sym-dc:32 and
        Allocatable:  qat.intel.com/sym-dc:  32
PASSED: Crypto-perf test
PASSED: Compression-perf test

Story: 2010604
Task: 51052

Change-Id: Ie3d8da6f7d2fb06e0c90b0ba3f93652482cc1277
Signed-off-by: Md Irshad Sheikh <mdirshad.sheikh@windriver.com>
2024-09-26 11:49:29 -04:00
Zuul
deb5736553 Merge "Fix Ceph being unresponsive on AIO-DX standalone controller" 2024-09-24 14:02:16 +00:00
Zuul
7d7b6f3664 Merge "Revert "Update Postgres Auth and Password Encryption"" 2024-09-18 18:58:56 +00:00
Jorge Saffe
db10c755a2 Revert "Update Postgres Auth and Password Encryption"
This reverts commit 2594c9a860258280a74e7f7248942cd0f9814c23.

Reason for revert: Changes are affecting DC' installation/bootstraping

Change-Id: Iface6ecb222f703219d05b04d7798d8f336b502a
2024-09-18 18:44:50 +00:00
Zuul
52cd886904 Merge "Update swanctl AppArmor profile for LUKS fs access" 2024-09-16 16:42:33 +00:00
Zuul
14255078d2 Merge "Prevent mtce runtime manifest from sighup'ing processes not running" 2024-09-16 15:48:19 +00:00
Zuul
145cb9744e Merge "Update Postgres Auth and Password Encryption" 2024-09-16 15:39:28 +00:00