The grub commandline parameters contains nohz_full grub which needs
to be updated with something different than a cpu list for kernel
isolation when the user decides to disable it by assigning the
disable-nohz-full label. In this manner, the disabled string can be
used to signal the kernel to restore this configuration to default.
Closes-Bug: 1981762
Test plan:
PASS: Disable nohz_full configuration in a standard worker by enabling
disable-nohz-full label.
PASS: Verify that /etc/default/grub doesn't contains nohz_full
parameter.
PASS: Verify that /sys/devices/system/cpu/nohz_full was restored to
default.
PASS: Enable nohz_full configuration in a standard worker by removing
disable-nohz-full label.
PASS: Verify that /sys/devices/system/cpu/nohz_full has the cpulist.
PASS: Remove label and verify alarms and system overall health.
PASS: Add label and verify alarms and system overall health.
Signed-off-by: Iago Estrela <IagoFilipe.EstrelaBarros@windriver.com>
Change-Id: I00b68faae2612088b8e2fe7aeb3900babe638ca0
Correct update ca certs command for Debian.
Test Plan:
Verify: Bootstrap and adding a Subcloud on Debian
Verify: Bootstrap and adding a Subcloud on Centos
Story: 2010119
Task: 45763
Signed-off-by: Li Zhu <li.zhu@windriver.com>
Change-Id: I4a9d2758ce012557fad4a19b49aa9b5bfe4f1680
When running "system host-device-show controller-0 0000:b2:00.0",
all firmware info are shown as "None".
Turns out some services like "sysinv-fpga-agent" and
"sysinv-conf-watcher" are not initialized properly during initial boot.
Test Plan:
PASS: modify the rules script that is used to start systemd unit files,
to become compatible with the Debian command rules. for example,
specifying the name of the service with the "--name" option:
"--name=sysinv-fpga-agent". Then execute:
"ls /usr/lib/systemd/system/sysinv* -l" after initial boot
--> Command Response:
-rw-r--r-- 3 root root 81 Jan 1 1970
/usr/lib/systemd/system/sysinv-conf-watcher.path
-rw-r--r-- 3 root root 219 Jan 1 1970
/usr/lib/systemd/system/sysinv-conf-watcher.service
-rw-r--r-- 3 root root 389 Jan 1 1970
/usr/lib/systemd/system/sysinv-fpga-agent.service
Closes-Bug: 1981824
Signed-off-by: Mohammad Issa <mohammad.issa@windriver.com>
Change-Id: I1491e9607688738fedec188c872a815b074e8dfc
For DELETE requests with no request body or "Content-Length" field, the
expected return value for "hasattr(rest_state.request, 'json')" would be
False and for "rest_state.request.json" would be an empty string.
Instead, what occurs is that both "hasattr(rest_state.request, 'json')"
and "rest_state.request.json" are throwing JSON decode exceptions in
this case. This change adds tolerance for this exception. This is done
because it is not know all the possible request conditions that may
cause this exception to be thrown.
Test Plan:
PASS: Successfully deploy an AIO-SX using a Debian image with this
commit present and, using horizon interface, create and delete a data
network, checking that in log file "/var/log/sysinv-api.log" no
exception was thrown in the delete request.
Closes-Bug: 1980842
Signed-off-by: Joao Victor Portal <Joao.VictorPortal@windriver.com>
Change-Id: Ib9f72f3b2d4d60790fa8011674a013bcd141a61e
device-image-upload with --bmc or --retimer-included failed with
error: "Expecting value: line 1 column 1 (char 0)"
Note: device-image-upload WITHOUT --bmc or --retimer-included works
The parameters for 'bmc' and 'retimer_included' (ex. True or False)
are passed in as strings instead of boolean.
Unit test case passes in parameters as strings, if strtobool
conversion works it will not catch an error
Test Plan:
PASS: Intercept the data Dictionary and override values corresponding
to their keys with the correct values.
Then use this command to test:
"system device-image-upload VistaCreekBravoBMCFW_Release_WW16.1.bin
functional 8086 0b30 --bitstream-id 32 --bmc true
--retimer-included true"
--> Command Response:
+------------------+--------------------------------------+
| Property | Value |
+------------------+--------------------------------------+
| uuid | d54038e6-4174-48e0-943e-a7bdfa1caf07 |
| bitstream_type | functional |
| pci_vendor | 8086 |
| pci_device | 0b30 |
| bitstream_id | 35 |
| key_signature | None |
| revoke_key_id | None |
| name | None |
| description | None |
| image_version | None |
| applied | False |
| applied_labels | |
| bmc | True |
| retimer_included | True |
+------------------+--------------------------------------+
Closes-Bug: 1980405
Signed-off-by: Mohammad Issa <mohammad.issa@windriver.com>
Change-Id: I79ea7661b1bc383df5daa6f4f93efd4df0e7bb2b
Raise each exception from_upload_file() method and populate
it to client side error messages. This change doesn't add or remove any
new behavior. It only update the error message format in the log.
Closes-Bug: 1979689
Test Plan:
PASS: Patch local duplex system and trigger one exception from _upload_file()
method to verify the error message is populated
PASS: Run 'system load-import bootimage.iso bootimage.sig' successfully and
verify in 'system load-list' with sufficient /scratch diskspace
Signed-off-by: Junfeng (Shawn) Li <junfeng.li@windriver.com>
Change-Id: I4cbf0ce3f0f9036e41d65d7c19a84d92f768ae32
Currently cgtsclient ignores "-k/--insecure", "--ca-file",
"--cert-file" and ""--key-file" options. In order for command
such as "system host-list" to work over HTTPS, OS_CACERT env
variable has to be set.
This change updated cgtsclient to accept and properly handle
the ignored options.
Test Plan:
PASS: remote cli docker image build
PASS: from remote cli environment, successfully run the
"system host-list" commands with the 4 options over
HTTPS.
Closes-Bug: 1980417
Signed-off-by: Andy Ning <andy.ning@windriver.com>
Change-Id: Iae03ac60188157cb726e6e12ba2209eff6b7e1e1
When configuring an interface to vlan, it is possible to configure the
interface without entering a vlan id. If vlan id is set without input, vlan id
is set to NONE. If this interface is assigned to oam and unlocked, a problem
occurs. After unlock, the device name is set to vlanNone and the oam interface
cannot up normally.
Closes-Bug: 1979253
Test Plan:
PASS: Create vlan interface with vlan id and succeed in creation
PASS: Create vlan interface without vlan id and fail in creation
Signed-off-by: ohjoohyun <oh011798@gmail.com>
Change-Id: Ifd5527699cb8e8f874a54292f40656fc890f0a1a
Code that retrieves the registry credentials does not work properly
with Python3. This commit fixes that.
Test Plan:
- Verify successful bootstrap for both CentOS and Debian
using authenticated registry
Partial-bug: 1980391
Change-Id: I71cac14d8bdd63501fc804086cb8af429135bd92
Signed-off-by: Jerry Sun <jerry.sun@windriver.com>
The new version of the keystone client we are using in debian does not
support instantiating a keystone_client without a session object.
This is making use of an existing self._get_keystone_session function
to create the session object and using that session object to
authenticate with the keystone_client. There are also minor
refactorings to avoid using the keystone client at bootstrap time.
Test plan on both centos and debian:
PASS: Run 'openstack user password set' to change the password and
verified that /opt/platform/puppet/22.06/hieradata/
secure_system.yaml gets updated with the new password, keyring
gets updated with new password and /var/log/sm-customer.log
shows that the vim service gets restarted
PASS: Verify that no keystone error messages are shown during bootstrap
in sysinv.log keystone listener are started.
PASS: When keystone admin user password changes, observe the
corresponding entity in keyring is updated, and vim is restarted
(by openstack::keystone::password::runtime)
Closes-Bug: 1979995
Signed-off-by: Rei Oliveira <Reinildes.JoseMateusOliveira@windriver.com>
Change-Id: Id4d4bf8072a853d4d3e016e9afa2cd984ee85694
Sysinv WSGI server is crashing after receiving 250 requests in parallel
in route creation endpoint due to limited number of threads to handle
requests.
Closes-Bug: 1974194
Test plan:
PASS: Hit route creation endpoint 250 times in parallel and verify
WSGI Server didn't restart.
Signed-off-by: Iago Estrela <IagoFilipe.EstrelaBarros@windriver.com>
Change-Id: I89012d7f8c7693cd3dc078d9f67ddffb4308e254
If an system application-update is triggered updating an armada app
to a fluxcd app (preceded by a helm release migration) and
update fails, the application framework will try to perform a recover.
Recover will fail as fluxcd uses helm3 and armada helm2. This will
create resources both in helm2 and helm3 leaving the app
in a inconsistent state.
To prevent that from happening recover is skipped if to_app and from_app
use different chart managers.
TEST PLAN:
PASS: recover skipped after update from armada to fluxcd without
migrating helmrelease
Closes-bug: 1980242
Signed-off-by: Lucas Cavalcante <lucasmedeiros.cavalcante@windriver.com>
Change-Id: I9061b75f443730e973b79cc93e955069951113ff
The ttys_dcd flag is currently being configured by sysinv agent audit.
The API updates the database but it is not instantly configured, it
must wait for an audit iteration. This change modifies this behavior
by running a puppet runtime manifest [1] from the API.
[1] https://review.opendev.org/c/starlingx/stx-puppet/+/845174
Closes-Bug: 1978009
Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/845174
Test plan:
PASS: Bootstrap AIO-SX, host unlock and verify system alarms.
PASS: Set the flag via host-update and check if the serial console
was really configured.
PASS: Unset the flag and also verify serial console.
Signed-off-by: Iago Estrela <IagoFilipe.EstrelaBarros@windriver.com>
Change-Id: Ia3ef2a5e96905c3cd770601d1e78af368bf54a95
Align the license statements with other files in this repo. Remove the
proprietary statements which are inconsistent with the use of Apache-2.
Include the SPDX-License-Identifier.
Closes-Bug: 1979242
Change-Id: Ic20b9e896198dee37ccf12ef993e050ff9f53fc2
Signed-off-by: Michel Thebeau <Michel.Thebeau@windriver.com>
On Debian we observe that bootstrap fails intermittently.
From the data so far, this is observed only on VMs.
Couldn't reproduce the error locally, but from the investigation
deduce it is just a performance degradation.
Investigation notes to prove it is a performance degradation and not a
service crash are uploaded to the LP.
Temporarily increase the retries for waiting for sysinv-api to come up.
Jump from 15 seconds to 60 seconds to be defensive.
This will allow sanities to pass and integration effort to continue.
Tests on AIO-SX on Debian:
PASS: bootstrap
Partial-Bug: 1979717
Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Change-Id: I10b00eab467303771d25cc5c79760005c3966446
Details: 1. Added a condition check to prevent a host from being added to an AIO-SX.
2. Updated host delete() to allow hosts to be deleted if it is not controller-0 in AIO-SX
Closes-Bug: 1978134
Test Plan:
PASS: Unit test cases against non controller-0 AIO-SX host deletion
PASS: Patched local simplex system and ran reproduced steps
PASS: Migration from SX to DX
Signed-off-by: Junfeng Li <junfeng.li@windriver.com>
Change-Id: I3510bc43f13869ec76673ec50a879b463ee760f4
This change adds a method to apply 'update-k8s-feature-gates'
puppet class on both controllers during platform upgrade activate
phase.
Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/845651
Test Plan: (CentOS)
On AIO-SX and AIO-DX:
PASS: Full platform upgrade successful
PASS: Check logs if puppet manifest is successfully
applied on both the controllers.
Story: 2009789
Task: 44765
Signed-off-by: Kaustubh Dhokte <kaustubh.dhokte@windriver.com>
Change-Id: I9cb494a72b7ad62476378f3512cb55c94596eb1e
This commit updates the per-mode version of Pod Security Admission
labels to "latest" for application namespaces such as cert-manager.
Pod Security Admission labels on namespaces are needed for pod
security admission controller to know how restrictive each
namespace is.
Pinning to a specific Kubernetes version, for example v1.23, allows
the behavior to remain consistent as policy changes happen over
Kubernetes releases. Keeping the version "latest" as the default,
allows more flexibility when supporting multiple kubernetes
versions.
This commit also updates the application namespaces label default
levels to "privileged" from "baseline". This will cause no-harm
if users do not wish to use "beta" PSA feature enabled by default
in Kubernetes v1.23+.
Test Plan:
PASS: In an installed system verify that the pod security admission
labels of the cert-manager namespace has been updated with the
per-mode version "latest".
PASS: Created namespaces where policies are applied via labels.
Privileged pods fail to get created in namespaces that are not
configured with privileged policy level.
PASS: Privileged pods get created in namespaces with no security
policy labels.
Story: 2009833
Task: 45632
Signed-off-by: Carmen Rata <carmen.rata@windriver.com>
Change-Id: I76d44873ac447bbc0e2d90643fedf38bef8ebd1a
These scripts perform helm override check, backup of helm overrides,
conversion of helm overrides and upgrade of the oidc-auth-apps
application.
The backup helm overrides and conversion files will be in named
sub-directories within /opt/oidc-auth-apps. The sub-directory names
reflect the procedure that creates them, but are otherwise arbitrary.
The configuration check ensures the end-user's helm-overrides can be
converted from old Dex to new Dex. The conversion rearranges the helm
overrides to fit into the new Dex values.yaml.
The 50-validate-oidc-auth-apps.py contains the backup, check and
conversion code and will be run on both on the 'from' release as a
pre-upgrade check, and on the 'to' release for backup, and conversion.
The 70-upgrade-oidc-auth-apps.sh script is run at upgrade activate to
remove the old app and apply the new app with converted user overrides.
This invokes 50-validate-oidc-auth-apps.py on the active controller
during upgrade-activate as well.
Depends-On:
https://review.opendev.org/c/starlingx/oidc-auth-armada-app/+/845380
Test plan:
PASS: tox (python2, python36)
PASS: run upgrade script as postgres user
PASS: sequenced commands: check, backup and convert
PASS: conversion of sample configurations with app verification after
PASS: conversion without user overrides
PASS: backup helm-overrides to /opt/oidc-auth-apps as postgres user
PASS: check and conversion of helm override for documentation examples
PASS: push helm overrides back to database
TODO: upgrade-activate
TODO: end-to-end upgrade SX
In-progress: end-to-end upgrade DX
TODO: python3 Debian
Story: 2009838
Task: 45641
Change-Id: If67ae45826bd9ceae35e50d5536f5054dd3ae9dd
Signed-off-by: Michel Thebeau <Michel.Thebeau@windriver.com>
These upgrade scripts were necessary on the stx 5.0 to 6.0 upgrade
path, however, they are not necessary for the next 7.0 upgrade path.
Test Plan:
PASS: Fresh install on AIO-SX
PASS: AIO-SX Upgrade (6.0 -> 7.0)
PASS: AIO-DX Upgrade (6.0 -> 7.0)
PASS: Checked scripts are not present after upgrade completed
Story: 2009754
Task: 45634
Signed-off-by: Adriano Oliveira <adriano.oliveira@windriver.com>
Change-Id: I675e8916b2f661a65622dd1d1bf0d7aa9e9deafe
For certificates stored as kubernetes tls secrets, the alarm should be
cleared when the secret is deleted.
This changes the audit_for_deleted_certificates function to also check
for deleted secrets and subsequently clear the alarm and delete the
certificate snapshot information.
Test plan:
PASS: Verify that when a certificate is deleted the alarm is cleared
from the system
PASS: Verify that deploying a soon-to-expire certificates results in
an alarm in 'fm alarm-list'
PASS: Verify that an existing certificate alarm is cleared up by
renewing the certificate to get a valid certificate
Closes-Bug: 1978730
Signed-off-by: Rei Oliveira <Reinildes.JoseMateusOliveira@windriver.com>
Change-Id: I6ed9248766b2abbbcc616e10d4575b4ae0471c9d
The ptp-notification application requires a privileged namespace in
order to deploy and operate.
This change moves the notification namespace from the baseline policy
group to the privileged policy group so that it can continue to operate
as it did prior to the addition of support for the Pod Security
Admission controller introduced in the upversion to k8s 1.23.
The privileged and baseline groups were defined in
https://review.opendev.org/c/starlingx/config/+/833487
Test-plan:
Pass: Update the privileged and baseline groups in common.py, restart
sysinv-conductor and verify that ptp-notification is able to properly
deploy.
Pass: Verify that the notification namespace has the expected
privileged labels.
Closes-Bug: 1978737
Signed-off-by: Cole Walker <cole.walker@windriver.com>
Change-Id: I5d24a8e81b32809f568a5953701cf2e0c474005e
In order to assure a response to the system health-query, when Ceph
storage-backend is configured and the ceph-api is unresponsive,
a Timeout is required.
This Timeout does not rely on the underlying ceph-api timeout as
the ceph-api may not timeout as expected.
Test Plan:
PASSED Verify system health-query response when Ceph is unhealthy
PASSED Verify system health-query response when Ceph is healthy
Closes-Bug: 1978726
Signed-off-by: John Kung <john.kung@windriver.com>
Change-Id: I4702c409e8ea45946ba94fab6a0989a90f2f6604
Checks for the existence of either 'status_code'
'status_int', or 'status' in the HTTP response object,
instantiating the HTTPException instance with the found
code, or raises a generic Exception if not found.
Test Plan:
PASS:
Tested code on system exhibiting the errors
referenced in the bug. The HTTP response is now
correctly caught and the correct HTTPException is
raised.
Closes-Bug: 1978499
Signed-off-by: Kyle MacLeod <kyle.macleod@windriver.com>
Change-Id: I53d1dd6583368de4f6713a838fe3304d362f1756
The change https://review.opendev.org/c/starlingx/config/+/838594
updated certificate api-version from cert-manager.io/v1alpha2 to
cert-manager.io/v1. But did not make necessary changes to certificates
specs to work with the new version.
This change makes only the required changes to certificates specs to
work with the new version: cert-manager.io/v1
The spec organization[] should now be subject:organizations[]
See the difference here,
https://cert-manager.io/v0.13-docs/reference/api-docs/#cert-manager.io/v1alpha2.Certificate
and https://cert-manager.io/docs/reference/api-docs/#cert-manager.io/v1.CertificateSpec
The organization 'system:masters' in the admin.conf certificate is
required to authorize the access for kubernetes-admin to cluster objects.
This authorization is specified in the 'cluster-admin'
clusterrolebinding. Without this change, all kubectl commands fail.
In v1, unlike in v1alpha2, CN is ignored by TLS clients during
authorization (https://cert-manager.io/docs/reference/api-docs/#cert-manager.io/v1.CertificateSpec)
if any subject alt name is set. My initial understanding here was that
the CN field value is being ignored due to
subject:organizations:['system:masters'] (in v1), as all the deployment
and daemonset pods were failing after "system kube-rootca-pods-update
--phase=trust-new-ca" (during rootCA update) with an authorization error
for the user 'kube-apiserver-kubelet-client'.
This forces the removal of organizations from the apiserver kubelet
client certificate as all deployments and daemonset pods authenticate
and authorize with the 'kube-apiserver-kubelet-client' user.
Without 'system:nodes' in the kubelet client certificate,
kube-scheduler and kube-controller-manager fail to authorize.
More Info: https://kubernetes.io/docs/reference/access-authn-authz/node/
Test Plan:
On CentOS AIO-SX:
PASS: Manual kubernetes RootCA update successful
PASS: Orchestrated kubernetes RootCA update successful.
PASS: All deployments, daemonsets and pods running as expected after
RootCA update.
Closes-Bug: 1978365
Signed-off-by: Kaustubh Dhokte <kaustubh.dhokte@windriver.com>
Change-Id: I767a70a07ab540510e4eb734cb4e282c9918840c
Ceph monitor quorum requires at least 2 monitors up when 3 are
configured in Standard or Storage setups. If 1 host that has ceph
monitor configured is locked, no other ceph monitor host can be
locked.
Test Plan:
PASS: AIO-SX CentOS lock/unlock.
PASS: AIO-DX CentOS lock/unlock standby controller.
PASS: Storage CentOS lock controller-1. Cannot lock storage-0.
PASS: Storage CentOS lock controller-1. Force lock storage-0.
PASS: Standard CentOS lock controller-1. Cannot lock compute-0.
PASS: Standard CentOS lock controller-1. Force lock compute-0.
Closes-Bug: #1978498
Signed-off-by: Felipe Sanches Zanoni <Felipe.SanchesZanoni@windriver.com>
Change-Id: If32aeea4712646430fdba06709aa3d4b9e05c51c