This story shall update the README file of a few most used StarlingX
repos.
Test Plan: N/A
Story: 2010814
Task: 48355
Change-Id: If7d4825337a8057d3be540d96885c7956b857730
Signed-off-by: Roger Ferraz <rogerio.ferraz@encora.com>
This commit changes the file "/etc/platform/openrc" to allow its usage
by other users. The parameter "--no_credentials" was added for
this purpose. Also, the permissions of this openrc file was changed to
0644 to allow its usage by users with no privileges.
The typical use case is an LDAP user with or without privileges
sources this openrc file and then sets the variables OS_USERNAME,
OS_PASSWORD and PS1 (that uses OS_USERNAME).
Also, the test to check if the controller is the active one, changed:
previously, it was tested just if the password gotten was empty, but as
the reason now to get an empty password may be a user with insufficient
privileges, the test changed to check whether the executable file
"keyring_file" exists (it exists only in the active controller and that
is the reason why a standby controller gets an empty password).
Test Plan:
PASS: Successfully deploy an AIO-DX containing this change. Check that
the permissions of "/etc/platform/openrc" are 644, owner root, group
sys_protected.
PASS: In the deployed AIO-DX, create 2 users: user1 is not part of
groups sys_protected and root, user2 is part only of group
sys_protected.
PASS: In the active controller of AIO-DX, using users user1 and user2,
execute the following commands: for "source /etc/platform/openrc
--no_credentials" command, the result for all users is that the file is
sourced without errors; for "source /etc/platform/openrc; system
host-list", user1 gets a message saying it doesn't have privileges to
read keyring password and an error message for system command, while
user2 gets the commands executed without errors.
PASS: Repeat the test above for standby controller: for "source
/etc/platform/openrc --no_credentials" command, all users get a message
saying it should only be loaded from active controller; for "source
/etc/platform/openrc; system host-list", also a message is printed
saying it should only be loaded from active controller and an error
message appears for system command.
Partial-Bug: 2024627
Signed-off-by: Joao Victor Portal <Joao.VictorPortal@windriver.com>
Change-Id: I6ef2ca16a272d1fc7c4a24b9f5b48a9cb860450f
As part of Debian migration, the sysinv procedure to check DPDK
compatibility for each host interface was also updated in order to make
it customizable in case one would like to use other virtual switch than
the delivered OVS with DPDK support [1].
For other virtual switches, that might or not rely on DPDK, the ELF
target that sysinv uses to verify interfaces compatibility must be
customizable and the query_pci_id script is already able to use custom
values [2].
This change adds to puppet the system configuration that will write, if
defined, the correct value for the ELF path. This platform parameter can
be overridden on the hiera data so puppet will update sysinv.conf
accordingly.
For now, when deploying StarlingX with vswitch_type=ovs-dpdk we will
override it to the query_pci_id script default value (i.e., the
/usr/sbin/ovs-vswitchd ELF) using the respective sysinv puppet module
and let it as an example for anyone that is later using a different
vswitch which requires this customization [3].
[1] https://review.opendev.org/c/starlingx/config/+/872979
[2] 2cd0b1e14a/sysinv/sysinv/sysinv/scripts/query_pci_id (L34)
[3] https://review.opendev.org/c/starlingx/config/+/887106
Test Plan:
PASS - Build puppet-manifest package
PASS - Build a custom stx ISO with the new package
PASS - Bootstrap AIO-SX virtual system (vswitch_type=none)
and ensure the hiera data was not modified neither
sysinv.conf was updated
PASS - Bootstrap AIO-SX virtual system (vswitch_type=ovs-dpdk)*
and ensure the hiera data was modified correctly and
sysinv.conf was updated accordingly
* A successful complete installation with ovs-dpdk is still blocked by
a bug that will be solved soon:
https://bugs.launchpad.net/starlingx/+bug/2008124
Story: 2010317
Task: 46389
Signed-off-by: Thales Elero Cervi <thaleselero.cervi@windriver.com>
Change-Id: Iaf31d3b5e2fc03b4783473e4329a780a516a9d43
This commit adds single quotes around user password parameter value to
ensure that complex passwords are valid when user option setup script
is executed by puppet bootstrap.
Test Plan:
PASS: Full build, system install, bootstrap and unlock DC system, with
one subcloud bootstrapped and unlocked with active enabled
available status.
PASS: Add, bootstrap, manage and unlock a subcloud with a complex
password containing special characters, numbers, capital letters
and an open parenthesis at the end of the sentence.
Closes-Bug: 2025292
Change-Id: Ia5430084bf6b16c78594a2483f2b88ec9b18f36a
Signed-off-by: Manoel Benedito Neto <Manoel.BeneditoNeto@windriver.com>
In order to unify implementation with the other platform firewalls,
the hard-coded values are set to 'undef' and will be provided by
sysinv in system.yaml
The test below validates the correct values are present in the OAM
firewall
Test Plan:
[PASS] Install, Lock, Unlock AIO-SX
[PASS] Install, Lock, Unlock AIO-DX (as SystemController)
Story: 2010591
Task: 48255
Depends-On: https://review.opendev.org/c/starlingx/config/+/885585
Change-Id: Idc1f71f7ba762dc76529022acf4145db00686ec2
Signed-off-by: Andre Kantek <andrefernandozanella.kantek@windriver.com>
This script waits for the k8s control-plane component endpoints
(apiserver, scheduler, controller-manager, kubelet) to be up and
running at the end of platform::kubernetes::upgrade_abort.
Retry/timeout parameters are configured to wait up to 3 minutes.
Test plan:
Pass: Verify the abort waits for all control-plane endpoints to be
healthy.
Pass: Verify /var/log/kubernetes/k8s-endpoints-health.log shows
'Timeout: Kubernetes control-plane endpoints not healthy' message
after timeout exceed.
Story: 2010565
Task: 48203
Depends-On: https://review.opendev.org/c/starlingx/config/+/885582
Change-Id: I232b4746a3eb899ba87e706160547e8792489394
Signed-off-by: Boovan Rajendran <boovan.rajendran@windriver.com>
sssd is monitored by pmon. But currently the Restart option in its
systemd service file is set to on-failure. This sometimes causes
systemd and pmon to fight to restart the service when it fails. All
processes monitored by pmon should have Restart set to "no".
This change added a systemd override file to set Restart to "no" for
sssd service.
Test Plan:
PASS: Standard system deployment.
PASS: Check sssd Restart option using "systemctl cat sssd", verify
Restart option is set to "no", as following:
# /etc/systemd/system/sssd.service.d/sssd-stx-override.conf
[Service]
# pmond monitors sssd service
Restart=no
PASS: Kill sssd process, verify pmon restart it successfully by
tailing pmon.log, and verify sssd is running by "systemctl
status sssd" command.
Closes-Bug: 2023421
Signed-off-by: Andy Ning <andy.ning@windriver.com>
Change-Id: I84521caf3745122492afe9ef4a251e42129b29b0
Local OpenLDAP and WAD servers are being used for k8s api and SSH
authentication. We need the ability to disallow SSH authentication
for selective users. As part of the solution, we create a Linux
group where all ldap users with "denied ssh access" will be added.
The group will be set for denied ssh access in the sshd configuration.
The sshd configuration change is part of a separate commit.
Test Plan:
PASS: Debian image gets successfully installed in AIO-SX system.
PASS: Verify the Linux group has been created.
PASS: Create an openldap user and add to the "deny ssh access" group.
Verify that the user cannot ssh.
PASS: Create a WAD group with the same name and gidNumber as the
Linux group for "deny ssh access". Create a WAD user in this group.
Validate that the new WAD user in the "deny ssh group" cannot ssh
to stx platform.
PASS: Remove the WAD user from the WAD "deny ssh access" group.
Validate that now the user can have ssh access to stx platform.
PASS: Remove the openldap user from the Linux "deny ssh access" group.
Validate that now the user can have ssh access to stx platform.
Story: 2010589
Task: 48234
Signed-off-by: Carmen Rata <carmen.rata@windriver.com>
Change-Id: Ib1229f21e207d66d39f8bcdb7acf0533ace527c1
The names of classes/defines should match the name that's
implied by their file path. Puppet throws an "unacceptable
location" warning whenever this condition is not satisfied.
Test Plan:
PASS: Build & install
PASS: AIO-SX Successful Bootstrap
PASS: AIO-SX Successful Unlock
PASS: Verified that 'unnaceptable location' warnings are
no longer present on puppet.log
Story: 2010757
Task: 48026
Change-Id: I1cd3d09e90bfeb3d206b540717943ea1e6413444
Signed-off-by: Matheus Guilhermino <matheus.machadoguilhermino@windriver.com>
This ensures that the kernel boot args are correct.
When they are not correct, puppet will trigger a reboot
after unlocking to fix them.
TEST PLAN
PASS: AIO-SX backup and restore
* New backup will include /boot files
* Non-default kernel boot args will be kept
* No double reboot
* /proc/cmdline can be used to verify kernel boot args
PASS: AIO-SX backup and restore
* Remove new /boot files from backup
* Restore with modified backup
* Non-default kernel boot args will be lost
* No double reboot
* /proc/cmdline can be used to verify kernel boot args
Partial-Bug: 2023678
Change-Id: I5f0c91c0c8583f4a86148ddf0fadc03b18ff9c1a
Signed-off-by: Joshua Kraitberg <joshua.kraitberg@windriver.com>
Replaces "systemctl is-active" calls by "pid file check" approach for
docker-distribution (docker-registry) and registry-token-server
services. These calls were causing unnecessary process restarts in
cases where systemd was halted due to contention on kernfs_mutex.
Test Plan:
PASS: Verify docker-distribution status
PASS: Verify registry-token-server status
Partial-bug: 2016028
Change-Id: I2398d7f397ad14d2ff1ff6d40141ffad4f54f2e3
Signed-off-by: Davi Frossard <dbarrosf@windriver.com>
Currently sssd is not configured and running on storage nodes so
ldap users can't login to storage nodes. This update makes sssd
configured, and running on storage nodes (with a followup update).
Test Plan:
PASS: System with storage nodes deployment
PASS: In storage nodes, verify that the following config file exist:
/etc/sssd/sssd.conf
Closes-Bug: 2023399
Signed-off-by: Andy Ning <andy.ning@windriver.com>
Change-Id: I383c101e0f99be93e9da528411c6fa1fd8cde4c6
This change creates a class to update the admin firewall during
runtime operations
Test Plan:
[PASS] in subcloud mode, add/remove static routes in the mgmt network
[PASS] in subcloud mode, add/remove static routes in the admin network
Story: 2010591
Task: 48202
Change-Id: I3a4025cb8c6ff8d90ba36b49e2aaa12d0ec7057b
Signed-off-by: Andre Kantek <andrefernandozanella.kantek@windriver.com>
After system bootstrap, when service-parameter-apply is executed
for the first time, it verifies that the k8s configmaps linked to
service-parameters extra-volumes exist, then creates a flag
(configmap_initialization_flag) to skip this step on subsequent
runs. If the flag is not generated, the k8s custom configuration
script checks the k8s configmaps each time it is run.
Test Plan:
PASS: Fresh Install STD/DX
PASS: Apply K8s service-parameter.
PASS: Verify configmap initialization flag has been created.
Closes-bug: 2022983
Signed-off-by: Jorge Saffe <jorge.saffe@windriver.com>
Change-Id: Ie28247fd62945f90a9018a7ebb7942245ea5aeb4
This change creates the new class to update the management network
firewall in runtime. The class is meant to be applied by
sysinv-conductor when the route config is updated in system controller
hosts.
Test plan:
Setup: Distributed Cloud with AIO-DX as system controller.
[PASS] Add route in a management interface, check that the
corresponding network is present in the system controller's
firewall.
[PASS] Remove previously created route, check that the corresponding
network is no longer present in the system controller's
firewall.
Story: 2010591
Task: 48174
Signed-off-by: Lucas Ratusznei Fonseca <lucas.ratuszneifonseca@windriver.com>
Change-Id: I08fa9e2807f0c734c716c28c1996588167ee9d58
The default ntpd configuration enables network interfaces scanning and
this is causing ntpd to lose sync after about 2 days and 9 to 10 hours.
This fix disables ntpd interface scanning by adding the -U 0 option.
Note: this was detected on CentOS and both CentOS and Debian will have
add the same option to maintain consistency.
Test Plan:
PASSED: Debian: check that ntpd -U 0 configuration is applied
PASSED: Debian: wait for more than 5 days and check that ntp sync is still working
Closes-Bug: 2017697
Change-Id: I1c2727b71d71bf03966c834c470bd225e2a95c81
Signed-off-by: Caio Bruchert <caio.bruchert@windriver.com>
This change adds a new log /var/log/rss-memory.log for
memory growth debuging
The following entry into crontab will output daily at 01:00:
0 1 * * * /usr/bin/date >> /var/log/rss-memory.log;
/usr/bin/ps -e -o ppid,pid,nlwp,rss:10,vsz:10,
comm,cmd --sort=-rss >> /var/log/rss-memory.log
Test Plan:
- PASS: Build an image, install and bootstrap successfully
- PASS: Apply monitor pods so addon logs would be installed.
- PASS: Check that log entries are correctly displayed.
- PASS: Tested on controller, AIO, worker and storage hosts.
Closes-Bug: 2019007
Change-Id: I6f8e6208d203bcc77320ced3766af04dab977829
Signed-off-by: Cesar Bombonate <Cesar.PompeudeBarrosBombonate@windriver.com>
This change is to restore the etcd snapshot during k8s
upgrade abort.
During k8s upgrade abort we need to drain the node, remove the
static pod manifests files stop the kubelet, containerd, docker
and etcd services, restore the etcd snapshot, restore the static pod
manifests, start the etcd, docker and containerd services, update the
bindmount and start kubelet service.
The helper script 'kube-wait-control-plane-terminated.sh' is used to
wait with a timeout for the control plane pods processes to exit after
removing static pod manifests files and forcibly kill the process if the
timeout expires.
Test Plan:
AIO-SX: Perform k8s upgrade v1.24.4 -> v1.25.3
PASS: Create a test pod, before the etcd backup and delete the pod
after taking snapshot run the command "system kube-upgrade-abort",
verify test pod is running after etcd is restored successfully.
PASS: Verify kubeadm and kubelet version restored successfully to the
from version after k8s upgrade abort.
PASS: Verify static manifest are restored successfully after k8s
upgrade abort.
PASS: Verify all the pods are restored and running successfully.
PASS: Verify pod networking are still working.
Story: 2010565
Task: 48070
Change-Id: I2efda2c9f84346933a9b1277e95d95cd8d21c50f
Signed-off-by: Boovan Rajendran <boovan.rajendran@windriver.com>
This change adds the capacity to install the worker node required
firewall into the calico configuration since kubectl isn't available
there. It uses ansible ad-hoc commands to access the controller from
the worker and execute the command.
In all test cases below the iptables/ip6tables content in the worker
node was verified
Test Plan:
[PASS] Install worker node.
[PASS] Execute lock/unlock in the worker node.
[PASS] Reinstall worker node.
Story: 2010591
Task: 48067
Change-Id: I613b4ea710172c2bc7c6408bfa36430cbfe33fa2
Signed-off-by: Andre Kantek <andrefernandozanella.kantek@windriver.com>
- Remove "-a" flag from the command as it has been deprecated in the
original pf-bb-config source.
Test Status:
- PASS: Configure device using host-device-modify with 1 VF and unlock
the host.
- PASS: Create a test pod requesting 1 VF.
- PASS: Run dpdk test-bbdev application on 1 VF inside the pod.
Closes-Bug: 2020128
Story: 2010698
Task: 48063
Change-Id: I229b505755138495c79513e926f674c28797b79b
Signed-off-by: Nidhi Shivashankara Belur <nidhi.shivashankara.belur@intel.com>
This change allow us to call a puppet class to update
the bindmounts, restore the saved static manifest files, restart
kubelet and restart etcd during k8s upgrade abort.
This change is also to solve the warning message
"Unrecognized escape sequence" which comes during kubelet upgrade.
Test plan:
Pass: Abort the k8s upgrade by 'system kube-upgrade-abort' command
and verify static manifest files are restored, bindmounts are updated,
kubelet and etcd restarted successfully.
Pass: Verify /etc/fstab content updated successfully after k8s upgrade
abort.
Story: 2010565
Task: 47822
Change-Id: If1b1bda88a898bc6360403a839e174fbc0d62008
Signed-off-by: Boovan Rajendran <boovan.rajendran@windriver.com>
This commit addresses a bug fix for the following scenario:
1. A user installs a subcloud with the communication between
subcloud and system controller assigned to the
management network.
2. The user decides they want to transition to the admin network,
which allows changes to the subnet information after install.
3. The user locks a host, creates a platform interface for the
admin network, then unlocks.
4. The user (after unlock) creates an address pool, admin
network, and assignes the network to the previously created
interface.
Because there is a requirement in StarlingX for the admin network
to be able to apply subnet changes (address pool, network) at
runtime, this scenerio causes an issue because the admin-services
SM service-domain-member and service group are only actually
present in the SM database after an unlock. In the above
scenerio, we logically create an admin interface but only assign
it to and 'admin' network after unlock.
This commit handles the above by ensuring the admin-services
service-domain-member and service-group are enabled in the case
that the system is a subcloud.
Test Plan:
1. Install a subcloud using the management network for communication
with a system controller. Ensure no alarms and that the
admin-services service group is active, with no admin-ip service
created.
Lock, create an 'admin' interface and
unlock. After unlock create and apply the admin address pool
and network. Ensure the subcloud can be updated to use the admin
network via dcmanager subcloud update. Ensure that the admin-ip
service is enabled-active.
2. Install a subcloud using the management network for communication
with a system controller. Lock, create an 'admin' interface,
create an 'admin' address pool and network, then unlock. Ensure
the subcloud can be updated to use the admin network via dcmanager
subcloud update.
3. Install a subcloud using the admin network for communication with
a system controller. Ensure the subcloud can become managed,
online, and in-sync.
4. Perform the steps 1-3 for both AIO-SX and AIO-DX.
Story: 2010319
Task: 46911
Signed-off-by: Steven Webster <steven.webster@windriver.com>
Change-Id: I692dcf4f7e8c280236d63984ffd02afbed0a3e1d
Adding puppet classes to install L3 firewall in cluster nodes that
can run kubernetes (controllers and workers), It uses the hash2yaml
function from the package puppet-hash2stuff, the change is marked as
a dependency for this task.
The story 2010591 is still under development and for now we are only
applying the platform firewalls into the controller nodes.
With the change https://review.opendev.org/c/starlingx/config/+/881495
the new classes' config info is provided. At this first delivery the
firewall will not contain restrictive rules, focusing more in making
the necessary GlobalNetworkPolicy and HostEndpoints to be correctly
installed among the nodes
Test Plan:
[PASS] install AIO-DX
[PASS] install Standard with DX+worker+storage nodes
Story: 2010591
Task: 47954
Depends-On: https://review.opendev.org/c/starlingx/integ/+/881497
Change-Id: I1d35abde612cdaf3ccb54a858618037382ff2636
Signed-off-by: Andre Kantek <andrefernandozanella.kantek@windriver.com>
In order to use the total available 1G hugepages space when
vswitch_type parameter is set to 'none', the value huge_pages=off
needs to be included on /etc/postgresql/postgresql.conf since, by
default, postgres uses hugepages if available.
The postgresql.pp is a manifest called on unlock
Test Plan
PASS: AIO-SX: Successfully bootstrapped and unlocked
PASS: Verified that app_hp_avail_1G == app_hp_total_1G after
increasing huge page memory to the amount indicated by
app_hp_total_1G (total and available values match when
no applications are using huge pages).
PASS: Output of 'cat /proc/meminfo' matches output of
'system host-memory-list controller-0'
(HugePages_Free == app_hp_avail_1G).
Closes-bug: 2018324
Change-Id: Iab7b7518fdcfccd2761778ed6a875a42cd35c34c
Signed-off-by: Matheus Guilhermino <matheus.machadoguilhermino@windriver.com>
In commit 77e0c7c1 we removed an exec resource that called out to an
obsolete script. However, we neglected to remove a "require"
metaparameter which referenced the removed script. It's unclear how
this was missed since the previous change was tested in VirtualBox.
This causes a puppet error when trying to upgrade K8s:
Could not find resource 'Exec[update kubeadm-config]' in parameter
'require' (file:
/usr/share/puppet/modules/platform/manifests/kubernetes.pp, line: 813)
The fix is to remove the metaparameter.
TEST PLAN:
PASS: While running the dev branch on AIO-DX, upgrade K8s from 1.21
to 1.22.
(note, a workaround was required to deal with
https://bugs.launchpad.net/starlingx/+bug/2018247)
Partial-Bug: 2017696
Change-Id: I66c0e88f0f0a3acc3326391263123e60667561cc
Signed-off-by: Chris Friesen <chris.friesen@windriver.com>
Updating the rsa ssh host key based on:
https://github.blog/2023-03-23-we-updated-our-rsa-ssh-host-key/
Note: In the future, StarlingX should have a zuul job and
secret setup for all repos so we do not need to do this
for every repo.
Needed to rename the secret, because zuul fails if like-named
secrets have diffent values in different branches of the same
repo.
Partial-Bug: #2015246
Change-Id: I94d27934bbfafb174f8e8d48491e6089f47e6408
Signed-off-by: Davlet Panech <davlet.panech@windriver.com>
upgrade_k8s_config.sh has been deprecated and
removed due to lack of support for "flow" style YAML.
Deprecated functionality has been superseded
by better YAML-aware handling in sysinv.
Updating how we invoke kubeadm, we will now use an explicit
version of kubeadm when calling it. The version called
will now match the version we are upgrading to in order to handle
the format unsupported by previous versions of kubeadm.
Test Plan
PASS:
- Manually update scripts on controllers and worker nodes based on
https://review.opendev.org/c/starlingx/integ/+/880390
- Perform manual upgrade from k8s v1.21.8 to v1.22.5
- Verify kubernetes successfully upgraded to v1.22.5
Test was performed in the lab with local changes
to verify the code.
Patch was not tested.
Closes-Bug: 2017696
Change-Id: I840eb566057be495fe0da3cae7604bf8055c0d4f
Signed-off-by: Gleb Aronsky <gleb.aronsky@windriver.com>
When K8s custom config puppet script is executed during restore
playbook, K8s updates fail when trying to validate cluster network
data. This happens whenever the OAM IP address is reconfigured (after
reinstall) with a different protocol version than the one used for the
K8 cluster host subnet.
The issue is related to "advertise-address" parameter. It is not
predefined in the api-server extra-args during bootstrap, so k8s gets
the host's default interface as default value. In this case, the host’s
default value is an IPv4 (IPv6) address while all the other K8s cluster
subnets are configured with IPv6 (IPv4) addresses.
K8s validation fails because STX defaults to a SingleStack mode. Only
dual-stack networks allow the assignment of IPv4 and IPv6 addresses to
pods and services.
Test Plan:
PASS: Fresh Install AIO-SX.
PASS: Create a backup and reinstall server.
PASS: Reconfigure network OAM IF with a different IP family.
PASS: Restore system.
PASS: Verify advertise-address parameter.
PASS: Modify and Apply K8s service-parameter.
PASS: Fresh Install STD/DX
PASS: Modify and Apply K8s service-parameter.
PASS: Verify advertise-address parameter in both controllers.
Closes-bug: 2001715
Signed-off-by: Jorge Saffe <jorge.saffe@windriver.com>
Change-Id: I6f75f171d0a45abe2d5e047a31308dc97ce19eed
The kubernetes.pp class platform::kubernetes::upgrade_first_control_plane
which does 'kubeadm upgrade apply' resulted in versioned kubelet-config
ConfigMap. The pre-upgrade ConfigMap was left behind.
Having multiple ConfigMap causes 'system kube-config-kubelet' to fail,
so reconfiguration was broken.
In historical releases, we had specified '--config
/etc/kubernetes/kubelet_override.yaml', so the the kubelet garbage
collection eviction parameters became incorrect post k8s upgrade,
without a way to reconfigure.
This update will purge all kubelet-config ConfigMap except the most
recent. This occurs immediately following 'kubeadm upgrade apply' step.
Testplan:
PASS: AIO-SX perform k8s upgrade, run 'system kube-config-kubelet'.
Verify only current version kubelet-config ConfigMap exists.
Closes-Bug: 2012975
Change-Id: I5e34299616690628267c07a744dc9923144e606d
Signed-off-by: Jim Gauld <James.Gauld@windriver.com>