It was observed that is not necessarily true that all pod traffic
is tunneled in IPv4 installations. To solve that we are extending the
solution done in IPv6 to IPv4, which consists in adding the
cluster-pod network into the cluster-host firewall
The problem showed itself when the stx-openstack application was
installed.
Test Plan:
[PASS] observe stx-openstack installation proceed with the correction
Closes-Bug: 2023085
Change-Id: I572cd85e6638d879d8be1d9992ae852a805eca4b
Signed-off-by: Andre Kantek <andrefernandozanella.kantek@windriver.com>
This commit implements the inclusion of the subcloud networks in the
firewall ingress rules. The networks are obtained from the routes
table.
Test plan:
Setup: Distributed Cloud with AIO-DX as system controller.
[PASS] Add subcloud, check that the corresponding network is present in
the system controller's firewall.
[PASS] Remove subcloud, check that the corresponding network is no
longer present in the system controller's firewall.
Story: 2010591
Task: 48139
Depends-on: https://review.opendev.org/c/starlingx/stx-puppet/+/885303
Change-Id: Ia83c26c88914413026953fcef97af55fe65bd058
Signed-off-by: Lucas Ratusznei Fonseca <lucas.ratuszneifonseca@windriver.com>
This commit add unit tests to load-delete worklow, improving test
coverage and validating the flow for future changes.
Test Plan:
- PASS: Tox tests
Story: 2010611
Task: 48159
Signed-off-by: Guilherme Schons <guilherme.dossantosschons@windriver.com>
Change-Id: If6a27af3a76d4aff8e4168b72ad5892046fe9ba6
Pods of intel qat device plugin will only be created on
nodes with label “intelqat: enabled” which support intel
qat drivers.
In this commit, sysinv agent will check host QAT device
driver. once detected supported device, sysinv agent would
send request to sysinv conductor, and conductor would set
kubernetes label “intelqat: enabled” for specific node if
file “/opt/platform/config/22.12/enabled_kube_plugins” exists and
"intelqat" is in the file.
To detect whether host supports QAT devices or not, sysinv code is
added based on following approaches:
1. Output of lspci command is parsed to check whether PFs are listed
for 8086 vendor and 4940/4942 QAT devices on Saphire Rappid lab.
2. Output of lspci command is parsed to check whether VFs are listed
for other QAT devices, referring old code implementation as no
hardware available to validate this.
TEST PLAN:
PASS: verified “intelqat: enabled" using
"kubectl get nodes controller-0 --show-labels" command.
PASS: checked whether daemonset pods are running or not, using
command "kubectl get ds -A | grep 'qat-plugin'"
NOTE: No QAT hardware is available, so all testing is done using
commands after bypassing the driver related checks in the code
Story: 2010604
Task: 47853
Signed-off-by: Md Irshad Sheikh <mdirshad.sheikh@windriver.com>
Change-Id: Ib5d4bafbf918f4c0e3ebcb5fa78f90d021e0ef20
Added support for new "registry" pattern. Image settings
inside charts can now have the following pattern:
image:
registry: <str>
repository: <str>
Test Plan:
PASS: Upload and apply process successfully completed with
tarball changed to new pattern using "registry"
PASS: metrics-server, nginx-ingress-controller, vault and
sts-silicom upload and apply process without "registry"
completed successfully.
Closes-bug: 2019730
Change-Id: Id5cadafedf9b85891700dffcede9b0b09ee64359
Signed-off-by: David Barbosa Bastos <david.barbosabastos@windriver.com>
When sysinv is restarted and there is an application stuck,
the _abort_operation function was called with a parameter
different from the expected one. The parameter needs to be
an instance of AppOperation.Application.
Function call was changed with correct parameter and added
documentation in the _abort_operation function.
Test Plan:
PASS: Restart sysinv successfully
PASS: Restart sysinv with stuck kubervirt app performed
successfully
PASS: Successfully lock and unlock the controller
PASS: Shows the name of the chart that caused the app to abort
PASS: Individually show the image that failed when trying to
apply the app
PASS: Command "system application-abort" executed and output
message "operation aborted by user" displayed in
application-list as expected
Closes-bug: 2022007
Signed-off-by: David Bastos <david.barbosabastos@windriver.com>
Change-Id: I948ec8f9700d188a5f8e099a4992853822735b95
When performing actions on STX-Openstack (apply, re-apply), sysinv logs
get filled with a "DetachedInstanceError" message. This was already
found previously and fixed by [1], but it looks like that isn't working
for the STX-Openstack application, when the app is applied (or
re-applied), Neutron needs to fetch some information about the
interfaces' data networks, which is throwing this DetachedInstanceError
message.
Checking the SQLAlchemy documentation [2], it looks like lazy load can
be allowed via a relationship parameter on the model, which fits this
case since the VirtualInterfaces (Interface) model needs to get
information of the InterfaceDataNetwork model.
This was also previously done for the InterfaceNetworks model on [3]
This change allows the lazy load of the InterfaceDataNetwork model so
no error messages are logged to sysinv.
[1] https://review.opendev.org/c/starlingx/config/+/826340
[2] https://docs.sqlalchemy.org/en/14/orm/loading_relationships.html
[3] https://review.opendev.org/c/starlingx/config/+/657645
Test Plan:
PASS - Build sysinv package
PASS - Build ISO with new sysinv package
PASS - Install and bootstrap an AIO-SX machine
PASS - Perform a lock/unlock on an AIO-SX machine
PASS - Apply STX-Openstack app
PASS - Re-apply STX-Openstack app
PASS - Visual inspection of sysinv.log shows nothing unusual
Closes-Bug: 1998512
Change-Id: Ic5b168e1a01dc53aa3f8658547c1f4776e681cdc
Signed-off-by: Lucas de Ataides <lucas.deataidesbarreto@windriver.com>
Removal of PSP Support as part of k8s 1.25/1.26 transition,
adding a check in the system health-query that there are
NO PSP policies present in the cluster. With the release of
Kubernetes v1.25, PodSecurityPolicy has been deprecated.
We can read more information about the removal of PodSecurityPolicy
in the Kubernetes 1.25 release notes here:
https://kubernetes.io/blog/2021/04/06/podsecuritypolicy-deprecation
-past-present-and-future/
The check should FAIL the scenario and log the error output in
sysinv log with the psp resource names exisitng in the cluster,
asking the user to remove it before the upgrade.
Test Plan
AIO-SX: Perform system health-query
PASS: Iso creation
PASS: bootstrap
PASS: RUN system health-query with no PSP policies
Output:
System Health:
All hosts are provisioned: [OK]
All hosts are unlocked/enabled: [OK]
All hosts have current configurations: [OK]
All hosts are patch current: [OK]
No alarms: [OK]
All kubernetes nodes are ready: [OK]
All kubernetes control plane pods are ready: [OK]
All PodSecurityPolicies are removed: [OK]
PASS: RUN system health-query with existing PSP policies
Output:
System Health:
All hosts are provisioned: [OK]
All hosts are unlocked/enabled: [OK]
All hosts have current configurations: [OK]
All hosts are patch current: [OK]
No alarms: [OK]
All kubernetes nodes are ready: [OK]
All kubernetes control plane pods are ready: [OK]
All PodSecurityPolicies are removed: [Fail]
PSP policies exists, please remove them before
upgrade: privileged, restricted
PASS: RUN system health-query-kube-upgrade with no PSP policies
PASS: RUN system health-query-kube-upgrade with existing PSP policies
All hosts are provisioned: [OK]
All hosts are unlocked/enabled: [OK]
All hosts have current configurations: [OK]
All hosts are patch current: [OK]
No alarms: [OK]
All kubernetes nodes are ready: [OK]
All kubernetes control plane pods are ready: [OK]
All PodSecurityPolicies are removed: [Fail]
PSP policies exists, please remove them before
upgrade: privileged, restricted
All kubernetes applications are in a valid state: [OK]
PASS: RUN system health-query-upgrade with no PSP policies
PASS: RUN system health-query-upgrade with existing PSP policies
All hosts are provisioned: [OK]
All hosts are unlocked/enabled: [OK]
All hosts have current configurations: [OK]
All hosts are patch current: [OK]
No alarms: [OK]
All kubernetes nodes are ready: [OK]
All kubernetes control plane pods are ready: [OK]
All PodSecurityPolicies are removed: [Fail]
PSP policies exists, please remove them before
upgrade: privileged, restricted
No imported load found. Unable to test further
Story: 2010590
Task: 48145
Change-Id: I3787bdce505c2d18f5312fc32e95c507d8916b3d
Signed-off-by: Rahul Roshan Kachchap <rahulroshan.kachchap@windriver.com>
New openldap users with sudo permision are not able to exercise
the sudo capabilities for a maximum of 15min since creation
because the sudo rules refresh interval has the default value
of 15min.
This commit reduces the sudo rules refresh interval to 5min to
improve usability. This change will only be done for local
openldap server.
The interval value was chosen to match the value of
"ldap_enumeration_refresh_timeout" attribute that specifies how
long SSSD has to wait before refreshing its cache of enumerated
records.
Due to performance consideration, it is not advisable to reduce the
sudo rules refresh time for WAD servers because of their large
number of users. Reducing the refresh interval can have negative
performance impact.
This commit also adjusts the sudo rules search criteria.
Test Plan:
PASS: Successful install in AIO-SX system configuration.
PASS: Create a new openldap user with sudo permissions and verify
that sudo capabilities are available for the user in maximum
300sec (5min).
PASS: Verify remote ssh connection for a new openldap user.
PASS: Verify SSSD sudo rules search works as expected for openldap
and WAD servers.
Story: 2010589
Task: 48164
Signed-off-by: Carmen Rata <carmen.rata@windriver.com>
Change-Id: Ieb8e23068b82e09c3feeec4c8317d32d45ff64e6
The recent PyYaml upversion issues would have been seen
in Zuul/Tox if this check had been enabled.
These changes have no impact on the runtime code.
This also disables the large import graph report at the end
of pylint executions, which makes it difficult to see pylint
errors.
Test Plan:
Pass: tox
Story: 2010642
Task: 48161
Signed-off-by: Al Bailey <al.bailey@windriver.com>
Change-Id: Ib0c27b2dfebea0ef82345da9a4935b79f00daa5a
After upversioning python3-yaml package [1] the yaml.load() method
signature started requiring Loader as a mandatory argument [2].
This change updates sysinv code entries where yaml.load() was called
without specifying a loader, such that now it is compatible with the
newest signature of yaml load calls.
[1] https://review.opendev.org/c/starlingx/tools/+/881280
[2] https://pyyaml.org/wiki/PyYAMLDocumentation
TEST PLAN:
PASS - Build sysinv package
PASS - Build a new STX ISO
PASS - Deploy the new ISO (virtual AIO-SX)
Closes-Bug: 2021672
Signed-off-by: Thales Elero Cervi <thaleselero.cervi@windriver.com>
Change-Id: I36565dbb9ca65234e596578e2f2e4adeb3e1628a
Pods of intel gpu device plugin will only be created on
nodes with label “intelgpu: enabled” which support intel
gpus with driver i915
In this commit, sysinv agent will check host GPU device
driver. once detected supported device, sysinv agent would
send request to sysinv conductor, and conductor would set
kubernetes label “intelgpu: enabled” for specific node if
file “/opt/platform/config/22.12/enabled_kube_plugins” exists and
"intelgpu" is in the file.
TEST PLAN:
PASS: verified “intelgpu: enabled" using
"kubectl get nodes controller-0 --show-labels" command.
PASS: checked whether daemonset pods are running or not, using command
"kubectl get ds -A | grep 'gpu-plugin'"
Note: No GPU hardware is available, so all testing is done using commands.
Story: 2010604
Task: 47854
Depends-On: https://review.opendev.org/c/starlingx/ansible-playbooks/+/884395
Signed-off-by: Md Irshad Sheikh <mdirshad.sheikh@windriver.com>
Change-Id: I0f6498dc473ddadacfa361ffdcabdbce6c839eba
Sysinv function to get the output from the migration scripts
excecution at the activating or starting upgrade subcloud step.
This allows the orchestrator to know the result of the migration
scripts for each subcloud.
Test plan:
PASS: Modify migration script involving action = 'start'
in order to make the starting upgrade step fail.
Run subcloud upgrade strategy.
Verify msg in /tmp/upgrade_fail_msg
PASS: Modify migration script involving action = 'activate'
in order to make the activating upgrade step fail.
Run upgrade subcloud strategy.
Verify msg in /tmp/upgrade_fail_msg
PASS: Create (modifying) a new app tarball
(platform-integ-apps) to make it recognizable but
not applicable. Run strategy and
verify msg in /tmp/upgrade_fail_msg
Story: 2010768
Task: 48078
Signed-off-by: fperez <fabrizio.perez@windriver.com>
Change-Id: I7e3547faeaa6c971afce7a6aa7445c8d71133558
This change adds firewall rules for non-DC installations that will
limit the traffic allowed in the management, cluster-host, pxeboot,
and storage interfaces to the local platform network, any outside
traffic will not be accepted. The filtering only looks at the source
IP address to allow or deny ingress into the node.
For IPv4, the DHCP's UDP port 67 is allowed without source address
filtering because the DHCP-Offer message sent by the client comes
with the source address "0.0.0.0".
For IPv6, the local link network address (fe80::/64) is also added to
the source nets to allow the cases where the initial communication,
inside the local network, starts with the local link source address
(e. g. DHCPv6-Solicit message).
Also, for IPv6, the cluster-pod network is also added to the
cluster-host firewall, this is done because, unlike IPv4 setups, the
pod traffic is not tunneled, running over the cluster-host interface.
DC scenarios will be part of a future task.
In all test scenarios below the correct presence of iptables/ip6tables
was verified, together with traffic tests using netcat, when directly
from the host network.
Test Plan
[PASS] Install Standard (controller+worker+storage) in IPv6
[PASS] Install Standard (controller+worker+storage) in IPv4
[PASS] Install AIO-DX in IPv4
[PASS] Install AIO-DX in IPv6
[PASS] Controller Lock/Unlock/Reinstall
[PASS] Worker Lock/Unlock/Reinstall
[PASS] Storage Lock/Unlock/Reinstall
[PASS] Change HTTP port during runtime
[PASS] IPv4 pod-to-pod communication between nodes
[PASS] IPv6 pod-to-pod communication between nodes
[PASS] Validate WRA installation
[PASS] Validate WRO installation
Story: 2010591
Task: 48088
Change-Id: I453c7cd8fcb9e63eb9a5c9d321ca6b504ea21e0d
Signed-off-by: Andre Kantek <andrefernandozanella.kantek@windriver.com>
There was no lifecycle hook during recovery.
Thus, in platform-integ-apps it was not possible
to identify when this happened.
To solve this, a new operation constant was created and
the lifecycle hook was triggered inside _perform_app_recover().
Test Plan:
PASS: When forcing recovery, it was possible to observe the
lifecycle with the "recover" operation (SX/DX/Storage)
Story: 2010688
Task: 48081
Change-Id: I44447ca2246a8461d98f8ea64e2e16c127c357a6
Signed-off-by: Erickson Silva de Oliveira <Erickson.SilvadeOliveira@windriver.com>
This commit restores the openstack/common/context.py file that was
deleted on commit 237ddb63830b2a660ab624c84a3c2fe7614e4d4a, this is
necessary for the 22.12 for 23.09 upgrade.
AIO-DX: Pass upgrade 22.12 -> 23.09
Story: 2010651
Task: 47915
Change-Id: I6a3c91ccf47b9785e6d510f53a3f7093a1f804f8
Signed-off-by: Caio Cesar Ferreira <Caio.CesarFerreira@windriver.com>
In an SX with ceph backend configured and platform-integ-apps not
applied, an unhandled error shows when the command "system
modify --system_mode=duplex" is executed. After this error, the
system_mode in sysinv db is changed to "duplex" and cannot change
back to simplex, blocking further system CLI operation.
It is necessary for an SX with the ceph backend configured to have
platform-integ-apps applied to migrate to DX. With the fix, there
is a validation checking that the system storage backend is able
for a migration, raising a proper error if not. In addition, if
the system is not ready, there are no changes in the sysinv db.
Test-Plan:
PASS: AIO-SX fresh install with ceph backend
PASS: Use system modify --system_mode=duplex with
platform-integ-apps not applied (uploaded or not)
PASS: Verify the error message "Cannot modify system mode..."
is printed on CLI
PASS: Verify system_mode is still 'simplex' in the sysinv db
PASS: Apply platform-integ-apps, use the same command, proceed
with the migration and validate that it was successful
PASS: AIO-SX fresh install without Ceph backend + successful
migration to DX
Closes-Bug: 2013069
Signed-off-by: Gabriel de Araújo Cabral <gabriel.cabral@windriver.com>
Change-Id: Ib05c675b54c2b9fcfa3e7f92affec9ca37245df2
The cordon and uncordon commands are not necessary to execute during
k8s upgrade in AIO-SX. This change provides the cordon and uncordon
system commands as optional.
Test Plan:
PASS: Fresh install ISO as AIO-SX. Successfully upgraded from 1.23.1
to 1.24 using the manual K8s upgrade without executing cordon
and uncordon system commands. Successfully upgraded from 1.24
to 1.25 with the cordon and uncordon system commands.
After executing cordon system command some of the pods went to
the 'Pending' status. After executing uncordon system command
pending pods are return to 'Running' status.
Story: 2010565
Task: 48042
Change-Id: Ia4b7b8345d33cb6662c6de6fbb13d6314e4c109f
Signed-off-by: Ramesh Kumar Sivanandam <rameshkumar.sivanandam@windriver.com>
WAD groups discovered by SSSD and imported in the stx platform
need to have Linux IDs so that WAD users in these groups can perform
privileged operations according to the group permissions.
An example would be the "sys_protected" group. In order to be able to
allow the WAD "sys_protected" user to execute privileged operations
with the stx platform applications, the same way as a native stx
platform user would do, the "sys_protected" group needs to be assigned
the GID number "345" when discovered with SSSD.
This commit is configuring SSSD to achieve that because by default the
the WAD users/groups are mapped to Linux users/groups on stx platform
using Windows Security Identifiers (SIDs).
On the WAD server, the "sys_protected" WAD group's Posix schema
attribute "gidNumber" would have been populated already as "345",
before the SSSD connects to WAD server. Similarly, the WAD user's
"uidNumber" attribute needs to be populated in the WAD server.
This commit also optimizes the SSSD sudo rules search.
Test Plan:
PASS: Successful install in AIO-SX system configuration.
PASS: The Linux uid and gid configuration for users and groups
respectively is configured correctly in sssd.conf.
PASS: SSSD service is successfully started.
PASS: Verify SSSD caches WAD users and groups and they have
the Linux IDs set correctly.
PASS: Verify remote ssh connection for discovered WAD ldap users.
PASS: Verify WAD users in "sys_protected" WAD group can perform
privileged operations like "source /etc/platform/openrc".
PASS: SSSD sudo rules search works as expected and the sudo rules
are discovered.
Story: 2010589
Task: 48010
Signed-off-by: Carmen Rata <carmen.rata@windriver.com>
Change-Id: I452b1097c607cd270bd56f03f7eba0d1f21f325c
It was detected that the IPv6 firewall was selecting ICMP instead of
ICMPv6 for IPv6 installations. Each protocol have a different number
and that is used in ip6tables.
This change selects the correct protocol for IPv6
Test Plan:
[PASS] Install AIO-DX IPv6
Story: 2010591
Task: 48007
Change-Id: I2be10999a362328fc730728052a9eb34c364af4d
Signed-off-by: Andre Kantek <andrefernandozanella.kantek@windriver.com>
Update application rollback strategy for proper compatibility with
Flux.
This fixes a bug where applications fail to downgrade on DX and Standard
systems due to incompatibility between the previous existing rollback
approach used for Armada and Flux-based apps.
The deprecated code used for rolling back Armada-based applications was
removed. That code was still being called and causing an exception to be
raised due to Armada related chart attributes not being available
anymore.
The rollback to a previous version is now done by applying that
version using "kubectl apply -k <manifest dir>". That way Flux is able
to detect the version we are rolling back to and properly applies it.
Test Plan:
PASS: build-pkgs -a
PASS: build-image
PASS: full AIO-DX install
PASS: update cert-manager-1.0-64.tgz to cert-manager-1.0-65.tgz then
check with "helm release -A -a" if chart and app versions were
properly updated.
PASS: downgrade cert-manager-1.0-65.tgz to cert-manager-1.0-64.tgz then
check with "helm release -A -a" if chart and app versions were
properly downgraded.
PASS: full AIO-SX install
Closes-Bug: 2019259
Signed-off-by: Igor Soares <Igor.PiresSoares@windriver.com>
Change-Id: Ice1e4d58ff228aea1d4d530e4679ee07263d83f9
This change adds the necessary sysinv.puppet class to generate the
hiera data for puppet. It will associate the network type firewall
with the desired platform interface.
Since this is an ongoing feature the new firewalls will only contain
filtering by protocol, allowing all TCP, UDP, and ICMP to ingress
the node (in practice there will be no blocks).
It is also skipping the DC setups as it will be done in a later phase.
The OAM network already existed using other implementation (directly
in puppet) and for now it will not be handled by this implementation.
A future task will be open to unify the solution.
If the change 881496 in stx-puppet is merged later, we will generate
the hiera_data but there will be no class in puppet to read. It will
not break the operation, but I'm adding a Depends-On to prevent this
situation.
The unit tests are already testing all code, using mock and database
entries.
Test PLan:
[PASS] install AIO-SX
[PASS] install Standard with DX+Worker+Storage nodes
[PASS] modify http_port and check the firewall is reapplied
in runtime
Story: 2010591
Task: 47956
Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/881496
Change-Id: Ia693cd36873188ce4a7f8daba4a909c6e7a27813
Signed-off-by: Andre Kantek <andrefernandozanella.kantek@windriver.com>
This commit update the extract playbooks plugin to search the packages
in the patches directory first and if not found, in the feed directory.
Test Plan:
PASSED: Playbooks extracted from patches directory (CentOS ISO
pre-patched)
PASSED: Playbooks extracted from feed directory (CentOS ISO with no
patches)
Story: 2010611
Task: 47967
Depends-On: https://review.opendev.org/c/starlingx/config/+/882856
Signed-off-by: Guilherme Schons <guilherme.dossantosschons@windriver.com>
Change-Id: I67b2a816312cbbc3ab2a16e8f9756b722ca7d052