Restarting sysinv while an application is applying will result in a
wrong reset status. For example cert-manager status is reset to
'apply-failed' instead of 'uploaded'.
When sysinv is restarted, app operations that are in progress are
reset. When apps were decoupled from sysinv [1], a requirement to
have the app metadata loaded was introduced.
Tests on AIO-SX:
PASS: deploy, unlocked enabled available
PASS: forced 'cert-manager' to be 'applying', forced sysinv conductor
restart, observed status was reset to 'uploaded'.
[1]: https://review.opendev.org/c/starlingx/config/+/774292/10/sysinv/sysinv/sysinv/sysinv/conductor/kube_app.py#333
Partial-Bug: 2003198
Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Change-Id: Ibefc6362c7a7f03571be3cf35b6592cf0c68bca3
An error is seen in the AppFramework during upgrades from N to N+2
side.
Specifically during [1], cert-manager is not properly removed,
preventing the new version apply:
ERROR sysinv.conductor.kube_app [-] Unsupported armada request: remove.
When [2] was introduced one request for app removal to armada/fluxcd
was renamed in one place from APP_DELETE_OP to APP_REMOVE_OP. This
needs to be corrected for armada case to support N to N+2 upgrades.
Allow the armada operation to be named APP_DELETE_OP as it was before.
Tested on AIO-SX upgrade from stx.6.0 to master(soon to be stx.8.0).
Had other patches applied to the system, but will address those
issues later.
PASS: cert-manager updated during upgrade script 64-.
[1]: 09981f9d90/controllerconfig/controllerconfig/upgrade-scripts/64-upgrade-cert-manager.sh (L168)
[2]: https://review.opendev.org/c/starlingx/config/+/866200/5/sysinv/sysinv/sysinv/sysinv/conductor/kube_app.py#3346
Story: 2009303
Task: 47135
Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Change-Id: Ia57980de2acac7d510e01903c16596b90bee3b4c
Anticipate failure for corner cases in which application apply
operations timeout due to another operation in progress.
Helm resource statuses are parsed in order to match that
specific case and report a failure before the timeout is reached.
Test Plan:
PASS: AIO-SX full build and deployment
PASS: Apply app with no exceptions
Closes-Bug: 2002311
Signed-off-by: Igor Soares <igor.piressoares@windriver.com>
Change-Id: Idd145fe10a9b6b5705f42a2726a42143aa46faed
Add the name of the last chart applied to FluxCD application apply output
and to the log.
The last chart applied is based on the most recent successful status for
a given release.
Test Plan:
PASS: AIO-SX full deployment
PASS: platform-integ-apps removal and apply
PASS: cert-manager removal and apply
Story: 2009138
Task: 47062
Signed-off-by: Igor Soares <Igor.PiresSoares@windriver.com>
Change-Id: I85fa375e11fda78d95ff34857e51cef59eb0fdb4
Because of FluxCD upversion described in [1], we don't need the recovery
logic flipping spec.suspend. Flux is supposed to properly
reconciliate the resources.
Remove recovery logic concerned with flipping spec.suspend was
removed.
Remove optimizations for triggering reconciliation by flipping
spec.suspend.
Disclaimer for tests:
1) This was applied on to of [1].
2) cert-manager, nginx-ingress-controller, platform-integ-apps had the
reconciliation interval decreased to 1m to allow Flux to manage the
resources by itself in a reasonable time interval.
There will be future commits per app updating reconciliation interval.
Tests on AIO-SX:
PASS: bootstrap
PASS: unlocked enabled available
PASS: apps applied
PASS: inspect flux pod logs for errors
PASS: re-test known trigger for 1996747 and 1995748
PASS: re-test known trigger 1997368
[1]: https://review.opendev.org/c/starlingx/ansible-playbooks/+/866820/
Depends-On: https://review.opendev.org/c/starlingx/ansible-playbooks/+/866820/
Related-Bug: 1995748
Related-Bug: 1996747
Related-Bug: 1997368
Partial-Bug: 1999032
Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Change-Id: I932d85d8b366479b2c1d2c88a0acf7fad219b131
At the moment, flux doesn't delete helm releases if they
have a running operation (eg. HR in 'pending-install' status).
That causes the app remove operation to get stuck and timeout due
to some resources not terminating, most commonly after a failed
application apply operation.
Until this flux behaviour gets changed, we need to uninstall
these 'stuck' releases in helm directly.
Test Plan:
PASS Cause a HR to get stuck by tainting the node before applying
the app, then successfully remove the app
Closes-Bug: 1998384
Signed-off-by: Leonardo Fagundes Luz Serrano <Leonardo.FagundesLuzSerrano@windriver.com>
Change-Id: I7466d61f79129b8f70f8a97f8968549f5823d811
Add support to preserve the app attributes from the old version
when updating app to a new version.
The key "maintain_attributes" in application metadata file
indicates if the app attributes will be reused or not during
update, user can specify --reuse-attributes <true/false>
to override the metadata preference specified by the application.
Database column which stores the app attributes is called
system_overrides (table helm_overrides), when attributes is mentioned
in the code, it means the property stored in column system_overrides
in the database. That property is shown to the user as attributes.
The naming confusion will be fixed later.
Test Plan:
PASS: Update app without specify --reuse-attributes
PASS: Update app without specify --reuse-attributes app metadata
defaults to maintain_attributes=true
PASS: Update app specify --reuse-attributes false
PASS: Update app specify --reuse-attributes true
PASS: Disabled helm chart stays disabled with update
Closes-Bug: https://bugs.launchpad.net/starlingx/+bug/1998499
Signed-off-by: Fabricio Henrique Ramos <fabriciohenrique.ramos@windriver.com>
Change-Id: I0f9c5c7314deb10f89853c9e5c8e15daf99580ed
Add some robustenss to the app framework. It is observed that the
framework can reach a state where a helm charts are not uploaded to
HelmRepository. This leads to app framework waiting for reconciliation
of HelmRepository to be fired. Currently the reconciliation interval
is set to 60 minutes for every app checked.
Issue becomes obvious when udating the app to use newer HelmCharts.
HelmChart observed status is '''chart pull error: failed to get chart
version for remote reference: no chart name found''' which is a
string the recovery logic will attempt to recover from.
Update recovery logic to trigger a HelmRepository reconciliation
before a HelmChart reconciliation.
Skip CentOS testing because we use the same fluxcd and kubernetes.
The only difference is the python kubernetes library, but the
implementation does not use any new API calls.
Tests on AIO-SX Debian:
PASS: AIO-SX unlocked enabled available
PASS: inspect logs to see HelmRepository
reconciliation is triggered by the recovery logic.
Closes-Bug: 1995748
Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Change-Id: I34ae586a5a267b636164d011b5fa5d44ce8c9a6c
It is observed that when a helm release is in pending state, another
helm release can't be started by FluxCD. FluxCD will not try to
do steps to apply the newer helm release, but will just error.
This prevents us from applying a new helm release over a release with
pods stuck in Pending state (just an example).
When the specific message for helm operation in progress is detected,
attempt to recover by moving the older releases to failed state.
Move inspired by [1].
To do so, patch the helm secret for the specific release.
As an optimization, trigger the FluxCD HelmRelease reconciliation right
after.
One future optimization we can do is run an audit to delete the helm
releases for which metadata status is a pending operation, but release
data is failed (resource that we patched in this commit).
Refactor HelmRelease resource reconciliation trigger, smaller size.
There are upstream references related to this bug, see [2] and [3].
Tests on Debian AIO-SX:
PASS: unlocked enabled available
PASS: platform-integ-apps applied
after reproducing error:
PASS: inspect sysinv logs, see recovery is attemped
PASS: inspect fluxcd logs, see that HelmRelease reconciliation is
triggered part of recovery
[1]: https://github.com/porter-dev/porter/pull/1685/files
[2]: https://github.com/helm/helm/issues/8987
[3]: https://github.com/helm/helm/issues/4558
Closes-Bug: 1997368
Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Change-Id: I36116ce8d298cc97194062b75db64541661ce84d
A kubernetes custom resource may not have been created by the time
recovery logic queries it. In such cases the API returns a None value.
Enhance [1].
Fix corner case by accounting for None values.
Also fix another potential corner case for 'conditions' attribute.
Tests:
PASS: tested some inputs for extract_helm_chart_status function
[1]: https://review.opendev.org/c/starlingx/config/+/864543
Depends-On: https://review.opendev.org/c/starlingx/config/+/864543
Closes-Bug: 1996747
Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Change-Id: I8c8bf46e28655b6833db8c8e72030a656186922b
Add some robustenss to the app framework. It is observed that the
framework can reach a bad state where a HelmRelease tries to be
applied but fails because the HelmChart is already in failed state.
When applying a HelmRelease FluxCD doesn't trigger a HelmChart
reconciliation, thus HelmChart is in failed state and HelmRelease is
stuck waiting for it.
Framework times-out after 1 hour of waiting for FluxCD to finish the
apply operation, which never happens.
Implement logic to trigger a HelmChart reconciliation.
Implement optimization to trigger HelmRelease reconciliaton right
after HelmChart reconciliation, to skip the time FluxCD waits until
periodic reconciliation is triggered.
Skip CentOS testing because we use the same fluxcd and kubernetes.
The only difference is the python kubernetes library, but the
implementation does not use any new API calls.
Tests on AIO-SX Debian:
PASS: AIO-SX unlocked enabled available
PASS: Can apply app when apply started with a HelmChart
for a platform-integ-apps in Failed state
PASS: Can apply app when apply started with both a HelmChart and a
HelmRelease for a platform-integ-apps in Failed state
Closes-Bug: 1996747
Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Change-Id: Iea19c5458bb1c6d739dfe0ef94af09c85e90cd9a
On Debian, application-update command is not functional.
Investigation showed that the pkg_resources cache logic needs to be
updated. The cache logic currently works if paths not links
(realpath = path).
When the paths are links, as is the case of Debian(ostree links
/var/rootdirs/opt to /opt), there will be pkg_resources data created
for both real and provided path, preventing the app framework cache
logic from functioning correctly. sys.modules is not purged correctly,
resulting in stevedore 'distribution' location can't be correctly
computed, resulting in an exception preventing app updates.
There are 2 affected structures used from pkg_resources.working_set:
'entry_keys' and 'entries'.
On Debian we observe this behavior(2 bullets):
1) After fresh launching sysinv, if plugins path were already present
in the system(/var/stx_app/plugins/ populated), an entry is not
created for real path in pkg_resources.working_set.entries, it is
created only for provided path.
But entry_keys are created for both real and provided path.
2) After one initial purge, we observe the app framework only uses
the real path for 'entry_keys' and 'entries'.
[
Personal note:
Based on the above 2 observation I will assume at python3 interpreter
initialization, after scanning the .pth files in /var/stx_app/plugins
pkg_resource is used in such a way that the cache is built for the
provided path not the real path, otherwise we would have seen only
real path used in the cache. Probably something else is used instead of
what we do in activate_plugins function.
]
Delete cached pkg_resource.working_set structures for both real and
provided path.
Switch to iterating 'entry_keys' using real path.
Tests on Debian:
PASS: AIO-SX unlocked enabled available
PASS: application-update works
Skip tests on CentOS: this is just a refactor since plugin realpath
doesn't change. Confirmed that 'realpath = provided path' for one of
the plugins in /opt/platform.
Story: 2009966
Task: 46719
Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Change-Id: I6a4626d8769a3db7f74193b6bffdea24c86d4df7
On Debian with the new Python 3.9 and Stevedore 3, EntryPoint objects
don't have the module path attached to them anymore. For this
sys.modules is used to determine the path. Because of this,
when deactivating the plugins, the operation for removing caches for
a specific location(purge_cache_by_location) needs to happen before
sys.modules cache is removed.
Also parsing all the modules is required, this means the .pth files
installed at runtime runtime must still exist.
Reorder the plugin deactivation steps.
This doesn't seem to have an impact on CentOS.
Tests on CentOS:
PASS: Patched live June 2022 StarlingX system.
Could remove, delete, upload, apply app.
Tested with platform-integ-apps.
PASS: No warnings/errors seen during log inspection.
Tests on Debian:
PASS: Patched live StarlingX system
Could remove, delete, upload, apply app.
Tested with platform-integ-apps.
PASS: AIO-SX unlocked enabled available
PASS: No warnings/errors seen during log inspection.
Story: 2009966
Task: 46718
Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Change-Id: I44103347be1548ef177c233d92f0b8309ea3e490
Recent changes [1] to AppImageParser _find_images_in_dict and
generate_download_images_list methods made this code to break with both
AttributeError and TypeError when stx-openstack application is being
uploaded.
This change includes extra protection against these types of errors and
restablish the flow for generating stx-openstack image list based on its
overrides.
It also adds a new image resource to TestKubeAppImageParser unit tests,
using an Openstack resource extracted from when debugging the original
error. It should prevent this issue to happen again for future changes
at AppImageParser logic.
The original change to generate_download_images_list, for example, would
fail the test:
* TestKubeAppImageParser.test_generate_download_images_list
[1] https://review.opendev.org/c/starlingx/config/+/858762
Test Plan:
PASS - Locally execute unit tests: TestKubeAppImageParser
PASS - Build the sysinv package with this change
PASS - Upload stx-openstack app
PASS - Apply stx-openstack app
Closes-Bug: 1991115
Signed-off-by: Thales Elero Cervi <thaleselero.cervi@windriver.com>
Change-Id: I8a1384bfefd12f8a893249853cbeae3a9d3661e0
In support for the STS silicom application, this
commit adds support for a new image format, which
may be found in the application charts (eg. values.yaml).
For the STS application, the format is as follows:
Images:
Tsyncd: quay.io/silicom/tsyncd:2.1.2.9
TsyncExtts: quay.io/silicom/tsync_extts:1.0.0
Phc2Sys: quay.io/silicom/phc2sys:3.1.1
GrpcTsyncd: quay.io/silicom/grpc-tsyncd:2.1.2.9
Gpsd: quay.io/silicom/gpsd:3.23.1
Testing:
- Apply the app-sts-silicom application. Ensure images
can be extracted and downloaded from the helm charts.
- Ensure the application is applied with no errors
Story: 2010213
Task: 45955
Signed-off-by: Steven Webster <steven.webster@windriver.com>
Change-Id: Iebe94fb77780e516697c2d98efb296aff415b22f
This commit adds listeners to monitor the change of keystone
service users' passwords, apply puppet runtime manifest to
update the service configuration and restart the related
services.
Tests passed:
1 Update keyring users' password
2 Change keystone users' passwords with OpenStack CLI
3 Verified the configuration updated
4 Verified the service w/o auth failure
5 No host swact during the apply, no FM alarm created at the
end of the process
Note:
1 the password synchronization between keyring and keystone
is not included in this review.
2. the update of the secure static hieradata is not included
in this change due to upgrade concerns, users need to update
the hieradata manually. E.g. the subcloud_rehome playbook
will add a task to migrate the passwords in the hieradata
during subcloud rehoming.
3. the unit tests will be delivered by another task in this
story.
Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/853708
Story: 2010230
Task: 46074
Signed-off-by: Yuxing Jiang <Yuxing.Jiang@windriver.com>
Change-Id: I1a2dbc8b1e0bd03c2086895818729b2283b0fb96
Currently, there is no path inside the appfwk to get an app
from 'remove-failed' state to any other state.
This commit makes it so that using remove --force
will prevent the app from being put in remove-failed
if the operation fails.
Instead, the app is put in 'uploaded' state
and a progress message warning about this is set.
remove --force can also be used to recover the app
from remove-failed state for a posterior delete.
Test Plan:
PASS: remove (without -f) results in remove-failed
state in case of an error
PASS: remove --force results in uploaded state
instead of remove-failed in case of an error
and the progress message is set.
(tested for apply-failed and remove-failed)
PASS: remove --force does not set the warning
progress message when the remove succeeds
Related-Bug: 1987115
Signed-off-by: Leonardo Fagundes Luz Serrano <Leonardo.FagundesLuzSerrano@windriver.com>
Change-Id: Iba659c05bf9abd28b0319e6c438141f9aa1c9240
Fixed application-remove cmd putting app in 'remove-failed' state
when used to remove an app which doesn't have any resources
in kubernetes.
(eg.: application-apply failed to download docker images)
Added some missing error message logging.
Test Plan:
PASS: remove cmd changes app state from 'apply-failed' to 'uploaded'
when apply cmd failed to download docker images
Closes-Bug: 1987115
Signed-off-by: Leonardo Fagundes Luz Serrano <Leonardo.FagundesLuzSerrano@windriver.com>
Change-Id: I30191f9b90c40f6432cf75e141d12319046486a6
This commit fixes the uninitialized variable error when doing a platform
backup with an armada app on update-failed state.
TEST PLAN
PASS Create a backup with all apps in valid state
PASS Create a backup just with FluxCD apps (no armada app) [1]
PASS Create a backup with rook-ceph on upload-failed state [2]
[1] Logs: https://paste.opendev.org/show/b8q5kg0XDfuUkQnHjo0X/
[2] Logs: https://paste.opendev.org/show/bG7toHBT558djkThJ2p1/
Closes-Bug: #1982488
Signed-off-by: Thiago Brito <thiago.brito@windriver.com>
Change-Id: If8adec53023bec727b695a9849ce8aef08455e0f
Code that retrieves the registry credentials does not work properly
with Python3. This commit fixes that.
Test Plan:
- Verify successful bootstrap for both CentOS and Debian
using authenticated registry
Partial-bug: 1980391
Change-Id: I71cac14d8bdd63501fc804086cb8af429135bd92
Signed-off-by: Jerry Sun <jerry.sun@windriver.com>
If an system application-update is triggered updating an armada app
to a fluxcd app (preceded by a helm release migration) and
update fails, the application framework will try to perform a recover.
Recover will fail as fluxcd uses helm3 and armada helm2. This will
create resources both in helm2 and helm3 leaving the app
in a inconsistent state.
To prevent that from happening recover is skipped if to_app and from_app
use different chart managers.
TEST PLAN:
PASS: recover skipped after update from armada to fluxcd without
migrating helmrelease
Closes-bug: 1980242
Signed-off-by: Lucas Cavalcante <lucasmedeiros.cavalcante@windriver.com>
Change-Id: I9061b75f443730e973b79cc93e955069951113ff
On [1], during a refactoring activity to export some variables to
constants, the IDE did some code substitutions on a file that wasn't
meant to be made. This commit will fix it and unblock the bootstrap
failure.
[1] https://review.opendev.org/c/starlingx/config/+/844340
Closes-Bug: #1977471
Signed-off-by: Thiago Brito <thiago.brito@windriver.com>
Change-Id: Id8f02057bb335f21314e9c06642fe501b786fe80
This commit fixes the problem of changing the lighttpd port in 2 parts:
- Added a callback to the puppet runtime manifest apply to call a method
that will fix the addres on the custom resource if it is already created
on puppet
- Modified the FluxCDKustumizeOperator to change the helmrepository.yaml
when the app is uploaded/applied/reapplied and removed that
responsibility from kube_app.py.
TEST PLAN
PASS change the http port and check if the HelmRepository resource was
updated on kubernetes
PASS check if the resource definition on base/helmrepository.yaml was
uploaded and that helmrepository-orig.yaml was created
PASS upload the snmp-app and verify that the default port for it's
helmrepository.yaml was updated and helmrepository-orig.yaml was
created.
Closes-Bug: #1977471
Signed-off-by: Thiago Brito <thiago.brito@windriver.com>
Change-Id: I4ac50bc7dabdb589a1774f2c13dba1f5d16432c5
This update adds the FluxCD complement to the ArmadaManifestOperator to
allow runtime adjustments to which helm releases are enabled based on
platform conditions.
Changes include:
- Add FluxCDKustomizeOperator to support helm_release_resource_delete()
and platform_mode_kustomize_updates() to allow runtime updates to the
top-level kustomization.yaml file that controls helm releases.
- Add a GenericFluxCDKustomizeOperator for apps that don't provide a
kustomize plugin.
- Addition of stevedore plugin support using the namespace
systemconfig.fluxcd.kustomize_ops
- Refactor helm.py to have two separate functions for generating helm
overrides, one for Armada and the other for FluxCD, so that easily
removing Armada support can be done in the future.
- Armada provided an --enable-chart-cleanup option when it stops
managing helm releases. To provide similar functionality the
FluxCDKustomizeOperator will manage a helmrelease_cleanup.yaml file
and remove HelmRelease CRDs after application applies.
- Refactor _find_manifest() in kube_app.py and supporting functions in
utils.py to provide more meaningful feed back when the required
application elements (Armada or FluxCD) are not present
- Update sysinv-helm command to generate system application overrides
for Armada and for FluxCD apps.
- Update get_custom_resource() and apply_custom_resource() to remove
the 'cert' references as these are generic use functions.
Test Plan:
PASS: CentOS - Build/Install/Bootstrap/Unlock AIO-SX
PASS: CentOS - Verify application upload/apply/remove/delete of an
Armada app and a FluxCD app
PASS: CentOS - Use helm-chart-attribute-modify to enable and disable
charts and confirm that after application re-apply that
the desired helm releases are deployed
PASS: Debian - Build/Install/Bootstrap/Unlock AIO-SX
PASS: Debian - Verify application upload/apply/remove/delete of an
Armada app. Debian FluxCD app not enabled yet.
Change-Id: I346324b382ad3106777df61781c8b2af326e26c8
Closes-Bug: #1974095
Signed-off-by: Robert Church <robert.church@windriver.com>
A runtime error (dictionary changed size during iteration)
is seen when applying a fluxcd application in Debian.
This commits resolves it by casting the dictionary of charts to list.
Test Plan:
PASS: Test istio application in Debian and Centos
PASS: Test other fluxcd applications(cert-manager, nginx) in Debian
and Centos
Story: 2009138
Task: 45386
Change-Id: I4c6a2c413ff8ce6e997f22fa6f21224ae9c802dd
Signed-off-by: Teresa Ho <teresa.ho@windriver.com>
Due to some upstream errors specified in 60c75cf2, code was provided to
verify that apps which label their pods with app.kubernetes.io/name get
special treatment on AIO-SX to verify that they are running before
reporting that the application apply is complete.
This original code does not account for the pods that run jobs that go
to a 'Completed' state and cause the apply to be stuck and eventually
timeout.
This update will now check for pods in the 'Completed' state and will
also partially re-factor the code from 60c75cf2 to provide better
readability, maintainability, and increased logging.
Test Plan:
PASS - Build/Install AIO-SX + AIO-DX
PASS - Bootstrap and unlock AIO-SX + AIO-DX
PASS - Test FluxCD apps with the appropriate labeling and confirm
running/completed conditions
Change-Id: I00fa35a2eef5f0d18def1a5233540503d7c1f212
Closes-Bug: #1971053
Signed-off-by: Robert Church <robert.church@windriver.com>
Move stx application plugin directory to /var/stx_app from
python system path (/usr/lib64/python)
TCs list in https://review.opendev.org/c/starlingx/integ/+/825346
Also successfully apply cert-manager and nginx-ingress-controller
Depends-on: https://review.opendev.org/c/starlingx/integ/+/825346
Story: 2009101
Task: 44312
Change-Id: Ia648df877bd7049b01ca89e2d071973f91d9f470
Signed-off-by: Bin Qian <bin.qian@windriver.com>
During a batch application apply on a large number of suclouds, the
massive image download may congest the MGMT network. This commit sets
up an exponential backoff logic to prevent the retry at the same time.
Test passed:
Apply a large application in parallel on a large number of subclouds.
Story: 2009725
Task: 45178
Signed-off-by: Yuxing Jiang <yuxing.jiang@windriver.com>
Change-Id: Ia7e5ab8fdcfcf148a0e769d0bea21ae577822143
In this commit we added 2 enhancement to
the the FluxCD functionality:
1. Dynamically change the host in the default helm repository with the
system controller network address.
2. We will see this issue
https://github.com/fluxcd/helm-controller/issues/81
on AIO-SX if there are issues during chart install.
Basically, the status of helmrelease ends up with ready but
the pods are not actually ready/running.
This is due to helm upstream issues
https://github.com/helm/helm/issues/3173,
https://github.com/helm/helm/issues/5814,
https://github.com/helm/helm/issues/8660.
To solve this we need to check if the pods of the
applied helm charts are ready/running using
the kubernetes python client after the helmrelease is
in a ready state.
3. Check for the 'failed' state of the helmreleases and
update the app accordingly
4. Move the Timeout counter before starting the fluxcd
operations to prevent some infinite loops
Test Plan:
PASS: Deployed a SX with the 'cluster_host_subnet' changed from the
default one and checked if the helm repositories were different
as expected
PASS: Apply nginx fluxcd app 1.1-24 and verified that the app status
is 'applied' when all the pods are in running state
PASS: Apply vault fluxcd app 1.0-27 and verified that the app status
is 'applied' when all the pods are in running state
PASS: Platform Upgrade from latest release to current release
Task: 44912
Story: 2009138
Change-Id: I207b5b55a4b504a1c8ecdb239036a3d122294a0d
Signed-off-by: Mihnea Saracin <Mihnea.Saracin@windriver.com>
All system application-[upload/apply/remove/delete/abort/update]
need to be able to operate on both and Armada
tarball and a FluxCD tarball.
An app tarball will be considered in FluxCD format if it
contains a directory named "fluxcd-manifests".
And if the tarball does not have this structure
it will be checked further if it's in Armada format.
This first commit introduces full support for the
application-[upload/apply] on FluxCD tarballs, the other operations
are partially supported and improvements should be made
on upcoming reviews.
Tested upload/apply operations on the following apps:
- vault
- nginx
- auditd
Task: 44830
Story: 2009138
Change-Id: I7a34571de1f990e843a9d01375a9dd7732201c0c
Signed-off-by: Mihnea Saracin <Mihnea.Saracin@windriver.com>
When the ghcr docker service parameter is missing sysinv fails to
download images and spams the logs with an exception. This
exception occurs within another except block and hides the true
error.
This fix changes the variable used in an exception message
when an image download from public/private registry fails. If
the exception occurs, pub_img_tag is not assigned and the
UnboundLocalError issue occurs due to the usage of pub_img_tag in
the exception handler.
This fix improves the original fix in
88cacfc5d5
to fix this scenario.
Test Plan:
PASS: Verify that the exceptions do not occur when the patch
for service parameters is not present
Closes-Bug: 1951014
Change-Id: I77db038d2f33fd42b9b8963fc6b31ebb0c425f6c
Change-Id: I720f63878f6672eccbe1a109f762966d73eef154
Signed-off-by: Shrikumar Sharma <shrikumar.sharma@windriver.com>
Adding a callback to execute the puppet class
'platform::keystone::password::runtime'
Test Plan:
PASS: Verify the command 'openstack user password set' ran in
openstack /var/log/sysinv.log
PASS: Verify the password change trigger reaches the subcloud
when the environment variable OS REGION NAME is changed
to 'SystemController'
PASS: Verify the patch apply using sw-patch and sw-manager
patch-strategy commands after changing openstack (keystone)
password
Regression:
PASS: Verify system patch install
PASS: Verify feature logging
PASS: Verify controllers, compute nodes after a force reboot
PASS: Verify bm type ipmi, redfish and dynamic
PASS: Verify push docker image to local registry
PASS: Verify uploading charts via helm upload
PASS: Verify host operations with custom kubectl app
PASS: Verify isolated 2 peers to big pod (HT non-AIO)
PASS: Verify system core dumps and crashes
PASS: Verify system health pre session for pods, alarms, system apps
PASS: Verify horizon host inventory display
PASS: Verify lock unlock host
PASS: Verify swact controller platform
PASS: Verify pod to pod connection
PASS: Verify pod to service connection
PASS: Verify host to service connection
Story: 2009194
Task: 43249
Signed-off-by: Alexandre Horst <alexandre.horst@windriver.com>
Change-Id: I1a2d3ff0c99ca29f63db9ca0a3f0d78b59b8d819
Starting with Kubernetes 1.20, the "kubectl cp" command will actually
error out if the item being copied doesn't exist. Prior to this it
failed silently.
It turns out that our application upload code was relying on the silent
failure.
In order to make it work for applications without plugins we need to
make it explicit in the code that it's not a fatal error if there are
no overrides.
Change-Id: Ifb70907c84b26bf6c2e19a72a60110a20bcb399b
Closes-Bug: 1948327
Signed-off-by: Chris Friesen <chris.friesen@windriver.com>
This changes the variable used on an exception message
when an image download failed from public/private
registry. Now, if the exception occurs the variable being
used has not yet been assigned.
Closes-Bug: 1942199
Signed-off-by: Hugo Brito <hugo.brito@windriver.com>
Change-Id: I8d393ae6a6d92fe649d3594cf32ab22988fcde48
yaml.load might not properly detect the encoding.
This error was observed during bootstrap when the first app
(nginx-ingress) had the values.yaml file loaded as yaml.
The encoding was detected as being ascii when it was not.
A UnicodeDecodeError is thrown: 'ascii' codec can't decode
byte 0xe2 in position.
The fix is to specifically tell yaml the encoding of the file.
As 'encoding' parameter exists only for open in Python3, but not
Python2, switch to io.open to be backward compatible.
As a precaution, did the same changed to ruamel yaml.
Story: 2006796
Task: 42798
Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Change-Id: Ib101c95392cb453121baecadebb4f75b81216477
(cherry picked from commit 156cc123e4)
b64encode and b64decode returns a str in Python2 and a
bytes in Python3. This is a problem when using http/url/rest
libraries. Runtime errors are raised. Mixing str and bytes when
formatting a text might introduce an unwanted "b" for bytes, which can
lead to potential issues when sent over the network.
To keep compatibility use oslo_serialization to force the
return type to str.
There is one place where base64 urlsafe version is specifically used to
send and receive(Rest API for application upload).
One of the tests is that platform-integ-app applies which exercises part
of the changes.
Cert-mon and DC part will be exercied when DC is available on f/centos8
branch.
Story: 2006796
Task: 42797
Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Change-Id: I48b1c6c80363458945c6bc1a9cf7e16c743a7bd6
(cherry picked from commit 8a7c4b15c7)
The filter functions returns a list in python2 but
an iterator in python3. Replace the filter function to
its list comprehension equivalent expression to enable
compatibility with Python 3.
Story: 2006796
Task: 42676
Signed-off-by: Charles Short <charles.short@windriver.com>
Change-Id: Id5daa43348c61c2e71589f79a116300fc0143540
(cherry picked from commit 09d02ac7bf)
Fix filemode to be compatible with python3. This is based off
of Iaa667fdf3c66802c9ad32eaf83bd011d03a5febc
Story: 2006796
Task: 42771
Signed-off-by: Charles Short <charles.short@windriver.com>
Change-Id: I96e16360fef1ad060e3d585e28e14913be4dfb9b
(cherry picked from commit 53c039b236)
getsitepackages() returns a list containing all global site-packages
directories. However, the method behaves differently in python3 compared
to python2. In python2, the first element of the array is
/usr/local/python2.7, however in python3 the first element of the
array is /usr/local/lib/python3.6, which does not exist when a user
is trying to upload applications.
To get around this we check to which version is running via the six
library. If python2 is running then use the old site-packages list,
otherwise use the python3 version. This can be removed after python2
support goes away. This code has been tested on both Centos8 and Debian.
Story: 2006796
Task: 42730
Signed-off-by: Charles Short <charles.short@windriver.com>
Change-Id: I0cae10787b9c6c0f41aa3dfedb350c382cb97fbc
(cherry picked from commit ba679dc473)