Currently, /var/log/sm-customer.log and /var/log/sm-customer.alarm
have permission 644. To comply with the CIS benchmark requirements,
the permissions should be set to 640.
This change updates the permissions of /var/log/sm-customer.log and
/var/log/sm-customer.alarm to 640.
Test Plan:
PASS: Build ISO and deploy AIO-SX and AIO-DX.
PASS: Verify that permission of /var/log/sm-customer.log and
/var/log/sm-customer.alarm files are set to 640.
PASS: AIO-DX: Perform a 'collect' and verify that the extracted
contents contains sm-customer.log and sm-customer.alarm.
PASS: AIO-DX: Verify that sm-customer.log is updated when running
'system host-swact'
Story: 2011241
Task: 51367
Change-Id: I6f37d88bf7e26356f5f11104d9bce7fee6512307
Signed-off-by: Jagatguru Prasad Mishra <jagatguruprasad.mishra@windriver.com>
For reasons described in the depends-on update, this update changes
to align with the new /var/persist/mtc/.node_locked file path.
Test Plan:
PEND: Verify all test cases in the depends on update
PASS: Verify SM does not activate a locked controller.
- with one locked controller and the other unlocked power-off
both nodes and the only power on the locked controller.
SM should remain in initial state.
PASS: Verify swact to unlocked controller after above test case.
Depends-On: https://review.opendev.org/c/starlingx/metal/+/939646
Closes-Bug: 2095212
Change-Id: I93070f100860a176de3ee34565e74b1f23e0090d
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
sm service and sm-api service CPUShares are reduced to 512 out of 1024
since they exhibit severe CPU hog behaviour for extended periods
particularly during host configuration.
NOTE: CPUQuota may not be used for these services because they require
more than number of Platform CPUs during host configuration.
This is part of an overall set of adjustments are required for systemd
cgroups CPUShares, CPUQuota, and AllowedCPUs for key system services.
This will improve latency of Kubernetes critical components, and
throttles lesser important services.
Partial-Bug: 2084714
TEST PLAN:
AIO-SX, AIO-DX, Standard, Storage, DC:
- PASS: Fresh install
- PASS: verify systemd parameters for sm and sm-api
Example:
systemctl show sm.service | grep -e CPUShares
systemctl show sm-api.service | grep -e CPUShares
AIO-SX, AIO-DX:
- PASS: BnR
- PASS: K8S orchestrated Upgrade 1.24 - 1.29
Change-Id: If72dd952aec5331db8146caefe8441b96475211d
Signed-off-by: Jim Gauld <James.Gauld@windriver.com>
This reverts commit dc3b09d35094068566a8ee6a009eeb52bd78bf8d.
Reason for revert: The change - https://bugs.launchpad.net/starlingx/+bug/2083632 is causing Upgrade failures
Change-Id: I457eda43c65ac198cec60259b03fa1ccfc7215d9
The logging configuration update for the SM service logs
is invalid and conflicts with the postgres logging
facility (local0), causing the SM logs to appear in
both sm-service.log and postgres.log. Reverting to
facility (local3) for SM service logs to fix the issue.
Change that introduced the bug - https://review.opendev.org/c/starlingx/ha/+/923772
Test Plan: Verified the changes on simplex and duplex configurations.
PASS: AIO-Simplex - Verified service logs are written to sm-service.log
Verfied that service logs are not seen in sm.log
and postgres.log
PASS: AIO-Duplex - Verified service logs are written to sm-service.log
Verified that service logs are not in sm.log and
postgres.log
PASS: AIO-Duplex - Verified that service logs are written to
sm-service.log file after a controlled swact and
not seen in sm.log and postgres.log
Depends-On= https://review.opendev.org/c/starlingx/config-files/+/931332
Change-Id: Ifcf03b33081d5908618e6698ef8a21e110ef35c9
Signed-off-by: Sandhya Kalisetty <sandhya.kalisetty@windriver.com>
This reverts commit 685ada4f20d7cf302c339485ca92f547e99225e6.
Reason for revert: Update should not have merged with a single core reviewer +2 and without a dependency declaration with https://review.opendev.org/c/starlingx/config-files/+/931332
Change-Id: I6653004b58f5c7f87648ed0ff866a39a08525775
The logging configuration update for the SM service logs
is invalid and conflicts with the postgres logging
facility (local0), causing the SM logs to appear in
both sm-service.log and postgres.log. Reverting to
facility (local3) for SM service logs to fix the issue.
Test Plan: Verified the changes on simplex and duplex configurations.
PASS: AIO-Simplex - Verified service logs are written to sm-service.log
Verfied that service logs are not seen in sm.log
and postgres.log
PASS: AIO-Duplex - Verified service logs are written to sm-service.log
Verified that service logs are not in sm.log and
postgres.log
PASS: AIO-Duplex - Verified that service logs are written to
sm-service.log file after a controlled swact and
not seen in sm.log and postgres.log
Change-Id: I03b3b033e38801711d1e6ab6831913c5e5b067a4
Signed-off-by: Sandhya Kalisetty <sandhya.kalisetty@windriver.com>
Decrease debounce time for sx due to its impact in Sx upgrade.
In AIO SX systems, the default debounce time may lead to undesired
behavior, unnecessarily increasing recovery time and potentially
affecting upgrade duration as well.
This commit changes the debounce time value, reducing the extended
recovery period for failed service groups.
Test plan:
PASS: Install simplex. Modify the /etc/init.d script for controller
services to prevent the service from starting. Restart the
service and verify that after the service group fails, the
time to check the service group again is reduced.
PASS: Enable SM debug mode to monitor behavior. Verify that these
changes do not negatively impact the reporting of services
failures.
PASS: Install duplex. Add changes and force failures using init.d
scripts for services. Verify that the behavior remains
unaffected
Change-Id: I6316e51c26b8548be1d2f3d8955925e2502266d3
Signed-off-by: fperez <fabrizio.perez@windriver.com>
Currently sm.log file has many types of logs. It makes the root-cause
analysis very difficult in case of any failure or error. As part of
logging enhancements, introduced a new log file sm-service.log to
redirect all the service and service group logs while retaining
the original log format.
Test Plan: Verified the changes on simplex and duplex configurations.
PASS: AIO-Simplex - Verified service logs redirected to sm-service.log
Verfied that service logs are not seen in sm.log
PASS: AIO-Duplex - Verified service logs redirected to sm-service.log
Verified that service logs are not in sm.log
PASS: AIO-Duplex - Verified that service logs are redirected to
sm-service.log file after a controlled swact
PASS: AIO-Duplex - Verified that sm-service logs are rotated as per the
configuration
Story: 2010533
Task: 50708
Change-Id: Ibef9443d0e866b15dbfb7532e611b672c7aeac93
Signed-off-by: Sandhya Kalisetty <sandhya.kalisetty@windriver.com>
When the SM on one controller detects a process stall on its peer
via the maintenance heartbeat cluster, it attempts recovery by
signaling maintenance to reset the peer. The signal to do so is
by creating the /var/run/.sm_reset_peer flag file.
Normally, if the peer's BMC is provisioned, the mtcClient accepts the
reset request, performs the reset, and then removes the
/var/run/.sm_reset_peer flag file.
Once SM has created this flag file it waits for up to 30 seconds for
mtcClient to remove it which is the signal that the reset is performed.
However, if the peer's BMC is not provisioned, the mtcClient cannot
perform the reset. In this case SM times out and moves on but leaves
the file present.
While the flag file (signal) remains present then the mtcClient is
poised to reset the peer controller when the peer's BMC is provisioned.
To avoid that future unexpected reset this update has SM remove the
/var/run/.sm_reset_peer flag file if it times out.
Test Plan: This update only affects the timeout handling case
PASS: Verify flag file is removed on timeout case.
PASS: Verify on timeout case that the mtcClient sees the file as gone.
PASS: Verify flag file remove operation is handled gracefully if
the file is already gone. Handle race condition case where
mtcClient removes it at timeout but before SM tries to remove.
Closes-Bug: 2076454
Change-Id: I694f1107e86fc90c7f007918fb59e754482717fe
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Move the following sm database files to /etc/sm/<release>
- sm.db
- sm.hb.db
update the paths in source code to point to the new database location
added a static_assert to fail if SW_VERSION macro is not defined
Test plan:
PASS - AIO-SX: iso install
verify database files are in the correct location
- /var/lib/sm/...
- /etc/sm/<rel>/...
- /var/run/sm/...
ensure sm is running smoothly after controller-0 unlock
PASS - AIO-DX: iso install upto controller-1 unlock
user host-swact
uncontrolled swact by powering off active controller
Story: 2010676
Task: 50649
Change-Id: I2195c420438135c9b109060de13765b0897d7dc9
Signed-off-by: Kyale, Eliud <Eliud.Kyale@windriver.com>
Add dcagent-api to SM database for high availability management.
Test plan:
- PASS: Verify the dcagent-api service is present on SM db and it
can be activated with sm-configure and sm-provision.
Story: 2011106
Task: 50562
Change-Id: I831f7478ed3138c6f5886c505b83d5b6ff90a769
Signed-off-by: Victor Romano <victor.gluzromano@windriver.com>
Adjusting the ceph-mon service to call the scripts with a new
parameter. The old one was just 'mon' and the new one must be
'mon.controller'.
This change is needed for the AIO-DX to support Ceph with 3 ceph
monitors running. It prevents the ceph-init-wrapper script from
managing the other fixed monitors.
A new service called storage-networking is added to improve the Ceph
resiliency for AIO-DX when there is no networking communication
between the controllers, avoiding a split-brain condition.
Test-Plan:
PASS: Fresh install and check if only the floating monitor is being
diabled/enabled on the controllers after a swact. The other monitors
should not be affected by a swact.
Story: 2011122
Task: 50128
Depends-On: https://review.opendev.org/c/starlingx/integ/+/914913
Change-Id: I5731e2ac3cf726eed765645b95c392ea55b1c94f
Signed-off-by: Hediberto C Silva <hediberto.cavalcantedasilva@windriver.com>
Currently SM troubleshoot does not collect ip6tables statistics, only
iptables. This change adds this information for debugging.
Tesplan
[PASS] execute "collect all" and check that the sm.info file contains
the output of ip6tables -nvL
Closes-Bug: 2071381
Change-Id: I62dccf4a1c031449fa56ff9463857819e8bc79f9
Signed-off-by: Andre Kantek <andrefernandozanella.kantek@windriver.com>
During the first unlock after a fresh install, the sysinv-inv starts
the WSGIService that uses the FQDN: "controller.internal"
For this reason it needs that DNSMasq is ready to resolve the IP
address.
Test done:
AIO-SX fresh install
AIO-DX fresh install
AIO-DX host-swact
Story: 2010722
Task: 50221
Depends-On: https://review.opendev.org/c/starlingx/config/+/920694
Change-Id: If255441f12da370bd48641d7c521aea5f3012af2
Signed-off-by: Fabiano Correa Mercer <fabiano.correamercer@windriver.com>
This commit updates the name of the services for the new
Rook Ceph.
drbd-rookmon -> drbd-rook
rookmon-fs -> rook-fs
rook-mon-exit > rook-mon-exit (no change)
Test Plan:
PASS: AIO-DX -> Standby controller locked and ceph-rook as
storage-backend + controller-fs add ceph-float=<size> +
checking if everything is created correctly: lv, drbd and
SM services.
PASS: Perform host-swact after the above test and confirm
primary/secondary DRBD change on 'drbd-ceph'.
PASS: AIO-DX -> Standby controller locked + controllerfs-delete
ceph + checking if everything is deleted correctly: lv, drbd
and SM services
Story: 2011117
Task: 50097
Change-Id: Ib896ae271f4e649853af950aebe33111948e639e
Co-Authored-By: Robert Church <robert.church@windriver.com>
Signed-off-by: Gabriel de Araújo Cabral <gabriel.cabral@windriver.com>
This commit sets a dependency between ipsec-config service and
management-ipv4/ipv6 services. The disable action may be performed on
ipsec-config service after management-ipv4/ipv6 dependent services are
on disabled state.
This fix is needed due to the verification step present on monitor
function for ipsec-config service. This function checks if the
floating IP is present on system ip tables and swanctl configuration
and verifies if the conditions are satisfied for active and standby
controllers.
It is expected that floating IP is present on the active controller
system where ipsec-config is on enabled-active state and not present
on standby controller system where ipsec-config is on disabled state.
The floating IP is added and removed by management-ipv4/ipv6 service
per their start and stop actions. In the previous service dependency
configuration, this would cause an error during ipsec-config audit-
disabled action and 400.001 alarm was present on system.
Therefore, this commit fixes this service dependency relation between
ipsec-config and management-ip services.
Test Plan:
PASS: Full build, system install, bootstrap and unlock of a DX system
with unlocked enabled available state. No 400.001 alarms present
on system.
PASS: On a DX system with unlocked enabled available state, perform a
host-swact on controller-0. Observe that ipsec-config service
changes its state on controllers, from disabled to enabled-active
on active controller and from enabled-active to disabled on
standby controller. No errors are reported on daemon-ocf.log
related to ipsec-config or management-ip services. No 400.001
alarms present on system.
PASS: On a DX system with unlocked enabled available state, perform a
host-lock and host-unlock on controller-1. Observe that system
boots with ipsec-config service on disabled state. No errors are
reported on daemon-ocf.log related to ipsec-config or
management-ip services. No 400.001 alarms present on system.
Story: 2010940
Task: 50196
Change-Id: Idd2f487b4589e1f66d79ea5c4f13c36e67c302be
Signed-off-by: Manoel Benedito Neto <Manoel.BeneditoNeto@windriver.com>
This change increased action timeout for some SM services.
The reason for the increase is, when IPsec for mgmt network is enabled,
CPU usage is going up 3-4 times during system installation, causing
actions for some SM services timeout or failed, this in turn triggers
uncontrolled swacts.
New timeouts are set for the following services. They are set to 4
times of the original values. This is based on the performance
measurement that indicates CPU usage could go up 3-4 times during
system installation. With these new values, multiple installation tests
are successful without uncontrolled swact seen. There will be follow up
tunings that potentially decreases these numbers.
ceph-mon
sysinv-inv
horizon
rabbit
Test Plan:
PASS: DX deployment, verify deployment is successful, both controllers
are in unlocked| enabled| available states after deployemt
complete.
PASS: Multi nodes system deployment, verify deployment is successful,
all hosts are in unlocked| enabled| available states after
deployemt complete.
PASS: Swact controllers multi times, verify swact is successful and
system is stable.
PASS: Lock/unlock hosts multi times with either controller-0 or
controller-1 as active controller, veriry the host locked and
unlocked comes back normally, and system is stable.
Story: 2010940
Task: 50187
Change-Id: I0ecd9cc82415b5a232040b6707c1f945c4f16d08
Signed-off-by: Andy Ning <andy.ning@windriver.com>
Add dcorch-engine-worker service into SM database for its management
in HA.
Depends-On: https://review.opendev.org/c/starlingx/distcloud/+/917792
Story: 2011106
Task: 50016
Change-Id: I6cdbf6a754af9339fc7db8aa453a4c49e8277613
Signed-off-by: lzhu1 <li.zhu@windriver.com>
This commit adds ipsec-config service to sm-db. This service is
responsible to manage swanctl configuration by creating symbolic
links between swanctl.conf and different conf files.
Test Plan:
PASS: Build a new debian iso containing the changes.
PASS: Bootstrap, install and unlock a DX system with unlocked enable
available status and IPsec enabled. Observe that ipsec-config
service data is present on sm-db tables.
Story: 2010940
Task: 49998
Depends-On: https://review.opendev.org/c/starlingx/config/+/916841
Change-Id: Ia1544134b7d4d49897153c064b996a1f67b7599b
Signed-off-by: Manoel Benedito Neto <Manoel.BeneditoNeto@windriver.com>
StarlingX stopped supporting CentOS builds in the after release 7.0.
This update will strip CentOS from our code base. It will also remove
references to the failed OpenSUSE feature as well.
Story: 2011110
Task: 49952
Change-Id: I1bed2fde10326ecb75b45376efea8480e0f23675
Signed-off-by: Scott Little <scott.little@windriver.com>
This change splits the IP service for each platform network into ipv4
and ipv6 t support dual-stack. It still supporting single-stack (when
there is only ipv4 or ipv6)
Test Plan:
[PASS] install, lock, unlock and swact for the following setups:
- AIO-SX (IPv4 and IPv6)
- AIO-DX (IPv4 and IPv6)
- Standard (IPv4 and IPv6)
- DC (SysCtrl=AIO-DX, subcloud=AIO-SX)
[PASS] Add dual-stack configuration and validate services operation
with lock, unlock and swact:
- AIO-SX (IPv4 and IPv6)
- AIO-DX (IPv4 and IPv6)
- Standard (IPv4 and IPv6)
- DC (SysCtrl=AIO-DX, subcloud=AIO-SX), using the admin network
Story: 2011027
Task: 49761
Change-Id: Ic6451cae04769409babd2d6507c3677d1cce5617
Signed-off-by: Andre Kantek <andrefernandozanella.kantek@windriver.com>
Service Management (SM) sometimes selects and activates services on a
locked controller following a dead office recovery.
This update adds a node locked check to SM's enable handler to
block enable if present much like the existing goenabled check
blocks enable if not present in the same function.
The enable gate file is /etc/mtc/tmp/.node_locked on the local host.
Maintenance manages the presence or absence of this file based on
the node's administrative state.
This update also cleans up some extra whitespace in the changed file.
Test Plan:
PASS: Verify system build.
PASS: Verify AIO SX install.
PASS: Verify AIO DX install.
PASS: Verify Standard DX system install with worker and storage.
For Both 'AIO DX' and 'Standard DX with worker and storage':
PASS: Verify SM does not activate on a locked DX controller.
PASS: ... DOR case
PASS: ... Uncontrolled Swact case
PASS: Verify Standard DX behavior over DOR with one locked controller
while the only unlocked controller does not recover.
PASS: Verify behavior after above test case once the only unlocked
controller does recover.
PASS: Verify lock of the standby controller and its sm logs
PASS: Verify manually creating the new Nv locked file on the active
controller will cause SM to go disabled and shut down all
services on that controller.
... If there is another unlocked controller then verify it
takes over as an uncontrolled swact.
... If there is no unlocked standby controller then verify SM
remains shutdown until the manually created Nv node locked
file is removed. At which point SM proceeds to activate
services on that controller again.
PASS: Verify SM ignores the node locked flag file for AIO SX systems.
PASS: Verify lock/unlock of AIO SX controller.
PASS: Verify original reported issue is resolved for AIO DX systems.
Regression:
PASS: Verify controlled swact with unlocked enabled standby.
PASS: Verify uncontrolled swact with unlocked enabled standby.
PASS: Verify standby controller lock/unlock soak loop (10).
PASS: Verify swact loop soak (10).
PASS: Verify no crash or core dumps.
PASS: Verify SM logging
Closes-Bug: 2051578
Change-Id: If8e27ef30d62096fa77c3868f4d460b18e10ade2
(cherry picked from commit 23d0d8ab2f3225f10594547c5f8a67c409f815a0)
Service Management (SM) sometimes selects and activates services on a
locked controller following a dead office recovery.
This update adds a node locked check to SM's enable handler to
block enable if present much like the existing goenabled check
blocks enable if not present in the same function.
The enable gate file is /etc/mtc/tmp/.node_locked on the local host.
Maintenance manages the presence or absence of this file based on
the node's administrative state.
This update also cleans up some extra whitespace in the changed file.
Test Plan:
PASS: Verify system build.
PASS: Verify AIO DX install.
PASS: Verify Standard DX system install with worker and storage.
For Both 'AIO DX' and 'Standard DX with worker and storage':
PASS: Verify SM does not activate on a locked controller.
PASS: ... DOR case
PASS: ... Uncontrolled Swact case
PASS: Verify Standard DX behavior over DOR with one locked controller
while the only unlocked controller does not recover.
PASS: Verify behavior after above test case once the only unlocked
controller does recover.
PASS: Verify lock of the standby controller and its sm logs
PASS: Verify manually creating the new Nv locked file on the active
controller will cause SM to go disabled and shut down all
services on that controller.
... If there is another unlocked controller then verify it
takes over as an uncontrolled swact.
... If there is no unlocked standby controller then verify SM
remains shutdown until the manually created Nv node locked
file is removed. At which point SM proceeds to activate
services on that controller again.
Regression:
PASS: Verify controlled swact with unlocked enabled standby.
PASS: Verify uncontrolled swact with unlocked enabled standby.
PASS: Verify standby controller lock/unlock soak loop (10).
PASS: Verify swact loop soak (10).
PASS: Verify no crash or core dumps.
Closes-Bug: 2051578
Change-Id: I0f0e3d199586513ddce484fdcc056e1b2562b45f
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
haproxy uses dns resolution
add service dependency to sm database
to ensure that dnsmasq service is started before haproxy
and dnsmasq is disabled after haproxy is disabled
Test plan:
PASS - AIO-SX: iso install
PASS - AIO-SX: reboot test
PASS - AIO-DX: iso install
PASS - AIO-DX: swact test
Closes-Bug: #2043506
Change-Id: I494faebfe67843d34819f66a0a2fbd977657bb6b
Signed-off-by: Kyale, Eliud <Eliud.Kyale@windriver.com>
Add support for aarch64 in sm_trap_thread_log.
Test Plan:
PASS: build-pkgs on x86-64 host
PASS: build-image on x86-64 host
PASS: build-pkgs on arm64 host
PASS: build-image on arm64 host
PASS: Deploy AIO-SX on x86-64 targets and check sm service
PASS: Deploy AIO-SX on arm64 targets and check sm service
PASS: Deploy AIO-DX on arm64 targets and check sm service
PASS: Deploy std (2+2+2) on arm64 targets and check sm service
Story: 2010739
Task: 48017
Change-Id: Iebea29e6df900f63d0dce24cf1a139f60c1cf6f8
Signed-off-by: Jackie Huang <jackie.huang@windriver.com>
The includes path in Makefile is hardcoded with x86_64,
use dpkg-architecture to check the host arch and replace
the hardcoded name.
Test Plan:
PASS: build-pkgs on x86-64 host
PASS: build-image on x86-64 host
PASS: build-pkgs on arm64 host
PASS: build-image on arm64 host
PASS: Deploy AIO-SX on x86-64 targets and check sm service
PASS: Deploy AIO-SX on arm64 targets and check sm service
PASS: Deploy AIO-DX on arm64 targets and check sm service
PASS: Deploy std (2+2+2) on arm64 targets and check sm service
Story: 2010739
Task: 48017
Change-Id: Ie22477b7ec7df63377f666186d95201cd16f5809
Signed-off-by: Jackie Huang <jackie.huang@windriver.com>
This is to avoid waiting for hbs cluster query for sending SM alive
pulse. When a hbs cluster query or alive pulse is being sent, do not
queue the subsequent alive pulse, as current request being sent is good
enough to update hbs agent.
Also move the function retrieving sock address to initial from inside
the query sending procedure. The function getaddrinfo to avoid indirectly
calling malloc, which invokes malloc_atfork to potentially a blocking call.
TCs:
This could improve in extreme situation only, passed regression.
Closes-bug: 2025504
Change-Id: I520b42f0330b670e301279c2e42670d40361adc5
Signed-off-by: Bin Qian <bin.qian@windriver.com>