289 Commits

Author SHA1 Message Date
Zuul
a6e53a5343 Merge "Update to use new mtce /var/persist/mtc/.node_locked flag file" 2025-02-20 20:04:21 +00:00
Zuul
5f8acd2bb5 Merge "Update permission of sm-customer logs" 2025-02-10 14:37:46 +00:00
Jagatguru Prasad Mishra
25380004f6 Update permission of sm-customer logs
Currently, /var/log/sm-customer.log and /var/log/sm-customer.alarm
have permission 644. To comply with the CIS benchmark requirements,
the permissions should be set to 640.

This change updates the permissions of /var/log/sm-customer.log and
/var/log/sm-customer.alarm to 640.

Test Plan:
PASS: Build ISO and deploy AIO-SX and AIO-DX.
PASS: Verify that permission of /var/log/sm-customer.log and
      /var/log/sm-customer.alarm files are set to 640.
PASS: AIO-DX: Perform a 'collect' and verify that the extracted
      contents contains sm-customer.log and sm-customer.alarm.
PASS: AIO-DX: Verify that sm-customer.log is updated when running
      'system host-swact'

Story: 2011241
Task: 51367

Change-Id: I6f37d88bf7e26356f5f11104d9bce7fee6512307
Signed-off-by: Jagatguru Prasad Mishra <jagatguruprasad.mishra@windriver.com>
2025-02-03 18:05:50 +00:00
Eric MacDonald
a78aa6796c Update to use new mtce /var/persist/mtc/.node_locked flag file
For reasons described in the depends-on update, this update changes
to align with the new /var/persist/mtc/.node_locked file path.

Test Plan:

PEND: Verify all test cases in the depends on update
PASS: Verify SM does not activate a locked controller.
      - with one locked controller and the other unlocked power-off
        both nodes and the only power on the locked controller.
        SM should remain in initial state.
PASS: Verify swact to unlocked controller after above test case.

Depends-On: https://review.opendev.org/c/starlingx/metal/+/939646

Closes-Bug: 2095212
Change-Id: I93070f100860a176de3ee34565e74b1f23e0090d
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2025-01-25 22:52:50 +00:00
Jim Gauld
348021f752 Configure systemd CPUShares for sm and sm-api services
sm service and sm-api service CPUShares are reduced to 512 out of 1024
since they exhibit severe CPU hog behaviour for extended periods
particularly during host configuration.

NOTE: CPUQuota may not be used for these services because they require
more than number of Platform CPUs during host configuration.

This is part of an overall set of adjustments are required for systemd
cgroups CPUShares, CPUQuota, and AllowedCPUs for key system services.
This will improve latency of Kubernetes critical components, and
throttles lesser important services.

Partial-Bug: 2084714

TEST PLAN:
AIO-SX, AIO-DX, Standard, Storage, DC:
- PASS: Fresh install
- PASS: verify systemd parameters for sm and sm-api
  Example:
    systemctl show sm.service | grep -e CPUShares
    systemctl show sm-api.service | grep -e CPUShares

AIO-SX, AIO-DX:
- PASS: BnR
- PASS: K8S orchestrated Upgrade 1.24 - 1.29

Change-Id: If72dd952aec5331db8146caefe8441b96475211d
Signed-off-by: Jim Gauld <James.Gauld@windriver.com>
2024-11-15 03:16:25 -05:00
Zuul
69c47ef5b8 Merge "Revert "SM service logs are being duplicated into postgres.log"" 2024-10-25 15:07:03 +00:00
Sandhya Kalisetty
d6a9b2af98 Revert "SM service logs are being duplicated into postgres.log"
This reverts commit dc3b09d35094068566a8ee6a009eeb52bd78bf8d.

Reason for revert: The change - https://bugs.launchpad.net/starlingx/+bug/2083632 is causing Upgrade failures

Change-Id: I457eda43c65ac198cec60259b03fa1ccfc7215d9
2024-10-25 14:46:03 +00:00
Zuul
5074a07b7c Merge "SM service logs are being duplicated into postgres.log" 2024-10-23 17:52:48 +00:00
Sandhya Kalisetty
dc3b09d350 SM service logs are being duplicated into postgres.log
The logging configuration update for the SM service logs
is invalid and conflicts with the postgres logging
facility (local0), causing the SM logs to appear in
both sm-service.log and postgres.log. Reverting to
facility (local3) for SM service logs to fix the issue.
Change that introduced the bug - https://review.opendev.org/c/starlingx/ha/+/923772

Test Plan: Verified the changes on simplex and duplex configurations.

PASS: AIO-Simplex - Verified service logs are written to sm-service.log
                    Verfied that service logs are not seen in sm.log
                    and postgres.log
PASS: AIO-Duplex -  Verified service logs are written to sm-service.log
                    Verified that service logs are not in sm.log and
                    postgres.log
PASS: AIO-Duplex -  Verified that service logs are written to
                    sm-service.log file after a controlled swact and
                    not seen in sm.log and postgres.log

Depends-On= https://review.opendev.org/c/starlingx/config-files/+/931332
Change-Id: Ifcf03b33081d5908618e6698ef8a21e110ef35c9
Signed-off-by: Sandhya Kalisetty <sandhya.kalisetty@windriver.com>
2024-10-23 16:55:37 +00:00
Zuul
1e85089040 Merge "Revert "SM service logs are being duplicated into postgres.log"" 2024-10-22 15:53:13 +00:00
Eric MacDonald
3c1833c71b Revert "SM service logs are being duplicated into postgres.log"
This reverts commit 685ada4f20d7cf302c339485ca92f547e99225e6.

Reason for revert: Update should not have merged with a single core reviewer +2 and without a dependency declaration with https://review.opendev.org/c/starlingx/config-files/+/931332

Change-Id: I6653004b58f5c7f87648ed0ff866a39a08525775
2024-10-22 15:35:01 +00:00
Zuul
7e62c075f2 Merge "SM service logs are being duplicated into postgres.log" 2024-10-22 14:52:56 +00:00
Sandhya Kalisetty
685ada4f20 SM service logs are being duplicated into postgres.log
The logging configuration update for the SM service logs
is invalid and conflicts with the postgres logging
facility (local0), causing the SM logs to appear in
both sm-service.log and postgres.log. Reverting to
facility (local3) for SM service logs to fix the issue.

Test Plan: Verified the changes on simplex and duplex configurations.

PASS: AIO-Simplex - Verified service logs are written to sm-service.log
                    Verfied that service logs are not seen in sm.log
                    and postgres.log
PASS: AIO-Duplex -  Verified service logs are written to sm-service.log
                    Verified that service logs are not in sm.log and
                    postgres.log
PASS: AIO-Duplex -  Verified that service logs are written to
                    sm-service.log file after a controlled swact and
                    not seen in sm.log and postgres.log
Change-Id: I03b3b033e38801711d1e6ab6831913c5e5b067a4
Signed-off-by: Sandhya Kalisetty <sandhya.kalisetty@windriver.com>
2024-10-22 10:19:19 -04:00
fperez
a68d0b3f08 Decrease debounce_time_in_ms for sx
Decrease debounce time for sx due to its impact in Sx upgrade.
In AIO SX systems, the default debounce time may lead to undesired
behavior, unnecessarily increasing recovery time and potentially
affecting upgrade duration as well.

This commit changes the debounce time value, reducing the extended
recovery period for failed service groups.

Test plan:
PASS: Install simplex. Modify the /etc/init.d script for controller
      services to prevent the service from starting. Restart the
      service and verify that after the service group fails, the
      time to check the service group again is reduced.
PASS: Enable SM debug mode to monitor behavior. Verify that these
      changes do not negatively impact the reporting of services
      failures.
PASS: Install duplex. Add changes and force failures using init.d
      scripts for services. Verify that the behavior remains
      unaffected

Change-Id: I6316e51c26b8548be1d2f3d8955925e2502266d3
Signed-off-by: fperez <fabrizio.perez@windriver.com>
2024-10-16 15:10:23 -03:00
Zuul
f61c10c4b6 Merge "Introduce new log file for SM service and service group logs" 2024-08-20 18:29:49 +00:00
Sandhya Kalisetty
80103b108c Introduce new log file for SM service and service group logs
Currently sm.log file has many types of logs. It makes the root-cause
analysis very difficult in case of any failure or error. As part of
logging enhancements, introduced a new log file sm-service.log to
redirect all the service and service group logs while retaining
the original log format.

Test Plan: Verified the changes on simplex and duplex configurations.

PASS: AIO-Simplex - Verified service logs redirected to sm-service.log
                    Verfied that service logs are not seen in sm.log
PASS: AIO-Duplex -  Verified service logs redirected to sm-service.log
                    Verified that service logs are not in sm.log
PASS: AIO-Duplex -  Verified that service logs are redirected to
                    sm-service.log file after a controlled swact
PASS: AIO-Duplex -  Verified that sm-service logs are rotated as per the
                    configuration
Story: 2010533
Task: 50708
Change-Id: Ibef9443d0e866b15dbfb7532e611b672c7aeac93
Signed-off-by: Sandhya Kalisetty <sandhya.kalisetty@windriver.com>
2024-08-15 14:25:56 -04:00
Zuul
d953bf64d2 Merge "Remove /var/run/.sm_reset_peer flag file after its 30 second timeout" 2024-08-09 20:12:57 +00:00
Eric MacDonald
f0bc7f2ba8 Remove /var/run/.sm_reset_peer flag file after its 30 second timeout
When the SM on one controller detects a process stall on its peer
via the maintenance heartbeat cluster, it attempts recovery by
signaling maintenance to reset the peer. The signal to do so is
by creating the /var/run/.sm_reset_peer flag file.

Normally, if the peer's BMC is provisioned, the mtcClient accepts the
reset request, performs the reset, and then removes the
/var/run/.sm_reset_peer flag file.

Once SM has created this flag file it waits for up to 30 seconds for
mtcClient to remove it which is the signal that the reset is performed.

However, if the peer's BMC is not provisioned, the mtcClient cannot
perform the reset. In this case SM times out and moves on but leaves
the file present.

While the flag file (signal) remains present then the mtcClient is
poised to reset the peer controller when the peer's BMC is provisioned.

To avoid that future unexpected reset this update has SM remove the
/var/run/.sm_reset_peer flag file if it times out.

Test Plan: This update only affects the timeout handling case

PASS: Verify flag file is removed on timeout case.
PASS: Verify on timeout case that the mtcClient sees the file as gone.

PASS: Verify flag file remove operation is handled gracefully if
      the file is already gone. Handle race condition case where
      mtcClient removes it at timeout but before SM tries to remove.

Closes-Bug: 2076454
Change-Id: I694f1107e86fc90c7f007918fb59e754482717fe
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2024-08-09 18:35:22 +00:00
Bin Qian
8720e4a8eb Add dcorch-usm-api-proxy service
With [1], this change creates new dcorch-usm-api-proxy SM managed
service.

Story: 201676
Task: 50683

TCs:
    Observe new service runs.
    Also see TCs in [1]

[1] https://review.opendev.org/c/starlingx/distcloud/+/924969

depends-on: https://review.opendev.org/c/starlingx/distcloud/+/924969
Change-Id: I38da9bd3f05f32314a6e61fef17d642dc8ead864
Signed-off-by: Bin Qian <Bin.Qian@windriver.com>
2024-08-07 16:03:09 +00:00
Zuul
159225103a Merge "Adjusting SM ceph-mon service parameters" 2024-08-06 18:06:26 +00:00
Zuul
8fe9a9df28 Merge "Move sm database files for ostree compatibility" 2024-07-26 12:59:42 +00:00
Kyale, Eliud
7e15abae2d Move sm database files for ostree compatibility
Move the following sm database files to /etc/sm/<release>
- sm.db
- sm.hb.db

update the paths in source code to point to the new database location
added a static_assert to fail if SW_VERSION macro is not defined

Test plan:

PASS - AIO-SX: iso install
               verify database files are in the correct location
                - /var/lib/sm/...
                - /etc/sm/<rel>/...
                - /var/run/sm/...
               ensure sm is running smoothly after controller-0 unlock

PASS - AIO-DX: iso install upto controller-1 unlock
               user host-swact
               uncontrolled swact by powering off active controller

Story: 2010676
Task: 50649

Change-Id: I2195c420438135c9b109060de13765b0897d7dc9
Signed-off-by: Kyale, Eliud <Eliud.Kyale@windriver.com>
2024-07-25 16:27:42 -04:00
Victor Romano
ce94627c74 SM management for dcagent-audit service
Add dcagent-api to SM database for high availability management.

Test plan:
  - PASS: Verify the dcagent-api service is present on SM db and it
          can be activated with sm-configure and sm-provision.

Story: 2011106
Task: 50562

Change-Id: I831f7478ed3138c6f5886c505b83d5b6ff90a769
Signed-off-by: Victor Romano <victor.gluzromano@windriver.com>
2024-07-24 15:54:25 -03:00
Hediberto C Silva
33a25bf1d0 Adjusting SM ceph-mon service parameters
Adjusting the ceph-mon service to call the scripts with a new
parameter. The old one was just 'mon' and the new one must be
'mon.controller'.

This change is needed for the AIO-DX to support Ceph with 3 ceph
monitors running. It prevents the ceph-init-wrapper script from
managing the other fixed monitors.

A new service called storage-networking is added to improve the Ceph
resiliency for AIO-DX when there is no networking communication
between the controllers, avoiding a split-brain condition.

Test-Plan:
  PASS: Fresh install and check if only the floating monitor is being
diabled/enabled on the controllers after a swact. The other monitors
should not be affected by a swact.

Story: 2011122
Task: 50128

Depends-On: https://review.opendev.org/c/starlingx/integ/+/914913

Change-Id: I5731e2ac3cf726eed765645b95c392ea55b1c94f
Signed-off-by: Hediberto C Silva <hediberto.cavalcantedasilva@windriver.com>
2024-06-27 16:43:00 -03:00
Andre Kantek
de48edd316 Collect ip6tables for logging purposes
Currently SM troubleshoot does not collect ip6tables statistics, only
iptables. This change adds this information for debugging.

Tesplan
[PASS] execute "collect all" and check that the sm.info file contains
       the output of ip6tables -nvL

Closes-Bug: 2071381

Change-Id: I62dccf4a1c031449fa56ff9463857819e8bc79f9
Signed-off-by: Andre Kantek <andrefernandozanella.kantek@windriver.com>
2024-06-27 19:02:33 +00:00
Zuul
3a0fa03806 Merge "Sysinv-inv depends on DNSMasq" 2024-06-03 19:41:47 +00:00
Fabiano Correa Mercer
094ee57df8 Sysinv-inv depends on DNSMasq
During the first unlock after a fresh install, the sysinv-inv starts
the WSGIService that uses the FQDN: "controller.internal"
For this reason it needs that DNSMasq is ready to resolve the IP
address.


Test done:
AIO-SX fresh install
AIO-DX fresh install
AIO-DX host-swact

Story: 2010722
Task: 50221

Depends-On: https://review.opendev.org/c/starlingx/config/+/920694

Change-Id: If255441f12da370bd48641d7c521aea5f3012af2
Signed-off-by: Fabiano Correa Mercer <fabiano.correamercer@windriver.com>
2024-06-03 10:52:27 -03:00
Zuul
bbb077583d Merge "Update Rook Ceph service names in SM" 2024-05-31 15:42:45 +00:00
gcabral
8ba47d9196 Update Rook Ceph service names in SM
This commit updates the name of the services for the new
Rook Ceph.

 drbd-rookmon -> drbd-rook
 rookmon-fs -> rook-fs
 rook-mon-exit > rook-mon-exit (no change)

Test Plan:
 PASS: AIO-DX -> Standby controller locked and ceph-rook as
       storage-backend + controller-fs add ceph-float=<size> +
       checking if everything is created correctly: lv, drbd and
       SM services.
 PASS: Perform host-swact after the above test and confirm
       primary/secondary DRBD change on 'drbd-ceph'.
 PASS: AIO-DX -> Standby controller locked + controllerfs-delete
       ceph + checking if everything is deleted correctly: lv, drbd
       and SM services

Story: 2011117
Task: 50097

Change-Id: Ib896ae271f4e649853af950aebe33111948e639e
Co-Authored-By: Robert Church <robert.church@windriver.com>
Signed-off-by: Gabriel de Araújo Cabral <gabriel.cabral@windriver.com>
2024-05-28 11:10:44 +00:00
Zuul
b08d386c89 Merge "Increase action timeout for some SM services" 2024-05-27 18:52:47 +00:00
Manoel Benedito Neto
e3382f7289 Fix ipsec-config service dependency
This commit sets a dependency between ipsec-config service and
management-ipv4/ipv6 services. The disable action may be performed on
ipsec-config service after management-ipv4/ipv6 dependent services are
on disabled state.

This fix is needed due to the verification step present on monitor
function for ipsec-config service. This function checks if the
floating IP is present on system ip tables and swanctl configuration
and verifies if the conditions are satisfied for active and standby
controllers.

It is expected that floating IP is present on the active controller
system where ipsec-config is on enabled-active state and not present
on standby controller system where ipsec-config is on disabled state.
The floating IP is added and removed by management-ipv4/ipv6 service
per their start and stop actions. In the previous service dependency
configuration, this would cause an error during ipsec-config audit-
disabled action and 400.001 alarm was present on system.

Therefore, this commit fixes this service dependency relation between
ipsec-config and management-ip services.

Test Plan:
PASS: Full build, system install, bootstrap and unlock of a DX system
      with unlocked enabled available state. No 400.001 alarms present
      on system.
PASS: On a DX system with unlocked enabled available state, perform a
      host-swact on controller-0. Observe that ipsec-config service
      changes its state on controllers, from disabled to enabled-active
      on active controller and from enabled-active to disabled on
      standby controller. No errors are reported on daemon-ocf.log
      related to ipsec-config or management-ip services. No 400.001
      alarms present on system.
PASS: On a DX system with unlocked enabled available state, perform a
      host-lock and host-unlock on controller-1. Observe that system
      boots with ipsec-config service on disabled state. No errors are
      reported on daemon-ocf.log related to ipsec-config or
      management-ip services. No 400.001 alarms present on system.

Story: 2010940
Task: 50196

Change-Id: Idd2f487b4589e1f66d79ea5c4f13c36e67c302be
Signed-off-by: Manoel Benedito Neto <Manoel.BeneditoNeto@windriver.com>
2024-05-27 11:41:46 -03:00
Andy Ning
9d35cc3248 Increase action timeout for some SM services
This change increased action timeout for some SM services.

The reason for the increase is, when IPsec for mgmt network is enabled,
CPU usage is going up 3-4 times during system installation, causing
actions for some SM services timeout or failed, this in turn triggers
uncontrolled swacts.

New timeouts are set for the following services. They are set to 4
times of the original values. This is based on the performance
measurement that indicates CPU usage could go up 3-4 times during
system installation. With these new values, multiple installation tests
are successful without uncontrolled swact seen. There will be follow up
tunings that potentially decreases these numbers.

ceph-mon
sysinv-inv
horizon
rabbit

Test Plan:
PASS: DX deployment, verify deployment is successful, both controllers
      are in unlocked| enabled| available states after deployemt
      complete.
PASS: Multi nodes system deployment, verify deployment is successful,
      all hosts are in unlocked| enabled| available states after
      deployemt complete.
PASS: Swact controllers multi times, verify swact is successful and
      system is stable.
PASS: Lock/unlock hosts multi times with either controller-0 or
      controller-1 as active controller, veriry the host locked and
      unlocked comes back normally, and system is stable.

Story: 2010940
Task: 50187

Change-Id: I0ecd9cc82415b5a232040b6707c1f945c4f16d08
Signed-off-by: Andy Ning <andy.ning@windriver.com>
2024-05-24 13:44:19 -04:00
Li Zhu
09d63cb790 SM management for dcorch-engine-worker service
Add dcorch-engine-worker service into SM database for its management
in HA.

Depends-On: https://review.opendev.org/c/starlingx/distcloud/+/917792

Story: 2011106
Task: 50016

Change-Id: I6cdbf6a754af9339fc7db8aa453a4c49e8277613
Signed-off-by: lzhu1 <li.zhu@windriver.com>
2024-05-15 20:50:53 +00:00
Manoel Benedito Neto
683fd05d50 Add and configure IPsec Config Service
This commit adds ipsec-config service to sm-db. This service is
responsible to manage swanctl configuration by creating symbolic
links between swanctl.conf and different conf files.

Test Plan:
PASS: Build a new debian iso containing the changes.
PASS: Bootstrap, install and unlock a DX system with unlocked enable
      available status and IPsec enabled. Observe that ipsec-config
      service data is present on sm-db tables.

Story: 2010940
Task: 49998

Depends-On: https://review.opendev.org/c/starlingx/config/+/916841

Change-Id: Ia1544134b7d4d49897153c064b996a1f67b7599b
Signed-off-by: Manoel Benedito Neto <Manoel.BeneditoNeto@windriver.com>
2024-05-03 12:11:59 -04:00
Zuul
f7fb00619b Merge "Remove CentOS/OpenSUSE build support" 2024-04-29 12:56:35 +00:00
Scott Little
cb7602eafa Remove CentOS/OpenSUSE build support
StarlingX stopped supporting CentOS builds in the after release 7.0.
This update will strip CentOS from our code base.  It will also remove
references to the failed OpenSUSE feature as well.

Story: 2011110
Task: 49952
Change-Id: I1bed2fde10326ecb75b45376efea8480e0f23675
Signed-off-by: Scott Little <scott.little@windriver.com>
2024-04-26 14:10:39 -04:00
Andre Kantek
64278ce1e6 Split IP services in IPv4 and IPv6 for dual-stack support
This change splits the IP service for each platform network into ipv4
and ipv6 t support dual-stack. It still supporting single-stack (when
there is only ipv4 or ipv6)

Test Plan:
[PASS] install, lock, unlock and swact for the following setups:
       - AIO-SX (IPv4 and IPv6)
       - AIO-DX (IPv4 and IPv6)
       - Standard (IPv4 and IPv6)
       - DC (SysCtrl=AIO-DX, subcloud=AIO-SX)
[PASS] Add dual-stack configuration and validate services operation
       with lock, unlock and swact:
       - AIO-SX (IPv4 and IPv6)
       - AIO-DX (IPv4 and IPv6)
       - Standard (IPv4 and IPv6)
       - DC (SysCtrl=AIO-DX, subcloud=AIO-SX), using the admin network

Story: 2011027
Task: 49761

Change-Id: Ic6451cae04769409babd2d6507c3677d1cce5617
Signed-off-by: Andre Kantek <andrefernandozanella.kantek@windriver.com>
2024-04-16 14:15:04 -03:00
Eric MacDonald
91fa44188c Add node locked gate to SM enable for DX systems
Service Management (SM) sometimes selects and activates services on a
locked controller following a dead office recovery.

This update adds a node locked check to SM's enable handler to
block enable if present much like the existing goenabled check
blocks enable if not present in the same function.

The enable gate file is /etc/mtc/tmp/.node_locked on the local host.

Maintenance manages the presence or absence of this file based on
the node's administrative state.

This update also cleans up some extra whitespace in the changed file.

Test Plan:

PASS: Verify system build.
PASS: Verify AIO SX install.
PASS: Verify AIO DX install.
PASS: Verify Standard DX system install with worker and storage.

For Both 'AIO DX' and 'Standard DX with worker and storage':

PASS: Verify SM does not activate on a locked DX controller.
PASS: ... DOR case
PASS: ... Uncontrolled Swact case
PASS: Verify Standard DX behavior over DOR with one locked controller
      while the only unlocked controller does not recover.
PASS: Verify behavior after above test case once the only unlocked
      controller does recover.
PASS: Verify lock of the standby controller and its sm logs
PASS: Verify manually creating the new Nv locked file on the active
      controller will cause SM to go disabled and shut down all
      services on that controller.
      ... If there is another unlocked controller then verify it
          takes over as an uncontrolled swact.
      ... If there is no unlocked standby controller then verify SM
          remains shutdown until the manually created Nv node locked
          file is removed. At which point SM proceeds to activate
          services on that controller again.

PASS: Verify SM ignores the node locked flag file for AIO SX systems.
PASS: Verify lock/unlock of AIO SX controller.
PASS: Verify original reported issue is resolved for AIO DX systems.

Regression:

PASS: Verify controlled swact with unlocked enabled standby.
PASS: Verify uncontrolled swact with unlocked enabled standby.
PASS: Verify standby controller lock/unlock soak loop (10).
PASS: Verify swact loop soak (10).
PASS: Verify no crash or core dumps.
PASS: Verify SM logging

Closes-Bug: 2051578
Change-Id: If8e27ef30d62096fa77c3868f4d460b18e10ade2
(cherry picked from commit 23d0d8ab2f3225f10594547c5f8a67c409f815a0)
2024-02-26 22:15:03 +00:00
Zuul
338161f443 Merge "Revert "Add node locked gate to SM enable"" 2024-02-23 14:26:53 +00:00
Eric MacDonald
1e62ab86f1 Revert "Add node locked gate to SM enable"
This reverts commit 23d0d8ab2f3225f10594547c5f8a67c409f815a0.

Reason for revert: Breaks AIO SX Enable

Change-Id: I662b8732e723f4ce5b748ef00a184ae5b8db523c
2024-02-23 14:06:10 +00:00
Zuul
031c2e223d Merge "Add node locked gate to SM enable" 2024-02-16 16:11:22 +00:00
Zuul
9367d45672 Merge "Avoid potential blocking of heartbeat thread" 2024-02-14 21:35:21 +00:00
Eric MacDonald
23d0d8ab2f Add node locked gate to SM enable
Service Management (SM) sometimes selects and activates services on a
locked controller following a dead office recovery.

This update adds a node locked check to SM's enable handler to
block enable if present much like the existing goenabled check
blocks enable if not present in the same function.

The enable gate file is /etc/mtc/tmp/.node_locked on the local host.

Maintenance manages the presence or absence of this file based on
the node's administrative state.

This update also cleans up some extra whitespace in the changed file.

Test Plan:

PASS: Verify system build.
PASS: Verify AIO DX install.
PASS: Verify Standard DX system install with worker and storage.

For Both 'AIO DX' and 'Standard DX with worker and storage':

PASS: Verify SM does not activate on a locked controller.
PASS: ... DOR case
PASS: ... Uncontrolled Swact case
PASS: Verify Standard DX behavior over DOR with one locked controller
      while the only unlocked controller does not recover.
PASS: Verify behavior after above test case once the only unlocked
      controller does recover.
PASS: Verify lock of the standby controller and its sm logs
PASS: Verify manually creating the new Nv locked file on the active
      controller will cause SM to go disabled and shut down all
      services on that controller.
      ... If there is another unlocked controller then verify it
          takes over as an uncontrolled swact.
      ... If there is no unlocked standby controller then verify SM
          remains shutdown until the manually created Nv node locked
          file is removed. At which point SM proceeds to activate
          services on that controller again.

Regression:

PASS: Verify controlled swact with unlocked enabled standby.
PASS: Verify uncontrolled swact with unlocked enabled standby.
PASS: Verify standby controller lock/unlock soak loop (10).
PASS: Verify swact loop soak (10).
PASS: Verify no crash or core dumps.

Closes-Bug: 2051578
Change-Id: I0f0e3d199586513ddce484fdcc056e1b2562b45f
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2024-02-14 13:01:03 +00:00
Zuul
2fd5ebc6e6 Merge "sm-common: add support for arm64" 2024-01-17 16:02:51 +00:00
Zuul
56b60d15a5 Merge "sm: fix the hardcoded includes for arm64" 2024-01-17 15:52:16 +00:00
Kyale, Eliud
0db57d60be Add service dependancy haproxy dnsmasq
haproxy uses dns resolution
add service dependency to sm database
to ensure that dnsmasq service is started before haproxy
and dnsmasq is disabled after haproxy is disabled

Test plan:

PASS - AIO-SX: iso install
PASS - AIO-SX: reboot test
PASS - AIO-DX: iso install
PASS - AIO-DX: swact test

Closes-Bug: #2043506

Change-Id: I494faebfe67843d34819f66a0a2fbd977657bb6b
Signed-off-by: Kyale, Eliud <Eliud.Kyale@windriver.com>
2024-01-16 09:32:53 -05:00
Jackie Huang
35d8d23563 sm-common: add support for arm64
Add support for aarch64 in sm_trap_thread_log.

Test Plan:
PASS: build-pkgs on x86-64 host
PASS: build-image on x86-64 host
PASS: build-pkgs on arm64 host
PASS: build-image on arm64 host
PASS: Deploy AIO-SX on x86-64 targets and check sm service
PASS: Deploy AIO-SX on arm64 targets and check sm service
PASS: Deploy AIO-DX on arm64 targets and check sm service
PASS: Deploy std (2+2+2) on arm64 targets and check sm service

Story: 2010739
Task: 48017

Change-Id: Iebea29e6df900f63d0dce24cf1a139f60c1cf6f8
Signed-off-by: Jackie Huang <jackie.huang@windriver.com>
2023-11-28 16:26:23 +08:00
Jackie Huang
15a8ffeee0 sm: fix the hardcoded includes for arm64
The includes path in Makefile is hardcoded with x86_64,
use dpkg-architecture to check the host arch and replace
the hardcoded name.

Test Plan:
PASS: build-pkgs on x86-64 host
PASS: build-image on x86-64 host
PASS: build-pkgs on arm64 host
PASS: build-image on arm64 host
PASS: Deploy AIO-SX on x86-64 targets and check sm service
PASS: Deploy AIO-SX on arm64 targets and check sm service
PASS: Deploy AIO-DX on arm64 targets and check sm service
PASS: Deploy std (2+2+2) on arm64 targets and check sm service

Story: 2010739
Task: 48017

Change-Id: Ie22477b7ec7df63377f666186d95201cd16f5809
Signed-off-by: Jackie Huang <jackie.huang@windriver.com>
2023-11-28 16:26:23 +08:00
Bin Qian
d91b069daf Avoid potential blocking of heartbeat thread
This is to avoid waiting for hbs cluster query for sending SM alive
pulse. When a hbs cluster query or alive pulse is being sent, do not
queue the subsequent alive pulse, as current request being sent is good
enough to update hbs agent.
Also move the function retrieving sock address to initial from inside
the query sending procedure. The function getaddrinfo to avoid indirectly
calling malloc, which invokes malloc_atfork to potentially a blocking call.

TCs:
   This could improve in extreme situation only, passed regression.

Closes-bug: 2025504

Change-Id: I520b42f0330b670e301279c2e42670d40361adc5
Signed-off-by: Bin Qian <bin.qian@windriver.com>
2023-11-17 21:01:08 +00:00
Zuul
4b800442ed Merge "Use FQDN for MGMT network" 2023-11-02 20:22:20 +00:00