Need to change related filter files as we only install
memcached package in controller node.
Basical multinode deployment test pass.
Story: 2004108
Task: 27517
Change-Id: Id21d3db9d172398f6c23f86e5dd5f9e5a249c6b8
Signed-off-by: zhipengl <zhipengs.liu@intel.com>
There are 2 duplicated LICESE files in mtce-control, mtce-compute,
and mtce-storage. Additionally, LICENSE was not placed in the root
directory of src RPM, so this patch is made as an enhancement or fix.
After this change, license file location and code structure in all 4
modules (mtce-common, mtce-compute, mtce-storage and mtce-control)
will be the same.
Test method: make a clean build and check src RPM and binary RPM
to assure there is only one LICENSE in correct place.
Story: 2004186
Task: 27676
Change-Id: Id71a7450e8b45438c5d15976ae8e853b9ba8f4f5
Signed-off-by: Yong Hu <yong.hu@intel.com>
The new flake8 version 3.6.0 introduces new warnings that cause
the check and gate jobs to fail. Locking down the flake8 version
to avoid these surprises in the future. We can later increment
the flake8 version and fix the new warnings in a controlled
manner.
Change-Id: I0cec95de9fe9d58536752038b94939962919d166
Partial-Bug: 1799721
Signed-off-by: Bart Wensley <barton.wensley@windriver.com>
The warnings might be treated as errors when the build system using compiler,
for example, gcc/g++ 8.2.1, with "-O2 -Wall -Wextra -Werror" options.
Story: 2004134
Task: 27591
Change-Id: I576a8c0305a4c32772fbc750ef39c73334b19336
Signed-off-by: Yong Hu <yong.hu@intel.com>
Rename files and folders in mtce-compute, mtce-control, and
mtce-storage. As well update packages' names in bsp-files/
filter_out_* scripts accordingly.
Story: 2004079
Task: 27485
Change-Id: Ic1e9bd4bb8d72f30ddcc2a2bfc602a1a34e583da
Signed-off-by: Yong Hu <yong.hu@intel.com>
* Add pointers in the main doc to api-ref and releasenotes pages
* Add publish-stx-api-ref and publish-stx-releasenotes jobs
* Add search at bottom of api-ref and relnotes pages to trigger the jobs
Change-Id: Ib41f10ce72eb283d4edbeb1ecc0543403295d7bf
Signed-off-by: Abraham Arce <abraham.arce.moreno@intel.com>
Also set the theme to alabaster until starlingxdocs is ready
Change-Id: I6a113b9fddb64792b5454b3ef0ef866ef9f74fc6
Signed-off-by: Dean Troyer <dtroyer@gmail.com>
This part one of a two part HA Improvements feature that introduces
the collection of heartbeat health at the system level.
The full feature is intended to provide service management (SM)
with the last 2 seconds of maintenace's heartbeat health view that
is reflective of each controller's connectivity to each host
including its peer controller.
The heartbeat cluster summary information is additional information
for SM to draw on when needing to make a choice of which controller
is healthier, if/when to switch over and to ultimately avoid split
brain scenarios in a two controller system.
Feature Behavior: A common heartbeat cluster data structure is
introduced and published to the sysroot for SM. The heartbeat
service populates and maintains a local copy of this structure
with data that reflects the responsivness for each monitored
network of all the monitored hosts for the last 20 heartbeat
periods. Mtce sends the current cluster summary to SM upon request.
General flow of cluster feature wrt hbsAgent:
hbs_cluster_init: general data init
hbs_cluster_nums: set controller and network numbers
forever:
select:
hbs_cluster_add / hbs_cluster_del: - add/del hosts from mtcAgent
hbs_sm_handler -> hbs_cluster_send: - send cluster to SM
heartbeating:
hbs_cluster_append: add controller cluster to pulse request
hbs_cluster_update: get controller cluster data from pulse responses
hbs_cluster_save: save other controller cluster view in cluster vault
hbs_cluster_log: log cluster state changes (clog)
Test Plan:
PASS: Verify compute system install
PASS: Verify storage system install
PASS: Verify cluster data ; all members of structure
PASS: Verify storage-0 state management
PASS: Verify add of second controller
PASS: Verify add of storage-0 node
PASS: Verify behavior over Swact
PASS: Verify lock/unlock of second controller ; overall behavior
PASS: Verify lock/unlock of storage-0 ; overall behavior
PASS: Verify lock/unlock of storage-1 ; overall behavior
PASS: Verify lock/unlock of compute nodes ; overall behavior
PASS: Verify heartbeat failure and recovery of compute node
PASS: Verify heartbeat failure and recovery of storage-0
PASS: Verify heartbeat failure and recovery of controller
PASS: Verify delete of controller node
PASS: Verify delete of storage-0
PASS: Verify delete of compute node
PASS: Verify cluster when controller-1 active / controller-0 disabled
PASS: Verify MNFA and recovery handling
PASS: Verify handling in presence of multiple failure conditions
PASS: Verify hbsAgent memory leak soak test with continuous SM query.
PASS: Verify active controller-1 infra network failure behavior.
PASS: Verify inactive controller-1 infra network failure behavior.
Change-Id: I4154287f6dcf5249be5ab3180f2752ab47c5da3c
Story: 2003576
Task: 24907
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Make sure the minimally sized physical volume for the cgts-vg volume
group is sized correctly when the --kubernetes option is used with
config_controller. Include the kubernetes specific logical volumes that
will be present in a kubernetes configuration.
Change-Id: Ifc6d2c0a5dbb880f8e1ff73b01bc05cb8ab22855
Partial-Bug: #1794567
Signed-off-by: Robert Church <robert.church@windriver.com>
Maintenance is seen to intermittently fail Swact requests when
it fails to get a response from SM 500 msecs after having issued
the request successfully.
A recent instrumentation update went in which verified that the
http request was being launched properly even in the failure cases.
Seems the 500 msec timeout might not be long enough to account
for SM's scheduling/handling.
This update increases the receive retry delay from 50 msec to 1 second.
Change-Id: I29d6ba03094843a2af9d8720dd074572d76a31a4
Related-Bug: https://bugs.launchpad.net/starlingx/+bug/1791381
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
The UEFI grub config file used for PXE Boot Server containes a mixed
of menuentry and submenu. A submenu opens a new context and global
variables cannot be used within them
https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1175127. This
was causing pxe_root to be empty for submenus selections. Moving
'1) UEFI Boot from hard drive' into a submenu fixes this problem and
also removes the need to save the root variable. This was previously
required when a user would select boot from hard drive and the hard
drive did not exist, root would no longer be set to the pxeboot
server and selecting the other menu entries would fail.
Remove the '1)' prefix in the 'Boot from hard drive' menu option in
both Legacy and UEFI boot as this is not used by other menues.
Closes-Bug: 1794863
Change-Id: I5bc62039bfb68477e9cd0166ce17877693037640
Signed-off-by: Kristine Bujold <kristine.bujold@windriver.com>
I added the api-ref-sysinv-v1-metal.rst file, which is content
for the API reference manual to the api-ref/source dir. This
represents the converted old-style files to the newer OpenStack
supported RST files.
I updated the index.rst file to include the new .rst file so that
the api-ref document can build.
Updated the actual output using Platform http://10.10.10.2:6385/v1
and querying.
Change-Id: I943dd136b2b083743e053db5070d4da7fbe78685
Signed-off-by: Scott Rifenbark <srifenbark@gmail.com>
Minimum backup partition size is calculated as the sum between the
sizes of the glance and database partitions, as well as a 20GB
overhead.
This fix increase the default backup size partition to 40GB to
be in line with the above calculations, considering the database
and the glance partitions are 10GB each by default.
This also increases the minimum disk requirements from 120GB
to 130GB.
Change-Id: I5cfc329870a84a6245d868b4c4990829e702e886
Closes-bug: 1793543
Signed-off-by: Stefan Dinescu <stefan.dinescu@windriver.com>
This decouples the build and packaging of guest-server, guest-agent from
mtce, by splitting guest component into stx-nfv repo.
This leaves existing C++ code, scripts, and resource files untouched,
so there is no functional change. Code refactoring is beyond the scope
of this update.
Makefiles were modified to include devel headers directories
/usr/include/mtce-common and /usr/include/mtce-daemon.
This ensures there is no contamination with other system headers.
The cgts-mtce-common package is renamed and split into:
- repo stx-metal: mtce-common, mtce-common-dev
- repo stx-metal: mtce
- repo stx-nfv: mtce-guest
- repo stx-ha: updates package dependencies to mtce-pmon for
service-mgmt, sm, and sm-api
mtce-common:
- contains common and daemon shared source utility code
mtce-common-dev:
- based on mtce-common, contains devel package required to build
mtce-guest and mtce
- contains common library archives and headers
mtce:
- contains components: alarm, fsmon, fsync, heartbeat, hostw, hwmon,
maintenance, mtclog, pmon, public, rmon
mtce-guest:
- contains guest component guest-server, guest-agent
Story: 2002829
Task: 22748
Change-Id: I9c7a9b846fd69fd566b31aa3f12a043c08f19f1f
Signed-off-by: Jim Gauld <james.gauld@windriver.com>
Currently, management interface can be shared with infrastructure only over
an VLAN. This update supports both management and infrastructure network
sharing a single interface.
Story: 2003087
Task: 23171
Depends-On: https://review.openstack.org/#/c/601156
Change-Id: Ie97dbd1260f5c98d7401b0e48361ebd87f060f65
Signed-off-by: Teresa Ho <teresa.ho@windriver.com>
* Use build-openstack-docs-pti job template for docs
* Use build-openstack-releasenotes job for release notes
(We can't use the OpenStack releasenotes template as it includes
publish jobs, stx needs its own)
* Add newnote tox environment as convenience for creating new release
notes, re-using the releasenotes venv.
* Create a release summary note.
Change-Id: I5a610cfe271707fd704248ede0db75be6d031121
Signed-off-by: Dean Troyer <dtroyer@gmail.com>
Allocating of space on the root disk was done in kickstart, combined
with config_controller. Various checks for disk and volume group
space made it difficult to add new filesystem partitions.
To simplify this, all checks are now merged under disk size checks.
Kickstart files now create the partitions of the same size for
rootfs and log partitions, no matter the disk size, with the
only variable size being the cgts-vg size.
Thus, any future changes will only need to consider only this
size for ensuring enough space is present.
The limit between small and big disks is 240GB, with a minimum
disk size of 120GB.
Depends-On: https://review.openstack.org/#/c/600743/
Change-Id: I37ecc8eb5468811d1ca3a71f8e2a0629525e8fad
Closes-bug: 1791170
Signed-off-by: Stefan Dinescu <stefan.dinescu@windriver.com>
Maintenance is seen to intermittently fail Swact requests early
after initial system provisioning, without logging an error
reason, only to always succeed later on.
The issue is difficult to reproduce so this update adds extra
logging to this code path and implements a speculative fix.
The event_base_loop calls' non-zero return code is never being
logged. The libevent documentation states that this API will
return 1 while the target has not yet provided any data.
Theory is, because the call is local, that normally it returns
with data even on the first dispatch case. However, during early
system configuration, when the system is busy, that first dispatch
does not complete immediately like it normally does later on.
Speculation is, instead it returns a 1 stating retry but the
existing code path treats that as a failure.
This update modifies the code to return a PASS if the command
dispatch returns a 1 while the error case of -1 gets enhanced
logging and continues to be treated as a failure.
Test Plan:
PASS: Swact 5 times
PASS: Lock/Unlock Host
PASS: Large System DOR
Related Bug: https://bugs.launchpad.net/starlingx/+bug/1791381
Change-Id: I19b22e07d3224b2e9dd3f3569ecbe9aed7d9402f
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
The current maintenance heartbeat failure action handling is to Fail
and Gracefully Recover the host. This means that maintenance will
ensure that a heartbeat failed host is rebooted/reset before it is
recovered but will avoid rebooting it a second time if its recovered
uptime indicates that it has already rebooted.
This update expands that single action handling behavior to support
three new actions. In doing so it adds a new configuration service
parameter called heartbeat_failure_action. The customer can configure
this new parameter with any one of the following 4 actions in order of
decreasing impact.
fail - Host is failed and gracefuly recovered.
- Current Network specific alarms continue to be raised/cleared.
Note: Prior to this update this was standard system behavior.
degrade - Host is only degraded while it is failing heartbeat.
- Current Network specific alarms continue to be raised/cleared.
- heartbeat degrade reason is cleared as are the alarms when
heartbeat responses resume.
alarm - The only indication of a heartbeat failure is by alarm.
- Same set of alarms as in above action cases
- Only in this case no degrade, no failure, no reboot/reset
none - Heartbeat is disabled ; no multicase heartbeat message is sent.
- All existing heartbeat alarms are cleared.
- The heartbeat soak as part of the enable sequence is bypassed.
The selected action is a system wide setting.
The selected setting also applies to Multi-Node Failure Avoidance.
The default action is the legacy action Fail.
This update also
1. Removes redundant inservice failure alarm for MNFA case in support
of degrade only action. Keeping it would make that alarm handling
case unnecessarily complicated.
2. No longer used 'hbs calibration' code is removed (cleanup).
3. Small amount of heartbeat logging cleanup.
Test Plan:
PASS: fail: Verify MNFA and recovery
PASS: fail: Verify Single Host heartbeat failure and recovery
PASS: fail: Verify Single Host heartbeat failure and recovery (from none)
PASS: degrade: Verify MNFA and recovery
PASS: degrade: Verify Single Host heartbeat failure and recovery
PASS: degrade: Verify Single Host heartbeat failure and recovery (from alarm)
PASS: alarm: Verify MNFA and recovery
PASS: alarm: Verify Single Host heartbeat failure and recovery
PASS: alarm: Verify Single Host heartbeat failure and recovery (from degrade)
PASS: none: Verify heartbeat disable, fail ignore and no recovery
PASS: none: Verify Single Host heartbeat ignore and no recovery
PASS: none: Verify Single Host heartbeat ignode and no recovery (from fail)
PASS: Verify action change behavior from none to alarm with active MNFA
PASS: Verify action change behavior from alarm to degrade with active MNFA
PASS: Verify action change behavior from degrade to none with active MNFA
PASS: Verify action change behavior from none to fail with active MNFA
PASS: Verify action change behavior from fail to none with active MNFA
PASS: Verify action change behavior from degrade to fail then MNFA timeout
PASS: Verify all heartbeat action change customer logs
PASS: verify heartbeat stats clear over action change
PASS: Verify LO DOR (several large labs - compute and storage systems)
PASS: Verify recovery from failure of active controller
PASS: Verify 3 host failure behavior with MNFA threshold at 3 (action:fail)
PASS: Verify 2 host failure behavior with MNFA threshold at 3 (action:fail)
Depends-On: https://review.openstack.org/601264
Change-Id: Iede5cdbb1c923898fd71b3a95d5289182f4287b4
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Filter out platform-util-controller from compute and storage nodes.
Story: 2002826
Task: 26228
Depends-On: https://review.openstack.org/600868
Change-Id: I3d8fc737c9a59caef7658d4139b302cee3841592
Signed-off-by: Jack Ding <jack.ding@windriver.com>
use flake8 as pep8 tools
enable check and gate for pep8(voting)
Fix below flake8 issues:
E127 continuation line over-indented for visual indent
E211 whitespace before '('
E222 multiple spaces after operator
E302 expected 2 blank lines, found 1
E501 line too long (101 > 79 characters)
E502 the backslash is redundant between brackets
F401 'platform' imported but unused
W391 blank line at end of file
Change-Id: Idfb953e52c8ee35c2adefdf0e4143a381c7f49e2
Story: 2003426
Task: 24596
Signed-off-by: Sun Austin <austin.sun@intel.com>