The nodepool openstack provider implementation has a launch timeout
default of 3600 seconds or one hour. This is problematic for us because
a node request will be attempted three times by a provider before being
passed onto the next provider. This means we may wait up to three hours
per failed provider to launch a single node.
Fix this by setting a timeout of 10 minutes on the three providers that
didn't already set a lower timeout (raxflex and the two vexxhost
providers). Looking at grafana graphs for time to ready this should be
plenty of time for normal booting conditions and reduces the time we
wait from 3 hours per provider to 30 minutes.
Change-Id: I0a4af9f5519ff2b64c737b1822590fbb2608e8bb
The openmetal provider nodes have Intel VMX flags set and raxflex
provider nodes have AMD SVM flags set. Both should be capable of nested
virt (as long as nested virt works anyway) so lets add these labels to
these clouds.
In the openmetal case we have the ability to directly gather debugging
info ourselves and in the raxflex case we know how to contact when
things go wrong.
Change-Id: Icc7c9cbafaef93f3ccec7010c82af1d36e02533c
Under heavy load, we're occasionally seeing ssh-keyscan time out
with no reply after a minute. Double the time Nodepool is willing to
wait for SSH to connect, in hopes that reduces these occurrences.
Change-Id: Icfc3cc4a29854455684d88be41ba3e70e8507f3a
Looking at `openstack limits show --absolute` we seem to currently
have sufficient memory quota to take this up to 32x 8GB RAM
instances (quotas for CPU and total instances are higher so RAM is
the limiting factor at the moment).
Things seem to be going fine at 20 nodes for the past few hours, so
let's go ahead and dial it up to maximum for the weekend and see how
it works out before we let our Rackspace contacts know we're ready
for additional quota.
Change-Id: I1d61da4ca2a1a5963e0cdcff99fdf4f41aea5920
The opendevzuul-network01 network should route to the global
Internet via NAT, so make that clear to Nodepool.
Change-Id: I8e0038fe82b0fc968d80ab656808b7744c160132
Our expected Nodepool images are present in the raxflex-sjc3
provider now, so raise max-servers to 1 in order to get some sample
builds we can analyze for obvious issues like connectivity.
Depend on the subnet resizing change since that might otherwise be
disruptive if it deployed while jobs are running there.
Depends-On: https://review.opendev.org/927813
Change-Id: If8b39b53608188e5881a273e6d092981a5871e84
Put the raxflex-sjc3 provider in our Nodepool configuration, but
with booting disabled by zeroing max-servers so we can make sure
image uploads are working first.
Change-Id: Id3d2e73b73c35af52dbc13579e773329e4f9ad68
Now that noble is our default nodeset lets split the min ready value of
10 nodes for jammy in half between jammy and noble. When jammy becomes
less common we'll set noble to the full 10.
Change-Id: I1e3b83c18234e1d15a8eb76a63989b29d4925908
This should be the last piece of cleanup to remove CentOS 8 stream from
Nodepool. We basically tell the builders to forget about these images
which should result in cleanups of all the related records on disk and
in zk.
Change-Id: I71421ce9a10438549ef21441349be84b5d7bd38b
This will stop nodepool from trying to manage centos-8-stream images in
our cloud providers. It will also remove the label from nodepool and
zuul entirely making this node type unusable. This will produce
NODE_FAILURE errors but jobs were already failing 100% of the time due
to the lack of valid package mirrors.
The followup change will stop our image builds (though they are paused
now) which will clean up the images on disk on our builders.
Depends-On: https://review.opendev.org/c/opendev/base-jobs/+/922653
Change-Id: I1bdb2441a7b8ce5c651e5e865005f80828cd6f47
We are trying to phase out this node type. We don't need to have a ready
node sitting around at all times for it.
Change-Id: I74da8de9b9776f2f33e921f3566e5f1c134be88d
In preparation for centos-8-stream cleanup we want to ensure we are not
going to automatically boot more nodes that we need to clean up.
Followup changes will more completely remove the node from nodepool.
Change-Id: I4ea6b7ab449124325cf22129663f86ef7117a5b9
Build images and boot ubuntu-noble everywhere we do for
ubuntu-jammy. Drop the kernel boot parameter override we use on
Jammy since it's default in the kernel versions included in Noble
now.
Change-Id: I3b9d01a111e66290cae16f7f4f58ba0c6f2cacd8
This is the last step in cleaning centos-7 out of nodepool. The previous
change will have cleaned up uploads and now we can stop building the
images entirely.
Change-Id: Ie81d6d516cd6cd42ae9797025a39521ceede7b71
This removal of centos-7 image uploads should cause Nodepool to clean up
the existing images in the clouds. Once that is done we can completely
remove the image builds in a followup change.
We are performing this cleanup because CentOS 7 is near its EOL and
cleaning it up will create room on nodepool builders and our mirrors for
other more modern test platforms.
Depends-On: https://review.opendev.org/c/opendev/base-jobs/+/912786
Change-Id: I48f6845bc7c97e0a8feb75fc0d540bdbe067e769
The cloud name is used to lookup cloud credentials in clouds.yaml, but
it is also used to determine names for things like mirrors within jobs.
As a result changing this value can impact running jobs as you need to
update DNS for mirrors (and possibly launch new mirrors) first. Add a
warning to help avoid problems like this in the future.
Change-Id: I9854ad47553370e6cc9ede843be3303dfa1f9f34
This reverts commit eca3bde9cbba1b680f4f813a421ceb2d5803cf96.
This was successful, but we want to make the change without altering
the cloud name. So switch this back, and separately we will update
the config of the rax cloud.
Change-Id: I8cdbd7777a2da866e54ef9210aff2f913a7a0211
Switch the Rackspace region with the smallest quota to uploading
images and booting server instances with our account's API key
instead of its password, in preparation for their MFA transition. If
this works as expected, we'll make a similar switch for the
remaining two regions.
Change-Id: I97887063c735c96d200ce2cbd8950bbec0ef7240
Depends-On: https://review.opendev.org/911164
This drop min-ready for centos-7 to 0 and removes use of some centos 7
jobs from puppet-midnoet. We will clean up those removed jobs in a
followup change to openstack-zuul-jobs.
We also remove x/collected-openstack-plugins from zuul. This repo uses
centos 7 nodesets that we want to clean up and it last merged a change
in 2019. That change was written by the infra team as part of global
cleanups. I think we can remove it from zuul for now and if interest
restarts it can be added and fixed up.
Change-Id: I06f8b0243d2083aacb44fe12c0c850991ce3ef63
This should be landed after the parent chagne has landed and nodepool
has successfully deleted all debian-buster image uploads from our cloud
providers. At this point it should be safe to remove the image builds
entirely.
Change-Id: I7fae65204ca825665c2e168f85d3630686d0cc75
Debian buster has been replaced by bullseye and bookworm, both of which
are releases we have images for. It is time to remove the unused debian
buster images as a result.
This change follows the process in nodepool docs for removing a provider
[0] (which isn't quite what we are doing) to properly remove images so
that they can be deleted by nodepool before we remove nodepool's
knowledge of them. The followup change will remove the image builds from
nodepool.
[0] https://zuul-ci.org/docs/nodepool/latest/operation.html#removing-a-provider
Depends-On: https://review.opendev.org/c/opendev/base-jobs/+/910015
Change-Id: I37cb3779944ff9eb1b774ecaf6df3c6929596155
This is in preparation for the removal of this distro release from
Nodepool. Setting this value to will prevent nodepool from automatically
booting new nodes under this label if we cleanup any existing nodes.
Change-Id: I90b6c84a92a0ebc4f40ac3a632667c8338d477f1
This should be landed after the parent chagne has landed and nodepool
has successfully deleted all opensuse-15 image uploads from our cloud
providers. At this point it should be safe to remove the image builds
entirely.
Change-Id: Icc870ce04b0f0b26df673f85dd6380234979906f
These images are old opensuse 15.2 and there doesn't seem to be interest
in keeping these images running (very few jobs ever ran on them and
rarely successfully and no one is trying to update to 15.5 or 15.6).
This change follows the process in nodepool docs for removing a provider
[0] (which isn't quite what we are doing) to properly remove images so
that they can be deleted by nodepool before we remove nodepool's
knowledge of them. The followup change will remove the image builds from
nodepool.
[0] https://zuul-ci.org/docs/nodepool/latest/operation.html#removing-a-provider
Depends-On: https://review.opendev.org/c/opendev/base-jobs/+/909773
Change-Id: Id9373762ed5de5c7c5131811cec989c2e6e51910
This is in preparation for the followup changes that will drop opensuse
nodes and images entirely. We set min-ready to 0 first so that we can
manually delete any running nodes before cleaning things up further.
Change-Id: I6cae355fd99dd90b5e48f804ca0d63b641c5da11
This removes the fedora image builds from nodepool. At this point
Nodepool should no longer have any knowledge of fedora.
There is potential for other cleanups for things like dib elements, but
leaving those in place doesn't hurt much.
Change-Id: I3e6984bc060e9d21f7ad851f3a64db8bb555b38a
This will stop providing the node label entirely and should result in
nodepool cleaning up the existing images for these images in our cloud
providers. It does not remove the diskimages for fedora which will
happen next.
Change-Id: Ic1361ff4e159509103a6436c88c9f3b5ca447777
In preparation for fedora node label removal we set min-ready to 0. This
is the first step to removing the images entirely.
Change-Id: I8c2a91cc43a0dbc633857a2733d66dc935ce32fa
Looking at our graphs, we're still spiking up into the 30-60
concurrent building range at times, which seems to result in some
launches exceeding the already lengthy timeout and wasting quota,
but when things do manage to boot we effectively utilize most of
max-servers nicely. The variability is because max-concurrency is
the maximum number of in-flight node requests the launcher will
accept for a provider, but the number of nodes in a request can be
quite large sometimes.
Raise max-servers back to its earlier value reflecting our available
quota in this provider, but halve the max-concurrency so we don't
try to boot so many at a time.
Change-Id: I683cdf92edeacd7ccf7b550c5bf906e75dfc90e8
This region seems to take a very long time to launch nodes when we
have a burst of requests for them, like a thundering herd sort of
behavior causing launch times to increase substantially. We have a
lot of capacity in this region though, so want to boot as many
instances as we can here. Attempt to reduce the effect by limiting
the number of instances nodepool will launch at the same time.
Also, mitigate the higher timeout for this provider by not retrying
launch failures, so that we won't ever lock a request for multiples
of the timeout.
Change-Id: I179ab22df37b2f996288820074ec69b8e0a202a5
We're still seeing a lot of timeouts waiting for instances to become
active in this provider, and are observing fairly long delays
between API calls at times. Increase the launch wait from 10 to 15
minutes, and increase the minimum delay between API calls by an
order of magnitude from 0.001 to 0.01 seconds.
Change-Id: Ib13ff03629481009a838a581d98d50accbf81de2
Reduce the max-servers in rax-ord from 195 to 100, and revert the
boot-timeout from the 300 we tried back down to 120 like the others.
We're continuing to see server create calls taking longer to report
active than nodepool is willing to wait, but also may be witnessing
the results of API rate limiting or systemic slowness. Reducing the
number of instances we attempt to boot there may give us a clearer
picture of whether that's the case.
Change-Id: Ife7035ba64b457d964c8497da0d9872e41769123
For a while we've been seeing a lot of "Timeout waiting for instance
creation" in Rackspace's ORD region, but checking behind the
launcher it appears these instances do eventually boot, so we're
wasting significant resources discarding quota we never use.
Increase the timeout for this from 2 minutes to 5, but only in this
region as 2 minutes appears to be sufficient in the others.
Change-Id: I1cf91a606eefc4aa65507f491a20182770b99f09
This seems to have been overlooked when the label was added to other
launchers, and is contributing to NODE_FAILURE results for some
jobs, particularly now that fedora-latest is relying on it.
Change-Id: Ifc0e5452ac0cf275463f6f1cfbe0d7fe350e3323
openEuler 20.03-LTS-SP2 was out of date in May 2022. 22.03 LTS
is the newest LTS version. It was release in March 2022 and
will be maintained for 2 years. This patch upgrades the LTS
version. It'll be used in Devstack, Kolla-ansible and so on
in CI jobs.
Change-Id: I23f2b397bc7f1d8c2a959e0e90f5058cf3bf104d
This distro release reached its EOL December 31, 2021. We are removing
it from our CI system as people should really stop testing on it. They
can use CentOS 8 Stream or other alternatives instead.
Depends-On: https://review.opendev.org/c/opendev/base-jobs/+/827181
Change-Id: I13e8185b7839371a9f9043b715dc39c6baf907d5
This is in preparation for removing this label. This distro is no longer
supported and users will need to find alternatives.
Change-Id: I57b363671809afe415a376b0894041438140bdae
This removes the label, nodes, and images for opensuse-tumbleweed across
our cloud providers. We also update grafana to stop graphing stats for
the label.
Depends-On: https://review.opendev.org/c/opendev/base-jobs/+/824068
Change-Id: Ic311af5d667c01c1845251270fd2fdda7d99ebcb
This is in preparation to remove the image and label entirely. Nothing
seems to use the image so clean it up.
Change-Id: I5ab3a0627874e302289deb442f80a782509df2c3