68 Commits

Author SHA1 Message Date
Clark Boylan
29f15bb14c Set launch-timeout on nodepool providers
The nodepool openstack provider implementation has a launch timeout
default of 3600 seconds or one hour. This is problematic for us because
a node request will be attempted three times by a provider before being
passed onto the next provider. This means we may wait up to three hours
per failed provider to launch a single node.

Fix this by setting a timeout of 10 minutes on the three providers that
didn't already set a lower timeout (raxflex and the two vexxhost
providers). Looking at grafana graphs for time to ready this should be
plenty of time for normal booting conditions and reduces the time we
wait from 3 hours per provider to 30 minutes.

Change-Id: I0a4af9f5519ff2b64c737b1822590fbb2608e8bb
2024-09-24 15:25:56 -07:00
Clark Boylan
963bb0e3d1 Add nested virt labels to raxflex and openmetal providers
The openmetal provider nodes have Intel VMX flags set and raxflex
provider nodes have AMD SVM flags set. Both should be capable of nested
virt (as long as nested virt works anyway) so lets add these labels to
these clouds.

In the openmetal case we have the ability to directly gather debugging
info ourselves and in the raxflex case we know how to contact when
things go wrong.

Change-Id: Icc7c9cbafaef93f3ccec7010c82af1d36e02533c
2024-09-12 10:58:35 -07:00
Jeremy Stanley
687fb1700b Increase the boot timeout for Rackspace Flex nodes
Under heavy load, we're occasionally seeing ssh-keyscan time out
with no reply after a minute. Double the time Nodepool is willing to
wait for SSH to connect, in hopes that reduces these occurrences.

Change-Id: Icfc3cc4a29854455684d88be41ba3e70e8507f3a
2024-09-09 16:44:25 +00:00
Jeremy Stanley
ff719df1e0 Increase raxflex-sjc3 max-servers to 32
Looking at `openstack limits show --absolute` we seem to currently
have sufficient memory quota to take this up to 32x 8GB RAM
instances (quotas for CPU and total instances are higher so RAM is
the limiting factor at the moment).

Things seem to be going fine at 20 nodes for the past few hours, so
let's go ahead and dial it up to maximum for the weekend and see how
it works out before we let our Rackspace contacts know we're ready
for additional quota.

Change-Id: I1d61da4ca2a1a5963e0cdcff99fdf4f41aea5920
2024-09-06 14:25:47 +00:00
Dr. Jens Harbott
572b7f4ae5 Bump max-servers for raxflex-sjc3 to 20
Initial tests with a single server have been successful, so let's
increase the load a bit.

Change-Id: I5a17cd428f5f31fef47b2737db1d5344fb0a8aae
2024-09-06 09:31:01 +02:00
Jeremy Stanley
cf6cbbd980 Add default network for Rackspace Flex in nodepool
The opendevzuul-network01 network should route to the global
Internet via NAT, so make that clear to Nodepool.

Change-Id: I8e0038fe82b0fc968d80ab656808b7744c160132
2024-09-05 18:13:36 +00:00
Jeremy Stanley
f17b1e682c Try booting nodes in Rackspace Flex
Our expected Nodepool images are present in the raxflex-sjc3
provider now, so raise max-servers to 1 in order to get some sample
builds we can analyze for obvious issues like connectivity.

Depend on the subnet resizing change since that might otherwise be
disruptive if it deployed while jobs are running there.

Depends-On: https://review.opendev.org/927813
Change-Id: If8b39b53608188e5881a273e6d092981a5871e84
2024-09-05 12:55:21 +00:00
Jeremy Stanley
cfc9d60ff3 Add Nodepool images to Rackspace Flex
Put the raxflex-sjc3 provider in our Nodepool configuration, but
with booting disabled by zeroing max-servers so we can make sure
image uploads are working first.

Change-Id: Id3d2e73b73c35af52dbc13579e773329e4f9ad68
2024-09-04 18:36:27 +00:00
Clark Boylan
3c76958809 Split min ready nodes between jammy and noble
Now that noble is our default nodeset lets split the min ready value of
10 nodes for jammy in half between jammy and noble. When jammy becomes
less common we'll set noble to the full 10.

Change-Id: I1e3b83c18234e1d15a8eb76a63989b29d4925908
2024-08-22 16:11:44 -07:00
Clark Boylan
ebb5b0cb7e Stop building centos 8 stream images
This should be the last piece of cleanup to remove CentOS 8 stream from
Nodepool. We basically tell the builders to forget about these images
which should result in cleanups of all the related records on disk and
in zk.

Change-Id: I71421ce9a10438549ef21441349be84b5d7bd38b
2024-07-30 14:01:53 -07:00
Clark Boylan
66fbfb941e Remove centos-8-stream image uploads and labels
This will stop nodepool from trying to manage centos-8-stream images in
our cloud providers. It will also remove the label from nodepool and
zuul entirely making this node type unusable. This will produce
NODE_FAILURE errors but jobs were already failing 100% of the time due
to the lack of valid package mirrors.

The followup change will stop our image builds (though they are paused
now) which will clean up the images on disk on our builders.

Depends-On: https://review.opendev.org/c/opendev/base-jobs/+/922653
Change-Id: I1bdb2441a7b8ce5c651e5e865005f80828cd6f47
2024-07-30 14:01:38 -07:00
Clark Boylan
97d35a8dd5 Set xenial min ready to 0
We are trying to phase out this node type. We don't need to have a ready
node sitting around at all times for it.

Change-Id: I74da8de9b9776f2f33e921f3566e5f1c134be88d
2024-07-23 17:27:39 -07:00
Clark Boylan
3ff9c96e3b Reduce centos-8-stream min ready to 0
In preparation for centos-8-stream cleanup we want to ensure we are not
going to automatically boot more nodes that we need to clean up.
Followup changes will more completely remove the node from nodepool.

Change-Id: I4ea6b7ab449124325cf22129663f86ef7117a5b9
2024-07-23 13:48:39 -07:00
Tony Breeds
20a0a5707f nodepool: Switch "common job platform" from bionic to jammy
Bionic isn't that common anymore so switch the min-ready to Jammy

Change-Id: I66f85c5b462bcae91f14195214194714aca13618
2024-06-18 11:50:29 +10:00
Tony Breeds
5b7316cff8 Switch nodepool over to the latest infra-root keyfile
Change-Id: If745d190d6a5586fbf23815b10b8411af3993828
2024-05-31 12:57:50 -05:00
Jeremy Stanley
059f2785e5 Add Ubuntu 24.04 LTS (ubuntu-noble) nodes
Build images and boot ubuntu-noble everywhere we do for
ubuntu-jammy. Drop the kernel boot parameter override we use on
Jammy since it's default in the kernel versions included in Noble
now.

Change-Id: I3b9d01a111e66290cae16f7f4f58ba0c6f2cacd8
2024-05-21 19:37:55 +00:00
Zuul
371ec90145 Merge "Add warning to nodepool configs about changing cloud name" 2024-04-17 12:29:38 +00:00
Clark Boylan
aabaf95b49 Remove centos-7 nodepool image builds
This is the last step in cleaning centos-7 out of nodepool. The previous
change will have cleaned up uploads and now we can stop building the
images entirely.

Change-Id: Ie81d6d516cd6cd42ae9797025a39521ceede7b71
2024-03-13 08:30:16 -07:00
Clark Boylan
b8c53b9c03 Remove centos-7 image uploads from Nodepool
This removal of centos-7 image uploads should cause Nodepool to clean up
the existing images in the clouds. Once that is done we can completely
remove the image builds in a followup change.

We are performing this cleanup because CentOS 7 is near its EOL and
cleaning it up will create room on nodepool builders and our mirrors for
other more modern test platforms.

Depends-On: https://review.opendev.org/c/opendev/base-jobs/+/912786
Change-Id: I48f6845bc7c97e0a8feb75fc0d540bdbe067e769
2024-03-13 08:21:46 -07:00
Clark Boylan
774ad69f33 Add warning to nodepool configs about changing cloud name
The cloud name is used to lookup cloud credentials in clouds.yaml, but
it is also used to determine names for things like mirrors within jobs.
As a result changing this value can impact running jobs as you need to
update DNS for mirrors (and possibly launch new mirrors) first. Add a
warning to help avoid problems like this in the future.

Change-Id: I9854ad47553370e6cc9ede843be3303dfa1f9f34
2024-03-07 11:28:17 -08:00
James E. Blair
f5c200181a Revert "Try switching Rackspace DFW to an API key"
This reverts commit eca3bde9cbba1b680f4f813a421ceb2d5803cf96.

This was successful, but we want to make the change without altering
the cloud name.  So switch this back, and separately we will update
the config of the rax cloud.

Change-Id: I8cdbd7777a2da866e54ef9210aff2f913a7a0211
2024-03-07 08:46:25 -08:00
Jeremy Stanley
eca3bde9cb Try switching Rackspace DFW to an API key
Switch the Rackspace region with the smallest quota to uploading
images and booting server instances with our account's API key
instead of its password, in preparation for their MFA transition. If
this works as expected, we'll make a similar switch for the
remaining two regions.

Change-Id: I97887063c735c96d200ce2cbd8950bbec0ef7240
Depends-On: https://review.opendev.org/911164
2024-03-06 15:06:34 +00:00
Clark Boylan
56c5fefcf6 CentOS 7 removal prep changes
This drop min-ready for centos-7 to 0 and removes use of some centos 7
jobs from puppet-midnoet. We will clean up those removed jobs in a
followup change to openstack-zuul-jobs.

We also remove x/collected-openstack-plugins from zuul. This repo uses
centos 7 nodesets that we want to clean up and it last merged a change
in 2019. That change was written by the infra team as part of global
cleanups. I think we can remove it from zuul for now and if interest
restarts it can be added and fixed up.

Change-Id: I06f8b0243d2083aacb44fe12c0c850991ce3ef63
2024-03-04 10:25:58 -08:00
Clark Boylan
c41bc6e5c2 Remove debian-buster image builds from nodepool
This should be landed after the parent chagne has landed and nodepool
has successfully deleted all debian-buster image uploads from our cloud
providers. At this point it should be safe to remove the image builds
entirely.

Change-Id: I7fae65204ca825665c2e168f85d3630686d0cc75
2024-02-23 13:23:22 -08:00
Clark Boylan
feff36e424 Drop debian-buster image uploads from nodepool
Debian buster has been replaced by bullseye and bookworm, both of which
are releases we have images for. It is time to remove the unused debian
buster images as a result.

This change follows the process in nodepool docs for removing a provider
[0] (which isn't quite what we are doing) to properly remove images so
that they can be deleted by nodepool before we remove nodepool's
knowledge of them. The followup change will remove the image builds from
nodepool.

[0] https://zuul-ci.org/docs/nodepool/latest/operation.html#removing-a-provider

Depends-On: https://review.opendev.org/c/opendev/base-jobs/+/910015
Change-Id: I37cb3779944ff9eb1b774ecaf6df3c6929596155
2024-02-23 13:19:49 -08:00
Clark Boylan
8eb9cb661e Set debian-buster min servers to 0
This is in preparation for the removal of this distro release from
Nodepool. Setting this value to will prevent nodepool from automatically
booting new nodes under this label if we cleanup any existing nodes.

Change-Id: I90b6c84a92a0ebc4f40ac3a632667c8338d477f1
2024-02-23 08:41:20 -08:00
Clark Boylan
211fe14946 Remove opensuse-15 image builds from nodepool
This should be landed after the parent chagne has landed and nodepool
has successfully deleted all opensuse-15 image uploads from our cloud
providers. At this point it should be safe to remove the image builds
entirely.

Change-Id: Icc870ce04b0f0b26df673f85dd6380234979906f
2024-02-22 10:27:37 -08:00
Clark Boylan
5635e67866 Drop opensuse image uploads from nodepool
These images are old opensuse 15.2 and there doesn't seem to be interest
in keeping these images running (very few jobs ever ran on them and
rarely successfully and no one is trying to update to 15.5 or 15.6).

This change follows the process in nodepool docs for removing a provider
[0] (which isn't quite what we are doing) to properly remove images so
that they can be deleted by nodepool before we remove nodepool's
knowledge of them. The followup change will remove the image builds from
nodepool.

[0] https://zuul-ci.org/docs/nodepool/latest/operation.html#removing-a-provider

Depends-On: https://review.opendev.org/c/opendev/base-jobs/+/909773
Change-Id: Id9373762ed5de5c7c5131811cec989c2e6e51910
2024-02-22 10:25:15 -08:00
Clark Boylan
b8b984e5b6 Set opensuse-15 min-ready to 0
This is in preparation for the followup changes that will drop opensuse
nodes and images entirely. We set min-ready to 0 first so that we can
manually delete any running nodes before cleaning things up further.

Change-Id: I6cae355fd99dd90b5e48f804ca0d63b641c5da11
2024-02-21 09:32:56 -08:00
Clark Boylan
3b9c5d2f07 Remove fedora image builds
This removes the fedora image builds from nodepool. At this point
Nodepool should no longer have any knowledge of fedora.

There is potential for other cleanups for things like dib elements, but
leaving those in place doesn't hurt much.

Change-Id: I3e6984bc060e9d21f7ad851f3a64db8bb555b38a
2023-09-06 09:16:34 -07:00
Clark Boylan
d83736575e Remove fedora-35 and fedora-36 from nodepool providers
This will stop providing the node label entirely and should result in
nodepool cleaning up the existing images for these images in our cloud
providers. It does not remove the diskimages for fedora which will
happen next.

Change-Id: Ic1361ff4e159509103a6436c88c9f3b5ca447777
2023-09-06 09:12:33 -07:00
Clark Boylan
8d32d45da2 Set fedora labels min-ready to 0
In preparation for fedora node label removal we set min-ready to 0. This
is the first step to removing the images entirely.

Change-Id: I8c2a91cc43a0dbc633857a2733d66dc935ce32fa
2023-09-06 09:07:13 -07:00
Dr. Jens Harbott
5aa792f1ae Start booting bookworm nodes
Image builds have been successful

Change-Id: If286eb3e1a75c643f67f3d6d3d7e2d31c205ac1b
2023-07-03 18:47:46 +02:00
Jeremy Stanley
8f916dc736 Restore rax-ord quota but lower max-concurrency
Looking at our graphs, we're still spiking up into the 30-60
concurrent building range at times, which seems to result in some
launches exceeding the already lengthy timeout and wasting quota,
but when things do manage to boot we effectively utilize most of
max-servers nicely. The variability is because max-concurrency is
the maximum number of in-flight node requests the launcher will
accept for a provider, but the number of nodes in a request can be
quite large sometimes.

Raise max-servers back to its earlier value reflecting our available
quota in this provider, but halve the max-concurrency so we don't
try to boot so many at a time.

Change-Id: I683cdf92edeacd7ccf7b550c5bf906e75dfc90e8
2023-03-16 19:53:55 +00:00
Jeremy Stanley
d0481326bf Limit rax-ord launch concurrency and don't retry
This region seems to take a very long time to launch nodes when we
have a burst of requests for them, like a thundering herd sort of
behavior causing launch times to increase substantially. We have a
lot of capacity in this region though, so want to boot as many
instances as we can here. Attempt to reduce the effect by limiting
the number of instances nodepool will launch at the same time.

Also, mitigate the higher timeout for this provider by not retrying
launch failures, so that we won't ever lock a request for multiples
of the timeout.

Change-Id: I179ab22df37b2f996288820074ec69b8e0a202a5
2023-03-10 18:09:33 +00:00
Jeremy Stanley
bc7d946ca2 Wait longer for rax-ord nodes and ease up API rate
We're still seeing a lot of timeouts waiting for instances to become
active in this provider, and are observing fairly long delays
between API calls at times. Increase the launch wait from 10 to 15
minutes, and increase the minimum delay between API calls by an
order of magnitude from 0.001 to 0.01 seconds.

Change-Id: Ib13ff03629481009a838a581d98d50accbf81de2
2023-03-08 14:39:38 +00:00
Jeremy Stanley
6f5c773b6e Try halving max-servers for rax-ord region
Reduce the max-servers in rax-ord from 195 to 100, and revert the
boot-timeout from the 300 we tried back down to 120 like the others.
We're continuing to see server create calls taking longer to report
active than nodepool is willing to wait, but also may be witnessing
the results of API rate limiting or systemic slowness. Reducing the
number of instances we attempt to boot there may give us a clearer
picture of whether that's the case.

Change-Id: Ife7035ba64b457d964c8497da0d9872e41769123
2023-03-07 18:39:00 +00:00
Jeremy Stanley
a177d641f2 Increase boot-timeout for rax-ord
For a while we've been seeing a lot of "Timeout waiting for instance
creation" in Rackspace's ORD region, but checking behind the
launcher it appears these instances do eventually boot, so we're
wasting significant resources discarding quota we never use.
Increase the timeout for this from 2 minutes to 5, but only in this
region as 2 minutes appears to be sufficient in the others.

Change-Id: I1cf91a606eefc4aa65507f491a20182770b99f09
2023-03-06 16:56:45 +00:00
Jeremy Stanley
b597a94b2e Add missing fedora-36 label to nl01
This seems to have been overlooked when the label was added to other
launchers, and is contributing to NODE_FAILURE results for some
jobs, particularly now that fedora-latest is relying on it.

Change-Id: Ifc0e5452ac0cf275463f6f1cfbe0d7fe350e3323
2022-09-19 15:05:52 +00:00
Ian Wienand
36b9c302e5 nodepool: Add Fedora 36
Add Fedora 36 builds

Change-Id: I64fac34945ea5c6ec91ddd442281fcaba2c53271
2022-08-22 11:25:09 +10:00
Neil Hanlon
705e8420a2
Add rockylinux 9 to nodepool
Change-Id: Iedf1c8eb2898cfc5771a5a695a53a39f9396edc9
2022-08-05 08:58:27 -04:00
wangxiyuan
30fca1376d Bump openEuler to 22.03 LTS
openEuler 20.03-LTS-SP2 was out of date in May 2022. 22.03 LTS
is the newest LTS version. It was release in March 2022 and
will be maintained for 2 years. This patch upgrades the LTS
version. It'll be used in Devstack, Kolla-ansible and so on
in CI jobs.

Change-Id: I23f2b397bc7f1d8c2a959e0e90f5058cf3bf104d
2022-08-03 14:40:34 +08:00
Dr. Jens Harbott
1bdccd42e5 Start launching Jammy images
The first image was built successfully, so we can start launching them.

Change-Id: Ie84d1700b6f4f7696e14dfe01bc887e422163d7e
2022-04-26 13:53:29 +02:00
Neil Hanlon
2ccc5241c8
Add rockylinux-8 to nodepool configuration
Signed-off-by: Neil Hanlon <neil@rockylinux.org>
Change-Id: Ic3344bc47ca56c27f7ec3427a0865bd6ce3349d3
Depends-On: https://review.opendev.org/c/openstack/project-config/+/829405
Depends-On: https://review.opendev.org/c/openstack/project-config/+/829712
2022-02-17 09:12:00 -05:00
Ian Wienand
dd58c496f8 Remove Fedora 34
The dependent changes should be the last references to Fedora 34 nodes

Depends-On: https://review.opendev.org/c/openstack/devstack/+/827576
Depends-On: https://review.opendev.org/c/openstack/devstack/+/827578
Depends-On: https://review.opendev.org/c/zuul/nodepool/+/827577

Change-Id: Ie14ea374808e5518588925de3a476f0bc6ff2ccf
2022-02-11 07:55:21 +11:00
Clark Boylan
dce378a6b4 Remove centos-8
This distro release reached its EOL December 31, 2021. We are removing
it from our CI system as people should really stop testing on it. They
can use CentOS 8 Stream or other alternatives instead.

Depends-On: https://review.opendev.org/c/opendev/base-jobs/+/827181
Change-Id: I13e8185b7839371a9f9043b715dc39c6baf907d5
2022-02-02 09:48:36 +11:00
Clark Boylan
58e4543789 Set centos-8 min-ready to 0
This is in preparation for removing this label. This distro is no longer
supported and users will need to find alternatives.

Change-Id: I57b363671809afe415a376b0894041438140bdae
2022-01-31 13:31:24 -08:00
Clark Boylan
07e9803134 Remove opensuse-tumbleweed from nodepool
This removes the label, nodes, and images for opensuse-tumbleweed across
our cloud providers. We also update grafana to stop graphing stats for
the label.

Depends-On: https://review.opendev.org/c/opendev/base-jobs/+/824068
Change-Id: Ic311af5d667c01c1845251270fd2fdda7d99ebcb
2022-01-10 14:02:55 -08:00
Clark Boylan
05afaaa9ea Set tumbleweed min-ready to 0
This is in preparation to remove the image and label entirely. Nothing
seems to use the image so clean it up.

Change-Id: I5ab3a0627874e302289deb442f80a782509df2c3
2022-01-10 14:02:55 -08:00
Ian Wienand
dc65009eaf nodepool: Remove . from openEuler name
We have a bug uploading images with "." in the name.  Work around this
for now by avoid it.

Change-Id: I20e1a926d02a632450b8114d84a0fa738b7ec639
2021-12-17 07:39:45 +11:00