This was the old timeout then some refactoring happened and we ended up
with the openstacksdk timeout of one hour. Since then Nodepool added the
ability to configure the timeout so we set it back to the original six
This removes the fedora image builds from nodepool. At this point
Nodepool should no longer have any knowledge of fedora.
There is potential for other cleanups for things like dib elements, but
leaving those in place doesn't hurt much.
This will stop providing the node label entirely and should result in
nodepool cleaning up the existing images for these images in our cloud
providers. It does not remove the diskimages for fedora which will
The bindep fallback list includes a libvirt-python package for all
RPM-based distros, but it appears that OpenSuse Leap has recently
dropped this (likely as part of removing Python 2.7 related
packages). Exclude the package on that platform so that the
opensuse-15 job will stop failing.
In order to reduce the load on our builder nodes and reduce the strain
on our providers' image stores, build most images only once per week.
Exceptions are ubuntu-jammy, our most often used distro image, which we
keep rebuilding daily, and some other more frequently used images built
every 2 days.
Enable uploads for all images again for rax-iad. We have configured the
nodepool-builders to run with only 1 upload thread, so we will have at
most two parallel uploads (one per builder).
This is a partial revert of d50921e66b.
We want to slowly re-enable image uploads for rax-iad, start with a
single image, choosing the one that is getting used most often.
Manual cleanup of approximately 1200 images in this region, some as
much as 4 years old, has completed. Start attempting uploads again
to see if they'll complete now.
This reverts commit 71d1f02164.
We're getting Glance task timeout errors when trying to upload new
images into rax-iad, which seems to be resulting in rapidly leaking
images and may be creating an ever-worsening feedback loop. Let's
pause uploads for now since they're not working anyway, and
hopefully that will allow us to clean up the mess that's been
created more rapidly as well.
Some deployment projects (e.g. Openstack-Helm)
test their code as "all-in-one" deployment.
I.e. a test job deploys all Openstack components
on a single node on top of a minimalistic
K8s cluster running on the same node.
This requires more memory than 8GB to make
jobs reliable. We add these new *-32GB labels
only in the Vexxhost ca-ymq-1 region because
v3-standard-8 flavor in this region has 32GB
At the same time we should not stick to this
kind of nodes since the number of such nodes
is very limited. It is highly recommended to
redesign the test jobs so they use multinode
dns-root-data has been demoted to a "Recommends" dependency of unbound,
which we don't install. Sadly the default unbound configuration is
broken without it.
The cirros project has released new images, add them to our cache prior
to actually using them in the CI. We can remove the old images once the
migration is completed and not too many stable branches using the old
images are still active, but comparing the size of these in relation to
the total size of our images, the impact of this shouldn't be too large
in comparison to the benefit in CI stability.
Signed-off-by: Dr. Jens Harbott <email@example.com>
This reverts commit 4df959c449.
Reason for revert: VEXXHOST provider has informed that they have
performed some optimizations and now we can enable again this pool.
The jobs running with nested-virt labels on this
provider are impacted by mirror issues from last
couple of weeks.
Atleast jobs running on compute nodes
Until the issue is clear let's disable the
kolla wants to have testing parity between Ubuntu and Debian, so add a
nested-virt-debian-bullseye label to nodepool matching the existing
Looking at our graphs, we're still spiking up into the 30-60
concurrent building range at times, which seems to result in some
launches exceeding the already lengthy timeout and wasting quota,
but when things do manage to boot we effectively utilize most of
max-servers nicely. The variability is because max-concurrency is
the maximum number of in-flight node requests the launcher will
accept for a provider, but the number of nodes in a request can be
quite large sometimes.
Raise max-servers back to its earlier value reflecting our available
quota in this provider, but halve the max-concurrency so we don't
try to boot so many at a time.
This region seems to take a very long time to launch nodes when we
have a burst of requests for them, like a thundering herd sort of
behavior causing launch times to increase substantially. We have a
lot of capacity in this region though, so want to boot as many
instances as we can here. Attempt to reduce the effect by limiting
the number of instances nodepool will launch at the same time.
Also, mitigate the higher timeout for this provider by not retrying
launch failures, so that we won't ever lock a request for multiples
of the timeout.
We're still seeing a lot of timeouts waiting for instances to become
active in this provider, and are observing fairly long delays
between API calls at times. Increase the launch wait from 10 to 15
minutes, and increase the minimum delay between API calls by an
order of magnitude from 0.001 to 0.01 seconds.
Reduce the max-servers in rax-ord from 195 to 100, and revert the
boot-timeout from the 300 we tried back down to 120 like the others.
We're continuing to see server create calls taking longer to report
active than nodepool is willing to wait, but also may be witnessing
the results of API rate limiting or systemic slowness. Reducing the
number of instances we attempt to boot there may give us a clearer
picture of whether that's the case.
This reverts commit 4a2253aac3.
We've made some modifications to the nova installation in this cloud
which should prevent nodes other than the mirror from launching on its
hypervisor. This should protect it from OOMs.
For a while we've been seeing a lot of "Timeout waiting for instance
creation" in Rackspace's ORD region, but checking behind the
launcher it appears these instances do eventually boot, so we're
wasting significant resources discarding quota we never use.
Increase the timeout for this from 2 minutes to 5, but only in this
region as 2 minutes appears to be sufficient in the others.
The mirror server spontaneously powered off again. It's been booted
back up, but let's take the region out of service until someone has
a chance to investigate the reason and hopefully fix it so that it
doesn't keep happening.
This reverts commit f45f51fdd7.
The mirror server in the inmotion iad3 region is down. Don't boot
nodes there for now, since jobs run on them will almost certainly
fail. This can be reverted once the mirror is back in service.
According to Ic8b3e790fe332cf68bad7aaa3d5f85229600380b review
comments, OpenSearch indexing indicates jobs aren't often using
CirrOS 0.3.4, 0.3.5, 0.4.0 or 0.5.1 images any longer. If jobs
occasionally used them and have to retrieve them from the Internet
then that's fine, we really only need to cache images which are used
frequently. Remove the rest in order to shrink our node images
Once the builders have a chance to clear out all uploaded images,
this will remove the remaining references in Nodepool. Then
system-config cleanup can proceed.
The mirror in our Limestone Networks donor environment is now
unreachable, but we ceased using this region years ago due to
persistent networking trouble and the admin hasn't been around for
roughly as long, so it's probably time to go ahead and say goodbye
In preparation for cleanup of credentials in system-config, first
remove configuration here except leave the nodepool provider with an
empty diskimages list so that it will have a chance to pick up after
The package-maps install of tox is only defined for gentoo, and that
came in with the original image build parts. We don't need that any
10-pip-packages I didn't trace down, but it hasn't been doing anything
for a long time, since we removed pip-and-virtualenv. We can remove
The install done in 40-install-tox I can not see being used anywhere.
It came in with If5397d731e9fb04431482529aed23cd9fdaecc1d but I can't
see the venv actually referenced anywhere. I think this has all been
replaced by the ensure-tox role (or, indeed, jobs migrating away from
tox). Remove it.
This came in via Ie1a0aba57390c9c0b269b4cbb076090ae1de73a9 many years
ago, when it was copied from old puppet. I can't see that we need to
be installing this for any infra reason.
I guess there is a small posibility things are relying on this, but
they would be better to install it themselves anyway.
We don't need to pull in Python 2 python-xml or python-dev packages
python3 is always installed by DIB (it needs python3 on the image to
run elements). So we don't explicitly need to pull that in.
The RedHat platforms vary if they come pre-installed with curl or
curl-minimal, and if curl-minimal is installed, it causes conflicts
when you try to install "curl" (without removing it first, or using
pkg-map is not designed to deal with this at all; it can't say "curl |
curl-minimal". But all our base images come with curl, because we're
using cache-url which uses it.
So, in short, drop it here to avoid this conflict.
As noted inline, this has a /27 subnet allocation, so that is the real
upper limit on hosts we can simultaneously talk to.
That makes 29 usable floating ip's, minus one for the mirror node. We
would max out the 160 CPU's with our standard 8x instances (we can fit
20 * 8) before we got to this anyway, but I think it's good to just
have a note on it.
I have left the -regionone off this, making its naming inconsistent.
This adds it.
Since this cloud is in its bringup phase, I will put the builder in
emergency, clear out the images for the "linaro" provider and then
apply this by hand, so that we don't have old ZK nodes lying around.
We can then merge this to make it consistent.
Drop the linaro-us cloud from nb04 uploads and launcher; it is
replaced by the new linaro cloud. Region is not running any nodes
nb03 is still in the inventory, but shutdown and in emergency. We can
remove the config here and cleanup will follow.
Reorganise these labels into x86_64, arm64 and vexxhost specific. I
think grouping by arch is a more logical grouping of what is going on
for the usual operations on this file.