When image data are imported, if there are holes in the sequence
numbers, ZooKeeper may register a collision after nodepool-builder
builds or uploads a new image. This is because ZooKeeper stores
a sequence node counter in the parent node, and we lose that
information when exporting/importing. Newly built images can end
up with the same sequence numbers as imported images. To avoid this,
re-create missing sequence nodes so that the import state more
closely matches the export state.
Change-Id: I0b96ebecc53dcf47324b8a009af749a3c04e574c
Users of the "nodepool" command don't need to see the component
registry logs at info level (which output at least a line for each
connected component). Set the minimum level to warning to avoid
that.
The component registry may still be useful for command-line use
in the future, so we leave it in place rather than disabling it
entirely.
Change-Id: I8c0937d7304ddc536773cf74fc40bbf6e79918d4
If a label is removed from the configuration while a node still
exists, the periodic cleanup performed by the metastatic driver
will raise an AttributeError exception when trying to access
the grace_time attribute, since the label pointer is None.
To address this, treat a missing label as if it has a grace time
of zero seconds.
This adds a test which simulates the issue.
This also adds an extra log entry for when a metastatic backing
node slot is deallocated is added so that we log both the allocation
and deallocation.
Change-Id: I0c104a2fe9874e2cd30e2bf2f2227569a73be243
Icc9adcbfae7a37c335ce4586741cb27d6e0f5c66 pins the openstacksdk
requirement to <0.99, but this fails when we try to use the zuul
checkout of master openstacksdk in the siblings job.
For now, we just need to drop it from the siblings job.
Change-Id: I6e755ef7d2fc95f7475315ed1be1c5d13e9d0f0b
The OpenStack SDK/CLI team made an experimental "release candidate"
with 0.99.0, and OpenDev observed errors with the format of network
specifications in API calls. Pin to an earlier version for now, in
order to avoid this and any other as of yet undiscovered problems in
that version.
Change-Id: Icc9adcbfae7a37c335ce4586741cb27d6e0f5c66
Now that the component we registered is a "pool" change the call
sites to use "launcher_pools" instead of "launchers". This may
reduce some ambiguity.
(s/launcher/pool/ might still be ambiguous since it may not be clear
whethere we're talking about our own pools or other pools; thus the
choice of "launcher_pool" for the variable name.)
Also, remove a redundant test assertion.
Change-Id: I865883cdb115bf72a3bd034d9290f60666d64b66
This lets users configure providers which should fulfill requests
before other providers. This facilitates using a less expensive
cloud before using a more expensive one.
The default priority is 100, to facilitate either raising above
or lowering below the default (while using only positive integers
in order to avoid confusion).
Change-Id: I969ea821e10a7773a0a8d135a4f13407319362ee
The limit of 5 metadata items is 8 years old and outdated for modern
OpenStack. Instead of trying to guess what the OpenStack limits are
(which may depend on quota settings), remove all the checks and let
OpenStack reject the image at upload time.
Change-Id: Ifa2e429db3bac2e3cad73dce09e01c901ea133c4
Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>
This uses a cache and lets us update metadata about components
and act on changes quickly (as compared to the current launcher
registry which doesn't have provision for live updates).
This removes the launcher registry, so operators should take care
to update all launchers within a short period of time since the
functionality to yield to a specific provider depends on it.
Change-Id: I6409db0edf022d711f4e825e2b3eb487e7a79922
We have made many improvements to connection handling in Zuul.
Bring those back to Nodepool by copying over the zuul/zk directory
which has our base ZK connection classes.
This will enable us to bring other Zuul classes over, such as the
component registry.
The existing connection-related code is removed and the remaining
model-style code is moved to nodepool.zk.zookeeper. Almost every
file imported the model as nodepool.zk, so import adjustments are
made to compensate while keeping the code more or less as-is.
Change-Id: I9f793d7bbad573cb881dfcfdf11e3013e0f8e4a3
Mention why the host-key-checking feature exists, so that users will
be aware of possible errors which may arise if they choose to
disable it. Also clarify why having the launcher and nodes on
different networks may lead you to need to disable the behavior.
Change-Id: I769080c5330bb7e6336f315eb0237324f0fda758
The quota cache may not be a valid dictionary when
invalidateQuotaCache() is called (e.g. when 'ignore-provider-quota' is
used in OpenStack). In that case, don't attempt to treat the None as a
dictionary as this raises a TypeError exception.
This bug was preventing Quota errors from OpenStack from causing
nodepool to retry the node request when ignore-provider-quota is True,
because the OpenStack handler calles invalidateQuotaCache() before
raising the QuotaException. Since invalidateQuotaCache() was raising
TypeError, it prevented the QuotaException from being raised and the
node allocation was outright failed.
A test has been added to verify that nodepool and OpenStack will now
retry node allocations as intended.
This fixes that bug, but does change the behavior of OpenStack when
ignore-provider-quota is True and it returns a Quota error.
Change-Id: I1916c56c4f07c6a5d53ce82f4c1bb32bddbd7d63
Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>
The following adjustments improve performance at large scale:
* Save the quota object earlier in the state machine. This is
still using cached data, but it's not necessary to re-run this
each time.
* Flatten iterators returned from cached methods. Some of our
methods intended to cache heavy list responses were in fact
only caching the iterator, and re-iterating would end up
re-running the request. This change does two things: it causes
the iteration to happen within the rate limit calculator so we
have a better idea of how long it actually took, and it causes
the actual data to be put in the cache so that we don't re-run
the request.
* Don't create a second instance object after creating the
instance. The create call returns an instance object in the
form that we expect already. Avoid creating a second one which
incurs another DescribeInstances call.
Change-Id: I73bc099b450879917ab923fb7371f8006b113d68
In case host is offline nodepool throws an exception without mentioning
hostname:
ERROR nodepool.driver.static.StaticNodeProvider: Couldn't sync node:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/nodepool/driver/static/provider.py", line 427, in cleanupLeakedResources
self.syncNodeCount(registered, node, pool)
File "/usr/local/lib/python3.7/site-packages/nodepool/driver/static/provider.py", line 320, in syncNodeCount
register_cnt, self.provider.name, pool, node)
File "/usr/local/lib/python3.7/site-packages/nodepool/driver/static/provider.py", line 210, in registerNodeFromConfig
nodeutils.set_node_ip(node)
File "/usr/local/lib/python3.7/site-packages/nodepool/nodeutils.py", line 48, in set_node_ip
addrinfo = socket.getaddrinfo(node.hostname, node.connection_port)[0]
File "/usr/local/lib/python3.7/socket.py", line 752, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
Change-Id: I8d9a5b382c4905ce14022f6d6f02f6d323799700
Because we cache the instance listing with a TTL, it is possible
for the refresh method (which is intended to return the current
version of the supplied instance object) to encounter a listing
which does not yet include a newly created instance. In this case
it should just return the supplied argument rather than None which
could cause callers to error out.
Change-Id: I30c04b0f7bdb6f61d2f2fce2037a0959c16dcba9
The RateLimiter utility used by several drivers measures time from
the exit of one context manager to the beginning of the next. This
means that if the API call itself takes substantial time, it will
be added to the expected delay. In other words, if the expected rate
is one API call every two seconds, and the API call itself takes
one second, the actual rate will be one call every three seconds.
To more closely approximate the overall expected rate, measure from
start to start, so that the duration of the API call itself is
included.
Change-Id: Ia62a6bfa6a3e6cac65f0c20179edcfbc94a5dcc5
This adds config options to enforce default resource (cpu,mem) limits on
k8s pod labels. With this, we can ensure all pod nodes have resource
information set on them. This allows to account for max-cores and
max-ram quotas for k8s pod nodes. Therefore also adding these config
options. Also tenant-quotas can then be considered for pod nodes.
Change-Id: Ida121c20b32828bba65a319318baef25b562aef2
This is a follow-on to I3279c3b5cb8cf26d390835fd0a7049bc43ec40b5
As discussed in the referenced issue, the blank return and
AttributeError here is the correct way to determine a recently deleted
image. We don't need to catch the ClientError as that won't be raised
any more. Getting a value for the state would indicate it wasn't
deleted.
This bumps the moto base requirement to ensure we have this behaviour.
Change-Id: I2d5b0ccb9802aa0d4c81555a17f40fe8b8595ebd
The AWS driver adapter was not actually deleting the image build
from AWS when the builder requested it (it was simply missing
the implementation of the method).
This went unnoticed because as long as a launcher is running, it
will act to clean up leaked images. But it's still better to
have the builder do so synchronously.
Add a test which verifies the behavior.
Change-Id: I0bea930847ded45574bf53ae098b9926d4924107
Per the notes inline, the recent 3.1.6 moto release has changed the
way deleted images are returned. Catching the exception we now see
works, and we can revisit if we find better solutions.
Change-Id: I3279c3b5cb8cf26d390835fd0a7049bc43ec40b5
When building Ubuntu 22.04 (Jammy), we need ``ar`` as extractor because
dpkg-deb on bullseye doesn't support the required compression algorithm.
Make sure that it is installed in the docker image.
Signed-off-by: Dr. Jens Harbott <harbott@osism.tech>
Change-Id: Icb0e40827c9f8ac583fa143545e6bed9641bf613