When image data are imported, if there are holes in the sequence
numbers, ZooKeeper may register a collision after nodepool-builder
builds or uploads a new image. This is because ZooKeeper stores
a sequence node counter in the parent node, and we lose that
information when exporting/importing. Newly built images can end
up with the same sequence numbers as imported images. To avoid this,
re-create missing sequence nodes so that the import state more
closely matches the export state.
Change-Id: I0b96ebecc53dcf47324b8a009af749a3c04e574c
This uses a cache and lets us update metadata about components
and act on changes quickly (as compared to the current launcher
registry which doesn't have provision for live updates).
This removes the launcher registry, so operators should take care
to update all launchers within a short period of time since the
functionality to yield to a specific provider depends on it.
Change-Id: I6409db0edf022d711f4e825e2b3eb487e7a79922
We have made many improvements to connection handling in Zuul.
Bring those back to Nodepool by copying over the zuul/zk directory
which has our base ZK connection classes.
This will enable us to bring other Zuul classes over, such as the
component registry.
The existing connection-related code is removed and the remaining
model-style code is moved to nodepool.zk.zookeeper. Almost every
file imported the model as nodepool.zk, so import adjustments are
made to compensate while keeping the code more or less as-is.
Change-Id: I9f793d7bbad573cb881dfcfdf11e3013e0f8e4a3
The quota cache may not be a valid dictionary when
invalidateQuotaCache() is called (e.g. when 'ignore-provider-quota' is
used in OpenStack). In that case, don't attempt to treat the None as a
dictionary as this raises a TypeError exception.
This bug was preventing Quota errors from OpenStack from causing
nodepool to retry the node request when ignore-provider-quota is True,
because the OpenStack handler calles invalidateQuotaCache() before
raising the QuotaException. Since invalidateQuotaCache() was raising
TypeError, it prevented the QuotaException from being raised and the
node allocation was outright failed.
A test has been added to verify that nodepool and OpenStack will now
retry node allocations as intended.
This fixes that bug, but does change the behavior of OpenStack when
ignore-provider-quota is True and it returns a Quota error.
Change-Id: I1916c56c4f07c6a5d53ce82f4c1bb32bddbd7d63
Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>
These changes are all in service of being able to better understand
AWS driver log messages:
* Use annotated loggers in the statemachine provider framework
so that we see the request, node, and provider information
* Have the statemachine framework pass annotated loggers to the
state machines themselves so that the above information is available
for log messages on individual API calls
* Add optional performance information to the rate limit handler
(delay and API call duration)
* Add some additional log entries to the AWS adapter
Also:
* Suppress boto logging by default in unit tests (it is verbose and
usually not helpful)
* Add coverage of node deletion in the AWS driver tests
Change-Id: I0e6b4ad72d1af7f776da73c5dd2a50b40f60e4a2
The Azure SDK for Python uses threads to manage async operations.
Every time a virtual machine is created, a new thread is spawned
to wait for it to finish (whether we actually end up polling it or
not). This will cause the Azure driver to have significant
scalability limits compared to other drivers, possibly limiting
the number of simultaneous nodes to 50% compared to others.
To address this, switch to using a very simple requests-based
REST client I'm calling Azul. The consistency of the Azure API
makes this simple. As a bonus, we can use the excellent Azure
REST API documentation directly, rather that mapping attribute
names through the Python SDK (which has subtle differences).
A new fake Azure test fixture is also created in order to make
the current unit test a more thorough exercise of the code.
Finally, the "zuul-private-key" attribute is misnamed since we
have a policy of a one-way dependency from Zuul -> Nodepool. It's
name is updated to match the GCE driver ("key") and moved to the
cloud-image section so that different images may be given different
keys.
Change-Id: I87bfa65733b2a71b294ebe2cf0d3404d0e4333c5
Require TLS Zookeeper connections before making the 4.0 release.
Change-Id: I69acdcec0deddfdd191f094f13627ec1618142af
Depends-On: https://review.opendev.org/776696
Currently nodepool has one thread per server creation or
deletion. Each of those waits for the cloud by regularly getting the
server list and checking if their instance is active or gone. On a
busy nodepool this leads to severe thread contention when the server
list gets large and/or there are many parallel creations/deletions in
progress.
This can be improved by offloading the waiting to a single thread that
regularly retrieves the server list and compares that to the list of
waiting server creates/deletes. The calling threads are then waiting
until the central thread wakes them up to proceed their task. The
waiting threads are waiting for the event outside of the GIL and thus
are not contributing to the thread contention problem anymore.
An alternative approach would be to redesign the threading to be less
threaded but this would be a much more complex redesign. Thus this
change keeps the many threads approach but makes them wait much more
lightweight which shows a substantial improvement during load testing
in a test environment.
Change-Id: I5525f2558a4f08a455f72e6b5479f27684471dc7
This ensures that we don't wait forever for tests to complete tasks.
This is particularly useful if you've disabled the global test timeout.
Change-Id: I0141e62826c3594ed20605cac25e39091d1514e2
We broke nodepool configuration with
I3795fee1530045363e3f629f0793cbe6e95c23ca by not having the labels
defined in the OpenStack provider in the top-level label list.
The added check here would have found such a case.
The validate() function is reworked slightly; previously it would
return various exceptions from the tools it was calling (YAML,
voluptuous, etc.). Now we have more testing (and I'd imagine we could
do even more, similar vaildations too) we'd have to keep adding
exception types. Just make the function return a value; this also
makes sure the regular exit paths are taken from the caller in
nodepoolcmd.py, rather than dying with an exception at whatever point.
A unit test is added.
Co-Authored-By: Mohammed Naser <mnaser@vexxhost.com>
Change-Id: I5455f5d7eb07abea34c11a3026d630dee62f2185
The builder intentionally does not attempt to shutdown the uploader
threads since that could take an unreasonable amount of time. This
causes a race in our tests where we can shutdown the ZooKeeper
connection while the upload thread is still in progress, which can
cause the test to fail with a ZooKeeper error. This adds uploader
thread cleanup for the builder used in tests.
Change-Id: I25d4b52e17501e5dc6543adef585dd3b86bd70f9
Only a single test actually depends on having more than a single
upload thread active, so this is just wasteful. Reduce the default
to 1 and add an option to useBuilder() that tests may use to alter
the value.
Change-Id: I07ec96000a81153b51b79bfb0daee1586491bcc5
This change allows you to specify a dib-cmd parameter for disk images,
which overrides the default call to "disk-image-create". This allows
you to essentially decide the disk-image-create binary to be called
for each disk image configured.
It is inspired by a couple of things:
The "--fake" argument to nodepool-builder has always been a bit of a
wart; a case of testing-only functionality leaking across into the
production code. It would be clearer if the tests used exposed
methods to configure themselves to use the fake builder.
Because disk-image-create is called from the $PATH, it makes it more
difficult to use nodepool from a virtualenv. You can not just run
"nodepool-builder"; you have to ". activate" the virtualenv before
running the daemon so that the path is set to find the virtualenv
disk-image-create.
In addressing activation issues by automatically choosing the
in-virtualenv binary in Ie0e24fa67b948a294aa46f8164b077c8670b4025, it
was pointed out that others are already using wrappers in various ways
where preferring the co-installed virtualenv version would break.
With this, such users can ensure they call the "disk-image-create"
binary they want. We can then make a change to prefer the
co-installed version without fear of breaking.
In theory, there's no reason why a totally separate
"/custom/venv/bin/disk-image-create" would not be valid if you
required a customised dib for some reason for just one image. This is
not currently possible, even modulo PATH hacks, etc., all images will
use the same binary to build. It is for this flexibility I think this
is best at the diskimage level, rather than as, say a global setting
for the whole builder instance.
Thus add a dib-cmd option for diskimages. In the testing case, this
points to the fake-image-create script, and the --fake command-line
option and related bits are removed.
It should have no backwards compatibility effects; documentation and a
release note is added.
Change-Id: I6677e11823df72f8c69973c83039a987b67eb2af
In my local tests running tox the tmp files of each test case get
deleted after the run. However kube_config maintains a static list of
temporary files it knows about and tries to re-use them in subsequent
test runs which causes the test to fail [1]. Fix this by telling
kube_config to cleanup its temporary files in the cleanup phase.
[1] Trace
Traceback (most recent call last):
File "/home/tobias/src/nodepool/nodepool/tests/unit/test_builder.py", line 239, in test_image_rotation_invalid_external_name
build001, image001 = self._test_image_rebuild_age(expire=172800)
File "/home/tobias/src/nodepool/nodepool/tests/unit/test_builder.py", line 186, in _test_image_rebuild_age
self.useBuilder(configfile)
File "/home/tobias/src/nodepool/nodepool/tests/__init__.py", line 539, in useBuilder
BuilderFixture(configfile, cleanup_interval, securefile)
File "/home/tobias/src/nodepool/.tox/py37/lib/python3.7/site-packages/testtools/testcase.py", line 756, in useFixture
reraise(*exc_info)
File "/home/tobias/src/nodepool/.tox/py37/lib/python3.7/site-packages/testtools/_compat3x.py", line 16, in reraise
raise exc_obj.with_traceback(exc_tb)
File "/home/tobias/src/nodepool/.tox/py37/lib/python3.7/site-packages/testtools/testcase.py", line 731, in useFixture
fixture.setUp()
File "/home/tobias/src/nodepool/nodepool/tests/__init__.py", line 318, in setUp
self.builder.start()
File "/home/tobias/src/nodepool/nodepool/builder.py", line 1304, in start
self._config = self._getAndValidateConfig()
File "/home/tobias/src/nodepool/nodepool/builder.py", line 1279, in _getAndValidateConfig
config = nodepool_config.loadConfig(self._config_path)
File "/home/tobias/src/nodepool/nodepool/config.py", line 246, in loadConfig
driver.reset()
File "/home/tobias/src/nodepool/nodepool/driver/openshift/__init__.py", line 29, in reset
config.load_kube_config(persist_config=True)
File "/home/tobias/src/nodepool/.tox/py37/lib/python3.7/site-packages/kubernetes/config/kube_config.py", line 540, in load_kube_config
loader.load_and_set(config)
File "/home/tobias/src/nodepool/.tox/py37/lib/python3.7/site-packages/kubernetes/config/kube_config.py", line 422, in load_and_set
self._load_cluster_info()
File "/home/tobias/src/nodepool/.tox/py37/lib/python3.7/site-packages/kubernetes/config/kube_config.py", line 385, in _load_cluster_info
file_base_path=self._config_base_path).as_file()
File "/home/tobias/src/nodepool/.tox/py37/lib/python3.7/site-packages/kubernetes/config/kube_config.py", line 112, in as_file
raise ConfigException("File does not exists: %s" % self._file)
kubernetes.config.config_exception.ConfigException: File does not exists: /tmp/tmplafutg0j/tmpmiti10bn
Ran 2 tests in 4.524s (+0.175s)
FAILED (id=20, failures=1)
Change-Id: Idce8ca9bed49162874af24b224e573121e250385
If, during a long DIB image build, we lose the ZooKeeper session,
it's likely that the CleanupWorker thread could have run and removed
the ZK record for the build (its state would be BUILDING and unlocked,
indicating something went wrong). In that scenario, when the DIB
process finishes (possibly writing out DIB files), it will never get
cleaned up since the ZK record would now be gone. If we fail to update
the ZK record at the end of the build, just delete the leaked DIB files
immediately after the build.
Change-Id: I5cb58318efe51b5b0c3413b7a01f02a50215a8b6
This reverts commit ccf40a462a.
The previous version would not work properly when daemonized
because there was no stdout. This version maintains stdout and
uses select/poll with non-blocking stdout to capture the output
to a log file.
Depends-On: https://review.openstack.org/634266
Change-Id: I7f0617b91e071294fe6051d14475ead1d7df56b7
This change adds an experimental AWS driver. It lacks some of the deeper
features of the openstack driver, such as quota management and image
building, but is highly functional for running tests on a static AMI.
Note that the test base had to be refactored to allow fixtures to be
customized in a more flexible way.
Change-Id: I313f9da435dfeb35591e37ad0bec921c8b5bc2b5
Co-Authored-By: Tristan Cacqueray <tdecacqu@redhat.com>
Co-Authored-By: David Moreau-Simard <dmsimard@redhat.com>
Co-AUthored-By: Clint Byrum <clint@fewbar.com>
A builder thread can wedge if the build process wedges. Add a timeout
to the subprocess. Since it was the call to readline() that would block,
we change the process to have DIB write directly to the log. This allows
us to set a timeout in the Popen.wait() call. And we kill the dib
subprocess, as well.
The timeout value can be controlled in the diskimage configuration and
defaults to 8 hours.
Change-Id: I188e8a74dc39b55a4b50ade5c1a96832fea76a7d
This change implements an OpenShift resource provider. The driver currently
supports project request and pod request to enable both containers as machine
and native containers workflow.
Depends-On: https://review.openstack.org/608610
Change-Id: Id3770f2b22b80c2e3666b9ae5e1b2fc8092ed67c
We currently updarte the node statistics on every node launch or
delete. This cannot use caching at the moment because when the
statistics are updated we might end up pushing slightly outdated
data. If then there is no further update for a longer time we end up
with broken gauges. We already get update events from the node cache
so we can use that to centrally trigger node statistics updates.
This is combined with leader election so there is only a single
launcher that keeps the statistics up to date. This will ensure that
the statistics are not cluttered because of several launchers pushing
their own slightly different view into the stats.
As a side effect this reduces the runtime of a test that creates 200
nodes from 100s to 70s on my local machine.
Change-Id: I77c6edc1db45b5b45be1812cf19eea66fdfab014
We currently only need to setup the zNode caches in the
launcher. Within the commandline client and the builders this is just
unneccessary work.
Change-Id: I03aa2a11b75cab3932e4b45c5e964811a7e0b3d4
Pipelines buffer stats and then send them out in more reasonable sized
chunks, helping to avoid small UDP packets going missing in a flood of
stats. Use this in stats.py.
This needs a slight change to the assertedStats handler to extract the
combined stats. This function is ported from Zuul where we updated to
handle pipeline stats (Id4f6f5a6cd66581a81299ed5c67a5c49c95c9b52) so
it is not really new code.
Change-Id: I3f68450c7164d1cf0f1f57f9a31e5dca2f72bc43
OpenStack Client Config has been pulled into openstacksdk. As part of
this work OSCC internals were dropped and aliased into the sdk lib. This
move broke patching of the clouds.yaml file location for nodepool tests.
We quickly work around this by using the new location for the value to
be overridden in openstacksdk.
Change-Id: I55ad4333ffddec8eeb023e345156e96773504400
This updates the builder to store individual build logs in dedicated
files, one per build, named for the image and build id. Old logs are
automatically pruned. By default, they are stored in
/var/log/nodepool/builds, but this can be changed.
This removes the need to specially configure logging handler for the
image build logs.
Change-Id: Ia7415d2fbbb320f8eddc4e46c3a055414df5f997
The pep8 rules used in nodepool are somewhat broken. In preparation to
use the pep8 ruleset from zuul we need to fix the findings upfront.
Change-Id: I9fb2a80db7671c590cdb8effbd1a1102aaa3aff8
The secure config file has largely been unused and ignored for v3.
This add support for reading ZooKeeper credentials from the secure
file. Note that actually specifying authentication credentials is
left for future work, but this adds the framework necessary for that.
ZooKeeper creds can be in both the normal config file and the secure
file. If specified in both, the data in the secure configuration wins.
Change-Id: I5d9c12c00f5e85ef258128337cdb99809f86b8ed
We had NodeRequestLock objects which are not the same type or data as
the request.lock objects. This is confusing. NodeRequestLock objects
hold znode state info for a lock object in zk and request.lock objects
are the actual zk Lock() for the earlier znode.
Make this more clear by renaming the NodeRequestLock object
NodeRequestLockStats.
Similarly there are many places where we pass 'lock' as a parameter
when we don't mean the actual request.lock object but instead the
request.lock.id object. Update these to be more clearly 'lock_id'
instead of 'lock'.
Finally we also have cases where we want the 'lock_id' but pass 'id'
which is a reserved word in python so lets just avoid it and call these
things 'lock_id'.
Change-Id: Id5c4fe71266efa7f8362fa9c8be7ab27c60059fd
This change is a follow-up to the drivers spec and it makes the fake provider
a real driver. The fakeprovider module is merged into the fake provider and
the get_one_cloud config loader is simplified.
Change-Id: I3f8ae12ea888e7c2a13f246ea5f85d4a809e8c8d
This change moves OpenStack related code to a driver. To avoid circular
import, this change also moves the StatsReporter to the stats module so that
the handlers doesn't have to import the launcher.
Change-Id: I319ce8780aa7e81b079c3f31d546b89eca6cf5f4
Story: 2001044
Task: 4614
This change adds a generic Provider meta class to the common
driver module to support multiple implementation. It also renames
some method to better match other drivers use-cases, e.g.:
* listServers into listNodes
* cleanupServer into cleanupNode
Change-Id: I6fab952db372312f12e57c6212f6ebde59a1a6b3
Story: 2001044
Task: 4612
Because python3 is more strict about string / bytes we need to start
encoding / decoding things as utf8.
Change-Id: Id1102be142bef5eeb96de69aaa8f653bdb1903d8
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
Since we are working towards python3 support, lets rename nodepool.py
to launcher.py to make relative imports nicer, otherwise we'd have to
use:
from . import foo
Change-Id: Ic38b6a8c2bf25d53625e159cb135b71d383b700c
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
When debugging a test case with pydevd (e.g. used by PyCharm) it
injects additional threads. These also need to be
whitelisted. Otherwise wait_for_threads will block forever.
Change-Id: I491c2fd404bddfbe17cb912557ef56ff9134ac4b
We weren't doing anything with statsd in tests. Port over the
fake statsd from Zuul and use it to verify that we exit some
stats.
Fix parts of the stats emission that were broken.
Change-Id: I027e67b928bd28372bef8ab147c7ed5841009caf
Before os-client-config and shade, we would include cloud credentials
in nodepool.yaml. But now comes the time where we can remove these
settings in favor of using a local clouds.yaml file.
Change-Id: Ie7af6dcd56dc48787f280816de939d07800e9d11
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
After some discussion, it was decided to create a 2nd thread
specifically to cleanup our nodes, which could be less agressive then
our DeleteNodeWorker interval. This will reduce the pressure we place
on clouds looking for leaked nodes.
Change-Id: I3f1a482eaa43ea7943cfa5d8b74530cd34d251b3
Signed-off-by: Paul Belanger <pabelanger@redhat.com>