If a provider (or its configuration) is sufficiently broken that
the provider manager is unable to start, then the launcher will
go into a loop where it attempts to restart all providers in the
system until it succeeds. During this time, no pool managers are
running which mean all requests are ignored by this launcher.
Nodepool continuously reloads its configuration file, and in case
of an error, the expected behavior is to continue running and allow
the user to correct the configuration and retry after a short delay.
We also expect providers on a launcher to be independent of each
other so that if ones fails, the others continue working.
However since we neither exit, nor process node requests if a
provider manager fails to start, an error with one provider can
cause all providers to stop handling requests with very little
feedback to the operator.
To address this, if a provider manager fails to start, the launcher
will now behave as if the provider were absent from the config file.
It will still emit the error to the log, and it will continuously
attempt to start the provider so that if the error condition abates,
the provider will start.
If there are no providers on-line for a label, then as long as any
provider in the system is running, node requests will be handled
and declined and possibly failed while the broken provider is offilne.
If the system contains only a single provider and it is broken, then
no requests will be handled (failed), which is the current behavior,
and still likely to be the most desirable in that case.
Change-Id: If652e8911993946cee67c4dba5e6f88e55ac7099
Having python files with exec bit and shebang defined in
/usr/lib/python-*/site-package/ is not fine in a RPM package.
Instead of carrying a patch in nodepool RPM packaging better
to fix this directly upstream.
Change-Id: I5a01e21243f175d28c67376941149e357cdacd26
Now that there is no more TaskManager class, nor anything using
one, the use_taskmanager flag is vestigal. Clean it up so that we
don't have to pass it around to things anymore.
Change-Id: I7c1f766f948ad965ee5f07321743fbaebb54288a
In order to support static node pre-registration, we need to give
the provider manager the opportunity to register/deregister any
nodes in its configuration file when it starts (on startup or when
the config change). It will need a ZooKeeper connection to do this.
The OpenStack driver will ignore this parameter.
Change-Id: Idd00286b2577921b3fe5b55e8f13a27f2fbde5d6
This change adds a plugin interface so that driver can be loaded dynamically.
Instead of importing each driver in the launcher, provider_manager and config,
the Drivers class discovers and loads driver from the driver directory.
This change also adds a reset() method to the driver Config interface to
reset the os_client_config reference when reloading the OpenStack driver.
Change-Id: Ia347aa2501de0e05b2a7dd014c4daf1b0a4e0fb5
This change is a follow-up to the drivers spec and it makes the fake provider
a real driver. The fakeprovider module is merged into the fake provider and
the get_one_cloud config loader is simplified.
Change-Id: I3f8ae12ea888e7c2a13f246ea5f85d4a809e8c8d
This change moves OpenStack related code to a driver. To avoid circular
import, this change also moves the StatsReporter to the stats module so that
the handlers doesn't have to import the launcher.
Change-Id: I319ce8780aa7e81b079c3f31d546b89eca6cf5f4
Story: 2001044
Task: 4614
This change adds a generic Provider meta class to the common
driver module to support multiple implementation. It also renames
some method to better match other drivers use-cases, e.g.:
* listServers into listNodes
* cleanupServer into cleanupNode
Change-Id: I6fab952db372312f12e57c6212f6ebde59a1a6b3
Story: 2001044
Task: 4612
Currently, we get OOTB groups per provider and per image.
It would be nice to have also groups per label type, for running
plays against a particular label.
Change-Id: Ib4173fc0c15184444a91dc402bb306d34f295106
The docs say we support this, but the code doesn't.
Also, self._cloud_image.name == self._label._cloud_image and is
essentially a foreign key. That's hard to read at the call site, so just
use self._cloud_image.
We have a cloud id if it's a disk image- so wrap that in a dict. Pass
the other one through unmodified so that we'll search for it.
We also don't have any codepaths using image_name, nor a reason to
distinguish.
Change-Id: I4aa9bd8e7c578ae63d05df453b9886c710a092c0
For example, a cloud may get better preformance from a cinder volume
then the local compute drive. As a result, give nodepool to option to
choose if the server should boot from volume or not.
Change-Id: I3faefe99096fef1fe28816ac0a4b28c05ff7f0ec
Depends-On: If58cd96b0b9ce4569120d60fbceb2c23b2f7641d
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
Currently, if the ssh connection fails, we are blind to what the
possible failures are. As a result, attempt to fetch the server
console log to help debug the failure.
This is the continuation of I39ec1fe591d6602a3d494ac79ffa6d2203b5676b
but for the feature/zuulv3 branch. This was done to avoid merge
conflicts on the recent changes to nodepool.yaml layout.
Change-Id: I75ccb6d01956fb6052473f44cce8f097a56dd16a
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
The current syntax is not python3 compatible, so we look to shade to
help accomplish our sorting syntax.
Change-Id: Iadb39f976840fd2af6e0bd7b08bd3b01169e37a1
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
The syntax for imports has changed for python3, lets use the new
syntax.
Change-Id: Ia985424bf23b44e492f51182179d2e476cdcccbb
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
It's possible that it's easier for a nodepool user to just specify a
name or id of a flavor in their config instead of the combo of min-ram
and name-filter.
In order to not have two name related items, and also to not have the
pure flavor-name case use a term called "name-filter" - change
name-filter to flavor-name, and introduce the semantics that if
flavor-name is given by itself, it will look for an exact match on
flavor name or id, and if it's given with min-ram it will behave as
name-filter did already.
Change-Id: I8b98314958d03818ceca5abf4e3b537c8998f248
This was a temporary measure to keep production nodepool from
deleting nodes created by v3 nodepool. We don't need to carry
it over.
This is an alternative to: https://review.openstack.org/449375
Change-Id: Ib24395e30a118c0ea57f8958a8dca4407fe1b55b
The nodepool_id feature may need to be removed. I've kept it to simplify
merging both now and if we do it again later.
A couple of the tests are disabled and need reworking in a subsquent
commit.
Change-Id: I948f9f69ad911778fabb1c498aebd23acce8c89c
Nova has an API call that can fetch the list of available AZs. Use it to
provide a default list so that we can provide sane choices to the
scheduler related to multi-node requests rather than just letting nova
pick on a per-request basis.
Change-Id: I1418ab8a513280318bc1fe6e59301fda5cf7b890
This was an unused setting which was left over from when we supported
snapshots.
Change-Id: I940eaa57f5dad8761752d767c0dfa80f2a25c787
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
Before os-client-config and shade, we would include cloud credentials
in nodepool.yaml. But now comes the time where we can remove these
settings in favor of using a local clouds.yaml file.
Change-Id: Ie7af6dcd56dc48787f280816de939d07800e9d11
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
When we first started putting nodepool metadata into the server record
in OpenStack, we json encoded the data so that we could store a dict
into a field that only takes strings. We were also going to teach the
ansible OpenStack Inventory about this so that it could read the data
out of the groups list. However, ansible was not crazy about accepting
"attempt to json decode values in the metadata" since json-encoded
values are not actually part of the interface OpenStack expects - which
means one of our goals, which is ansible inventory groups based on
nodepool information is no longer really a thing.
We could push harder on that, but we actually don't need the functionality
we're getting from the json encoding. The OpenStack Inventory has
supported comma separated lists of groups since before day one. And the
other nodepool info we're storing stores and fetches just as easily
with 4 different top level keys as it does in a json dict - and is
easier to read and deal with when just looking at server records.
Finally, nova has a 255 byte limit on size of the value that can be
stored, so we cannot grow the information in the nodepool dict
indefinitely anyway.
Migrate the data to store into nodepool_ variables and a comma separated
list for groups. Consume both forms, so that people upgrading will not
lose track of existing stock of nodes.
Finally, we don't use snapshot_id anymore - so remove it.
Change-Id: I2c06dc7c2faa19e27d1fb1d9d6df78da45ffa6dd
Currently, while testing zuulv3, we are wanting to share the
infracloud-chocolate provider between 2 nodepool servers. The current
issue is, if we launch nodes from zuulv3-dev.o.o, nodepool.o.o will
detect the nodes as leaked and delete them.
A way to solve this, is to create a per provider 'nodepool-id' where
an admin can configure 2 separate nodepool servers to share the same
tenant. The big reason for doing this, is so we don't have to stand
up a duplicate nodepool-builder and upload duplicate images.
Change-Id: I03a95ce7b8bf06199de7f46fd3d0f82407bec8f5
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
Let's not use mock for testing launch failures. Instead, add an
attribute to FakeProviderManager that tells it how many times
successive calls to createServer() should fail.
Change-Id: Iba6f8f89de84b06d2c858b0ee69bc65c37ef3cf0
ProviderManager is a TaskManager, and TaskManagers are intended
to serialize API requests to a single cloud from multiple threads.
Currently each worker in the builder has its own set of
ProviderManagers. That means that we are performing cloud API calls
in parallel. That's probably okay since we perform very few of them,
mostly image uploads and deletes. And in fact, we probably want
to avoid blocking on image uploads.
However, there is a thread associated with each of these
ProviderManagers, and even though they are idle, in aggregate they
add up to a significant CPU cost.
This makes the use of a TaskManager by a ProviderManager optional
and sets the builder not to use it in order to avoid spawning these
useless threads.
Change-Id: Iaf6498c34a38c384b85d3ab568c43dab0bcdd3d5
We recently added the ability for diskimage-builder to generate
checksum files. This means nodepool can validate DIBs and then pass
the contents to shade, saving shade from caclucating the checksums.
Change-Id: I4cd44bb83beb4839c2c2346af081638e61899d4d
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
Recent shade allows users to pass in image and flavor to create_server
by name. This results in a potential extra lookup to find the image and
flavor. Since nodepool is not using shade caching, this is causing our
nodepool-level caching to be subverted. Although an eventual project to
get nodepool to use shade caching, that's a bad scope creep for now.
Just pass in the objects themselves, which gets shade to not attempt to
look for them. In the case where we have an image_id - put it into a
dict so that shade treats it as an object passed in and not a thing that
needs to be treated like a name_or_id.
Depends-On: I4938037decf51001ab5789ee383f6c7ed34889b1
Change-Id: Ic70b19ad5baf25413e20a658163ca718dce63bee
As we depend more and more on glean to help bootstrap a node, it is
possible for new clouds added to nodepool.yaml to be missing the
setting. Which results is broken nodes and multiple configuration
updates.
As a result, we now default config-drive to true to make it easier to
bring nodes online.
Change-Id: I4e214ba7bc43a59ddffb4bfb50576ab3b96acf69
Signed-off-by: Paul Belanger <pabelanger@redhat.com>
It should not happen in a neutron setup that we have leaked floating
ips. However, sometimes it seems that it happens around startup. It's
also safe in a neutron context to just clean the unattached ones. So
assume that sometimes clouds get into weird states and just clean them.
Change-Id: I1a30efb3b7994381592c2391881711d6b1f32dff
Depends-On: I93b0c7d0b0eefdfe0fb1cd4a66cdbba9baabeb09
* Builders were interfering with the gear shutdown procedure
by overriding the use of the 'running' variable on gear workers.
Instead, just rely on the built-in shutdown process in the gear
worker class.
* Have the builder shutdown provider managers as well.
* Correctly handle signals in the builder.
* Have the nodepool daemon shut down its gearman client.
* Use a condition object so that we can interrupt the main loop
sleep and exit faster.
Both the builder and the daemon now exit cleanly on CTRL-C when
run in the foreground.
Change-Id: Iefd5ef7df74e701725f4bafe4df51b8276088fe5
With OSC and shade patches, we lost the ability to run nodepoold
in the foreground with fakes. This restores that ability.
The shade integration unit tests are updated to use the string
'real' rather than 'fake' in config files, as they are trying to
avoid actually using the nodepool fakes, and the use of the string
'fake' is what triggers their use in many cases.
Change-Id: Ia5d3c3d5462bc03edafcc1567d1bab299ea5d40f
It's not a big deal because we cache this - but we don't care at all
about the extra flavor specs, so skip fetching them for each of the
flavors.
Change-Id: Iff73bdbe598fcf7556eafc484325f79452975a4f
We need to know which networks are public/private, which we already have
in nodepool, but were not passing in to the OCC constructor. We also
need to be able to indicate which network should be the target of NAT in
the case of multiple private networks, which can be done via
nat_destination and the new networks list argument support in OCC.
Finally, 'use_neutron' is purely the purview of shade now, so remove it.
Depends-On: I0d469339ba00486683fcd3ce2995002fa0a576d1
Change-Id: I70e6191d60e322a93127abf4105ca087b785130e
This restores some logic that was inadvertently removed in the
shade transition, without which, we issue an extra delete keypair
API call for every server delete.
Change-Id: Ib1f50c23d61c1d874f2b235fd57d2a2b0defd6c5
We wrote shade as an extraction of the logic we had in nodepool, and
have since expanded it to support more clouds. It's time to start
using it in nodepool, since that will allow us to add more clouds
and also to handle a wider variety of them.
Making a patch series was too tricky because of the way fakes and
threading work, so this is everything in one stab.
Depends-On: I557694b3931d81a3524c781ab5dabfb5995557f5
Change-Id: I423716d619aafb2eca5c1748bc65b38603a97b6a
Co-Authored-By: James E. Blair <jeblair@linux.vnet.ibm.com>
Co-Authored-By: David Shrewsbury <shrewsbury.dave@gmail.com>
Co-Authored-By: Yolanda Robla <yolanda.robla-mota@hpe.com>
With the dependent change, shade now stores inner
exceptions if they occur. Wrap our use of shade
with a context manager that logs the inner exceptions
in nodepool's own logging context.
Change-Id: I6be2422aa0352ee9f0ff7429ee6e66384c2b5d57
Depends-On: I33269743a8f62b863569130aba3cc9b5a8539aa0
At the moment, grepping through logs to determine what's happening with
timeouts on a provider is difficult because for some errors the cause of
the timeout is on a different line than the provider in question.
Give each timeout a specific named exception, and then when we catch the
exceptions, log them specifically with node id, provider and then the
additional descriptive text from the timeout exception. This should
allow for easy grepping through logs to find specific instances of
types of timeouts - or of all timeouts. Also add a corresponding success
debug log so that comparitive greps/counts are also easy.
Change-Id: I889bd9b5d92f77ce9ff86415c775fe1cd9545bbc
In case there is useful debug information in the server fault message,
log it so that we can try to track down why servers go away.
Change-Id: I33fd51cbfc110fdb1ccfa6bc30a421d527f2e928
Newer clouds are always going to return true for has-extension
os-floating-ips because nova has moved to microversions and away from
extension lists.
Change-Id: I0232db76216468ca9b56343501d02902aaa21963