If a jenkins is in shutdown mode or is offline, ignore that jenkins
for the purposes of launching nodes. Node updates (used/complete)
for that jenkins will still be processed.
This should allow another jenkins to gracefully accept the increased
load if one goes offline.
Also, log the IP address when spinning up a node.
Change-Id: I3a8720dd5aaf154ca91cdc36136decad52eb6afa
The current code has some parse errors with more complex gearman
function names (which can show up due to the way Jenkins constructs
maven jobs).
Also, switch the calculation to examine only queued jobs (total -
running) instead of trying to calculate a worker shortage (total -
workers). The latter doesn't deal well with multiple jobs that
require workers of the same image (it incorrectly behaves as if
they are independent). By only examining queued workers, the actual
relationship between multiple jobs that require workers from the
same image is manifested by the fact that if, together, all such
jobs exceed the demand, we will see jobs sitting in the queue.
In other words, the overall picture is now that nodepool should
have at least enough ready+building nodes to accomodate the number
of jobs for a given worker/image that gearman is waiting to run.
Change-Id: Ibc2990ed2c7aea37bd4c94e5387c80ef840afa83
Use information from Gearman to determine the immediate load
requirements of the system and spin up as many nodes as required
to meet the demand. Use the existing information about the
min and max servers to determine the ratio of servers to spin
up from each provider.
Replaces the several fake server scripts with one script that
implements statsd, zmq, and gearman to ease testing.
Change-Id: Ic0dedc7ef2760ff664912f771377e02967ad5633
* nodepool/nodepool.py: Make the initial ssh timeout configurable
(retain default of 60 seconds). Looking at the logs there is a very high
occurence of SSH timeout from our providers. Making this timeout
configurable will allow us to adjust the timeout if necessary.
Additionally, fix the comparison between old and new provider configs,
some items were not compared when they should be.
Change-Id: I51df708cb24e93e87c2fedf36d1f9de2131c76bd
Some of us have hacking installed global, which means nodepool flake8
produces a lot of spurious warnings. Suppress them.
Change-Id: Ie869a92fa423dc022c5c37c102f5a9071ccaf1b0
When responding to a build complete event, don't do anything to
the node if it is in the HOLD state.
Change-Id: I37e458198bfcd08472d07ca9c206c1b4551f3341
It lists servers that exist in the provider accounts about which
nodepool has no knowledge. Useful for identifying resource leaks.
Change-Id: Iaf71d6320d6ec7691f301208e09974cad2177ad5
Moves the daemon command to nodepoold.
Refactor config handling a bit in NodePool to make the config
objects just contain information by default (though things
such as database handles and managers may get added to them
later as needed).
Start with the list and image-list commands.
Change-Id: If2ba7bca7ab4ef922787176af87ad5de31ae4b3e
Log stdout/stderr from the image build process. Use the provider
and image name in the log selector so that admins can route
appropriately (or at least grep).
Change-Id: I7bc74ebfca3184340b51b083695b3441f0924e83
The logic around when to delete an image was just completely wrong.
Also, rackspace sometimes returns a deleted image when we request
it, while hp returns a 404. Deal with both of those situations.
Change-Id: I4b6d620a750bd39a1d3b89e6eb51baf37694f8a7
Add a new node state, TEST, and if a test job name is supplied
put the node in the TEST state, and run that job with the node
name as a parameter. If the job succeeds, move it into READY
and relabel it with the appropriate label (from the image name).
If it fails, immediately delete the node.
If it never runs, it will eventually be cleaned up by the
periodic cleanup task.
Change-Id: I5ba1ea8cdc832b13a760edaee841487afe7d7ce4
This mostly assists local dev, but is a good idea anyway.
Move the zmq port in the test script so it does not conflict
with the default port in the jenkins zmq plugin (more local
dev assist).
Change-Id: I68f7fc31fe7e2a819568a2f40626641dee240387
If a target, provider, or image did not exist in the config but
was still in the db, the stats function would encounter a KeyError.
This makes sure we can still report stats for lingering resources.
Change-Id: Iade002917dbcb2931bb4f9ff009516d24c47e743
Move them to /opt/nodepool-scripts so they are in a nice
world-readable location so that they can be run as any user.
Change-Id: I007e341fbe17067c164d3712fcfb7e744bdd80e9
From 1 hour to 10 minutes. If it isn't deleted, it will get
deleted by the next pass of the cleanup process.
Change-Id: I6dd1693d14fd215117ddbed8440ff4abe02c374c
This is used to serialize all access to an individual provider
(nova client). One ProviderManager is created for every provider
defined in the configuration. Any actions that require interaction
with nova submit a task to the manager which processes them serially
with an appropriate delay to ensure that rate limits are not hit.
This solves not only rate-limit problems, but also ends multi-threaded
access to a single novaclient Client object.
Change-Id: I0cdaa747dac08cdbe4719cb6c9c220678b7a0320
Novaclient instances (via their internal requests.Session object)
do not correctly clean up after themselves. This visibly manifests
in the file descriptors for sockets not being closed.
A simple solution to this problem that also gains some efficiency
is to cache the novaclient objects for each provider. Based on
limited examination and research, I believe they are thread-safe.
The underlying requests library certainly is expected to be.
Change-Id: I541a0783fabef368449ef6dc8c3cf766d3560bfa
We can burst and create a lot of threads, each of which will checkout
a SQLAlchemy connection from the pool. This accomodates that.
We have a natural limit on the number of db connections -- we will
never use more than the total number of servers managed. So in that
case, just don't set an overflow limit for the db connection pool.
This means that it will stabilize on 5 open connections and burst
to as many as needed.
Also, ensure that the connection is returned to the pool in the
context manager exit method, as well as setting it to None so that
the session can not be re-used again (this is an easy way to make
sure that it can't be used except as a context manager).
Change-Id: Ie4628326b6b84fb0979e4eceed546404c4e30637
The existing db session strategy was inherited from a bunch of
shell scripts that ran once in a single thread and exited.
The surprising thing is that even worked at all. This change
replaces that "strategy" with one where each thread clearly
begins a new session as a context manager and passes that around
to functions that need the DB. A thread-local session is used
for convenience and extra safety.
This also adds a fake provider that will produce fake images and
servers quickly without needing a real nova or jenkins. This was
used to develop the database change.
Also some minor logging changes and very brief developer docs.
Change-Id: I45e6564cb061f81d79c47a31e17f5d85cd1d9306
This is effectively a required db field; without it, the watermark
calculation can be wrong until it's filled in, so make sure it's
there to start.
Also some minor logging changes.
Change-Id: Idc5a9cd40fe330f7a1aea4a5513267ee3c254f60
And some other minor changes gleaned from production testing.
Remove the scripts dir because it is no longer needed.
Change-Id: I7ffe3ed8d2a1be294637ac18bc3eaefede97d401