Commit Graph

168 Commits (22961c5ba5c6573b615a9bf7cfbb78c365aa07aa)

Author SHA1 Message Date
James E. Blair 22961c5ba5 Add tests for the allocator
And fix a bug in it that caused too-small allocations in some
circumstances.

The main demonstration of this failure was:
  nodepool.tests.test_allocator.TwoLabels.test_allocator(two_nodes)

Which allocated 1,2 instead of 2,2.  But the following tests also
failed:

  nodepool.tests.test_allocator.TwoProvidersTwoLabels.test_allocator(four_nodes_over_quota)
  nodepool.tests.test_allocator.TwoProvidersTwoLabels.test_allocator(four_nodes)
  nodepool.tests.test_allocator.TwoProvidersTwoLabels.test_allocator(three_nodes)
  nodepool.tests.test_allocator.TwoProvidersTwoLabels.test_allocator(four_nodes_at_quota)
  nodepool.tests.test_allocator.TwoProvidersTwoLabels.test_allocator(one_node)

Change-Id: Idba0e52b2775132f52386785b3d5f0974c5e0f8e
2014-03-31 09:20:16 -07:00
James E. Blair db5602a91e Add ready-script and multi-node support
Write information about the node group to /etc/nodepool, along
with an ssh key generated specifically for the node group.

Add an optional script that is run on each node (and sub-node) for
a label right before a node is placed in the ready state.  This
script can use the data in /etc/nodepool to setup access between
the nodes in the group.

Change-Id: Id0771c62095cccf383229780d1c4ddcf0ab42c1b
2014-03-31 09:20:15 -07:00
Jenkins ad7b9a849b Merge "Fix the allocation distribution" 2014-03-28 23:22:34 +00:00
James E. Blair da96ca9ff4 Fix the allocation distribution
The new simpler method for calculating the weight of the targets
is a little too simple and can miss allocating nodes.  Make the
weight change as the algorithm walks through the target list
to ensure that everything is allocated somewhere.

Change-Id: I98f72c69cf2793aa012f330219cd850a5d4ceab2
2014-03-28 15:51:59 -07:00
James E. Blair 7bfda82c6b Fix image/label name typo in stats
Change-Id: Ibd75b0ef169d6a3cc2a1ebfbea54139dfa28dedc
2014-03-28 15:12:11 -07:00
James E. Blair 9d4e56ff57 Add 'labels' as a configuration primitive
Labels replace images as the basic identity for nodes.  Rather than
having nodes of a particular image, we now have nodes of a particular
label.  A label describes how a node is created from an image, which
providers can supply such nodes, and how many should be kept ready.

This makes configuration simpler (by not specifying which images
are associated with which targets and simply assuming an even
distribution, the target section is _much_ smaller and _much_ less
repetitive).  It also facilitates describing how a nodes of
potentially different configurations (e.g., number of subnodes) can
be created from the same image.

Change-Id: I35b80d6420676d405439cbeca49f4b0a6f8d3822
2014-03-28 09:12:27 -07:00
Jenkins 9dd3ced2b1 Merge "Raise min_demand due to slow node boot times" 2014-03-28 15:53:18 +00:00
Jenkins 72f7333765 Merge "Stop waiting for resources in ERROR state" 2014-03-28 01:05:56 +00:00
James E. Blair 71e1419f61 Stop waiting for resources in ERROR state
As soon as a resource changes to the ERROR state, stop waiting for
it.  Return it to the caller, where it will be deleted.

Change-Id: I128bc4344b238b96e5696cce87f608fb2cdffa6e
2014-03-27 14:02:30 -07:00
James E. Blair e206893f27 Add the ability to create subnodes
An image can specify that it should be created with a number of
subnodes.  That number of nodes of the same image type will also
be created and associated with each primary node of that image
type.

Adjust the allocator to accomodate the expected loss of capacity
associated with subnodes.

If a node has subnodes, wait until they are all in the ready state
before declaring a node ready.

Change-Id: Ia4b315b1ed2999da96aab60c5c02ea2ce7667494
2014-03-27 12:57:40 -07:00
James E. Blair 30bf0ecb87 Add SubNodes and the ability to delete them
There's no way to create subnodes yet.  But this change introduces
the class/table and a method to cleanly delete them if they exist.

The actual server deletion and the waiting for that to complete
are separated out in the provider manager, so that we can kick
off a bunch of subnode deletes and then wait for them all to
complete in one thread.

All existing calls to cleanupServer are augmented with a new
waitForServerDeletion call to handle the separation.

Change-Id: Iba9d5a0a61cccc07d914e60a24777c6451dca7ea
2014-03-27 12:57:40 -07:00
Jenkins 14b308b20b Merge "Include provider names in timeout messages" 2014-03-26 20:45:37 +00:00
Clark Boylan 841750fdea Depend on hacking for its dependencies.
flake8 does not pin its pep8 and pyflakes dependencies which makes it
possible for new sets of rules to suddenly apply whenever either of
those projects pushes a new release. Fix this by depending on hacking
instead of flake8, pep8, and pyflakes directly. This will keep nodepool
in sync with the rest of openstack even if it doesn't use the hacking
rules (H*).

Change-Id: Ice9198e9439ebcac15e76832835e78f72344425c
2014-03-26 12:39:41 -07:00
James E. Blair fdc3616927 Include provider names in timeout messages
Also, remove the extra "waiting for".

Change-Id: I5842daecb6b193eb5d6d2d2662dfab89ac8f7344
2014-03-25 17:56:29 -07:00
Paul Belanger e2fff8cd15 Set paramiko version > 1.9.0
Ubuntu 12.04 package version for paramiko is 1.7.7.1, which lacks
the additional arguments for exec_command:

  TypeError: exec_command() got an unexpected keyword argument 'get_pty'

1.10.0 was the first version to add get_pty flag.

Change-Id: I3b4d8a6d8a1d10ab002a79824feab8937d160244
Signed-off-by: Paul Belanger <paul.belanger@polybeacon.com>
2014-03-23 19:36:41 -04:00
James E. Blair 6ab771728e Delete created keypairs if nova boot fails
If the server fails to boot (perhaps due to quota issues) during
an image update, if nodepool created a key for that server it won't
be deleted because the existing keypair delete is done as part of
deleting the server.  Handle the special case of never having
actually booted a server and delete the keypair explicitly in that
case.

Change-Id: I0607b77ef2d52cbb8a81feb5e9c502b080a51dbe
2014-03-20 10:24:33 -07:00
Joe Gordon 4cae69fd01 Raise min_demand due to slow node boot times
Booting a node takes up to 16 minutes, so keep more nodes in ready
state.

Change-Id: I0ae647c658feffabc499c96b0a9ed11855202c4b
2014-03-17 14:28:04 -07:00
Jenkins bd6f5cdf54 Merge "Roll up node stats" 2014-03-14 21:59:03 +00:00
Fengqian Gao 366746aeb0 Keep py3.X compatibility for urllib/urllib2
Use six.moves.urllib instead of urllib and
six.moves.urllib.request instead of urllib2.

Partial-Bug: #1280105

Change-Id: Id122f7be5aa3e0dd213bfa86f9be86d10d72b4a6
2014-02-25 16:51:03 +08:00
James E. Blair 7e0c42d035 Roll up node stats
As the number of providers and targets grows, the number of stats
that graphite has to sum in order to produce the summary graphs that
we use grows.  Instead of asking graphite to summarize something like
400 metrics (takes about 3 seconds), have nodepool directly produce
the metrics that we are going to use.

Change-Id: I2a7403af2512ace0cbe795f2ec17ebcd9b90dd09
2014-02-24 17:29:19 -08:00
Jenkins 839646ecbe Merge "Add fedora support" 2014-02-24 21:49:37 +00:00
Jenkins e8d67fead2 Merge "Retry ssh connections on auth failure." 2014-02-24 21:48:53 +00:00
Jenkins dd51a2fb58 Merge "Preserve HOLD state when job starts." 2014-02-21 23:47:49 +00:00
James E. Blair 08a6254348 Keep current and previous snapshot images
The previous logic around how to keep images was not accomplishing
anything particularly useful.

Instead:
  * Delete images that are not configured or have no corresponding
    base image.
  * Keep the current and previous READY images.
  * Otherwise, delete any images that have been in their current
    state for more than 8 hours.

Also, correct the image-update command which no longer needs to
join a thread.  Also, fix up some poorly exercised parts of the
fake provider.

Change-Id: Iba921f26d971e56692b9104f9d7c531d955d17b4
2014-02-21 13:08:26 -08:00
Clark Boylan 2f6dfbd59b Preserve HOLD state when job starts.
Previously when a job started on a node nodepool always changed that
node to state USED. Now preserve the old state value if it was
previously HOLD.

Change-Id: Ia4f736ae20b8b24ec079aa024fad404019725bcb
2014-02-20 17:45:49 -08:00
James E. Blair f73676bc31 Fix more str!=int bugs
When we return ids after creating servers or images, coerce them
to strings so they compare correctly when we wait for them.

Change-Id: I6d4575f9a392b6028bcec4ad57299b7f467cb764
2014-02-20 16:10:08 -08:00
James E. Blair 55be19f0f2 Remove unhelpful log message
Since we're caching lists, we would actually expect a server not
to appear within the first two iterations of the loop, so remove
a log message that says it isn't there (which now shows up quite
often).

Change-Id: Ifbede6a141809e9fa40b910de2aabbd44f252fe5
2014-02-20 15:03:56 -08:00
James E. Blair b818085c97 Coerce all ids from novaclient to str
Nova might return int or str ids for various objects.  Our db
stores them all as strings (since that supports the superset).
Since we convert all novaclient info into simply list/dict
datastructures anyway, convert all the ids to str at the same
time to make comparisons elsewhere easier.

Change-Id: Ic90c07ec906865e975decee190c2e5a27ef7ef6d
2014-02-20 14:48:47 -08:00
James E. Blair 9bd9567ccb Node deletion related fixes
* Less verbose logging (these messages show up a lot).
* Handle the case where a db record disappears during cleanup but
  before the individual node cleanup has started.
* Fix a missed API cleanup where deleteNode was still being
  called with a node object instead of the id.

Change-Id: I2025ff19a51cfacff64dd8345eaf120bf3473ac2
2014-02-20 14:33:47 -08:00
James E. Blair 34ce3e2da4 Use the task manager to get extensions and flavors
Since these calls can now come from any thread, these should now
be run through the task manager to serialize them and make sure
that we don't run into novaclient thread-safety issues.

Change-Id: I46ab44b93d56ad1ce289bf837511b9373d3284ee
2014-02-20 14:22:38 -08:00
James E. Blair aa90b344da Delete all building nodes on daemon start
Nodepool can not (currently) resume building a node if the daemon
is interrupted while that is happening.  At least have it clean
up nodes that are in the 'building' state when it starts.

Change-Id: I66124c598b01919d3fd8b6158c482d65508c6dae
2014-02-20 09:07:12 -08:00
James E. Blair ef799c4236 Perform all deletes in threads
Have all node deletions (like node launches) handled by a thread,
including ones started by the periodic cleanup.  This will make
the system more responsive when providers are unable to delete
nodes quickly, as well as when large numbers of nodes are deleted
by an operator, or when the system is restarted while many node
deletions are in progress.

Additionally, make the 'nodepool delete' command merely update the
db with the expectation that the next run of the cleanup cron will
spawn deletes.  Add a '--now' option so that an operator may still
delete nodes synchronously, for instance when the daemon is not
running.

Change-Id: I20ce2873172cb1906e7c5832ed2100e23f86e74e
2014-02-20 08:59:42 -08:00
James E. Blair 7e21d6a991 Check server status in batch
And cache the results for 5 seconds.  This should remove a huge
amount of GETs for individual server status, instead replacing
them with a single request that fetches the status of all servers.
All individual build or delete threads will use the cached result
from the most recent server list, and if it is out of date, they
will trigger an update.

The main benefit of using the individual requests was that they
also provided progress information.  Since that just goes into
the logs and no one cares, we can certainly do without it.

Also includes a minor constant replacement.

Change-Id: I995c3f39e5c3cddc6f1b2ce6b91bcd178ef2fbb0
2014-02-20 08:59:34 -08:00
Derek Higgins 7afd8ff3b1 Add fedora support
The previous loop iterating usernames doesn't work if another user is
added as we move onto the second username once ssh comes up and root
fails to run a command and then hit a timeout on the second user and
never attempt the 3rd.

Instead take an approach there we continue trying root until it either
works or ssh succeeds but a command can't be run. At this stage we can
try any other users that may be configured on the image, with a short
timeout (as we know ssh has come up, if it hadn't ssh'ing as root would
have timed out).

Change-Id: Id05aa186e8496d19f41d9c723260e2151deb45c9
2014-02-20 10:26:10 +00:00
Jenkins 1fb3147b17 Merge "Make nodepool more robust to offline clouds." 2014-02-20 01:26:53 +00:00
Robert Collins 9368953229 Make nodepool more robust to offline clouds.
When a cloud is offline we cannot query it's flavors or extensions,
and without those we cannot use a provider manager. For these
attributes making the properties that lazy-initialize will fix the
problem (we may make multiple queries, but it is idempotent so
locking is not needed).

Callers that trigger flavor or extension lookups have to be able to
cope with a failure propogating up - I've manually found all the
places I think.

The catchall in _getFlavors would mask the problem and lead to
the manager being incorrectly initialized, so I have removed that.

Startup will no longer trigger cloud connections in the main thread,
it will all be deferred to worker threads such as ImageUpdate,
periodic check etc.

Additionally I've added some belts-and-braces catches to the two
key methods - launchImage and updateImage which while they don't
directly interact with a provider manager do access the provider
definition, which I think can lead to occasional skew between the
DB and the configuration - I'm not /sure/ they are needed, but
I'd rather be safe.

Change-Id: I7e8e16d5d4266c9424e4c27ebcc36ed7738bc86f
Fixes-Bug: #1281319
2014-02-19 17:12:33 -08:00
Dan Prince 1963731f7d Retry ssh connections on auth failure.
Some cloud instance types (Fedora for example) create
the ssh user after sshd comes online. This allows
our ssh connection retry loop to handle this scenario
gracefully.

Change-Id: Ie345dea50fc2983112cd2e72826a708583d2712a
2014-02-19 16:07:40 -05:00
James E. Blair 2d45897802 Fix typo in allocation
This was causing the target distribution for all images to be
calculated according to the current values of the last image
through the previous loop.

Change-Id: I3d5190c78849b77933e18c5cf6a9c7443945b6cd
2014-02-19 10:47:58 -08:00
Jenkins eeffbc86ac Merge "Allow useage of server IDs as well as names." 2014-02-19 01:47:52 +00:00
James E. Blair 91f7af41d3 Make jenkins get info task synchronous
This is called within the main loop which means if there are a lot
of jenkins tasks pending, the main loop waits until they are all
finished.  Instead, make this synchronous.  It should be lightweight
so the additional load on jenkins should be negligible.

Also increase the sleep in the main loop to 10 seconds to mitigate
this somewhat.

Change-Id: I16e27fbd9bfef0617e35df08f4fd17bc2ead67b0
2014-02-18 16:57:40 -08:00
Bob Ball c47547d321 Allow useage of server IDs as well as names.
RAX have some hidden images useful for build a xenserver host.
Ensure nodepool can use these by referring to them by image UUID.

Change-Id: Idfefc60e762740f4cffa437933007942a0920970
2014-02-12 16:55:24 +00:00
James E. Blair 6878447060 Revert delete-rework branch
Revert "Log state names not numbers"

This reverts commit 1c7c954aab.

Revert "Also log provider name when debugging deletes."

This reverts commit 3c32477bc7.

Revert "Run per-provider cleanup threads."

This reverts commit 3f2100cf6f.

Revert "Decouple cron names from config file names."

This reverts commit 9505671e66.

Revert "Move cron definition out of the inner loop."

This reverts commit f91940bf31.

Revert "Move cron loading below provider loading."

This reverts commit 84061a038c.

Revert "Teach periodicCleanup how to do one provider."

This reverts commit e856646bd7.

Revert "Use the nonblocking cleanupServer."

This reverts commit c2f854c99f.

Revert "Split out the logic for deleting a nodedb node."

This reverts commit 5a696ca992.

Revert "Make cleanupServer optionally nonblocking."

This reverts commit 423bed124e.

Revert "Consolidate duplicate logging messages."

This reverts commit b8a8736ac5.

Revert "Log how long nodes have been in DELETE state."

This reverts commit daf427f463.

Revert "Cleanup nodes in state DELETE immediately."

This reverts commit a84540f332.

Change-Id: Iaf1eae8724f0a9fe1c14e70896fa699629624d28
2014-02-05 14:54:49 -08:00
James E. Blair 1c7c954aab Log state names not numbers
Change-Id: Ieefdcc8c910b5fd8e50c9c08793f1d80c76f623c
2014-02-05 10:32:57 -08:00
Robert Collins 3c32477bc7 Also log provider name when debugging deletes.
This helps us identify why a node might be in state for a long time.

Change-Id: I81dc7e4a033dff46111e98be957cbe5d1ac3872c
2014-02-05 11:47:34 +13:00
Jenkins 8e045b0fba Merge "Expose paramiko's get_pty parameter." 2014-02-03 23:09:18 +00:00
Jenkins 2c6383ad7f Merge "Revert "Provide diagnostics when task rate limiting."" 2014-02-03 23:05:28 +00:00
Jenkins d7a0b8a7fe Merge "Revert "Default to a ratelimit of 2/second for API calls"" 2014-02-03 23:05:18 +00:00
Robert Collins 2f052a01e3 Include check in fake.yaml.
Anyone starting from fake.yaml should have this.

Change-Id: Ieaa8a075cfff5259fabdcdada71c04cddb53cf9f
2014-02-03 20:22:07 +00:00
Robert Collins 3f2100cf6f Run per-provider cleanup threads.
Change-Id: I4c1c54190a5e254dd946ef308a11bb81907bb16c
2014-02-03 20:22:07 +00:00
Robert Collins 9505671e66 Decouple cron names from config file names.
This permits creating multiple cron entries for one config entry,
which per-provider cleanups requires.

Change-Id: Ied9db4a512d8efef355e911d6e5630697d6c38c9
2014-02-03 20:22:06 +00:00