nodepool

Author	SHA1	Message	Date
Paul Belanger	1b904ec697	Create snapshots when min-ready is >= 0 Currently, if min-ready is 0 a snapshot will not be created (nodepool considers the image to be disabled). Now if min-ready is greater than or equal 0, nodepool will create the snapshot. The reason for the change is to allow jenkins slaves to be offline waiting for a new job to be submitted. Once a new job is submit, nodepool will properly launch a slave node from the snapshot. Additionally, min-ready is now optional and defaults to 2. If min-ready is -1 the snapshot will not be created (label becomes disabled). Closes-Bug: #1299172 Change-Id: I7094a76b09266c00c0290d84ae0a39b6c2d16215 Signed-off-by: Paul Belanger <paul.belanger@polybeacon.com>	2014-03-31 15:23:58 -04:00
Jenkins	933a7e80dc	Merge "Fix update-image command"	2014-03-31 18:06:57 +00:00
James E. Blair	4a60b2846e	Fix update-image command The updateImage method signature changed to require name strings instead of objects. Change-Id: Ic8aefe86e59d2e36db903fe3735b4907a6c2bf2a	2014-03-31 10:53:02 -07:00
James E. Blair	ddcf1543fe	Fix missing attribute error in subnodes If statsd was enabled, we would hit an undefined attribute error because subnodes have no target. Instead, pass through the target name of the parent node and use that for statsd reporting. Change-Id: Ic7a04a85775a23f954ea565e8c82976b52b218c7	2014-03-31 09:22:00 -07:00
James E. Blair	92b9842951	Add a test for subnodes Some misc changes related to running this: * Set log/stdout/err capture vars as in ZUUL * Give the main loop a configurable sleep value so tests can run faster * Fix a confusing typo in the node.yaml config Additionally, a better method for waiting for test completion is added which permits us to use assert statements in the tests. Change-Id: Icddd2afcd816dbd5ab955fa4ab5011ac8def8faf	2014-03-31 09:22:00 -07:00
James E. Blair	fca89ee0a0	Add a very basic functional test It starts the daemon with a simple config file and ensures that it spins up a node. A timeout is added to the zmq listener so that its run loop can be stopped by the 'stopped' flag. And the shutdown procedure for nodepool is altered so that it sets those flags and waits for those threads to join before proceeding. The previous method could occasionally cause assertion errors (from C, therefore core dumps) due to zmq concurrency issues. Change-Id: I7019a80c9dbf0396c8ddc874a3f4f0c2e977dcfa	2014-03-31 09:22:00 -07:00
James E. Blair	852c6b0b96	Add per-test database fixture And test it. Change-Id: I49fb5f58127ed2a1c80282b55e30336da725b75c	2014-03-31 09:22:00 -07:00
James E. Blair	faef2431a7	Finish initial docs Finish the initial sections defined in the documentation index. Add sphinxcontrib-programoutput to document command line utils. Add py27 to the list of default tox targets. Change-Id: I254534032e0706e410647b023249fe3af4f3a35f	2014-03-31 09:21:56 -07:00
James E. Blair	22961c5ba5	Add tests for the allocator And fix a bug in it that caused too-small allocations in some circumstances. The main demonstration of this failure was: nodepool.tests.test_allocator.TwoLabels.test_allocator(two_nodes) Which allocated 1,2 instead of 2,2. But the following tests also failed: nodepool.tests.test_allocator.TwoProvidersTwoLabels.test_allocator(four_nodes_over_quota) nodepool.tests.test_allocator.TwoProvidersTwoLabels.test_allocator(four_nodes) nodepool.tests.test_allocator.TwoProvidersTwoLabels.test_allocator(three_nodes) nodepool.tests.test_allocator.TwoProvidersTwoLabels.test_allocator(four_nodes_at_quota) nodepool.tests.test_allocator.TwoProvidersTwoLabels.test_allocator(one_node) Change-Id: Idba0e52b2775132f52386785b3d5f0974c5e0f8e	2014-03-31 09:20:16 -07:00
James E. Blair	db5602a91e	Add ready-script and multi-node support Write information about the node group to /etc/nodepool, along with an ssh key generated specifically for the node group. Add an optional script that is run on each node (and sub-node) for a label right before a node is placed in the ready state. This script can use the data in /etc/nodepool to setup access between the nodes in the group. Change-Id: Id0771c62095cccf383229780d1c4ddcf0ab42c1b	2014-03-31 09:20:15 -07:00
Jenkins	ad7b9a849b	Merge "Fix the allocation distribution"	2014-03-28 23:22:34 +00:00
James E. Blair	da96ca9ff4	Fix the allocation distribution The new simpler method for calculating the weight of the targets is a little too simple and can miss allocating nodes. Make the weight change as the algorithm walks through the target list to ensure that everything is allocated somewhere. Change-Id: I98f72c69cf2793aa012f330219cd850a5d4ceab2	2014-03-28 15:51:59 -07:00
James E. Blair	7bfda82c6b	Fix image/label name typo in stats Change-Id: Ibd75b0ef169d6a3cc2a1ebfbea54139dfa28dedc	2014-03-28 15:12:11 -07:00
James E. Blair	9d4e56ff57	Add 'labels' as a configuration primitive Labels replace images as the basic identity for nodes. Rather than having nodes of a particular image, we now have nodes of a particular label. A label describes how a node is created from an image, which providers can supply such nodes, and how many should be kept ready. This makes configuration simpler (by not specifying which images are associated with which targets and simply assuming an even distribution, the target section is _much_ smaller and _much_ less repetitive). It also facilitates describing how a nodes of potentially different configurations (e.g., number of subnodes) can be created from the same image. Change-Id: I35b80d6420676d405439cbeca49f4b0a6f8d3822	2014-03-28 09:12:27 -07:00
Jenkins	9dd3ced2b1	Merge "Raise min_demand due to slow node boot times"	2014-03-28 15:53:18 +00:00
Jenkins	72f7333765	Merge "Stop waiting for resources in ERROR state"	2014-03-28 01:05:56 +00:00
James E. Blair	71e1419f61	Stop waiting for resources in ERROR state As soon as a resource changes to the ERROR state, stop waiting for it. Return it to the caller, where it will be deleted. Change-Id: I128bc4344b238b96e5696cce87f608fb2cdffa6e	2014-03-27 14:02:30 -07:00
James E. Blair	e206893f27	Add the ability to create subnodes An image can specify that it should be created with a number of subnodes. That number of nodes of the same image type will also be created and associated with each primary node of that image type. Adjust the allocator to accomodate the expected loss of capacity associated with subnodes. If a node has subnodes, wait until they are all in the ready state before declaring a node ready. Change-Id: Ia4b315b1ed2999da96aab60c5c02ea2ce7667494	2014-03-27 12:57:40 -07:00
James E. Blair	30bf0ecb87	Add SubNodes and the ability to delete them There's no way to create subnodes yet. But this change introduces the class/table and a method to cleanly delete them if they exist. The actual server deletion and the waiting for that to complete are separated out in the provider manager, so that we can kick off a bunch of subnode deletes and then wait for them all to complete in one thread. All existing calls to cleanupServer are augmented with a new waitForServerDeletion call to handle the separation. Change-Id: Iba9d5a0a61cccc07d914e60a24777c6451dca7ea	2014-03-27 12:57:40 -07:00
Jenkins	14b308b20b	Merge "Include provider names in timeout messages"	2014-03-26 20:45:37 +00:00
Clark Boylan	841750fdea	Depend on hacking for its dependencies. flake8 does not pin its pep8 and pyflakes dependencies which makes it possible for new sets of rules to suddenly apply whenever either of those projects pushes a new release. Fix this by depending on hacking instead of flake8, pep8, and pyflakes directly. This will keep nodepool in sync with the rest of openstack even if it doesn't use the hacking rules (H*). Change-Id: Ice9198e9439ebcac15e76832835e78f72344425c	2014-03-26 12:39:41 -07:00
James E. Blair	fdc3616927	Include provider names in timeout messages Also, remove the extra "waiting for". Change-Id: I5842daecb6b193eb5d6d2d2662dfab89ac8f7344	2014-03-25 17:56:29 -07:00
Paul Belanger	e2fff8cd15	Set paramiko version > 1.9.0 Ubuntu 12.04 package version for paramiko is 1.7.7.1, which lacks the additional arguments for exec_command: TypeError: exec_command() got an unexpected keyword argument 'get_pty' 1.10.0 was the first version to add get_pty flag. Change-Id: I3b4d8a6d8a1d10ab002a79824feab8937d160244 Signed-off-by: Paul Belanger <paul.belanger@polybeacon.com>	2014-03-23 19:36:41 -04:00
James E. Blair	6ab771728e	Delete created keypairs if nova boot fails If the server fails to boot (perhaps due to quota issues) during an image update, if nodepool created a key for that server it won't be deleted because the existing keypair delete is done as part of deleting the server. Handle the special case of never having actually booted a server and delete the keypair explicitly in that case. Change-Id: I0607b77ef2d52cbb8a81feb5e9c502b080a51dbe	2014-03-20 10:24:33 -07:00
Joe Gordon	4cae69fd01	Raise min_demand due to slow node boot times Booting a node takes up to 16 minutes, so keep more nodes in ready state. Change-Id: I0ae647c658feffabc499c96b0a9ed11855202c4b	2014-03-17 14:28:04 -07:00
Jenkins	bd6f5cdf54	Merge "Roll up node stats"	2014-03-14 21:59:03 +00:00
Fengqian Gao	366746aeb0	Keep py3.X compatibility for urllib/urllib2 Use six.moves.urllib instead of urllib and six.moves.urllib.request instead of urllib2. Partial-Bug: #1280105 Change-Id: Id122f7be5aa3e0dd213bfa86f9be86d10d72b4a6	2014-02-25 16:51:03 +08:00
James E. Blair	7e0c42d035	Roll up node stats As the number of providers and targets grows, the number of stats that graphite has to sum in order to produce the summary graphs that we use grows. Instead of asking graphite to summarize something like 400 metrics (takes about 3 seconds), have nodepool directly produce the metrics that we are going to use. Change-Id: I2a7403af2512ace0cbe795f2ec17ebcd9b90dd09	2014-02-24 17:29:19 -08:00
Jenkins	839646ecbe	Merge "Add fedora support"	2014-02-24 21:49:37 +00:00
Jenkins	e8d67fead2	Merge "Retry ssh connections on auth failure."	2014-02-24 21:48:53 +00:00
Jenkins	dd51a2fb58	Merge "Preserve HOLD state when job starts."	2014-02-21 23:47:49 +00:00
James E. Blair	08a6254348	Keep current and previous snapshot images The previous logic around how to keep images was not accomplishing anything particularly useful. Instead: * Delete images that are not configured or have no corresponding base image. * Keep the current and previous READY images. * Otherwise, delete any images that have been in their current state for more than 8 hours. Also, correct the image-update command which no longer needs to join a thread. Also, fix up some poorly exercised parts of the fake provider. Change-Id: Iba921f26d971e56692b9104f9d7c531d955d17b4	2014-02-21 13:08:26 -08:00
Clark Boylan	2f6dfbd59b	Preserve HOLD state when job starts. Previously when a job started on a node nodepool always changed that node to state USED. Now preserve the old state value if it was previously HOLD. Change-Id: Ia4f736ae20b8b24ec079aa024fad404019725bcb	2014-02-20 17:45:49 -08:00
James E. Blair	f73676bc31	Fix more str!=int bugs When we return ids after creating servers or images, coerce them to strings so they compare correctly when we wait for them. Change-Id: I6d4575f9a392b6028bcec4ad57299b7f467cb764	2014-02-20 16:10:08 -08:00
James E. Blair	55be19f0f2	Remove unhelpful log message Since we're caching lists, we would actually expect a server not to appear within the first two iterations of the loop, so remove a log message that says it isn't there (which now shows up quite often). Change-Id: Ifbede6a141809e9fa40b910de2aabbd44f252fe5	2014-02-20 15:03:56 -08:00
James E. Blair	b818085c97	Coerce all ids from novaclient to str Nova might return int or str ids for various objects. Our db stores them all as strings (since that supports the superset). Since we convert all novaclient info into simply list/dict datastructures anyway, convert all the ids to str at the same time to make comparisons elsewhere easier. Change-Id: Ic90c07ec906865e975decee190c2e5a27ef7ef6d	2014-02-20 14:48:47 -08:00
James E. Blair	9bd9567ccb	Node deletion related fixes * Less verbose logging (these messages show up a lot). * Handle the case where a db record disappears during cleanup but before the individual node cleanup has started. * Fix a missed API cleanup where deleteNode was still being called with a node object instead of the id. Change-Id: I2025ff19a51cfacff64dd8345eaf120bf3473ac2	2014-02-20 14:33:47 -08:00
James E. Blair	34ce3e2da4	Use the task manager to get extensions and flavors Since these calls can now come from any thread, these should now be run through the task manager to serialize them and make sure that we don't run into novaclient thread-safety issues. Change-Id: I46ab44b93d56ad1ce289bf837511b9373d3284ee	2014-02-20 14:22:38 -08:00
James E. Blair	aa90b344da	Delete all building nodes on daemon start Nodepool can not (currently) resume building a node if the daemon is interrupted while that is happening. At least have it clean up nodes that are in the 'building' state when it starts. Change-Id: I66124c598b01919d3fd8b6158c482d65508c6dae	2014-02-20 09:07:12 -08:00
James E. Blair	ef799c4236	Perform all deletes in threads Have all node deletions (like node launches) handled by a thread, including ones started by the periodic cleanup. This will make the system more responsive when providers are unable to delete nodes quickly, as well as when large numbers of nodes are deleted by an operator, or when the system is restarted while many node deletions are in progress. Additionally, make the 'nodepool delete' command merely update the db with the expectation that the next run of the cleanup cron will spawn deletes. Add a '--now' option so that an operator may still delete nodes synchronously, for instance when the daemon is not running. Change-Id: I20ce2873172cb1906e7c5832ed2100e23f86e74e	2014-02-20 08:59:42 -08:00
James E. Blair	7e21d6a991	Check server status in batch And cache the results for 5 seconds. This should remove a huge amount of GETs for individual server status, instead replacing them with a single request that fetches the status of all servers. All individual build or delete threads will use the cached result from the most recent server list, and if it is out of date, they will trigger an update. The main benefit of using the individual requests was that they also provided progress information. Since that just goes into the logs and no one cares, we can certainly do without it. Also includes a minor constant replacement. Change-Id: I995c3f39e5c3cddc6f1b2ce6b91bcd178ef2fbb0	2014-02-20 08:59:34 -08:00
Derek Higgins	7afd8ff3b1	Add fedora support The previous loop iterating usernames doesn't work if another user is added as we move onto the second username once ssh comes up and root fails to run a command and then hit a timeout on the second user and never attempt the 3rd. Instead take an approach there we continue trying root until it either works or ssh succeeds but a command can't be run. At this stage we can try any other users that may be configured on the image, with a short timeout (as we know ssh has come up, if it hadn't ssh'ing as root would have timed out). Change-Id: Id05aa186e8496d19f41d9c723260e2151deb45c9	2014-02-20 10:26:10 +00:00
Jenkins	1fb3147b17	Merge "Make nodepool more robust to offline clouds."	2014-02-20 01:26:53 +00:00
Robert Collins	9368953229	Make nodepool more robust to offline clouds. When a cloud is offline we cannot query it's flavors or extensions, and without those we cannot use a provider manager. For these attributes making the properties that lazy-initialize will fix the problem (we may make multiple queries, but it is idempotent so locking is not needed). Callers that trigger flavor or extension lookups have to be able to cope with a failure propogating up - I've manually found all the places I think. The catchall in _getFlavors would mask the problem and lead to the manager being incorrectly initialized, so I have removed that. Startup will no longer trigger cloud connections in the main thread, it will all be deferred to worker threads such as ImageUpdate, periodic check etc. Additionally I've added some belts-and-braces catches to the two key methods - launchImage and updateImage which while they don't directly interact with a provider manager do access the provider definition, which I think can lead to occasional skew between the DB and the configuration - I'm not /sure/ they are needed, but I'd rather be safe. Change-Id: I7e8e16d5d4266c9424e4c27ebcc36ed7738bc86f Fixes-Bug: #1281319	2014-02-19 17:12:33 -08:00
Dan Prince	1963731f7d	Retry ssh connections on auth failure. Some cloud instance types (Fedora for example) create the ssh user after sshd comes online. This allows our ssh connection retry loop to handle this scenario gracefully. Change-Id: Ie345dea50fc2983112cd2e72826a708583d2712a	2014-02-19 16:07:40 -05:00
James E. Blair	2d45897802	Fix typo in allocation This was causing the target distribution for all images to be calculated according to the current values of the last image through the previous loop. Change-Id: I3d5190c78849b77933e18c5cf6a9c7443945b6cd	2014-02-19 10:47:58 -08:00
Jenkins	eeffbc86ac	Merge "Allow useage of server IDs as well as names."	2014-02-19 01:47:52 +00:00
James E. Blair	91f7af41d3	Make jenkins get info task synchronous This is called within the main loop which means if there are a lot of jenkins tasks pending, the main loop waits until they are all finished. Instead, make this synchronous. It should be lightweight so the additional load on jenkins should be negligible. Also increase the sleep in the main loop to 10 seconds to mitigate this somewhat. Change-Id: I16e27fbd9bfef0617e35df08f4fd17bc2ead67b0	2014-02-18 16:57:40 -08:00
Bob Ball	c47547d321	Allow useage of server IDs as well as names. RAX have some hidden images useful for build a xenserver host. Ensure nodepool can use these by referring to them by image UUID. Change-Id: Idfefc60e762740f4cffa437933007942a0920970	2014-02-12 16:55:24 +00:00
James E. Blair	6878447060	Revert delete-rework branch Revert "Log state names not numbers" This reverts commit 1c7c954aab8022e786b74beb4013c6d402fddedb. Revert "Also log provider name when debugging deletes." This reverts commit 3c32477bc73839b946403adf0750f2e9b09ba855. Revert "Run per-provider cleanup threads." This reverts commit 3f2100cf6fab5c51e22f806060440182c24c50eb. Revert "Decouple cron names from config file names." This reverts commit 9505671e66a976e9b0ee8d13a9fb677a0409f39a. Revert "Move cron definition out of the inner loop." This reverts commit f91940bf31b9cf4c187980a76d49bd9955e2c53e. Revert "Move cron loading below provider loading." This reverts commit 84061a038c9f1684ba654d81ce675b0eb3b70957. Revert "Teach periodicCleanup how to do one provider." This reverts commit e856646bd7df2acf2883ba181b0d685b87249f37. Revert "Use the nonblocking cleanupServer." This reverts commit c2f854c99f9a983bcd6f1390f6c20838cf67d525. Revert "Split out the logic for deleting a nodedb node." This reverts commit 5a696ca99231bf72827503b06f08f0c3e91e8ae1. Revert "Make cleanupServer optionally nonblocking." This reverts commit 423bed124e90bc7773017a7884cc23c428f58265. Revert "Consolidate duplicate logging messages." This reverts commit b8a8736ac54dac25d89fd9e0f01eb64d1b035b78. Revert "Log how long nodes have been in DELETE state." This reverts commit daf427f463706e99a90e7d29c48c19566cc710f9. Revert "Cleanup nodes in state DELETE immediately." This reverts commit a84540f33214eb31228e964bb193402903568754. Change-Id: Iaf1eae8724f0a9fe1c14e70896fa699629624d28	2014-02-05 14:54:49 -08:00

... 40 41 42 43 44 ...

2226 Commits