nodepool

Author	SHA1	Message	Date
Jenkins	e8d67fead2	Merge "Retry ssh connections on auth failure."	2014-02-24 21:48:53 +00:00
Jenkins	dd51a2fb58	Merge "Preserve HOLD state when job starts."	2014-02-21 23:47:49 +00:00
James E. Blair	08a6254348	Keep current and previous snapshot images The previous logic around how to keep images was not accomplishing anything particularly useful. Instead: * Delete images that are not configured or have no corresponding base image. * Keep the current and previous READY images. * Otherwise, delete any images that have been in their current state for more than 8 hours. Also, correct the image-update command which no longer needs to join a thread. Also, fix up some poorly exercised parts of the fake provider. Change-Id: Iba921f26d971e56692b9104f9d7c531d955d17b4	2014-02-21 13:08:26 -08:00
Clark Boylan	2f6dfbd59b	Preserve HOLD state when job starts. Previously when a job started on a node nodepool always changed that node to state USED. Now preserve the old state value if it was previously HOLD. Change-Id: Ia4f736ae20b8b24ec079aa024fad404019725bcb	2014-02-20 17:45:49 -08:00
James E. Blair	f73676bc31	Fix more str!=int bugs When we return ids after creating servers or images, coerce them to strings so they compare correctly when we wait for them. Change-Id: I6d4575f9a392b6028bcec4ad57299b7f467cb764	2014-02-20 16:10:08 -08:00
James E. Blair	55be19f0f2	Remove unhelpful log message Since we're caching lists, we would actually expect a server not to appear within the first two iterations of the loop, so remove a log message that says it isn't there (which now shows up quite often). Change-Id: Ifbede6a141809e9fa40b910de2aabbd44f252fe5	2014-02-20 15:03:56 -08:00
James E. Blair	b818085c97	Coerce all ids from novaclient to str Nova might return int or str ids for various objects. Our db stores them all as strings (since that supports the superset). Since we convert all novaclient info into simply list/dict datastructures anyway, convert all the ids to str at the same time to make comparisons elsewhere easier. Change-Id: Ic90c07ec906865e975decee190c2e5a27ef7ef6d	2014-02-20 14:48:47 -08:00
James E. Blair	9bd9567ccb	Node deletion related fixes * Less verbose logging (these messages show up a lot). * Handle the case where a db record disappears during cleanup but before the individual node cleanup has started. * Fix a missed API cleanup where deleteNode was still being called with a node object instead of the id. Change-Id: I2025ff19a51cfacff64dd8345eaf120bf3473ac2	2014-02-20 14:33:47 -08:00
James E. Blair	34ce3e2da4	Use the task manager to get extensions and flavors Since these calls can now come from any thread, these should now be run through the task manager to serialize them and make sure that we don't run into novaclient thread-safety issues. Change-Id: I46ab44b93d56ad1ce289bf837511b9373d3284ee	2014-02-20 14:22:38 -08:00
James E. Blair	aa90b344da	Delete all building nodes on daemon start Nodepool can not (currently) resume building a node if the daemon is interrupted while that is happening. At least have it clean up nodes that are in the 'building' state when it starts. Change-Id: I66124c598b01919d3fd8b6158c482d65508c6dae	2014-02-20 09:07:12 -08:00
James E. Blair	ef799c4236	Perform all deletes in threads Have all node deletions (like node launches) handled by a thread, including ones started by the periodic cleanup. This will make the system more responsive when providers are unable to delete nodes quickly, as well as when large numbers of nodes are deleted by an operator, or when the system is restarted while many node deletions are in progress. Additionally, make the 'nodepool delete' command merely update the db with the expectation that the next run of the cleanup cron will spawn deletes. Add a '--now' option so that an operator may still delete nodes synchronously, for instance when the daemon is not running. Change-Id: I20ce2873172cb1906e7c5832ed2100e23f86e74e	2014-02-20 08:59:42 -08:00
James E. Blair	7e21d6a991	Check server status in batch And cache the results for 5 seconds. This should remove a huge amount of GETs for individual server status, instead replacing them with a single request that fetches the status of all servers. All individual build or delete threads will use the cached result from the most recent server list, and if it is out of date, they will trigger an update. The main benefit of using the individual requests was that they also provided progress information. Since that just goes into the logs and no one cares, we can certainly do without it. Also includes a minor constant replacement. Change-Id: I995c3f39e5c3cddc6f1b2ce6b91bcd178ef2fbb0	2014-02-20 08:59:34 -08:00
Derek Higgins	7afd8ff3b1	Add fedora support The previous loop iterating usernames doesn't work if another user is added as we move onto the second username once ssh comes up and root fails to run a command and then hit a timeout on the second user and never attempt the 3rd. Instead take an approach there we continue trying root until it either works or ssh succeeds but a command can't be run. At this stage we can try any other users that may be configured on the image, with a short timeout (as we know ssh has come up, if it hadn't ssh'ing as root would have timed out). Change-Id: Id05aa186e8496d19f41d9c723260e2151deb45c9	2014-02-20 10:26:10 +00:00
Jenkins	1fb3147b17	Merge "Make nodepool more robust to offline clouds."	2014-02-20 01:26:53 +00:00
Robert Collins	9368953229	Make nodepool more robust to offline clouds. When a cloud is offline we cannot query it's flavors or extensions, and without those we cannot use a provider manager. For these attributes making the properties that lazy-initialize will fix the problem (we may make multiple queries, but it is idempotent so locking is not needed). Callers that trigger flavor or extension lookups have to be able to cope with a failure propogating up - I've manually found all the places I think. The catchall in _getFlavors would mask the problem and lead to the manager being incorrectly initialized, so I have removed that. Startup will no longer trigger cloud connections in the main thread, it will all be deferred to worker threads such as ImageUpdate, periodic check etc. Additionally I've added some belts-and-braces catches to the two key methods - launchImage and updateImage which while they don't directly interact with a provider manager do access the provider definition, which I think can lead to occasional skew between the DB and the configuration - I'm not /sure/ they are needed, but I'd rather be safe. Change-Id: I7e8e16d5d4266c9424e4c27ebcc36ed7738bc86f Fixes-Bug: #1281319	2014-02-19 17:12:33 -08:00
Dan Prince	1963731f7d	Retry ssh connections on auth failure. Some cloud instance types (Fedora for example) create the ssh user after sshd comes online. This allows our ssh connection retry loop to handle this scenario gracefully. Change-Id: Ie345dea50fc2983112cd2e72826a708583d2712a	2014-02-19 16:07:40 -05:00
James E. Blair	2d45897802	Fix typo in allocation This was causing the target distribution for all images to be calculated according to the current values of the last image through the previous loop. Change-Id: I3d5190c78849b77933e18c5cf6a9c7443945b6cd	2014-02-19 10:47:58 -08:00
Jenkins	eeffbc86ac	Merge "Allow useage of server IDs as well as names."	2014-02-19 01:47:52 +00:00
James E. Blair	91f7af41d3	Make jenkins get info task synchronous This is called within the main loop which means if there are a lot of jenkins tasks pending, the main loop waits until they are all finished. Instead, make this synchronous. It should be lightweight so the additional load on jenkins should be negligible. Also increase the sleep in the main loop to 10 seconds to mitigate this somewhat. Change-Id: I16e27fbd9bfef0617e35df08f4fd17bc2ead67b0	2014-02-18 16:57:40 -08:00
Bob Ball	c47547d321	Allow useage of server IDs as well as names. RAX have some hidden images useful for build a xenserver host. Ensure nodepool can use these by referring to them by image UUID. Change-Id: Idfefc60e762740f4cffa437933007942a0920970	2014-02-12 16:55:24 +00:00
James E. Blair	6878447060	Revert delete-rework branch Revert "Log state names not numbers" This reverts commit 1c7c954aab8022e786b74beb4013c6d402fddedb. Revert "Also log provider name when debugging deletes." This reverts commit 3c32477bc73839b946403adf0750f2e9b09ba855. Revert "Run per-provider cleanup threads." This reverts commit 3f2100cf6fab5c51e22f806060440182c24c50eb. Revert "Decouple cron names from config file names." This reverts commit 9505671e66a976e9b0ee8d13a9fb677a0409f39a. Revert "Move cron definition out of the inner loop." This reverts commit f91940bf31b9cf4c187980a76d49bd9955e2c53e. Revert "Move cron loading below provider loading." This reverts commit 84061a038c9f1684ba654d81ce675b0eb3b70957. Revert "Teach periodicCleanup how to do one provider." This reverts commit e856646bd7df2acf2883ba181b0d685b87249f37. Revert "Use the nonblocking cleanupServer." This reverts commit c2f854c99f9a983bcd6f1390f6c20838cf67d525. Revert "Split out the logic for deleting a nodedb node." This reverts commit 5a696ca99231bf72827503b06f08f0c3e91e8ae1. Revert "Make cleanupServer optionally nonblocking." This reverts commit 423bed124e90bc7773017a7884cc23c428f58265. Revert "Consolidate duplicate logging messages." This reverts commit b8a8736ac54dac25d89fd9e0f01eb64d1b035b78. Revert "Log how long nodes have been in DELETE state." This reverts commit daf427f463706e99a90e7d29c48c19566cc710f9. Revert "Cleanup nodes in state DELETE immediately." This reverts commit a84540f33214eb31228e964bb193402903568754. Change-Id: Iaf1eae8724f0a9fe1c14e70896fa699629624d28	2014-02-05 14:54:49 -08:00
James E. Blair	1c7c954aab	Log state names not numbers Change-Id: Ieefdcc8c910b5fd8e50c9c08793f1d80c76f623c	2014-02-05 10:32:57 -08:00
Robert Collins	3c32477bc7	Also log provider name when debugging deletes. This helps us identify why a node might be in state for a long time. Change-Id: I81dc7e4a033dff46111e98be957cbe5d1ac3872c	2014-02-05 11:47:34 +13:00
Jenkins	8e045b0fba	Merge "Expose paramiko's get_pty parameter."	2014-02-03 23:09:18 +00:00
Jenkins	2c6383ad7f	Merge "Revert "Provide diagnostics when task rate limiting.""	2014-02-03 23:05:28 +00:00
Jenkins	d7a0b8a7fe	Merge "Revert "Default to a ratelimit of 2/second for API calls""	2014-02-03 23:05:18 +00:00
Robert Collins	2f052a01e3	Include check in fake.yaml. Anyone starting from fake.yaml should have this. Change-Id: Ieaa8a075cfff5259fabdcdada71c04cddb53cf9f	2014-02-03 20:22:07 +00:00
Robert Collins	3f2100cf6f	Run per-provider cleanup threads. Change-Id: I4c1c54190a5e254dd946ef308a11bb81907bb16c	2014-02-03 20:22:07 +00:00
Robert Collins	9505671e66	Decouple cron names from config file names. This permits creating multiple cron entries for one config entry, which per-provider cleanups requires. Change-Id: Ied9db4a512d8efef355e911d6e5630697d6c38c9	2014-02-03 20:22:06 +00:00
Robert Collins	f91940bf31	Move cron definition out of the inner loop. This makes creating multiple Cron entries for one config entry cleaner, which we need to do per-provider cleanups. Change-Id: Ic5fe8a57fec7aaca644da43f4c88209fbc5488dd	2014-02-03 20:22:05 +00:00
Robert Collins	84061a038c	Move cron loading below provider loading. This is preparation for making the cleanup cron job run per-provider rather than globally. Change-Id: I2cc0ec62b9779fa2500c6c67f014739dfec9205c	2014-02-03 20:22:05 +00:00
Robert Collins	e856646bd7	Teach periodicCleanup how to do one provider. This permits (once the cron changes are wired in) per-provider threads, rather than a single global thread. Change-Id: Ie666d03ee9493373de494c3e6664e4e7bf144dcb	2014-02-03 20:22:04 +00:00
Robert Collins	c2f854c99f	Use the nonblocking cleanupServer. In this patch we now change from doing one server deletion entirely at a time to instead doing all the deletion requests and then all the was-it-deleted checks. This gives the cloud all our requested work at once. Note that we still do lookups from cleanupServer, future work might tune this more to minimise them. The reason we need to do a scan after submitting all the deletions is to avoid waiting the whole poll period between periodic cleanups before we notice a not-quite-immediate delete actually happened. Change-Id: I8f52b24ed911c1e6b8e04b8fe561100f8552fee6	2014-02-03 20:22:03 +00:00
Jenkins	dd2b9db05f	Merge "Split out the logic for deleting a nodedb node."	2014-02-03 19:52:33 +00:00
Jenkins	4a20db1f72	Merge "Make cleanupServer optionally nonblocking."	2014-02-03 19:52:32 +00:00
Jenkins	cf9cd43af7	Merge "Consolidate duplicate logging messages."	2014-02-03 19:52:32 +00:00
Jenkins	c4815e5d23	Merge "Log how long nodes have been in DELETE state."	2014-02-03 19:50:27 +00:00
Jenkins	fb157ba62b	Merge "Cleanup nodes in state DELETE immediately."	2014-02-03 19:50:26 +00:00
Robert Collins	5a696ca992	Split out the logic for deleting a nodedb node. In order to move to scatter-gather, we need to be able to run just the tail of deleteNode independently. Change-Id: If03753e98b423552c8fd6be4d2f5b6b543f3a1af	2014-02-03 15:23:52 +00:00
Robert Collins	423bed124e	Make cleanupServer optionally nonblocking. cleanupServer is used from both the CLI and the daemon, but in the daemon we don't want to block - a single slow to delete server should not block deleting all other servers. Instead we want to pass over all the servers we have pending deletion, submitting them for deletion, and after that loop back and check that they were in fact deleted. Or not. This change also fixes a minor bug where calling cleanupServer on a server that was concurrently deleted from the provider (e.g. by the CLI) would result in a traceback and deferral to the next loop of periodic cleanup before the node record was deleted. Change-Id: If40930238d82c1103b83fe7f46ef6f6e86efd624	2014-02-03 15:23:52 +00:00
Robert Collins	b8a8736ac5	Consolidate duplicate logging messages. Change-Id: If66df6f9b42cbfb00f8ed2d81e573a550c3e6db3	2014-02-03 15:23:51 +00:00
Robert Collins	daf427f463	Log how long nodes have been in DELETE state. We have providers that are terrible at deleting. Change-Id: Ia1e5466b607010e7f21f321775aa25ce9543f046	2014-02-03 15:23:41 +00:00
Robert Collins	a84540f332	Cleanup nodes in state DELETE immediately. The 900 second delay before deleting them from periodic cleanup just causes excessive waste when we've restarted nodepool. To avoid concurrent sqlalchemy access (particularly mutations) to node objects conditions with other per-node operations, this delegates all deletions to periodic cleanup. This is done because with the high latency on API command submission due to low rate limits with providers (e.g. 2 requests/second) it's likely that a node deleted during busy periods will not have it's API request submitted for up to 5 minutes. As such, the 5 minute cleanup thread will actually submit the API requests with little or no increase in latency. And of course that frequency could be increased, permitting faster cleanup in low-load situations. Change-Id: Iad7004b0d3389df1e1b687209fd5f6ed27b03239	2014-02-03 15:21:59 +00:00
Jenkins	4d4ea4b48b	Merge "Switch node id and ip in debug output"	2014-01-31 16:08:52 +00:00
Robert Collins	e3da71c5a5	Expose paramiko's get_pty parameter. This is needed for Fedora - we could just force it on everywhere, or someone could add glue to control it on a per-image basis. However anyone testing commands interactively will have a tty, so forcing it on should lead to less surprises. I've made the patch expose the option so that making it more specific in future is simple. Change-Id: I41173e06201421d5d77d23155d925fe060e478fc	2014-01-29 09:29:27 +13:00
Jenkins	39b328ace3	Merge "Add docs"	2014-01-27 23:49:22 +00:00
Jenkins	60d2be60c4	Merge "Readme enhancements"	2014-01-27 22:54:15 +00:00
Arx Cruz	9161d25e2c	Switch node id and ip in debug output The debug output was wrong, showing first the ip and then the node id. The output is something like: Node id: 192.168.1.1 is running, ip: 508, testing ssh The correct should be: Node id: 508 is running, ip: 192.168.1.1, testing ssh Change-Id: Ie6e630284cb0c3c5961deae579f35515edd57c3e	2014-01-27 09:47:11 -02:00
Jenkins	f95d5d6b44	Merge "Catch exceptions from nova flavor-list calls"	2014-01-22 23:52:30 +00:00
James E. Blair	3a1072cc2c	Revert "Provide diagnostics when task rate limiting." This reverts commit a01956ed70132029c4770ba4e7a3aafa4d14b4d1. This outputs a huge amount of not very useful information. Change-Id: Ie72a207ced8bb64ae1a01c88ca396f9df633e79c	2014-01-21 11:34:12 -08:00

... 46 47 48 49 50

2497 Commits