From d5783f7a8ddd93869a5a08138bd9c08f3403b00a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=C5=81ukasz=20Ole=C5=9B?= Date: Fri, 5 Sep 2014 17:26:07 +0200 Subject: [PATCH] 100 nodes support Change-Id: Ibb8fafac3b6def746f5b731031e652f819b92926 Blueprint: 100-nodes-support --- specs/6.0/100-nodes-support.rst | 219 ++++++++++++++++++++++++++++++++ 1 file changed, 219 insertions(+) create mode 100644 specs/6.0/100-nodes-support.rst diff --git a/specs/6.0/100-nodes-support.rst b/specs/6.0/100-nodes-support.rst new file mode 100644 index 00000000..9ddbcb69 --- /dev/null +++ b/specs/6.0/100-nodes-support.rst @@ -0,0 +1,219 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +========================================== +100 nodes support (fuel only) +========================================== + +https://blueprints.launchpad.net/fuel/+spec/100-nodes-support + +Fuel is an enterprise tool for deploying OpenStack, it should be +able to deploy large clusters. Fuel also should be fast and responsive. +It does not run any processor consuming tasks, so there is no reason +for it to be slow. + +Problem description +=================== + +* For large number of nodes Fuel(nailgun, astute) is getting slow. +* Probability of failing provisioning is also increasing. +* MySQL DB works only as active/standby which has very poor performance. + +Proposed change +=============== + +For nailgun +----------- + +In the first step, it is necessary to write tests which will show places in +code which are not optimal. Some of slow parts are already known. +Such tests should include(all in fake mode): + +* list 100 nodes +* get cluster with 100 nodes +* add 100 nodes to environment +* remove 100 nodes from environment +* run network verification for environment with 100 nodes +* change settings in environment with 100 nodes +* change network configuration in environment with 100 nodes +* run deploy in environment with 100 nodes +* run provision in environment with 100 nodes +* ... + +In order to detect any specific code that works slow it's necessary to run all +the above mentioned tests which measure the time of execution and compare it to +specification in order to see which of them are actually slow. +Run the operations under a profiler and then analyse and fix all bottlenecks, +non-optimal code, etc. +To measure and profile code following tools may be used: + +* cprofile - python module +* osprofiler - python module +* rally - testing framework + +For fuelclient +-------------- + +There should not be any performance bottlenecks in the fuelclient, it +only parses JSON data. There should be tests for fuelclient which should +at least include: + +* list nodes +* add nodes to environment +* list environment with pending changes for 100 nodes +* upload nodes from disk + +For astute +----------- + +Testing astute is harder because it includes interaction with hardware +and other services like cobbler, tftp, dhcp. There is one known problem +which can be addressed now. The rest of the problems can be identified after +testing on real hardware. + +One known problem is connected with network/storage capabilities of Fuel Master +node. When, during provisioning, 100 nodes simultaneously trying to fetch +images and packages. Master node can not handle that high load. Astute should +detect such situation and handle it. +User should be also able to manually tweak astute work. For example to +configure it to provision 10 nodes at the time. It will increase provisioning +time but will make it more resistant. +There should be configuration option to set number nodes to deploy in one run. + +Currently, if provisioning fails on one of the nodes, astute will +stop the whole process. It is not an optimal solution for larger deployments. +Some nodes may fail because of random failures, provisioning should still +continue in this case. +Provision will not be restarted for failed nodes. This nodes will be removed +from cluster. User can re-add this nodes to cluster after successful +deployment. +There should be a configuration option to set percent of nodes which can fail +during provisioning. +In case when for example all controllers failed to provision, provisioning +should be stopped. +User should be notified about each failure. + +For UI +------ + +Our tests show that for 100 nodes UI speed is acceptable. In future, for 1000 +nodes, it will require some speed improvements. + +For puppet manifests library +---------------------------- + +Configure HAproxy MySQL backends as active/active. +There is a patch https://review.openstack.org/#/c/124549/ addressing this +change, but it requires additional researching and load testing. + +Alternatives +------------ + +None + +Data model impact +----------------- + +Depends on bottlenecks found, but unlikely. + +REST API impact +--------------- + +No API changes. All optimization have to be backward compatible. + +Upgrade impact +-------------- + +Only if database is changed, but unlikely. + +Security impact +--------------- + +None + +Notifications impact +-------------------- + +If there are failed nodes. User should be informed about this. + +Other end user impact +--------------------- + +None + +Performance Impact +------------------ + +After blueprint is implemented Fuel should be able to deploy 100 nodes. +Active/active load balancing for MySQL connections should improve DB +operations. + +Other deployer impact +--------------------- + +Rules will change. Some nodes can fail now. + +Developer impact +---------------- + +None + +Implementation +============== + +Assignee(s) +----------- + +Primary assignee: + loles@mirantis.com + ksambor@mirantis.com + +Work Items +---------- + +Blueprint will be implemented in several stages: + +* In first stage all tests will be written. +* In next stage all known and discovered bottlenecks will be fixed. +* After this tests will be run in virtual environment which can create + 100 nodes. +* At the end tests will be run in lab with 100 physical nodes. This test + should show us all astute bottlenecks. +* To prevent reintroducing bottlenecks in next releases all test + will be integrated with our CI infrastructure. +* Additional integration with OSProfiler. It can help find bottleneck + in production systems +* Additional integration with Rally. It will help to test Fuel in real live + environment. +* Additional Neutron load testing with Rally in HA for active/active MySQL. + Even if active/active will fail the testing, at least we could play with + tuning related params and provide some output to community. + +Dependencies +============ + +None + +Testing +======= + +When all bottlenecks are fixed, load test will be added to CI infrastructure, +so non optimal code can immediately be noticed. + +Documentation Impact +==================== + +Deployment rules will change, it should be documented. New notifications +should be described. Active/active mode for MySQL should be documented. + +References +========== + +* https://github.com/stackforge/osprofiler +* https://github.com/stackforge/rally +* https://docs.google.com/a/mirantis.com/document/d/1GJHr4AHw2qA2wYgngoeN2C-6Dhb7wd1Nm1Q9lkhGCag +* https://docs.google.com/a/mirantis.com/document/d/1O2G-fTXlEWh0dAbRCtbrFtPVefc5GvEEOhgBIsU_eP0 +* http://lists.openstack.org/pipermail/openstack-operators/2014-September/005162.html