100 nodes support

Change-Id: Ibb8fafac3b6def746f5b731031e652f819b92926 Blueprint: 100-nodes-support
2014-09-05 17:26:07 +02:00 · 2014-09-05 17:26:07 +02:00 · d5783f7a8d
commit d5783f7a8d
parent c4e758890a
1 changed files with 219 additions and 0 deletions
--- a/specs/6.0/100-nodes-support.rst
+++ b/specs/6.0/100-nodes-support.rst
@ -0,0 +1,219 @@
 ..
 This work is licensed under a Creative Commons Attribution 3.0 Unported
 License.
 http://creativecommons.org/licenses/by/3.0/legalcode
 ==========================================
 100 nodes support (fuel only)
 ==========================================
 https://blueprints.launchpad.net/fuel/+spec/100-nodes-support
 Fuel is an enterprise tool for deploying OpenStack, it should be
 able to deploy large clusters. Fuel also should be fast and responsive.
 It does not run any processor consuming tasks, so there is no reason
 for it to be slow.
 Problem description
 ===================
 * For large number of nodes Fuel(nailgun, astute) is getting slow.
 * Probability of failing provisioning is also increasing.
 * MySQL DB works only as active/standby which has very poor performance.
 Proposed change
 ===============
 For nailgun
 -----------
 In the first step, it is necessary to write tests which will show places in
 code which are not optimal. Some of slow parts are already known.
 Such tests should include(all in fake mode):
 * list 100 nodes
 * get cluster with 100 nodes
 * add 100 nodes to environment
 * remove 100 nodes from environment
 * run network verification for environment with 100 nodes
 * change settings in environment with 100 nodes
 * change network configuration in environment with 100 nodes
 * run deploy in environment with 100 nodes
 * run provision in environment with 100 nodes
 * ...
 In order to detect any specific code that works slow it's necessary to run all
 the above mentioned tests which measure the time of execution and compare it to
 specification in order to see which of them are actually slow.
 Run the operations under a profiler and then analyse and fix all bottlenecks,
 non-optimal code, etc.
 To measure and profile code following tools may be used:
 * cprofile - python module
 * osprofiler -  python module
 * rally - testing framework
 For fuelclient
 --------------
 There should not be any performance bottlenecks in the fuelclient, it
 only parses JSON data. There should be tests for fuelclient which should
 at least include:
 * list nodes
 * add nodes to environment
 * list environment with pending changes for 100 nodes
 * upload nodes from disk
 For astute
 -----------
 Testing astute is harder because it includes interaction with hardware
 and other services like cobbler, tftp, dhcp. There is one known problem
 which can be addressed now. The rest of the problems can be identified after
 testing on real hardware.
 One known problem is connected with network/storage capabilities of Fuel Master
 node. When, during provisioning, 100 nodes simultaneously trying to fetch
 images and packages. Master node can not handle that high load. Astute should
 detect such situation and handle it.
 User should be also able to manually tweak astute work. For example to
 configure it to provision 10 nodes at the time. It will increase provisioning
 time but will make it more resistant.
 There should be configuration option to set number nodes to deploy in one run.
 Currently, if provisioning fails on one of the nodes, astute will
 stop the whole process. It is not an optimal solution for larger deployments.
 Some nodes may fail because of random failures, provisioning should still
 continue in this case.
 Provision will not be restarted for failed nodes. This nodes will be removed
 from cluster. User can re-add this nodes to cluster after successful
 deployment.
 There should be a configuration option to set percent of nodes which can fail
 during provisioning.
 In case when for example all controllers failed to provision, provisioning
 should be stopped.
 User should be notified about each failure.
 For UI
 ------
 Our tests show that for 100 nodes UI speed is acceptable. In future, for 1000
 nodes, it will require some speed improvements.
 For puppet manifests library
 ----------------------------
 Configure HAproxy MySQL backends as active/active.
 There is a patch https://review.openstack.org/#/c/124549/ addressing this
 change, but it requires additional researching and load testing.
 Alternatives
 ------------
 None
 Data model impact
 -----------------
 Depends on bottlenecks found, but unlikely.
 REST API impact
 ---------------
 No API changes. All optimization have to be backward compatible.
 Upgrade impact
 --------------
 Only if database is changed, but unlikely.
 Security impact
 ---------------
 None
 Notifications impact
 --------------------
 If there are failed nodes. User should be informed about this.
 Other end user impact
 ---------------------
 None
 Performance Impact
 ------------------
 After blueprint is implemented Fuel should be able to deploy 100 nodes.
 Active/active load balancing for MySQL connections should improve DB
 operations.
 Other deployer impact
 ---------------------
 Rules will change. Some nodes can fail now.
 Developer impact
 ----------------
 None
 Implementation
 ==============
 Assignee(s)
 -----------
 Primary assignee:
  loles@mirantis.com
  ksambor@mirantis.com
 Work Items
 ----------
 Blueprint will be implemented in several stages:
 * In first stage all tests will be written.
 * In next stage all known and discovered bottlenecks will be fixed.
 * After this tests will be run in virtual environment which can create
  100 nodes.
 * At the end tests will be run in lab with 100 physical nodes. This test
  should show us all astute bottlenecks.
 * To prevent reintroducing bottlenecks in next releases all test
  will be integrated with our CI infrastructure.
 * Additional integration with OSProfiler. It can help find bottleneck
  in production systems
 * Additional integration with Rally. It will help to test Fuel in real live
  environment.
 * Additional Neutron load testing with Rally in HA for active/active MySQL.
  Even if active/active will fail the testing, at least we could play with
  tuning related params and provide some output to community.
 Dependencies
 ============
 None
 Testing
 =======
 When all bottlenecks are fixed, load test will be added to CI infrastructure,
 so non optimal code can immediately be noticed.
 Documentation Impact
 ====================
 Deployment rules will change, it should be documented. New notifications
 should be described. Active/active mode for MySQL should be documented.
 References
 ==========
 * https://github.com/stackforge/osprofiler
 * https://github.com/stackforge/rally
 * https://docs.google.com/a/mirantis.com/document/d/1GJHr4AHw2qA2wYgngoeN2C-6Dhb7wd1Nm1Q9lkhGCag
 * https://docs.google.com/a/mirantis.com/document/d/1O2G-fTXlEWh0dAbRCtbrFtPVefc5GvEEOhgBIsU_eP0
 * http://lists.openstack.org/pipermail/openstack-operators/2014-September/005162.html