From d5783f7a8ddd93869a5a08138bd9c08f3403b00a Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=C5=81ukasz=20Ole=C5=9B?= <loles@mirantis.com>
Date: Fri, 5 Sep 2014 17:26:07 +0200
Subject: [PATCH] 100 nodes support

Change-Id: Ibb8fafac3b6def746f5b731031e652f819b92926
Blueprint: 100-nodes-support
---
 specs/6.0/100-nodes-support.rst | 219 ++++++++++++++++++++++++++++++++
 1 file changed, 219 insertions(+)
 create mode 100644 specs/6.0/100-nodes-support.rst

diff --git a/specs/6.0/100-nodes-support.rst b/specs/6.0/100-nodes-support.rst
new file mode 100644
index 00000000..9ddbcb69
--- /dev/null
+++ b/specs/6.0/100-nodes-support.rst
@@ -0,0 +1,219 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+==========================================
+100 nodes support (fuel only)
+==========================================
+
+https://blueprints.launchpad.net/fuel/+spec/100-nodes-support
+
+Fuel is an enterprise tool for deploying OpenStack, it should be
+able to deploy large clusters. Fuel also should be fast and responsive.
+It does not run any processor consuming tasks, so there is no reason
+for it to be slow.
+
+Problem description
+===================
+
+* For large number of nodes Fuel(nailgun, astute) is getting slow.
+* Probability of failing provisioning is also increasing.
+* MySQL DB works only as active/standby which has very poor performance.
+
+Proposed change
+===============
+
+For nailgun
+-----------
+
+In the first step, it is necessary to write tests which will show places in
+code which are not optimal. Some of slow parts are already known.
+Such tests should include(all in fake mode):
+
+* list 100 nodes
+* get cluster with 100 nodes
+* add 100 nodes to environment
+* remove 100 nodes from environment
+* run network verification for environment with 100 nodes
+* change settings in environment with 100 nodes
+* change network configuration in environment with 100 nodes
+* run deploy in environment with 100 nodes
+* run provision in environment with 100 nodes
+* ...
+
+In order to detect any specific code that works slow it's necessary to run all
+the above mentioned tests which measure the time of execution and compare it to
+specification in order to see which of them are actually slow.
+Run the operations under a profiler and then analyse and fix all bottlenecks,
+non-optimal code, etc.
+To measure and profile code following tools may be used:
+
+* cprofile - python module
+* osprofiler -  python module
+* rally - testing framework
+
+For fuelclient
+--------------
+
+There should not be any performance bottlenecks in the fuelclient, it
+only parses JSON data. There should be tests for fuelclient which should
+at least include:
+
+* list nodes
+* add nodes to environment
+* list environment with pending changes for 100 nodes
+* upload nodes from disk
+
+For astute
+-----------
+
+Testing astute is harder because it includes interaction with hardware
+and other services like cobbler, tftp, dhcp. There is one known problem
+which can be addressed now. The rest of the problems can be identified after
+testing on real hardware.
+
+One known problem is connected with network/storage capabilities of Fuel Master
+node. When, during provisioning, 100 nodes simultaneously trying to fetch
+images and packages. Master node can not handle that high load. Astute should
+detect such situation and handle it.
+User should be also able to manually tweak astute work. For example to
+configure it to provision 10 nodes at the time. It will increase provisioning
+time but will make it more resistant.
+There should be configuration option to set number nodes to deploy in one run.
+
+Currently, if provisioning fails on one of the nodes, astute will
+stop the whole process. It is not an optimal solution for larger deployments.
+Some nodes may fail because of random failures, provisioning should still
+continue in this case.
+Provision will not be restarted for failed nodes. This nodes will be removed
+from cluster. User can re-add this nodes to cluster after successful
+deployment.
+There should be a configuration option to set percent of nodes which can fail
+during provisioning.
+In case when for example all controllers failed to provision, provisioning
+should be stopped.
+User should be notified about each failure.
+
+For UI
+------
+
+Our tests show that for 100 nodes UI speed is acceptable. In future, for 1000
+nodes, it will require some speed improvements.
+
+For puppet manifests library
+----------------------------
+
+Configure HAproxy MySQL backends as active/active.
+There is a patch https://review.openstack.org/#/c/124549/ addressing this
+change, but it requires additional researching and load testing.
+
+Alternatives
+------------
+
+None
+
+Data model impact
+-----------------
+
+Depends on bottlenecks found, but unlikely.
+
+REST API impact
+---------------
+
+No API changes. All optimization have to be backward compatible.
+
+Upgrade impact
+--------------
+
+Only if database is changed, but unlikely.
+
+Security impact
+---------------
+
+None
+
+Notifications impact
+--------------------
+
+If there are failed nodes. User should be informed about this.
+
+Other end user impact
+---------------------
+
+None
+
+Performance Impact
+------------------
+
+After blueprint is implemented Fuel should be able to deploy 100 nodes.
+Active/active load balancing for MySQL connections should improve DB
+operations.
+
+Other deployer impact
+---------------------
+
+Rules will change. Some nodes can fail now.
+
+Developer impact
+----------------
+
+None
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  loles@mirantis.com
+  ksambor@mirantis.com
+
+Work Items
+----------
+
+Blueprint will be implemented in several stages:
+
+* In first stage all tests will be written.
+* In next stage all known and discovered bottlenecks will be fixed.
+* After this tests will be run in virtual environment which can create
+  100 nodes.
+* At the end tests will be run in lab with 100 physical nodes. This test
+  should show us all astute bottlenecks.
+* To prevent reintroducing bottlenecks in next releases all test
+  will be integrated with our CI infrastructure.
+* Additional integration with OSProfiler. It can help find bottleneck
+  in production systems
+* Additional integration with Rally. It will help to test Fuel in real live
+  environment.
+* Additional Neutron load testing with Rally in HA for active/active MySQL.
+  Even if active/active will fail the testing, at least we could play with
+  tuning related params and provide some output to community.
+
+Dependencies
+============
+
+None
+
+Testing
+=======
+
+When all bottlenecks are fixed, load test will be added to CI infrastructure,
+so non optimal code can immediately be noticed.
+
+Documentation Impact
+====================
+
+Deployment rules will change, it should be documented. New notifications
+should be described. Active/active mode for MySQL should be documented.
+
+References
+==========
+
+* https://github.com/stackforge/osprofiler
+* https://github.com/stackforge/rally
+* https://docs.google.com/a/mirantis.com/document/d/1GJHr4AHw2qA2wYgngoeN2C-6Dhb7wd1Nm1Q9lkhGCag
+* https://docs.google.com/a/mirantis.com/document/d/1O2G-fTXlEWh0dAbRCtbrFtPVefc5GvEEOhgBIsU_eP0
+* http://lists.openstack.org/pipermail/openstack-operators/2014-September/005162.html