fuel-specs/specs/6.0/100-nodes-support.rst

..
 This work is licensed under a Creative Commons Attribution 3.0 Unported
 License.

 http://creativecommons.org/licenses/by/3.0/legalcode

==========================================
100 nodes support (fuel only)
==========================================

https://blueprints.launchpad.net/fuel/+spec/100-nodes-support

Fuel is an enterprise tool for deploying OpenStack, it should be
able to deploy large clusters. Fuel also should be fast and responsive.
It does not run any processor consuming tasks, so there is no reason
for it to be slow.

Problem description
===================

* For large number of nodes Fuel(nailgun, astute) is getting slow.
* Probability of failing provisioning is also increasing.
* MySQL DB works only as active/standby which has very poor performance.

Proposed change
===============

For nailgun
-----------

In the first step, it is necessary to write tests which will show places in
code which are not optimal. Some of slow parts are already known.
Such tests should include(all in fake mode):

* list 100 nodes
* get cluster with 100 nodes
* add 100 nodes to environment
* remove 100 nodes from environment
* run network verification for environment with 100 nodes
* change settings in environment with 100 nodes
* change network configuration in environment with 100 nodes
* run deploy in environment with 100 nodes
* run provision in environment with 100 nodes
* ...

In order to detect any specific code that works slow it's necessary to run all
the above mentioned tests which measure the time of execution and compare it to
specification in order to see which of them are actually slow.
Run the operations under a profiler and then analyse and fix all bottlenecks,
non-optimal code, etc.
To measure and profile code following tools may be used:

* cprofile - python module
* osprofiler -  python module
* rally - testing framework

For fuelclient
--------------

There should not be any performance bottlenecks in the fuelclient, it
only parses JSON data. There should be tests for fuelclient which should
at least include:

* list nodes
* add nodes to environment
* list environment with pending changes for 100 nodes
* upload nodes from disk

For astute
-----------

Testing astute is harder because it includes interaction with hardware
and other services like cobbler, tftp, dhcp. There is one known problem
which can be addressed now. The rest of the problems can be identified after
testing on real hardware.

One known problem is connected with network/storage capabilities of Fuel Master
node. When, during provisioning, 100 nodes simultaneously trying to fetch
images and packages. Master node can not handle that high load. Astute should
detect such situation and handle it.
User should be also able to manually tweak astute work. For example to
configure it to provision 10 nodes at the time. It will increase provisioning
time but will make it more resistant.
There should be configuration option to set number nodes to deploy in one run.

Currently, if provisioning fails on one of the nodes, astute will
stop the whole process. It is not an optimal solution for larger deployments.
Some nodes may fail because of random failures, provisioning should still
continue in this case.
Provision will not be restarted for failed nodes. This nodes will be removed
from cluster. User can re-add this nodes to cluster after successful
deployment.
There should be a configuration option to set percent of nodes which can fail
during provisioning.
In case when for example all controllers failed to provision, provisioning
should be stopped.
User should be notified about each failure.

For UI
------

Our tests show that for 100 nodes UI speed is acceptable. In future, for 1000
nodes, it will require some speed improvements.

For puppet manifests library
----------------------------

Configure HAproxy MySQL backends as active/active.
There is a patch https://review.openstack.org/#/c/124549/ addressing this
change, but it requires additional researching and load testing.

Alternatives
------------

None

Data model impact
-----------------

Depends on bottlenecks found, but unlikely.

REST API impact
---------------

No API changes. All optimization have to be backward compatible.

Upgrade impact
--------------

Only if database is changed, but unlikely.

Security impact
---------------

None

Notifications impact
--------------------

If there are failed nodes. User should be informed about this.

Other end user impact
---------------------

None

Performance Impact
------------------

After blueprint is implemented Fuel should be able to deploy 100 nodes.
Active/active load balancing for MySQL connections should improve DB
operations.

Other deployer impact
---------------------

Rules will change. Some nodes can fail now.

Developer impact
----------------

None

Implementation
==============

Assignee(s)
-----------

Primary assignee:
  loles@mirantis.com
  ksambor@mirantis.com

Work Items
----------

Blueprint will be implemented in several stages:

* In first stage all tests will be written.
* In next stage all known and discovered bottlenecks will be fixed.
* After this tests will be run in virtual environment which can create
  100 nodes.
* At the end tests will be run in lab with 100 physical nodes. This test
  should show us all astute bottlenecks.
* To prevent reintroducing bottlenecks in next releases all test
  will be integrated with our CI infrastructure.
* Additional integration with OSProfiler. It can help find bottleneck
  in production systems
* Additional integration with Rally. It will help to test Fuel in real live
  environment.
* Additional Neutron load testing with Rally in HA for active/active MySQL.
  Even if active/active will fail the testing, at least we could play with
  tuning related params and provide some output to community.

Dependencies
============

None

Testing
=======

When all bottlenecks are fixed, load test will be added to CI infrastructure,
so non optimal code can immediately be noticed.

Documentation Impact
====================

Deployment rules will change, it should be documented. New notifications
should be described. Active/active mode for MySQL should be documented.

References
==========

* https://github.com/stackforge/osprofiler
* https://github.com/stackforge/rally
* https://docs.google.com/a/mirantis.com/document/d/1GJHr4AHw2qA2wYgngoeN2C-6Dhb7wd1Nm1Q9lkhGCag
* https://docs.google.com/a/mirantis.com/document/d/1O2G-fTXlEWh0dAbRCtbrFtPVefc5GvEEOhgBIsU_eP0
* http://lists.openstack.org/pipermail/openstack-operators/2014-September/005162.html