100 nodes support
Change-Id: Ibb8fafac3b6def746f5b731031e652f819b92926 Blueprint: 100-nodes-support
This commit is contained in:
parent
c4e758890a
commit
d5783f7a8d
219
specs/6.0/100-nodes-support.rst
Normal file
219
specs/6.0/100-nodes-support.rst
Normal file
@ -0,0 +1,219 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==========================================
|
||||
100 nodes support (fuel only)
|
||||
==========================================
|
||||
|
||||
https://blueprints.launchpad.net/fuel/+spec/100-nodes-support
|
||||
|
||||
Fuel is an enterprise tool for deploying OpenStack, it should be
|
||||
able to deploy large clusters. Fuel also should be fast and responsive.
|
||||
It does not run any processor consuming tasks, so there is no reason
|
||||
for it to be slow.
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
* For large number of nodes Fuel(nailgun, astute) is getting slow.
|
||||
* Probability of failing provisioning is also increasing.
|
||||
* MySQL DB works only as active/standby which has very poor performance.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
For nailgun
|
||||
-----------
|
||||
|
||||
In the first step, it is necessary to write tests which will show places in
|
||||
code which are not optimal. Some of slow parts are already known.
|
||||
Such tests should include(all in fake mode):
|
||||
|
||||
* list 100 nodes
|
||||
* get cluster with 100 nodes
|
||||
* add 100 nodes to environment
|
||||
* remove 100 nodes from environment
|
||||
* run network verification for environment with 100 nodes
|
||||
* change settings in environment with 100 nodes
|
||||
* change network configuration in environment with 100 nodes
|
||||
* run deploy in environment with 100 nodes
|
||||
* run provision in environment with 100 nodes
|
||||
* ...
|
||||
|
||||
In order to detect any specific code that works slow it's necessary to run all
|
||||
the above mentioned tests which measure the time of execution and compare it to
|
||||
specification in order to see which of them are actually slow.
|
||||
Run the operations under a profiler and then analyse and fix all bottlenecks,
|
||||
non-optimal code, etc.
|
||||
To measure and profile code following tools may be used:
|
||||
|
||||
* cprofile - python module
|
||||
* osprofiler - python module
|
||||
* rally - testing framework
|
||||
|
||||
For fuelclient
|
||||
--------------
|
||||
|
||||
There should not be any performance bottlenecks in the fuelclient, it
|
||||
only parses JSON data. There should be tests for fuelclient which should
|
||||
at least include:
|
||||
|
||||
* list nodes
|
||||
* add nodes to environment
|
||||
* list environment with pending changes for 100 nodes
|
||||
* upload nodes from disk
|
||||
|
||||
For astute
|
||||
-----------
|
||||
|
||||
Testing astute is harder because it includes interaction with hardware
|
||||
and other services like cobbler, tftp, dhcp. There is one known problem
|
||||
which can be addressed now. The rest of the problems can be identified after
|
||||
testing on real hardware.
|
||||
|
||||
One known problem is connected with network/storage capabilities of Fuel Master
|
||||
node. When, during provisioning, 100 nodes simultaneously trying to fetch
|
||||
images and packages. Master node can not handle that high load. Astute should
|
||||
detect such situation and handle it.
|
||||
User should be also able to manually tweak astute work. For example to
|
||||
configure it to provision 10 nodes at the time. It will increase provisioning
|
||||
time but will make it more resistant.
|
||||
There should be configuration option to set number nodes to deploy in one run.
|
||||
|
||||
Currently, if provisioning fails on one of the nodes, astute will
|
||||
stop the whole process. It is not an optimal solution for larger deployments.
|
||||
Some nodes may fail because of random failures, provisioning should still
|
||||
continue in this case.
|
||||
Provision will not be restarted for failed nodes. This nodes will be removed
|
||||
from cluster. User can re-add this nodes to cluster after successful
|
||||
deployment.
|
||||
There should be a configuration option to set percent of nodes which can fail
|
||||
during provisioning.
|
||||
In case when for example all controllers failed to provision, provisioning
|
||||
should be stopped.
|
||||
User should be notified about each failure.
|
||||
|
||||
For UI
|
||||
------
|
||||
|
||||
Our tests show that for 100 nodes UI speed is acceptable. In future, for 1000
|
||||
nodes, it will require some speed improvements.
|
||||
|
||||
For puppet manifests library
|
||||
----------------------------
|
||||
|
||||
Configure HAproxy MySQL backends as active/active.
|
||||
There is a patch https://review.openstack.org/#/c/124549/ addressing this
|
||||
change, but it requires additional researching and load testing.
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
None
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
Depends on bottlenecks found, but unlikely.
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
No API changes. All optimization have to be backward compatible.
|
||||
|
||||
Upgrade impact
|
||||
--------------
|
||||
|
||||
Only if database is changed, but unlikely.
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
If there are failed nodes. User should be informed about this.
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
None
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
After blueprint is implemented Fuel should be able to deploy 100 nodes.
|
||||
Active/active load balancing for MySQL connections should improve DB
|
||||
operations.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
Rules will change. Some nodes can fail now.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
loles@mirantis.com
|
||||
ksambor@mirantis.com
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
Blueprint will be implemented in several stages:
|
||||
|
||||
* In first stage all tests will be written.
|
||||
* In next stage all known and discovered bottlenecks will be fixed.
|
||||
* After this tests will be run in virtual environment which can create
|
||||
100 nodes.
|
||||
* At the end tests will be run in lab with 100 physical nodes. This test
|
||||
should show us all astute bottlenecks.
|
||||
* To prevent reintroducing bottlenecks in next releases all test
|
||||
will be integrated with our CI infrastructure.
|
||||
* Additional integration with OSProfiler. It can help find bottleneck
|
||||
in production systems
|
||||
* Additional integration with Rally. It will help to test Fuel in real live
|
||||
environment.
|
||||
* Additional Neutron load testing with Rally in HA for active/active MySQL.
|
||||
Even if active/active will fail the testing, at least we could play with
|
||||
tuning related params and provide some output to community.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
When all bottlenecks are fixed, load test will be added to CI infrastructure,
|
||||
so non optimal code can immediately be noticed.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
Deployment rules will change, it should be documented. New notifications
|
||||
should be described. Active/active mode for MySQL should be documented.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
* https://github.com/stackforge/osprofiler
|
||||
* https://github.com/stackforge/rally
|
||||
* https://docs.google.com/a/mirantis.com/document/d/1GJHr4AHw2qA2wYgngoeN2C-6Dhb7wd1Nm1Q9lkhGCag
|
||||
* https://docs.google.com/a/mirantis.com/document/d/1O2G-fTXlEWh0dAbRCtbrFtPVefc5GvEEOhgBIsU_eP0
|
||||
* http://lists.openstack.org/pipermail/openstack-operators/2014-September/005162.html
|
Loading…
Reference in New Issue
Block a user