4ccb70da74
* Improve nailgun performance tests * Provision nodes in chunks * Allow some nodes to fail during provision * Improve network verification mechanism Blueprint: 200-nodes-support Change-Id: Ibc3e72ce769a23ffb2c0fa4f950b27f268ce5a0f
181 lines
4.5 KiB
ReStructuredText
181 lines
4.5 KiB
ReStructuredText
..
|
|
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
|
License.
|
|
|
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
|
|
|
==========================================
|
|
200 nodes support
|
|
==========================================
|
|
|
|
https://blueprints.launchpad.net/fuel/+spec/200-nodes-support
|
|
|
|
This blueprint is a continuation of the blueprint "100 nodes support"[1] from
|
|
release 6.0.
|
|
|
|
Problem description
|
|
===================
|
|
|
|
For large number of nodes probability of failing during provision and deploy
|
|
stages is increasing. If nodes fail to provision deployment can not continue.
|
|
For large environments network verification also takes a lot of time and may
|
|
timeout.
|
|
|
|
Proposed change
|
|
===============
|
|
|
|
For nailgun
|
|
-----------
|
|
|
|
In the previous release some performance tests[2] were added to nailgun to show
|
|
bottlenecks and the biggest issues were fixed. During this release more test
|
|
will be added. For example:
|
|
|
|
Integration tests:
|
|
|
|
* add 100 nodes, deploy, add 100 nodes, deploy
|
|
* add 100 nodes, deploy cluster, stop deployment, deploy cluster
|
|
|
|
Unit tests:
|
|
|
|
* Tests for handler ProvisionSelectedNodes
|
|
* Tests for handler NodeGroupCollectionHandler
|
|
* Tests for handler NodeCollectionNICsDefaultHandler
|
|
* Check how NotificationCollectionHandler works with big number of
|
|
notifications
|
|
|
|
Execution of handler ClusterChangeHandler which takes to much time will be
|
|
moved to background as it is hard to optimize it.
|
|
|
|
Graphs will be added to CI job to show how performance changed between
|
|
commits.
|
|
|
|
For astute
|
|
-----------
|
|
|
|
One known problem is connected with network/storage capabilities of Fuel Master
|
|
node. When, during provisioning, 200 nodes simultaneously trying to fetch
|
|
images and packages. Master node can not handle that high load. Astute should
|
|
detect such situation and handle it.
|
|
User should be also able to manually tweak astute work. For example to
|
|
configure it to provision 50 nodes at the time. It will increase provisioning
|
|
time but will make it more resistant.
|
|
There should be a configuration option to set number nodes to deploy in one
|
|
run.
|
|
|
|
Some nodes may fail because of random failures, provisioning should still
|
|
continue in this case.
|
|
Provision will not be restarted for failed nodes. This nodes will have
|
|
status set to error. User can re-provision this nodes after successful
|
|
deployment.
|
|
There should be a configuration option to set percent of nodes which can fail
|
|
during provisioning.
|
|
User should be notified about each failure.
|
|
The same applies for deploy stage.
|
|
|
|
Another problem is connected with network verification which for 100 nodes
|
|
takes a lot of time. Currently connectivity between node is checked on one
|
|
node at time. It should be parallelized to make it faster but also
|
|
it should be backward compatible.
|
|
|
|
Alternatives
|
|
------------
|
|
|
|
None
|
|
|
|
Data model impact
|
|
-----------------
|
|
|
|
Depends on bottlenecks found, but unlikely.
|
|
|
|
REST API impact
|
|
---------------
|
|
|
|
No API changes. All optimization have to be backward compatible.
|
|
|
|
Upgrade impact
|
|
--------------
|
|
|
|
Only if database is changed, but unlikely.
|
|
|
|
Security impact
|
|
---------------
|
|
|
|
None
|
|
|
|
Notifications impact
|
|
--------------------
|
|
|
|
If there are failed nodes. User should be informed about this.
|
|
|
|
Other end user impact
|
|
---------------------
|
|
|
|
None
|
|
|
|
Performance Impact
|
|
------------------
|
|
|
|
After blueprint is implemented Fuel should be able to deploy 200 nodes.
|
|
|
|
Other deployer impact
|
|
---------------------
|
|
|
|
Rules will change. Some nodes can fail now.
|
|
|
|
Developer impact
|
|
----------------
|
|
|
|
None
|
|
|
|
Implementation
|
|
==============
|
|
|
|
Assignee(s)
|
|
-----------
|
|
|
|
Primary assignee:
|
|
loles@mirantis.com
|
|
|
|
Work Items
|
|
----------
|
|
|
|
Blueprint will be implemented in several stages:
|
|
|
|
* Allow to run provision in chunks
|
|
* Improve network verification performance
|
|
* Allow some nodes to fail during provisioning and deployment
|
|
* Write new nailgun performance tests
|
|
|
|
Dependencies
|
|
============
|
|
|
|
None
|
|
|
|
Testing
|
|
=======
|
|
|
|
More load test will be added to CI infrastructure,
|
|
so non optimal code can immediately be noticed.
|
|
|
|
Aceptance criteria
|
|
------------------
|
|
|
|
* Nailgun performance jobs on CI are passing
|
|
* 10 nodes cluster deployment succeeds even when one node failed to provision
|
|
* No more than 50 nodes are simultaneously provisioned when default settings
|
|
are used
|
|
* Network verification does not timeout when testing 200 nodes
|
|
|
|
|
|
Documentation Impact
|
|
====================
|
|
|
|
Changes about provision and deployment should be documented.
|
|
|
|
References
|
|
==========
|
|
|
|
1. https://blueprints.launchpad.net/fuel/+spec/100-nodes-support
|
|
2. https://github.com/stackforge/fuel-web/tree/master/nailgun/nailgun/test/performance
|