Merge "Document MaxFailPercentage"

2020-07-31 12:47:17 +00:00 · 2020-07-31 12:47:17 +00:00 · dff4356eea
parent 64314a02c1 35f32dc468
commit dff4356eea
2 changed files with 47 additions and 0 deletions
--- a/deploy-guide/source/features/index.rst
+++ b/deploy-guide/source/features/index.rst
@ -20,6 +20,7 @@ Documentation on additional features for |project|.
   distributed_compute_node
   distributed_multibackend_storage
   extra_config
+   tolerated_failure
   high_availability
   instance_ha
   ipsec
--- a/deploy-guide/source/features/tolerated_failure.rst
+++ b/deploy-guide/source/features/tolerated_failure.rst
@ -0,0 +1,46 @@
+Tolerate deployment failures
+============================
+
+When proceeding to large scale deployments, it happens very often to have
+infrastucture problems such as network outages, wrong configurations applied
+on hardware, hard drive issues, etc.
+
+It is unpleasant to deploy hundred of nodes and only have a few of them which
+failed. On most of large-scale use-cases, deployers would not care about
+these nodes, as long as the cloud can already be used with the successfully
+deployed servers.
+
+For that purpose, it is possible in |project|  to specify a percentage value,
+per role, that will tell how much failures we tolerate.
+
+Example: We deploy 50 compute nodes with the role "Compute". If I set the
+following environment, my deployment will go until the end even if up to 5
+nodes fail to deploy::
+
+  parameter_defaults:
+    ComputeMaxFailPercentage: 10
+
+At the end of the deployment, a report will be printed and if nodes failed to
+deploy, it'll be shown like this::
+
+  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+  ~~~~~~~~~~~~~~~~~~~~~~~~~~ State Information ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+  ~~~~~~~~~ Number of nodes which did not deploy successfully: 3 ~~~~~~~~~~~~~~
+   This or these node(s) failed to deploy: compute3, compute24, compute29
+  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If one or multiple node(s) failed to deploy, the tripleoclient return code
+won't be 0 and an error will be printed with a Python trace. Very often the
+problem can be read from the Ansible logs by searching for the nodes which
+didn't deploy successfully.
+
+If you want to target all the compute nodes in our deployment and you have more
+than one role to deploy computes, then you'll probably want to allocate one
+value per role and distribute it based on your expectations and needs.
+
+.. Warning::
+
+   For now, this only works for the execution of the deployment steps
+   from config-download playbooks. Minor updates, major upgrades, fast forward
+   upgrades and baremetal provisioning operations aren't supported yet, but
+   will certainly be in the future.