Node reinstallation

implements blueprint mos-node-reinstallation Change-Id: I0fa9c39832a0e458fbb897456370872058197763
2015-06-05 21:24:59 +03:00 · 2015-06-05 21:24:59 +03:00 · a2f6232f2e
commit a2f6232f2e
parent b4b03b8730
1 changed files with 204 additions and 0 deletions
--- a/specs/7.0/node-reinstallation.rst
+++ b/specs/7.0/node-reinstallation.rst
@ -0,0 +1,204 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+==========================================
+MOS Node Reinstallation
+==========================================
+
+https://blueprints.launchpad.net/fuel/+spec/mos-node-reinstallation
+
+Node reinstallation allows fully/partially recover failed nodes
+using the standard fuel processes 'provision' and 'deploy'.
+Full reinstallation - purge all data from reinstalled node
+Partial reinstallation - some data can be preserved OS will be
+reinstalled.
+In case when only system should be reinstalled from scratch
+(partially) Partition Preservation feature should be enabled.
+
+(https://blueprints.launchpad.net/fuel/+spec/partition-preservation)
+
+Problem description
+===================
+
+Currently fuel does not fully support functioning node reinstallation.
+Slave nodes can't be restored after fail. Including but not limited to
+MongoDB failures, Galera failures, update failures, upgrade failures, etc.
+
+
+Proposed change
+===============
+
+Reinstallation feature includes multiple changes which should be implemented.
+
+
+* Partition Preservation (will be implemented separately)
+  (https://blueprints.launchpad.net/fuel/+spec/partition-preservation).
+
+* Node renaming (https://blueprints.launchpad.net/fuel/+spec/node-naming).
+
+* MongoDB recovery in case of failure (assumed should be fixed in 7.0).
+
+* Swift ring sync during redeploy (assumed should be fixed in 7.0).
+
+
+  Reinstallation process:
+
+  1) Nailgun shouldn't serialize recovering controller as primary.
+     Nailgun should always serialize recovering controller as regular
+     controller. Same is applied for other roles that have primary one.
+
+  2) Partition preservation manipulation should be prepared
+     before node will be reprovisioned.
+
+  3) The last step is provision and deploy.
+
+
+Alternatives
+------------
+
+None
+
+Data model impact
+-----------------
+
+None
+
+REST API impact
+---------------
+
+API part will not change. Reinstallation process will use standard
+API calls - provision and deploy
+
+API changes will be in partition preservation
+(https://blueprints.launchpad.net/fuel/+spec/partition-preservation).
+
+Node renaming
+(https://blueprints.launchpad.net/fuel/+spec/node-naming).
+
+
+Upgrade impact
+--------------
+
+None
+
+Security impact
+---------------
+
+None
+
+Notifications impact
+--------------------
+
+None
+
+Other end user impact
+---------------------
+
+None
+
+Performance Impact
+------------------
+
+Reinstallation process using partition preservation should improve
+deployment stage. Swift, Mysql, Mongodb services synchronization
+time should be shorter.
+In case compute node should be reinstalled using partition
+preservation method VM images migration not required.
+
+None
+
+Plugin impact
+-------------
+
+None
+
+Other deployer impact
+---------------------
+
+None
+
+Developer impact
+----------------
+
+None
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+:Primary Assignee: Ivan Ponomarev
+
+:QA: Dmitriy Kruglov
+
+:Nandatory design review: Vladimir Kuklin
+
+
+Work Items
+----------
+
+#. Nailgun shouldn't serialize recovered controller as primary
+   Nailgun should be able reinstall slave node and using the same name
+   to return slave node back to the cluster.
+
+
+Dependencies
+============
+
+No strict dependencies
+
+Testing
+=======
+
+* Manual testing and acceptance criteria:
+
+ - It is possible to perform a full reinstallation (all data is purged) of a
+   failed slave node to recover  to previous working state
+ - It is possible to perform a partial reinstallation (some data is preserved)
+   of a failed slave node to recover to previous working state
+
+Scenarios to automate
+
+Reinstall single compute:
+
+1. Do reinstallation of the compute
+2. Run Network check
+3. Run OSTF tests set
+4. list nova services and verify that the 'nova-compute' service is enabled
+   and is running on the reinstalled node
+
+Reinstall single controller:
+
+1. Do reinstallation of the controller
+2. Run Network check
+3. Run OSTF tests set
+4. Verify that the reinstalled controller is in pacemaker cluster and has
+   'online' status
+5. Verify that the reinstalled controller is in rabbitmq cluster and running
+6. Verify that the reinstalled controller is in Halera cluster
+
+Reinstallation of full cluster:
+
+1. Do reinstallation of whole cluster
+2. Run Network check
+3. Run OSTF tests set
+4. Verify that the reinstalled controller is in pacemaker cluster and has
+   'online' status
+5. Verify that the reinstalled controller is in rabbitmq cluster and running
+6. Verify that the reinstalled controller is in Halera cluster
+7. list nova services and verify that the 'nova-compute' service is enabled
+
+
+
+Documentation Impact
+====================
+
+Reinstallation documentation will be added to the User Guide section
+
+References
+==========
+