Remove the blueprintsi from airship-in-a-bottle
Airship-specs is the new home for this sort of thing. Change-Id: I9ef6b4c8892f215044e77af5a3dde18380d56e95
This commit is contained in:
parent
ed0d96cafd
commit
2f33e6c4b4
@ -1,28 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2018 AT&T Intellectual Property.
|
|
||||||
All Rights Reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may
|
|
||||||
not use this file except in compliance with the License. You may obtain
|
|
||||||
a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software
|
|
||||||
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
|
|
||||||
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
|
|
||||||
License for the specific language governing permissions and limitations
|
|
||||||
under the License.
|
|
||||||
|
|
||||||
.. _blueprints:
|
|
||||||
|
|
||||||
Blueprints
|
|
||||||
==========
|
|
||||||
|
|
||||||
Designs for features of the UCP.
|
|
||||||
|
|
||||||
.. toctree::
|
|
||||||
:maxdepth: 2
|
|
||||||
|
|
||||||
deployment-grouping-baremetal
|
|
||||||
node-teardown
|
|
@ -1,553 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2018 AT&T Intellectual Property.
|
|
||||||
All Rights Reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may
|
|
||||||
not use this file except in compliance with the License. You may obtain
|
|
||||||
a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software
|
|
||||||
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
|
|
||||||
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
|
|
||||||
License for the specific language governing permissions and limitations
|
|
||||||
under the License.
|
|
||||||
|
|
||||||
.. _deployment-grouping-baremetal:
|
|
||||||
|
|
||||||
Deployment Grouping for Baremetal Nodes
|
|
||||||
=======================================
|
|
||||||
One of the primary functionalities of the Undercloud Platform is the deployment
|
|
||||||
of baremetal nodes as part of site deployment and upgrade. This blueprint aims
|
|
||||||
to define how deployment strategies can be applied to the workflow during these
|
|
||||||
actions.
|
|
||||||
|
|
||||||
Overview
|
|
||||||
--------
|
|
||||||
When Shipyard is invoked for a deploy_site or update_site action, there are
|
|
||||||
three primary stages:
|
|
||||||
|
|
||||||
1. Preparation and Validation
|
|
||||||
2. Baremetal and Network Deployment
|
|
||||||
3. Software Deployment
|
|
||||||
|
|
||||||
During the Baremetal and Network Deployment stage, the deploy_site or
|
|
||||||
update_site workflow (and perhaps other workflows in the future) invokes
|
|
||||||
Drydock to verify the site, prepare the site, prepare the nodes, and deploy the
|
|
||||||
nodes. Each of these steps is described in the `Drydock Orchestrator Readme`_
|
|
||||||
|
|
||||||
.. _Drydock Orchestrator Readme: https://git.openstack.org/cgit/openstack/airship-drydock/plain/drydock_provisioner/orchestrator/readme.md
|
|
||||||
|
|
||||||
The prepare nodes and deploy nodes steps each involve intensive and potentially
|
|
||||||
time consuming operations on the target nodes, orchestrated by Drydock and
|
|
||||||
MAAS. These steps need to be approached and managed such that grouping,
|
|
||||||
ordering, and criticality of success of nodes can be managed in support of
|
|
||||||
fault tolerant site deployments and updates.
|
|
||||||
|
|
||||||
For the purposes of this document `phase of deployment` refer to the prepare
|
|
||||||
nodes and deploy nodes steps of the Baremetal and Network deployment.
|
|
||||||
|
|
||||||
Some factors that advise this solution:
|
|
||||||
|
|
||||||
1. Limits to the amount of parallelization that can occur due to a centralized
|
|
||||||
MAAS system.
|
|
||||||
2. Faults in the hardware, preventing operational nodes.
|
|
||||||
3. Miswiring or configuration of network hardware.
|
|
||||||
4. Incorrect site design causing a mismatch against the hardware.
|
|
||||||
5. Criticality of particular nodes to the realization of the site design.
|
|
||||||
6. Desired configurability within the framework of the UCP declarative site
|
|
||||||
design.
|
|
||||||
7. Improved visibility into the current state of node deployment.
|
|
||||||
8. A desire to begin the deployment of nodes before the finish of the
|
|
||||||
preparation of nodes -- i.e. start deploying nodes as soon as they are ready
|
|
||||||
to be deployed. Note: This design will not achieve new forms of
|
|
||||||
task parallelization within Drydock; this is recognized as a desired
|
|
||||||
functionality.
|
|
||||||
|
|
||||||
Solution
|
|
||||||
--------
|
|
||||||
Updates supporting this solution will require changes to Shipyard for changed
|
|
||||||
workflows and Drydock for the desired node targeting, and for retrieval of
|
|
||||||
diagnostic and result information.
|
|
||||||
|
|
||||||
Deployment Strategy Document (Shipyard)
|
|
||||||
---------------------------------------
|
|
||||||
To accommodate the needed changes, this design introduces a new
|
|
||||||
DeploymentStrategy document into the site design to be read and utilized
|
|
||||||
by the workflows for update_site and deploy_site.
|
|
||||||
|
|
||||||
Groups
|
|
||||||
~~~~~~
|
|
||||||
Groups are named sets of nodes that will be deployed together. The fields of a
|
|
||||||
group are:
|
|
||||||
|
|
||||||
name
|
|
||||||
Required. The identifying name of the group.
|
|
||||||
|
|
||||||
critical
|
|
||||||
Required. Indicates if this group is required to continue to additional
|
|
||||||
phases of deployment.
|
|
||||||
|
|
||||||
depends_on
|
|
||||||
Required, may be empty list. Group names that must be successful before this
|
|
||||||
group can be processed.
|
|
||||||
|
|
||||||
selectors
|
|
||||||
Required, may be empty list. A list of identifying information to indicate
|
|
||||||
the nodes that are members of this group.
|
|
||||||
|
|
||||||
success_criteria
|
|
||||||
Optional. Criteria that must evaluate to be true before a group is considered
|
|
||||||
successfully complete with a phase of deployment.
|
|
||||||
|
|
||||||
Criticality
|
|
||||||
'''''''''''
|
|
||||||
- Field: critical
|
|
||||||
- Valid values: true | false
|
|
||||||
|
|
||||||
Each group is required to indicate true or false for the `critical` field.
|
|
||||||
This drives the behavior after the deployment of baremetal nodes. If any
|
|
||||||
groups that are marked as `critical: true` fail to meet that group's success
|
|
||||||
criteria, the workflow should halt after the deployment of baremetal nodes. A
|
|
||||||
group that cannot be processed due to a parent dependency failing will be
|
|
||||||
considered failed, regardless of the success criteria.
|
|
||||||
|
|
||||||
Dependencies
|
|
||||||
''''''''''''
|
|
||||||
- Field: depends_on
|
|
||||||
- Valid values: [] or a list of group names
|
|
||||||
|
|
||||||
Each group specifies a list of depends_on groups, or an empty list. All
|
|
||||||
identified groups must complete successfully for the phase of deployment before
|
|
||||||
the current group is allowed to be processed by the current phase.
|
|
||||||
|
|
||||||
- A failure (based on success criteria) of a group prevents any groups
|
|
||||||
dependent upon the failed group from being attempted.
|
|
||||||
- Circular dependencies will be rejected as invalid during document validation.
|
|
||||||
- There is no guarantee of ordering among groups that have their dependencies
|
|
||||||
met. Any group that is ready for deployment based on declared dependencies
|
|
||||||
will execute. Execution of groups is serialized - two groups will not deploy
|
|
||||||
at the same time.
|
|
||||||
|
|
||||||
Selectors
|
|
||||||
'''''''''
|
|
||||||
- Field: selectors
|
|
||||||
- Valid values: [] or a list of selectors
|
|
||||||
|
|
||||||
The list of selectors indicate the nodes that will be included in a group.
|
|
||||||
Each selector has four available filtering values: node_names, node_tags,
|
|
||||||
node_labels, and rack_names. Each selector is an intersection of this
|
|
||||||
critera, while the list of selectors is a union of the individual selectors.
|
|
||||||
|
|
||||||
- Omitting a criterion from a selector, or using empty list means that criterion
|
|
||||||
is ignored.
|
|
||||||
- Having a completely empty list of selectors, or a selector that has no
|
|
||||||
criteria specified indicates ALL nodes.
|
|
||||||
- A collection of selectors that results in no nodes being identified will be
|
|
||||||
processed as if 100% of nodes successfully deployed (avoiding division by
|
|
||||||
zero), but would fail the minimum or maximum nodes criteria (still counts as
|
|
||||||
0 nodes)
|
|
||||||
- There is no validation against the same node being in multiple groups,
|
|
||||||
however the workflow will not resubmit nodes that have already completed or
|
|
||||||
failed in this deployment to Drydock twice, since it keeps track of each node
|
|
||||||
uniquely. The success or failure of those nodes excluded from submission to
|
|
||||||
Drydock will still be used for the success criteria calculation.
|
|
||||||
|
|
||||||
E.g.::
|
|
||||||
|
|
||||||
selectors:
|
|
||||||
- node_names:
|
|
||||||
- node01
|
|
||||||
- node02
|
|
||||||
rack_names:
|
|
||||||
- rack01
|
|
||||||
node_tags:
|
|
||||||
- control
|
|
||||||
- node_names:
|
|
||||||
- node04
|
|
||||||
node_labels:
|
|
||||||
- ucp_control_plane: enabled
|
|
||||||
|
|
||||||
Will indicate (not really SQL, just for illustration)::
|
|
||||||
|
|
||||||
SELECT nodes
|
|
||||||
WHERE node_name in ('node01', 'node02')
|
|
||||||
AND rack_name in ('rack01')
|
|
||||||
AND node_tags in ('control')
|
|
||||||
UNION
|
|
||||||
SELECT nodes
|
|
||||||
WHERE node_name in ('node04')
|
|
||||||
AND node_label in ('ucp_control_plane: enabled')
|
|
||||||
|
|
||||||
Success Criteria
|
|
||||||
''''''''''''''''
|
|
||||||
- Field: success_criteria
|
|
||||||
- Valid values: for possible values, see below
|
|
||||||
|
|
||||||
Each group optionally contains success criteria which is used to indicate if
|
|
||||||
the deployment of that group is successful. The values that may be specified:
|
|
||||||
|
|
||||||
percent_successful_nodes
|
|
||||||
The calculated success rate of nodes completing the deployment phase.
|
|
||||||
|
|
||||||
E.g.: 75 would mean that 3 of 4 nodes must complete the phase successfully.
|
|
||||||
|
|
||||||
This is useful for groups that have larger numbers of nodes, and do not
|
|
||||||
have critical minimums or are not sensitive to an arbitrary number of nodes
|
|
||||||
not working.
|
|
||||||
|
|
||||||
minimum_successful_nodes
|
|
||||||
An integer indicating how many nodes must complete the phase to be considered
|
|
||||||
successful.
|
|
||||||
|
|
||||||
maximum_failed_nodes
|
|
||||||
An integer indicating a number of nodes that are allowed to have failed the
|
|
||||||
deployment phase and still consider that group successful.
|
|
||||||
|
|
||||||
When no criteria are specified, it means that no checks are done - processing
|
|
||||||
continues as if nothing is wrong.
|
|
||||||
|
|
||||||
When more than one criterion is specified, each is evaluated separately - if
|
|
||||||
any fail, the group is considered failed.
|
|
||||||
|
|
||||||
|
|
||||||
Example Deployment Strategy Document
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
This example shows a deployment strategy with 5 groups: control-nodes,
|
|
||||||
compute-nodes-1, compute-nodes-2, monitoring-nodes, and ntp-node.
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
---
|
|
||||||
schema: shipyard/DeploymentStrategy/v1
|
|
||||||
metadata:
|
|
||||||
schema: metadata/Document/v1
|
|
||||||
name: deployment-strategy
|
|
||||||
layeringDefinition:
|
|
||||||
abstract: false
|
|
||||||
layer: global
|
|
||||||
storagePolicy: cleartext
|
|
||||||
data:
|
|
||||||
groups:
|
|
||||||
- name: control-nodes
|
|
||||||
critical: true
|
|
||||||
depends_on:
|
|
||||||
- ntp-node
|
|
||||||
selectors:
|
|
||||||
- node_names: []
|
|
||||||
node_labels: []
|
|
||||||
node_tags:
|
|
||||||
- control
|
|
||||||
rack_names:
|
|
||||||
- rack03
|
|
||||||
success_criteria:
|
|
||||||
percent_successful_nodes: 90
|
|
||||||
minimum_successful_nodes: 3
|
|
||||||
maximum_failed_nodes: 1
|
|
||||||
- name: compute-nodes-1
|
|
||||||
critical: false
|
|
||||||
depends_on:
|
|
||||||
- control-nodes
|
|
||||||
selectors:
|
|
||||||
- node_names: []
|
|
||||||
node_labels: []
|
|
||||||
rack_names:
|
|
||||||
- rack01
|
|
||||||
node_tags:
|
|
||||||
- compute
|
|
||||||
success_criteria:
|
|
||||||
percent_successful_nodes: 50
|
|
||||||
- name: compute-nodes-2
|
|
||||||
critical: false
|
|
||||||
depends_on:
|
|
||||||
- control-nodes
|
|
||||||
selectors:
|
|
||||||
- node_names: []
|
|
||||||
node_labels: []
|
|
||||||
rack_names:
|
|
||||||
- rack02
|
|
||||||
node_tags:
|
|
||||||
- compute
|
|
||||||
success_criteria:
|
|
||||||
percent_successful_nodes: 50
|
|
||||||
- name: monitoring-nodes
|
|
||||||
critical: false
|
|
||||||
depends_on: []
|
|
||||||
selectors:
|
|
||||||
- node_names: []
|
|
||||||
node_labels: []
|
|
||||||
node_tags:
|
|
||||||
- monitoring
|
|
||||||
rack_names:
|
|
||||||
- rack03
|
|
||||||
- rack02
|
|
||||||
- rack01
|
|
||||||
- name: ntp-node
|
|
||||||
critical: true
|
|
||||||
depends_on: []
|
|
||||||
selectors:
|
|
||||||
- node_names:
|
|
||||||
- ntp01
|
|
||||||
node_labels: []
|
|
||||||
node_tags: []
|
|
||||||
rack_names: []
|
|
||||||
success_criteria:
|
|
||||||
minimum_successful_nodes: 1
|
|
||||||
|
|
||||||
The ordering of groups, as defined by the dependencies (``depends-on``
|
|
||||||
fields)::
|
|
||||||
|
|
||||||
__________ __________________
|
|
||||||
| ntp-node | | monitoring-nodes |
|
|
||||||
---------- ------------------
|
|
||||||
|
|
|
||||||
____V__________
|
|
||||||
| control-nodes |
|
|
||||||
---------------
|
|
||||||
|_________________________
|
|
||||||
| |
|
|
||||||
______V__________ ______V__________
|
|
||||||
| compute-nodes-1 | | compute-nodes-2 |
|
|
||||||
----------------- -----------------
|
|
||||||
|
|
||||||
Given this, the order of execution could be:
|
|
||||||
|
|
||||||
- ntp-node > monitoring-nodes > control-nodes > compute-nodes-1 > compute-nodes-2
|
|
||||||
- ntp-node > control-nodes > compute-nodes-2 > compute-nodes-1 > monitoring-nodes
|
|
||||||
- monitoring-nodes > ntp-node > control-nodes > compute-nodes-1 > compute-nodes-2
|
|
||||||
- and many more ... the only guarantee is that ntp-node will run some time
|
|
||||||
before control-nodes, which will run sometime before both of the
|
|
||||||
compute-nodes. Monitoring-nodes can run at any time.
|
|
||||||
|
|
||||||
Also of note are the various combinations of selectors and the varied use of
|
|
||||||
success criteria.
|
|
||||||
|
|
||||||
Deployment Configuration Document (Shipyard)
|
|
||||||
--------------------------------------------
|
|
||||||
The existing deployment-configuration document that is used by the workflows
|
|
||||||
will also be modified to use the existing deployment_strategy field to provide
|
|
||||||
the name of the deployment-straegy document that will be used.
|
|
||||||
|
|
||||||
The default value for the name of the DeploymentStrategy document will be
|
|
||||||
``deployment-strategy``.
|
|
||||||
|
|
||||||
Drydock Changes
|
|
||||||
---------------
|
|
||||||
|
|
||||||
API and CLI
|
|
||||||
~~~~~~~~~~~
|
|
||||||
- A new API needs to be provided that accepts a node filter (i.e. selector,
|
|
||||||
above) and returns a list of node names that result from analysis of the
|
|
||||||
design. Input to this API will also need to include a design reference.
|
|
||||||
|
|
||||||
- Drydock needs to provide a "tree" output of tasks rooted at the requested
|
|
||||||
parent task. This will provide the needed success/failure status for nodes
|
|
||||||
that have been prepared/deployed.
|
|
||||||
|
|
||||||
Documentation
|
|
||||||
~~~~~~~~~~~~~
|
|
||||||
Drydock documentation will be updated to match the introduction of new APIs
|
|
||||||
|
|
||||||
|
|
||||||
Shipyard Changes
|
|
||||||
----------------
|
|
||||||
|
|
||||||
API and CLI
|
|
||||||
~~~~~~~~~~~
|
|
||||||
- The commit configdocs api will need to be enhanced to look up the
|
|
||||||
DeploymentStrategy by using the DeploymentConfiguration.
|
|
||||||
- The DeploymentStrategy document will need to be validated to ensure there are
|
|
||||||
no circular dependencies in the groups' declared dependencies (perhaps
|
|
||||||
NetworkX_).
|
|
||||||
- A new API endpoint (and matching CLI) is desired to retrieve the status of
|
|
||||||
nodes as known to Drydock/MAAS and their MAAS status. The existing node list
|
|
||||||
API in Drydock provides a JSON output that can be utilized for this purpose.
|
|
||||||
|
|
||||||
Workflow
|
|
||||||
~~~~~~~~
|
|
||||||
The deploy_site and update_site workflows will be modified to utilize the
|
|
||||||
DeploymentStrategy.
|
|
||||||
|
|
||||||
- The deployment configuration step will be enhanced to also read the
|
|
||||||
deployment strategy and pass the information on a new xcom for use by the
|
|
||||||
baremetal nodes step (see below)
|
|
||||||
- The prepare nodes and deploy nodes steps will be combined to perform both as
|
|
||||||
part of the resolution of an overall ``baremetal nodes`` step.
|
|
||||||
The baremetal nodes step will introduce functionality that reads in the
|
|
||||||
deployment strategy (from the prior xcom), and can orchestrate the calls to
|
|
||||||
Drydock to enact the grouping, ordering and and success evaluation.
|
|
||||||
Note that Drydock will serialize tasks; there is no parallelization of
|
|
||||||
prepare/deploy at this time.
|
|
||||||
|
|
||||||
Needed Functionality
|
|
||||||
''''''''''''''''''''
|
|
||||||
|
|
||||||
- function to formulate the ordered groups based on dependencies (perhaps
|
|
||||||
NetworkX_)
|
|
||||||
- function to evaluate success/failure against the success criteria for a group
|
|
||||||
based on the result list of succeeded or failed nodes.
|
|
||||||
- function to mark groups as success or failure (including failed due to
|
|
||||||
dependency failure), as well as keep track of the (if any) successful and
|
|
||||||
failed nodes.
|
|
||||||
- function to get a group that is ready to execute, or 'Done' when all groups
|
|
||||||
are either complete or failed.
|
|
||||||
- function to formulate the node filter for Drydock based on a group's
|
|
||||||
selectors
|
|
||||||
- function to orchestrate processing groups, moving to the next group (or being
|
|
||||||
done) when a prior group completes or fails.
|
|
||||||
- function to summarize the success/failed nodes for a group (primarily for
|
|
||||||
reporting to the logs at this time).
|
|
||||||
|
|
||||||
Process
|
|
||||||
'''''''
|
|
||||||
The baremetal nodes step (preparation and deployment of nodes) will proceed as
|
|
||||||
follows:
|
|
||||||
|
|
||||||
1. Each group's selector will be sent to Drydock to determine the list of
|
|
||||||
nodes that are a part of that group.
|
|
||||||
|
|
||||||
- An overall status will be kept for each unique node (not started |
|
|
||||||
prepared | success | failure).
|
|
||||||
- When sending a task to Drydock for processing, the nodes associated with
|
|
||||||
that group will be sent as a simple `node_name` node filter. This will
|
|
||||||
allow for this list to exclude nodes that have a status that is not
|
|
||||||
congruent for the task being performed.
|
|
||||||
|
|
||||||
- prepare nodes valid status: not started
|
|
||||||
- deploy nodes valid status: prepared
|
|
||||||
|
|
||||||
2. In a processing loop, groups that are ready to be processed based on their
|
|
||||||
dependencies (and the success criteria of groups they are dependent upon)
|
|
||||||
will be selected for processing until there are no more groups that can be
|
|
||||||
processed. The processing will consist of preparing and then deploying the
|
|
||||||
group.
|
|
||||||
|
|
||||||
- The selected group will be prepared and then deployed before selecting
|
|
||||||
another group for processing.
|
|
||||||
- Any nodes that failed as part of that group will be excluded from
|
|
||||||
subsequent deployment or preparation of that node for this deployment.
|
|
||||||
|
|
||||||
- Excluding nodes that are already processed addresses groups that have
|
|
||||||
overlapping lists of nodes due to the group's selectors, and prevents
|
|
||||||
sending them to Drydock for re-processing.
|
|
||||||
- Evaluation of the success criteria will use the full set of nodes
|
|
||||||
identified by the selector. This means that if a node was previously
|
|
||||||
successfully deployed, that same node will count as "successful" when
|
|
||||||
evaluating the success criteria.
|
|
||||||
|
|
||||||
- The success criteria will be evaluated after the group's prepare step and
|
|
||||||
the deploy step. A failure to meet the success criteria in a prepare step
|
|
||||||
will cause the deploy step for that group to be skipped (and marked as
|
|
||||||
failed).
|
|
||||||
- Any nodes that fail during the prepare step, will not be used in the
|
|
||||||
corresponding deploy step.
|
|
||||||
- Upon completion (success, partial success, or failure) of a prepare step,
|
|
||||||
the nodes that were sent for preparation will be marked in the unique list
|
|
||||||
of nodes (above) with their appropriate status: prepared or failure
|
|
||||||
- Upon completion of a group's deployment step, the nodes status will be
|
|
||||||
updated to their current status: success or failure.
|
|
||||||
|
|
||||||
4. Before the end of the baremetal nodes step, following all eligible group
|
|
||||||
processing, a report will be logged to indicate the success/failure of
|
|
||||||
groups and the status of the individual nodes. Note that it is possible for
|
|
||||||
individual nodes to be left in `not started` state if they were only part of
|
|
||||||
groups that were never allowed to process due to dependencies and success
|
|
||||||
criteria.
|
|
||||||
|
|
||||||
5. At the end of the baremetal nodes step, if any nodes that have failed
|
|
||||||
due to timeout, dependency failure, or success criteria failure and are
|
|
||||||
marked as critical will trigger an Airflow Exception, resulting in a failed
|
|
||||||
deployment.
|
|
||||||
|
|
||||||
Notes:
|
|
||||||
|
|
||||||
- The timeout values specified for the prepare nodes and deploy nodes steps
|
|
||||||
will be used to put bounds on the individual calls to Drydock. A failure
|
|
||||||
based on these values will be treated as a failure for the group; we need to
|
|
||||||
be vigilant on if this will lead to indeterminate states for nodes that mess
|
|
||||||
with further processing. (e.g. Timed out, but the requested work still
|
|
||||||
continued to completion)
|
|
||||||
|
|
||||||
Example Processing
|
|
||||||
''''''''''''''''''
|
|
||||||
Using the defined deployment strategy in the above example, the following is
|
|
||||||
an example of how it may process::
|
|
||||||
|
|
||||||
Start
|
|
||||||
|
|
|
||||||
| prepare ntp-node <SUCCESS>
|
|
||||||
| deploy ntp-node <SUCCESS>
|
|
||||||
V
|
|
||||||
| prepare control-nodes <SUCCESS>
|
|
||||||
| deploy control-nodes <SUCCESS>
|
|
||||||
V
|
|
||||||
| prepare monitoring-nodes <SUCCESS>
|
|
||||||
| deploy monitoring-nodes <SUCCESS>
|
|
||||||
V
|
|
||||||
| prepare compute-nodes-2 <SUCCESS>
|
|
||||||
| deploy compute-nodes-2 <SUCCESS>
|
|
||||||
V
|
|
||||||
| prepare compute-nodes-1 <SUCCESS>
|
|
||||||
| deploy compute-nodes-1 <SUCCESS>
|
|
||||||
|
|
|
||||||
Finish (success)
|
|
||||||
|
|
||||||
If there were a failure in preparing the ntp-node, the following would be the
|
|
||||||
result::
|
|
||||||
|
|
||||||
Start
|
|
||||||
|
|
|
||||||
| prepare ntp-node <FAILED>
|
|
||||||
| deploy ntp-node <FAILED, due to prepare failure>
|
|
||||||
V
|
|
||||||
| prepare control-nodes <FAILED, due to dependency>
|
|
||||||
| deploy control-nodes <FAILED, due to dependency>
|
|
||||||
V
|
|
||||||
| prepare monitoring-nodes <SUCCESS>
|
|
||||||
| deploy monitoring-nodes <SUCCESS>
|
|
||||||
V
|
|
||||||
| prepare compute-nodes-2 <FAILED, due to dependency>
|
|
||||||
| deploy compute-nodes-2 <FAILED, due to dependency>
|
|
||||||
V
|
|
||||||
| prepare compute-nodes-1 <FAILED, due to dependency>
|
|
||||||
| deploy compute-nodes-1 <FAILED, due to dependency>
|
|
||||||
|
|
|
||||||
Finish (failed due to critical group failed)
|
|
||||||
|
|
||||||
If a failure occurred during the deploy of compute-nodes-2, the following would
|
|
||||||
result::
|
|
||||||
|
|
||||||
Start
|
|
||||||
|
|
|
||||||
| prepare ntp-node <SUCCESS>
|
|
||||||
| deploy ntp-node <SUCCESS>
|
|
||||||
V
|
|
||||||
| prepare control-nodes <SUCCESS>
|
|
||||||
| deploy control-nodes <SUCCESS>
|
|
||||||
V
|
|
||||||
| prepare monitoring-nodes <SUCCESS>
|
|
||||||
| deploy monitoring-nodes <SUCCESS>
|
|
||||||
V
|
|
||||||
| prepare compute-nodes-2 <SUCCESS>
|
|
||||||
| deploy compute-nodes-2 <FAILED>
|
|
||||||
V
|
|
||||||
| prepare compute-nodes-1 <SUCCESS>
|
|
||||||
| deploy compute-nodes-1 <SUCCESS>
|
|
||||||
|
|
|
||||||
Finish (success with some nodes/groups failed)
|
|
||||||
|
|
||||||
Schemas
|
|
||||||
~~~~~~~
|
|
||||||
A new schema will need to be provided by Shipyard to validate the
|
|
||||||
DeploymentStrategy document.
|
|
||||||
|
|
||||||
Documentation
|
|
||||||
~~~~~~~~~~~~~
|
|
||||||
The Shipyard action documentation will need to include details defining the
|
|
||||||
DeploymentStrategy document (mostly as defined here), as well as the update to
|
|
||||||
the DeploymentConfiguration document to contain the name of the
|
|
||||||
DeploymentStrategy document.
|
|
||||||
|
|
||||||
|
|
||||||
.. _NetworkX: https://networkx.github.io/documentation/networkx-1.9/reference/generated/networkx.algorithms.dag.topological_sort.html
|
|
@ -1,559 +0,0 @@
|
|||||||
..
|
|
||||||
Copyright 2018 AT&T Intellectual Property.
|
|
||||||
All Rights Reserved.
|
|
||||||
|
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may
|
|
||||||
not use this file except in compliance with the License. You may obtain
|
|
||||||
a copy of the License at
|
|
||||||
|
|
||||||
http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
|
|
||||||
Unless required by applicable law or agreed to in writing, software
|
|
||||||
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
|
|
||||||
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
|
|
||||||
License for the specific language governing permissions and limitations
|
|
||||||
under the License.
|
|
||||||
|
|
||||||
.. _node-teardown:
|
|
||||||
|
|
||||||
Undercloud Node Teardown
|
|
||||||
========================
|
|
||||||
|
|
||||||
When redeploying a physical host (server) using the Undercloud Platform(UCP),
|
|
||||||
it is necessary to trigger a sequence of steps to prevent undesired behaviors
|
|
||||||
when the server is redeployed. This blueprint intends to document the
|
|
||||||
interaction that must occur between UCP components to teardown a server.
|
|
||||||
|
|
||||||
Overview
|
|
||||||
--------
|
|
||||||
Shipyard is the entrypoint for UCP actions, including the need to redeploy a
|
|
||||||
server. The first part of redeploying a server is the graceful teardown of the
|
|
||||||
software running on the server; specifically Kubernetes and etcd are of
|
|
||||||
critical concern. It is the duty of Shipyard to orchestrate the teardown of the
|
|
||||||
server, followed by steps to deploy the desired new configuration. This design
|
|
||||||
covers only the first portion - node teardown
|
|
||||||
|
|
||||||
Shipyard node teardown Process
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
#. (Existing) Shipyard receives request to redeploy_server, specifying a target
|
|
||||||
server.
|
|
||||||
#. (Existing) Shipyard performs preflight, design reference lookup, and
|
|
||||||
validation steps.
|
|
||||||
#. (New) Shipyard invokes Promenade to decommission a node.
|
|
||||||
#. (New) Shipyard invokes Drydock to destroy the node - setting a node
|
|
||||||
filter to restrict to a single server.
|
|
||||||
# (New) Shipyard invokes Promenade to remove the node from the Kubernetes
|
|
||||||
cluster.
|
|
||||||
|
|
||||||
Assumption:
|
|
||||||
node_id is the hostname of the server, and is also the identifier that both
|
|
||||||
Drydock and Promenade use to identify the appropriate parts - hosts and k8s
|
|
||||||
nodes. This convention is set by the join script produced by promenade.
|
|
||||||
|
|
||||||
Drydock Destroy Node
|
|
||||||
--------------------
|
|
||||||
The API/interface for destroy node already exists. The implementation within
|
|
||||||
Drydock needs to be developed. This interface will need to accept both the
|
|
||||||
specified node_id and the design_id to retrieve from Deckhand.
|
|
||||||
|
|
||||||
Using the provided node_id (hardware node), and the design_id, Drydock will
|
|
||||||
reset the hardware to a re-provisionable state.
|
|
||||||
|
|
||||||
By default, all local storage should be wiped (per datacenter policy for
|
|
||||||
wiping before re-use).
|
|
||||||
|
|
||||||
An option to allow for only the OS disk to be wiped should be supported, such
|
|
||||||
that other local storage is left intact, and could be remounted without data
|
|
||||||
loss. e.g.: --preserve-local-storage
|
|
||||||
|
|
||||||
The target node should be shut down.
|
|
||||||
|
|
||||||
The target node should be removed from the provisioner (e.g. MaaS)
|
|
||||||
|
|
||||||
Responses
|
|
||||||
~~~~~~~~~
|
|
||||||
The responses from this functionality should follow the pattern set by prepare
|
|
||||||
nodes, and other Drydock functionality. The Drydock status responses used for
|
|
||||||
all async invocations will be utilized for this functionality.
|
|
||||||
|
|
||||||
Promenade Decommission Node
|
|
||||||
---------------------------
|
|
||||||
Performs steps that will result in the specified node being cleanly
|
|
||||||
disassociated from Kubernetes, and ready for the server to be destroyed.
|
|
||||||
Users of the decommission node API should be aware of the long timeout values
|
|
||||||
that may occur while awaiting promenade to complete the appropriate steps.
|
|
||||||
At this time, Promenade is a stateless service and doesn't use any database
|
|
||||||
storage. As such, requests to Promenade are synchronous.
|
|
||||||
|
|
||||||
.. code:: json
|
|
||||||
|
|
||||||
POST /nodes/{node_id}/decommission
|
|
||||||
|
|
||||||
{
|
|
||||||
rel : "design",
|
|
||||||
href: "deckhand+https://{{deckhand_url}}/revisions/{{revision_id}}/rendered-documents",
|
|
||||||
type: "application/x-yaml"
|
|
||||||
}
|
|
||||||
|
|
||||||
Such that the design reference body is the design indicated when the
|
|
||||||
redeploy_server action is invoked through Shipyard.
|
|
||||||
|
|
||||||
Query Parameters:
|
|
||||||
|
|
||||||
- drain-node-timeout: A whole number timeout in seconds to be used for the
|
|
||||||
drain node step (default: none). In the case of no value being provided,
|
|
||||||
the drain node step will use its default.
|
|
||||||
- drain-node-grace-period: A whole number in seconds indicating the
|
|
||||||
grace-period that will be provided to the drain node step. (default: none).
|
|
||||||
If no value is specified, the drain node step will use its default.
|
|
||||||
- clear-labels-timeout: A whole number timeout in seconds to be used for the
|
|
||||||
clear labels step. (default: none). If no value is specified, clear labels
|
|
||||||
will use its own default.
|
|
||||||
- remove-etcd-timeout: A whole number timeout in seconds to be used for the
|
|
||||||
remove etcd from nodes step. (default: none). If no value is specified,
|
|
||||||
remove-etcd will use its own default.
|
|
||||||
- etcd-ready-timeout: A whole number in seconds indicating how long the
|
|
||||||
decommission node request should allow for etcd clusters to become stable
|
|
||||||
(default: 600).
|
|
||||||
|
|
||||||
Process
|
|
||||||
~~~~~~~
|
|
||||||
Acting upon the node specified by the invocation and the design reference
|
|
||||||
details:
|
|
||||||
|
|
||||||
#. Drain the Kubernetes node.
|
|
||||||
#. Clear the Kubernetes labels on the node.
|
|
||||||
#. Remove etcd nodes from their clusters (if impacted).
|
|
||||||
- if the node being decommissioned contains etcd nodes, Promenade will
|
|
||||||
attempt to gracefully have those nodes leave the etcd cluster.
|
|
||||||
#. Ensure that etcd cluster(s) are in a stable state.
|
|
||||||
- Polls for status every 30 seconds up to the etcd-ready-timeout, or the
|
|
||||||
cluster meets the defined minimum functionality for the site.
|
|
||||||
- A new document: promenade/EtcdClusters/v1 that will specify details about
|
|
||||||
the etcd clusters deployed in the site, including: identifiers,
|
|
||||||
credentials, and thresholds for minimum functionality.
|
|
||||||
- This process should ignore the node being torn down from any calculation
|
|
||||||
of health
|
|
||||||
#. Shutdown the kubelet.
|
|
||||||
- If this is not possible because the node is in a state of disarray such
|
|
||||||
that it cannot schedule the daemonset to run, this step may fail, but
|
|
||||||
should not hold up the process, as the Drydock dismantling of the node
|
|
||||||
will shut the kubelet down.
|
|
||||||
|
|
||||||
Responses
|
|
||||||
~~~~~~~~~
|
|
||||||
All responses will be form of the UCP Status response.
|
|
||||||
|
|
||||||
- Success: Code: 200, reason: Success
|
|
||||||
|
|
||||||
Indicates that all steps are successful.
|
|
||||||
|
|
||||||
- Failure: Code: 404, reason: NotFound
|
|
||||||
|
|
||||||
Indicates that the target node is not discoverable by Promenade.
|
|
||||||
|
|
||||||
- Failure: Code: 500, reason: DisassociateStepFailure
|
|
||||||
|
|
||||||
The details section should detail the successes and failures further. Any
|
|
||||||
4xx series errors from the individual steps would manifest as a 500 here.
|
|
||||||
|
|
||||||
Promenade Drain Node
|
|
||||||
--------------------
|
|
||||||
Drain the Kubernetes node for the target node. This will ensure that this node
|
|
||||||
is no longer the target of any pod scheduling, and evicts or deletes the
|
|
||||||
running pods. In the case of notes running DaemonSet manged pods, or pods
|
|
||||||
that would prevent a drain from occurring, Promenade may be required to provide
|
|
||||||
the `ignore-daemonsets` option or `force` option to attempt to drain the node
|
|
||||||
as fully as possible.
|
|
||||||
|
|
||||||
By default, the drain node will utilize a grace period for pods of 1800
|
|
||||||
seconds and a total timeout of 3600 seconds (1 hour). Clients of this
|
|
||||||
functionality should be prepared for a long timeout.
|
|
||||||
|
|
||||||
.. code:: json
|
|
||||||
|
|
||||||
POST /nodes/{node_id}/drain
|
|
||||||
|
|
||||||
Query Paramters:
|
|
||||||
|
|
||||||
- timeout: a whole number in seconds (default = 3600). This value is the total
|
|
||||||
timeout for the kubectl drain command.
|
|
||||||
- grace-period: a whole number in seconds (default = 1800). This value is the
|
|
||||||
grace period used by kubectl drain. Grace period must be less than timeout.
|
|
||||||
|
|
||||||
.. note::
|
|
||||||
|
|
||||||
This POST has no message body
|
|
||||||
|
|
||||||
Example command being used for drain (reference only)
|
|
||||||
`kubectl drain --force --timeout 3600s --grace-period 1800 --ignore-daemonsets --delete-local-data n1`
|
|
||||||
https://git.openstack.org/cgit/openstack/airship-promenade/tree/promenade/templates/roles/common/usr/local/bin/promenade-teardown
|
|
||||||
|
|
||||||
Responses
|
|
||||||
~~~~~~~~~
|
|
||||||
All responses will be form of the UCP Status response.
|
|
||||||
|
|
||||||
- Success: Code: 200, reason: Success
|
|
||||||
|
|
||||||
Indicates that the drain node has successfully concluded, and that no pods
|
|
||||||
are currently running
|
|
||||||
|
|
||||||
- Failure: Status response, code: 400, reason: BadRequest
|
|
||||||
|
|
||||||
A request was made with parameters that cannot work - e.g. grace-period is
|
|
||||||
set to a value larger than the timeout value.
|
|
||||||
|
|
||||||
- Failure: Status response, code: 404, reason: NotFound
|
|
||||||
|
|
||||||
The specified node is not discoverable by Promenade
|
|
||||||
|
|
||||||
- Failure: Status response, code: 500, reason: DrainNodeError
|
|
||||||
|
|
||||||
There was a processing exception raised while trying to drain a node. The
|
|
||||||
details section should indicate the underlying cause if it can be
|
|
||||||
determined.
|
|
||||||
|
|
||||||
Promenade Clear Labels
|
|
||||||
----------------------
|
|
||||||
Removes the labels that have been added to the target kubernetes node.
|
|
||||||
|
|
||||||
.. code:: json
|
|
||||||
|
|
||||||
POST /nodes/{node_id}/clear-labels
|
|
||||||
|
|
||||||
Query Parameters:
|
|
||||||
|
|
||||||
- timeout: A whole number in seconds allowed for the pods to settle/move
|
|
||||||
following removal of labels. (Default = 1800)
|
|
||||||
|
|
||||||
.. note::
|
|
||||||
|
|
||||||
This POST has no message body
|
|
||||||
|
|
||||||
Responses
|
|
||||||
~~~~~~~~~
|
|
||||||
All responses will be form of the UCP Status response.
|
|
||||||
|
|
||||||
- Success: Code: 200, reason: Success
|
|
||||||
|
|
||||||
All labels have been removed from the specified Kubernetes node.
|
|
||||||
|
|
||||||
- Failure: Code: 404, reason: NotFound
|
|
||||||
|
|
||||||
The specified node is not discoverable by Promenade
|
|
||||||
|
|
||||||
- Failure: Code: 500, reason: ClearLabelsError
|
|
||||||
|
|
||||||
There was a failure to clear labels that prevented completion. The details
|
|
||||||
section should provide more information about the cause of this failure.
|
|
||||||
|
|
||||||
Promenade Remove etcd Node
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
Checks if the node specified contains any etcd nodes. If so, this API will
|
|
||||||
trigger that etcd node to leave the associated etcd cluster.
|
|
||||||
|
|
||||||
POST /nodes/{node_id}/remove-etcd
|
|
||||||
|
|
||||||
{
|
|
||||||
rel : "design",
|
|
||||||
href: "deckhand+https://{{deckhand_url}}/revisions/{{revision_id}}/rendered-documents",
|
|
||||||
type: "application/x-yaml"
|
|
||||||
}
|
|
||||||
|
|
||||||
Query Parameters:
|
|
||||||
|
|
||||||
- timeout: A whole number in seconds allowed for the removal of etcd nodes
|
|
||||||
from the targe node. (Default = 1800)
|
|
||||||
|
|
||||||
Responses
|
|
||||||
~~~~~~~~~
|
|
||||||
All responses will be form of the UCP Status response.
|
|
||||||
|
|
||||||
- Success: Code: 200, reason: Success
|
|
||||||
|
|
||||||
All etcd nodes have been removed from the specified node.
|
|
||||||
|
|
||||||
- Failure: Code: 404, reason: NotFound
|
|
||||||
|
|
||||||
The specified node is not discoverable by Promenade
|
|
||||||
|
|
||||||
- Failure: Code: 500, reason: RemoveEtcdError
|
|
||||||
|
|
||||||
There was a failure to remove etcd from the target node that prevented
|
|
||||||
completion within the specified timeout, or that etcd prevented removal of
|
|
||||||
the node because it would result in the cluster being broken. The details
|
|
||||||
section should provide more information about the cause of this failure.
|
|
||||||
|
|
||||||
|
|
||||||
Promenade Check etcd
|
|
||||||
~~~~~~~~~~~~~~~~~~~~
|
|
||||||
Retrieves the current interpreted state of etcd.
|
|
||||||
|
|
||||||
GET /etcd-cluster-health-statuses?design_ref={the design ref}
|
|
||||||
|
|
||||||
Where the design_ref parameter is required for appropriate operation, and is in
|
|
||||||
the same format as used for the join-scripts API.
|
|
||||||
|
|
||||||
Query Parameters:
|
|
||||||
|
|
||||||
- design_ref: (Required) the design reference to be used to discover etcd
|
|
||||||
instances.
|
|
||||||
|
|
||||||
Responses
|
|
||||||
~~~~~~~~~
|
|
||||||
All responses will be form of the UCP Status response.
|
|
||||||
|
|
||||||
- Success: Code: 200, reason: Success
|
|
||||||
|
|
||||||
The status of each etcd in the site will be returned in the details section.
|
|
||||||
Valid values for status are: Healthy, Unhealthy
|
|
||||||
|
|
||||||
https://github.com/att-comdev/ucp-integration/blob/master/docs/source/api-conventions.rst#status-responses
|
|
||||||
|
|
||||||
.. code:: json
|
|
||||||
|
|
||||||
{ "...": "... standard status response ...",
|
|
||||||
"details": {
|
|
||||||
"errorCount": {{n}},
|
|
||||||
"messageList": [
|
|
||||||
{ "message": "Healthy",
|
|
||||||
"error": false,
|
|
||||||
"kind": "HealthMessage",
|
|
||||||
"name": "{{the name of the etcd service}}"
|
|
||||||
},
|
|
||||||
{ "message": "Unhealthy"
|
|
||||||
"error": false,
|
|
||||||
"kind": "HealthMessage",
|
|
||||||
"name": "{{the name of the etcd service}}"
|
|
||||||
},
|
|
||||||
{ "message": "Unable to access Etcd"
|
|
||||||
"error": true,
|
|
||||||
"kind": "HealthMessage",
|
|
||||||
"name": "{{the name of the etcd service}}"
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
...
|
|
||||||
}
|
|
||||||
|
|
||||||
- Failure: Code: 400, reason: MissingDesignRef
|
|
||||||
|
|
||||||
Returned if the design_ref parameter is not specified
|
|
||||||
|
|
||||||
- Failure: Code: 404, reason: NotFound
|
|
||||||
|
|
||||||
Returned if the specified etcd could not be located
|
|
||||||
|
|
||||||
- Failure: Code: 500, reason: EtcdNotAccessible
|
|
||||||
|
|
||||||
Returned if the specified etcd responded with an invalid health response
|
|
||||||
(Not just simply unhealthy - that's a 200).
|
|
||||||
|
|
||||||
|
|
||||||
Promenade Shutdown Kubelet
|
|
||||||
--------------------------
|
|
||||||
Shuts down the kubelet on the specified node. This is accomplished by Promenade
|
|
||||||
setting the label `promenade-decomission: enabled` on the node, which will
|
|
||||||
trigger a newly-developed daemonset to run something like:
|
|
||||||
`systemctl disable kubelet && systemctl stop kubelet`.
|
|
||||||
This daemonset will effectively sit dormant until nodes have the appropriate
|
|
||||||
label added, and then perform the kubelet teardown.
|
|
||||||
|
|
||||||
.. code:: json
|
|
||||||
|
|
||||||
POST /nodes/{node_id}/shutdown-kubelet
|
|
||||||
|
|
||||||
.. note::
|
|
||||||
|
|
||||||
This POST has no message body
|
|
||||||
|
|
||||||
Responses
|
|
||||||
~~~~~~~~~
|
|
||||||
All responses will be form of the UCP Status response.
|
|
||||||
|
|
||||||
- Success: Code: 200, reason: Success
|
|
||||||
|
|
||||||
The kubelet has been successfully shutdown
|
|
||||||
|
|
||||||
- Failure: Code: 404, reason: NotFound
|
|
||||||
|
|
||||||
The specified node is not discoverable by Promenade
|
|
||||||
|
|
||||||
- Failure: Code: 500, reason: ShutdownKubeletError
|
|
||||||
|
|
||||||
The specified node's kubelet fails to shutdown. The details section of the
|
|
||||||
status response should contain reasonable information about the source of
|
|
||||||
this failure
|
|
||||||
|
|
||||||
Promenade Delete Node from Cluster
|
|
||||||
----------------------------------
|
|
||||||
Updates the Kubernetes cluster, removing the specified node. Promenade should
|
|
||||||
check that the node is drained/cordoned and has no labels other than
|
|
||||||
`promenade-decomission: enabled`. In either of these cases, the API should
|
|
||||||
respond with a 409 Conflict response.
|
|
||||||
|
|
||||||
.. code:: json
|
|
||||||
|
|
||||||
POST /nodes/{node_id}/remove-from-cluster
|
|
||||||
|
|
||||||
.. note::
|
|
||||||
|
|
||||||
This POST has no message body
|
|
||||||
|
|
||||||
Responses
|
|
||||||
~~~~~~~~~
|
|
||||||
All responses will be form of the UCP Status response.
|
|
||||||
|
|
||||||
- Success: Code: 200, reason: Success
|
|
||||||
|
|
||||||
The specified node has been removed from the Kubernetes cluster.
|
|
||||||
|
|
||||||
- Failure: Code: 404, reason: NotFound
|
|
||||||
|
|
||||||
The specified node is not discoverable by Promenade
|
|
||||||
|
|
||||||
- Failure: Code: 409, reason: Conflict
|
|
||||||
|
|
||||||
The specified node cannot be deleted due to checks that the node is
|
|
||||||
drained/cordoned and has no labels (other than possibly
|
|
||||||
`promenade-decomission: enabled`).
|
|
||||||
|
|
||||||
- Failure: Code: 500, reason: DeleteNodeError
|
|
||||||
|
|
||||||
The specified node cannot be removed from the cluster due to an error from
|
|
||||||
Kubernetes. The details section of the status response should contain more
|
|
||||||
information about the failure.
|
|
||||||
|
|
||||||
|
|
||||||
Shipyard Tag Releases
|
|
||||||
---------------------
|
|
||||||
Shipyard will need to mark Deckhand revisions with tags when there are
|
|
||||||
successful deploy_site or update_site actions to be able to determine the last
|
|
||||||
known good design. This is related to issue 16 for Shipyard, which utilizes the
|
|
||||||
same need.
|
|
||||||
|
|
||||||
.. note::
|
|
||||||
|
|
||||||
Repeated from https://github.com/att-comdev/shipyard/issues/16
|
|
||||||
|
|
||||||
When multiple configdocs commits have been done since the last deployment,
|
|
||||||
there is no ready means to determine what's being done to the site. Shipyard
|
|
||||||
should reject deploy site or update site requests that have had multiple
|
|
||||||
commits since the last site true-up action. An option to override this guard
|
|
||||||
should be allowed for the actions in the form of a parameter to the action.
|
|
||||||
|
|
||||||
The configdocs API should provide a way to see what's been changed since the
|
|
||||||
last site true-up, not just the last commit of configdocs. This might be
|
|
||||||
accommodated by new deckhand tags like the 'commit' tag, but for
|
|
||||||
'site true-up' or similar applied by the deploy and update site commands.
|
|
||||||
|
|
||||||
The design for issue 16 includes the bare-minimum marking of Deckhand
|
|
||||||
revisions. This design is as follows:
|
|
||||||
|
|
||||||
Scenario
|
|
||||||
~~~~~~~~
|
|
||||||
Multiple commits occur between site actions (deploy_site, update_site) - those
|
|
||||||
actions that attempt to bring a site into compliance with a site design.
|
|
||||||
When this occurs, the current system of being able to only see what has changed
|
|
||||||
between committed and the the buffer versions (configdocs diff) is insufficient
|
|
||||||
to be able to investigate what has changed since the last successful (or
|
|
||||||
unsuccessful) site action.
|
|
||||||
To accommodate this, Shipyard needs several enhancements.
|
|
||||||
|
|
||||||
Enhancements
|
|
||||||
~~~~~~~~~~~~
|
|
||||||
|
|
||||||
#. Deckhand revision tags for site actions
|
|
||||||
|
|
||||||
Using the tagging facility provided by Deckhand, Shipyard will tag the end
|
|
||||||
of site actions.
|
|
||||||
Upon completing a site action successfully tag the revision being used with
|
|
||||||
the tag site-action-success, and a body of dag_id:<dag_id>
|
|
||||||
|
|
||||||
Upon completion of a site action unsuccessfully, tag the revision being used
|
|
||||||
with the tag site-action-failure, and a body of dag_id:<dag_id>
|
|
||||||
|
|
||||||
The completion tags should only be applied upon failure if the site action
|
|
||||||
gets past document validation successfully (i.e. gets to the point where it
|
|
||||||
can start making changes via the other UCP components)
|
|
||||||
|
|
||||||
This could result in a single revision having both site-action-success and
|
|
||||||
site-action-failure if a later re-invocation of a site action is successful.
|
|
||||||
|
|
||||||
#. Check for intermediate committed revisions
|
|
||||||
|
|
||||||
Upon running a site action, before tagging the revision with the site action
|
|
||||||
tag(s), the dag needs to check to see if there are committed revisions that
|
|
||||||
do not have an associated site-action tag. If there are any committed
|
|
||||||
revisions since the last site action other than the current revision being
|
|
||||||
used (between them), then the action should not be allowed to proceed (stop
|
|
||||||
before triggering validations). For the calculation of intermediate
|
|
||||||
committed revisions, assume revision 0 if there are no revisions with a
|
|
||||||
site-action tag (null case)
|
|
||||||
|
|
||||||
If the action is invoked with a parameter of
|
|
||||||
allow-intermediate-commits=true, then this check should log that the
|
|
||||||
intermediate committed revisions check is being skipped and not take any
|
|
||||||
other action.
|
|
||||||
|
|
||||||
#. Support action parameter of allow-intermediate-commits=true|false
|
|
||||||
|
|
||||||
In the CLI for create action, the --param option supports adding parameters
|
|
||||||
to actions. The parameters passed should be relayed by the CLI to the API
|
|
||||||
and ultimately to the invocation of the DAG. The DAG as noted above will
|
|
||||||
check for the presense of allow-intermediate-commits=true. This needs to be
|
|
||||||
tested to work.
|
|
||||||
|
|
||||||
#. Shipyard needs to support retrieving configdocs and rendered documents for
|
|
||||||
the last successful site action, and last site action (successful or not
|
|
||||||
successful)
|
|
||||||
|
|
||||||
--successful-site-action
|
|
||||||
--last-site-action
|
|
||||||
These options would be mutually exclusive of --buffer or --committed
|
|
||||||
|
|
||||||
#. Shipyard diff (shipyard get configdocs)
|
|
||||||
|
|
||||||
Needs to support an option to do the diff of the buffer vs. the last
|
|
||||||
successful site action and the last site action (succesful or not
|
|
||||||
successful).
|
|
||||||
|
|
||||||
Currently there are no options to select which versions to diff (always
|
|
||||||
buffer vs. committed)
|
|
||||||
|
|
||||||
support:
|
|
||||||
--base-version=committed | successful-site-action | last-site-action (Default = committed)
|
|
||||||
--diff-version=buffer | committed | successful-site-action | last-site-action (Default = buffer)
|
|
||||||
|
|
||||||
Equivalent query parameters need to be implemented in the API.
|
|
||||||
|
|
||||||
Because the implementation of this design will result in the tagging of
|
|
||||||
successful site-actions, Shipyard will be able to determine the correct
|
|
||||||
revision to use while attempting to teardown a node.
|
|
||||||
|
|
||||||
If the request to teardown a node indicates a revision that doesn't exist, the
|
|
||||||
command to do so (e.g. redeploy_server) should not continue, but rather fail
|
|
||||||
due to a missing precondition.
|
|
||||||
|
|
||||||
The invocation of the Promenade and Drydock steps in this design will utilize
|
|
||||||
the appropriate tag based on the request (default is successful-site-action) to
|
|
||||||
determine the revision of the Deckhand documents used as the design-ref.
|
|
||||||
|
|
||||||
Shipyard redeploy_server Action
|
|
||||||
-------------------------------
|
|
||||||
The redeploy_server action currently accepts a target node. Additional
|
|
||||||
supported parameters are needed:
|
|
||||||
|
|
||||||
#. preserve-local-storage=true which will instruct Drydock to only wipe the
|
|
||||||
OS drive, and any other local storage will not be wiped. This would allow
|
|
||||||
for the drives to be remounted to the server upon re-provisioning. The
|
|
||||||
default behavior is that local storage is not preserved.
|
|
||||||
|
|
||||||
#. target-revision=committed | successful-site-action | last-site-action
|
|
||||||
This will indicate which revision of the design will be used as the
|
|
||||||
reference for what should be re-provisioned after the teardown.
|
|
||||||
The default is successful-site-action, which is the closest representation
|
|
||||||
to the last-known-good state.
|
|
||||||
|
|
||||||
These should be accepted as parameters to the action API/CLI and modify the
|
|
||||||
behavior of the redeploy_server DAG.
|
|
@ -52,7 +52,6 @@ Conventions and Standards
|
|||||||
:maxdepth: 3
|
:maxdepth: 3
|
||||||
|
|
||||||
conventions
|
conventions
|
||||||
blueprints/blueprints
|
|
||||||
dev-getting-started
|
dev-getting-started
|
||||||
|
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user