Browse Source

Merge "Remove the blueprintsi from airship-in-a-bottle"

changes/07/587207/7
Zuul 9 months ago
parent
commit
f68ca30bd7

+ 0
- 28
docs/source/blueprints/blueprints.rst View File

@@ -1,28 +0,0 @@
1
-..
2
-      Copyright 2018 AT&T Intellectual Property.
3
-      All Rights Reserved.
4
-
5
-      Licensed under the Apache License, Version 2.0 (the "License"); you may
6
-      not use this file except in compliance with the License. You may obtain
7
-      a copy of the License at
8
-
9
-          http://www.apache.org/licenses/LICENSE-2.0
10
-
11
-      Unless required by applicable law or agreed to in writing, software
12
-      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
13
-      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
14
-      License for the specific language governing permissions and limitations
15
-      under the License.
16
-
17
-.. _blueprints:
18
-
19
-Blueprints
20
-==========
21
-
22
-Designs for features of the UCP.
23
-
24
-.. toctree::
25
-   :maxdepth: 2
26
-
27
-   deployment-grouping-baremetal
28
-   node-teardown

+ 0
- 553
docs/source/blueprints/deployment-grouping-baremetal.rst View File

@@ -1,553 +0,0 @@
1
-..
2
-      Copyright 2018 AT&T Intellectual Property.
3
-      All Rights Reserved.
4
-
5
-      Licensed under the Apache License, Version 2.0 (the "License"); you may
6
-      not use this file except in compliance with the License. You may obtain
7
-      a copy of the License at
8
-
9
-          http://www.apache.org/licenses/LICENSE-2.0
10
-
11
-      Unless required by applicable law or agreed to in writing, software
12
-      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
13
-      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
14
-      License for the specific language governing permissions and limitations
15
-      under the License.
16
-
17
-.. _deployment-grouping-baremetal:
18
-
19
-Deployment Grouping for Baremetal Nodes
20
-=======================================
21
-One of the primary functionalities of the Undercloud Platform is the deployment
22
-of baremetal nodes as part of site deployment and upgrade. This blueprint aims
23
-to define how deployment strategies can be applied to the workflow during these
24
-actions.
25
-
26
-Overview
27
---------
28
-When Shipyard is invoked for a deploy_site or update_site action, there are
29
-three primary stages:
30
-
31
-1. Preparation and Validation
32
-2. Baremetal and Network Deployment
33
-3. Software Deployment
34
-
35
-During the Baremetal and Network Deployment stage, the deploy_site or
36
-update_site workflow (and perhaps other workflows in the future) invokes
37
-Drydock to verify the site, prepare the site, prepare the nodes, and deploy the
38
-nodes. Each of these steps is described in the `Drydock Orchestrator Readme`_
39
-
40
-.. _Drydock Orchestrator Readme: https://git.openstack.org/cgit/openstack/airship-drydock/plain/drydock_provisioner/orchestrator/readme.md
41
-
42
-The prepare nodes and deploy nodes steps each involve intensive and potentially
43
-time consuming operations on the target nodes, orchestrated by Drydock and
44
-MAAS. These steps need to be approached and managed such that grouping,
45
-ordering, and criticality of success of nodes can be managed in support of
46
-fault tolerant site deployments and updates.
47
-
48
-For the purposes of this document `phase of deployment` refer to the prepare
49
-nodes and deploy nodes steps of the Baremetal and Network deployment.
50
-
51
-Some factors that advise this solution:
52
-
53
-1. Limits to the amount of parallelization that can occur due to a centralized
54
-   MAAS system.
55
-2. Faults in the hardware, preventing operational nodes.
56
-3. Miswiring or configuration of network hardware.
57
-4. Incorrect site design causing a mismatch against the hardware.
58
-5. Criticality of particular nodes to the realization of the site design.
59
-6. Desired configurability within the framework of the UCP declarative site
60
-   design.
61
-7. Improved visibility into the current state of node deployment.
62
-8. A desire to begin the deployment of nodes before the finish of the
63
-   preparation of nodes -- i.e. start deploying nodes as soon as they are ready
64
-   to be deployed. Note: This design will not achieve new forms of
65
-   task parallelization within Drydock; this is recognized as a desired
66
-   functionality.
67
-
68
-Solution
69
---------
70
-Updates supporting this solution will require changes to Shipyard for changed
71
-workflows and Drydock for the desired node targeting, and for retrieval of
72
-diagnostic and result information.
73
-
74
-Deployment Strategy Document (Shipyard)
75
----------------------------------------
76
-To accommodate the needed changes, this design introduces a new
77
-DeploymentStrategy document into the site design to be read and utilized
78
-by the workflows for update_site and deploy_site.
79
-
80
-Groups
81
-~~~~~~
82
-Groups are named sets of nodes that will be deployed together. The fields of a
83
-group are:
84
-
85
-name
86
-  Required. The identifying name of the group.
87
-
88
-critical
89
-  Required. Indicates if this group is required to continue to additional
90
-  phases of deployment.
91
-
92
-depends_on
93
-  Required, may be empty list. Group names that must be successful before this
94
-  group can be processed.
95
-
96
-selectors
97
-  Required, may be empty list. A list of identifying information to indicate
98
-  the nodes that are members of this group.
99
-
100
-success_criteria
101
-  Optional. Criteria that must evaluate to be true before a group is considered
102
-  successfully complete with a phase of deployment.
103
-
104
-Criticality
105
-'''''''''''
106
-- Field: critical
107
-- Valid values: true | false
108
-
109
-Each group is required to indicate true or false for the `critical` field.
110
-This drives the behavior after the deployment of baremetal nodes.  If any
111
-groups that are marked as `critical: true` fail to meet that group's success
112
-criteria, the workflow should halt after the deployment of baremetal nodes. A
113
-group that cannot be processed due to a parent dependency failing will be
114
-considered failed, regardless of the success criteria.
115
-
116
-Dependencies
117
-''''''''''''
118
-- Field: depends_on
119
-- Valid values: [] or a list of group names
120
-
121
-Each group specifies a list of depends_on groups, or an empty list. All
122
-identified groups must complete successfully for the phase of deployment before
123
-the current group is allowed to be processed by the current phase.
124
-
125
-- A failure (based on success criteria) of a group prevents any groups
126
-  dependent upon the failed group from being attempted.
127
-- Circular dependencies will be rejected as invalid during document validation.
128
-- There is no guarantee of ordering among groups that have their dependencies
129
-  met. Any group that is ready for deployment based on declared dependencies
130
-  will execute. Execution of groups is serialized - two groups will not deploy
131
-  at the same time.
132
-
133
-Selectors
134
-'''''''''
135
-- Field: selectors
136
-- Valid values: [] or a list of selectors
137
-
138
-The list of selectors indicate the nodes that will be included in a group.
139
-Each selector has four available filtering values: node_names, node_tags,
140
-node_labels, and rack_names. Each selector is an intersection of this
141
-critera, while the list of selectors is a union of the individual selectors.
142
-
143
-- Omitting a criterion from a selector, or using empty list means that criterion
144
-  is ignored.
145
-- Having a completely empty list of selectors, or a selector that has no
146
-  criteria specified indicates ALL nodes.
147
-- A collection of selectors that results in no nodes being identified will be
148
-  processed as if 100% of nodes successfully deployed (avoiding division by
149
-  zero), but would fail the minimum or maximum nodes criteria (still counts as
150
-  0 nodes)
151
-- There is no validation against the same node being in multiple groups,
152
-  however the workflow will not resubmit nodes that have already completed or
153
-  failed in this deployment to Drydock twice, since it keeps track of each node
154
-  uniquely. The success or failure of those nodes excluded from submission to
155
-  Drydock will still be used for the success criteria calculation.
156
-
157
-E.g.::
158
-
159
-  selectors:
160
-    - node_names:
161
-        - node01
162
-        - node02
163
-      rack_names:
164
-        - rack01
165
-      node_tags:
166
-        - control
167
-    - node_names:
168
-        - node04
169
-      node_labels:
170
-        - ucp_control_plane: enabled
171
-
172
-Will indicate (not really SQL, just for illustration)::
173
-
174
-    SELECT nodes
175
-    WHERE node_name in ('node01', 'node02')
176
-          AND rack_name in ('rack01')
177
-          AND node_tags in ('control')
178
-    UNION
179
-    SELECT nodes
180
-    WHERE node_name in ('node04')
181
-          AND node_label in ('ucp_control_plane: enabled')
182
-
183
-Success Criteria
184
-''''''''''''''''
185
-- Field: success_criteria
186
-- Valid values: for possible values, see below
187
-
188
-Each group optionally contains success criteria which is used to indicate if
189
-the deployment of that group is successful. The values that may be specified:
190
-
191
-percent_successful_nodes
192
-  The calculated success rate of nodes completing the deployment phase.
193
-
194
-  E.g.: 75 would mean that 3 of 4 nodes must complete the phase successfully.
195
-
196
-  This is useful for groups that have larger numbers of nodes, and do not
197
-  have critical minimums or are not sensitive to an arbitrary number of nodes
198
-  not working.
199
-
200
-minimum_successful_nodes
201
-  An integer indicating how many nodes must complete the phase to be considered
202
-  successful.
203
-
204
-maximum_failed_nodes
205
-  An integer indicating a number of nodes that are allowed to have failed the
206
-  deployment phase and still consider that group successful.
207
-
208
-When no criteria are specified, it means that no checks are done - processing
209
-continues as if nothing is wrong.
210
-
211
-When more than one criterion is specified, each is evaluated separately - if
212
-any fail, the group is considered failed.
213
-
214
-
215
-Example Deployment Strategy Document
216
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
217
-This example shows a deployment strategy with 5 groups: control-nodes,
218
-compute-nodes-1, compute-nodes-2, monitoring-nodes, and ntp-node.
219
-
220
-::
221
-
222
-  ---
223
-  schema: shipyard/DeploymentStrategy/v1
224
-  metadata:
225
-    schema: metadata/Document/v1
226
-    name: deployment-strategy
227
-    layeringDefinition:
228
-        abstract: false
229
-        layer: global
230
-    storagePolicy: cleartext
231
-  data:
232
-    groups:
233
-      - name: control-nodes
234
-        critical: true
235
-        depends_on:
236
-          - ntp-node
237
-        selectors:
238
-          - node_names: []
239
-            node_labels: []
240
-            node_tags:
241
-              - control
242
-            rack_names:
243
-              - rack03
244
-        success_criteria:
245
-          percent_successful_nodes: 90
246
-          minimum_successful_nodes: 3
247
-          maximum_failed_nodes: 1
248
-      - name: compute-nodes-1
249
-        critical: false
250
-        depends_on:
251
-          - control-nodes
252
-        selectors:
253
-          - node_names: []
254
-            node_labels: []
255
-            rack_names:
256
-              - rack01
257
-            node_tags:
258
-              - compute
259
-        success_criteria:
260
-          percent_successful_nodes: 50
261
-      - name: compute-nodes-2
262
-        critical: false
263
-        depends_on:
264
-          - control-nodes
265
-        selectors:
266
-          - node_names: []
267
-            node_labels: []
268
-            rack_names:
269
-              - rack02
270
-            node_tags:
271
-              - compute
272
-        success_criteria:
273
-          percent_successful_nodes: 50
274
-      - name: monitoring-nodes
275
-        critical: false
276
-        depends_on: []
277
-        selectors:
278
-          - node_names: []
279
-            node_labels: []
280
-            node_tags:
281
-              - monitoring
282
-            rack_names:
283
-              - rack03
284
-              - rack02
285
-              - rack01
286
-      - name: ntp-node
287
-        critical: true
288
-        depends_on: []
289
-        selectors:
290
-          - node_names:
291
-              - ntp01
292
-            node_labels: []
293
-            node_tags: []
294
-            rack_names: []
295
-        success_criteria:
296
-          minimum_successful_nodes: 1
297
-
298
-The ordering of groups, as defined by the dependencies (``depends-on``
299
-fields)::
300
-
301
-   __________     __________________
302
-  | ntp-node |   | monitoring-nodes |
303
-   ----------     ------------------
304
-       |
305
-   ____V__________
306
-  | control-nodes |
307
-   ---------------
308
-       |_________________________
309
-           |                     |
310
-     ______V__________     ______V__________
311
-    | compute-nodes-1 |   | compute-nodes-2 |
312
-     -----------------     -----------------
313
-
314
-Given this, the order of execution could be:
315
-
316
-- ntp-node > monitoring-nodes > control-nodes > compute-nodes-1 > compute-nodes-2
317
-- ntp-node > control-nodes > compute-nodes-2 > compute-nodes-1 > monitoring-nodes
318
-- monitoring-nodes > ntp-node > control-nodes > compute-nodes-1 > compute-nodes-2
319
-- and many more ... the only guarantee is that ntp-node will run some time
320
-  before control-nodes, which will run sometime before both of the
321
-  compute-nodes. Monitoring-nodes can run at any time.
322
-
323
-Also of note are the various combinations of selectors and the varied use of
324
-success criteria.
325
-
326
-Deployment Configuration Document (Shipyard)
327
---------------------------------------------
328
-The existing deployment-configuration document that is used by the workflows
329
-will also be modified to use the existing deployment_strategy field to provide
330
-the name of the deployment-straegy document that will be used.
331
-
332
-The default value for the name of the DeploymentStrategy document will be
333
-``deployment-strategy``.
334
-
335
-Drydock Changes
336
----------------
337
-
338
-API and CLI
339
-~~~~~~~~~~~
340
-- A new API needs to be provided that accepts a node filter (i.e. selector,
341
-  above) and returns a list of node names that result from analysis of the
342
-  design. Input to this API will also need to include a design reference.
343
-
344
-- Drydock needs to provide a "tree" output of tasks rooted at the requested
345
-  parent task. This will provide the needed success/failure status for nodes
346
-  that have been prepared/deployed.
347
-
348
-Documentation
349
-~~~~~~~~~~~~~
350
-Drydock documentation will be updated to match the introduction of new APIs
351
-
352
-
353
-Shipyard Changes
354
-----------------
355
-
356
-API and CLI
357
-~~~~~~~~~~~
358
-- The commit configdocs api will need to be enhanced to look up the
359
-  DeploymentStrategy by using the DeploymentConfiguration.
360
-- The DeploymentStrategy document will need to be validated to ensure there are
361
-  no circular dependencies in the groups' declared dependencies (perhaps
362
-  NetworkX_).
363
-- A new API endpoint (and matching CLI) is desired to retrieve the status of
364
-  nodes as known to Drydock/MAAS and their MAAS status. The existing node list
365
-  API in Drydock provides a JSON output that can be utilized for this purpose.
366
-
367
-Workflow
368
-~~~~~~~~
369
-The deploy_site and update_site workflows will be modified to utilize the
370
-DeploymentStrategy.
371
-
372
-- The deployment configuration step will be enhanced to also read the
373
-  deployment strategy and pass the information on a new xcom for use by the
374
-  baremetal nodes step (see below)
375
-- The prepare nodes and deploy nodes steps will be combined to perform both as
376
-  part of the resolution of an overall ``baremetal nodes`` step.
377
-  The baremetal nodes step will introduce functionality that reads in the
378
-  deployment strategy (from the prior xcom), and can orchestrate the calls to
379
-  Drydock to enact the grouping, ordering and and success evaluation.
380
-  Note that Drydock will serialize tasks; there is no parallelization of
381
-  prepare/deploy at this time.
382
-
383
-Needed Functionality
384
-''''''''''''''''''''
385
-
386
-- function to formulate the ordered groups based on dependencies (perhaps
387
-  NetworkX_)
388
-- function to evaluate success/failure against the success criteria for a group
389
-  based on the result list of succeeded or failed nodes.
390
-- function to mark groups as success or failure (including failed due to
391
-  dependency failure), as well as keep track of the (if any) successful and
392
-  failed nodes.
393
-- function to get a group that is ready to execute, or 'Done' when all groups
394
-  are either complete or failed.
395
-- function to formulate the node filter for Drydock based on a group's
396
-  selectors
397
-- function to orchestrate processing groups, moving to the next group (or being
398
-  done) when a prior group completes or fails.
399
-- function to summarize the success/failed nodes for a group (primarily for
400
-  reporting to the logs at this time).
401
-
402
-Process
403
-'''''''
404
-The baremetal nodes step (preparation and deployment of nodes) will proceed as
405
-follows:
406
-
407
-1. Each group's selector will be sent to Drydock to determine the list of
408
-   nodes that are a part of that group.
409
-
410
-   - An overall status will be kept for each unique node (not started |
411
-     prepared | success | failure).
412
-   - When sending a task to Drydock for processing, the nodes associated with
413
-     that group will be sent as a simple `node_name` node filter. This will
414
-     allow for this list to exclude nodes that have a status that is not
415
-     congruent for the task being performed.
416
-
417
-     - prepare nodes valid status: not started
418
-     - deploy nodes valid status: prepared
419
-
420
-2. In a processing loop, groups that are ready to be processed based on their
421
-   dependencies (and the success criteria of groups they are dependent upon)
422
-   will be selected for processing until there are no more groups that can be
423
-   processed. The processing will consist of preparing and then deploying the
424
-   group.
425
-
426
-   - The selected group will be prepared and then deployed before selecting
427
-     another group for processing.
428
-   - Any nodes that failed as part of that group will be excluded from
429
-     subsequent deployment or preparation of that node for this deployment.
430
-
431
-     - Excluding nodes that are already processed addresses groups that have
432
-       overlapping lists of nodes due to the group's selectors, and prevents
433
-       sending them to Drydock for re-processing.
434
-     - Evaluation of the success criteria will use the full set of nodes
435
-       identified by the selector. This means that if a node was previously
436
-       successfully deployed, that same node will count as "successful" when
437
-       evaluating the success criteria.
438
-
439
-   - The success criteria will be evaluated after the group's prepare step and
440
-     the deploy step. A failure to meet the success criteria in a prepare step
441
-     will cause the deploy step for that group to be skipped (and marked as
442
-     failed).
443
-   - Any nodes that fail during the prepare step, will not be used in the
444
-     corresponding deploy step.
445
-   - Upon completion (success, partial success, or failure) of a prepare step,
446
-     the nodes that were sent for preparation will be marked in the unique list
447
-     of nodes (above) with their appropriate status: prepared or failure
448
-   - Upon completion of a group's deployment step, the nodes status will be
449
-     updated to their current status: success or failure.
450
-
451
-4. Before the end of the baremetal nodes step, following all eligible group
452
-   processing, a report will be logged to indicate the success/failure of
453
-   groups and the status of the individual nodes. Note that it is possible for
454
-   individual nodes to be left in `not started` state if they were only part of
455
-   groups that were never allowed to process due to dependencies and success
456
-   criteria.
457
-
458
-5. At the end of the baremetal nodes step, if any nodes that have failed
459
-   due to timeout, dependency failure, or success criteria failure and are
460
-   marked as critical will trigger an Airflow Exception, resulting in a failed
461
-   deployment.
462
-
463
-Notes:
464
-
465
-- The timeout values specified for the prepare nodes and deploy nodes steps
466
-  will be used to put bounds on the individual calls to Drydock. A failure
467
-  based on these values will be treated as a failure for the group; we need to
468
-  be vigilant on if this will lead to indeterminate states for nodes that mess
469
-  with further processing. (e.g. Timed out, but the requested work still
470
-  continued to completion)
471
-
472
-Example Processing
473
-''''''''''''''''''
474
-Using the defined deployment strategy in the above example, the following is
475
-an example of how it may process::
476
-
477
-  Start
478
-  |
479
-  | prepare ntp-node           <SUCCESS>
480
-  | deploy ntp-node            <SUCCESS>
481
-  V
482
-  | prepare control-nodes      <SUCCESS>
483
-  | deploy control-nodes       <SUCCESS>
484
-  V
485
-  | prepare monitoring-nodes   <SUCCESS>
486
-  | deploy monitoring-nodes    <SUCCESS>
487
-  V
488
-  | prepare compute-nodes-2    <SUCCESS>
489
-  | deploy compute-nodes-2     <SUCCESS>
490
-  V
491
-  | prepare compute-nodes-1    <SUCCESS>
492
-  | deploy compute-nodes-1     <SUCCESS>
493
-  |
494
-  Finish (success)
495
-
496
-If there were a failure in preparing the ntp-node, the following would be the
497
-result::
498
-
499
-  Start
500
-  |
501
-  | prepare ntp-node           <FAILED>
502
-  | deploy ntp-node            <FAILED, due to prepare failure>
503
-  V
504
-  | prepare control-nodes      <FAILED, due to dependency>
505
-  | deploy control-nodes       <FAILED, due to dependency>
506
-  V
507
-  | prepare monitoring-nodes   <SUCCESS>
508
-  | deploy monitoring-nodes    <SUCCESS>
509
-  V
510
-  | prepare compute-nodes-2    <FAILED, due to dependency>
511
-  | deploy compute-nodes-2     <FAILED, due to dependency>
512
-  V
513
-  | prepare compute-nodes-1    <FAILED, due to dependency>
514
-  | deploy compute-nodes-1     <FAILED, due to dependency>
515
-  |
516
-  Finish (failed due to critical group failed)
517
-
518
-If a failure occurred during the deploy of compute-nodes-2, the following would
519
-result::
520
-
521
-  Start
522
-  |
523
-  | prepare ntp-node           <SUCCESS>
524
-  | deploy ntp-node            <SUCCESS>
525
-  V
526
-  | prepare control-nodes      <SUCCESS>
527
-  | deploy control-nodes       <SUCCESS>
528
-  V
529
-  | prepare monitoring-nodes   <SUCCESS>
530
-  | deploy monitoring-nodes    <SUCCESS>
531
-  V
532
-  | prepare compute-nodes-2    <SUCCESS>
533
-  | deploy compute-nodes-2     <FAILED>
534
-  V
535
-  | prepare compute-nodes-1    <SUCCESS>
536
-  | deploy compute-nodes-1     <SUCCESS>
537
-  |
538
-  Finish (success with some nodes/groups failed)
539
-
540
-Schemas
541
-~~~~~~~
542
-A new schema will need to be provided by Shipyard to validate the
543
-DeploymentStrategy document.
544
-
545
-Documentation
546
-~~~~~~~~~~~~~
547
-The Shipyard action documentation will need to include details defining the
548
-DeploymentStrategy document (mostly as defined here), as well as the update to
549
-the DeploymentConfiguration document to contain the name of the
550
-DeploymentStrategy document.
551
-
552
-
553
-.. _NetworkX: https://networkx.github.io/documentation/networkx-1.9/reference/generated/networkx.algorithms.dag.topological_sort.html

+ 0
- 559
docs/source/blueprints/node-teardown.rst View File

@@ -1,559 +0,0 @@
1
-..
2
-      Copyright 2018 AT&T Intellectual Property.
3
-      All Rights Reserved.
4
-
5
-      Licensed under the Apache License, Version 2.0 (the "License"); you may
6
-      not use this file except in compliance with the License. You may obtain
7
-      a copy of the License at
8
-
9
-          http://www.apache.org/licenses/LICENSE-2.0
10
-
11
-      Unless required by applicable law or agreed to in writing, software
12
-      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
13
-      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
14
-      License for the specific language governing permissions and limitations
15
-      under the License.
16
-
17
-.. _node-teardown:
18
-
19
-Undercloud Node Teardown
20
-========================
21
-
22
-When redeploying a physical host (server) using the Undercloud Platform(UCP),
23
-it is necessary to trigger a sequence of steps to prevent undesired behaviors
24
-when the server is redeployed. This blueprint intends to document the
25
-interaction that must occur between UCP components to teardown a server.
26
-
27
-Overview
28
---------
29
-Shipyard is the entrypoint for UCP actions, including the need to redeploy a
30
-server. The first part of redeploying a server is the graceful teardown of the
31
-software running on the server; specifically Kubernetes and etcd are of
32
-critical concern. It is the duty of Shipyard to orchestrate the teardown of the
33
-server, followed by steps to deploy the desired new configuration. This design
34
-covers only the first portion - node teardown
35
-
36
-Shipyard node teardown Process
37
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
38
-#. (Existing) Shipyard receives request to redeploy_server, specifying a target
39
-   server.
40
-#. (Existing) Shipyard performs preflight, design reference lookup, and
41
-   validation steps.
42
-#. (New) Shipyard invokes Promenade to decommission a node.
43
-#. (New) Shipyard invokes Drydock to destroy the node - setting a node
44
-   filter to restrict to a single server.
45
-#  (New) Shipyard invokes Promenade to remove the node from the Kubernetes
46
-   cluster.
47
-
48
-Assumption:
49
-node_id is the hostname of the server, and is also the identifier that both
50
-Drydock and Promenade use to identify the appropriate parts - hosts and k8s
51
-nodes. This convention is set by the join script produced by promenade.
52
-
53
-Drydock Destroy Node
54
---------------------
55
-The API/interface for destroy node already exists. The implementation within
56
-Drydock needs to be developed. This interface will need to accept both the
57
-specified node_id and the design_id to retrieve from Deckhand.
58
-
59
-Using the provided node_id (hardware node), and the design_id, Drydock will
60
-reset the hardware to a re-provisionable state.
61
-
62
-By default, all local storage should be wiped (per datacenter policy for
63
-wiping before re-use).
64
-
65
-An option to allow for only the OS disk to be wiped should be supported, such
66
-that other local storage is left intact, and could be remounted without data
67
-loss. e.g.: --preserve-local-storage
68
-
69
-The target node should be shut down.
70
-
71
-The target node should be removed from the provisioner (e.g. MaaS)
72
-
73
-Responses
74
-~~~~~~~~~
75
-The responses from this functionality should follow the pattern set by prepare
76
-nodes, and other Drydock functionality. The Drydock status responses used for
77
-all async invocations will be utilized for this functionality.
78
-
79
-Promenade Decommission Node
80
----------------------------
81
-Performs steps that will result in the specified node being cleanly
82
-disassociated from Kubernetes, and ready for the server to be destroyed.
83
-Users of the decommission node API should be aware of the long timeout values
84
-that may occur while awaiting promenade to complete the appropriate steps.
85
-At this time, Promenade is a stateless service and doesn't use any database
86
-storage. As such, requests to Promenade are synchronous.
87
-
88
-.. code:: json
89
-
90
-  POST /nodes/{node_id}/decommission
91
-
92
-  {
93
-    rel : "design",
94
-    href: "deckhand+https://{{deckhand_url}}/revisions/{{revision_id}}/rendered-documents",
95
-    type: "application/x-yaml"
96
-  }
97
-
98
-Such that the design reference body is the design indicated when the
99
-redeploy_server action is invoked through Shipyard.
100
-
101
-Query Parameters:
102
-
103
--  drain-node-timeout: A whole number timeout in seconds to be used for the
104
-   drain node step (default: none). In the case of no value being provided,
105
-   the drain node step will use its default.
106
--  drain-node-grace-period: A whole number in seconds indicating the
107
-   grace-period that will be provided to the drain node step. (default: none).
108
-   If no value is specified, the drain node step will use its default.
109
--  clear-labels-timeout: A whole number timeout in seconds to be used for the
110
-   clear labels step. (default: none).  If no value is specified, clear labels
111
-   will use its own default.
112
--  remove-etcd-timeout: A whole number timeout in seconds to be used for the
113
-   remove etcd from nodes step. (default: none). If no value is specified,
114
-   remove-etcd will use its own default.
115
--  etcd-ready-timeout: A whole number in seconds indicating how long the
116
-   decommission node request should allow for etcd clusters to become stable
117
-   (default: 600).
118
-
119
-Process
120
-~~~~~~~
121
-Acting upon the node specified by the invocation and the design reference
122
-details:
123
-
124
-#. Drain the Kubernetes node.
125
-#. Clear the Kubernetes labels on the node.
126
-#. Remove etcd nodes from their clusters (if impacted).
127
-   -  if the node being decommissioned contains etcd nodes, Promenade will
128
-      attempt to gracefully have those nodes leave the etcd cluster.
129
-#. Ensure that etcd cluster(s) are in a stable state.
130
-   -  Polls for status every 30 seconds up to the etcd-ready-timeout, or the
131
-      cluster meets the defined minimum functionality for the site.
132
-   -  A new document: promenade/EtcdClusters/v1 that will specify details about
133
-      the etcd clusters deployed in the site, including: identifiers,
134
-      credentials, and thresholds for minimum functionality.
135
-   -  This process should ignore the node being torn down from any calculation
136
-      of health
137
-#. Shutdown the kubelet.
138
-   -  If this is not possible because the node is in a state of disarray such
139
-      that it cannot schedule the daemonset to run, this step may fail, but
140
-      should not hold up the process, as the Drydock dismantling of the node
141
-      will shut the kubelet down.
142
-
143
-Responses
144
-~~~~~~~~~
145
-All responses will be form of the UCP Status response.
146
-
147
--  Success: Code: 200, reason: Success
148
-
149
-   Indicates that all steps are successful.
150
-
151
--  Failure: Code: 404, reason: NotFound
152
-
153
-   Indicates that the target node is not discoverable by Promenade.
154
-
155
--  Failure: Code: 500, reason: DisassociateStepFailure
156
-
157
-   The details section should detail the successes and failures further. Any
158
-   4xx series errors from the individual steps would manifest as a 500 here.
159
-
160
-Promenade Drain Node
161
---------------------
162
-Drain the Kubernetes node for the target node. This will ensure that this node
163
-is no longer the target of any pod scheduling, and evicts or deletes the
164
-running pods. In the case of notes running DaemonSet manged pods, or pods
165
-that would prevent a drain from occurring, Promenade may be required to provide
166
-the `ignore-daemonsets` option or `force` option to attempt to drain the node
167
-as fully as possible.
168
-
169
-By default, the drain node will utilize a grace period for pods of 1800
170
-seconds and a total timeout of 3600 seconds (1 hour). Clients of this
171
-functionality should be prepared for a long timeout.
172
-
173
-.. code:: json
174
-
175
-  POST /nodes/{node_id}/drain
176
-
177
-Query Paramters:
178
-
179
--  timeout: a whole number in seconds (default = 3600). This value is the total
180
-   timeout for the kubectl drain command.
181
--  grace-period: a whole number in seconds (default = 1800). This value is the
182
-   grace period used by kubectl drain. Grace period must be less than timeout.
183
-
184
-.. note::
185
-
186
-   This POST has no message body
187
-
188
-Example command being used for drain (reference only)
189
-`kubectl drain --force --timeout 3600s --grace-period 1800 --ignore-daemonsets --delete-local-data n1`
190
-https://git.openstack.org/cgit/openstack/airship-promenade/tree/promenade/templates/roles/common/usr/local/bin/promenade-teardown
191
-
192
-Responses
193
-~~~~~~~~~
194
-All responses will be form of the UCP Status response.
195
-
196
--  Success: Code: 200, reason: Success
197
-
198
-   Indicates that the drain node has successfully concluded, and that no pods
199
-   are currently running
200
-
201
--  Failure: Status response, code: 400, reason: BadRequest
202
-
203
-   A request was made with parameters that cannot work - e.g. grace-period is
204
-   set to a value larger than the timeout value.
205
-
206
--  Failure: Status response, code: 404, reason: NotFound
207
-
208
-   The specified node is not discoverable by Promenade
209
-
210
--  Failure: Status response, code: 500, reason: DrainNodeError
211
-
212
-   There was a processing exception raised while trying to drain a node. The
213
-   details section should indicate the underlying cause if it can be
214
-   determined.
215
-
216
-Promenade Clear Labels
217
-----------------------
218
-Removes the labels that have been added to the target kubernetes node.
219
-
220
-.. code:: json
221
-
222
-  POST /nodes/{node_id}/clear-labels
223
-
224
-Query Parameters:
225
-
226
--  timeout: A whole number in seconds allowed for the pods to settle/move
227
-   following removal of labels. (Default = 1800)
228
-
229
-.. note::
230
-
231
-   This POST has no message body
232
-
233
-Responses
234
-~~~~~~~~~
235
-All responses will be form of the UCP Status response.
236
-
237
--  Success: Code: 200, reason: Success
238
-
239
-   All labels have been removed from the specified Kubernetes node.
240
-
241
--  Failure: Code: 404, reason: NotFound
242
-
243
-   The specified node is not discoverable by Promenade
244
-
245
--  Failure: Code: 500, reason: ClearLabelsError
246
-
247
-   There was a failure to clear labels that prevented completion. The details
248
-   section should provide more information about the cause of this failure.
249
-
250
-Promenade Remove etcd Node
251
-~~~~~~~~~~~~~~~~~~~~~~~~~~
252
-Checks if the node specified contains any etcd nodes. If so, this API will
253
-trigger that etcd node to leave the associated etcd cluster.
254
-
255
-POST /nodes/{node_id}/remove-etcd
256
-
257
-  {
258
-    rel : "design",
259
-    href: "deckhand+https://{{deckhand_url}}/revisions/{{revision_id}}/rendered-documents",
260
-    type: "application/x-yaml"
261
-  }
262
-
263
-Query Parameters:
264
-
265
--  timeout: A whole number in seconds allowed for the removal of etcd nodes
266
-   from the targe node. (Default = 1800)
267
-
268
-Responses
269
-~~~~~~~~~
270
-All responses will be form of the UCP Status response.
271
-
272
--  Success: Code: 200, reason: Success
273
-
274
-   All etcd nodes have been removed from the specified node.
275
-
276
--  Failure: Code: 404, reason: NotFound
277
-
278
-   The specified node is not discoverable by Promenade
279
-
280
--  Failure: Code: 500, reason: RemoveEtcdError
281
-
282
-   There was a failure to remove etcd from the target node that prevented
283
-   completion within the specified timeout, or that etcd prevented removal of
284
-   the node because it would result in the cluster being broken. The details
285
-   section should provide more information about the cause of this failure.
286
-
287
-
288
-Promenade Check etcd
289
-~~~~~~~~~~~~~~~~~~~~
290
-Retrieves the current interpreted state of etcd.
291
-
292
-GET /etcd-cluster-health-statuses?design_ref={the design ref}
293
-
294
-Where the design_ref parameter is required for appropriate operation, and is in
295
-the same format as used for the join-scripts API.
296
-
297
-Query Parameters:
298
-
299
--  design_ref: (Required) the design reference to be used to discover etcd
300
-   instances.
301
-
302
-Responses
303
-~~~~~~~~~
304
-All responses will be form of the UCP Status response.
305
-
306
--  Success: Code: 200, reason: Success
307
-
308
-   The status of each etcd in the site will be returned in the details section.
309
-   Valid values for status are: Healthy, Unhealthy
310
-
311
-https://github.com/att-comdev/ucp-integration/blob/master/docs/source/api-conventions.rst#status-responses
312
-
313
-.. code:: json
314
-
315
-  { "...": "... standard status response ...",
316
-    "details": {
317
-      "errorCount": {{n}},
318
-      "messageList": [
319
-        { "message": "Healthy",
320
-          "error": false,
321
-          "kind": "HealthMessage",
322
-          "name": "{{the name of the etcd service}}"
323
-        },
324
-        { "message": "Unhealthy"
325
-          "error": false,
326
-          "kind": "HealthMessage",
327
-          "name": "{{the name of the etcd service}}"
328
-        },
329
-        { "message": "Unable to access Etcd"
330
-          "error": true,
331
-          "kind": "HealthMessage",
332
-          "name": "{{the name of the etcd service}}"
333
-        }
334
-      ]
335
-    }
336
-    ...
337
-  }
338
-
339
--  Failure: Code: 400, reason: MissingDesignRef
340
-
341
-   Returned if the design_ref parameter is not specified
342
-
343
--  Failure: Code: 404, reason: NotFound
344
-
345
-   Returned if the specified etcd could not be located
346
-
347
--  Failure: Code: 500, reason: EtcdNotAccessible
348
-
349
-   Returned if the specified etcd responded with an invalid health response
350
-   (Not just simply unhealthy - that's a 200).
351
-
352
-
353
-Promenade Shutdown Kubelet
354
---------------------------
355
-Shuts down the kubelet on the specified node. This is accomplished by Promenade
356
-setting the label `promenade-decomission: enabled` on the node, which will
357
-trigger a newly-developed daemonset to run something like:
358
-`systemctl disable kubelet && systemctl stop kubelet`.
359
-This daemonset will effectively sit dormant until nodes have the appropriate
360
-label added, and then perform the kubelet teardown.
361
-
362
-.. code:: json
363
-
364
-  POST /nodes/{node_id}/shutdown-kubelet
365
-
366
-.. note::
367
-
368
-   This POST has no message body
369
-
370
-Responses
371
-~~~~~~~~~
372
-All responses will be form of the UCP Status response.
373
-
374
--  Success: Code: 200, reason: Success
375
-
376
-   The kubelet has been successfully shutdown
377
-
378
--  Failure: Code: 404, reason: NotFound
379
-
380
-   The specified node is not discoverable by Promenade
381
-
382
--  Failure: Code: 500, reason: ShutdownKubeletError
383
-
384
-   The specified node's kubelet fails to shutdown. The details section of the
385
-   status response should contain reasonable information about the source of
386
-   this failure
387
-
388
-Promenade Delete Node from Cluster
389
-----------------------------------
390
-Updates the Kubernetes cluster, removing the specified node. Promenade should
391
-check that the node is drained/cordoned and has no labels other than
392
-`promenade-decomission: enabled`. In either of these cases, the API should
393
-respond with a 409 Conflict response.
394
-
395
-.. code:: json
396
-
397
-  POST /nodes/{node_id}/remove-from-cluster
398
-
399
-.. note::
400
-
401
-   This POST has no message body
402
-
403
-Responses
404
-~~~~~~~~~
405
-All responses will be form of the UCP Status response.
406
-
407
--  Success: Code: 200, reason: Success
408
-
409
-   The specified node has been removed from the Kubernetes cluster.
410
-
411
--  Failure: Code: 404, reason: NotFound
412
-
413
-   The specified node is not discoverable by Promenade
414
-
415
--  Failure: Code: 409, reason: Conflict
416
-
417
-   The specified node cannot be deleted due to checks that the node is
418
-   drained/cordoned and has no labels (other than possibly
419
-   `promenade-decomission: enabled`).
420
-
421
--  Failure: Code: 500, reason: DeleteNodeError
422
-
423
-   The specified node cannot be removed from the cluster due to an error from
424
-   Kubernetes. The details section of the status response should contain more
425
-   information about the failure.
426
-
427
-
428
-Shipyard Tag Releases
429
----------------------
430
-Shipyard will need to mark Deckhand revisions with tags when there are
431
-successful deploy_site or update_site actions to be able to determine the last
432
-known good design. This is related to issue 16 for Shipyard, which utilizes the
433
-same need.
434
-
435
-.. note::
436
-
437
-   Repeated from https://github.com/att-comdev/shipyard/issues/16
438
-
439
-   When multiple configdocs commits have been done since the last deployment,
440
-   there is no ready means to determine what's being done to the site. Shipyard
441
-   should reject deploy site or update site requests that have had multiple
442
-   commits since the last site true-up action. An option to override this guard
443
-   should be allowed for the actions in the form of a parameter to the action.
444
-
445
-   The configdocs API should provide a way to see what's been changed since the
446
-   last site true-up, not just the last commit of configdocs. This might be
447
-   accommodated by new deckhand tags like the 'commit' tag, but for
448
-   'site true-up' or similar applied by the deploy and update site commands.
449
-
450
-The design for issue 16 includes the bare-minimum marking of Deckhand
451
-revisions. This design is as follows:
452
-
453
-Scenario
454
-~~~~~~~~
455
-Multiple commits occur between site actions (deploy_site, update_site) - those
456
-actions that attempt to bring a site into compliance with a site design.
457
-When this occurs, the current system of being able to only see what has changed
458
-between committed and the the buffer versions (configdocs diff) is insufficient
459
-to be able to investigate what has changed since the last successful (or
460
-unsuccessful) site action.
461
-To accommodate this, Shipyard needs several enhancements.
462
-
463
-Enhancements
464
-~~~~~~~~~~~~
465
-
466
-#. Deckhand revision tags for site actions
467
-
468
-   Using the tagging facility provided by Deckhand, Shipyard will tag the end
469
-   of site actions.
470
-   Upon completing a site action successfully tag the revision being used with
471
-   the tag site-action-success, and a body of dag_id:<dag_id>
472
-
473
-   Upon completion of a site action unsuccessfully, tag the revision being used
474
-   with the tag site-action-failure, and a body of dag_id:<dag_id>
475
-
476
-   The completion tags should only be applied upon failure if the site action
477
-   gets past document validation successfully (i.e. gets to the point where it
478
-   can start making changes via the other UCP components)
479
-
480
-   This could result in a single revision having both site-action-success and
481
-   site-action-failure if a later re-invocation of a site action is successful.
482
-
483
-#. Check for intermediate committed revisions
484
-
485
-   Upon running a site action, before tagging the revision with the site action
486
-   tag(s), the dag needs to check to see if there are committed revisions that
487
-   do not have an associated site-action tag.  If there are any committed
488
-   revisions since the last site action other than the current revision being
489
-   used (between them), then the action should not be allowed to proceed (stop
490
-   before triggering validations). For the calculation of intermediate
491
-   committed revisions, assume revision 0 if there are no revisions with a
492
-   site-action tag (null case)
493
-
494
-   If the action is invoked with a parameter of
495
-   allow-intermediate-commits=true, then this check should log that the
496
-   intermediate committed revisions check is being skipped and not take any
497
-   other action.
498
-
499
-#. Support action parameter of allow-intermediate-commits=true|false
500
-
501
-   In the CLI for create action, the --param option supports adding parameters
502
-   to actions. The parameters passed should be relayed by the CLI to the API
503
-   and ultimately to the invocation of the DAG.  The DAG as noted above will
504
-   check for the presense of allow-intermediate-commits=true.  This needs to be
505
-   tested to work.
506
-
507
-#. Shipyard needs to support retrieving configdocs and rendered documents for
508
-   the last successful site action, and last site action (successful or not
509
-   successful)
510
-
511
-   --successful-site-action
512
-   --last-site-action
513
-   These options would be mutually exclusive of --buffer or --committed
514
-
515
-#. Shipyard diff (shipyard get configdocs)
516
-
517
-   Needs to support an option to do the diff of the buffer vs. the last
518
-   successful site action and the last site action (succesful or not
519
-   successful).
520
-
521
-   Currently there are no options to select which versions to diff (always
522
-   buffer vs. committed)
523
-
524
-   support:
525
-   --base-version=committed | successful-site-action | last-site-action (Default = committed)
526
-   --diff-version=buffer | committed | successful-site-action | last-site-action (Default = buffer)
527
-
528
-   Equivalent query parameters need to be implemented in the API.
529
-
530
-Because the implementation of this design will result in the tagging of
531
-successful site-actions, Shipyard will be able to determine the correct
532
-revision to use while attempting to teardown a node.
533
-
534
-If the request to teardown a node indicates a revision that doesn't exist, the
535
-command to do so (e.g. redeploy_server) should not continue, but rather fail
536
-due to a missing precondition.
537
-
538
-The invocation of the Promenade and Drydock steps in this design will utilize
539
-the appropriate tag based on the request (default is successful-site-action) to
540
-determine the revision of the Deckhand documents used as the design-ref.
541
-
542
-Shipyard redeploy_server Action
543
--------------------------------
544
-The redeploy_server action currently accepts a target node. Additional
545
-supported parameters are needed:
546
-
547
-#. preserve-local-storage=true which will instruct Drydock to only wipe the
548
-   OS drive, and any other local storage will not be wiped. This would allow
549
-   for the drives to be remounted to the server upon re-provisioning. The
550
-   default behavior is that local storage is not preserved.
551
-
552
-#. target-revision=committed | successful-site-action | last-site-action
553
-   This will indicate which revision of the design will be used as the
554
-   reference for what should be re-provisioned after the teardown.
555
-   The default is successful-site-action, which is the closest representation
556
-   to the last-known-good state.
557
-
558
-These should be accepted as parameters to the action API/CLI and modify the
559
-behavior of the redeploy_server DAG.

+ 0
- 1
docs/source/index.rst View File

@@ -52,7 +52,6 @@ Conventions and Standards
52 52
    :maxdepth: 3
53 53
 
54 54
    conventions
55
-   blueprints/blueprints
56 55
    dev-getting-started
57 56
 
58 57
 

Loading…
Cancel
Save