Browse Source

A fresh way of looking at step retrieval

This is an attempt at a fresh, simplistic, current state
driven way to obtain possible steps.

Change-Id: Iee540569380365f945f7e072c12e0c5739128e42
Julia Kreger 6 months ago
parent
commit
b2407ddcff
1 changed files with 343 additions and 0 deletions
  1. 343
    0
      specs/backlog/obtaining-steps.rst

+ 343
- 0
specs/backlog/obtaining-steps.rst View File

@@ -0,0 +1,343 @@
1
+..
2
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
3
+ License.
4
+
5
+ http://creativecommons.org/licenses/by/3.0/legalcode
6
+
7
+===============
8
+Obtaining Steps
9
+===============
10
+
11
+https://storyboard.openstack.org/#!/story/1719925
12
+
13
+https://storyboard.openstack.org/#!/story/1715419
14
+
15
+In Ironic, we have a concept of steps [1]_ to be executed to achieve a task
16
+utilizing a blend of driver code running in the conductor and code operating
17
+inside of the
18
+`ironic-python-agent <https://git.openstack.org/cgit/openstack/ironic-python-agent>`_.
19
+
20
+In order for this to be useful, we have to be able to raise the visibility of
21
+what is available to be performed to the end user of the API. Presently users
22
+are only able to rely upon documentation, and the state of the code including
23
+modules that could be loaded in.
24
+
25
+This issue is further compounded as the entire list of steps is a union
26
+of information identified from the ``ironic-conductor`` process managing the
27
+node and the ``ironic-python-agent`` process executing upon the node.
28
+
29
+.. Note::
30
+   This document is present in the backlog as there are implementation issues
31
+   to this feature. Please see Gerrit change
32
+   `606199 <https://review.openstack.org/#/c/606199/4>`_ for more information.
33
+
34
+Problem description
35
+===================
36
+
37
+* API users presently have to rely upon documentation of steps to know
38
+  what is available.
39
+
40
+* Different steps may be available with different hardware managers.
41
+
42
+* With the increasing use of the Deploy Steps [2]_ framework, new steps
43
+  should be anticipated to be added with new releases of Ironic.
44
+
45
+* The ``ironic-python-agent`` must be running to obtain a complete list
46
+  of steps.
47
+
48
+Proposed change
49
+===============
50
+
51
+In order to keep this solution relatively lightweight, there are four
52
+fundamental changes that will be needed in order to facilitate visibility.
53
+
54
+This doesn't seek to solve complete visibility by creating additional
55
+processes, but instead seeks to provide tools to collect data,
56
+with the limiting factor being we can only return the current available
57
+information.
58
+
59
+How to do it?
60
+-------------
61
+
62
+Step 1
63
+~~~~~~
64
+
65
+The initial step is to provide an API endpoint that returns the current
66
+available list of steps visible for a node running in the conductor.
67
+This would be an API endpoint, to a RPC method, to a conductor manager
68
+method, which would then return the list of steps, while tolerating the
69
+absence of ``ironic-python-agent``.
70
+
71
+.. Note::
72
+   The ironic community consensus is that this feature should cache steps
73
+   and return those cached steps as available to the user.
74
+
75
+Step 2
76
+~~~~~~
77
+
78
+Addition of a ``hold`` provision state verb and ``holding`` state.
79
+
80
+.. Note::
81
+   During a specific planning and discussion meeting to determine the path
82
+   for a feature such as this, the ironic community reached a consensus on
83
+   the call that a holding state would be useful, and could likey be
84
+   implemented aside from the API functionality proposed in this backlog
85
+   specification.
86
+
87
+ +-----------------+-------------------+---------------------------------+
88
+ | *Initial State* | *Temporary State* | *Possible next verbs*           |
89
+ +-----------------+-------------------+---------------------------------+
90
+ | manageable      | holding           | manage, clean, provide, active, |
91
+ |                 |                   | inspect                         |
92
+ +-----------------+-------------------+---------------------------------+
93
+ | available       | holding           | active, manage, provide         |
94
+ +-----------------+-------------------+---------------------------------+
95
+
96
+
97
+With the invocation of the state:
98
+
99
+* The machine is moved to the provisioning network.
100
+
101
+.. Note::
102
+   There is a slight issue with this transition in that to clean the node
103
+   would realistically need to be on the cleaning network. Operationally
104
+   changing the DHCP address is problematic as we have learned with the
105
+   rescue feature.
106
+
107
+* The deployment ramdisk is booted.
108
+* The ``ironic-python-agent`` would then be left in a running
109
+  state, allowed to heartbeat (or be polled), and the API
110
+  endpoint added in the prior step would fetch a complete
111
+  list of steps that can be executed upon.
112
+
113
+Alternatives
114
+------------
115
+
116
+An alternative to this solution would be to provide an async API endpoint
117
+to perform the steps detailed in step 2, and cache the data which could then
118
+be retrieved by the user asynchronously. In this case, the user would have
119
+to poll the API to determine if the cached information has been updated.
120
+
121
+The conundrum is that this would have to be constrained by states, which
122
+means we would still need to build state machine states around this to
123
+represent the current operation to users.
124
+
125
+Data model impact
126
+-----------------
127
+
128
+None
129
+
130
+State Machine Impact
131
+--------------------
132
+
133
+As noted above, we would add a new hold verb, which would allow transition
134
+back to the prior state. This ``hold`` verb would only be accessible from
135
+the ``manageable`` and ``available`` states.
136
+
137
+In this holding state, API users would be able to request logical next steps,
138
+in-line with the present state, as detailed in the table above.
139
+
140
+REST API impact
141
+---------------
142
+
143
+The node object returned would expose additional ``provision_state`` states,
144
+however this is a known quantity with all state machine impacts.
145
+
146
+An additional provision state target verb of ``hold`` to trigger the state
147
+machine change.
148
+
149
+An endpoint will be added on to enable an API user to return the list
150
+of known steps via the RPC interface and the conductor, which will be
151
+triggered as a GET request.
152
+
153
+.. Note::
154
+   Community consensus is that we should not be initiating a synchronous call
155
+   to IPA to collect data, that we should instead return cached data and
156
+   somehow trigger the cache to be updated.
157
+
158
+Example::
159
+
160
+   GET /v1/nodes/{node_ident}/steps[?type=(clean|deploy)]
161
+   {
162
+     [{"source": "conductor",
163
+       "deploy": [
164
+         {
165
+           "interface": "deploy",
166
+           "step": "deploy",
167
+           "priority": 100,
168
+         },
169
+       ],
170
+       "clean": [
171
+         {
172
+           "interface": "deploy",
173
+           "step": "erase_devices",
174
+           "reboot_requested": False,
175
+           "priority": 10,
176
+           "abortable": True,
177
+         },
178
+         {
179
+           "interface": "bios",
180
+           "step": "apply_configuration",
181
+           "args": {....},
182
+           "priority": 0,
183
+         },
184
+         {
185
+           "interface": "raid",
186
+           "step": "create_configuration",
187
+           "args": {....},
188
+           "priority": 0
189
+         },
190
+         {
191
+           "interface": "raid"
192
+           "step": "delete_configuration",
193
+            "args": {....},
194
+            "priority": 0
195
+         }
196
+       ]
197
+     },
198
+     {"source": "agent",
199
+     ...
200
+     }
201
+     ]
202
+   }
203
+
204
+If a specific ``type`` is requested, then the request shall only return the
205
+requested type of steps. If no type is defined, both sets will be returned
206
+to the caller.
207
+
208
+Normal response code: 200
209
+Expected error codes::
210
+
211
+  * 400 with malformed request
212
+  * 503 upon conductor error
213
+
214
+.. NOTE::
215
+   API micro-version will be incremented in accordance with standard
216
+   procedure.
217
+
218
+
219
+Client (CLI) impact
220
+-------------------
221
+
222
+"ironic" CLI
223
+~~~~~~~~~~~~
224
+None
225
+
226
+"openstack baremetal" CLI
227
+~~~~~~~~~~~~~~~~~~~~~~~~~
228
+
229
+An ``openstack baremetal node steps`` and ``openstack baremetal node hold``
230
+commands will be added to facilitate returning the data exposed by this api.
231
+
232
+RPC API impact
233
+--------------
234
+
235
+A new RPC method will need to be added called ``get_steps``
236
+that will support a single argument to indicate what class of
237
+steps are being requested by the API user.
238
+
239
+Driver API impact
240
+-----------------
241
+
242
+None
243
+
244
+Nova driver impact
245
+------------------
246
+
247
+None is required for this feature.
248
+
249
+That being said, there is value to enable a node to be scheduled which is
250
+being held for an available deployment. As such, it could be an optional
251
+enhancement which could save quite a bit of time in a deployment process.
252
+This could be enabled by allowing nova to consider a node in the ``holding``
253
+state to be available for deployments by also evaluating the
254
+``target_provision_state`` for nodes in ``holding``. It would be
255
+fairly tight coupling, but a frequent ask is for faster deployments,
256
+and it would be a route that we could take to enable such
257
+functionality in terms of "holding for deployment".
258
+
259
+Ramdisk impact
260
+--------------
261
+
262
+None
263
+
264
+Security impact
265
+---------------
266
+
267
+None
268
+
269
+Other end user impact
270
+---------------------
271
+
272
+None
273
+
274
+Scalability impact
275
+------------------
276
+
277
+None
278
+
279
+Performance Impact
280
+------------------
281
+
282
+None
283
+
284
+Other deployer impact
285
+---------------------
286
+
287
+None
288
+
289
+Developer impact
290
+----------------
291
+
292
+None
293
+
294
+Implementation
295
+==============
296
+
297
+Assignee(s)
298
+-----------
299
+
300
+Primary assignee:
301
+  Julia Kreger (TheJulia) <juliaashleykreger@gmail.com>
302
+
303
+Other contributors:
304
+  ?
305
+
306
+Work Items
307
+----------
308
+
309
+* Implement API to retrieve a list of states.
310
+* Implement State machine changes to allow an idle agent instance to return
311
+  cleaning step data.
312
+* Add API tests to ironic-tempest-plugin.
313
+* Update state machine documentation.
314
+* Add Admin documentation.
315
+* Update CLI documentation.
316
+
317
+Dependencies
318
+============
319
+
320
+None
321
+
322
+Testing
323
+=======
324
+
325
+Basic API contract and state testing should be sufficient for this feature.
326
+
327
+Upgrades and Backwards Compatibility
328
+====================================
329
+
330
+N/A, The existing rolling upgrades and RPC version pinning practice should
331
+be more than sufficient to support this feature.
332
+
333
+Documentation Impact
334
+====================
335
+
336
+Additional details will need to be added to the Admin guide.
337
+State documentation will need to be updated.
338
+Update client documentation for new state verb.
339
+
340
+References
341
+==========
342
+.. [1] Manual cleaning - https://specs.openstack.org/openstack/ironic-specs/specs/5.0/manual-cleaning.html
343
+.. [2] Deploy Steps - https://specs.openstack.org/openstack/ironic-specs/specs/11.1/deployment-steps-framework.html

Loading…
Cancel
Save