Browse Source

Split service into API and Worker

Change-Id: Iaeb99ab1954a1d5303c9bd10b81f7f8d6aa7e731
Anton Arefiev 2 years ago
parent
commit
cf953cade1
1 changed files with 349 additions and 0 deletions
  1. 349
    0
      specs/splitting-service-on-API-and-worker.rst

+ 349
- 0
specs/splitting-service-on-API-and-worker.rst View File

@@ -0,0 +1,349 @@
1
+..
2
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
3
+ License.
4
+
5
+ http://creativecommons.org/licenses/by/3.0/legalcode
6
+
7
+============================================
8
+Splitting Inspector into an API and a Worker
9
+============================================
10
+
11
+https://bugs.launchpad.net/ironic-inspector/+bug/1525218
12
+
13
+This work is part of the `High Availability for Ironic Inspector`_ spec. One
14
+of the items to achieve inspector HA and scalability is splitting inspector
15
+single service into API and worker services. This spec focuses on detailed
16
+description of the essential part of the mentioned work - internal
17
+communication between mentioned services.
18
+
19
+
20
+Problem description
21
+===================
22
+
23
+**inspector** is a monolithic service consisting of the API, processing
24
+background, the firewall and the DHCP managment. As result inspector isn't
25
+capable of dealing well with the sizeable amount of **ironic** bare metal
26
+nodes and doesn't fit large-scale deployments. Introducing new services to
27
+solve this issues also brings some complexity as it requires a mechanism for
28
+internal services communication.
29
+
30
+Proposed change
31
+===============
32
+
33
+Node introspection is a sequence of asynchronous tasks. A task could be
34
+described as an FSM transition of the **inspector** state machine [1]_,
35
+triggered by events such as:
36
+
37
+    * ``starting(wait) -> waiting``
38
+    * ``waiting(process) -> processing``
39
+    * ``waiting(timeout) -> error``
40
+    * ``processing(finish) -> finished``
41
+
42
+API request that are executed in the background can be considered as
43
+asynchronous tasks. It is these tasks that allow splitting the service
44
+into the *API* and *Worker* parts, the former creating tasks, the latter
45
+consuming those. The communication of these service parts requires a
46
+medium, the *queue*, and together these three subjects comprise the
47
+`messaging queue paradigm`_. OpenStack projects use an open standard
48
+for messaging middleware known as AMQP_. This messaging middleware,
49
+oslo.messaging_, enables services that run on multiple servers to
50
+talk to each other.
51
+
52
+Each inspector worker provides a pool of worker threads that get state
53
+transition requests from the API service via the queue. An API service
54
+invokes methods on workers and eventually becomes a task. In other words,
55
+there is the ``client`` role, carried out by the API service, and the
56
+``server`` role, carried out by the worker thread respectively. Servers make
57
+oslo.messaging_ ``RPC`` interfaces available to clients.
58
+
59
+Client - inspector API
60
+----------------------
61
+
62
+**inspector** API will implement a simple oslo.messaging_ client, which will
63
+connect to the messaging transport and send messages with state transition
64
+event.
65
+
66
+There are two ways that a method can be invoked, see [2]_:
67
+    * cast - the method is invoked asynchronously and no result is returned to
68
+      the caller.
69
+
70
+    * call - the method is invoked synchronously and the result is returned to
71
+      the caller.
72
+
73
+*inspector* endpoints which invokes RPC:
74
+
75
+.. table::
76
+
77
+    +---------------------------------+----------+---------------------------+--------------------------+
78
+    | Method                          | RPC type |          API              |        Worker            |
79
+    +=================================+==========+===========================+==========================+
80
+    |                                 |          | check provision state,    | add lookup attributes,   |
81
+    |                                 |          | validate power interface, | update pxe filters,      |
82
+    |  POST /introspection/<node_id>  |   cast   | set `starting` state,     | set pxe as boot device,  |
83
+    |                                 |          | <RPC> cast `inspect`      | reboot node,             |
84
+    |                                 |          |                           | set `waiting` state      |
85
+    +---------------------------------+----------+---------------------------+--------------------------+
86
+    |                                 |          | node lookup,              | set `processing` state   |
87
+    |                                 |          | check provision state,    | run processing hook,     |
88
+    |    POST /continue               |   cast   | <RPC> cast `process`      | apply rules,             |
89
+    |                                 |          |                           | update pxe filters,      |
90
+    |                                 |          |                           | save introspection data, |
91
+    |                                 |          |                           | power off node,          |
92
+    |                                 |          |                           | set `finished` state     |
93
+    +---------------------------------+----------+---------------------------+--------------------------+
94
+    | POST /introspection/<node_id>   |          | find node in cache,       | force power off,         |
95
+    |      /abort                     |   cast   | <RPC> cast `abort`        | update pxe filters,      |
96
+    |                                 |          |                           | set `error` state        |
97
+    +---------------------------------+----------+---------------------------+--------------------------+
98
+    |                                 |          | find node in cache,       | get introspection data,  |
99
+    |                                 |          | <RPC> cast `reapply`      | set `reapplying` state,  |
100
+    |   POST /introspection/<id>/data |   cast   |                           | run processing hooks,    |
101
+    |        /unprocessed             |          |                           | save introspection data, |
102
+    |                                 |          |                           | apply rules,             |
103
+    |                                 |          |                           | set `finished` state     |
104
+    +---------------------------------+----------+---------------------------+--------------------------+
105
+
106
+The resulting workflow for introspection looks like::
107
+
108
+ Client           API            Worker           Node           Ironic
109
+   +               +               +                +               +
110
+   | <HTTP>Start   |               |                |               |
111
+   +--inspection--->               |                |               |
112
+   |               X Validate power|                |               |
113
+   |               X interface,    |                |               |
114
+   |               X initiate task |                |               |
115
+   |               X for inspection|                |               |
116
+   |               X               |                |               |
117
+   |               X  <RPC> Cast   |                |               |
118
+   X               +- inspection--->                |               |
119
+   X               |               X Update pxe     |               |
120
+   X               |               X filters, set   |               |
121
+   X               |               X lookup attrs   |               |
122
+   X               |               X                |               |
123
+   X               |               X <HTTP> Set pxe |               |
124
+   X               |               +-boot dev,reboot+--------------->
125
+   X               |               |                |     Reboot    |
126
+   X               |               |                <---------------+
127
+   X               |               |    DHCP, boot, X               |
128
+   X               |               |   Collect data X               |
129
+   X               |               |                X               |
130
+   X               |Send inspection data to inspector               |
131
+   X               <---------------+----------------+               |
132
+   X               X Node lookup,  |                |               |
133
+   X               X verify collect|                |               |
134
+   X               X failures      |                |               |
135
+   X               X               |                |               |
136
+   X               X   <RPC> Cast  |                |               |
137
+   X               +-process data-->                |               |
138
+   X               |               X Run process    |               |
139
+   X               |               X hooks, apply   |               |
140
+   X               |               X rules, update  |               |
141
+   X               |               X filters        |               |
142
+   X               |               X     <HTTP> Set power off       |
143
+   X               |               +----------------+--------------->
144
+   X               |               |                |  Power off    |
145
+   X               +               +                <-------------  +
146
+
147
+
148
+Server - inspector worker
149
+-------------------------
150
+
151
+An RPC server exposes **inspector** endpoints containing a set of methods,
152
+which may be invoked remotely by clients over a given transport. Transport
153
+driver will be loaded according to the users messaging configuration. See
154
+[3]_ for more details on configuration options.
155
+
156
+An **inspector** worker will implement a separate ``oslo.service`` process
157
+with its own pool of green threads. The worker will periodically consume and
158
+handle messages from the clients.
159
+
160
+RPC reliability
161
+---------------
162
+
163
+For each message sent by the client via cast (asynchronously), an
164
+acknowledgement is sent back immediately and the message is removed from
165
+the queue. As result there is no guarantees that worker will handle the
166
+introspection task.
167
+
168
+This model, known as `at-most-once-delivery` doesn't guarantee message
169
+processing for asynchronous tasks if proceed worker dies. Supporting HA
170
+may require some additional functionality to confirm that task message
171
+was processed.
172
+
173
+If a worker dies (connection is closed or lost) during processing inspection
174
+data, the task request message will disappear and the introspection task will
175
+hang in ``processing`` state till `timeout` happens.
176
+
177
+Alternatives
178
+------------
179
+
180
+Implement our own Publisher/Consumer functionality with Kombu_ library. This
181
+approach has some benefits:
182
+
183
+ * support `at-least-once-delivery` semantic.
184
+   For each message retrieved and handled by a consumer, an acknowledgement is
185
+   sent back to the message producer. In case this acknowledgement is not
186
+   received after a certain time amount, the message is resent::
187
+
188
+     API               Worker thread
189
+      +                      +
190
+      |                      |
191
+      +--------------------->+
192
+      |                      |
193
+      |             +--------+
194
+      |             |        |
195
+      |      Process|        |
196
+      |      Request|        |
197
+      |             |        |
198
+      |             +------->+
199
+      |         ACK          |
200
+      +<---------------------+
201
+      |                      |
202
+      +                      +
203
+
204
+   If a consumer dies without sending an ack, the message wasn't processed and
205
+   if there are other consumers online at the same time, message will be
206
+   reprocessed.
207
+
208
+On the other hand, these approach has considerable drawbacks:
209
+
210
+ * Implementing own Publisher/Consumer.
211
+   It means complexity of supporting new functionality, lack of supported
212
+   backends, compared to oslo.messaging, like 0MQ_.
213
+
214
+ * Worse deployer's UX.
215
+   Message backend configuration in *inspector* will differ from other
216
+   services (including *ironic*), which brings some pain to deployers.
217
+
218
+Data model impact
219
+-----------------
220
+
221
+None
222
+
223
+HTTP API impact
224
+---------------
225
+
226
+Endpoint `/continue` will return `ACCEPTED` instead of `OK`.
227
+
228
+Client (CLI) impact
229
+-------------------
230
+
231
+None
232
+
233
+Ironic python agent impact
234
+--------------------------
235
+
236
+None
237
+
238
+Performance and scalability impact
239
+----------------------------------
240
+
241
+Proposed change will allow users to scale **ironic-inspector**, both API
242
+and Worker, horizontally after some more work in future, for more details
243
+refer to `High Availability for Ironic Inspector`_.
244
+
245
+Security impact
246
+---------------
247
+
248
+The newly introduced services require additional protection. The messaging
249
+service, which would be used as the transport layer e.g RabbitMQ_, should rely
250
+on a transport-level cryptography, see [4]_ for more details.
251
+
252
+Deployer impact
253
+---------------
254
+
255
+The newly introduced message bus layer will require some message broker to
256
+connect the inspector API and workers. The most popular broker implementation
257
+used in OpenStack installations is RabbitMQ_, see [5]_ for more details.
258
+
259
+To achieve resiliency, multiple API service and worker service instances
260
+should be deployed on multiple physical hosts.
261
+
262
+There are also new configuration options being added, see [3]_
263
+
264
+
265
+Developer impact
266
+----------------
267
+
268
+Developers will need to consider new architecture and **inspector** API and
269
+Worker communication details when adding new features which are required to be
270
+handled as background tasks.
271
+
272
+Upgrades and Backwards Compatibility
273
+------------------------------------
274
+
275
+The current **inspector** service is a single process, so deployers might need
276
+to add more services, newly added *inspector* Worker, the messaging transport
277
+backend (RabbitMQ) . Console script ``ironic-inspector`` could be changed
278
+to run both API and Worker services with ``in-memory`` backend for messaging
279
+transport. Which allows to run ``ironic-inspector`` in backward compatibility
280
+manner - run both services on single host without the message broker.
281
+
282
+Implementation
283
+==============
284
+
285
+Assignee(s)
286
+-----------
287
+
288
+* aarefiev (Anton Arefiev)
289
+
290
+Work Items
291
+----------
292
+
293
+* Add base service functionality;
294
+* Introduce Client/Servers workers;
295
+* Implement API/Worker managers;
296
+* Split service into API and Worker;
297
+* Implement support for these services in Devstack;
298
+* Use WSGI [6]_ to implement the API service.
299
+
300
+Dependencies
301
+============
302
+
303
+None
304
+
305
+Testing
306
+=======
307
+
308
+All new functionality would be tested both with functional and unit tests.
309
+Already running Tempest tests as well as upgrade tests with Grenade will
310
+also cover added features.
311
+
312
+Functional tests run both Inspector API and Worker with an ``in-memory``
313
+backend.
314
+
315
+Having all the work items done will allow to setup multi-node devstack and
316
+test Inspector in cluster mode eventually.
317
+
318
+
319
+References
320
+==========
321
+
322
+
323
+.. [1] `Inspection states <http://git.openstack.org/cgit/openstack/ironic-inspector/tree/ironic_inspector/introspection_state.py>`_
324
+
325
+.. [2] `RPC Client <https://docs.openstack.org/developer/oslo.messaging/rpcclient.html>`_
326
+
327
+.. [3] `oslo.messaging configuration options <https://docs.openstack.org/developer/oslo.messaging/opts.html>`_
328
+
329
+.. [4] `RabbitMQ security <https://docs.openstack.org/security-guide/messaging/security.html>`_
330
+
331
+.. [5] `RabbitMQ HA <https://docs.openstack.org/ha-guide/shared-messaging.html>`_
332
+
333
+.. [6] `TC Pike WSGI Goal <https://governance.openstack.org/tc/goals/pike/deploy-api-in-wsgi.html>`_
334
+
335
+.. _Kombu: http://docs.celeryproject.org/projects/kombu/en/latest/
336
+
337
+.. _High Availability for Ironic Inspector: https://specs.openstack.org/openstack/ironic-inspector-specs/specs/HA_inspector.html
338
+
339
+.. _oslo.messaging: https://docs.openstack.org/developer/oslo.messaging/index.html
340
+
341
+.. _RabbitMQ: https://www.rabbitmq.com
342
+
343
+.. _HAProxy: http://www.haproxy.org
344
+
345
+.. _0MQ: https://docs.openstack.org/developer/oslo.messaging/zmq_driver.html
346
+
347
+.. _`messaging queue paradigm`: https://en.wikipedia.org/wiki/Message_queue
348
+
349
+.. _AMQP: http://www.amqp.org/sites/amqp.org/files/amqp.pdf

Loading…
Cancel
Save