Browse Source

High Availability for Ironic Inspector

Introduce redundancy and scalability to the ironic
inspector service

Change-Id: I88667decc4d01a125fc840b9efb448fdba5dec08
Co-Authored-By: Dmitry Tantsur <divius.inside@gmail.com>
dparalen 3 years ago
parent
commit
d959e4b1ba
1 changed files with 841 additions and 0 deletions
  1. 841
    0
      specs/HA_inspector.rst

+ 841
- 0
specs/HA_inspector.rst View File

@@ -0,0 +1,841 @@
1
+..
2
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
3
+ License.
4
+
5
+ http://creativecommons.org/licenses/by/3.0/legalcode
6
+
7
+=======================================
8
+ High Availability for Ironic Inspector
9
+=======================================
10
+
11
+Ironic inspector is a service that allows bare metal nodes to be
12
+introspected dynamically, that currently isn't redundant.  The goal of
13
+this blueprint is to suggest *conceptual changes* to the inspector
14
+service that would make inspector redundant while maintaining both the
15
+current inspection feature set and API.
16
+
17
+Problem description
18
+===================
19
+
20
+Inspector is a compound service consisting of the inspector API
21
+service, the firewall and the DHCP (PXE) service.  Currently, all
22
+three components run a single instance on a shared host per OpenStack
23
+deployment.  A failure of the host or any of the services renders
24
+introspection unavailable and prevents the cloud administrator from
25
+enrolling new hardware or from booting already enrolled baremetal
26
+nodes.  Furthermore, Inspector isn't designed to cope well with the
27
+amount of hardware required for Ironic bare metal usage at large
28
+scale.  With a site size of 10k bare metal nodes in mind, we aim at
29
+the inspector sustaining a batch load of a couple of hundred
30
+introspection/enroll requests interleaved with couple of minutes of
31
+silence, maintaining a couple of thousand firewall black list items.
32
+We refer to this use case as *bare metal to tenant*.
33
+
34
+Below we describe the current Inspector service architecture with some
35
+Inspector process instance failure consequences.
36
+
37
+Introspection process
38
+---------------------
39
+
40
+Node introspection is a sequence of asynchronous steps, controlled by
41
+the inspector API service, that take various amounts of time to
42
+finish.  One could describe these steps as states of a transition
43
+system, advanced by events as follows:
44
+
45
+* ``starting`` the initial state; the system is advanced into this
46
+  state by receiving an introspect API request.  Introspection
47
+  configuration and set-up steps are performed while in this state.
48
+* ``waiting`` introspection image is booting on the node.  The system
49
+  advances to this state automatically.
50
+* ``processing`` introspection image has booted and collected
51
+  necessary information from the node.  This information is being
52
+  processed by plug-ins to validate node status.  The system is
53
+  advanced to this state having received the ``continue`` REST API
54
+  request.
55
+* ``finished`` introspection is done, node powered-off.  The system
56
+  is advanced to this state automatically.
57
+
58
+In case of an API service failure, nodes in-between the ``starting``
59
+and ``finished`` state, will lose their state, and may require manual
60
+intervention to recover.  No more nodes can be processed either
61
+because the API service runs in a single instance per deployment.
62
+
63
+Firewall configuration
64
+----------------------
65
+
66
+To minimize interference with normally deployed nodes, inspector
67
+deploys temporary firewall rules so only nodes being inspected can
68
+access its PXE boot service.  It is implemented as a blacklist
69
+containing MAC addresses of nodes kept by ironic service but not by
70
+inspector.  This is required because the MAC address isn't known
71
+before a node boots for the first time.
72
+
73
+Depending on the spot in which the API service fails while the
74
+firewall and DHCP services are intact, firewall configuration may get
75
+out of sync and may lead to interference with normal node booting:
76
+
77
+* firewall chain set-up (init phase): Inspector's dnsmasq service is
78
+  exposed to all nodes
79
+* firewall synchronization periodic task: new nodes added to Ironic
80
+  aren't blacklisted
81
+* node introspection finished: the node won't be blacklisted
82
+
83
+On the other hand, no boot interference is expected if running all
84
+services (inspector, firewall and DHCP), on the same host, as all
85
+service are lost together.  Losing the API service during clean-up
86
+periodic task, should not matter as the nodes concerned will be kept
87
+blacklisted during service downtime.
88
+
89
+DHCP (PXE) service
90
+------------------
91
+
92
+Inspector service doesn't manage the DHCP service directly, rather, it
93
+just requires DHCP is properly set up and shares the host of the API
94
+service and the firewall.  We'd anyway like to briefly describe the
95
+consequences of the DHCP service failing.
96
+
97
+In case of a DHCP service failure inspected nodes won't be able to
98
+boot the introspection ramdisk and eventually fail to get inspected
99
+because of a timeout.  The nodes may loop retrying to boot depending
100
+on their firmware configuration.
101
+
102
+A fail-over of DHCP from active to back-up host (`dnsmasq
103
+<http://www.thekelleys.org.uk/dnsmasq/doc.html>`_ usually) would
104
+manifest with booting nodes under introspection timing out or nodes
105
+already booted (with a lease of an address) getting into an address
106
+conflict with another node booting.  There's not much to help the
107
+former situation besides retrying.  To prevent the latter from
108
+happening, the configuration of DHCP service for the introspection
109
+purpose should consider disjoint address pools served by the DHCP
110
+instances such as recommended in `IP address allocation between
111
+servers
112
+<https://tools.ietf.org/html/draft-ietf-dhc-failover-12#section-5.4>`_
113
+section of the DHCP Failover Protocol RFC.  We also recommend using
114
+the ``dhcp-sequential-ip`` in the dnsmasq configuration file to avoid
115
+conflicts within the address pools.  See `related bug report
116
+<https://bugzilla.redhat.com/show_bug.cgi?id=1301659#c20>`_ for more
117
+details on the issue.  The introspection being an ephemeral matter,
118
+synchronization of the leases between the DHCP instances isn't
119
+necessary if restarting introspection isn't an issue.
120
+
121
+Other Inspector parts
122
+---------------------
123
+
124
+* periodic introspection status clean-up, removing old inspection data
125
+  and finishing timed-out introspections
126
+* synchronizing set of nodes with ironic
127
+* limiting node power-on rate with a shared lock and a timeout
128
+
129
+Proposed change
130
+===============
131
+
132
+In considering the problem of high availability, we are proposing a
133
+solution that consists of a distributed, shared-nothing, active-active
134
+implementation of all services that comprise the ironic inspector.
135
+From the user point of view, we suggest API service to serve through a
136
+*load balancer*, such as `HAProxy <http://www.haproxy.org/>`_, in
137
+order to maintain a single entry point for the API service (e.g.
138
+floating IP address).
139
+
140
+HA Node Introspection decomposition
141
+-----------------------------------
142
+
143
+Node introspection being a state transition system, we focus on
144
+*decentralizing* it.  We therefore replicate the current introspection
145
+state through the distributed store in all inspector process instances
146
+for particular node.  We suggest that both the automatic state
147
+advancing requests as well as API state advancing requests are
148
+performed asynchronously by independent workers.
149
+
150
+HA Worker
151
+---------
152
+
153
+Each inspector process provides a pool of asynchronous workers that
154
+get state transition requests from a queue.  We use separate
155
+``queue.get`` and ``queue.consume`` calls to avoid losing state
156
+transition requests due to worker failures.  This however introduces
157
+the *at-least-once* delivery semantics to the requests.  We therefore
158
+rely on the `transition-function`_ to handle the request delivery
159
+gracefully.  We suggest two kinds of state-transition handling with
160
+regards to the at-least-once delivery semantics:
161
+
162
+Strict (non-reentrant-task) Transition Specification
163
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
164
+
165
+* `Getting a request`_
166
+* `Calculating new node state`_
167
+* `Updating node state`_
168
+* `Executing a task`_
169
+* `Consuming a request`_
170
+
171
+Reentrant Task Transition Specification
172
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
173
+* `Getting a request`_
174
+* `Calculating new node state`_
175
+* `Executing a task`_
176
+* `Updating node state`_
177
+* `Consuming a request`_
178
+
179
+Strict transition protecting a state change may lead to a situation
180
+that the state of introspection is not in correspondence with the node
181
+in reality --- if a worker partitions right after having successfully
182
+executed the task but just before consuming the request from the
183
+queue.  As a consequence the transition request not having been
184
+consumed will be encountered again with (another) worker.  One can
185
+refer to this behavior as a *reentrancy glitch or Déjà vu*
186
+
187
+Since the goal is to protect the inspected node from going through the
188
+same task again, we rely on the state transition system to handle this
189
+situation by navigating to the ``error`` state instead.
190
+
191
+Removing a node
192
+^^^^^^^^^^^^^^^
193
+
194
+`Ironic synchronization periodic task`_ puts node delete requests on
195
+the queue.  Workers perform following steps to handle:
196
+
197
+* `Getting a request`_
198
+* worker removes the node from the store
199
+* `Consuming a request`_
200
+
201
+Failure of store removing the node isn't a concern here as the
202
+periodic task will try again later.  It is therefore safe to always
203
+consume the request here.
204
+
205
+Shutting Down HA Inspector Processes
206
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
207
+
208
+All inspector process instances register a ``SIGTERM`` callback. To
209
+notify inspector worker threads, the ``SIGTERM`` callback sets the
210
+``sigterm_flag`` upon the signal delivery. The flag is process-local
211
+and its purpose is to allow inspector processes to perform a
212
+controlled/graceful shutdown. For this mechanism to work, potentially
213
+blocking operations (such as ``queue.get``) have to be used with a
214
+configurable timeout value within the workers. All sleep calls
215
+throughout the process instance should be interruptible, possibly
216
+implemented as ``sigterm_flag.wait(sleep_time)`` or similar.
217
+
218
+Getting a request
219
+^^^^^^^^^^^^^^^^^
220
+
221
+* any worker instance may execute any request the queue contains
222
+* worker gets state transition or node delete request from the queue
223
+* if ``SIGTERM`` flag is set, worker stops
224
+* if ``queue.get`` timed-out (task is ``None``) poll the queue again
225
+* lock the BM node related to the request
226
+* if locking failed worker polls the queue again not consuming the
227
+  request
228
+
229
+Calculating new node state
230
+^^^^^^^^^^^^^^^^^^^^^^^^^^
231
+
232
+* worker instantiates a state transition system instance for current
233
+  node state
234
+* if instantiating failed (e.g. no such node in the store) worker
235
+  performs `Retrying a request`_
236
+* worker advances the state transition system
237
+* if the state machine is jammed (illegal state transition request)
238
+  worker performs `Consuming a request`_
239
+
240
+Updating node state
241
+^^^^^^^^^^^^^^^^^^^
242
+
243
+The introspection state is kept in the store, visible to all worker
244
+instances.
245
+
246
+* worker saves node state in the store
247
+* if saving node state in the store failed (such as node has been
248
+  removed) worker performs `Retrying a request`_
249
+
250
+Executing a task
251
+^^^^^^^^^^^^^^^^
252
+
253
+* worker performs the task bound to the transition request
254
+* if the task result is a transition request worker puts it on the
255
+  queue
256
+
257
+Consuming a request
258
+^^^^^^^^^^^^^^^^^^^
259
+
260
+* worker consumes the state transition request from the queue
261
+* worker releases related node lock
262
+* worker continues from the beginning
263
+
264
+Retrying a request
265
+^^^^^^^^^^^^^^^^^^
266
+
267
+* worker releases node lock
268
+* worker continues from the beginning not consuming the request to
269
+  retry later
270
+
271
+Introspection State-Transition System
272
+-------------------------------------
273
+
274
+Node introspection state is managed by a worker-local instance of a
275
+state transition system.  The state transition function is as follows.
276
+
277
+.. compound::
278
+
279
+   .. _transition-function:
280
+
281
+   .. table:: Transition function
282
+
283
+      +----------------+-----------------------+------------------------------------+
284
+      | State          | Event                 | Target                             |
285
+      +================+=======================+====================================+
286
+      | N/A            | Inspect               | Starting                           |
287
+      +----------------+-----------------------+------------------------------------+
288
+      | Starting*      | Inspect               | Starting                           |
289
+      +----------------+-----------------------+------------------------------------+
290
+      | Starting*      | S~                    | Waiting                            |
291
+      +----------------+-----------------------+------------------------------------+
292
+      | Waiting        | S~                    | Waiting                            |
293
+      +----------------+-----------------------+------------------------------------+
294
+      | Waiting        | Timeout               | Error                              |
295
+      +----------------+-----------------------+------------------------------------+
296
+      | Waiting        | Abort                 | Error                              |
297
+      +----------------+-----------------------+------------------------------------+
298
+      | Waiting        | Continue!             | Processing                         |
299
+      +----------------+-----------------------+------------------------------------+
300
+      | Processing     | Continue!             | Error                              |
301
+      +----------------+-----------------------+------------------------------------+
302
+      | Processing     | F~                    | Finished                           |
303
+      +----------------+-----------------------+------------------------------------+
304
+      | Finished+      | Inspect               | Starting                           |
305
+      +----------------+-----------------------+------------------------------------+
306
+      | Finished+      | Abort                 | Error                              |
307
+      +----------------+-----------------------+------------------------------------+
308
+      | Error+         | Inspect               | Starting                           |
309
+      +----------------+-----------------------+------------------------------------+
310
+
311
+   .. table:: Legend
312
+
313
+      +------------+-----------------------------+
314
+      | Expression | Meaning                     |
315
+      +============+=============================+
316
+      | State*     | the initial state           |
317
+      +------------+-----------------------------+
318
+      | State+     | the terminal/accepting state|
319
+      +------------+-----------------------------+
320
+      | State~     | the automatic event         |
321
+      |            | originating in State        |
322
+      +------------+-----------------------------+
323
+      | Event!     | strict/non-reentrant        |
324
+      |            | transition event            |
325
+      +------------+-----------------------------+
326
+
327
+.. _timer-decomposition:
328
+
329
+HA Singleton Periodic task decomposition
330
+----------------------------------------
331
+
332
+Ironic inspector service houses a couple of periodic tasks. At any
333
+point, up to a single "instance" of a periodic task flavor should be
334
+running, no matter the process instances count. For this purpose, the
335
+processes form a periodic task distributed management party.
336
+
337
+Process instances register a ``SIGTERM`` callback that, the signal
338
+being delivered, makes the process instance leave the party and switch
339
+the ``reset_flag``.
340
+
341
+The process instances install a watch on the party. Upon the party
342
+shrinkage, the processes reset their periodic task, if they have one
343
+set, triggering the ``reset_flag`` and participate in new distributed
344
+periodic task management leader election.  Party growth isn't of
345
+concern to the processes.
346
+
347
+It's because of the task reset due to the party shrinkage a custom
348
+flag has to be used, instead of the ``sigterm_flag``, to stop the
349
+periodic task.  Otherwise, setting the ``sigterm_flag`` because of the
350
+party change would stop the whole service.
351
+
352
+The leader process executes the periodic task loop.  Upon exception or
353
+partitioning, mind the `partitioning-concerns`_, the leader stops
354
+through flipping the ``sigterm_flag`` in order for the inspector
355
+service to stop.  The periodic task loop is stopped eventually as it
356
+performs ``reset_flag.wait(period)`` instead of sleeping.
357
+
358
+The periodic task management should happen in a separate asynchronous
359
+thread instance, one per periodic task.  Losing leader due to its
360
+error (or partitioning) isn't a concern --- a new one will eventually
361
+be elected and a couple of periodic task runs will be wasted
362
+(including those that died together with the leader).
363
+
364
+HA Periodic clean-up decomposition
365
+----------------------------------
366
+
367
+Clean-up should be implemented as independent HA singleton periodic
368
+tasks with configurable time period, one for each of the introspection
369
+timeout and ironic synchronization tasks.
370
+
371
+Introspection timeout periodic task
372
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
373
+
374
+To finish introspections that are timing-out:
375
+
376
+* select nodes for which the introspection is timing out
377
+* for each node:
378
+* put a request to time-out the introspection on the queue for a
379
+  worker to process
380
+
381
+Ironic synchronization periodic task
382
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
383
+
384
+To remove nodes no longer tracked by Ironic:
385
+
386
+* select nodes that are kept by Inspector but not kept by Ironic
387
+* for each node:
388
+* put a request to delete the node on the queue for a worker to
389
+  process
390
+
391
+HA Reboot Throttle Decomposition
392
+--------------------------------
393
+
394
+As a workaround for some hardware, reboot request rate should be
395
+limited. For this purpose, a single distributed lock instance should
396
+be utilized. At any point in time, only a single worker may hold the
397
+lock while performing the reboot (power-on) task. Upon acquiring the
398
+lock, the reboot state transition sleeps in an interruptible fashion
399
+for a configurable quantum of time. If the sleep was indeed
400
+interrupted, the worker should raise an exception stopping the reboot
401
+procedure and the worker itself. This interruption should happen as
402
+part of the graceful shutdown mechanism. This should be implemented
403
+utilizing the same ``SIGTERM`` flag/event workers use to check for
404
+pending shutdown: ``sigterm_flag.wait(timeout=quantum)``
405
+
406
+Process partitioning isn't a concern here because all workers sleep
407
+while holding the lock. Partitioning therefore slows down the reboot
408
+pace by the amount of time a lock takes to expire.  It should be
409
+possible to disable the reboot throttle altogether through the
410
+configuration.
411
+
412
+HA Firewall decomposition
413
+-------------------------
414
+
415
+The PXE boot environment is configured and active on all inspector
416
+hosts. The firewall protection of the PXE environment is active on all
417
+inspector hosts, blocking the hosts' PXE service.  At any given point
418
+in time, at most one inspector host's PXE service is available, and it
419
+is available to all inspected nodes.
420
+
421
+Building blocks
422
+^^^^^^^^^^^^^^^
423
+
424
+The general policy is allow-all, and each node that is not being
425
+inspected has a block-exception to the general policy.  Due to its
426
+size, the black-list is maintained locally on all inspector hosts,
427
+pulling items from ironic periodically or asynchronously from a
428
+pub--sub channel.
429
+
430
+Nodes that are being introspected are white-listed in a separate set
431
+of firewall rules.  Nodes that are being discovered for the first time
432
+fall through the black-list due to the general allow-all black-list
433
+policy.
434
+
435
+Nodes the HA firewall is supposed to allow access to the PXE service,
436
+are kept in a distributed store or obtained asynchronously from a
437
+pub--sub channel.  Process instance workers add (subtract) firewall
438
+rules to (from) the distributed store as necessary or announce the
439
+changes on the pub--sub channels.  Firewall rules are ``(port_ID,
440
+port_MAC)`` tuples to be white-/black-listed.
441
+
442
+Process instances use custom chains to implement the firewall: the
443
+white-list chain and the black-list chain.  Failing through the
444
+white-list chain, a packet "proceeds" to the black-list chain. Failing
445
+through the black-list chain, a packet is allowed to access the PXE
446
+service port.  A node port rule may be present both in the white-list
447
+and the black-list chain at the same time if being introspected.
448
+
449
+HA Decomposition
450
+^^^^^^^^^^^^^^^^
451
+
452
+Starting, the processes poll Ironic to build their black-list chains
453
+for the first time and set up *local* periodic Ironic black-list
454
+synchronisation task or set callbacks on the black-list pub--sub
455
+channel.
456
+
457
+Process instances form a distributed firewall management party that
458
+they watch for changes.  Process instances register a ``SIGTERM``
459
+callback that, the signal being delivered, makes the process instance
460
+leave the party and reset the firewall, completely blocking their PXE
461
+service.
462
+
463
+Upon the party shrinkage, processes reset their firewall white-list
464
+chain, the *pass* rule in the black-list chain, and the rule set watch
465
+(should they have one set) and participate in a distributed firewall
466
+management leader election.  Party growth isn't of concern to the
467
+processes.
468
+
469
+The leader process' black-list chain contains the *pass* rule while
470
+other process's black-list chains don't.  Having been elected, the
471
+leader process builds the white-list and registers a watch on the
472
+distributed store or a white-list pub--sub channel callback in order
473
+to keep the white-list firewall chain up-to-date.  Other process
474
+instances don't maintain a white-list chain, that chain is empty for
475
+them.
476
+
477
+Upon any exception (or process instance partitioning), a process
478
+resets its firewall to completely protect its PXE service.
479
+
480
+Notes
481
+^^^^^
482
+
483
+Periodic white-list store polling and the white-list pub--sub channel
484
+callbacks are mutually optional facilities to enhance the
485
+responsiveness of the firewall, and the user may prefer enabling one
486
+or the other or both simultaneously as necessary.  The same holds for
487
+the black-list Ironic polling and the black-list pub--sub channel
488
+callbacks.
489
+
490
+To assemble the blacklist of MAC addresses, the processes may need to
491
+poll the ironic service periodically for node information.  A
492
+cache/proxy of this information might be kept optionally to reduce the
493
+load on Ironic.
494
+
495
+The firewall management should be implemented as a separate
496
+asynchronous thread in each inspector process instance. Firewall being
497
+lost due to the leader failure isn't a concern --- new leader will be
498
+eventually elected.  Some nodes being introspected may experience a
499
+timeout in the waiting state and fail the introspection though.
500
+
501
+Periodic Ironic--firewall node synchronization and white-list store
502
+polling should be implemented as independent threads with configurable
503
+time period, ``0<=period<=30s``, ideally ``0<=period<=15s`` so the
504
+window between introducing a node to ironic and blacklisting it in
505
+inspector firewall is kept below user's resolution.
506
+
507
+As an optimization, the implementation may consider offloading the MAC
508
+address rules of node ports from firewall chains into `IP sets
509
+<http://ipset.netfilter.org/changelog.html>`_
510
+
511
+HA HTTP API Decomposition
512
+-------------------------
513
+
514
+We assume a Load Balancer (HAProxy) shielding the user from the
515
+inspector service. All the inspector API process instances should
516
+export the same REST API. Each API Request should be handled in a
517
+separate asynchronous thread instance (as is the case now with the
518
+`Flask <https://pypi.python.org/pypi/Flask>`_ framework). At any point
519
+in time, any of the process instances may serve any request.
520
+
521
+.. _partitioning-concerns:
522
+
523
+Partitioning concerns
524
+---------------------
525
+
526
+Upon connection exception/worker process partitioning, affected entity
527
+should retry connection establishing before announcing failure.  The
528
+retry count and timeout should be configurable for each of the ironic,
529
+database, distributed store, lock and queue services.  The timeout
530
+should be interruptible, possibly implemented as waiting for
531
+appropriate termination/``SIGTERM`` flag,
532
+e.g. ``sigterm_flag.wait(timeout)``.  Should the retrying fail,
533
+affected entity breaks the worker inspector service altogether,
534
+setting the flag, to avoid damage to resources --- most of the time,
535
+other worker service entities would be equally affected by the
536
+partition anyway.  User may consider restarting affected worker
537
+service process instance when the partitioning issue is resolved.
538
+
539
+Partitioning of HTTP API service instances isn't a concern as those
540
+are stateless and accessed through a load balancer.
541
+
542
+Alternatives
543
+------------
544
+
545
+HA Worker Decomposition
546
+^^^^^^^^^^^^^^^^^^^^^^^
547
+
548
+We've briefly examined the `TaskFlow
549
+<https://wiki.openstack.org/wiki/taskflow>`_ library as alternate
550
+tasking mechanism.  Currently, TaskFlow does support only `directed
551
+acyclic graphs as dependency structure
552
+<https://bugs.launchpad.net/taskflow/+bug/1527690>`_ between
553
+particular steps. Inspector service has to however support restarting
554
+of the introspection for a particular node, bringing loops into the
555
+graph; see `transition-function`_.  Moreover TaskFlow does not
556
+`support external event propagating
557
+<https://bugs.launchpad.net/taskflow/+bug/1527678>`_ to a running
558
+flow, such as the ``continue`` call from the bare metal node.  Because
559
+of that, the overall state of the introspection of particular node has
560
+to be maintained explicitly if TaskFlow is adopted.  TaskFlow, too,
561
+requires tasks to be reentrant/idempotent.
562
+
563
+HA Firewall decomposition
564
+^^^^^^^^^^^^^^^^^^^^^^^^^
565
+
566
+The firewall facility can be replaced by Neutron once it adopts
567
+`enhancements to subnet DHCP options
568
+<https://review.openstack.org/#/c/247027/>`_ and `allows serving DHCP
569
+to unknown hosts <https://review.openstack.org/#/c/255240/>`_.  We're
570
+keeping Inspector's firewall facility for users that are interested in
571
+stand-alone deployments.
572
+
573
+Data model impact
574
+-----------------
575
+
576
+Queue
577
+^^^^^
578
+
579
+State transition request item is introduced, it should contain these
580
+attributes (as an oslo.versioned) object:
581
+
582
+* node ID
583
+* transition event
584
+
585
+A clean-up request item is introduced removing a node. Attributes
586
+comprising the request:
587
+
588
+* node ID
589
+
590
+Pub--sub channels
591
+^^^^^^^^^^^^^^^^^
592
+
593
+Two channels are introduced: firewall white-list and black-list.  The
594
+message format is as follows:
595
+
596
+* add/remove
597
+* port ID, MAC address
598
+
599
+Store
600
+^^^^^
601
+
602
+Node state column is introduced to the node table.
603
+
604
+HTTP API impact
605
+---------------
606
+
607
+API service is provided by dedicated processes.
608
+
609
+Client (CLI) impact
610
+-------------------
611
+
612
+None planned.
613
+
614
+Performance and scalability impact
615
+----------------------------------
616
+
617
+We hope this change brings in desired redundancy and scaling for the
618
+inspector service.  We however expect the change to have a negative
619
+network utilization impact as the introspection task requires a queue
620
+and a DLM to coordinate.
621
+
622
+The inspector firewall facility requires periodic polling of the
623
+ironic service inventory in each inspector instance.  Therefore we
624
+expect increased load on the ironic service.
625
+
626
+Firewall facility leader partitioning causes boot service outage for
627
+the election period. Some nodes may therefore timeout booting.
628
+
629
+Each time the firewall leader updates the hosts firewall node
630
+information is polled from ironic service. This may introduce delays
631
+in firewall availability.  If a node being introspected is removed
632
+from the ironic service, the change will not propagate to Inspector
633
+until the introspection finishes.
634
+
635
+Security impact
636
+---------------
637
+
638
+New services introduced that might require hardening and protection:
639
+
640
+* load balancer
641
+* distributed locking facility
642
+* queue
643
+* pub--sub channels
644
+
645
+Deployer impact
646
+---------------
647
+
648
+Inspector Service Configuration
649
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
650
+
651
+* distributed locking facility, queue, firewall pub--sub channels and
652
+  load balancer introduce new configuration options, especially
653
+  URLs/hosts and credentials
654
+* worker pool size, integral, ``0<size;
655
+  size.default==processor.count``
656
+* worker ``queue.get(timeout); 0.0s<timeout; timeout.default==3.0s``
657
+* clean-up period  ``0.0s<period; period.default==30s``
658
+* clean-up introspection report expiration threshold ``0.0s<threshold;
659
+  threshold.default==86400.0s``
660
+* clean-up introspection time-out threshold ``0.0s<threshold<=900.0s``
661
+* ironic firewall black-list synchronization polling period
662
+  ``0.0s<=period<=30.0s; period.default==15.0s; period==0.0`` to disable
663
+* firewall white-list store watcher polling period
664
+  ``0.0s<=period<=30.0s; period.default==15.0s; period==0.0`` to
665
+  disable
666
+* bare metal reboot throttle, ``0.0s<=value; value.default==0.0s``
667
+  disabling this feature altogether
668
+* for each of the ironic service, database, distributed locking
669
+  facility and the queue, a connection retry count and connection
670
+  retry timeout should be configured
671
+* all inspector hosts should share same configuration, save for the
672
+  update situation
673
+
674
+New services and minimal Topology
675
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
676
+
677
+* floating IP address shared by load balancers
678
+* load balancers, wired for redundancy
679
+* WSGI HTTP API instances (httpd), addressed by load balancers in a
680
+  round-robin fashion
681
+* 3 inspector hosts each running a worker process instance, dnsmasq
682
+  instance and iptables
683
+* distributed synchronization facility hosts, wired for redundancy,
684
+  accessed by all inspector workers
685
+* queue hosts, wired for redundancy, accessed by all API instances and
686
+  workers
687
+* database cluster, wired for redundancy, accessed by all API
688
+  instances and workers
689
+* NTP set up and configured for all the services
690
+
691
+Please note, all inspector hosts require access to the PXE LAN for
692
+bare metal nodes to boot.
693
+
694
+Serviceability considerations
695
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
696
+
697
+Considering service update, we suggest following procedure to be
698
+adopted for each inspector host, one at a time:
699
+
700
+HTTP API services:
701
+
702
+* remove selected host from the load balancer service
703
+* stop the HTTP API service on the host
704
+* upgrade the service and configuration files
705
+* start the HTTP API service on the host
706
+* enroll the host to the load balancer service
707
+
708
+Worker services:
709
+
710
+* for each worker host:
711
+* stop the worker service instance on the host
712
+* update the worker service and configuration files
713
+* start the worker service
714
+
715
+Shutting down the inspector worker service may hang for some time due
716
+to worker threads executing a long synchronous procedure or waiting in
717
+the ``queue.get(timeout)`` method while polling for new task.
718
+
719
+This approach may lead to introspection (task) failures for nodes that
720
+are being handled on inspector host under update.  Especially changes
721
+of the transition function (new states etc) may induce introspection
722
+errors.  Ideally, the update should therefore happen with no ongoing
723
+introspections.  Failed node introspections may be restarted.
724
+
725
+A couple of periodic task "instances" may be lost due to the updated
726
+leader partitioning each time a host is updated.  HA firewall may be
727
+lost for the leader election period each time a host is updated,
728
+expected delay should be less than 10 seconds so that booting of
729
+inspected nodes isn't affected.
730
+
731
+Upgrade from non-HA Inspector Service
732
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
733
+
734
+Because the non-HA inspector service is a single-process entity and
735
+because the HA services aren't internally backwards compatible with it
736
+(to allow taking-over running node inspections), to perform an
737
+upgrade, the non-HA service has to be stopped first while no
738
+inspections are ongoing.  Data migration is necessary before the
739
+upgrade.  As the new services require the queue and the DLM for their
740
+operation those have to be introduced before the upgrade.  The worker
741
+services have to be started before HTTP API services.  Having started,
742
+the HTTP API services have to be introduced to the load balancer.
743
+
744
+Developer impact
745
+----------------
746
+
747
+None planned.
748
+
749
+Implementation
750
+==============
751
+
752
+We consider following implementations for the facilities we rely on:
753
+
754
+* load balancer: HAProxy
755
+* queue: Oslo messaging
756
+* pub--sub firewall channels: Oslo messaging
757
+* store: a database service
758
+* distributed synchronization facility: Tooz
759
+* HTTP API service: WSGI and httpd
760
+
761
+Assignee(s)
762
+-----------
763
+
764
+* `vetrisko <https://launchpad.net/~vetrisko>`_; primary
765
+* `divius  <https://launchpad.net/~divius>`_
766
+
767
+Work Items
768
+----------
769
+
770
+* replace current locking with Tooz DLM
771
+* introduce state machine
772
+* split API service and introduce conductors and queue
773
+* split cleaning into a separate timeout and synchronization handlers
774
+  and introduce leader-election to these periodic procedures
775
+* introduce leader-election to the firewall facility
776
+* introduce the pub--sub channels to the firewall facility
777
+
778
+Dependencies
779
+============
780
+
781
+We require proper inspector `grenade testing
782
+<https://wiki.openstack.org/wiki/Grenade>`_ before landing HA so we
783
+avoid breaking users as much as possible.
784
+
785
+Testing
786
+=======
787
+
788
+All work items should be tested as separate patches both with
789
+functional and unit tests as well as upgrade tests with Grenade.
790
+
791
+Having landed all the required work items it should be possible to
792
+test Inspector with focus on redundancy and scaling.
793
+
794
+References
795
+==========
796
+
797
+During the analysis process we considered these blueprints:
798
+
799
+* `Abort introspection
800
+  <https://blueprints.launchpad.net/ironic-inspector/+spec/abort-introspection>`_
801
+* `Node States
802
+  <https://blueprints.launchpad.net/ironic-inspector/+spec/node-states>`_
803
+* `Node Locking <https://review.openstack.org/#/c/244750/5>`_
804
+* `Oslo.messaging at-least-once semantics
805
+  <https://review.openstack.org/#/c/256342/>`_
806
+
807
+RFEs:
808
+
809
+* `TaskFlow: flow suspend&continue
810
+  <https://bugs.launchpad.net/taskflow/+bug/1527678>`_
811
+* `TaskFlow: non-DAG flow patterns
812
+  <https://bugs.launchpad.net/taskflow/+bug/1527690>`_
813
+* `HA for Ironic Inspector
814
+  <https://bugs.launchpad.net/ironic-inspector/+bug/1525218>`_
815
+* `Safe queue for Tooz
816
+  <https://bugs.launchpad.net/python-tooz/+bug/1528490>`_
817
+* `Watchable store for Tooz
818
+  <https://bugs.launchpad.net/python-tooz/+bug/1528495>`_
819
+* `Enhanced Network/Subnet DHCP Options
820
+  <https://review.openstack.org/#/c/247027/>`_
821
+* `Neutron DHCP serve unknown hosts
822
+  <https://review.openstack.org/#/c/255240/>`_
823
+
824
+Community sources:
825
+
826
+* `DLM options discussion
827
+  <https://etherpad.openstack.org/p/mitaka-cross-project-dlm>`_
828
+* `TaskFlow with external events and Non-DAG flows
829
+  <http://lists.openstack.org/pipermail/openstack-dev/2015-November/080622.html>`_
830
+* Joshua Harlow's comment that `Tooz should implement the
831
+  at-least-once semantics not Oslo.messaging
832
+  <https://review.openstack.org/#/c/256342/7/specs/mitaka/at-least-once-guarantee.rst@305>`_
833
+
834
+RFCs:
835
+
836
+* `DHCP Failover Protocol: IP address allocation between servers <https://tools.ietf.org/html/draft-ietf-dhc-failover-12#section-5.4>`_
837
+
838
+Tools:
839
+
840
+* `IP Sets <http://ipset.netfilter.org/changelog.html>`_
841
+* `Dnsmasq <http://www.thekelleys.org.uk/dnsmasq/doc.html>`_

Loading…
Cancel
Save