ironic-inspector-specs/specs/HA_inspector.rst

..
 This work is licensed under a Creative Commons Attribution 3.0 Unported
 License.

 http://creativecommons.org/licenses/by/3.0/legalcode

=======================================
 High Availability for Ironic Inspector
=======================================

Ironic inspector is a service that allows bare metal nodes to be
introspected dynamically, that currently isn't redundant.  The goal of
this blueprint is to suggest *conceptual changes* to the inspector
service that would make inspector redundant while maintaining both the
current inspection feature set and API.

Problem description
===================

Inspector is a compound service consisting of the inspector API
service, the firewall and the DHCP (PXE) service.  Currently, all
three components run a single instance on a shared host per OpenStack
deployment.  A failure of the host or any of the services renders
introspection unavailable and prevents the cloud administrator from
enrolling new hardware or from booting already enrolled baremetal
nodes.  Furthermore, Inspector isn't designed to cope well with the
amount of hardware required for Ironic bare metal usage at large
scale.  With a site size of 10k bare metal nodes in mind, we aim at
the inspector sustaining a batch load of a couple of hundred
introspection/enroll requests interleaved with couple of minutes of
silence, maintaining a couple of thousand firewall black list items.
We refer to this use case as *bare metal to tenant*.

Below we describe the current Inspector service architecture with some
Inspector process instance failure consequences.

Introspection process
---------------------

Node introspection is a sequence of asynchronous steps, controlled by
the inspector API service, that take various amounts of time to
finish.  One could describe these steps as states of a transition
system, advanced by events as follows:

* ``starting`` the initial state; the system is advanced into this
  state by receiving an introspect API request.  Introspection
  configuration and set-up steps are performed while in this state.
* ``waiting`` introspection image is booting on the node.  The system
  advances to this state automatically.
* ``processing`` introspection image has booted and collected
  necessary information from the node.  This information is being
  processed by plug-ins to validate node status.  The system is
  advanced to this state having received the ``continue`` REST API
  request.
* ``finished`` introspection is done, node powered-off.  The system
  is advanced to this state automatically.

In case of an API service failure, nodes in-between the ``starting``
and ``finished`` state, will lose their state, and may require manual
intervention to recover.  No more nodes can be processed either
because the API service runs in a single instance per deployment.

Firewall configuration
----------------------

To minimize interference with normally deployed nodes, inspector
deploys temporary firewall rules so only nodes being inspected can
access its PXE boot service.  It is implemented as a blacklist
containing MAC addresses of nodes kept by ironic service but not by
inspector.  This is required because the MAC address isn't known
before a node boots for the first time.

Depending on the spot in which the API service fails while the
firewall and DHCP services are intact, firewall configuration may get
out of sync and may lead to interference with normal node booting:

* firewall chain set-up (init phase): Inspector's dnsmasq service is
  exposed to all nodes
* firewall synchronization periodic task: new nodes added to Ironic
  aren't blacklisted
* node introspection finished: the node won't be blacklisted

On the other hand, no boot interference is expected if running all
services (inspector, firewall and DHCP), on the same host, as all
service are lost together.  Losing the API service during clean-up
periodic task, should not matter as the nodes concerned will be kept
blacklisted during service downtime.

DHCP (PXE) service
------------------

Inspector service doesn't manage the DHCP service directly, rather, it
just requires DHCP is properly set up and shares the host of the API
service and the firewall.  We'd anyway like to briefly describe the
consequences of the DHCP service failing.

In case of a DHCP service failure inspected nodes won't be able to
boot the introspection ramdisk and eventually fail to get inspected
because of a timeout.  The nodes may loop retrying to boot depending
on their firmware configuration.

A fail-over of DHCP from active to back-up host (`dnsmasq
<http://www.thekelleys.org.uk/dnsmasq/doc.html>`_ usually) would
manifest with booting nodes under introspection timing out or nodes
already booted (with a lease of an address) getting into an address
conflict with another node booting.  There's not much to help the
former situation besides retrying.  To prevent the latter from
happening, the configuration of DHCP service for the introspection
purpose should consider disjoint address pools served by the DHCP
instances such as recommended in `IP address allocation between
servers
<https://tools.ietf.org/html/draft-ietf-dhc-failover-12#section-5.4>`_
section of the DHCP Failover Protocol RFC.  We also recommend using
the ``dhcp-sequential-ip`` in the dnsmasq configuration file to avoid
conflicts within the address pools.  See `related bug report
<https://bugzilla.redhat.com/show_bug.cgi?id=1301659#c20>`_ for more
details on the issue.  The introspection being an ephemeral matter,
synchronization of the leases between the DHCP instances isn't
necessary if restarting introspection isn't an issue.

Other Inspector parts
---------------------

* periodic introspection status clean-up, removing old inspection data
  and finishing timed-out introspections
* synchronizing set of nodes with ironic
* limiting node power-on rate with a shared lock and a timeout

Proposed change
===============

In considering the problem of high availability, we are proposing a
solution that consists of a distributed, shared-nothing, active-active
implementation of all services that comprise the ironic inspector.
From the user point of view, we suggest API service to serve through a
*load balancer*, such as `HAProxy <http://www.haproxy.org/>`_, in
order to maintain a single entry point for the API service (e.g.
floating IP address).

HA Node Introspection decomposition
-----------------------------------

Node introspection being a state transition system, we focus on
*decentralizing* it.  We therefore replicate the current introspection
state through the distributed store in all inspector process instances
for particular node.  We suggest that both the automatic state
advancing requests as well as API state advancing requests are
performed asynchronously by independent workers.

HA Worker
---------

Each inspector process provides a pool of asynchronous workers that
get state transition requests from a queue.  We use separate
``queue.get`` and ``queue.consume`` calls to avoid losing state
transition requests due to worker failures.  This however introduces
the *at-least-once* delivery semantics to the requests.  We therefore
rely on the `transition-function`_ to handle the request delivery
gracefully.  We suggest two kinds of state-transition handling with
regards to the at-least-once delivery semantics:

Strict (non-reentrant-task) Transition Specification
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* `Getting a request`_
* `Calculating new node state`_
* `Updating node state`_
* `Executing a task`_
* `Consuming a request`_

Reentrant Task Transition Specification
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* `Getting a request`_
* `Calculating new node state`_
* `Executing a task`_
* `Updating node state`_
* `Consuming a request`_

Strict transition protecting a state change may lead to a situation
that the state of introspection is not in correspondence with the node
in reality --- if a worker partitions right after having successfully
executed the task but just before consuming the request from the
queue.  As a consequence the transition request not having been
consumed will be encountered again with (another) worker.  One can
refer to this behavior as a *reentrancy glitch or Déjà vu*

Since the goal is to protect the inspected node from going through the
same task again, we rely on the state transition system to handle this
situation by navigating to the ``error`` state instead.

Removing a node
^^^^^^^^^^^^^^^

`Ironic synchronization periodic task`_ puts node delete requests on
the queue.  Workers perform following steps to handle:

* `Getting a request`_
* worker removes the node from the store
* `Consuming a request`_

Failure of store removing the node isn't a concern here as the
periodic task will try again later.  It is therefore safe to always
consume the request here.

Shutting Down HA Inspector Processes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

All inspector process instances register a ``SIGTERM`` callback. To
notify inspector worker threads, the ``SIGTERM`` callback sets the
``sigterm_flag`` upon the signal delivery. The flag is process-local
and its purpose is to allow inspector processes to perform a
controlled/graceful shutdown. For this mechanism to work, potentially
blocking operations (such as ``queue.get``) have to be used with a
configurable timeout value within the workers. All sleep calls
throughout the process instance should be interruptible, possibly
implemented as ``sigterm_flag.wait(sleep_time)`` or similar.

Getting a request
^^^^^^^^^^^^^^^^^

* any worker instance may execute any request the queue contains
* worker gets state transition or node delete request from the queue
* if ``SIGTERM`` flag is set, worker stops
* if ``queue.get`` timed-out (task is ``None``) poll the queue again
* lock the BM node related to the request
* if locking failed worker polls the queue again not consuming the
  request

Calculating new node state
^^^^^^^^^^^^^^^^^^^^^^^^^^

* worker instantiates a state transition system instance for current
  node state
* if instantiating failed (e.g. no such node in the store) worker
  performs `Retrying a request`_
* worker advances the state transition system
* if the state machine is jammed (illegal state transition request)
  worker performs `Consuming a request`_

Updating node state
^^^^^^^^^^^^^^^^^^^

The introspection state is kept in the store, visible to all worker
instances.

* worker saves node state in the store
* if saving node state in the store failed (such as node has been
  removed) worker performs `Retrying a request`_

Executing a task
^^^^^^^^^^^^^^^^

* worker performs the task bound to the transition request
* if the task result is a transition request worker puts it on the
  queue

Consuming a request
^^^^^^^^^^^^^^^^^^^

* worker consumes the state transition request from the queue
* worker releases related node lock
* worker continues from the beginning

Retrying a request
^^^^^^^^^^^^^^^^^^

* worker releases node lock
* worker continues from the beginning not consuming the request to
  retry later

Introspection State-Transition System
-------------------------------------

Node introspection state is managed by a worker-local instance of a
state transition system.  The state transition function is as follows.

.. compound::

   .. _transition-function:

   .. table:: Transition function

      +----------------+-----------------------+------------------------------------+
      | State          | Event                 | Target                             |
      +================+=======================+====================================+
      | N/A            | Inspect               | Starting                           |
      +----------------+-----------------------+------------------------------------+
      | Starting*      | Inspect               | Starting                           |
      +----------------+-----------------------+------------------------------------+
      | Starting*      | S~                    | Waiting                            |
      +----------------+-----------------------+------------------------------------+
      | Waiting        | S~                    | Waiting                            |
      +----------------+-----------------------+------------------------------------+
      | Waiting        | Timeout               | Error                              |
      +----------------+-----------------------+------------------------------------+
      | Waiting        | Abort                 | Error                              |
      +----------------+-----------------------+------------------------------------+
      | Waiting        | Continue!             | Processing                         |
      +----------------+-----------------------+------------------------------------+
      | Processing     | Continue!             | Error                              |
      +----------------+-----------------------+------------------------------------+
      | Processing     | F~                    | Finished                           |
      +----------------+-----------------------+------------------------------------+
      | Finished+      | Inspect               | Starting                           |
      +----------------+-----------------------+------------------------------------+
      | Finished+      | Abort                 | Error                              |
      +----------------+-----------------------+------------------------------------+
      | Error+         | Inspect               | Starting                           |
      +----------------+-----------------------+------------------------------------+

   .. table:: Legend

      +------------+-----------------------------+
      | Expression | Meaning                     |
      +============+=============================+
      | State*     | the initial state           |
      +------------+-----------------------------+
      | State+     | the terminal/accepting state|
      +------------+-----------------------------+
      | State~     | the automatic event         |
      |            | originating in State        |
      +------------+-----------------------------+
      | Event!     | strict/non-reentrant        |
      |            | transition event            |
      +------------+-----------------------------+

.. _timer-decomposition:

HA Singleton Periodic task decomposition
----------------------------------------

Ironic inspector service houses a couple of periodic tasks. At any
point, up to a single "instance" of a periodic task flavor should be
running, no matter the process instances count. For this purpose, the
processes form a periodic task distributed management party.

Process instances register a ``SIGTERM`` callback that, the signal
being delivered, makes the process instance leave the party and switch
the ``reset_flag``.

The process instances install a watch on the party. Upon the party
shrinkage, the processes reset their periodic task, if they have one
set, triggering the ``reset_flag`` and participate in new distributed
periodic task management leader election.  Party growth isn't of
concern to the processes.

It's because of the task reset due to the party shrinkage a custom
flag has to be used, instead of the ``sigterm_flag``, to stop the
periodic task.  Otherwise, setting the ``sigterm_flag`` because of the
party change would stop the whole service.

The leader process executes the periodic task loop.  Upon exception or
partitioning, mind the `partitioning-concerns`_, the leader stops
through flipping the ``sigterm_flag`` in order for the inspector
service to stop.  The periodic task loop is stopped eventually as it
performs ``reset_flag.wait(period)`` instead of sleeping.

The periodic task management should happen in a separate asynchronous
thread instance, one per periodic task.  Losing leader due to its
error (or partitioning) isn't a concern --- a new one will eventually
be elected and a couple of periodic task runs will be wasted
(including those that died together with the leader).

HA Periodic clean-up decomposition
----------------------------------

Clean-up should be implemented as independent HA singleton periodic
tasks with configurable time period, one for each of the introspection
timeout and ironic synchronization tasks.

Introspection timeout periodic task
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To finish introspections that are timing-out:

* select nodes for which the introspection is timing out
* for each node:
* put a request to time-out the introspection on the queue for a
  worker to process

Ironic synchronization periodic task
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To remove nodes no longer tracked by Ironic:

* select nodes that are kept by Inspector but not kept by Ironic
* for each node:
* put a request to delete the node on the queue for a worker to
  process

HA Reboot Throttle Decomposition
--------------------------------

As a workaround for some hardware, reboot request rate should be
limited. For this purpose, a single distributed lock instance should
be utilized. At any point in time, only a single worker may hold the
lock while performing the reboot (power-on) task. Upon acquiring the
lock, the reboot state transition sleeps in an interruptible fashion
for a configurable quantum of time. If the sleep was indeed
interrupted, the worker should raise an exception stopping the reboot
procedure and the worker itself. This interruption should happen as
part of the graceful shutdown mechanism. This should be implemented
utilizing the same ``SIGTERM`` flag/event workers use to check for
pending shutdown: ``sigterm_flag.wait(timeout=quantum)``

Process partitioning isn't a concern here because all workers sleep
while holding the lock. Partitioning therefore slows down the reboot
pace by the amount of time a lock takes to expire.  It should be
possible to disable the reboot throttle altogether through the
configuration.

HA Firewall decomposition
-------------------------

The PXE boot environment is configured and active on all inspector
hosts. The firewall protection of the PXE environment is active on all
inspector hosts, blocking the hosts' PXE service.  At any given point
in time, at most one inspector host's PXE service is available, and it
is available to all inspected nodes.

Building blocks
^^^^^^^^^^^^^^^

The general policy is allow-all, and each node that is not being
inspected has a block-exception to the general policy.  Due to its
size, the black-list is maintained locally on all inspector hosts,
pulling items from ironic periodically or asynchronously from a
pub--sub channel.

Nodes that are being introspected are white-listed in a separate set
of firewall rules.  Nodes that are being discovered for the first time
fall through the black-list due to the general allow-all black-list
policy.

Nodes the HA firewall is supposed to allow access to the PXE service,
are kept in a distributed store or obtained asynchronously from a
pub--sub channel.  Process instance workers add (subtract) firewall
rules to (from) the distributed store as necessary or announce the
changes on the pub--sub channels.  Firewall rules are ``(port_ID,
port_MAC)`` tuples to be white-/black-listed.

Process instances use custom chains to implement the firewall: the
white-list chain and the black-list chain.  Failing through the
white-list chain, a packet "proceeds" to the black-list chain. Failing
through the black-list chain, a packet is allowed to access the PXE
service port.  A node port rule may be present both in the white-list
and the black-list chain at the same time if being introspected.

HA Decomposition
^^^^^^^^^^^^^^^^

Starting, the processes poll Ironic to build their black-list chains
for the first time and set up *local* periodic Ironic black-list
synchronisation task or set callbacks on the black-list pub--sub
channel.

Process instances form a distributed firewall management party that
they watch for changes.  Process instances register a ``SIGTERM``
callback that, the signal being delivered, makes the process instance
leave the party and reset the firewall, completely blocking their PXE
service.

Upon the party shrinkage, processes reset their firewall white-list
chain, the *pass* rule in the black-list chain, and the rule set watch
(should they have one set) and participate in a distributed firewall
management leader election.  Party growth isn't of concern to the
processes.

The leader process' black-list chain contains the *pass* rule while
other process's black-list chains don't.  Having been elected, the
leader process builds the white-list and registers a watch on the
distributed store or a white-list pub--sub channel callback in order
to keep the white-list firewall chain up-to-date.  Other process
instances don't maintain a white-list chain, that chain is empty for
them.

Upon any exception (or process instance partitioning), a process
resets its firewall to completely protect its PXE service.

Notes
^^^^^

Periodic white-list store polling and the white-list pub--sub channel
callbacks are mutually optional facilities to enhance the
responsiveness of the firewall, and the user may prefer enabling one
or the other or both simultaneously as necessary.  The same holds for
the black-list Ironic polling and the black-list pub--sub channel
callbacks.

To assemble the blacklist of MAC addresses, the processes may need to
poll the ironic service periodically for node information.  A
cache/proxy of this information might be kept optionally to reduce the
load on Ironic.

The firewall management should be implemented as a separate
asynchronous thread in each inspector process instance. Firewall being
lost due to the leader failure isn't a concern --- new leader will be
eventually elected.  Some nodes being introspected may experience a
timeout in the waiting state and fail the introspection though.

Periodic Ironic--firewall node synchronization and white-list store
polling should be implemented as independent threads with configurable
time period, ``0<=period<=30s``, ideally ``0<=period<=15s`` so the
window between introducing a node to ironic and blacklisting it in
inspector firewall is kept below user's resolution.

As an optimization, the implementation may consider offloading the MAC
address rules of node ports from firewall chains into `IP sets
<http://ipset.netfilter.org/changelog.html>`_

HA HTTP API Decomposition
-------------------------

We assume a Load Balancer (HAProxy) shielding the user from the
inspector service. All the inspector API process instances should
export the same REST API. Each API Request should be handled in a
separate asynchronous thread instance (as is the case now with the
`Flask <https://pypi.python.org/pypi/Flask>`_ framework). At any point
in time, any of the process instances may serve any request.

.. _partitioning-concerns:

Partitioning concerns
---------------------

Upon connection exception/worker process partitioning, affected entity
should retry connection establishing before announcing failure.  The
retry count and timeout should be configurable for each of the ironic,
database, distributed store, lock and queue services.  The timeout
should be interruptible, possibly implemented as waiting for
appropriate termination/``SIGTERM`` flag,
e.g. ``sigterm_flag.wait(timeout)``.  Should the retrying fail,
affected entity breaks the worker inspector service altogether,
setting the flag, to avoid damage to resources --- most of the time,
other worker service entities would be equally affected by the
partition anyway.  User may consider restarting affected worker
service process instance when the partitioning issue is resolved.

Partitioning of HTTP API service instances isn't a concern as those
are stateless and accessed through a load balancer.

Alternatives
------------

HA Worker Decomposition
^^^^^^^^^^^^^^^^^^^^^^^

We've briefly examined the `TaskFlow
<https://wiki.openstack.org/wiki/taskflow>`_ library as alternate
tasking mechanism.  Currently, TaskFlow does support only `directed
acyclic graphs as dependency structure
<https://bugs.launchpad.net/taskflow/+bug/1527690>`_ between
particular steps. Inspector service has to however support restarting
of the introspection for a particular node, bringing loops into the
graph; see `transition-function`_.  Moreover TaskFlow does not
`support external event propagating
<https://bugs.launchpad.net/taskflow/+bug/1527678>`_ to a running
flow, such as the ``continue`` call from the bare metal node.  Because
of that, the overall state of the introspection of particular node has
to be maintained explicitly if TaskFlow is adopted.  TaskFlow, too,
requires tasks to be reentrant/idempotent.

HA Firewall decomposition
^^^^^^^^^^^^^^^^^^^^^^^^^

The firewall facility can be replaced by Neutron once it adopts
`enhancements to subnet DHCP options
<https://review.openstack.org/#/c/247027/>`_ and `allows serving DHCP
to unknown hosts <https://review.openstack.org/#/c/255240/>`_.  We're
keeping Inspector's firewall facility for users that are interested in
stand-alone deployments.

Data model impact
-----------------

Queue
^^^^^

State transition request item is introduced, it should contain these
attributes (as an oslo.versioned) object:

* node ID
* transition event

A clean-up request item is introduced removing a node. Attributes
comprising the request:

* node ID

Pub--sub channels
^^^^^^^^^^^^^^^^^

Two channels are introduced: firewall white-list and black-list.  The
message format is as follows:

* add/remove
* port ID, MAC address

Store
^^^^^

Node state column is introduced to the node table.

HTTP API impact
---------------

API service is provided by dedicated processes.

Client (CLI) impact
-------------------

None planned.

Performance and scalability impact
----------------------------------

We hope this change brings in desired redundancy and scaling for the
inspector service.  We however expect the change to have a negative
network utilization impact as the introspection task requires a queue
and a DLM to coordinate.

The inspector firewall facility requires periodic polling of the
ironic service inventory in each inspector instance.  Therefore we
expect increased load on the ironic service.

Firewall facility leader partitioning causes boot service outage for
the election period. Some nodes may therefore timeout booting.

Each time the firewall leader updates the hosts firewall node
information is polled from ironic service. This may introduce delays
in firewall availability.  If a node being introspected is removed
from the ironic service, the change will not propagate to Inspector
until the introspection finishes.

Security impact
---------------

New services introduced that might require hardening and protection:

* load balancer
* distributed locking facility
* queue
* pub--sub channels

Deployer impact
---------------

Inspector Service Configuration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* distributed locking facility, queue, firewall pub--sub channels and
  load balancer introduce new configuration options, especially
  URLs/hosts and credentials
* worker pool size, integral, ``0<size;
  size.default==processor.count``
* worker ``queue.get(timeout); 0.0s<timeout; timeout.default==3.0s``
* clean-up period  ``0.0s<period; period.default==30s``
* clean-up introspection report expiration threshold ``0.0s<threshold;
  threshold.default==86400.0s``
* clean-up introspection time-out threshold ``0.0s<threshold<=900.0s``
* ironic firewall black-list synchronization polling period
  ``0.0s<=period<=30.0s; period.default==15.0s; period==0.0`` to disable
* firewall white-list store watcher polling period
  ``0.0s<=period<=30.0s; period.default==15.0s; period==0.0`` to
  disable
* bare metal reboot throttle, ``0.0s<=value; value.default==0.0s``
  disabling this feature altogether
* for each of the ironic service, database, distributed locking
  facility and the queue, a connection retry count and connection
  retry timeout should be configured
* all inspector hosts should share same configuration, save for the
  update situation

New services and minimal Topology
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* floating IP address shared by load balancers
* load balancers, wired for redundancy
* WSGI HTTP API instances (httpd), addressed by load balancers in a
  round-robin fashion
* 3 inspector hosts each running a worker process instance, dnsmasq
  instance and iptables
* distributed synchronization facility hosts, wired for redundancy,
  accessed by all inspector workers
* queue hosts, wired for redundancy, accessed by all API instances and
  workers
* database cluster, wired for redundancy, accessed by all API
  instances and workers
* NTP set up and configured for all the services

Please note, all inspector hosts require access to the PXE LAN for
bare metal nodes to boot.

Serviceability considerations
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Considering service update, we suggest following procedure to be
adopted for each inspector host, one at a time:

HTTP API services:

* remove selected host from the load balancer service
* stop the HTTP API service on the host
* upgrade the service and configuration files
* start the HTTP API service on the host
* enroll the host to the load balancer service

Worker services:

* for each worker host:
* stop the worker service instance on the host
* update the worker service and configuration files
* start the worker service

Shutting down the inspector worker service may hang for some time due
to worker threads executing a long synchronous procedure or waiting in
the ``queue.get(timeout)`` method while polling for new task.

This approach may lead to introspection (task) failures for nodes that
are being handled on inspector host under update.  Especially changes
of the transition function (new states etc) may induce introspection
errors.  Ideally, the update should therefore happen with no ongoing
introspections.  Failed node introspections may be restarted.

A couple of periodic task "instances" may be lost due to the updated
leader partitioning each time a host is updated.  HA firewall may be
lost for the leader election period each time a host is updated,
expected delay should be less than 10 seconds so that booting of
inspected nodes isn't affected.

Upgrade from non-HA Inspector Service
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Because the non-HA inspector service is a single-process entity and
because the HA services aren't internally backwards compatible with it
(to allow taking-over running node inspections), to perform an
upgrade, the non-HA service has to be stopped first while no
inspections are ongoing.  Data migration is necessary before the
upgrade.  As the new services require the queue and the DLM for their
operation those have to be introduced before the upgrade.  The worker
services have to be started before HTTP API services.  Having started,
the HTTP API services have to be introduced to the load balancer.

Developer impact
----------------

None planned.

Implementation
==============

We consider following implementations for the facilities we rely on:

* load balancer: HAProxy
* queue: Oslo messaging
* pub--sub firewall channels: Oslo messaging
* store: a database service
* distributed synchronization facility: Tooz
* HTTP API service: WSGI and httpd

Assignee(s)
-----------

* `vetrisko <https://launchpad.net/~vetrisko>`_; primary
* `divius  <https://launchpad.net/~divius>`_

Work Items
----------

* replace current locking with Tooz DLM
* introduce state machine
* split API service and introduce conductors and queue
* split cleaning into a separate timeout and synchronization handlers
  and introduce leader-election to these periodic procedures
* introduce leader-election to the firewall facility
* introduce the pub--sub channels to the firewall facility

Dependencies
============

We require proper inspector `grenade testing
<https://wiki.openstack.org/wiki/Grenade>`_ before landing HA so we
avoid breaking users as much as possible.

Testing
=======

All work items should be tested as separate patches both with
functional and unit tests as well as upgrade tests with Grenade.

Having landed all the required work items it should be possible to
test Inspector with focus on redundancy and scaling.

References
==========

During the analysis process we considered these blueprints:

* `Abort introspection
  <https://blueprints.launchpad.net/ironic-inspector/+spec/abort-introspection>`_
* `Node States
  <https://blueprints.launchpad.net/ironic-inspector/+spec/node-states>`_
* `Node Locking <https://review.openstack.org/#/c/244750/5>`_
* `Oslo.messaging at-least-once semantics
  <https://review.openstack.org/#/c/256342/>`_

RFEs:

* `TaskFlow: flow suspend&continue
  <https://bugs.launchpad.net/taskflow/+bug/1527678>`_
* `TaskFlow: non-DAG flow patterns
  <https://bugs.launchpad.net/taskflow/+bug/1527690>`_
* `HA for Ironic Inspector
  <https://bugs.launchpad.net/ironic-inspector/+bug/1525218>`_
* `Safe queue for Tooz
  <https://bugs.launchpad.net/python-tooz/+bug/1528490>`_
* `Watchable store for Tooz
  <https://bugs.launchpad.net/python-tooz/+bug/1528495>`_
* `Enhanced Network/Subnet DHCP Options
  <https://review.openstack.org/#/c/247027/>`_
* `Neutron DHCP serve unknown hosts
  <https://review.openstack.org/#/c/255240/>`_

Community sources:

* `DLM options discussion
  <https://etherpad.openstack.org/p/mitaka-cross-project-dlm>`_
* `TaskFlow with external events and Non-DAG flows
  <http://lists.openstack.org/pipermail/openstack-dev/2015-November/080622.html>`_
* Joshua Harlow's comment that `Tooz should implement the
  at-least-once semantics not Oslo.messaging
  <https://review.openstack.org/#/c/256342/7/specs/mitaka/at-least-once-guarantee.rst@305>`_

RFCs:

* `DHCP Failover Protocol: IP address allocation between servers <https://tools.ietf.org/html/draft-ietf-dhc-failover-12#section-5.4>`_

Tools:

* `IP Sets <http://ipset.netfilter.org/changelog.html>`_
* `Dnsmasq <http://www.thekelleys.org.uk/dnsmasq/doc.html>`_