Split service into API and Worker
Change-Id: Iaeb99ab1954a1d5303c9bd10b81f7f8d6aa7e731
This commit is contained in:
parent
d425662f8b
commit
cf953cade1
|
@ -0,0 +1,349 @@
|
||||||
|
..
|
||||||
|
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||||
|
License.
|
||||||
|
|
||||||
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||||
|
|
||||||
|
============================================
|
||||||
|
Splitting Inspector into an API and a Worker
|
||||||
|
============================================
|
||||||
|
|
||||||
|
https://bugs.launchpad.net/ironic-inspector/+bug/1525218
|
||||||
|
|
||||||
|
This work is part of the `High Availability for Ironic Inspector`_ spec. One
|
||||||
|
of the items to achieve inspector HA and scalability is splitting inspector
|
||||||
|
single service into API and worker services. This spec focuses on detailed
|
||||||
|
description of the essential part of the mentioned work - internal
|
||||||
|
communication between mentioned services.
|
||||||
|
|
||||||
|
|
||||||
|
Problem description
|
||||||
|
===================
|
||||||
|
|
||||||
|
**inspector** is a monolithic service consisting of the API, processing
|
||||||
|
background, the firewall and the DHCP managment. As result inspector isn't
|
||||||
|
capable of dealing well with the sizeable amount of **ironic** bare metal
|
||||||
|
nodes and doesn't fit large-scale deployments. Introducing new services to
|
||||||
|
solve this issues also brings some complexity as it requires a mechanism for
|
||||||
|
internal services communication.
|
||||||
|
|
||||||
|
Proposed change
|
||||||
|
===============
|
||||||
|
|
||||||
|
Node introspection is a sequence of asynchronous tasks. A task could be
|
||||||
|
described as an FSM transition of the **inspector** state machine [1]_,
|
||||||
|
triggered by events such as:
|
||||||
|
|
||||||
|
* ``starting(wait) -> waiting``
|
||||||
|
* ``waiting(process) -> processing``
|
||||||
|
* ``waiting(timeout) -> error``
|
||||||
|
* ``processing(finish) -> finished``
|
||||||
|
|
||||||
|
API request that are executed in the background can be considered as
|
||||||
|
asynchronous tasks. It is these tasks that allow splitting the service
|
||||||
|
into the *API* and *Worker* parts, the former creating tasks, the latter
|
||||||
|
consuming those. The communication of these service parts requires a
|
||||||
|
medium, the *queue*, and together these three subjects comprise the
|
||||||
|
`messaging queue paradigm`_. OpenStack projects use an open standard
|
||||||
|
for messaging middleware known as AMQP_. This messaging middleware,
|
||||||
|
oslo.messaging_, enables services that run on multiple servers to
|
||||||
|
talk to each other.
|
||||||
|
|
||||||
|
Each inspector worker provides a pool of worker threads that get state
|
||||||
|
transition requests from the API service via the queue. An API service
|
||||||
|
invokes methods on workers and eventually becomes a task. In other words,
|
||||||
|
there is the ``client`` role, carried out by the API service, and the
|
||||||
|
``server`` role, carried out by the worker thread respectively. Servers make
|
||||||
|
oslo.messaging_ ``RPC`` interfaces available to clients.
|
||||||
|
|
||||||
|
Client - inspector API
|
||||||
|
----------------------
|
||||||
|
|
||||||
|
**inspector** API will implement a simple oslo.messaging_ client, which will
|
||||||
|
connect to the messaging transport and send messages with state transition
|
||||||
|
event.
|
||||||
|
|
||||||
|
There are two ways that a method can be invoked, see [2]_:
|
||||||
|
* cast - the method is invoked asynchronously and no result is returned to
|
||||||
|
the caller.
|
||||||
|
|
||||||
|
* call - the method is invoked synchronously and the result is returned to
|
||||||
|
the caller.
|
||||||
|
|
||||||
|
*inspector* endpoints which invokes RPC:
|
||||||
|
|
||||||
|
.. table::
|
||||||
|
|
||||||
|
+---------------------------------+----------+---------------------------+--------------------------+
|
||||||
|
| Method | RPC type | API | Worker |
|
||||||
|
+=================================+==========+===========================+==========================+
|
||||||
|
| | | check provision state, | add lookup attributes, |
|
||||||
|
| | | validate power interface, | update pxe filters, |
|
||||||
|
| POST /introspection/<node_id> | cast | set `starting` state, | set pxe as boot device, |
|
||||||
|
| | | <RPC> cast `inspect` | reboot node, |
|
||||||
|
| | | | set `waiting` state |
|
||||||
|
+---------------------------------+----------+---------------------------+--------------------------+
|
||||||
|
| | | node lookup, | set `processing` state |
|
||||||
|
| | | check provision state, | run processing hook, |
|
||||||
|
| POST /continue | cast | <RPC> cast `process` | apply rules, |
|
||||||
|
| | | | update pxe filters, |
|
||||||
|
| | | | save introspection data, |
|
||||||
|
| | | | power off node, |
|
||||||
|
| | | | set `finished` state |
|
||||||
|
+---------------------------------+----------+---------------------------+--------------------------+
|
||||||
|
| POST /introspection/<node_id> | | find node in cache, | force power off, |
|
||||||
|
| /abort | cast | <RPC> cast `abort` | update pxe filters, |
|
||||||
|
| | | | set `error` state |
|
||||||
|
+---------------------------------+----------+---------------------------+--------------------------+
|
||||||
|
| | | find node in cache, | get introspection data, |
|
||||||
|
| | | <RPC> cast `reapply` | set `reapplying` state, |
|
||||||
|
| POST /introspection/<id>/data | cast | | run processing hooks, |
|
||||||
|
| /unprocessed | | | save introspection data, |
|
||||||
|
| | | | apply rules, |
|
||||||
|
| | | | set `finished` state |
|
||||||
|
+---------------------------------+----------+---------------------------+--------------------------+
|
||||||
|
|
||||||
|
The resulting workflow for introspection looks like::
|
||||||
|
|
||||||
|
Client API Worker Node Ironic
|
||||||
|
+ + + + +
|
||||||
|
| <HTTP>Start | | | |
|
||||||
|
+--inspection---> | | |
|
||||||
|
| X Validate power| | |
|
||||||
|
| X interface, | | |
|
||||||
|
| X initiate task | | |
|
||||||
|
| X for inspection| | |
|
||||||
|
| X | | |
|
||||||
|
| X <RPC> Cast | | |
|
||||||
|
X +- inspection---> | |
|
||||||
|
X | X Update pxe | |
|
||||||
|
X | X filters, set | |
|
||||||
|
X | X lookup attrs | |
|
||||||
|
X | X | |
|
||||||
|
X | X <HTTP> Set pxe | |
|
||||||
|
X | +-boot dev,reboot+--------------->
|
||||||
|
X | | | Reboot |
|
||||||
|
X | | <---------------+
|
||||||
|
X | | DHCP, boot, X |
|
||||||
|
X | | Collect data X |
|
||||||
|
X | | X |
|
||||||
|
X |Send inspection data to inspector |
|
||||||
|
X <---------------+----------------+ |
|
||||||
|
X X Node lookup, | | |
|
||||||
|
X X verify collect| | |
|
||||||
|
X X failures | | |
|
||||||
|
X X | | |
|
||||||
|
X X <RPC> Cast | | |
|
||||||
|
X +-process data--> | |
|
||||||
|
X | X Run process | |
|
||||||
|
X | X hooks, apply | |
|
||||||
|
X | X rules, update | |
|
||||||
|
X | X filters | |
|
||||||
|
X | X <HTTP> Set power off |
|
||||||
|
X | +----------------+--------------->
|
||||||
|
X | | | Power off |
|
||||||
|
X + + <------------- +
|
||||||
|
|
||||||
|
|
||||||
|
Server - inspector worker
|
||||||
|
-------------------------
|
||||||
|
|
||||||
|
An RPC server exposes **inspector** endpoints containing a set of methods,
|
||||||
|
which may be invoked remotely by clients over a given transport. Transport
|
||||||
|
driver will be loaded according to the users messaging configuration. See
|
||||||
|
[3]_ for more details on configuration options.
|
||||||
|
|
||||||
|
An **inspector** worker will implement a separate ``oslo.service`` process
|
||||||
|
with its own pool of green threads. The worker will periodically consume and
|
||||||
|
handle messages from the clients.
|
||||||
|
|
||||||
|
RPC reliability
|
||||||
|
---------------
|
||||||
|
|
||||||
|
For each message sent by the client via cast (asynchronously), an
|
||||||
|
acknowledgement is sent back immediately and the message is removed from
|
||||||
|
the queue. As result there is no guarantees that worker will handle the
|
||||||
|
introspection task.
|
||||||
|
|
||||||
|
This model, known as `at-most-once-delivery` doesn't guarantee message
|
||||||
|
processing for asynchronous tasks if proceed worker dies. Supporting HA
|
||||||
|
may require some additional functionality to confirm that task message
|
||||||
|
was processed.
|
||||||
|
|
||||||
|
If a worker dies (connection is closed or lost) during processing inspection
|
||||||
|
data, the task request message will disappear and the introspection task will
|
||||||
|
hang in ``processing`` state till `timeout` happens.
|
||||||
|
|
||||||
|
Alternatives
|
||||||
|
------------
|
||||||
|
|
||||||
|
Implement our own Publisher/Consumer functionality with Kombu_ library. This
|
||||||
|
approach has some benefits:
|
||||||
|
|
||||||
|
* support `at-least-once-delivery` semantic.
|
||||||
|
For each message retrieved and handled by a consumer, an acknowledgement is
|
||||||
|
sent back to the message producer. In case this acknowledgement is not
|
||||||
|
received after a certain time amount, the message is resent::
|
||||||
|
|
||||||
|
API Worker thread
|
||||||
|
+ +
|
||||||
|
| |
|
||||||
|
+--------------------->+
|
||||||
|
| |
|
||||||
|
| +--------+
|
||||||
|
| | |
|
||||||
|
| Process| |
|
||||||
|
| Request| |
|
||||||
|
| | |
|
||||||
|
| +------->+
|
||||||
|
| ACK |
|
||||||
|
+<---------------------+
|
||||||
|
| |
|
||||||
|
+ +
|
||||||
|
|
||||||
|
If a consumer dies without sending an ack, the message wasn't processed and
|
||||||
|
if there are other consumers online at the same time, message will be
|
||||||
|
reprocessed.
|
||||||
|
|
||||||
|
On the other hand, these approach has considerable drawbacks:
|
||||||
|
|
||||||
|
* Implementing own Publisher/Consumer.
|
||||||
|
It means complexity of supporting new functionality, lack of supported
|
||||||
|
backends, compared to oslo.messaging, like 0MQ_.
|
||||||
|
|
||||||
|
* Worse deployer's UX.
|
||||||
|
Message backend configuration in *inspector* will differ from other
|
||||||
|
services (including *ironic*), which brings some pain to deployers.
|
||||||
|
|
||||||
|
Data model impact
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
HTTP API impact
|
||||||
|
---------------
|
||||||
|
|
||||||
|
Endpoint `/continue` will return `ACCEPTED` instead of `OK`.
|
||||||
|
|
||||||
|
Client (CLI) impact
|
||||||
|
-------------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Ironic python agent impact
|
||||||
|
--------------------------
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Performance and scalability impact
|
||||||
|
----------------------------------
|
||||||
|
|
||||||
|
Proposed change will allow users to scale **ironic-inspector**, both API
|
||||||
|
and Worker, horizontally after some more work in future, for more details
|
||||||
|
refer to `High Availability for Ironic Inspector`_.
|
||||||
|
|
||||||
|
Security impact
|
||||||
|
---------------
|
||||||
|
|
||||||
|
The newly introduced services require additional protection. The messaging
|
||||||
|
service, which would be used as the transport layer e.g RabbitMQ_, should rely
|
||||||
|
on a transport-level cryptography, see [4]_ for more details.
|
||||||
|
|
||||||
|
Deployer impact
|
||||||
|
---------------
|
||||||
|
|
||||||
|
The newly introduced message bus layer will require some message broker to
|
||||||
|
connect the inspector API and workers. The most popular broker implementation
|
||||||
|
used in OpenStack installations is RabbitMQ_, see [5]_ for more details.
|
||||||
|
|
||||||
|
To achieve resiliency, multiple API service and worker service instances
|
||||||
|
should be deployed on multiple physical hosts.
|
||||||
|
|
||||||
|
There are also new configuration options being added, see [3]_
|
||||||
|
|
||||||
|
|
||||||
|
Developer impact
|
||||||
|
----------------
|
||||||
|
|
||||||
|
Developers will need to consider new architecture and **inspector** API and
|
||||||
|
Worker communication details when adding new features which are required to be
|
||||||
|
handled as background tasks.
|
||||||
|
|
||||||
|
Upgrades and Backwards Compatibility
|
||||||
|
------------------------------------
|
||||||
|
|
||||||
|
The current **inspector** service is a single process, so deployers might need
|
||||||
|
to add more services, newly added *inspector* Worker, the messaging transport
|
||||||
|
backend (RabbitMQ) . Console script ``ironic-inspector`` could be changed
|
||||||
|
to run both API and Worker services with ``in-memory`` backend for messaging
|
||||||
|
transport. Which allows to run ``ironic-inspector`` in backward compatibility
|
||||||
|
manner - run both services on single host without the message broker.
|
||||||
|
|
||||||
|
Implementation
|
||||||
|
==============
|
||||||
|
|
||||||
|
Assignee(s)
|
||||||
|
-----------
|
||||||
|
|
||||||
|
* aarefiev (Anton Arefiev)
|
||||||
|
|
||||||
|
Work Items
|
||||||
|
----------
|
||||||
|
|
||||||
|
* Add base service functionality;
|
||||||
|
* Introduce Client/Servers workers;
|
||||||
|
* Implement API/Worker managers;
|
||||||
|
* Split service into API and Worker;
|
||||||
|
* Implement support for these services in Devstack;
|
||||||
|
* Use WSGI [6]_ to implement the API service.
|
||||||
|
|
||||||
|
Dependencies
|
||||||
|
============
|
||||||
|
|
||||||
|
None
|
||||||
|
|
||||||
|
Testing
|
||||||
|
=======
|
||||||
|
|
||||||
|
All new functionality would be tested both with functional and unit tests.
|
||||||
|
Already running Tempest tests as well as upgrade tests with Grenade will
|
||||||
|
also cover added features.
|
||||||
|
|
||||||
|
Functional tests run both Inspector API and Worker with an ``in-memory``
|
||||||
|
backend.
|
||||||
|
|
||||||
|
Having all the work items done will allow to setup multi-node devstack and
|
||||||
|
test Inspector in cluster mode eventually.
|
||||||
|
|
||||||
|
|
||||||
|
References
|
||||||
|
==========
|
||||||
|
|
||||||
|
|
||||||
|
.. [1] `Inspection states <http://git.openstack.org/cgit/openstack/ironic-inspector/tree/ironic_inspector/introspection_state.py>`_
|
||||||
|
|
||||||
|
.. [2] `RPC Client <https://docs.openstack.org/developer/oslo.messaging/rpcclient.html>`_
|
||||||
|
|
||||||
|
.. [3] `oslo.messaging configuration options <https://docs.openstack.org/developer/oslo.messaging/opts.html>`_
|
||||||
|
|
||||||
|
.. [4] `RabbitMQ security <https://docs.openstack.org/security-guide/messaging/security.html>`_
|
||||||
|
|
||||||
|
.. [5] `RabbitMQ HA <https://docs.openstack.org/ha-guide/shared-messaging.html>`_
|
||||||
|
|
||||||
|
.. [6] `TC Pike WSGI Goal <https://governance.openstack.org/tc/goals/pike/deploy-api-in-wsgi.html>`_
|
||||||
|
|
||||||
|
.. _Kombu: http://docs.celeryproject.org/projects/kombu/en/latest/
|
||||||
|
|
||||||
|
.. _High Availability for Ironic Inspector: https://specs.openstack.org/openstack/ironic-inspector-specs/specs/HA_inspector.html
|
||||||
|
|
||||||
|
.. _oslo.messaging: https://docs.openstack.org/developer/oslo.messaging/index.html
|
||||||
|
|
||||||
|
.. _RabbitMQ: https://www.rabbitmq.com
|
||||||
|
|
||||||
|
.. _HAProxy: http://www.haproxy.org
|
||||||
|
|
||||||
|
.. _0MQ: https://docs.openstack.org/developer/oslo.messaging/zmq_driver.html
|
||||||
|
|
||||||
|
.. _`messaging queue paradigm`: https://en.wikipedia.org/wiki/Message_queue
|
||||||
|
|
||||||
|
.. _AMQP: http://www.amqp.org/sites/amqp.org/files/amqp.pdf
|
Loading…
Reference in New Issue