Merge "Re-submit Ironic-shards for Caracal"
This commit is contained in:
commit
c8553fa81c
435
specs/2024.1/approved/ironic-shards.rst
Normal file
435
specs/2024.1/approved/ironic-shards.rst
Normal file
@ -0,0 +1,435 @@
|
||||
..
|
||||
This work is licensed under a Creative Commons Attribution 3.0 Unported
|
||||
License.
|
||||
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
==========================================
|
||||
Ironic Shards
|
||||
==========================================
|
||||
|
||||
https://blueprints.launchpad.net/nova/+spec/ironic-shards
|
||||
|
||||
|
||||
.. note:: The deprecation for the ``[ironic]\peer_list`` config option,
|
||||
explained below in `Config changes and Deprecations`_, was
|
||||
landed in 2023.2 (Bobcat). The rest of the feature was reverted
|
||||
due to a late-discovered bug and is being re-submitted in 2024.1
|
||||
(Caracal).
|
||||
|
||||
|
||||
Problem description
|
||||
===================
|
||||
|
||||
Nova's Ironic driver involves a single nova-compute service managing
|
||||
many compute nodes, where each compute node record maps to an Ironic node.
|
||||
Some deployments support 1000s of ironic nodes, but a single nova-compute
|
||||
service is unable to manage 1000s of nodes and 1000s of instances.
|
||||
|
||||
Currently we support setting a partition key, where nova-compute only
|
||||
cares about a subset of ironic nodes, those associated with a specific
|
||||
conductor group. However, some conductor groups can be very large,
|
||||
servered by many ironic-conductor services.
|
||||
|
||||
To help with this, Nova has attempted to dynamically spread ironic
|
||||
nodes between a set of nova-compute peers. While this work some of
|
||||
the time, there are some major limitations:
|
||||
|
||||
* when one nova-compute is down, only unassigned ironic nodes can
|
||||
move to another nova-compute service
|
||||
* i.e. when one nova-compute is down, all ironic nodes with nova instances
|
||||
associated with the down nova-compute service are unable to be
|
||||
managed, i.e. reboot will fail
|
||||
* moreover, when the old nova-compute comes back up, which might take
|
||||
some time, there are lots of bugs as the hash ring slowly rebalances.
|
||||
In part because every nova-compute fetches all nodes, in a large enough
|
||||
cloud, this can take over 24 hours.
|
||||
|
||||
This spec is about tweaking the way we shard Ironic compute nodes.
|
||||
We need to stop violating deep assumptions in the compute manager
|
||||
code by moving to a more static ironic node partitions.
|
||||
|
||||
Use Cases
|
||||
---------
|
||||
|
||||
Any users of the ironic driver that have more than one
|
||||
nova-compute service per conductor group should move to an
|
||||
active-passive failover mode.
|
||||
|
||||
The new static sharding will be of paritcular interest for clouds
|
||||
with ironic conductor groups that are greater than around
|
||||
1000 baremetal nodes.
|
||||
|
||||
.. NOTE: many parts of this story work today but
|
||||
need better documentation:
|
||||
|
||||
* understanding the current scale limit of around 500-1000 ironic
|
||||
nodes per nova-compute, and the best way to scale beyond that
|
||||
* sharding ironic-conductors and nova-computes using
|
||||
ironic conductor groups.
|
||||
Note: conductor groups have a specific use in Ironic
|
||||
and this is not it, but it works for some users.
|
||||
* active-passive failover for nova-compute services
|
||||
running the ironic driver.
|
||||
Note: the time to start up a new process after a
|
||||
failover is way too high, particularly at larger
|
||||
scales without conductor groups.
|
||||
|
||||
Proposed change
|
||||
===============
|
||||
|
||||
We add a new configuration option:
|
||||
|
||||
* [ironic] shard_key
|
||||
|
||||
By default, there will be no shard_key set, and we will continue to
|
||||
expose all ironic nodes from a single nova-compute process.
|
||||
Mostly, this is to keep things simple for smaller deployments,
|
||||
i.e. when you have less than 500 ironic nodes.
|
||||
|
||||
When the operator sets a shard_key, the compute-node process will
|
||||
use the shard_key when querying a list of nodes in Ironic.
|
||||
We must never try to list all Ironic nodes when
|
||||
the Ironic shard key is defined in the config.
|
||||
|
||||
When we look up a specific ironic node via a node uuid or
|
||||
instance uuid, we should not restrict that to either the shard key
|
||||
or conductor group.
|
||||
|
||||
Similar to checking the instance uuid is still present on the Ironic
|
||||
node before performing an action, or ensuring there is no instance uuid
|
||||
before provisioning, we should also check the node is in the correct
|
||||
shard (and conductor group) before doing anything with that Ironic node.
|
||||
|
||||
Config changes and Deprecations
|
||||
-------------------------------
|
||||
|
||||
We will keep the option to target a specific conductor group,
|
||||
but this option will be renamed from partition_key to conductor_group.
|
||||
This is addative to the shard_key above, the target ironic nodes are
|
||||
those in both the correct `shard_key` and the correct `conductor_group`,
|
||||
when both are configured.
|
||||
|
||||
We will deprecate the use of the `peer_list`.
|
||||
We should log a warning when the hash ring is being used,
|
||||
i.e. when it has more than one member added to the hash ring.
|
||||
|
||||
In addtion, we need the logic that tries to move Compute Nodes
|
||||
to never work unless the peer_list is larger than one. More details
|
||||
in the data model impact section.
|
||||
|
||||
When deleting a ComputeNode object, we need to have the driver
|
||||
confirm that is safe. In the case of Ironic we will check to see if
|
||||
the configured Ironic has a node with that uuid, searching across all
|
||||
conductor groups and all shard keys. When the ComputeNode object is not
|
||||
deleted, we should not delete the entry in placement.
|
||||
|
||||
nova-manage move ironic node
|
||||
----------------------------
|
||||
|
||||
We will create a new nova-manage command::
|
||||
|
||||
nova-manage ironic-compute-node-move <ironic-node-uuid> \
|
||||
--service <destination-service>
|
||||
|
||||
This command will do the following:
|
||||
|
||||
* Find the ComputeNode object for this ironic-node-uuid
|
||||
* Error if the ComputeNode type does not match the ironic driver.
|
||||
* Find the related Service object for the above ComputeNode
|
||||
(i.e. the host)
|
||||
* Error if the service object is not reported as down, and
|
||||
has not also been put into maintanance. We do not require
|
||||
forced down, because we might only be moving a subset of
|
||||
nodes associated with this nova-compute service.
|
||||
* Check the Service object for the destination service host exists
|
||||
* Find all non-deleted instances for this (host,node)
|
||||
* Error if there is more than 1 non-deleted instance found.
|
||||
It is OK if we find zero or 1 instances.
|
||||
* In one DB transaction:
|
||||
move the ComputeNode object to the destination service host and
|
||||
move the Instance (if there is one) to the destination service host
|
||||
|
||||
The above tool is expected to be used as part of this wider process
|
||||
of migrating from the old peer_list to the new shard key. There are
|
||||
two key scearios (although the tool may help operator recover from
|
||||
other issues as well):
|
||||
|
||||
* moving from a peer_list to a single nova-compute
|
||||
* moving from peer_list to shard_key, while keeping multiple nova-compute
|
||||
proccesses (for a single conductor group)
|
||||
|
||||
Migrate from peer_list to single nova-compute
|
||||
---------------------------------------------
|
||||
|
||||
Small deployments (i.e. less than 500 ironic nodes)
|
||||
are recommended to move from a peer_list of, for example,
|
||||
three nova-compute services, to a single nova-compute service.
|
||||
On failure of the nova-compute service, operators can either manually start
|
||||
the processes on a new host, or use an automatic active-passive HA scheme.
|
||||
|
||||
The process would look something like this:
|
||||
|
||||
* ironic and nova both default to an empty_shard key by default,
|
||||
such that all ironic nodes are in the same default shard
|
||||
* start a new nova-compute service running the ironic driver,
|
||||
ideally with a syntheic value for `[DEFAULT]host` e.g. `ironic`
|
||||
This will log warnings about the need to use the nova-compute
|
||||
migration tool before being able to manage any nodes
|
||||
* stop all existing nova-compute services
|
||||
* mark them as forced-down via the API
|
||||
* Now loop around all ironic nodes and call this, assuming your
|
||||
nova-compute service has its host value of just `ironic`:
|
||||
`nova_manage ironic-compute-node-move <uuid> --service ironic`
|
||||
|
||||
The periodic tasks in the new nova-compute service will gradually
|
||||
pick up the new ComputeNodes, and will start being able to recieve
|
||||
commands such a reboot for all the moved instances.
|
||||
|
||||
While you could start the new nova-compute service after
|
||||
having migrated all the ironic compute nodes, but that would
|
||||
lead to higher downtime during the migration.
|
||||
|
||||
Migrate from peer_list to shard_key
|
||||
-----------------------------------
|
||||
|
||||
The proccess to move from the hash key based peer_list to the static
|
||||
shard_key from ironic is very similar to the above process:
|
||||
|
||||
* Set the shard_key on all your ironic nodes, such that you can spread
|
||||
the nodes out between your nova-compute processes,
|
||||
* Start your new nova compute processes, one for each `shard_key`,
|
||||
possibly setting a synthetic `[DEFAULT]host` value that matches the
|
||||
`my_shard_key`.
|
||||
* Shutdown all the older nova-compute processs with `[ironic]peer_list` set
|
||||
* Mark those older services as in maintainance via the Nova API
|
||||
* For each shard_key in Ironic, work out which service host you have mapped
|
||||
each one to above, then run this for each ironic node uuid in the shard:
|
||||
`nova_manage ironic-compute-node-move <uuid> --service my_shard_key`
|
||||
* Delete the old services via the Nova API, now there are no instances
|
||||
or compute nodes on those services
|
||||
|
||||
While you could start the new nova-compute services after the migration,
|
||||
that would lead to a slightly longer downtime.
|
||||
|
||||
Adding new compute nodes
|
||||
------------------------
|
||||
|
||||
In general, there is no change when adding nodes into existing
|
||||
shards.
|
||||
|
||||
Similarly, you can add a new nova-compute process for a new shard
|
||||
and then start to fill that up with nodes.
|
||||
|
||||
Move an ironic node between shards
|
||||
----------------------------------
|
||||
|
||||
When removing nodes from ironic at the end of their life, or
|
||||
adding large numbers of new nodes, you may need to rebalance
|
||||
the shards.
|
||||
|
||||
To move some ironic nodes, you need to move the nodes in
|
||||
groups associated with a specific nova-compute process.
|
||||
For each nova-compute and the associated ironic nodes you
|
||||
want to move to a different shard you need to:
|
||||
|
||||
* Shutdown the affected nova-compute process
|
||||
* Put nova-compute services into in maintanance
|
||||
* In Ironic API update the shard key on the Ironic node
|
||||
* Now move each ironic node to the correct new nova-compute
|
||||
process for the shard key it was moved into:
|
||||
`nova_manage ironic-compute-node-move <uuid> --service my_shard_key`
|
||||
* Now unset maintanance mode for the nova-compute,
|
||||
and start that service back up
|
||||
|
||||
Move shards between nova-compute services
|
||||
-----------------------------------------
|
||||
|
||||
To move a shard between nova-compute services, you need to
|
||||
replace the nova-compute process with a new one:
|
||||
|
||||
* ensure the destination nova-compute is configured with the
|
||||
shard you want to move, and is running
|
||||
* stop the nova-compute process currently serving the shard
|
||||
* force-down the service via the API
|
||||
* for each ironic node uuid in the shard call nova-manage
|
||||
to move it to the new nova-compute process
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
We could require nova-compute processes to be explicitly forced down,
|
||||
before allowing the nova-manage to move the ironic nodes about,
|
||||
in a similar way to evacuate.
|
||||
But this creates problems when trying to re-balance shards as you
|
||||
remove nodes at the end of their life.
|
||||
|
||||
We could consider a list of shard keys, rather than a single shard key
|
||||
per nova-compute. But for this first version, we have chosen the simpler
|
||||
path, that appears to have few limitations.
|
||||
|
||||
We could attempt to keep fixing the hash ring recovery within the ironic
|
||||
driver, but its very unclear what will break next due to all the deep
|
||||
assumptions made about the nova-compute process. The specific assumptions
|
||||
include:
|
||||
|
||||
* when nova-compute breaks, its usually the hypervisor hardware that
|
||||
has broken, which includes all the nova servers running on that.
|
||||
* all locking and management of a nova server object is done by the
|
||||
currently assigned nova-compute node, and this is only ever changed
|
||||
by explict move operations like resize, migrate, live-migration
|
||||
and evacuate. As such we can use simple local locks to ensure
|
||||
concurrent operations don't conflict, along with DB state checking.
|
||||
|
||||
Data model impact
|
||||
-----------------
|
||||
|
||||
A key thing we need to ensure is that ComputeNode objects are only
|
||||
automatically moved between service objects when in legacy hash ring mode.
|
||||
Currently, this only happens for unassigned ComputeNodes.
|
||||
|
||||
In this new explicit shard mode, only nova-manage is able to move
|
||||
ComputeNode objects. In addtion, nova-manage will also move associated
|
||||
instances. However, similar to evacuate, this will only be allowed
|
||||
when the currently associated service is forced down.
|
||||
|
||||
Note, this applies when a nova-compute finds a ComputeNode that is should
|
||||
own, but the Nova database says its already owned by a difference service.
|
||||
In this scenario, we should log a warning to the operator
|
||||
to ensure they have migrated that ComputeNode from its old location
|
||||
before this nova-compute service is able to manage it.
|
||||
|
||||
In addition, we should ensure we only delete a ComputeNode object
|
||||
when the driver explictly says its safe to delete. In the case of
|
||||
the Ironic driver, we should ensure the node no longer exists in
|
||||
Ironic, being sure to search across all shards.
|
||||
|
||||
This is all very related this spec on robustfying
|
||||
the Compute Node and Service object relationship:
|
||||
https://review.opendev.org/c/openstack/nova-specs/+/853837
|
||||
|
||||
REST API impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Security impact
|
||||
---------------
|
||||
|
||||
None
|
||||
|
||||
Notifications impact
|
||||
--------------------
|
||||
|
||||
None
|
||||
|
||||
Other end user impact
|
||||
---------------------
|
||||
|
||||
Users will experience a more reliable Ironic and Nova integration.
|
||||
|
||||
Performance Impact
|
||||
------------------
|
||||
|
||||
It should help users more easily support large ironic deployments
|
||||
integrated with Nova.
|
||||
|
||||
Other deployer impact
|
||||
---------------------
|
||||
|
||||
We will rename the "partition_key" configuration to be expliclity
|
||||
"conductor_group".
|
||||
|
||||
We will deprecate the peer list key. When we start up and see
|
||||
anything set, we ommit a warning about the bugs in using this
|
||||
legacy auto sharding, and recomend moving to the explicit sharding.
|
||||
|
||||
There is a new `shard_key` config, as descirbed above.
|
||||
|
||||
There is a new nova_manage CLI command to move Ironic compute nodes
|
||||
on forced-down nova-compute services to a new one.
|
||||
|
||||
Developer impact
|
||||
----------------
|
||||
|
||||
None
|
||||
|
||||
Upgrade impact
|
||||
--------------
|
||||
|
||||
For those currenly using peer_list, we need to document how they
|
||||
can move to the new sharding approach.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
JayF
|
||||
|
||||
Other contributors:
|
||||
johnthetubaguy
|
||||
|
||||
Feature Liaison
|
||||
---------------
|
||||
|
||||
Feature liaison: None
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* rename conductor group partition key config
|
||||
* deprecate peer_list config, with warning log messages
|
||||
* add compute node move and delete protections, when peer_list not used
|
||||
* add new shard_key config, limit ironic node list using shard_key
|
||||
* add nova-manage tool to move ironic nodes between compute services
|
||||
* document operational processes around above nova-manage tool
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
The deprecation of the peer list can happen right away.
|
||||
|
||||
But the new sharding depends on the Ironic shard key getting added:
|
||||
https://review.opendev.org/c/openstack/ironic-specs/+/861803
|
||||
|
||||
Ideally we add this into Nova after robustify compute node has landed:
|
||||
https://review.opendev.org/c/openstack/nova/+/842478
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
We need some functional tests for the nova-manage command to ensure
|
||||
all of the safety guards work as expected.
|
||||
|
||||
We need to ensure a tempest test exists which has multiple shards, with
|
||||
only one shard containing valid, functional Ironic nodes. Then, ensure
|
||||
that only the valid nodes are scheduled to.
|
||||
|
||||
Documentation Impact
|
||||
====================
|
||||
|
||||
A lot of docs needed for the Ironic driver on the operational
|
||||
procedures around the shard_key.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
None
|
||||
|
||||
History
|
||||
=======
|
||||
|
||||
.. list-table:: Revisions
|
||||
:header-rows: 1
|
||||
|
||||
* - Release Name
|
||||
- Description
|
||||
* - 2023.1 Antelope
|
||||
- Introduced
|
||||
* - 2023.2 Bobcat
|
||||
- Re-proposed, Partially implemented
|
||||
* - 2024.1 Caracal
|
||||
- Re-proposed
|
Loading…
Reference in New Issue
Block a user