Re-Propose "Ironic Shards" for Bobcat/2023.2

Change-Id: Ifc8a1a77f0af93c314fe0cc54ccb1d36e0f8dc30
2023-04-26 12:30:44 -07:00 · 2023-04-26 12:30:44 -07:00 · 588b2107d8
parent 5229724450
commit 588b2107d8
1 changed files with 421 additions and 0 deletions
--- a/specs/2023.2/approved/ironic-shards.rst
+++ b/specs/2023.2/approved/ironic-shards.rst
@ -0,0 +1,421 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+==========================================
+Ironic Shards
+==========================================
+
+https://blueprints.launchpad.net/nova/+spec/ironic-shards
+
+Problem description
+===================
+
+Nova's Ironic driver involves a single nova-compute service managing
+many compute nodes, where each compute node record maps to an Ironic node.
+Some deployments support 1000s of ironic nodes, but a single nova-compute
+service is unable to manage 1000s of nodes and 1000s of instances.
+
+Currently we support setting a partition key, where nova-compute only
+cares about a subset of ironic nodes, those associated with a specific
+conductor group. However, some conductor groups can be very large,
+servered by many ironic-conductor services.
+
+To help with this, Nova has attempted to dynamically spread ironic
+nodes between a set of nova-compute peers. While this work some of
+the time, there are some major limitations:
+
+* when one nova-compute is down, only unassigned ironic nodes can
+  move to another nova-compute service
+* i.e. when one nova-compute is down, all ironic nodes with nova instances
+  associated with the down nova-compute service are unable to be
+  managed, i.e. reboot will fail
+* moreover, when the old nova-compute comes back up, which might take
+  some time, there are lots of bugs as the hash ring slowly rebalances.
+  In part because every nova-compute fetches all nodes, in a large enough
+  cloud, this can take over 24 hours.
+
+This spec is about tweaking the way we shard Ironic compute nodes.
+We need to stop violating deep assumptions in the compute manager
+code by moving to a more static ironic node partitions.
+
+Use Cases
+---------
+
+Any users of the ironic driver that have more than one
+nova-compute service per conductor group should move to an
+active-passive failover mode.
+
+The new static sharding will be of paritcular interest for clouds
+with ironic conductor groups that are greater than around
+1000 baremetal nodes.
+
+.. NOTE: many parts of this story work today but
+ need better documentation:
+
+ * understanding the current scale limit of around 500-1000 ironic
+  nodes per nova-compute, and the best way to scale beyond that
+ * sharding ironic-conductors and nova-computes using
+  ironic conductor groups.
+  Note: conductor groups have a specific use in Ironic
+  and this is not it, but it works for some users.
+ * active-passive failover for nova-compute services
+  running the ironic driver.
+  Note: the time to start up a new process after a
+  failover is way too high, particularly at larger
+  scales without conductor groups.
+
+Proposed change
+===============
+
+We add a new configuration option:
+
+* [ironic] shard_key
+
+By default, there will be no shard_key set, and we will continue to
+expose all ironic nodes from a single nova-compute process.
+Mostly, this is to keep things simple for smaller deployments,
+i.e. when you have less than 500 ironic nodes.
+
+When the operator sets a shard_key, the compute-node process will
+use the shard_key when querying a list of nodes in Ironic.
+We must never try to list all Ironic nodes when
+the Ironic shard key is defined in the config.
+
+When we look up a specific ironic node via a node uuid or
+instance uuid, we should not restrict that to either the shard key
+or conductor group.
+
+Similar to checking the instance uuid is still present on the Ironic
+node before performing an action, or ensuring there is no instance uuid
+before provisioning, we should also check the node is in the correct
+shard (and conductor group) before doing anything with that Ironic node.
+
+Config changes and Deprecations
+-------------------------------
+
+We will keep the option to target a specific conductor group,
+but this option will be renamed from partition_key to conductor_group.
+This is addative to the shard_key above, the target ironic nodes are
+those in both the correct `shard_key` and the correct `conductor_group`,
+when both are configured.
+
+We will deprecate the use of the `peer_list`.
+We should log a warning when the hash ring is being used,
+i.e. when it has more than one member added to the hash ring.
+
+In addtion, we need the logic that tries to move Compute Nodes
+to never work unless the peer_list is larger than one. More details
+in the data model impact section.
+
+When deleting a ComputeNode object, we need to have the driver
+confirm that is safe. In the case of Ironic we will check to see if
+the configured Ironic has a node with that uuid, searching across all
+conductor groups and all shard keys. When the ComputeNode object is not
+deleted, we should not delete the entry in placement.
+
+nova-manage move ironic node
+----------------------------
+
+We will create a new nova-manage command::
+
+  nova-manage ironic-compute-node-move <ironic-node-uuid> \
+      --service <destination-service>
+
+This command will do the following:
+
+* Find the ComputeNode object for this ironic-node-uuid
+* Error if the ComputeNode type does not match the ironic driver.
+* Find the related Service object for the above ComputeNode
+  (i.e. the host)
+* Error if the service object is not reported as down, and
+  has not also been put into maintanance. We do not require
+  forced down, because we might only be moving a subset of
+  nodes associated with this nova-compute service.
+* Check the Service object for the destination service host exists
+* Find all non-deleted instances for this (host,node)
+* Error if there is more than 1 non-deleted instance found.
+  It is OK if we find zero or 1 instances.
+* In one DB transaction:
+  move the ComputeNode object to the destination service host and
+  move the Instance (if there is one) to the destination service host
+
+The above tool is expected to be used as part of this wider process
+of migrating from the old peer_list to the new shard key. There are
+two key scearios (although the tool may help operator recover from
+other issues as well):
+
+* moving from a peer_list to a single nova-compute
+* moving from peer_list to shard_key, while keeping multiple nova-compute
+  proccesses (for a single conductor group)
+
+Migrate from peer_list to single nova-compute
+---------------------------------------------
+
+Small deployments (i.e. less than 500 ironic nodes)
+are recommended to move from a peer_list of, for example,
+three nova-compute services, to a single nova-compute service.
+On failure of the nova-compute service, operators can either manually start
+the processes on a new host, or use an automatic active-passive HA scheme.
+
+The process would look something like this:
+
+* ironic and nova both default to an empty_shard key by default,
+  such that all ironic nodes are in the same default shard
+* start a new nova-compute service running the ironic driver,
+  ideally with a syntheic value for `[DEFAULT]host` e.g. `ironic`
+  This will log warnings about the need to use the nova-compute
+  migration tool before being able to manage any nodes
+* stop all existing nova-compute services
+* mark them as forced-down via the API
+* Now loop around all ironic nodes and call this, assuming your
+  nova-compute service has its host value of just `ironic`:
+  `nova_manage ironic-compute-node-move <uuid> --service ironic`
+
+The periodic tasks in the new nova-compute service will gradually
+pick up the new ComputeNodes, and will start being able to recieve
+commands such a reboot for all the moved instances.
+
+While you could start the new nova-compute service after
+having migrated all the ironic compute nodes, but that would
+lead to higher downtime during the migration.
+
+Migrate from peer_list to shard_key
+-----------------------------------
+
+The proccess to move from the hash key based peer_list to the static
+shard_key from ironic is very similar to the above process:
+
+* Set the shard_key on all your ironic nodes, such that you can spread
+  the nodes out between your nova-compute processes,
+* Start your new nova compute processes, one for each `shard_key`,
+  possibly setting a synthetic `[DEFAULT]host` value that matches the
+  `my_shard_key`.
+* Shutdown all the older nova-compute processs with `[ironic]peer_list` set
+* Mark those older services as in maintainance via the Nova API
+* For each shard_key in Ironic, work out which service host you have mapped
+  each one to above, then run this for each ironic node uuid in the shard:
+  `nova_manage ironic-compute-node-move <uuid> --service my_shard_key`
+* Delete the old services via the Nova API, now there are no instances
+  or compute nodes on those services
+
+While you could start the new nova-compute services after the migration,
+that would lead to a slightly longer downtime.
+
+Adding new compute nodes
+------------------------
+
+In general, there is no change when adding nodes into existing
+shards.
+
+Similarly, you can add a new nova-compute process for a new shard
+and then start to fill that up with nodes.
+
+Move an ironic node between shards
+----------------------------------
+
+When removing nodes from ironic at the end of their life, or
+adding large numbers of new nodes, you may need to rebalance
+the shards.
+
+To move some ironic nodes, you need to move the nodes in
+groups associated with a specific nova-compute process.
+For each nova-compute and the associated ironic nodes you
+want to move to a different shard you need to:
+
+* Shutdown the affected nova-compute process
+* Put nova-compute services into in maintanance
+* In Ironic API update the shard key on the Ironic node
+* Now move each ironic node to the correct new nova-compute
+  process for the shard key it was moved into:
+  `nova_manage ironic-compute-node-move <uuid> --service my_shard_key`
+* Now unset maintanance mode for the nova-compute,
+  and start that service back up
+
+Move shards between nova-compute services
+-----------------------------------------
+
+To move a shard between nova-compute services, you need to
+replace the nova-compute process with a new one:
+
+* ensure the destination nova-compute is configured with the
+  shard you want to move, and is running
+* stop the nova-compute process currently serving the shard
+* force-down the service via the API
+* for each ironic node uuid in the shard call nova-manage
+  to move it to the new nova-compute process
+
+Alternatives
+------------
+
+We could require nova-compute processes to be explicitly forced down,
+before allowing the nova-manage to move the ironic nodes about,
+in a similar way to evacuate.
+But this creates problems when trying to re-balance shards as you
+remove nodes at the end of their life.
+
+We could consider a list of shard keys, rather than a single shard key
+per nova-compute. But for this first version, we have chosen the simpler
+path, that appears to have few limitations.
+
+We could attempt to keep fixing the hash ring recovery within the ironic
+driver, but its very unclear what will break next due to all the deep
+assumptions made about the nova-compute process. The specific assumptions
+include:
+
+* when nova-compute breaks, its usually the hypervisor hardware that
+  has broken, which includes all the nova servers running on that.
+* all locking and management of a nova server object is done by the
+  currently assigned nova-compute node, and this is only ever changed
+  by explict move operations like resize, migrate, live-migration
+  and evacuate. As such we can use simple local locks to ensure
+  concurrent operations don't conflict, along with DB state checking.
+
+Data model impact
+-----------------
+
+A key thing we need to ensure is that ComputeNode objects are only
+automatically moved between service objects when in legacy hash ring mode.
+Currently, this only happens for unassigned ComputeNodes.
+
+In this new explicit shard mode, only nova-manage is able to move
+ComputeNode objects. In addtion, nova-manage will also move associated
+instances. However, similar to evacuate, this will only be allowed
+when the currently associated service is forced down.
+
+Note, this applies when a nova-compute finds a ComputeNode that is should
+own, but the Nova database says its already owned by a difference service.
+In this scenario, we should log a warning to the operator
+to ensure they have migrated that ComputeNode from its old location
+before this nova-compute service is able to manage it.
+
+In addition, we should ensure we only delete a ComputeNode object
+when the driver explictly says its safe to delete. In the case of
+the Ironic driver, we should ensure the node no longer exists in
+Ironic, being sure to search across all shards.
+
+This is all very related this spec on robustfying
+the Compute Node and Service object relationship:
+https://review.opendev.org/c/openstack/nova-specs/+/853837
+
+REST API impact
+---------------
+
+None
+
+Security impact
+---------------
+
+None
+
+Notifications impact
+--------------------
+
+None
+
+Other end user impact
+---------------------
+
+Users will experience a more reliable Ironic and Nova integration.
+
+Performance Impact
+------------------
+
+It should help users more easily support large ironic deployments
+integrated with Nova.
+
+Other deployer impact
+---------------------
+
+We will rename the "partition_key" configuration to be expliclity
+"conductor_group".
+
+We will deprecate the peer list key. When we start up and see
+anything set, we ommit a warning about the bugs in using this
+legacy auto sharding, and recomend moving to the explicit sharding.
+
+There is a new `shard_key` config, as descirbed above.
+
+There is a new nova_manage CLI command to move Ironic compute nodes
+on forced-down nova-compute services to a new one.
+
+Developer impact
+----------------
+
+None
+
+Upgrade impact
+--------------
+
+For those currenly using peer_list, we need to document how they
+can move to the new sharding approach.
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  JayF
+
+Other contributors:
+  johnthetubaguy
+
+Feature Liaison
+---------------
+
+Feature liaison: None
+
+Work Items
+----------
+
+* rename conductor group partition key config
+* deprecate peer_list config, with warning log messages
+* add compute node move and delete protections, when peer_list not used
+* add new shard_key config, limit ironic node list using shard_key
+* add nova-manage tool to move ironic nodes between compute services
+* document operational processes around above nova-manage tool
+
+Dependencies
+============
+
+The deprecation of the peer list can happen right away.
+
+But the new sharding depends on the Ironic shard key getting added:
+https://review.opendev.org/c/openstack/ironic-specs/+/861803
+
+Ideally we add this into Nova after robustify compute node has landed:
+https://review.opendev.org/c/openstack/nova/+/842478
+
+Testing
+=======
+
+We need some functional tests for the nova-manage command to ensure
+all of the safty guards work as expected.
+
+Documentation Impact
+====================
+
+A lot of docs needed for the Ironic driver on the operational
+procedures around the shard_key.
+
+References
+==========
+
+None
+
+History
+=======
+
+.. list-table:: Revisions
+   :header-rows: 1
+
+   * - Release Name
+     - Description
+   * - 2023.1 Antelope
+     - Introduced
+   * - 2023.2 Bobcat
+     - Re-proposed