Merge "Re-Propose "Ironic Shards" for Bobcat/2023.2"

2023-05-12 14:18:01 +00:00 · 2023-05-12 14:18:01 +00:00 · e0ef37d5d0
parent 3a80f8a42f 588b2107d8
commit e0ef37d5d0
1 changed files with 421 additions and 0 deletions
--- a/specs/2023.2/approved/ironic-shards.rst
+++ b/specs/2023.2/approved/ironic-shards.rst
@ -0,0 +1,421 @@
 ..
 This work is licensed under a Creative Commons Attribution 3.0 Unported
 License.
 http://creativecommons.org/licenses/by/3.0/legalcode
 ==========================================
 Ironic Shards
 ==========================================
 https://blueprints.launchpad.net/nova/+spec/ironic-shards
 Problem description
 ===================
 Nova's Ironic driver involves a single nova-compute service managing
 many compute nodes, where each compute node record maps to an Ironic node.
 Some deployments support 1000s of ironic nodes, but a single nova-compute
 service is unable to manage 1000s of nodes and 1000s of instances.
 Currently we support setting a partition key, where nova-compute only
 cares about a subset of ironic nodes, those associated with a specific
 conductor group. However, some conductor groups can be very large,
 servered by many ironic-conductor services.
 To help with this, Nova has attempted to dynamically spread ironic
 nodes between a set of nova-compute peers. While this work some of
 the time, there are some major limitations:
 * when one nova-compute is down, only unassigned ironic nodes can
  move to another nova-compute service
 * i.e. when one nova-compute is down, all ironic nodes with nova instances
  associated with the down nova-compute service are unable to be
  managed, i.e. reboot will fail
 * moreover, when the old nova-compute comes back up, which might take
  some time, there are lots of bugs as the hash ring slowly rebalances.
  In part because every nova-compute fetches all nodes, in a large enough
  cloud, this can take over 24 hours.
 This spec is about tweaking the way we shard Ironic compute nodes.
 We need to stop violating deep assumptions in the compute manager
 code by moving to a more static ironic node partitions.
 Use Cases
 ---------
 Any users of the ironic driver that have more than one
 nova-compute service per conductor group should move to an
 active-passive failover mode.
 The new static sharding will be of paritcular interest for clouds
 with ironic conductor groups that are greater than around
 1000 baremetal nodes.
 .. NOTE: many parts of this story work today but
 need better documentation:
 * understanding the current scale limit of around 500-1000 ironic
  nodes per nova-compute, and the best way to scale beyond that
 * sharding ironic-conductors and nova-computes using
  ironic conductor groups.
  Note: conductor groups have a specific use in Ironic
  and this is not it, but it works for some users.
 * active-passive failover for nova-compute services
  running the ironic driver.
  Note: the time to start up a new process after a
  failover is way too high, particularly at larger
  scales without conductor groups.
 Proposed change
 ===============
 We add a new configuration option:
 * [ironic] shard_key
 By default, there will be no shard_key set, and we will continue to
 expose all ironic nodes from a single nova-compute process.
 Mostly, this is to keep things simple for smaller deployments,
 i.e. when you have less than 500 ironic nodes.
 When the operator sets a shard_key, the compute-node process will
 use the shard_key when querying a list of nodes in Ironic.
 We must never try to list all Ironic nodes when
 the Ironic shard key is defined in the config.
 When we look up a specific ironic node via a node uuid or
 instance uuid, we should not restrict that to either the shard key
 or conductor group.
 Similar to checking the instance uuid is still present on the Ironic
 node before performing an action, or ensuring there is no instance uuid
 before provisioning, we should also check the node is in the correct
 shard (and conductor group) before doing anything with that Ironic node.
 Config changes and Deprecations
 -------------------------------
 We will keep the option to target a specific conductor group,
 but this option will be renamed from partition_key to conductor_group.
 This is addative to the shard_key above, the target ironic nodes are
 those in both the correct `shard_key` and the correct `conductor_group`,
 when both are configured.
 We will deprecate the use of the `peer_list`.
 We should log a warning when the hash ring is being used,
 i.e. when it has more than one member added to the hash ring.
 In addtion, we need the logic that tries to move Compute Nodes
 to never work unless the peer_list is larger than one. More details
 in the data model impact section.
 When deleting a ComputeNode object, we need to have the driver
 confirm that is safe. In the case of Ironic we will check to see if
 the configured Ironic has a node with that uuid, searching across all
 conductor groups and all shard keys. When the ComputeNode object is not
 deleted, we should not delete the entry in placement.
 nova-manage move ironic node
 ----------------------------
 We will create a new nova-manage command::
  nova-manage ironic-compute-node-move <ironic-node-uuid> \
      --service <destination-service>
 This command will do the following:
 * Find the ComputeNode object for this ironic-node-uuid
 * Error if the ComputeNode type does not match the ironic driver.
 * Find the related Service object for the above ComputeNode
  (i.e. the host)
 * Error if the service object is not reported as down, and
  has not also been put into maintanance. We do not require
  forced down, because we might only be moving a subset of
  nodes associated with this nova-compute service.
 * Check the Service object for the destination service host exists
 * Find all non-deleted instances for this (host,node)
 * Error if there is more than 1 non-deleted instance found.
  It is OK if we find zero or 1 instances.
 * In one DB transaction:
  move the ComputeNode object to the destination service host and
  move the Instance (if there is one) to the destination service host
 The above tool is expected to be used as part of this wider process
 of migrating from the old peer_list to the new shard key. There are
 two key scearios (although the tool may help operator recover from
 other issues as well):
 * moving from a peer_list to a single nova-compute
 * moving from peer_list to shard_key, while keeping multiple nova-compute
  proccesses (for a single conductor group)
 Migrate from peer_list to single nova-compute
 ---------------------------------------------
 Small deployments (i.e. less than 500 ironic nodes)
 are recommended to move from a peer_list of, for example,
 three nova-compute services, to a single nova-compute service.
 On failure of the nova-compute service, operators can either manually start
 the processes on a new host, or use an automatic active-passive HA scheme.
 The process would look something like this:
 * ironic and nova both default to an empty_shard key by default,
  such that all ironic nodes are in the same default shard
 * start a new nova-compute service running the ironic driver,
  ideally with a syntheic value for `[DEFAULT]host` e.g. `ironic`
  This will log warnings about the need to use the nova-compute
  migration tool before being able to manage any nodes
 * stop all existing nova-compute services
 * mark them as forced-down via the API
 * Now loop around all ironic nodes and call this, assuming your
  nova-compute service has its host value of just `ironic`:
  `nova_manage ironic-compute-node-move <uuid> --service ironic`
 The periodic tasks in the new nova-compute service will gradually
 pick up the new ComputeNodes, and will start being able to recieve
 commands such a reboot for all the moved instances.
 While you could start the new nova-compute service after
 having migrated all the ironic compute nodes, but that would
 lead to higher downtime during the migration.
 Migrate from peer_list to shard_key
 -----------------------------------
 The proccess to move from the hash key based peer_list to the static
 shard_key from ironic is very similar to the above process:
 * Set the shard_key on all your ironic nodes, such that you can spread
  the nodes out between your nova-compute processes,
 * Start your new nova compute processes, one for each `shard_key`,
  possibly setting a synthetic `[DEFAULT]host` value that matches the
  `my_shard_key`.
 * Shutdown all the older nova-compute processs with `[ironic]peer_list` set
 * Mark those older services as in maintainance via the Nova API
 * For each shard_key in Ironic, work out which service host you have mapped
  each one to above, then run this for each ironic node uuid in the shard:
  `nova_manage ironic-compute-node-move <uuid> --service my_shard_key`
 * Delete the old services via the Nova API, now there are no instances
  or compute nodes on those services
 While you could start the new nova-compute services after the migration,
 that would lead to a slightly longer downtime.
 Adding new compute nodes
 ------------------------
 In general, there is no change when adding nodes into existing
 shards.
 Similarly, you can add a new nova-compute process for a new shard
 and then start to fill that up with nodes.
 Move an ironic node between shards
 ----------------------------------
 When removing nodes from ironic at the end of their life, or
 adding large numbers of new nodes, you may need to rebalance
 the shards.
 To move some ironic nodes, you need to move the nodes in
 groups associated with a specific nova-compute process.
 For each nova-compute and the associated ironic nodes you
 want to move to a different shard you need to:
 * Shutdown the affected nova-compute process
 * Put nova-compute services into in maintanance
 * In Ironic API update the shard key on the Ironic node
 * Now move each ironic node to the correct new nova-compute
  process for the shard key it was moved into:
  `nova_manage ironic-compute-node-move <uuid> --service my_shard_key`
 * Now unset maintanance mode for the nova-compute,
  and start that service back up
 Move shards between nova-compute services
 -----------------------------------------
 To move a shard between nova-compute services, you need to
 replace the nova-compute process with a new one:
 * ensure the destination nova-compute is configured with the
  shard you want to move, and is running
 * stop the nova-compute process currently serving the shard
 * force-down the service via the API
 * for each ironic node uuid in the shard call nova-manage
  to move it to the new nova-compute process
 Alternatives
 ------------
 We could require nova-compute processes to be explicitly forced down,
 before allowing the nova-manage to move the ironic nodes about,
 in a similar way to evacuate.
 But this creates problems when trying to re-balance shards as you
 remove nodes at the end of their life.
 We could consider a list of shard keys, rather than a single shard key
 per nova-compute. But for this first version, we have chosen the simpler
 path, that appears to have few limitations.
 We could attempt to keep fixing the hash ring recovery within the ironic
 driver, but its very unclear what will break next due to all the deep
 assumptions made about the nova-compute process. The specific assumptions
 include:
 * when nova-compute breaks, its usually the hypervisor hardware that
  has broken, which includes all the nova servers running on that.
 * all locking and management of a nova server object is done by the
  currently assigned nova-compute node, and this is only ever changed
  by explict move operations like resize, migrate, live-migration
  and evacuate. As such we can use simple local locks to ensure
  concurrent operations don't conflict, along with DB state checking.
 Data model impact
 -----------------
 A key thing we need to ensure is that ComputeNode objects are only
 automatically moved between service objects when in legacy hash ring mode.
 Currently, this only happens for unassigned ComputeNodes.
 In this new explicit shard mode, only nova-manage is able to move
 ComputeNode objects. In addtion, nova-manage will also move associated
 instances. However, similar to evacuate, this will only be allowed
 when the currently associated service is forced down.
 Note, this applies when a nova-compute finds a ComputeNode that is should
 own, but the Nova database says its already owned by a difference service.
 In this scenario, we should log a warning to the operator
 to ensure they have migrated that ComputeNode from its old location
 before this nova-compute service is able to manage it.
 In addition, we should ensure we only delete a ComputeNode object
 when the driver explictly says its safe to delete. In the case of
 the Ironic driver, we should ensure the node no longer exists in
 Ironic, being sure to search across all shards.
 This is all very related this spec on robustfying
 the Compute Node and Service object relationship:
 https://review.opendev.org/c/openstack/nova-specs/+/853837
 REST API impact
 ---------------
 None
 Security impact
 ---------------
 None
 Notifications impact
 --------------------
 None
 Other end user impact
 ---------------------
 Users will experience a more reliable Ironic and Nova integration.
 Performance Impact
 ------------------
 It should help users more easily support large ironic deployments
 integrated with Nova.
 Other deployer impact
 ---------------------
 We will rename the "partition_key" configuration to be expliclity
 "conductor_group".
 We will deprecate the peer list key. When we start up and see
 anything set, we ommit a warning about the bugs in using this
 legacy auto sharding, and recomend moving to the explicit sharding.
 There is a new `shard_key` config, as descirbed above.
 There is a new nova_manage CLI command to move Ironic compute nodes
 on forced-down nova-compute services to a new one.
 Developer impact
 ----------------
 None
 Upgrade impact
 --------------
 For those currenly using peer_list, we need to document how they
 can move to the new sharding approach.
 Implementation
 ==============
 Assignee(s)
 -----------
 Primary assignee:
  JayF
 Other contributors:
  johnthetubaguy
 Feature Liaison
 ---------------
 Feature liaison: None
 Work Items
 ----------
 * rename conductor group partition key config
 * deprecate peer_list config, with warning log messages
 * add compute node move and delete protections, when peer_list not used
 * add new shard_key config, limit ironic node list using shard_key
 * add nova-manage tool to move ironic nodes between compute services
 * document operational processes around above nova-manage tool
 Dependencies
 ============
 The deprecation of the peer list can happen right away.
 But the new sharding depends on the Ironic shard key getting added:
 https://review.opendev.org/c/openstack/ironic-specs/+/861803
 Ideally we add this into Nova after robustify compute node has landed:
 https://review.opendev.org/c/openstack/nova/+/842478
 Testing
 =======
 We need some functional tests for the nova-manage command to ensure
 all of the safty guards work as expected.
 Documentation Impact
 ====================
 A lot of docs needed for the Ironic driver on the operational
 procedures around the shard_key.
 References
 ==========
 None
 History
 =======
 .. list-table:: Revisions
   :header-rows: 1
   * - Release Name
     - Description
   * - 2023.1 Antelope
     - Introduced
   * - 2023.2 Bobcat
     - Re-proposed