Re-re-propose ironic-multiple-compute-hosts

This re-writes the ironic-multiple-compute-hosts spec to use a hash ring, rather than messing around with how we schedule. Change-Id: I51de94e3fbe301aeed35a6456ed0b7350aefa317
2016-08-02 17:48:17 -04:00
parent ffed7f72c5
commit 341143e32b
1 changed files with 65 additions and 71 deletions
--- a/specs/newton/approved/ironic-multiple-compute-hosts.rst
+++ b/specs/newton/approved/ironic-multiple-compute-hosts.rst
@@ -41,71 +41,71 @@ be able to scale to 10^5 nodes.
 Proposed change
 ===============
-In general, a nova-compute running the Ironic virt driver should only
+We'll lift some hash ring code from ironic (to be put into oslo
-register as a single row in the compute_nodes table, rather than many rows.
+soon), to be used to do consistent hashing of ironic nodes among
 multiple nova-compute services. The hash ring is used within the
 driver itself, and is refreshed at each resource tracker run.
-Nova's scheduler should schedule only to a nova-compute host; the host will
+get_available_nodes() will now return a subset of nodes,
-choose an Ironic node itself, from the nodes that match the request (explained
+determined by the following rules:
 further below).  Once an instance is placed on a given nova-compute service
 host, that host will always manage other requests for that instance (delete,
 etc). So the instance count scheduler filter can just be used here to equally
 distribute instances between compute hosts. This reduces the failure domain to
 failing actions for existing instances on a compute host, if a compute host
 happens to fail.
-The Ironic virt driver should be modified to call an Ironic endpoint with
+* any node with an instance managed by the compute service
-the request spec for the instance. This endpoint will reserve a node, and
+* any node that is mapped to the compute service on the hash ring
-return that node. The virt driver will then deploy the instance to this node.
+* no nodes with instances managed by another compute service
 When the instance is destroyed, the reservation should also be destroyed.
-This endpoint will take parameters related to the request spec, and is being
+The virt driver finds all compute services that are running the
-worked on the Ironic side here.[0] This has not yet been solidified, but it
+ironic driver by joining the services table and the compute_nodes
-will have, at a minimum, all fields in the flavor object. This might look
+table. Since there won't be any records in the compute_nodes table
-something like::
+for a service that is starting for the first time, the virt driver
 also adds its own compute service into this list. The list of all
 hostnames in this list is what is used to instantiate the hash ring.
-    {
+As nova-compute services are brought up or down, the ring will
-        "memory_mb": 1024,
+re-balance. It's important to note that this re-balance does not
-        "vcpus": 8,
+occur at the same time on all compute services, so for some amount
-        "vcpu_weight": null,
+of time, an ironic node may be managed by more than one compute
-        "root_gb": 20,
+service. In other words, there may be two compute_nodes records
-        "ephemeral_gb": 10,
+for a single ironic node, with a different host value. For
-        "swap": 2,
+scheduling purposes, this is okay, because either compute service
-        "rxtx_factor": 1.0,
+is capable of actually spawning an instance on the node (because the
-        "extra_specs": {
+ironic service doesn't know about this hashing). This will cause
-            "capabilities": "supports_uefi,has_gpu",
+capacity reporting (e.g. nova hypervisor-stats) to over-report
-        },
+capacity for this time. Once all compute services in the cluster
-        "image": {
+have done a resource tracker run and re-balanced the hash ring,
-            "id": "some-uuid",
+this will be back to normal.
            "properties": {...},
        },
    }
 It's also important to note that, due to the way nodes with instances
 are handled, if an instance is deleted while the compute service is
 down, that node will be removed from the compute_nodes table when
 the service comes back up (as each service will see an instance on
 the node object, and assume another compute service manages that
 instance). The ironic node will remain active and orphaned. Once
 the periodic task to reap deleted instances runs, the ironic node
 will be torn down and the node will again be reported in the
 compute_nodes table.
-We will report (total ironic capacity) into the resource tracker for each
+It's all very eventually consistent, with a potentially long time
-compute host. This will end up over-reporting total available capacity to Nova,
+to eventual.
 however is the least wrong option here. Other (worse) options are:
-* Report (total ironic capacity)/(number of compute hosts) from each compute
+There's no configuration to enable this mode; it's always running. For
-  host. This is more "right", but has the possibility of a compute host
+deployments that continue to use only one compute service, this has the
-  reporting (usage) > (max capacity), and therefore becoming unable to perform
+same behavior as today.
  new build actions.
 * Report some arbitrary incorrect number for total capacity, and try to make
  the scheduler ignore it. This reports numbers more incorrectly, and also
  takes more code and has more room for error.
 Alternatives
 ------------
-Do what we do today, with active/passive failover.
+Do what we do today, with active/passive failover. Doing active/passive
 failover well is not an easy task, and doesn't account for all possible
 failures. This also does not follow Nova's prescribed model for compute
 failure. Furthermore, the resource tracker initialization is slow with many
 Ironic nodes, and so a cold failover could take minutes.
-Doing active/passive failover well is not an easy task, and doesn't account for
+Another alternative is to make nova's scheduler only choose a compute service
-all possible failures. This also does not follow Nova's prescribed model for
+running the ironic driver (essentially at random) and let the scheduling to
-compute failure. Furthermore, the resource tracker initialization is slow
+a given node be determined between the virt driver and ironic. The downsides
-with many Ironic nodes, and so a cold failover could take minutes.
+here are that operators no longer have a pluggable scheduler (unless we build
-
+one in ironic), and we'll have to do lots of work to ensure there aren't
-Resource providers[1] may be another viable alternative, but we shouldn't
+scheduling races between the compute services.
 have a hard dependency on that.
 Data model impact
 -----------------
@@ -142,13 +142,7 @@ smaller and improve the performance of the resource tracker loop.
 Other deployer impact
 ---------------------
-A version of Ironic that supports the reservation endpoint must be deployed
+None.
 before a version of Nova with this change is deployed. If this is not the
 case, the previous behavior should be used. We'll need to properly deprecate
 the old behavior behind a config option, as deployers will need to configure
 different scheduler filters and host managers than the current recommendation
 for this to work correctly. We should investigate if this can be done
 gracefully without a new config option, however I'm not sure it's possible.
 Developer impact
 ----------------
@@ -166,43 +160,42 @@ Primary assignee:
  jim-rollenhagen (jroll)
 Other contributors:
-  devananda
+  dansmith
  jaypipes
 Work Items
 ----------
-* Change the Ironic driver to be a 1:1 host:node mapping.
+* Import the hash ring code into Nova.
-* Change the Ironic driver to get reservations from Ironic.
+* Use the hash ring in the virt driver to shard nodes among compute daemons.
 Dependencies
 ============
-This depends on a new endpoint in Ironic.[0]
+None.
 Testing
 =======
-This should be tested by being the default configuration.
+This code will run in the default devstack configuration.
 We also plan to add a CI job that runs the ironic driver with multiple
 compute hosts, but this likely won't happen until Ocata.
 Documentation Impact
 ====================
-Deployer documentation will need updates to specify how this works, since it
+Maybe an ops guide update, however I'd like to leave that for next cycle until
-is different than most drivers.
+we're pretty sure this is stable.
 References
 ==========
 [0] https://review.openstack.org/#/c/204641/
 [1] https://review.openstack.org/#/c/225546/
 History
 =======
@@ -216,3 +209,4 @@ History
     - Introduced but no changes merged.
   * - Newton
     - Re-proposed.
     - Completely re-written to use a hash ring.