Re-re-propose ironic-multiple-compute-hosts
This re-writes the ironic-multiple-compute-hosts spec to use a hash ring, rather than messing around with how we schedule. Change-Id: I51de94e3fbe301aeed35a6456ed0b7350aefa317
This commit is contained in:
@@ -41,71 +41,71 @@ be able to scale to 10^5 nodes.
|
|||||||
Proposed change
|
Proposed change
|
||||||
===============
|
===============
|
||||||
|
|
||||||
In general, a nova-compute running the Ironic virt driver should only
|
We'll lift some hash ring code from ironic (to be put into oslo
|
||||||
register as a single row in the compute_nodes table, rather than many rows.
|
soon), to be used to do consistent hashing of ironic nodes among
|
||||||
|
multiple nova-compute services. The hash ring is used within the
|
||||||
|
driver itself, and is refreshed at each resource tracker run.
|
||||||
|
|
||||||
Nova's scheduler should schedule only to a nova-compute host; the host will
|
get_available_nodes() will now return a subset of nodes,
|
||||||
choose an Ironic node itself, from the nodes that match the request (explained
|
determined by the following rules:
|
||||||
further below). Once an instance is placed on a given nova-compute service
|
|
||||||
host, that host will always manage other requests for that instance (delete,
|
|
||||||
etc). So the instance count scheduler filter can just be used here to equally
|
|
||||||
distribute instances between compute hosts. This reduces the failure domain to
|
|
||||||
failing actions for existing instances on a compute host, if a compute host
|
|
||||||
happens to fail.
|
|
||||||
|
|
||||||
The Ironic virt driver should be modified to call an Ironic endpoint with
|
* any node with an instance managed by the compute service
|
||||||
the request spec for the instance. This endpoint will reserve a node, and
|
* any node that is mapped to the compute service on the hash ring
|
||||||
return that node. The virt driver will then deploy the instance to this node.
|
* no nodes with instances managed by another compute service
|
||||||
When the instance is destroyed, the reservation should also be destroyed.
|
|
||||||
|
|
||||||
This endpoint will take parameters related to the request spec, and is being
|
The virt driver finds all compute services that are running the
|
||||||
worked on the Ironic side here.[0] This has not yet been solidified, but it
|
ironic driver by joining the services table and the compute_nodes
|
||||||
will have, at a minimum, all fields in the flavor object. This might look
|
table. Since there won't be any records in the compute_nodes table
|
||||||
something like::
|
for a service that is starting for the first time, the virt driver
|
||||||
|
also adds its own compute service into this list. The list of all
|
||||||
|
hostnames in this list is what is used to instantiate the hash ring.
|
||||||
|
|
||||||
{
|
As nova-compute services are brought up or down, the ring will
|
||||||
"memory_mb": 1024,
|
re-balance. It's important to note that this re-balance does not
|
||||||
"vcpus": 8,
|
occur at the same time on all compute services, so for some amount
|
||||||
"vcpu_weight": null,
|
of time, an ironic node may be managed by more than one compute
|
||||||
"root_gb": 20,
|
service. In other words, there may be two compute_nodes records
|
||||||
"ephemeral_gb": 10,
|
for a single ironic node, with a different host value. For
|
||||||
"swap": 2,
|
scheduling purposes, this is okay, because either compute service
|
||||||
"rxtx_factor": 1.0,
|
is capable of actually spawning an instance on the node (because the
|
||||||
"extra_specs": {
|
ironic service doesn't know about this hashing). This will cause
|
||||||
"capabilities": "supports_uefi,has_gpu",
|
capacity reporting (e.g. nova hypervisor-stats) to over-report
|
||||||
},
|
capacity for this time. Once all compute services in the cluster
|
||||||
"image": {
|
have done a resource tracker run and re-balanced the hash ring,
|
||||||
"id": "some-uuid",
|
this will be back to normal.
|
||||||
"properties": {...},
|
|
||||||
},
|
|
||||||
}
|
|
||||||
|
|
||||||
|
It's also important to note that, due to the way nodes with instances
|
||||||
|
are handled, if an instance is deleted while the compute service is
|
||||||
|
down, that node will be removed from the compute_nodes table when
|
||||||
|
the service comes back up (as each service will see an instance on
|
||||||
|
the node object, and assume another compute service manages that
|
||||||
|
instance). The ironic node will remain active and orphaned. Once
|
||||||
|
the periodic task to reap deleted instances runs, the ironic node
|
||||||
|
will be torn down and the node will again be reported in the
|
||||||
|
compute_nodes table.
|
||||||
|
|
||||||
We will report (total ironic capacity) into the resource tracker for each
|
It's all very eventually consistent, with a potentially long time
|
||||||
compute host. This will end up over-reporting total available capacity to Nova,
|
to eventual.
|
||||||
however is the least wrong option here. Other (worse) options are:
|
|
||||||
|
|
||||||
* Report (total ironic capacity)/(number of compute hosts) from each compute
|
There's no configuration to enable this mode; it's always running. For
|
||||||
host. This is more "right", but has the possibility of a compute host
|
deployments that continue to use only one compute service, this has the
|
||||||
reporting (usage) > (max capacity), and therefore becoming unable to perform
|
same behavior as today.
|
||||||
new build actions.
|
|
||||||
|
|
||||||
* Report some arbitrary incorrect number for total capacity, and try to make
|
|
||||||
the scheduler ignore it. This reports numbers more incorrectly, and also
|
|
||||||
takes more code and has more room for error.
|
|
||||||
|
|
||||||
Alternatives
|
Alternatives
|
||||||
------------
|
------------
|
||||||
|
|
||||||
Do what we do today, with active/passive failover.
|
Do what we do today, with active/passive failover. Doing active/passive
|
||||||
|
failover well is not an easy task, and doesn't account for all possible
|
||||||
|
failures. This also does not follow Nova's prescribed model for compute
|
||||||
|
failure. Furthermore, the resource tracker initialization is slow with many
|
||||||
|
Ironic nodes, and so a cold failover could take minutes.
|
||||||
|
|
||||||
Doing active/passive failover well is not an easy task, and doesn't account for
|
Another alternative is to make nova's scheduler only choose a compute service
|
||||||
all possible failures. This also does not follow Nova's prescribed model for
|
running the ironic driver (essentially at random) and let the scheduling to
|
||||||
compute failure. Furthermore, the resource tracker initialization is slow
|
a given node be determined between the virt driver and ironic. The downsides
|
||||||
with many Ironic nodes, and so a cold failover could take minutes.
|
here are that operators no longer have a pluggable scheduler (unless we build
|
||||||
|
one in ironic), and we'll have to do lots of work to ensure there aren't
|
||||||
Resource providers[1] may be another viable alternative, but we shouldn't
|
scheduling races between the compute services.
|
||||||
have a hard dependency on that.
|
|
||||||
|
|
||||||
Data model impact
|
Data model impact
|
||||||
-----------------
|
-----------------
|
||||||
@@ -142,13 +142,7 @@ smaller and improve the performance of the resource tracker loop.
|
|||||||
Other deployer impact
|
Other deployer impact
|
||||||
---------------------
|
---------------------
|
||||||
|
|
||||||
A version of Ironic that supports the reservation endpoint must be deployed
|
None.
|
||||||
before a version of Nova with this change is deployed. If this is not the
|
|
||||||
case, the previous behavior should be used. We'll need to properly deprecate
|
|
||||||
the old behavior behind a config option, as deployers will need to configure
|
|
||||||
different scheduler filters and host managers than the current recommendation
|
|
||||||
for this to work correctly. We should investigate if this can be done
|
|
||||||
gracefully without a new config option, however I'm not sure it's possible.
|
|
||||||
|
|
||||||
Developer impact
|
Developer impact
|
||||||
----------------
|
----------------
|
||||||
@@ -166,43 +160,42 @@ Primary assignee:
|
|||||||
jim-rollenhagen (jroll)
|
jim-rollenhagen (jroll)
|
||||||
|
|
||||||
Other contributors:
|
Other contributors:
|
||||||
devananda
|
dansmith
|
||||||
jaypipes
|
jaypipes
|
||||||
|
|
||||||
Work Items
|
Work Items
|
||||||
----------
|
----------
|
||||||
|
|
||||||
* Change the Ironic driver to be a 1:1 host:node mapping.
|
* Import the hash ring code into Nova.
|
||||||
|
|
||||||
* Change the Ironic driver to get reservations from Ironic.
|
* Use the hash ring in the virt driver to shard nodes among compute daemons.
|
||||||
|
|
||||||
|
|
||||||
Dependencies
|
Dependencies
|
||||||
============
|
============
|
||||||
|
|
||||||
This depends on a new endpoint in Ironic.[0]
|
None.
|
||||||
|
|
||||||
|
|
||||||
Testing
|
Testing
|
||||||
=======
|
=======
|
||||||
|
|
||||||
This should be tested by being the default configuration.
|
This code will run in the default devstack configuration.
|
||||||
|
|
||||||
|
We also plan to add a CI job that runs the ironic driver with multiple
|
||||||
|
compute hosts, but this likely won't happen until Ocata.
|
||||||
|
|
||||||
|
|
||||||
Documentation Impact
|
Documentation Impact
|
||||||
====================
|
====================
|
||||||
|
|
||||||
Deployer documentation will need updates to specify how this works, since it
|
Maybe an ops guide update, however I'd like to leave that for next cycle until
|
||||||
is different than most drivers.
|
we're pretty sure this is stable.
|
||||||
|
|
||||||
|
|
||||||
References
|
References
|
||||||
==========
|
==========
|
||||||
|
|
||||||
[0] https://review.openstack.org/#/c/204641/
|
|
||||||
|
|
||||||
[1] https://review.openstack.org/#/c/225546/
|
|
||||||
|
|
||||||
|
|
||||||
History
|
History
|
||||||
=======
|
=======
|
||||||
@@ -216,3 +209,4 @@ History
|
|||||||
- Introduced but no changes merged.
|
- Introduced but no changes merged.
|
||||||
* - Newton
|
* - Newton
|
||||||
- Re-proposed.
|
- Re-proposed.
|
||||||
|
- Completely re-written to use a hash ring.
|
||||||
|
|||||||
Reference in New Issue
Block a user