Robustify Compute Node Hostnames backlog spec

This is mostly a brain-dump spec on the topic of compute hosts and
how fragile they are in terms of hostname handling. There has long
been a requirement that computes can NEVER change hostnames, and we
have few tools to even detect the problem before we corrupt the
database if it happens.

Here I have documented some of the things we could do to make that
more robust, should we choose to do so. This is based on a recent
near-catastrophe and thus reflects things that would have avoided
pain in a real scenario.

Per discussion at the PTG, I am adding this as a backlog spec, to be
an overarching guide for multiple smaller specs to provide more
detailed progress towards the goals described here.

Change-Id: I72fa3f605cfcf7c3dd0ff4c791be7df8f19f058b
This commit is contained in:
Dan Smith
2022-08-19 11:55:58 -07:00
parent 7ad4e88db2
commit 3256eda85f
2 changed files with 322 additions and 1 deletions

View File

@@ -24,4 +24,8 @@ Template:
Approved (but not implemented) backlog specs:
None currently
.. toctree::
:glob:
:maxdepth: 1
approved/*

View File

@@ -0,0 +1,317 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
========================================
Robustify Compute Node Hostname Handling
========================================
Include the URL of your launchpad blueprint:
https://blueprints.launchpad.net/nova/+spec/example
Nova has long had a dependency on an unchanging hostname on the
compute nodes. This spec aims to address this limitation, at least
from the perspective of being able to detect an accidental change and
avoiding catastrophe in the database that can currently result from a
hostname change, whether intentional or not.
Problem description
===================
Currently nova uses the hostname of the compute (specifically
``CONF.host``) for a variety of things:
#. As the routing key for communicating with a compute node over RPC
#. As the link between the instance, service and compute node objects
in the database
#. For neutron to bind ports to the proper hostname (and in some
cases, it must match the equivalent setting in the neutron agent
config)
#. For cinder to export a volume to the proper host
#. As the resource provider name in placement (this actually comes
from libvirt's notion of the hostname, not ``CONF.host``)
If the hostname of the compute node changes, all of these links
break. Upon starting the compute node with the changed name, we will
be unable to find a ``nova-compute`` ``Service`` record in the
database that matches, and will create a new one. After that, we will
fail to find the matching ``ComputeNode`` record and create a new one
of those, with a new UUID. Instances that refer to both the old
compute and service records will not be associated with the running
host, and thus become unmanageable through the API. Further, new
instances that end up created on the compute node after the rename
will be able to claim resources that have been promised to the
orphaned instances (such as PCI devices and VCPUs) as the tracking of
those is associated with the old compute node record.
If the orphaned instances are relatively static, the first indication
that something has gone wrong may be long after the actual rename,
where reality has forked and there are instances running on one
compute node that refer to two different compute node records and thus
are accounted for in two separate locations.
Further, neutron, cinder, and placement resources will have the old
information for existing instances and new information for current
instances, which requires reconciliation. This situation may also
prevent restarting old instances if the old hostname is no longer
reachable.
Use Cases
---------
* As an operator, I want to make sure my database does not get
corrupted due to a temporary or permanent DNS change or outage
* As an operator, I may need to change the name of a compute node as
my network evolves over many years.
* As a deployment tool writer, I want to make sure that changes in
tooling and libraries never cause data loss or database corruption.
Proposed change
===============
There are multiple things we can do here to robustify Nova's handling
of this data. Each one increases safety, but we do not have to do all
of them.
Ensure a stable compute node UUID
---------------------------------
For non-ironic virt drivers, whenever we generate a compute node uuid,
we should write that to a file on the local disk. Whenever we start,
we should look for that UUID file, use that, and under no
circumstances should we generate another one. To facilitate
pre-generating this by deployment tools, we should use this if we are
starting for the first time and create a ComputeNode record in the
database.
We would put the actual lookup of the compute node UUID in the
`get_available_nodes()` method of the virt driver (or create a new
UUID-specific one). Ironic would override this with its current
implementation that returns UUIDs based on the state of Ironic and the
hash ring. Thus only non-Ironic computes would read and write the
persistent UUID file.
Single-host virt drivers like libvirt would be able to tolerate a
system hostname change, updating ``ComputeNode.hypervisor_hostname``
without breaking things.
Link ComputeNode records with Service records by id
---------------------------------------------------
Currently the ComputeNode and Service records are associated in the
database purely by the hostname string. This means that they can
become disassociated, and is also not ideal from a performance
standpoint. Some other data structures are linked against ComputeNode
by id, and thus are not re-associated when the name matches.
This relationship used to exist, but was `removed`_ in the Kilo
timeframe. I believe this was due to the desire to make the process
less focused on the service object and more on the compute node
(potentially because of Ironic) although the breaking of that tight
relationship has serious downsides as well. I think we can keep the
tight binding for single-host computes where it makes sense.
At startup, ``nova-compute`` should resolve its ComputeNode object via
the persistent UUID, find the associated Service, and fail to start if
the hostname does not match CONF.host. Since this is used with
external services, we should not just "fix it" as those other links
will be broken as well. This will at least allow us to avoid opening
the window for silent data corruption.
Link Instances to a ComputeNode by id
-------------------------------------
Currently instance records are linked to their Service and ComputeNode
objects purely by hostname. We should link them to a ComputeNode by
its id. Since we need the Service in order to get the RPC routing key
or for hostname resolution when talking to external services, we
should find that based on the Instance->ComputeNode->Service id
relationship.
We already link PCI allocations for instances to the compute node by
id, even though the instance itself is linked via hostname. This
discrepancy makes it easy to get one out of sync with the other.
Potential Changes in the future
-------------------------------
If the above changes are made, we open ourselves to the future
possibility for supporting:
#. Renaming service objects through the API if a compute host really
needs to have its hostname changed. This will require changes to
the other services at the same time, but nova would at least have a
single source of truth for the hostname, making it feasible.
#. If we do all of this, Nova could potentially be confident enough of
an intentional rename that it could update port bindings, cinder
volume attachments, and placement resources to make it seamless.es
#. Moving to the use of the service UUID as the RPC routing key, if
desired.
#. Dropping quite a bit of duplicate string fields from our database.
Alternatives
------------
We can always do nothing. Compute hostnames have been unchangeable
forever, and the status quo is "don't do that or it will break" which
is certainly something we could continue to rely on.
We could implement part of this (i.e. the persistent ComputeNode UUID)
without the rest of the database changes. This would allow us to
detect the situation and abort, but without (the work required to get)
the benefits of a more robust database schema that could potentially
also support voluntary renames.
Data model impact
-----------------
Most of the impact here is to the data model for Instance,
ComputeNode, Service. Other models that reference compute hostnames
may also make sense to change (although it's also reasonable to punt
that entirely or to a different phase). Examples:
* Migration
* InstanceFault
* InstanceActionEvent
* TaskLog
* ConsoleAuthToken
Further, host aggregates use the service name for
membership. Migrating those to database IDs is not possible since
multiple cells will cause overlap. We could migrate those to UUIDs or
simply ignore this case and assume that any *actual* rename operation
in the future would involve API operations to fix aggregates (which is
doable, unlike changing the host of things like Instance).
REST API impact
---------------
No specific REST API impact for this, other than the potential for
enabling a mutable Service hostname in the future.
Security impact
---------------
No impact.
Notifications impact
--------------------
No impact.
Other end user impact
---------------------
Not visible to end users.
Performance Impact
------------------
Theoretically some benefit comes from integer-based linkages between
these objects that are currently linked by strings. Eventually we
could reduce a bunch of string duplication from our DB schema and
footprint.
There will definitely be a one-time performance impact due to the
online data migration(s) required to move to the more robust schema.
Other deployer impact
---------------------
This is really all an (eventual) benefit to the deployer.
Developer impact
----------------
There will be some churn in the database models during the
transition. Looking up the hostname of an instance will require
Instance->ComputeNode->Service, but this can probably be hidden with
helpers in the Instance object such that not much has to change in the
actual workflow.
Upgrade impact
--------------
There will be some substantial online data migrations required to get
things into the new schema, and the benefits will only be achievable
in a subsequent release once everything is converted.
Implementation
==============
Assignee(s)
-----------
Primary assignee:
danms
Work Items
----------
* Persist the compute node UUID to disk when we generate it. Read the
compute node UUID from that location if it exists before we look to
see if we need to generate, create, or find an existing node record.
* Change the compute startup procedures to abort if we detect a
mismatch
* Make the schema changes to link database models by id. The
ComputeNode and Service objects/tables still have the id fields that
we can re-enable without even needing a schema change on those.
* Make the data models honor the ID-based linkages, if present
* Write an online data migration to construct those links on existing
databases
Later, there will be work items to:
* Drop the legacy columns
* Potentially implement an actual service rename procedure
Dependencies
============
There should be no external dependencies for the base of this work,
but there is a dependency on the release cycle, which affects how
quickly we can implement this and drop the old way of doing it.
Testing
=======
Unit and functional testing for the actual compute node startup
behavior should be fine. Existing integration testing should ensure
that we haven't broken any of the runtime behavior. Grenade jobs
will test the data migration and we can implement some nova status
items to help validate things in those upgrade jobs.
Documentation Impact
====================
There will need to be some documentation about the persistent compute
node UUID file for deployers and tool authors. Ideally, the only
visible result of this would be some additional failure modes if the
compute service detects an unexpected rename, so some documentation of
what that looks like and what to do about it would be helpful.
References
==========
TODO(danms): There are probably bugs we can reference about compute
node renames being not possible, or problematic if/when they happen.
.. _removed: https://specs.openstack.org/openstack/nova-specs/specs/kilo/implemented/detach-service-from-computenode.html
History
=======
.. list-table:: Revisions
:header-rows: 1
* - Release Name
- Description
* - Antelope
- Introduced