Robustify Compute Node Hostnames backlog spec

This is mostly a brain-dump spec on the topic of compute hosts and how fragile they are in terms of hostname handling. There has long been a requirement that computes can NEVER change hostnames, and we have few tools to even detect the problem before we corrupt the database if it happens. Here I have documented some of the things we could do to make that more robust, should we choose to do so. This is based on a recent near-catastrophe and thus reflects things that would have avoided pain in a real scenario. Per discussion at the PTG, I am adding this as a backlog spec, to be an overarching guide for multiple smaller specs to provide more detailed progress towards the goals described here. Change-Id: I72fa3f605cfcf7c3dd0ff4c791be7df8f19f058b
2022-08-19 11:55:58 -07:00
parent 7ad4e88db2
commit 3256eda85f
2 changed files with 322 additions and 1 deletions
--- a/doc/source/specs/backlog/index.rst
+++ b/doc/source/specs/backlog/index.rst
@@ -24,4 +24,8 @@ Template:

 Approved (but not implemented) backlog specs:

-None currently
+.. toctree::
+   :glob:
+   :maxdepth: 1
+
+   approved/*
--- a/specs/backlog/approved/robustify-compute-hostnames.rst
+++ b/specs/backlog/approved/robustify-compute-hostnames.rst
@@ -0,0 +1,317 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+========================================
+Robustify Compute Node Hostname Handling
+========================================
+
+Include the URL of your launchpad blueprint:
+
+https://blueprints.launchpad.net/nova/+spec/example
+
+Nova has long had a dependency on an unchanging hostname on the
+compute nodes. This spec aims to address this limitation, at least
+from the perspective of being able to detect an accidental change and
+avoiding catastrophe in the database that can currently result from a
+hostname change, whether intentional or not.
+
+Problem description
+===================
+
+Currently nova uses the hostname of the compute (specifically
+``CONF.host``) for a variety of things:
+
+#. As the routing key for communicating with a compute node over RPC
+#. As the link between the instance, service and compute node objects
+   in the database
+#. For neutron to bind ports to the proper hostname (and in some
+   cases, it must match the equivalent setting in the neutron agent
+   config)
+#. For cinder to export a volume to the proper host
+#. As the resource provider name in placement (this actually comes
+   from libvirt's notion of the hostname, not ``CONF.host``)
+
+If the hostname of the compute node changes, all of these links
+break. Upon starting the compute node with the changed name, we will
+be unable to find a ``nova-compute`` ``Service`` record in the
+database that matches, and will create a new one. After that, we will
+fail to find the matching ``ComputeNode`` record and create a new one
+of those, with a new UUID. Instances that refer to both the old
+compute and service records will not be associated with the running
+host, and thus become unmanageable through the API. Further, new
+instances that end up created on the compute node after the rename
+will be able to claim resources that have been promised to the
+orphaned instances (such as PCI devices and VCPUs) as the tracking of
+those is associated with the old compute node record.
+
+If the orphaned instances are relatively static, the first indication
+that something has gone wrong may be long after the actual rename,
+where reality has forked and there are instances running on one
+compute node that refer to two different compute node records and thus
+are accounted for in two separate locations.
+
+Further, neutron, cinder, and placement resources will have the old
+information for existing instances and new information for current
+instances, which requires reconciliation. This situation may also
+prevent restarting old instances if the old hostname is no longer
+reachable.
+
+Use Cases
+---------
+
+* As an operator, I want to make sure my database does not get
+  corrupted due to a temporary or permanent DNS change or outage
+* As an operator, I may need to change the name of a compute node as
+  my network evolves over many years.
+* As a deployment tool writer, I want to make sure that changes in
+  tooling and libraries never cause data loss or database corruption.
+
+Proposed change
+===============
+
+There are multiple things we can do here to robustify Nova's handling
+of this data. Each one increases safety, but we do not have to do all
+of them.
+
+Ensure a stable compute node UUID
+---------------------------------
+
+For non-ironic virt drivers, whenever we generate a compute node uuid,
+we should write that to a file on the local disk. Whenever we start,
+we should look for that UUID file, use that, and under no
+circumstances should we generate another one. To facilitate
+pre-generating this by deployment tools, we should use this if we are
+starting for the first time and create a ComputeNode record in the
+database.
+
+We would put the actual lookup of the compute node UUID in the
+`get_available_nodes()` method of the virt driver (or create a new
+UUID-specific one). Ironic would override this with its current
+implementation that returns UUIDs based on the state of Ironic and the
+hash ring. Thus only non-Ironic computes would read and write the
+persistent UUID file.
+
+Single-host virt drivers like libvirt would be able to tolerate a
+system hostname change, updating ``ComputeNode.hypervisor_hostname``
+without breaking things.
+
+Link ComputeNode records with Service records by id
+---------------------------------------------------
+
+Currently the ComputeNode and Service records are associated in the
+database purely by the hostname string. This means that they can
+become disassociated, and is also not ideal from a performance
+standpoint. Some other data structures are linked against ComputeNode
+by id, and thus are not re-associated when the name matches.
+
+This relationship used to exist, but was `removed`_ in the Kilo
+timeframe. I believe this was due to the desire to make the process
+less focused on the service object and more on the compute node
+(potentially because of Ironic) although the breaking of that tight
+relationship has serious downsides as well. I think we can keep the
+tight binding for single-host computes where it makes sense.
+
+At startup, ``nova-compute`` should resolve its ComputeNode object via
+the persistent UUID, find the associated Service, and fail to start if
+the hostname does not match CONF.host. Since this is used with
+external services, we should not just "fix it" as those other links
+will be broken as well. This will at least allow us to avoid opening
+the window for silent data corruption.
+
+Link Instances to a ComputeNode by id
+-------------------------------------
+
+Currently instance records are linked to their Service and ComputeNode
+objects purely by hostname. We should link them to a ComputeNode by
+its id. Since we need the Service in order to get the RPC routing key
+or for hostname resolution when talking to external services, we
+should find that based on the Instance->ComputeNode->Service id
+relationship.
+
+We already link PCI allocations for instances to the compute node by
+id, even though the instance itself is linked via hostname. This
+discrepancy makes it easy to get one out of sync with the other.
+
+Potential Changes in the future
+-------------------------------
+
+If the above changes are made, we open ourselves to the future
+possibility for supporting:
+
+#. Renaming service objects through the API if a compute host really
+   needs to have its hostname changed. This will require changes to
+   the other services at the same time, but nova would at least have a
+   single source of truth for the hostname, making it feasible.
+#. If we do all of this, Nova could potentially be confident enough of
+   an intentional rename that it could update port bindings, cinder
+   volume attachments, and placement resources to make it seamless.es
+#. Moving to the use of the service UUID as the RPC routing key, if
+   desired.
+#. Dropping quite a bit of duplicate string fields from our database.
+
+
+Alternatives
+------------
+
+We can always do nothing. Compute hostnames have been unchangeable
+forever, and the status quo is "don't do that or it will break" which
+is certainly something we could continue to rely on.
+
+We could implement part of this (i.e. the persistent ComputeNode UUID)
+without the rest of the database changes. This would allow us to
+detect the situation and abort, but without (the work required to get)
+the benefits of a more robust database schema that could potentially
+also support voluntary renames.
+
+
+Data model impact
+-----------------
+
+Most of the impact here is to the data model for Instance,
+ComputeNode, Service. Other models that reference compute hostnames
+may also make sense to change (although it's also reasonable to punt
+that entirely or to a different phase). Examples:
+
+* Migration
+* InstanceFault
+* InstanceActionEvent
+* TaskLog
+* ConsoleAuthToken
+
+Further, host aggregates use the service name for
+membership. Migrating those to database IDs is not possible since
+multiple cells will cause overlap. We could migrate those to UUIDs or
+simply ignore this case and assume that any *actual* rename operation
+in the future would involve API operations to fix aggregates (which is
+doable, unlike changing the host of things like Instance).
+
+REST API impact
+---------------
+
+No specific REST API impact for this, other than the potential for
+enabling a mutable Service hostname in the future.
+
+
+Security impact
+---------------
+
+No impact.
+
+
+Notifications impact
+--------------------
+
+No impact.
+
+Other end user impact
+---------------------
+
+Not visible to end users.
+
+Performance Impact
+------------------
+
+Theoretically some benefit comes from integer-based linkages between
+these objects that are currently linked by strings. Eventually we
+could reduce a bunch of string duplication from our DB schema and
+footprint.
+
+There will definitely be a one-time performance impact due to the
+online data migration(s) required to move to the more robust schema.
+
+Other deployer impact
+---------------------
+
+This is really all an (eventual) benefit to the deployer.
+
+Developer impact
+----------------
+
+There will be some churn in the database models during the
+transition. Looking up the hostname of an instance will require
+Instance->ComputeNode->Service, but this can probably be hidden with
+helpers in the Instance object such that not much has to change in the
+actual workflow.
+
+Upgrade impact
+--------------
+
+There will be some substantial online data migrations required to get
+things into the new schema, and the benefits will only be achievable
+in a subsequent release once everything is converted.
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  danms
+
+
+Work Items
+----------
+
+* Persist the compute node UUID to disk when we generate it. Read the
+  compute node UUID from that location if it exists before we look to
+  see if we need to generate, create, or find an existing node record.
+* Change the compute startup procedures to abort if we detect a
+  mismatch
+* Make the schema changes to link database models by id. The
+  ComputeNode and Service objects/tables still have the id fields that
+  we can re-enable without even needing a schema change on those.
+* Make the data models honor the ID-based linkages, if present
+* Write an online data migration to construct those links on existing
+  databases
+
+Later, there will be work items to:
+* Drop the legacy columns
+* Potentially implement an actual service rename procedure
+
+Dependencies
+============
+
+There should be no external dependencies for the base of this work,
+but there is a dependency on the release cycle, which affects how
+quickly we can implement this and drop the old way of doing it.
+
+Testing
+=======
+
+Unit and functional testing for the actual compute node startup
+behavior should be fine. Existing integration testing should ensure
+that we haven't broken any of the runtime behavior. Grenade jobs
+will test the data migration and we can implement some nova status
+items to help validate things in those upgrade jobs.
+
+Documentation Impact
+====================
+
+There will need to be some documentation about the persistent compute
+node UUID file for deployers and tool authors. Ideally, the only
+visible result of this would be some additional failure modes if the
+compute service detects an unexpected rename, so some documentation of
+what that looks like and what to do about it would be helpful.
+
+References
+==========
+
+TODO(danms): There are probably bugs we can reference about compute
+node renames being not possible, or problematic if/when they happen.
+
+.. _removed: https://specs.openstack.org/openstack/nova-specs/specs/kilo/implemented/detach-service-from-computenode.html
+
+History
+=======
+
+.. list-table:: Revisions
+   :header-rows: 1
+
+   * - Release Name
+     - Description
+   * - Antelope
+     - Introduced