Also changed some filenames to match the requirement which is to have the same name than the blueprint. Change-Id: I9b4e875dc613d63556937d3a66b873b060db62f7
8.9 KiB
Link Compute Objects by ID
https://blueprints.launchpad.net/nova/+spec/compute-object-ids
Nova has long had a dependency on an unchanging hostname on the compute nodes. This spec aims to address this limitation, at least from the perspective of being able to detect an accidental change and avoiding catastrophe in the database that can currently result from a hostname change, whether intentional or not.
As a continuation of the effort to robustify compute hostnames, this spec describes the next phase which involves strengthening the linkage between the primary database objects managed by the compute nodes.
Problem description
The ComputeNode, Service, and
Instance objects form the primary data model for our
compute nodes. Instances run on compute nodes, which are managed by
services. We rely on this hierarchy to know where instances are
(physically) as well as which RPC endpoint to send messages to for
management. Currently, the linkage between all three objects is a
relatively loose and string-based, association using the hostname of the
compute node and/or the CONF.host values. This not only
makes an actual/intentional rename very difficult, but also risks
breaking critical links as a result of an accidental one.
Use Cases
As an operator I want an accidental or transient hostname rename to not cause corruption of my Nova data structures.
As a developer, I want a stronger association between the primary objects in the data model for robustness and performance reasons.
Proposed change
We already have a service_id field on our
ComputeNode object. We should resume populating that when
we create a new ComputeNode and we should fix existing
records during ComputeManager.init_host(), similar to how
we added checks for hostname discrepancies in the earlier phase of this
effort.
We will need to add a compute_id field to the
Instance object, which will require a schema migration.
This field will need to remain nullable, and will be NULL
for instances before scheduling, as well as instances in
SHELVED_OFFLOADED state. The compute_id field
can be populated at the same time we currently set
Instance.node, and similar to ComputeNode
records above, we can migrate existing records during
ComputeManager._init_instance(). In order to ensure that we
keep the node and compute_id fields in sync, the
Instance.create() and Instance.update()
methods will perform a check to ensure that the former is never changed
without the latter also being changed. This check will (by the nature of
those two @remotable methods) be run on the conductor
nodes, and will only enforce the requirement if the version of the
objects is new enough.
Many of the times we update Instance.node, we do so from
a Migration object, using either source_node
for reverted migrations or dest_node for successful ones.
Thus, our handling of migrations will need some work as well, which is
described in the subsection below.
It is important to note that this spec defines one part of a two-part effort. The setup described here will require a subsequent step to change how we look up these objects to use the new relationships once all the data has been migrated.
Migration handling
Currently we update Instance.node from a
Migration object in a number of places. In most of these,
it is being performed on the node where the instance will
remain. For those cases, we will get the ComputeNode object
from the resource tracker (still by name, from the
Migration object) and use it to set the new field. Aside
from saving a loosely-coupled DB lookup each time we need it, this has
the additional benefit of double-checking that the node specified
(loosely, by name) in the Migration object is the (or a)
correct one for the current host.
The only place where we currently update Instance.node
from a location that is not the host where the Instance is
staying is during the early part of resize, where
_resize_instance() runs on the sending host with
information provided by the destination. In this case, we will modify
the Migration object to have one additional
dest_compute_id field, which will be filled by the
destination host with its known-correct value, to be used by the sending
host when it modifies Instance.node (and
Instance.compute_id) to be the values for the new host.
Upgrade Concerns
Since the Instance and Migration objects
will be growing new fields, older nodes will not be populating these
fields when migrating between old and new nodes. In the case of
Instance, the compute_id field will not be
actually used until a later release when we know it has been populated.
The dest_compute_id field in Migration will be
used if present, and if not, a fallback to finding the node's ID will
rely on a call to ComputeNode.get_by_host_and_nodename(),
which is "easy" since the Migration has all the fields
necessary to make that call.
Alternatives
This is not required for proper operation, so we could choose to do nothing.
We could also choose to keep the string-based association, strengthened by Foreign Key relationships.
For the Migration changes, we could also make the
destination compute ID be a new RPC parameter that is passed from the
destination compute back to the source to avoid needing to change the
Migration object. However that brings with it more upgrade
concerns.
We could also use the ComputeNode.uuid on the
Migration object instead of the ID. There is no real reason
to do that because cross-cell migration already creates two migration
objects, one per cell. It would also perform worse and would not be a
1:1 mapping of the field we need to set on the instance, which would
mean another DB lookup as well.
Data model impact
All changes will be confined to the Cell database:
- Instance will grow a
compute_idfield - Migration will grow a
dest_compute_idfield - Consistency checks for both of these will need to be added to the object lifecycle operations.
- ComputeNode's existing
service_idfield will be populated - Both will be populated during new record creation, and for existing
records at runtime during
nova-computestartup.
REST API impact
None
Security impact
None
Notifications impact
None
Other end user impact
None
Performance Impact
While not the primary intent, a follow-on effort to this will enable querying these objects by integer ID relation instead of by string, which should be both faster as well as lower impact on the database server.
Other deployer impact
No additional deployer impact other than a tiny amount of online data migration traffic on the next startup after upgrade, as well as improved performance and robustness going forward once the effort is completed.
Developer impact
Some additional re-learning about the relationships between the objects being based on IDs instead of hostnames.
Upgrade impact
No real upgrade impact here, other than what is already expected. A simple and database migration will be added, with no specific requirements about ordering or simultaneous code change. Compute nodes will migrate existing records during the first post-upgrade restart.
Implementation
Assignee(s)
- Primary assignee:
-
danms
Work Items
- Start populating
ComputeNode.service_idon creation - Migrate existing
ComputeNodeobjects on startup (init_host()) - Add a migration to add the
Instance.compute_idandMigration.dest_compute_idfields - Start populating
Migration.dest_compute_idfor migrations - Start populating
Instance.compute_idon completion of scheduling and migrations. - Migrate existing
Instanceobjects on startup (_init_instance())
Dependencies
None
Testing
Unit and Functional tests will be added to verify that new and existing objects are properly linked and migrated.
Documentation Impact
No documentation changes required.
References
- This is part of a larger multi-cycle effort to robustify compute hostnames.
- This follows the first
robustification stage, completed in
2023.1
History
| Release Name | Description |
|---|---|
| 2023.2 Bobcat | Introduced |