Add troubleshooting doc about rebuilding the placement db

This has come up a few times via support questions from operators
that have a nova cell database out of sync with the placement
database resulting in a mismatch in compute nodes to provider
uuids and they just want to wipe the placement database and rebuild
it from the current data in nova. This provides a document with the
high level steps to do that.

Change-Id: Ie4fed22615f60e132a887fe541771c447fae1082
This commit is contained in:
Matt Riedemann 2019-12-11 10:40:44 -05:00
parent 4c8f3990c6
commit 1a17fe8aab
3 changed files with 59 additions and 0 deletions

View File

@ -15,6 +15,7 @@ you how to troubleshoot Compute.
:maxdepth: 1
troubleshooting/orphaned-allocations.rst
troubleshooting/rebuild-placement-db.rst
Compute service logging

View File

@ -0,0 +1,56 @@
Rebuild placement DB
====================
Problem
-------
You have somehow changed a nova cell database and the ``compute_nodes`` table
entries are now reporting different uuids to the placement service but
placement already has ``resource_providers`` table entries with the same
names as those computes so the resource providers in placement and the
compute nodes in the nova database are not synchronized. Maybe this happens
as a result of restoring the nova cell database from a backup where the compute
hosts have not changed but they are using different uuids.
Nova reports compute node inventory to placement using the
``hypervisor_hostname`` and uuid of the ``compute_nodes`` table to the
placement ``resource_providers`` table, which has a unique constraint on the
name (hostname in this case) and uuid. Trying to create a new resource provider
with a new uuid but the same name as an existing provider results in a 409
error from placement, such as in `bug 1817833`_.
.. _bug 1817833: https://bugs.launchpad.net/nova/+bug/1817833
Solution
--------
.. warning:: This is likely a last resort when *all* computes and resource
providers are not synchronized and it is simpler to just rebuild
the placement database from the current state of nova. This may,
however, not work when using placement for more advanced features
such as :neutron-doc:`ports with minimum bandwidth guarantees </admin/config-qos-min-bw>`
or `accelerators <https://docs.openstack.org/cyborg/latest/>`_.
Obviously testing first in a pre-production environment is ideal.
These are the steps at a high level:
#. Make a backup of the existing placement database in case these steps fail
and you need to start over.
#. Recreate the placement database and run the schema migrations to
initialize the placement database.
#. Either restart or wait for the
:oslo.config:option:`update_resources_interval` on the ``nova-compute``
services to report resource providers and their inventory to placement.
#. Run the :ref:`nova-manage placement heal_allocations <heal_allocations_cli>`
command to report allocations to placement for the existing instances in
nova.
#. Run the :ref:`nova-manage placement sync_aggregates <sync_aggregates_cli>`
command to synchronize nova host aggregates to placement resource provider
aggregates.
Once complete, test your deployment as usual, e.g. running Tempest integration
and/or Rally tests, creating, migrating and deleting a server, etc.

View File

@ -643,6 +643,8 @@ Placement
* - 255
- An unexpected error occurred.
.. _sync_aggregates_cli:
``nova-manage placement sync_aggregates [--verbose]``
Mirrors compute host aggregates to resource provider aggregates
in the Placement service. Requires the :oslo.config:group:`api_database`