Add troubleshooting doc about rebuilding the placement db

This has come up a few times via support questions from operators that have a nova cell database out of sync with the placement database resulting in a mismatch in compute nodes to provider uuids and they just want to wipe the placement database and rebuild it from the current data in nova. This provides a document with the high level steps to do that. Change-Id: Ie4fed22615f60e132a887fe541771c447fae1082
2019-12-11 10:40:44 -05:00 · 2019-12-11 10:40:44 -05:00 · 1a17fe8aab
parent 4c8f3990c6
commit 1a17fe8aab
3 changed files with 59 additions and 0 deletions
--- a/doc/source/admin/support-compute.rst
+++ b/doc/source/admin/support-compute.rst
@ -15,6 +15,7 @@ you how to troubleshoot Compute.
   :maxdepth: 1

   troubleshooting/orphaned-allocations.rst
+   troubleshooting/rebuild-placement-db.rst


 Compute service logging
--- a/doc/source/admin/troubleshooting/rebuild-placement-db.rst
+++ b/doc/source/admin/troubleshooting/rebuild-placement-db.rst
@ -0,0 +1,56 @@
+Rebuild placement DB
+====================
+
+Problem
+-------
+
+You have somehow changed a nova cell database and the ``compute_nodes`` table
+entries are now reporting different uuids to the placement service but
+placement already has ``resource_providers`` table entries with the same
+names as those computes so the resource providers in placement and the
+compute nodes in the nova database are not synchronized. Maybe this happens
+as a result of restoring the nova cell database from a backup where the compute
+hosts have not changed but they are using different uuids.
+
+Nova reports compute node inventory to placement using the
+``hypervisor_hostname`` and uuid of the ``compute_nodes`` table to the
+placement ``resource_providers`` table, which has a unique constraint on the
+name (hostname in this case) and uuid. Trying to create a new resource provider
+with a new uuid but the same name as an existing provider results in a 409
+error from placement, such as in `bug 1817833`_.
+
+.. _bug 1817833: https://bugs.launchpad.net/nova/+bug/1817833
+
+Solution
+--------
+
+.. warning:: This is likely a last resort when *all* computes and resource
+             providers are not synchronized and it is simpler to just rebuild
+             the placement database from the current state of nova. This may,
+             however, not work when using placement for more advanced features
+             such as :neutron-doc:`ports with minimum bandwidth guarantees </admin/config-qos-min-bw>`
+             or `accelerators <https://docs.openstack.org/cyborg/latest/>`_.
+             Obviously testing first in a pre-production environment is ideal.
+
+These are the steps at a high level:
+
+#. Make a backup of the existing placement database in case these steps fail
+   and you need to start over.
+
+#. Recreate the placement database and run the schema migrations to
+   initialize the placement database.
+
+#. Either restart or wait for the
+   :oslo.config:option:`update_resources_interval` on the ``nova-compute``
+   services to report resource providers and their inventory to placement.
+
+#. Run the :ref:`nova-manage placement heal_allocations <heal_allocations_cli>`
+   command to report allocations to placement for the existing instances in
+   nova.
+
+#. Run the :ref:`nova-manage placement sync_aggregates <sync_aggregates_cli>`
+   command to synchronize nova host aggregates to placement resource provider
+   aggregates.
+
+Once complete, test your deployment as usual, e.g. running Tempest integration
+and/or Rally tests, creating, migrating and deleting a server, etc.
--- a/doc/source/cli/nova-manage.rst
+++ b/doc/source/cli/nova-manage.rst
@ -643,6 +643,8 @@ Placement
       * - 255
         - An unexpected error occurred.

+.. _sync_aggregates_cli:
+
 ``nova-manage placement sync_aggregates [--verbose]``
    Mirrors compute host aggregates to resource provider aggregates
    in the Placement service. Requires the :oslo.config:group:`api_database`