[Doc] Best practices for effectively tolerating down cells

Adds a section in the admin guide with the config options related to
down cells.

Related to blueprint handling-down-cell

Change-Id: I6a6cc71e83896aaccd5dd98bc2ea024d6f22d528
This commit is contained in:
Surya Seetharaman 2019-02-20 15:33:04 +01:00
parent a37a035c9d
commit 57eb9424b9
3 changed files with 98 additions and 5 deletions

View File

@ -0,0 +1,88 @@
==================
CellsV2 Management
==================
This section describes the various recommended practices/tips for runnning and
maintaining CellsV2 for admins and operators. For more details regarding the
basic concept of CellsV2 and its layout please see the main :doc:`/user/cellsv2-layout`
page.
.. _handling-cell-failures:
Handling cell failures
----------------------
For an explanation on how ``nova-api`` handles cell failures please see the
`Handling Down Cells <https://developer.openstack.org/api-guide/compute/down_cells.html>`__
section of the Compute API guide. Below, you can find some recommended practices and
considerations for effectively tolerating cell failure situations.
Configuration considerations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Since a cell being reachable or not is determined through timeouts, it is suggested
to provide suitable values for the following settings based on your requirements.
#. :oslo.config:option:`database.max_retries` is 10 by default meaning every time
a cell becomes unreachable, it would retry 10 times before nova can declare the
cell as a "down" cell.
#. :oslo.config:option:`database.retry_interval` is 10 seconds and
:oslo.config:option:`oslo_messaging_rabbit.rabbit_retry_interval` is 1 second by
default meaning every time a cell becomes unreachable it would retry every 10
seconds or 1 second depending on if it's a database or a message queue problem.
#. Nova also has a timeout value called ``CELL_TIMEOUT`` which is hardcoded to 60
seconds and that is the total time the nova-api would wait before returning
partial results for the "down" cells.
The values of the above settings will affect the time required for nova to decide
if a cell is unreachable and then take the necessary actions like returning
partial results.
The operator can also control the results of certain actions like listing
servers and services depending on the value of the
:oslo.config:option:`api.list_records_by_skipping_down_cells` config option.
If this is true, the results from the unreachable cells will be skipped
and if it is false, the request will just fail with an API error in situations where
partial constructs cannot be computed.
Disabling down cells
~~~~~~~~~~~~~~~~~~~~
While the temporary outage in the infrastructure is being fixed, the affected
cells can be disabled so that they are removed from being scheduling candidates.
To enable or disable a cell, use :command:`nova-manage cell_v2 update_cell
--cell_uuid <cell_uuid> --disable`. See the :ref:`man-page-cells-v2` man page
for details on command usage.
Known issues
~~~~~~~~~~~~
#. **Services and Performance:** In case a cell is down during the startup of nova
services, there is the chance that the services hang because of not being able
to connect to all the cell databases that might be required for certain calculations
and initializations. An example scenario of this situation is if
:oslo.config:option:`upgrade_levels.compute` is set to ``auto`` then the
``nova-api`` service hangs on startup if there is at least one unreachable
cell. This is because it needs to connect to all the cells to gather
information on each of the compute service's version to determine the compute
version cap to use. The current workaround is to pin the
:oslo.config:option:`upgrade_levels.compute` to a particular version like
"rocky" and get the service up under such situations. See `bug 1815697
<https://bugs.launchpad.net/nova/+bug/1815697>`__ for more details. Also note
that in general during situations where cells are not reachable certain
"slowness" may be experienced in operations requiring hitting all the cells
because of the aforementioned configurable timeout/retry values.
#. **Counting Quotas:** Another known issue is in the current approach of counting
quotas where we query each cell database to get the used resources and aggregate
them which makes it sensitive to temporary cell outages. While the cell is
unavailable, we cannot count resource usage residing in that cell database and
things would behave as though more quota is available than should be. That is,
if a tenant has used all of their quota and part of it is in cell A and cell A
goes offline temporarily, that tenant will suddenly be able to allocate more
resources than their limit (assuming cell A returns, the tenant will have more
resources allocated than their allowed quota). In the future, this will be solved
by using placement and nova_api databases for counting quotas so as to remove the
dependency on individual cells. See `counting quotas from placement`_ for more details.
.. _counting quotas from placement: https://specs.openstack.org/openstack/nova-specs/specs/stein/approved/count-quota-usage-from-placement.html

View File

@ -22,6 +22,7 @@ operating system, and exposes functionality over a web-based API.
adv-config.rst
arch.rst
availability-zones.rst
cells.rst
configuration/index.rst
configuring-migrations.rst
cpu-topologies.rst

View File

@ -121,10 +121,11 @@ always be much smaller than the number of instances.
There are availability implications with this change since something like a
'nova list' which might query multiple cells could end up with a partial result
if there is a database failure in a cell. A database failure within a cell
would cause larger issues than a partial list result so the expectation is that
it would be addressed quickly and cellsv2 will handle it by indicating in the
response that the data may not be complete.
if there is a database failure in a cell. See :doc:`/admin/cells` for knowing
more about the recommended practices under such situations. A database failure
within a cell would cause larger issues than a partial list result so the
expectation is that it would be addressed quickly and cellsv2 will handle it by
indicating in the response that the data may not be complete.
Since this is very similar to what we have with current cells, in terms of
organization of resources, we have decided to call this "cellsv2" for
@ -819,4 +820,7 @@ FAQs
See the `Handling Down Cells
<https://developer.openstack.org/api-guide/compute/down_cells.html>`__
section of the Compute API guide for more information on the partial
constructs.
constructs.
For administrative considerations, see
:ref:`Handling cell failures <handling-cell-failures>`.