Merge "Re-propose nova-audit spec for Victoria"

This commit is contained in:
Zuul 2020-05-26 15:34:05 +00:00 committed by Gerrit Code Review
commit 1fcaf9c44f
1 changed files with 306 additions and 0 deletions

View File

@ -0,0 +1,306 @@
..
This work is licensed under a Creative Commons Attribution 3.0 Unported
License.
http://creativecommons.org/licenses/by/3.0/legalcode
=================================================
Add a nova-audit service for periodic maintenance
=================================================
https://blueprints.launchpad.net/nova/+spec/nova-audit
Nova is a distributed system, which means that things fail in strange
ways and data stored across multiple systems gets out of sync with the
actual state of reality. Hosts and instances come and go, along with
network connectivity, the message bus and database. Recently we have
gained a number of "heal $thing" routines that operators can run
either periodically or on demand to synchronize the states of various
services and data stores to resolve or prevent problems. The number of
these tasks is already overwhelming for the average operator, and
tracking new tasks each cycle is not realistic [1]_.
Problem description
===================
As described above, we have an increasing number of maintenance tasks
that need to be run in various scenarios. In most cases, these tasks
are idempotent and safe to run even when nothing is wrong. Operators
need a single mechanism for performing these maintenance tasks and
healing activities that can be run periodically in the background with
minimal impact to runtime performance, other than to hopefully fix
problems related to inconsistencies before they become acute enough to
get an human involved.
Use Cases
---------
As an operator, I would like Nova to heal itself whenever possible to
minimize the number of support incidents requiring human intervention.
As a user, I would like Nova to heal itself whenever possible to avoid
having to involve support for transient issues, which may be
impossible or expensive, especially during off-hour periods.
Proposed change
===============
We already have a number of these maintenance activities codified in
one-shot commands [2]_ that can be run on-demand once a problem has been
identified. Since most of them are not harmful or overly expensive, we
should be able to run those things periodically to attempt to fix
problems automatically before the operator gets involved.
This spec proposes a new binary called ``nova-audit`` to encapsulate
these tasks. Ideally it should be usable in multiple ways:
- As a singleton daemon that periodically runs tasks at various
intervals according to their potential impact on the system and
need.
- As a one-shot "fix stuff" command that can be run from cron or
otherwise scheduled or executed.
- As a daemon or one-shot command that purely audits potential
problems, but makes no changes.
A new config section of ``[audit]`` would be added with timers and
default values for each task.
Current heal/sync/fix/cleanup tasks we have that could be integrated:
``heal_allocations``
--------------------
This task checks the consistency of allocations in Placement for
instances in Nova. It has a runtime performance impact on both
Placement and the Nova database. Many instances means this should
probably check one instance per cycle, but potentially a short cycle
time.
``audit_allocations``
---------------------
This task checks for orphaned allocations in Placement for instances in
Nova and will delete them if specified by the configuration.
It has a runtime performance impact on both Placement and the Nova
database. Many instances means this should probably check one instance
per cycle, but potentially a short cycle time.
Today, the command is named ``nova-manage placement audit`` but it
might be a good idea to name it more specifically inside the context of
the ``nova-audit`` command and service.
``sync_aggregates``
-------------------
This task checks that host aggregates match between Nova and
Placement. It is required for some scheduler activities, but not all
cases. It has a runtime performance impact on both Placement and the
Nova database. Many hosts means this should probably check one
aggregate per cycle. Aggregates generally change infrequently, so a
long cycle time of an hour or more is probably reasonable.
``map_instances``
-----------------
This task checks that instances have a suitable mapping to a cell. It
has a runtime performance impact on the Nova database. Many instances
means this should probably check one instance per cycle, with a
relatively short cycle time. It may also be better to check one cell
at a time, very infrequently such as once per day.
``discover_hosts``
------------------
This task ensures that newly-registered hypervisor hosts are mapped to
the appropriate cell. This has a runtime impact on the Nova database,
but there is an efficient way to query for unmapped hosts, so this can
run relatively frequently, such as every ten minutes.
.. note:: There is already a mechanism by which to run this
periodically in the scheduler service, which should be
deprecated and replaced by ``nova-audit``.
``archive_deleted_rows``
------------------------
This task archives deleted data from the main database tables into the
shadow tables. It has a runtime performance impact on the Nova
database, both negative (while running) and positive (after
running). Some people never run this, so a cycle time of once per day
or week should be fine. This also needs a parameter to limit the scope
of archived changes to a date range, defaulting to some multiple of
the cycle time.
.. note:: This (and others) may need a configuration element to
control its execution only between certain hours or days.
``purge``
---------
This task removes data from the shadow tables entirely. It has a
runtime performance impact on the Nova database, but it is just
deleting data from tables accessed only during the
``archive_deleted_rows`` operation. In reality, this should probably
be run directly after the archival process, potentially with a
different age scope.
``heal_instance_mappings`` (proposed)
-------------------------------------
This task scans for orphaned instance mappings in the API database
that have no build request or matching instance in a cell. It has a
runtime performance impact on the Nova API and cell databases, but
only looks for mappings with no cell id. It is bounded by the number
of in-flight instance builds plus the number of orphans, which should
be small. Thus it should be fine to run this relatively frequently,
such as every ten minutes.
Alternatives
------------
We could obviously do nothing. People are managing the complexity
today, so we could simply choose to let them continue.
We could eliminate the daemon and scheduling nature of the proposal
and just provide a very unified interface to running these commands --
a single place to find all the periodic maintenance tasks separate
from the setup sort of things that ``nova-manage`` does.
We could integrate this into ``nova-manage`` itself, under a
"maintenance" subcommand or similar.
Data model impact
-----------------
None.
REST API impact
---------------
None.
Security impact
---------------
None.
Notifications impact
--------------------
None. You could argue that notifications sent about audit activity
would be useful, but doing so would require more setup and
configuration of this utility, as well as connectivity and credentials
to the message bus. We could implement that later if there is a need.
Other end user impact
---------------------
None.
Performance Impact
------------------
There will be some runtime performance impact due to the background
nature of the audit and any cleanup that happens. Mitigation is to not
run it, tune the intervals to be longer, or run it in single-shot mode
when desired.
Other deployer impact
---------------------
Deployers will have to learn about and deploy a new
command/service. This will hopefully be completely offeset by the
reduced complexity of managing and maintaining Nova in the longer
term.
Developer impact
----------------
New maintenance tasks that are added will need to be done in an
idempotent and efficient way and according to whatever interface for
these commands is defined.
Upgrade impact
--------------
A new binary will be added, which will have some impact on
upgrades. Any existing periodic maintenance jobs that call ``nova-manage``
for various tasks will need to convert over to the new command. The
interfaces we have for existing things in ``nova-manage`` can be
deprecated but maintained for an extended period to avoid breaking
existing deployments.
.. note:: Specific tasks like ``db archive_deleted_rows`` may make
sense to continue to exist in ``nova-manage`` as well.
Implementation
==============
Assignee(s)
-----------
Primary assignee:
melwitt
Other contributors:
danms
Feature Liaison
---------------
Feature liaison:
melwitt
Work Items
----------
* Create a new ``nova-audit`` command and define scheduling
mechanisms and internal interfaces.
* Create the new config section and items.
* Implement connectors to integrate the existing tasks we have into
the new command.
* Modify the ``nova-next`` job to run the audit command in single-shot
mode after the tempest run, ideally removing the existing
archive/purge invocation.
Dependencies
============
None.
Testing
=======
Unit and functional testing of the daemon and internal architecture,
and the continued requirement for testing of the actual tasks. A
single-shot run in the ``nova-next`` job as we currently do today for
archive/purge.
Documentation Impact
====================
Operator documentation about the new command, how to deploy it, and
per-knob documentation about the impacts and suggested intervals.
References
==========
.. [1] Proposed new ``heal_instance_mappings`` command for Ussuri: https://review.opendev.org/#/c/655908/
.. [2] Commands in ``nova-manage``: https://docs.openstack.org/nova/latest/cli/nova-manage.html
History
=======
.. list-table:: Revisions
:header-rows: 1
* - Release Name
- Description
* - Ussuri
- Introduced
* - Victoria
- Re-proposed and added the ``audit_allocations`` task to include
the current ``nova-manage placement audit`` functionality in
``nova-audit``.