Default live_migration_progress_timeout to off

live_migration_progress_timeout aims to timeout a live-migration well
before the live_migration_completion_timeout limit, by looking for when
it appears that no progress has been made copying the memory between the
hosts. However, it turns out there are several problems with the way we
monitor progress. In production, and stress testing, having
live_migration_progress_timeout > 0 has caused random timeout failures
for live-migrations that take longer than live_migration_progress_timeout

One problem is that block_migrations appear to show no progress, as it
seems we only look for progress in copying memory. Also the way we query
QEMU via libvirt breaks when there are multiple iterations of memory
copying.

We need to revisit this bug and either fix the progress mechanism or
remove the all the code that checks for the progress (including the
automatic trigger for post-copy). But in the mean time, lets default to
having no timeout, and warn users that have overridden this
configuration by deprecating the live_migration_progress_timeout
configuration option.

For users concerned about live-migration timeout errors, I have
cleaned up the configuration option descriptions, so they have a better
chance of stopping the live-migration timeout errors they may come
across.

Related-Bug: #1644248

Change-Id: I1a1143ddf8da5fb9706cf53dbfd6cbe84e606ae1
This commit is contained in:
John Garbutt 2017-02-06 17:42:59 +00:00
parent f40467b0eb
commit 510fe1353d
2 changed files with 37 additions and 4 deletions

View File

@ -340,8 +340,14 @@ Please refer to the libvirt documentation for further details.
Maximum permitted downtime, in milliseconds, for live migration
switchover.
Will be rounded up to a minimum of %dms. Use a large value if guest liveness
is unimportant.
Will be rounded up to a minimum of %dms. You can increase this value
if you want to allow live-migrations to complete faster, or avoid
live-migration timeout errors by allowing the guest to be paused for
longer during the live-migration switch over.
Related options:
* live_migration_completion_timeout
""" % LIVE_MIGRATION_DOWNTIME_MIN),
# TODO(hieulq): Need to add min argument by moving from
# LIVE_MIGRATION_DOWNTIME_STEPS_MIN constant.
@ -373,16 +379,27 @@ data before aborting the operation.
Value is per GiB of guest RAM + disk to be transferred, with lower bound of
a minimum of 2 GiB. Should usually be larger than downtime delay * downtime
steps. Set to 0 to disable timeouts.
Default is 800.
Related options:
* live_migration_downtime
* live_migration_downtime_steps
* live_migration_downtime_delay
"""),
cfg.IntOpt('live_migration_progress_timeout',
default=150,
default=0,
deprecated_for_removal=True,
deprecated_reason="Serious bugs found in this feature.",
mutable=True,
help="""
Time to wait, in seconds, for migration to make forward progress in
transferring data before aborting the operation.
Set to 0 to disable timeouts.
This is deprecated, and now disabled by default because we have found serious
bugs in this feature that caused false live-migration timeout failures. This
feature will be removed or replaced in a future release.
"""),
cfg.BoolOpt('live_migration_permit_post_copy',
default=False,

View File

@ -0,0 +1,16 @@
---
issues:
- |
The live-migration progress timeout controlled by the configuration option
``[libvirt]/live_migration_progress_timeout`` has been discovered to
frequently cause live-migrations to fail with a progress timeout error,
even though the live-migration is still making good progress.
To minimize problems caused by these checks we have changed the default
to 0, which means do not trigger a timeout.
To modify when a live-migration will fail with a timeout error, please now
look at ``[libvirt]/live_migration_completion_timeout`` and
``[libvirt]/live_migration_downtime``.
deprecations:
- |
``[libvirt]/live_migration_progress_timeout`` has been deprecated as this
feature has been found not to work. See bug 1644248 for more details.