kernel: Do not call quiet_vmstat from IRQ context

We received a bug report indicating that the "Dirty" field in
/proc/meminfo was increasing without bounds, to the point that the
number of dirty file pages would eventually reach what is enforced by
the vm.dirty_bytes threshold (which is set to 800_000_000 bytes in
StarlingX) and cause any task attempting to carry out disk I/O to get
blocked.

Upon further debugging, we noticed that this issue occurred on nohz_full
CPUs where a user application was carrying out disk I/O by writing to
and rotating log files. The issue was reproducible with the preempt-rt
patch set very reliably.

This commit addresses the issue in question, by reverting commit
62cb1188ed86 ("sched/idle: Move quiet_vmstate() into the NOHZ code"),
which was merged in the v4.15-rc1 time frame. The revert, in effect,
moves the quiet_vmstat function call from hard IRQ context back to the
start of the idle loop. Please see the patch description for a more
detailed overview.

Note that this commit does not introduce a "novel" change, as the
4.14.298-rt140 kernel, released on 2022-11-04 does not have the reverted
commit either, which should preclude the need for regression testing in
terms of functionality and performance.

I would like to acknowledge the extensive help and guidance provided by
Jim Somerville <jim.somerville@windriver.com> during the debugging and
investigation of this issue.

Verification

- The issue was reproduced with an older CentOS-based StarlingX-based
  system, running a StarlingX/linux-yocto preempt-rt kernel based on
  v5.10.112-rt61 by running a test application for about 4~5 hours. In
  this configuration, the issue becomes apparent within 1 hour or so,
  where the Dirty field in /proc/meminfo reaches the threshold sysctl
  vm.dirty_background_bytes (set to 600_000_000 bytes in StarlingX). By
  the end of the test, the Dirty field was very close to the
  vm.dirty_bytes threshold sysctl (800_000_000 bytes).

  Afterwards, a kernel patched with this commit was found to no longer
  reproduce the issue, by running the same test application for ~12.5
  hours. (Note that the second test had Meltdown/Spectre mitigations
  enabled by accident, but we are confident that this does not affect
  the test results.) The Dirty value in /proc/meminfo stayed around
  180_000 KiB for the duration of the test. A test re-run with the
  Meltdown/Spectre mitigations disabled, for a duration of 1.75 hours,
  had similar results.

  The test application that reproduces this issue writes to and rotates
  log files in a rapid manner, with a usleep(0) call between every log
  file rotation. The issue is reproduced on nohz_full CPUs with the
  preempt-rt kernel, more reliably at least.

- A Debian-based StarlingX ISO image was successfully built with this
  commit.

- The ISO image was successfully installed into a qemu/KVM-based virtual
  machine using the All-in-One Simplex, low-latency profile, and the
  Ansible bootstrap procedure was successful.

- The issue was confirmed to no longer exist with this commit, by
  running multiple concurrent instances of a simplified test application
  for about 30 minutes (with the installation resulting from the
  Debian-based StarlingX ISO image built with this commit). Without a
  patched kernel, the issue becomes apparent within 10 minutes of test
  runtime in this configuration.

Closes-Bug: 2002039
Change-Id: I818d8bd751f4b1941a26530a99a4a635e98d5c54
Signed-off-by: M. Vefa Bicakci <vefa.bicakci@windriver.com>
This commit is contained in:
M. Vefa Bicakci 2023-01-05 15:47:33 +00:00
parent f29ab14ca8
commit 436c7067d0
4 changed files with 242 additions and 0 deletions

View File

@ -0,0 +1,120 @@
From d636082d0ae322ebd744b88ccdc3a8a501826494 Mon Sep 17 00:00:00 2001
From: "M. Vefa Bicakci" <vefa.bicakci@windriver.com>
Date: Wed, 4 Jan 2023 20:36:53 -0500
Subject: [PATCH] Revert "sched/idle: Move quiet_vmstate() into the NOHZ code"
This reverts commit 62cb1188ed86a9cf082fd2f757d4dd9b54741f24.
We received a bug report indicating that the "Dirty" field in
/proc/meminfo was increasing without bounds, to the point that the
number of dirty file pages would eventually reach what is enforced by
the vm.dirty_bytes threshold (which is set to 800_000_000 bytes in
StarlingX) and cause any task attempting to carry out disk I/O to get
blocked.
Upon further debugging, we noticed that this issue occurred on nohz_full
CPUs where a user application was carrying out disk I/O by writing to
and rotating log files, with a usleep(0) call between every log file
rotation. The issue was reproducible with the preempt-rt patch set very
reliably.
The reverted commit moved the quiet_vmstat() call from the entry of the
idle loop (do_idle function) to the tick_nohz_stop_tick function.
However, the tick_nohz_stop_tick function is called very frequently from
hard IRQ context via the following call chain:
irq_exit_rcu
tick_irq_exit (has a condition to check for nohz_full CPUs)
tick_nohz_irq_exit
tick_nohz_full_update_tick
tick_nohz_stop_sched_tick
tick_nohz_stop_tick
quiet_vmstat
The check for nohz_full CPUs in tick_irq_exit() explains why this issue
occurred with nohz_full CPUs more reliably.
Calling quiet_vmstat from hard IRQ context is problematic.
quiet_vmstat() makes the following calls to update vm_node_stat as well
as other statistics such as vm_zone_stat and vm_numa_stat. Recall that
an element in the vm_node_stat array tracks the number of dirty file
pages:
quiet_vmstat
refresh_cpu_vm_stats
fold_diff (Updates vm_node_stat and other statistics)
However, __mod_node_page_state() (and fellow functions) also update
vm_node_stat, and although it is called with interrupts disabled in most
contexts (via spin_lock_irqsave), there are instances where it is called
with interrupts enabled (as evidenced by instrumenting the function with
counters that count the number of times the function was called with and
without interrupts disabled). Also, the fact that __mod_node_page_state
and its sibling function __mod_zone_page_state should not be called with
interrupts enabled is evidenced by the following comment in mm/vmstat.c
above __mod_zone_page_state():
For use when we know that interrupts are disabled, or when we know
that preemption is disabled and that particular counter cannot be
updated from interrupt context.
Furthermore, recall that the preempt-rt patch set makes most spinlocks
sleeping locks and changes the implementation of spin_lock_irqsave in
such a way that IRQs are *not* disabled by spin_lock_irqsave. With the
preempt-rt patch set, this corresponds to a significant increase in the
number of calls to __mod_node_page_state() with interrupts *enabled*.
This in turn significantly increases the possibility of incorrectly
modifying global statistics variables such the ones in the vm_node_stat
array.
To avoid this issue, we revert commit 62cb1188ed86 ("sched/idle: Move
quiet_vmstate() into the NOHZ code") and therefore move the quiet_vmstat
call back into the idle loop's entry point, where it is *not* called
from an hard IRQ context. With this revert applied, the issue is no
longer reproducible.
I would like to acknowledge the extensive help and guidance provided by
Jim Somerville <jim.somerville@windriver.com> during the debugging and
investigation of this issue.
Signed-off-by: M. Vefa Bicakci <vefa.bicakci@windriver.com>
---
kernel/sched/idle.c | 1 +
kernel/time/tick-sched.c | 2 --
2 files changed, 1 insertion(+), 2 deletions(-)
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 2593a733c084..a4122590ca54 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -271,6 +271,7 @@ static void do_idle(void)
*/
__current_set_polling();
+ quiet_vmstat();
tick_nohz_idle_enter();
while (!need_resched()) {
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index a800fd555499..9935576b295d 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -24,7 +24,6 @@
#include <linux/irq_work.h>
#include <linux/posix-timers.h>
#include <linux/context_tracking.h>
-#include <linux/mm.h>
#include <asm/irq_regs.h>
@@ -812,7 +811,6 @@ static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
*/
if (!ts->tick_stopped) {
calc_load_nohz_start();
- quiet_vmstat();
ts->last_tick = hrtimer_get_expires(&ts->sched_timer);
ts->tick_stopped = 1;
--
2.25.1

View File

@ -35,3 +35,4 @@
0037-xfs-drop-unused-ioend-private-merge-and-setfilesize-.patch
0038-xfs-drop-unnecessary-setfilesize-helper.patch
0039-samples-bpf-use-kprobe-and-urandom_read_iter.patch
0040-Revert-sched-idle-Move-quiet_vmstate-into-the-NOHZ-c.patch

View File

@ -0,0 +1,120 @@
From ce6ac6352c2d5d319a852959b0d3169a0cf23d78 Mon Sep 17 00:00:00 2001
From: "M. Vefa Bicakci" <vefa.bicakci@windriver.com>
Date: Wed, 4 Jan 2023 20:41:54 -0500
Subject: [PATCH] Revert "sched/idle: Move quiet_vmstate() into the NOHZ code"
This reverts commit 62cb1188ed86a9cf082fd2f757d4dd9b54741f24.
We received a bug report indicating that the "Dirty" field in
/proc/meminfo was increasing without bounds, to the point that the
number of dirty file pages would eventually reach what is enforced by
the vm.dirty_bytes threshold (which is set to 800_000_000 bytes in
StarlingX) and cause any task attempting to carry out disk I/O to get
blocked.
Upon further debugging, we noticed that this issue occurred on nohz_full
CPUs where a user application was carrying out disk I/O by writing to
and rotating log files, with a usleep(0) call between every log file
rotation. The issue was reproducible with the preempt-rt patch set very
reliably.
The reverted commit moved the quiet_vmstat() call from the entry of the
idle loop (do_idle function) to the tick_nohz_stop_tick function.
However, the tick_nohz_stop_tick function is called very frequently from
hard IRQ context via the following call chain:
irq_exit_rcu
tick_irq_exit (has a condition to check for nohz_full CPUs)
tick_nohz_irq_exit
tick_nohz_full_update_tick
tick_nohz_stop_sched_tick
tick_nohz_stop_tick
quiet_vmstat
The check for nohz_full CPUs in tick_irq_exit() explains why this issue
occurred with nohz_full CPUs more reliably.
Calling quiet_vmstat from hard IRQ context is problematic.
quiet_vmstat() makes the following calls to update vm_node_stat as well
as other statistics such as vm_zone_stat and vm_numa_stat. Recall that
an element in the vm_node_stat array tracks the number of dirty file
pages:
quiet_vmstat
refresh_cpu_vm_stats
fold_diff (Updates vm_node_stat and other statistics)
However, __mod_node_page_state() (and fellow functions) also update
vm_node_stat, and although it is called with interrupts disabled in most
contexts (via spin_lock_irqsave), there are instances where it is called
with interrupts enabled (as evidenced by instrumenting the function with
counters that count the number of times the function was called with and
without interrupts disabled). Also, the fact that __mod_node_page_state
and its sibling function __mod_zone_page_state should not be called with
interrupts enabled is evidenced by the following comment in mm/vmstat.c
above __mod_zone_page_state():
For use when we know that interrupts are disabled, or when we know
that preemption is disabled and that particular counter cannot be
updated from interrupt context.
Furthermore, recall that the preempt-rt patch set makes most spinlocks
sleeping locks and changes the implementation of spin_lock_irqsave in
such a way that IRQs are *not* disabled by spin_lock_irqsave. With the
preempt-rt patch set, this corresponds to a significant increase in the
number of calls to __mod_node_page_state() with interrupts *enabled*.
This in turn significantly increases the possibility of incorrectly
modifying global statistics variables such the ones in the vm_node_stat
array.
To avoid this issue, we revert commit 62cb1188ed86 ("sched/idle: Move
quiet_vmstate() into the NOHZ code") and therefore move the quiet_vmstat
call back into the idle loop's entry point, where it is *not* called
from an hard IRQ context. With this revert applied, the issue is no
longer reproducible.
I would like to acknowledge the extensive help and guidance provided by
Jim Somerville <jim.somerville@windriver.com> during the debugging and
investigation of this issue.
Signed-off-by: M. Vefa Bicakci <vefa.bicakci@windriver.com>
---
kernel/sched/idle.c | 1 +
kernel/time/tick-sched.c | 2 --
2 files changed, 1 insertion(+), 2 deletions(-)
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 2593a733c084..a4122590ca54 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -271,6 +271,7 @@ static void do_idle(void)
*/
__current_set_polling();
+ quiet_vmstat();
tick_nohz_idle_enter();
while (!need_resched()) {
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 92fb738813f3..81ef5f83a0d8 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -24,7 +24,6 @@
#include <linux/irq_work.h>
#include <linux/posix-timers.h>
#include <linux/context_tracking.h>
-#include <linux/mm.h>
#include <asm/irq_regs.h>
@@ -812,7 +811,6 @@ static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
*/
if (!ts->tick_stopped) {
calc_load_nohz_start();
- quiet_vmstat();
ts->last_tick = hrtimer_get_expires(&ts->sched_timer);
ts->tick_stopped = 1;
--
2.25.1

View File

@ -34,3 +34,4 @@
0036-xfs-drop-unused-ioend-private-merge-and-setfilesize-.patch
0037-xfs-drop-unnecessary-setfilesize-helper.patch
0038-samples-bpf-use-kprobe-and-urandom_read_iter.patch
0039-Revert-sched-idle-Move-quiet_vmstate-into-the-NOHZ-c.patch