Commit 53bcf5c328d8d89d7357d9d8bdd58ad9fb06a377

Authored by Vlastimil Babka
Committed by Greg Kroah-Hartman
1 parent a78e877e9a

mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed

commit 9e5e3661727eaf960d3480213f8e87c8d67b6956 upstream.

Charles Shirron and Paul Cassella from Cray Inc have reported kswapd
stuck in a busy loop with nothing left to balance, but
kswapd_try_to_sleep() failing to sleep.  Their analysis found the cause
to be a combination of several factors:

1. A process is waiting in throttle_direct_reclaim() on pgdat->pfmemalloc_wait

2. The process has been killed (by OOM in this case), but has not yet been
   scheduled to remove itself from the waitqueue and die.

3. kswapd checks for throttled processes in prepare_kswapd_sleep():

        if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
                wake_up(&pgdat->pfmemalloc_wait);
		return false; // kswapd will not go to sleep
	}

   However, for a process that was already killed, wake_up() does not remove
   the process from the waitqueue, since try_to_wake_up() checks its state
   first and returns false when the process is no longer waiting.

4. kswapd is running on the same CPU as the only CPU that the process is
   allowed to run on (through cpus_allowed, or possibly single-cpu system).

5. CONFIG_PREEMPT_NONE=y kernel is used. If there's nothing to balance, kswapd
   encounters no voluntary preemption points and repeatedly fails
   prepare_kswapd_sleep(), blocking the process from running and removing
   itself from the waitqueue, which would let kswapd sleep.

So, the source of the problem is that we prevent kswapd from going to
sleep until there are processes waiting on the pfmemalloc_wait queue,
and a process waiting on a queue is guaranteed to be removed from the
queue only when it gets scheduled.  This was done to make sure that no
process is left sleeping on pfmemalloc_wait when kswapd itself goes to
sleep.

However, it isn't necessary to postpone kswapd sleep until the
pfmemalloc_wait queue actually empties.  To prevent processes from being
left sleeping, it's actually enough to guarantee that all processes
waiting on pfmemalloc_wait queue have been woken up by the time we put
kswapd to sleep.

This patch therefore fixes this issue by substituting 'wake_up' with
'wake_up_all' and removing 'return false' in the code snippet from
prepare_kswapd_sleep() above.  Note that if any process puts itself in
the queue after this waitqueue_active() check, or after the wake up
itself, it means that the process will also wake up kswapd - and since
we are under prepare_to_wait(), the wake up won't be missed.  Also we
update the comment prepare_kswapd_sleep() to hopefully more clearly
describe the races it is preventing.

Fixes: 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Showing 1 changed file with 13 additions and 11 deletions Side-by-side Diff

... ... @@ -2904,18 +2904,20 @@
2904 2904 return false;
2905 2905  
2906 2906 /*
2907   - * There is a potential race between when kswapd checks its watermarks
2908   - * and a process gets throttled. There is also a potential race if
2909   - * processes get throttled, kswapd wakes, a large process exits therby
2910   - * balancing the zones that causes kswapd to miss a wakeup. If kswapd
2911   - * is going to sleep, no process should be sleeping on pfmemalloc_wait
2912   - * so wake them now if necessary. If necessary, processes will wake
2913   - * kswapd and get throttled again
  2907 + * The throttled processes are normally woken up in balance_pgdat() as
  2908 + * soon as pfmemalloc_watermark_ok() is true. But there is a potential
  2909 + * race between when kswapd checks the watermarks and a process gets
  2910 + * throttled. There is also a potential race if processes get
  2911 + * throttled, kswapd wakes, a large process exits thereby balancing the
  2912 + * zones, which causes kswapd to exit balance_pgdat() before reaching
  2913 + * the wake up checks. If kswapd is going to sleep, no process should
  2914 + * be sleeping on pfmemalloc_wait, so wake them now if necessary. If
  2915 + * the wake up is premature, processes will wake kswapd and get
  2916 + * throttled again. The difference from wake ups in balance_pgdat() is
  2917 + * that here we are under prepare_to_wait().
2914 2918 */
2915   - if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
2916   - wake_up(&pgdat->pfmemalloc_wait);
2917   - return false;
2918   - }
  2919 + if (waitqueue_active(&pgdat->pfmemalloc_wait))
  2920 + wake_up_all(&pgdat->pfmemalloc_wait);
2919 2921  
2920 2922 return pgdat_balanced(pgdat, order, classzone_idx);
2921 2923 }