Eric Lee / smarc-fsl-linux-kernel

22 Jul, 2018

1 commit

1bbe05e27 bdi: Fix another oops in wb_workfn() ... Browse Code »

commit 3ee7e8697d5860b173132606d80a9cd35e7113ee upstream.

syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
WB_shutting_down after wb->bdi->dev became NULL. This indicates that
unregister_bdi() failed to call wb_shutdown() on one of wb objects.

The problem is in cgwb_bdi_unregister() which does cgwb_kill() and thus
drops bdi's reference to wb structures before going through the list of
wbs again and calling wb_shutdown() on each of them. This way the loop
iterating through all wbs can easily miss a wb if that wb has already
passed through cgwb_remove_from_bdi_list() called from wb_shutdown()
from cgwb_release_workfn() and as a result fully shutdown bdi although
wb_workfn() for this wb structure is still running. In fact there are
also other ways cgwb_bdi_unregister() can race with
cgwb_release_workfn() leading e.g. to use-after-free issues:

CPU1 CPU2
cgwb_bdi_unregister()
cgwb_kill(*slot);

cgwb_release()
queue_work(cgwb_release_wq, &wb->release_work);
cgwb_release_workfn()
wb = list_first_entry(&bdi->wb_list, ...)
spin_unlock_irq(&cgwb_lock);
wb_shutdown(wb);
...
kfree_rcu(wb, rcu);
wb_shutdown(wb); -> oops use-after-free

We solve these issues by synchronizing writeback structure shutdown from
cgwb_bdi_unregister() with cgwb_release_workfn() using a new mutex. That
way we also no longer need synchronization using WB_shutting_down as the
mutex provides it for CONFIG_CGROUP_WRITEBACK case and without
CONFIG_CGROUP_WRITEBACK wb_shutdown() can be called only once from
bdi_unregister().

Reported-by: syzbot
Acked-by: Tejun Heo
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2018-07-22 20:28:49 +0800

26 Jun, 2018

1 commit

67b46304b bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue ... Browse Code »

commit f183464684190bacbfb14623bd3e4e51b7575b4c upstream.

From 0aa2e9b921d6db71150633ff290199554f0842a8 Mon Sep 17 00:00:00 2001
From: Tejun Heo
Date: Wed, 23 May 2018 10:29:00 -0700

cgwb_release() punts the actual release to cgwb_release_workfn() on
system_wq. Depending on the number of cgroups or block devices, there
can be a lot of cgwb_release_workfn() in flight at the same time.

We're periodically seeing close to 256 kworkers getting stuck with the
following stack trace and overtime the entire system gets stuck.

[] _synchronize_rcu_expedited.constprop.72+0x2fc/0x330
[] synchronize_rcu_expedited+0x24/0x30
[] bdi_unregister+0x53/0x290
[] release_bdi+0x89/0xc0
[] wb_exit+0x85/0xa0
[] cgwb_release_workfn+0x54/0xb0
[] process_one_work+0x150/0x410
[] worker_thread+0x6d/0x520
[] kthread+0x12c/0x160
[] ret_from_fork+0x29/0x40
[] 0xffffffffffffffff

The events leading to the lockup are...

1. A lot of cgwb_release_workfn() is queued at the same time and all
system_wq kworkers are assigned to execute them.

2. They all end up calling synchronize_rcu_expedited(). One of them
wins and tries to perform the expedited synchronization.

3. However, that invovles queueing rcu_exp_work to system_wq and
waiting for it. Because #1 is holding all available kworkers on
system_wq, rcu_exp_work can't be executed. cgwb_release_workfn()
is waiting for synchronize_rcu_expedited() which in turn is waiting
for cgwb_release_workfn() to free up some of the kworkers.

We shouldn't be scheduling hundreds of cgwb_release_workfn() at the
same time. There's nothing to be gained from that. This patch
updates cgwb release path to use a dedicated percpu workqueue with
@max_active of 1.

While this resolves the problem at hand, it might be a good idea to
isolate rcu_exp_work to its own workqueue too as it can be used from
various paths and is prone to this sort of indirect A-A deadlocks.

Signed-off-by: Tejun Heo
Cc: "Paul E. McKenney"
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2018-06-26 08:06:32 +0800

16 May, 2018

1 commit

6b5a99167 bdi: wake up concurrent wb_shutdown() callers. ... Browse Code »

commit 8236b0ae31c837d2b3a2565c5f8d77f637e824cc upstream.

syzbot is reporting hung tasks at wait_on_bit(WB_shutting_down) in
wb_shutdown() [1]. This seems to be because commit 5318ce7d46866e1d ("bdi:
Shutdown writeback on all cgwbs in cgwb_bdi_destroy()") forgot to call
wake_up_bit(WB_shutting_down) after clear_bit(WB_shutting_down).

Introduce a helper function clear_and_wake_up_bit() and use it, in order
to avoid similar errors in future.

[1] https://syzkaller.appspot.com/bug?id=b297474817af98d5796bc544e1bb806fc3da0e5e

Signed-off-by: Tetsuo Handa
Reported-by: syzbot
Fixes: 5318ce7d46866e1d ("bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy()")
Cc: Tejun Heo
Reviewed-by: Jan Kara
Suggested-by: Linus Torvalds
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Tetsuo Handa
2018-05-16 16:10:25 +0800

12 Sep, 2017

1 commit

0b045bd1c mm/backing-dev.c: fix an error handling path in 'cgwb_create()' ... Browse Code »

If the 'kmalloc' fails, we must go through the existing error handling
path.

Signed-off-by: Christophe JAILLET
Fixes: 52ebea749aae ("writeback: make backing_dev_info host cgroup-specific bdi_writebacks")
Reviewed-by: Jan Kara
Signed-off-by: Jens Axboe

Christophe JAILLET
2017-09-12 04:16:44 +0800

21 Apr, 2017

5 commits

7c4cc3002 bdi: Drop 'parent' argument from bdi_register[_va]() ... Browse Code »

Drop 'parent' argument of bdi_register() and bdi_register_va(). It is
always NULL.

Reviewed-by: Christoph Hellwig
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2017-04-21 02:09:55 +0800
2e82b84c0 block: Remove unused functions ... Browse Code »

Now that all backing_dev_info structure are allocated separately, we can
drop some unused functions.

Reviewed-by: Christoph Hellwig
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2017-04-21 02:09:55 +0800
62bf42adc bdi: Export bdi_alloc_node() and bdi_put() ... Browse Code »

MTD will want to call bdi_alloc_node() and bdi_put() directly. Export
these functions.

Reviewed-by: Christoph Hellwig
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2017-04-21 02:09:55 +0800
5af110b2f block: Unregister bdi on last reference drop ... Browse Code »

Most users will want to unregister bdi when dropping last reference to a
bdi. Only a few users (like block devices) want to play more complex
tricks with bdi registration and unregistration. So unregister bdi when
the last reference to bdi is dropped and just make sure we don't
unregister the bdi the second time if it is already unregistered.

Reviewed-by: Christoph Hellwig
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2017-04-21 02:09:55 +0800
baf7a616d bdi: Provide bdi_register_va() and bdi_alloc() ... Browse Code »

Add function that registers bdi and takes va_list instead of variable
number of arguments.

Add bdi_alloc() as simple wrapper for NUMA-unaware users allocating BDI.

Reviewed-by: Christoph Hellwig
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2017-04-21 02:09:55 +0800

23 Mar, 2017

6 commits

b1c51afc0 bdi: Rename cgwb_bdi_destroy() to cgwb_bdi_unregister() ... Browse Code »

Rename cgwb_bdi_destroy() to cgwb_bdi_unregister() as it gets called
from bdi_unregister() which is not necessarily called from bdi_destroy()
and thus the name is somewhat misleading.

Acked-by: Tejun Heo
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2017-03-23 10:11:31 +0800
4514451e7 bdi: Do not wait for cgwbs release in bdi_unregister() ... Browse Code »

Currently we wait for all cgwbs to get released in cgwb_bdi_destroy()
(called from bdi_unregister()). That is however unnecessary now when
cgwb->bdi is a proper refcounted reference (thus bdi cannot get
released before all cgwbs are released) and when cgwb_bdi_destroy()
shuts down writeback directly.

Acked-by: Tejun Heo
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2017-03-23 10:11:30 +0800
5318ce7d4 bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy() ... Browse Code »

Currently we waited for all cgwbs to get freed in cgwb_bdi_destroy()
which also means that writeback has been shutdown on them. Since this
wait is going away, directly shutdown writeback on cgwbs from
cgwb_bdi_destroy() to avoid live writeback structures after
bdi_unregister() has finished. To make that safe with concurrent
shutdown from cgwb_release_workfn(), we also have to make sure
wb_shutdown() returns only after the bdi_writeback structure is really
shutdown.

Acked-by: Tejun Heo
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2017-03-23 10:11:28 +0800
e8cb72b32 bdi: Unify bdi->wb_list handling for root wb_writeback ... Browse Code »

Currently root wb_writeback structure is added to bdi->wb_list in
bdi_init() and never removed. That is different from all other
wb_writeback structures which get added to the list when created and
removed from it before wb_shutdown().

So move list addition of root bdi_writeback to bdi_register() and list
removal of all wb_writeback structures to wb_shutdown(). That way a
wb_writeback structure is on bdi->wb_list if and only if it can handle
writeback and it will make it easier for us to handle shutdown of all
wb_writeback structures in bdi_unregister().

Acked-by: Tejun Heo
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2017-03-23 10:11:27 +0800
810df54a6 bdi: Make wb->bdi a proper reference ... Browse Code »

Make wb->bdi a proper refcounted reference to bdi for all bdi_writeback
structures except for the one embedded inside struct backing_dev_info.
That will allow us to simplify bdi unregistration.

Acked-by: Tejun Heo
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2017-03-23 10:11:25 +0800
b7d680d7b bdi: Mark congested->bdi as internal ... Browse Code »

congested->bdi pointer is used only to be able to remove congested
structure from bdi->cgwb_congested_tree on structure release. Moreover
the pointer can become NULL when we unregister the bdi. Rename the field
to __bdi and add a comment to make it more explicit this is internal
stuff of memcg writeback code and people should not use the field as
such use will be likely race prone.

We do not bother with converting congested->bdi to a proper refcounted
reference. It will be slightly ugly to special-case bdi->wb.congested to
avoid effectively a cyclic reference of bdi to itself and the reference
gets cleared from bdi_unregister() making it impossible to reference
a freed bdi.

Acked-by: Tejun Heo
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2017-03-23 10:11:24 +0800

09 Mar, 2017

2 commits

df23de556 bdi: Fix use-after-free in wb_congested_put() ... Browse Code »

bdi_writeback_congested structures get created for each blkcg and bdi
regardless whether bdi is registered or not. When they are created in
unregistered bdi and the request queue (and thus bdi) is then destroyed
while blkg still holds reference to bdi_writeback_congested structure,
this structure will be referencing freed bdi and last wb_congested_put()
will try to remove the structure from already freed bdi.

With commit 165a5e22fafb "block: Move bdi_unregister() to
del_gendisk()", SCSI started to destroy bdis without calling
bdi_unregister() first (previously it was calling bdi_unregister() even
for unregistered bdis) and thus the code detaching
bdi_writeback_congested in cgwb_bdi_destroy() was not triggered and we
started hitting this use-after-free bug. It is enough to boot a KVM
instance with virtio-scsi device to trigger this behavior.

Fix the problem by detaching bdi_writeback_congested structures in
bdi_exit() instead of bdi_unregister(). This is also more logical as
they can get attached to bdi regardless whether it ever got registered
or not.

Fixes: 165a5e22fafb127ecb5914e12e8c32a1f0d3f820
Signed-off-by: Jan Kara
Tested-by: Omar Sandoval
Signed-off-by: Jens Axboe

Jan Kara
2017-03-09 01:55:17 +0800
b6f8fec44 block: Allow bdi re-registration ... Browse Code »

SCSI can call device_add_disk() several times for one request queue when
a device in unbound and bound, creating new gendisk each time. This will
lead to bdi being repeatedly registered and unregistered. This was not a
big problem until commit 165a5e22fafb "block: Move bdi_unregister() to
del_gendisk()" since bdi was only registered repeatedly (bdi_register()
handles repeated calls fine, only we ended up leaking reference to
gendisk due to overwriting bdi->owner) but unregistered only in
blk_cleanup_queue() which didn't get called repeatedly. After
165a5e22fafb we were doing correct bdi_register() - bdi_unregister()
cycles however bdi_unregister() is not prepared for it. So make sure
bdi_unregister() cleans up bdi in such a way that it is prepared for
a possible following bdi_register() call.

An easy way to provoke this behavior is to enable
CONFIG_DEBUG_TEST_DRIVER_REMOVE and use scsi_debug driver to create a
scsi disk which immediately hangs without this fix.

Fixes: 165a5e22fafb127ecb5914e12e8c32a1f0d3f820
Signed-off-by: Jan Kara
Tested-by: Omar Sandoval
Signed-off-by: Jens Axboe

Jan Kara
2017-03-09 01:55:17 +0800

23 Feb, 2017

1 commit

bc71226b0 mm/backing-dev.c: use rb_entry() ... Browse Code »

To make the code clearer, use rb_entry() instead of container_of() to
deal with rbtree.

Link: http://lkml.kernel.org/r/671275de093d93ddc7c6f77ddc0d357149691a39.1484306840.git.geliangtang@gmail.com
Signed-off-by: Geliang Tang
Cc: Jens Axboe
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Geliang Tang
2017-02-23 08:41:30 +0800

09 Feb, 2017

1 commit

5f478e4ea block: fix double-free in the failure path of cgwb_bdi_init() ... Browse Code »

When !CONFIG_CGROUP_WRITEBACK, bdi has single bdi_writeback_congested
at bdi->wb_congested. cgwb_bdi_init() allocates it with kzalloc() and
doesn't do further initialization. This usually works fine as the
reference count gets bumped to 1 by wb_init() and the put from
wb_exit() releases it.

However, when wb_init() fails, it puts the wb base ref automatically
freeing the wb and the explicit kfree() in cgwb_bdi_init() error path
ends up trying to free the same pointer the second time causing a
double-free.

Fix it by explicitly initilizing the refcnt to 1 and putting the base
ref from cgwb_bdi_destroy().

Signed-off-by: Tejun Heo
Reported-by: Dmitry Vyukov
Fixes: a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback")
Cc: stable@vger.kernel.org # v4.2+
Signed-off-by: Jens Axboe

Tejun Heo
2017-02-09 04:52:01 +0800

02 Feb, 2017

1 commit

d03f6cdc1 block: Dynamically allocate and refcount backing_dev_info ... Browse Code »

Instead of storing backing_dev_info inside struct request_queue,
allocate it dynamically, reference count it, and free it when the last
reference is dropped. Currently only request_queue holds the reference
but in the following patch we add other users referencing
backing_dev_info.

Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2017-02-02 23:20:50 +0800

08 Nov, 2016

1 commit

b57d74aff writeback: track if we're sleeping on progress in balance_dirty_pages() ... Browse Code »

Note in the bdi_writeback structure whenever a task ends up sleeping
waiting for progress. We can use that information in the lower layers
to increase the priority of writes.

Signed-off-by: Jens Axboe
Reviewed-by: Jan Kara

Jens Axboe
2016-11-08 23:28:55 +0800

05 Aug, 2016

1 commit

df08c32ce block: fix bdi vs gendisk lifetime mismatch ... Browse Code »

The name for a bdi of a gendisk is derived from the gendisk's devt.
However, since the gendisk is destroyed before the bdi it leaves a
window where a new gendisk could dynamically reuse the same devt while a
bdi with the same name is still live. Arrange for the bdi to hold a
reference against its "owner" disk device while it is registered.
Otherwise we can hit sysfs duplicate name collisions like the following:

WARNING: CPU: 10 PID: 2078 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x64/0x80
sysfs: cannot create duplicate filename '/devices/virtual/bdi/259:1'

Hardware name: HP ProLiant DL580 Gen8, BIOS P79 05/06/2015
0000000000000286 0000000002c04ad5 ffff88006f24f970 ffffffff8134caec
ffff88006f24f9c0 0000000000000000 ffff88006f24f9b0 ffffffff8108c351
0000001f0000000c ffff88105d236000 ffff88105d1031e0 ffff8800357427f8
Call Trace:
[] dump_stack+0x63/0x87
[] __warn+0xd1/0xf0
[] warn_slowpath_fmt+0x5f/0x80
[] sysfs_warn_dup+0x64/0x80
[] sysfs_create_dir_ns+0x7e/0x90
[] kobject_add_internal+0xaa/0x320
[] ? vsnprintf+0x34e/0x4d0
[] kobject_add+0x75/0xd0
[] ? mutex_lock+0x12/0x2f
[] device_add+0x125/0x610
[] device_create_groups_vargs+0xd8/0x100
[] device_create_vargs+0x1c/0x20
[] bdi_register+0x8c/0x180
[] bdi_register_dev+0x27/0x30
[] add_disk+0x175/0x4a0

Cc:
Reported-by: Yi Zhang
Tested-by: Yi Zhang
Signed-off-by: Dan Williams

Fixed up missing 0 return in bdi_register_owner().

Signed-off-by: Jens Axboe

Dan Williams
2016-08-05 04:19:16 +0800

29 Jul, 2016

1 commit

599d0c954 mm, vmscan: move LRU lists to node ... Browse Code »

This moves the LRU lists from the zone to the node and related data such
as counters, tracing, congestion tracking and writeback tracking.

Unfortunately, due to reclaim and compaction retry logic, it is
necessary to account for the number of LRU pages on both zone and node
logic. Most reclaim logic is based on the node counters but the retry
logic uses the zone counters which do not distinguish inactive and
active sizes. It would be possible to leave the LRU counters on a
per-zone basis but it's a heavier calculation across multiple cache
lines that is much more frequent than the retry checks.

Other than the LRU counters, this is mostly a mechanical patch but note
that it introduces a number of anomalies. For example, the scans are
per-zone but using per-node counters. We also mark a node as congested
when a zone is congested. This causes weird problems that are fixed
later but is easier to review.

In the event that there is excessive overhead on 32-bit systems due to
the nodes being on LRU then there are two potential solutions

1. Long-term isolation of highmem pages when reclaim is lowmem

When pages are skipped, they are immediately added back onto the LRU
list. If lowmem reclaim persisted for long periods of time, the same
highmem pages get continually scanned. The idea would be that lowmem
keeps those pages on a separate list until a reclaim for highmem pages
arrives that splices the highmem pages back onto the LRU. It potentially
could be implemented similar to the UNEVICTABLE list.

That would reduce the skip rate with the potential corner case is that
highmem pages have to be scanned and reclaimed to free lowmem slab pages.

2. Linear scan lowmem pages if the initial LRU shrink fails

This will break LRU ordering but may be preferable and faster during
memory pressure than skipping LRU pages.

Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman
Acked-by: Johannes Weiner
Acked-by: Vlastimil Babka
Cc: Hillf Danton
Cc: Joonsoo Kim
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2016-07-29 07:07:41 +0800

21 May, 2016

1 commit

ede377137 mm: throttle on IO only when there are too many dirty and writeback pages ... Browse Code »

wait_iff_congested has been used to throttle allocator before it retried
another round of direct reclaim to allow the writeback to make some
progress and prevent reclaim from looping over dirty/writeback pages
without making any progress.

We used to do congestion_wait before commit 0e093d99763e ("writeback: do
not sleep on the congestion queue if there are no congested BDIs or if
significant congestion is not being encountered in the current zone")
but that led to undesirable stalls and sleeping for the full timeout
even when the BDI wasn't congested. Hence wait_iff_congested was used
instead.

But it seems that even wait_iff_congested doesn't work as expected. We
might have a small file LRU list with all pages dirty/writeback and yet
the bdi is not congested so this is just a cond_resched in the end and
can end up triggering pre mature OOM.

This patch replaces the unconditional wait_iff_congested by
congestion_wait which is executed only if we _know_ that the last round
of direct reclaim didn't make any progress and dirty+writeback pages are
more than a half of the reclaimable pages on the zone which might be
usable for our target allocation. This shouldn't reintroduce stalls
fixed by 0e093d99763e because congestion_wait is called only when we are
getting hopeless when sleeping is a better choice than OOM with many
pages under IO.

We have to preserve logic introduced by commit 373ccbe59270 ("mm,
vmstat: allow WQ concurrency to discover memory reclaim doesn't make any
progress") into the __alloc_pages_slowpath now that wait_iff_congested
is not used anymore. As the only remaining user of wait_iff_congested
is shrink_inactive_list we can remove the WQ specific short sleep from
wait_iff_congested because the sleep is needed to be done only once in
the allocation retry cycle.

[mhocko@suse.com: high_zoneidx->ac_classzone_idx to evaluate memory reserves properly]
Link: http://lkml.kernel.org/r/1463051677-29418-2-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Hillf Danton
Cc: David Rientjes
Cc: Johannes Weiner
Cc: Joonsoo Kim
Cc: Mel Gorman
Cc: Tetsuo Handa
Cc: Vladimir Davydov
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-05-21 08:58:30 +0800

01 Apr, 2016

1 commit

c877ef8ae writeback: fix the wrong congested state variable definition ... Browse Code »

The right variable definition should be wb_congested_state that
include WB_async_congested and WB_sync_congested. So fix it.

Signed-off-by: Kaixu Xia
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

Kaixu Xia
2016-04-01 02:26:25 +0800

18 Mar, 2016

1 commit

1170532bb mm: convert printk(KERN_<LEVEL> to pr_<level> ... Browse Code »

Most of the mm subsystem uses pr_ so make it consistent.

Miscellanea:

- Realign arguments
- Add missing newline to format
- kmemleak-test.c has a "kmemleak: " prefix added to the
"Kmemleak testing" logging message via pr_fmt

Signed-off-by: Joe Perches
Acked-by: Tejun Heo [percpu]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2016-03-18 06:09:34 +0800

12 Feb, 2016

1 commit

078c6c3a5 mm/backing-dev.c: fix error path in wb_init() ... Browse Code »

We need to use post-decrement to get percpu_counter_destroy() called on
&wb->stat[0]. Moreover, the pre-decremebt would cause infinite
out-of-bounds accesses if the setup code failed at i==0.

Signed-off-by: Rasmus Villemoes
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rasmus Villemoes
2016-02-12 10:35:48 +0800

06 Feb, 2016

1 commit

564e81a57 mm, vmstat: fix wrong WQ sleep when memory reclaim doesn't make any progress ... Browse Code »

Jan Stancek has reported that system occasionally hanging after "oom01"
testcase from LTP triggers OOM. Guessing from a result that there is a
kworker thread doing memory allocation and the values between "Node 0
Normal free:" and "Node 0 Normal:" differs when hanging, vmstat is not
up-to-date for some reason.

According to commit 373ccbe59270 ("mm, vmstat: allow WQ concurrency to
discover memory reclaim doesn't make any progress"), it meant to force
the kworker thread to take a short sleep, but it by error used
schedule_timeout(1). We missed that schedule_timeout() in state
TASK_RUNNING doesn't do anything.

Fix it by using schedule_timeout_uninterruptible(1) which forces the
kworker thread to take a short sleep in order to make sure that vmstat
is up-to-date.

Fixes: 373ccbe59270 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress")
Signed-off-by: Tetsuo Handa
Reported-by: Jan Stancek
Acked-by: Michal Hocko
Cc: Tejun Heo
Cc: Cristopher Lameter
Cc: Joonsoo Kim
Cc: Arkadiusz Miskiewicz
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tetsuo Handa
2016-02-06 10:10:40 +0800

15 Jan, 2016

1 commit

7d828602e mm: memcontrol: export root_mem_cgroup ... Browse Code »

A later patch will need this symbol in files other than memcontrol.c, so
export it now and replace mem_cgroup_root_css at the same time.

Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Acked-by: David S. Miller
Reviewed-by: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-01-15 08:00:49 +0800

13 Dec, 2015

1 commit

373ccbe59 mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress ... Browse Code »

Tetsuo Handa has reported that the system might basically livelock in
OOM condition without triggering the OOM killer.

The issue is caused by internal dependency of the direct reclaim on
vmstat counter updates (via zone_reclaimable) which are performed from
the workqueue context. If all the current workers get assigned to an
allocation request, though, they will be looping inside the allocator
trying to reclaim memory but zone_reclaimable can see stalled numbers so
it will consider a zone reclaimable even though it has been scanned way
too much. WQ concurrency logic will not consider this situation as a
congested workqueue because it relies that worker would have to sleep in
such a situation. This also means that it doesn't try to spawn new
workers or invoke the rescuer thread if the one is assigned to the
queue.

In order to fix this issue we need to do two things. First we have to
let wq concurrency code know that we are in trouble so we have to do a
short sleep. In order to prevent from issues handled by 0e093d99763e
("writeback: do not sleep on the congestion queue if there are no
congested BDIs or if significant congestion is not being encountered in
the current zone") we limit the sleep only to worker threads which are
the ones of the interest anyway.

The second thing to do is to create a dedicated workqueue for vmstat and
mark it WQ_MEM_RECLAIM to note it participates in the reclaim and to
have a spare worker thread for it.

Signed-off-by: Michal Hocko
Reported-by: Tetsuo Handa
Cc: Tejun Heo
Cc: Cristopher Lameter
Cc: Joonsoo Kim
Cc: Arkadiusz Miskiewicz
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2015-12-13 02:15:34 +0800

07 Nov, 2015

1 commit

d0164adc8 mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep an… ... Browse Code »

…d avoiding waking kswapd

__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts. They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve". __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".

Over time, callers had a requirement to not block when fallback options
were available. Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.

This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative. High priority users continue to use
__GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim. __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.

This patch then converts a number of sites

o __GFP_ATOMIC is used by callers that are high priority and have memory
pools for those requests. GFP_ATOMIC uses this flag.

o Callers that have a limited mempool to guarantee forward progress clear
__GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
into this category where kswapd will still be woken but atomic reserves
are not used as there is a one-entry mempool to guarantee progress.

o Callers that are checking if they are non-blocking should use the
helper gfpflags_allow_blocking() where possible. This is because
checking for __GFP_WAIT as was done historically now can trigger false
positives. Some exceptions like dm-crypt.c exist where the code intent
is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
flag manipulations.

o Callers that built their own GFP flags instead of starting with GFP_KERNEL
and friends now also need to specify __GFP_KSWAPD_RECLAIM.

The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.

The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL. They may
now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Mel Gorman
2015-11-07 09:50:42 +0800

21 Oct, 2015

1 commit

e27c5b9d2 writeback: remove broken rbtree_postorder_for_each_entry_safe() usage in cgwb_bdi_destroy() ... Browse Code »

a20135ffbc44 ("writeback: don't drain bdi_writeback_congested on bdi
destruction") added rbtree_postorder_for_each_entry_safe() which is
used to remove all entries; however, according to Cody, the iterator
isn't safe against operations which may rebalance the tree. Fix it by
switching to repeatedly removing rb_first() until empty.

Signed-off-by: Tejun Heo
Reported-by: Cody P Schafer
Fixes: a20135ffbc44 ("writeback: don't drain bdi_writeback_congested on bdi destruction")
Link: http://lkml.kernel.org/g/1443997973-1700-1-git-send-email-dev@codyps.com
Signed-off-by: Jens Axboe

Tejun Heo
2015-10-21 22:17:29 +0800

15 Oct, 2015

1 commit

b02176f30 block: don't release bdi while request_queue has live references ... Browse Code »

bdi's are initialized in two steps, bdi_init() and bdi_register(), but
destroyed in a single step by bdi_destroy() which, for a bdi embedded
in a request_queue, is called during blk_cleanup_queue() which makes
the queue invisible and starts the draining of remaining usages.

A request_queue's user can access the congestion state of the embedded
bdi as long as it holds a reference to the queue. As such, it may
access the congested state of a queue which finished
blk_cleanup_queue() but hasn't reached blk_release_queue() yet.
Because the congested state was embedded in backing_dev_info which in
turn is embedded in request_queue, accessing the congested state after
bdi_destroy() was called was fine. The bdi was destroyed but the
memory region for the congested state remained accessible till the
queue got released.

a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in
bdi_writeback") changed the situation. Now, the root congested state
which is expected to be pinned while request_queue remains accessible
is separately reference counted and the base ref is put during
bdi_destroy(). This means that the root congested state may go away
prematurely while the queue is between bdi_dstroy() and
blk_cleanup_queue(), which was detected by Andrey's KASAN tests.

The root cause of this problem is that bdi doesn't distinguish the two
steps of destruction, unregistration and release, and now the root
congested state actually requires a separate release step. To fix the
issue, this patch separates out bdi_unregister() and bdi_exit() from
bdi_destroy(). bdi_unregister() is called from blk_cleanup_queue()
and bdi_exit() from blk_release_queue(). bdi_destroy() is now just a
simple wrapper calling the two steps back-to-back.

While at it, the prototype of bdi_destroy() is moved right below
bdi_setup_and_register() so that the counterpart operations are
located together.

Signed-off-by: Tejun Heo
Fixes: a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback")
Cc: stable@vger.kernel.org # v4.2+
Reported-and-tested-by: Andrey Konovalov
Link: http://lkml.kernel.org/g/CAAeHK+zUJ74Zn17=rOyxacHU18SgCfC6bsYW=6kCY5GXJBwGfQ@mail.gmail.com
Reviewed-by: Jan Kara
Reviewed-by: Jeff Moyer
Signed-off-by: Jens Axboe

Tejun Heo
2015-10-15 23:53:28 +0800

13 Oct, 2015

1 commit

b817525a4 writeback: bdi_writeback iteration must not skip dying ones ... Browse Code »

bdi_for_each_wb() is used in several places to wake up or issue
writeback work items to all wb's (bdi_writeback's) on a given bdi.
The iteration is performed by walking bdi->cgwb_tree; however, the
tree only indexes wb's which are currently active.

For example, when a memcg gets associated with a different blkcg, the
old wb is removed from the tree so that the new one can be indexed.
The old wb starts dying from then on but will linger till all its
inodes are drained. As these dying wb's may still host dirty inodes,
writeback operations which affect all wb's must include them.
bdi_for_each_wb() skipping dying wb's led to sync(2) missing and
failing to sync the inodes belonging to those wb's.

This patch adds a RCU protected @bdi->wb_list which lists all wb's
beloinging to that bdi. wb's are added on creation and removed on
release rather than on the start of destruction. bdi_for_each_wb()
usages are replaced with list_for_each[_continue]_rcu() iterations
over @bdi->wb_list and bdi_for_each_wb() and its helpers are removed.

v2: Updated as per Jan. last_wb ref leak in bdi_split_work_to_wbs()
fixed and unnecessary list head severing in cgwb_bdi_destroy()
removed.

Signed-off-by: Tejun Heo
Reported-and-tested-by: Artem Bityutskiy
Fixes: ebe41ab0c79d ("writeback: implement bdi_for_each_wb()")
Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com
Cc: Jan Kara
Signed-off-by: Jens Axboe

Tejun Heo
2015-10-13 00:31:12 +0800

11 Sep, 2015

1 commit

b0a1ea51b Merge branch 'for-4.3/blkcg' of git://git.kernel.dk/linux-block ... Browse Code »

Pull blk-cg updates from Jens Axboe:
"A bit later in the cycle, but this has been in the block tree for a a
while. This is basically four patchsets from Tejun, that improve our
buffered cgroup writeback. It was dependent on the other cgroup
changes, but they went in earlier in this cycle.

Series 1 is set of 5 patches that has cgroup writeback updates:

- bdi_writeback iteration fix which could lead to some wb's being
skipped or repeated during e.g. sync under memory pressure.

- Simplification of wb work wait mechanism.

- Writeback tracepoints updated to report cgroup.

Series 2 is is a set of updates for the CFQ cgroup writeback handling:

cfq has always charged all async IOs to the root cgroup. It didn't
have much choice as writeback didn't know about cgroups and there
was no way to tell who to blame for a given writeback IO.
writeback finally grew support for cgroups and now tags each
writeback IO with the appropriate cgroup to charge it against.

This patchset updates cfq so that it follows the blkcg each bio is
tagged with. Async cfq_queues are now shared across cfq_group,
which is per-cgroup, instead of per-request_queue cfq_data. This
makes all IOs follow the weight based IO resource distribution
implemented by cfq.

- Switched from GFP_ATOMIC to GFP_NOWAIT as suggested by Jeff.

- Other misc review points addressed, acks added and rebased.

Series 3 is the blkcg policy cleanup patches:

This patchset contains assorted cleanups for blkcg_policy methods
and blk[c]g_policy_data handling.

- alloc/free added for blkg_policy_data. exit dropped.

- alloc/free added for blkcg_policy_data.

- blk-throttle's async percpu allocation is replaced with direct
allocation.

- all methods now take blk[c]g_policy_data instead of blkcg_gq or
blkcg.

And finally, series 4 is a set of patches cleaning up the blkcg stats
handling:

blkcg's stats have always been somwhat of a mess. This patchset
tries to improve the situation a bit.

- The following patches added to consolidate blkcg entry point and
blkg creation. This is in itself is an improvement and helps
colllecting common stats on bio issue.

- per-blkg stats now accounted on bio issue rather than request
completion so that bio based and request based drivers can behave
the same way. The issue was spotted by Vivek.

- cfq-iosched implements custom recursive stats and blk-throttle
implements custom per-cpu stats. This patchset make blkcg core
support both by default.

- cfq-iosched and blk-throttle keep track of the same stats
multiple times. Unify them"

* 'for-4.3/blkcg' of git://git.kernel.dk/linux-block: (45 commits)
blkcg: use CGROUP_WEIGHT_* scale for io.weight on the unified hierarchy
blkcg: s/CFQ_WEIGHT_*/CFQ_WEIGHT_LEGACY_*/
blkcg: implement interface for the unified hierarchy
blkcg: misc preparations for unified hierarchy interface
blkcg: separate out tg_conf_updated() from tg_set_conf()
blkcg: move body parsing from blkg_conf_prep() to its callers
blkcg: mark existing cftypes as legacy
blkcg: rename subsystem name from blkio to io
blkcg: refine error codes returned during blkcg configuration
blkcg: remove unnecessary NULL checks from __cfqg_set_weight_device()
blkcg: reduce stack usage of blkg_rwstat_recursive_sum()
blkcg: remove cfqg_stats->sectors
blkcg: move io_service_bytes and io_serviced stats into blkcg_gq
blkcg: make blkg_[rw]stat_recursive_sum() to be able to index into blkcg_gq
blkcg: make blkcg_[rw]stat per-cpu
blkcg: add blkg_[rw]stat->aux_cnt and replace cfq_group->dead_stats with it
blkcg: consolidate blkg creation in blkcg_bio_issue_check()
blk-throttle: improve queue bypass handling
blkcg: move root blkg lookup optimization from throtl_lookup_tg() to __blkg_lookup()
blkcg: inline [__]blkg_lookup()
...

Linus Torvalds
2015-09-11 09:56:14 +0800

19 Aug, 2015

1 commit

c165b3e3c blkcg: rename subsystem name from blkio to io ... Browse Code »

blkio interface has become messy over time and is currently the
largest. In addition to the inconsistent naming scheme, it has
multiple stat files which report more or less the same thing, a number
of debug stat files which expose internal details which shouldn't have
been part of the public interface in the first place, recursive and
non-recursive stats and leaf and non-leaf knobs.

Both recursive vs. non-recursive and leaf vs. non-leaf distinctions
don't make any sense on the unified hierarchy as only leaf cgroups can
contain processes. cgroups is going through a major interface
revision with the unified hierarchy involving significant fundamental
usage changes and given that a significant portion of the interface
doesn't make sense anymore, it's a good time to reorganize the
interface.

As the first step, this patch renames the external visible subsystem
name from "blkio" to "io". This is more concise, matches the other
two major subsystem names, "cpu" and "memory", and better suited as
blkcg will be involved in anything writeback related too whether an
actual block device is involved or not.

As the subsystem legacy_name is set to "blkio", the only userland
visible change outside the unified hierarchy is that blkcg is reported
as "io" instead of "blkio" in the subsystem initialized message during
boot. On the unified hierarchy, blkcg now appears as "io".

Signed-off-by: Tejun Heo
Cc: Li Zefan
Cc: Johannes Weiner
Cc: cgroups@vger.kernel.org
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:18 +0800

18 Aug, 2015

1 commit

c7f540849 inode: rename i_wb_list to i_io_list ... Browse Code »

There's a small consistency problem between the inode and writeback
naming. Writeback calls the "for IO" inode queues b_io and
b_more_io, but the inode calls these the "writeback list" or
i_wb_list. This makes it hard to an new "under writeback" list to
the inode, or call it an "under IO" list on the bdi because either
way we'll have writeback on IO and IO on writeback and it'll just be
confusing. I'm getting confused just writing this!

So, rename the inode "for IO" list variable to i_io_list so we can
add a new "writeback list" in a subsequent patch.

Signed-off-by: Dave Chinner
Signed-off-by: Josef Bacik
Reviewed-by: Jan Kara
Reviewed-by: Christoph Hellwig
Tested-by: Dave Chinner

Dave Chinner
2015-08-18 11:38:10 +0800

02 Jul, 2015

2 commits

a20135ffb writeback: don't drain bdi_writeback_congested on bdi destruction ... Browse Code »

52ebea749aae ("writeback: make backing_dev_info host cgroup-specific
bdi_writebacks") made bdi (backing_dev_info) host per-cgroup wb's
(bdi_writeback's). As the congested state needs to be per-wb and
referenced from blkcg side and multiple wbs, the patch made all
non-root cong's (bdi_writeback_congested's) reference counted and
indexed on bdi.

When a bdi is destroyed, cgwb_bdi_destroy() tries to drain all
non-root cong's; however, this can hang indefinitely because wb's can
also be referenced from blkcg_gq's which are destroyed after bdi
destruction is complete.

This patch fixes the bug by updating bdi destruction to not wait for
cong's to drain. A cong is unlinked from bdi->cgwb_congested_tree on
bdi destuction regardless of its reference count as the bdi may go
away any point after destruction. wb_congested_put() checks whether
the cong is already unlinked on release.

Signed-off-by: Tejun Heo
Reported-by: Jon Christopherson
Link: https://bugzilla.kernel.org/show_bug.cgi?id=100681
Fixes: 52ebea749aae ("writeback: make backing_dev_info host cgroup-specific bdi_writebacks")
Tested-by: Jon Christopherson
Signed-off-by: Jens Axboe

Tejun Heo
2015-07-02 22:46:00 +0800
a13f35e87 writeback: don't embed root bdi_writeback_congested in bdi_writeback ... Browse Code »

52ebea749aae ("writeback: make backing_dev_info host cgroup-specific
bdi_writebacks") made bdi (backing_dev_info) host per-cgroup wb's
(bdi_writeback's). As the congested state needs to be per-wb and
referenced from blkcg side and multiple wbs, the patch made all
non-root cong's (bdi_writeback_congested's) reference counted and
indexed on bdi.

When a bdi is destroyed, cgwb_bdi_destroy() tries to drain all
non-root cong's; however, this can hang indefinitely because wb's can
also be referenced from blkcg_gq's which are destroyed after bdi
destruction is complete.

To fix the bug, bdi destruction will be updated to not wait for cong's
to drain, which naturally means that cong's may outlive the associated
bdi. This is fine for non-root cong's but is problematic for the root
cong's which are embedded in their bdi's as they may end up getting
dereferenced after the containing bdi's are freed.

This patch makes root cong's behave the same as non-root cong's. They
are no longer embedded in their bdi's but allocated separately during
bdi initialization, indexed and reference counted the same way.

* As cong handling is the same for all wb's, wb->congested
initialization is moved into wb_init().

* When !CONFIG_CGROUP_WRITEBACK, there was no indexing or refcnting.
bdi->wb_congested is now a pointer pointing to the root cong
allocated during bdi init and minimal refcnting operations are
implemented.

* The above makes root wb init paths diverge depending on
CONFIG_CGROUP_WRITEBACK. root wb init is moved to cgwb_bdi_init().

This patch in itself shouldn't cause any consequential behavior
differences but prepares for the actual fix.

Signed-off-by: Tejun Heo
Reported-by: Jon Christopherson
Link: https://bugzilla.kernel.org/show_bug.cgi?id=100681
Tested-by: Jon Christopherson

Added include to backing-dev.h for kfree() definition.

Signed-off-by: Jens Axboe

Tejun Heo
2015-07-02 22:46:00 +0800

26 Jun, 2015

1 commit

e4bc13adf Merge branch 'for-4.2/writeback' of git://git.kernel.dk/linux-block ... Browse Code »

Pull cgroup writeback support from Jens Axboe:
"This is the big pull request for adding cgroup writeback support.

This code has been in development for a long time, and it has been
simmering in for-next for a good chunk of this cycle too. This is one
of those problems that has been talked about for at least half a
decade, finally there's a solution and code to go with it.

Also see last weeks writeup on LWN:

http://lwn.net/Articles/648292/"

* 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
writeback, blkio: add documentation for cgroup writeback support
vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
writeback: do foreign inode detection iff cgroup writeback is enabled
v9fs: fix error handling in v9fs_session_init()
bdi: fix wrong error return value in cgwb_create()
buffer: remove unusued 'ret' variable
writeback: disassociate inodes from dying bdi_writebacks
writeback: implement foreign cgroup inode bdi_writeback switching
writeback: add lockdep annotation to inode_to_wb()
writeback: use unlocked_inode_to_wb transaction in inode_congested()
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
writeback: implement [locked_]inode_to_wb_and_lock_list()
writeback: implement foreign cgroup inode detection
writeback: make writeback_control track the inode being written back
writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
writeback: implement memcg writeback domain based throttling
writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
writeback: implement memcg wb_domain
writeback: update wb_over_bg_thresh() to use wb_domain aware operations
...

Linus Torvalds
2015-06-26 07:00:17 +0800