Eric Lee / smarc-fsl-linux-kernel

25 Sep, 2020

4 commits

f56753ac2 bdi: replace BDI_CAP_NO_{WRITEBACK,ACCT_DIRTY} with a single flag ... Browse Code »

Replace the two negative flags that are always used together with a
single positive flag that indicates the writeback capability instead
of two related non-capabilities. Also remove the pointless wrappers
to just check the flag.

Signed-off-by: Christoph Hellwig
Reviewed-by: Jan Kara
Reviewed-by: Johannes Thumshirn
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-09-25 03:43:39 +0800
823423ef5 bdi: invert BDI_CAP_NO_ACCT_WB ... Browse Code »

Replace BDI_CAP_NO_ACCT_WB with a positive BDI_CAP_WRITEBACK_ACCT to
make the checks more obvious. Also remove the pointless
bdi_cap_account_writeback wrapper that just obsfucates the check.

Signed-off-by: Christoph Hellwig
Reviewed-by: Jan Kara
Reviewed-by: Johannes Thumshirn
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-09-25 03:43:39 +0800
1cb039f3d bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag ... Browse Code »

The BDI_CAP_STABLE_WRITES is one of the few bits of information in the
backing_dev_info shared between the block drivers and the writeback code.
To help untangling the dependency replace it with a queue flag and a
superblock flag derived from it. This also helps with the case of e.g.
a file system requiring stable writes due to its own checksumming, but
not forcing it on other users of the block device like the swap code.

One downside is that we an't support the stable_pages_required bdi
attribute in sysfs anymore. It is replaced with a queue attribute which
also is writable for easier testing.

Signed-off-by: Christoph Hellwig
Reviewed-by: Jan Kara
Reviewed-by: Johannes Thumshirn
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-09-25 03:43:39 +0800
55b2598e8 bdi: initialize ->ra_pages and ->io_pages in bdi_init ... Browse Code »

Set up a readahead size by default, as very few users have a good
reason to change it. This means code, ecryptfs, and orangefs now
set up the values while they were previously missing it, while ubifs,
mtd and vboxsf manually set it to 0 to avoid readahead.

Signed-off-by: Christoph Hellwig
Reviewed-by: Jan Kara
Acked-by: David Sterba [btrfs]
Acked-by: Richard Weinberger [ubifs, mtd]
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-09-25 03:43:39 +0800

09 Jul, 2020

2 commits

8c911f3d4 writeback: remove struct bdi_writeback_congested ... Browse Code »

We never set any congested bits in the group writeback instances of it.
And for the simpler bdi-wide case a simple scalar field is all that
that is needed.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-07-09 07:05:53 +0800
492d76b21 writeback: remove {set,clear}_wb_congested ... Browse Code »

Just merge them into their only callers.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-07-09 07:05:53 +0800

10 May, 2020

5 commits

1cd925d58 bdi: remove the name field in struct backing_dev_info ... Browse Code »

The name is only printed for a not registered bdi in writeback. Use the
device name there as is more useful anyway for the unlike case that the
warning triggers.

Signed-off-by: Christoph Hellwig
Reviewed-by: Jan Kara
Reviewed-by: Greg Kroah-Hartman
Reviewed-by: Bart Van Assche
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-05-10 06:15:13 +0800
aef33c2ff bdi: simplify bdi_alloc ... Browse Code »

Merge the _node vs normal version and drop the superflous gfp_t argument.

Signed-off-by: Christoph Hellwig
Reviewed-by: Jan Kara
Reviewed-by: Greg Kroah-Hartman
Reviewed-by: Bart Van Assche
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-05-10 06:15:13 +0800
3c5d202b5 bdi: remove bdi_register_owner ... Browse Code »

Split out a new bdi_set_owner helper to set the owner, and move the policy
for creating the bdi name back into genhd.c, where it belongs.

Signed-off-by: Christoph Hellwig
Reviewed-by: Jan Kara
Reviewed-by: Greg Kroah-Hartman
Reviewed-by: Bart Van Assche
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-05-10 06:15:13 +0800
a5a6c66df bdi: unexport bdi_register_va ... Browse Code »

bdi_register_va is only used by super.c, which can't be modular.

Signed-off-by: Christoph Hellwig
Reviewed-by: Jan Kara
Reviewed-by: Greg Kroah-Hartman
Reviewed-by: Bart Van Assche
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-05-10 06:15:13 +0800
6bd87eec2 bdi: add a ->dev_name field to struct backing_dev_info ... Browse Code »

Cache a copy of the name for the life time of the backing_dev_info
structure so that we can reference it even after unregistering.

Fixes: 68f23b89067f ("memcg: fix a crash in wb_workfn when a device disappears")
Reported-by: Yufen Yu
Signed-off-by: Christoph Hellwig
Reviewed-by: Jan Kara
Reviewed-by: Bart Van Assche
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-05-10 06:07:57 +0800

07 May, 2020

1 commit

eb7ae5e06 bdi: move bdi_dev_name out of line ... Browse Code »

bdi_dev_name is not a fast path function, move it out of line. This
prepares for using it from modular callers without having to export
an implementation detail like bdi_unknown_name.

Signed-off-by: Christoph Hellwig
Reviewed-by: Jan Kara
Reviewed-by: Greg Kroah-Hartman
Reviewed-by: Bart Van Assche
Signed-off-by: Jens Axboe

Christoph Hellwig
2020-05-07 22:45:47 +0800

02 Apr, 2020

1 commit

d866dbf61 blkcg: rename blkcg->cgwb_refcnt to ->online_pin and always use it ... Browse Code »

blkcg->cgwb_refcnt is used to delay blkcg offlining so that blkgs
don't get offlined while there are active cgwbs on them. However, it
ends up making offlining unordered sometimes causing parents to be
offlined before children.

To fix it, we want child blkcgs to pin the parents' online states
turning the refcnt into a more generic online pinning mechanism.

In prepartion,

* blkcg->cgwb_refcnt -> blkcg->online_pin
* blkcg_cgwb_get/put() -> blkcg_pin/unpin_online()
* Take them out of CONFIG_CGROUP_WRITEBACK

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2020-04-02 04:56:42 +0800

01 Feb, 2020

1 commit

68f23b890 memcg: fix a crash in wb_workfn when a device disappears ... Browse Code »

Without memcg, there is a one-to-one mapping between the bdi and
bdi_writeback structures. In this world, things are fairly
straightforward; the first thing bdi_unregister() does is to shutdown
the bdi_writeback structure (or wb), and part of that writeback ensures
that no other work queued against the wb, and that the wb is fully
drained.

With memcg, however, there is a one-to-many relationship between the bdi
and bdi_writeback structures; that is, there are multiple wb objects
which can all point to a single bdi. There is a refcount which prevents
the bdi object from being released (and hence, unregistered). So in
theory, the bdi_unregister() *should* only get called once its refcount
goes to zero (bdi_put will drop the refcount, and when it is zero,
release_bdi gets called, which calls bdi_unregister).

Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about
the Brave New memcg World, and calls bdi_unregister directly. It does
this without informing the file system, or the memcg code, or anything
else. This causes the root wb associated with the bdi to be
unregistered, but none of the memcg-specific wb's are shutdown. So when
one of these wb's are woken up to do delayed work, they try to
dereference their wb->bdi->dev to fetch the device name, but
unfortunately bdi->dev is now NULL, thanks to the bdi_unregister()
called by del_gendisk(). As a result, *boom*.

Fortunately, it looks like the rest of the writeback path is perfectly
happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to
create a bdi_dev_name() function which can handle bdi->dev being NULL.
This also allows us to bulletproof the writeback tracepoints to prevent
them from dereferencing a NULL pointer and crashing the kernel if one is
tracing with memcg's enabled, and an iSCSI device dies or a USB storage
stick is pulled.

The most common way of triggering this will be hotremoval of a device
while writeback with memcg enabled is going on. It was triggering
several times a day in a heavily loaded production environment.

Google Bug Id: 145475544

Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu
Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o
Cc: Chris Mason
Cc: Tejun Heo
Cc: Jens Axboe
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Theodore Ts'o
2020-02-01 02:30:36 +0800

06 Oct, 2019

1 commit

a2b90f112 bdi: Do not use freezable workqueue ... Browse Code »

A removable block device, such as NVMe or SSD connected over Thunderbolt
can be hot-removed any time including when the system is suspended. When
device is hot-removed during suspend and the system gets resumed, kernel
first resumes devices and then thaws the userspace including freezable
workqueues. What happens in that case is that the NVMe driver notices
that the device is unplugged and removes it from the system. This ends
up calling bdi_unregister() for the gendisk which then schedules
wb_workfn() to be run one more time.

However, since the bdi_wq is still frozen flush_delayed_work() call in
wb_shutdown() blocks forever halting system resume process. User sees
this as hang as nothing is happening anymore.

Triggering sysrq-w reveals this:

Workqueue: nvme-wq nvme_remove_dead_ctrl_work [nvme]
Call Trace:
? __schedule+0x2c5/0x630
? wait_for_completion+0xa4/0x120
schedule+0x3e/0xc0
schedule_timeout+0x1c9/0x320
? resched_curr+0x1f/0xd0
? wait_for_completion+0xa4/0x120
wait_for_completion+0xc3/0x120
? wake_up_q+0x60/0x60
__flush_work+0x131/0x1e0
? flush_workqueue_prep_pwqs+0x130/0x130
bdi_unregister+0xb9/0x130
del_gendisk+0x2d2/0x2e0
nvme_ns_remove+0xed/0x110 [nvme_core]
nvme_remove_namespaces+0x96/0xd0 [nvme_core]
nvme_remove+0x5b/0x160 [nvme]
pci_device_remove+0x36/0x90
device_release_driver_internal+0xdf/0x1c0
nvme_remove_dead_ctrl_work+0x14/0x30 [nvme]
process_one_work+0x1c2/0x3f0
worker_thread+0x48/0x3e0
kthread+0x100/0x140
? current_work+0x30/0x30
? kthread_park+0x80/0x80
ret_from_fork+0x35/0x40

This is not limited to NVMes so exactly same issue can be reproduced by
hot-removing SSD (over Thunderbolt) while the system is suspended.

Prevent this from happening by removing WQ_FREEZABLE from bdi_wq.

Reported-by: AceLan Kao
Link: https://marc.info/?l=linux-kernel&m=138695698516487
Link: https://bugzilla.kernel.org/show_bug.cgi?id=204385
Link: https://lore.kernel.org/lkml/20191002122136.GD2819@lahna.fi.intel.com/#t
Acked-by: Rafael J. Wysocki
Signed-off-by: Mika Westerberg
Signed-off-by: Jens Axboe

Mika Westerberg
2019-10-06 23:11:35 +0800

27 Aug, 2019

2 commits

ed288dc0d writeback: Separate out wb_get_lookup() from wb_get_create() ... Browse Code »

Separate out wb_get_lookup() which doesn't try to create one if there
isn't already one from wb_get_create(). This will be used by later
patches.

Reviewed-by: Jan Kara
Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2019-08-27 23:22:38 +0800
34f8fe501 bdi: Add bdi->id ... Browse Code »

There currently is no way to universally identify and lookup a bdi
without holding a reference and pointer to it. This patch adds an
non-recycling bdi->id and implements bdi_get_by_id() which looks up
bdis by their ids. This will be used by memcg foreign inode flushing.

I left bdi_list alone for simplicity and because while rb_tree does
support rcu assignment it doesn't seem to guarantee lossless walk when
walk is racing aginst tree rebalance operations.

Reviewed-by: Jan Kara
Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2019-08-27 23:22:38 +0800

03 Jun, 2019

1 commit

2d146b924 backing-dev: no need to check return value of debugfs_create functions ... Browse Code »

When calling debugfs functions, there is no need to ever check the
return value. The function can work or not, but the code logic should
never do something different based on this.

And as the return value does not matter at all, no need to save the
dentry in struct backing_dev_info, so delete it.

Cc: Andrew Morton
Cc: Anders Roxell
Cc: Arnd Bergmann
Cc: Michal Hocko
Cc: linux-mm@kvack.org
Reviewed-by: Sebastian Andrzej Siewior
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2019-06-03 21:49:07 +0800

21 May, 2019

1 commit

457c89965 treewide: Add SPDX license identifier for missed files ... Browse Code »

Add SPDX license identifiers to all files which:

- Have no license information of any form

- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

GPL-2.0-only

Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2019-05-21 16:50:45 +0800

23 Jan, 2019

1 commit

7fc5854f8 writeback: synchronize sync(2) against cgroup writeback membership switches ... Browse Code »

sync_inodes_sb() can race against cgwb (cgroup writeback) membership
switches and fail to writeback some inodes. For example, if an inode
switches to another wb while sync_inodes_sb() is in progress, the new
wb might not be visible to bdi_split_work_to_wbs() at all or the inode
might jump from a wb which hasn't issued writebacks yet to one which
already has.

This patch adds backing_dev_info->wb_switch_rwsem to synchronize cgwb
switch path against sync_inodes_sb() so that sync_inodes_sb() is
guaranteed to see all the target wbs and inodes can't jump wbs to
escape syncing.

v2: Fixed misplaced rwsem init. Spotted by Jiufei.

Signed-off-by: Tejun Heo
Reported-by: Jiufei Xue
Link: http://lkml.kernel.org/r/dc694ae2-f07f-61e1-7097-7c8411cee12d@gmail.com
Acked-by: Jan Kara
Signed-off-by: Jens Axboe

Tejun Heo
2019-01-23 05:39:38 +0800

01 Sep, 2018

1 commit

59b57717f blkcg: delay blkg destruction until after writeback has finished ... Browse Code »

Currently, blkcg destruction relies on a sequence of events:
1. Destruction starts. blkcg_css_offline() is called and blkgs
release their reference to the blkcg. This immediately destroys
the cgwbs (writeback).
2. With blkgs giving up their reference, the blkcg ref count should
become zero and eventually call blkcg_css_free() which finally
frees the blkcg.

Jiufei Xue reported that there is a race between blkcg_bio_issue_check()
and cgroup_rmdir(). To remedy this, blkg destruction becomes contingent
on the completion of all writeback associated with the blkcg. A count of
the number of cgwbs is maintained and once that goes to zero, blkg
destruction can follow. This should prevent premature blkg destruction
related to writeback.

The new process for blkcg cleanup is as follows:
1. Destruction starts. blkcg_css_offline() is called which offlines
writeback. Blkg destruction is delayed on the cgwb_refcnt count to
avoid punting potentially large amounts of outstanding writeback
to root while maintaining any ongoing policies. Here, the base
cgwb_refcnt is put back.
2. When the cgwb_refcnt becomes zero, blkcg_destroy_blkgs() is called
and handles destruction of blkgs. This is where the css reference
held by each blkg is released.
3. Once the blkcg ref count goes to zero, blkcg_css_free() is called.
This finally frees the blkg.

It seems in the past blk-throttle didn't do the most understandable
things with taking data from a blkg while associating with current. So,
the simplification and unification of what blk-throttle is doing caused
this.

Fixes: 08e18eab0c579 ("block: add bi_blkg to the bio for cgroups")
Reviewed-by: Josef Bacik
Signed-off-by: Dennis Zhou
Cc: Jiufei Xue
Cc: Joseph Qi
Cc: Tejun Heo
Cc: Josef Bacik
Cc: Jens Axboe
Signed-off-by: Jens Axboe

Dennis Zhou (Facebook)
2018-09-01 04:48:56 +0800

23 Aug, 2018

2 commits

060288a73 bdi: use irqsave variant of refcount_dec_and_lock() ... Browse Code »

The irqsave variant of refcount_dec_and_lock handles irqsave/restore when
taking/releasing the spin lock. With this variant the call of
local_irq_save/restore is no longer required.

[bigeasy@linutronix.de: s@atomic_dec_and_lock@refcount_dec_and_lock@g]
Link: http://lkml.kernel.org/r/20180703200141.28415-5-bigeasy@linutronix.de
Signed-off-by: Anna-Maria Gleixner
Signed-off-by: Sebastian Andrzej Siewior
Acked-by: Peter Zijlstra (Intel)
Cc: Jens Axboe
Cc: Ingo Molnar
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Anna-Maria Gleixner
2018-08-23 01:52:46 +0800
e58dd0de5 bdi: use refcount_t for reference counting instead atomic_t ... Browse Code »

refcount_t type and corresponding API should be used instead of atomic_t
when the variable is used as a reference counter. This permits avoiding
accidental refcounter overflows that might lead to use-after-free
situations.

Link: http://lkml.kernel.org/r/20180703200141.28415-4-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior
Reviewed-by: Andrew Morton
Acked-by: Peter Zijlstra (Intel)
Suggested-by: Peter Zijlstra
Cc: Jens Axboe
Cc: Ingo Molnar
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sebastian Andrzej Siewior
2018-08-23 01:52:46 +0800

23 Jun, 2018

1 commit

3ee7e8697 bdi: Fix another oops in wb_workfn() ... Browse Code »

syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
WB_shutting_down after wb->bdi->dev became NULL. This indicates that
unregister_bdi() failed to call wb_shutdown() on one of wb objects.

The problem is in cgwb_bdi_unregister() which does cgwb_kill() and thus
drops bdi's reference to wb structures before going through the list of
wbs again and calling wb_shutdown() on each of them. This way the loop
iterating through all wbs can easily miss a wb if that wb has already
passed through cgwb_remove_from_bdi_list() called from wb_shutdown()
from cgwb_release_workfn() and as a result fully shutdown bdi although
wb_workfn() for this wb structure is still running. In fact there are
also other ways cgwb_bdi_unregister() can race with
cgwb_release_workfn() leading e.g. to use-after-free issues:

CPU1 CPU2
cgwb_bdi_unregister()
cgwb_kill(*slot);

cgwb_release()
queue_work(cgwb_release_wq, &wb->release_work);
cgwb_release_workfn()
wb = list_first_entry(&bdi->wb_list, ...)
spin_unlock_irq(&cgwb_lock);
wb_shutdown(wb);
...
kfree_rcu(wb, rcu);
wb_shutdown(wb); -> oops use-after-free

We solve these issues by synchronizing writeback structure shutdown from
cgwb_bdi_unregister() with cgwb_release_workfn() using a new mutex. That
way we also no longer need synchronization using WB_shutting_down as the
mutex provides it for CONFIG_CGROUP_WRITEBACK case and without
CONFIG_CGROUP_WRITEBACK wb_shutdown() can be called only once from
bdi_unregister().

Reported-by: syzbot
Acked-by: Tejun Heo
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2018-06-23 02:08:07 +0800

08 Jun, 2018

1 commit

9ccc36171 memcg: writeback: use memcg->cgwb_list directly ... Browse Code »

mem_cgroup_cgwb_list is a very simple wrapper and it will never be used
outside of code under CONFIG_CGROUP_WRITEBACK. so use memcg->cgwb_list
directly.

Link: http://lkml.kernel.org/r/1524406173-212182-1-git-send-email-wanglong19@meituan.com
Signed-off-by: Wang Long
Reviewed-by: Jan Kara
Acked-by: Tejun Heo
Acked-by: Michal Hocko
Reviewed-by: Andrew Morton
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wang Long
2018-06-08 08:34:36 +0800

24 May, 2018

1 commit

f18346468 bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue ... Browse Code »

From 0aa2e9b921d6db71150633ff290199554f0842a8 Mon Sep 17 00:00:00 2001
From: Tejun Heo
Date: Wed, 23 May 2018 10:29:00 -0700

cgwb_release() punts the actual release to cgwb_release_workfn() on
system_wq. Depending on the number of cgroups or block devices, there
can be a lot of cgwb_release_workfn() in flight at the same time.

We're periodically seeing close to 256 kworkers getting stuck with the
following stack trace and overtime the entire system gets stuck.

[] _synchronize_rcu_expedited.constprop.72+0x2fc/0x330
[] synchronize_rcu_expedited+0x24/0x30
[] bdi_unregister+0x53/0x290
[] release_bdi+0x89/0xc0
[] wb_exit+0x85/0xa0
[] cgwb_release_workfn+0x54/0xb0
[] process_one_work+0x150/0x410
[] worker_thread+0x6d/0x520
[] kthread+0x12c/0x160
[] ret_from_fork+0x29/0x40
[] 0xffffffffffffffff

The events leading to the lockup are...

1. A lot of cgwb_release_workfn() is queued at the same time and all
system_wq kworkers are assigned to execute them.

2. They all end up calling synchronize_rcu_expedited(). One of them
wins and tries to perform the expedited synchronization.

3. However, that invovles queueing rcu_exp_work to system_wq and
waiting for it. Because #1 is holding all available kworkers on
system_wq, rcu_exp_work can't be executed. cgwb_release_workfn()
is waiting for synchronize_rcu_expedited() which in turn is waiting
for cgwb_release_workfn() to free up some of the kworkers.

We shouldn't be scheduling hundreds of cgwb_release_workfn() at the
same time. There's nothing to be gained from that. This patch
updates cgwb release path to use a dedicated percpu workqueue with
@max_active of 1.

While this resolves the problem at hand, it might be a good idea to
isolate rcu_exp_work to its own workqueue too as it can be used from
various paths and is prone to this sort of indirect A-A deadlocks.

Signed-off-by: Tejun Heo
Cc: "Paul E. McKenney"
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe

Tejun Heo
2018-05-24 05:28:50 +0800

03 May, 2018

2 commits

f53823c18 bdi: Fix use after free bug in debugfs_remove() ... Browse Code »

syzbot is reporting use after free bug in debugfs_remove() [1].

This is because fault injection made memory allocation for
debugfs_create_file() from bdi_debug_register() from bdi_register_va()
fail and continued with setting WB_registered. But when debugfs_remove()
is called from debugfs_remove(bdi->debug_dir) from bdi_debug_unregister()
from bdi_unregister() from release_bdi() because WB_registered was set
by bdi_register_va(), IS_ERR_OR_NULL(bdi->debug_dir) == false despite
debugfs_remove(bdi->debug_dir) was already called from bdi_register_va().

Fix this by making IS_ERR_OR_NULL(bdi->debug_dir) == true.

[1] https://syzkaller.appspot.com/bug?id=5ab4efd91a96dcea9b68104f159adf4af2a6dfc1

Signed-off-by: Tetsuo Handa
Reported-by: syzbot
Fixes: 97f07697932e6faf ("bdi: convert bdi_debug_register to int")
Cc: weiping zhang
Reviewed-by: Greg Kroah-Hartman
Reviewed-by: Jan Kara
Signed-off-by: Jens Axboe

Tetsuo Handa
2018-05-03 23:36:24 +0800
8236b0ae3 bdi: wake up concurrent wb_shutdown() callers. ... Browse Code »

syzbot is reporting hung tasks at wait_on_bit(WB_shutting_down) in
wb_shutdown() [1]. This seems to be because commit 5318ce7d46866e1d ("bdi:
Shutdown writeback on all cgwbs in cgwb_bdi_destroy()") forgot to call
wake_up_bit(WB_shutting_down) after clear_bit(WB_shutting_down).

Introduce a helper function clear_and_wake_up_bit() and use it, in order
to avoid similar errors in future.

[1] https://syzkaller.appspot.com/bug?id=b297474817af98d5796bc544e1bb806fc3da0e5e

Signed-off-by: Tetsuo Handa
Reported-by: syzbot
Fixes: 5318ce7d46866e1d ("bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy()")
Cc: Tejun Heo
Reviewed-by: Jan Kara
Suggested-by: Linus Torvalds
Signed-off-by: Jens Axboe

Tetsuo Handa
2018-05-03 23:25:47 +0800

12 Apr, 2018

1 commit

e3c1ac586 mm/vmscan: don't mess with pgdat->flags in memcg reclaim ... Browse Code »

memcg reclaim may alter pgdat->flags based on the state of LRU lists in
cgroup and its children. PGDAT_WRITEBACK may force kswapd to sleep
congested_wait(), PGDAT_DIRTY may force kswapd to writeback filesystem
pages. But the worst here is PGDAT_CONGESTED, since it may force all
direct reclaims to stall in wait_iff_congested(). Note that only kswapd
have powers to clear any of these bits. This might just never happen if
cgroup limits configured that way. So all direct reclaims will stall as
long as we have some congested bdi in the system.

Leave all pgdat->flags manipulations to kswapd. kswapd scans the whole
pgdat, only kswapd can clear pgdat->flags once node is balanced, thus
it's reasonable to leave all decisions about node state to kswapd.

Why only kswapd? Why not allow to global direct reclaim change these
flags? It is because currently only kswapd can clear these flags. I'm
less worried about the case when PGDAT_CONGESTED falsely not set, and
more worried about the case when it falsely set. If direct reclaimer
sets PGDAT_CONGESTED, do we have guarantee that after the congestion
problem is sorted out, kswapd will be woken up and clear the flag? It
seems like there is no such guarantee. E.g. direct reclaimers may
eventually balance pgdat and kswapd simply won't wake up (see
wakeup_kswapd()).

Moving pgdat->flags manipulation to kswapd, means that cgroup2 recalim
now loses its congestion throttling mechanism. Add per-cgroup
congestion state and throttle cgroup2 reclaimers if memcg is in
congestion state.

Currently there is no need in per-cgroup PGDAT_WRITEBACK and PGDAT_DIRTY
bits since they alter only kswapd behavior.

The problem could be easily demonstrated by creating heavy congestion in
one cgroup:

echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
mkdir -p /sys/fs/cgroup/congester
echo 512M > /sys/fs/cgroup/congester/memory.max
echo $$ > /sys/fs/cgroup/congester/cgroup.procs
/* generate a lot of diry data on slow HDD */
while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
....
while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &

and some job in another cgroup:

mkdir /sys/fs/cgroup/victim
echo 128M > /sys/fs/cgroup/victim/memory.max

# time cat /dev/sda > /dev/null
real 10m15.054s
user 0m0.487s
sys 1m8.505s

According to the tracepoint in wait_iff_congested(), the 'cat' spent 50%
of the time sleeping there.

With the patch, cat don't waste time anymore:

# time cat /dev/sda > /dev/null
real 5m32.911s
user 0m0.411s
sys 0m56.664s

[aryabinin@virtuozzo.com: congestion state should be per-node]
Link: http://lkml.kernel.org/r/20180406135215.10057-1-aryabinin@virtuozzo.com
[ayabinin@virtuozzo.com: make congestion state per-cgroup-per-node instead of just per-cgroup[
Link: http://lkml.kernel.org/r/20180406180254.8970-2-aryabinin@virtuozzo.com
Link: http://lkml.kernel.org/r/20180323152029.11084-5-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin
Reviewed-by: Shakeel Butt
Acked-by: Johannes Weiner
Cc: Mel Gorman
Cc: Tejun Heo
Cc: Michal Hocko
Cc: Steven Rostedt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrey Ryabinin
2018-04-12 01:28:30 +0800

07 Apr, 2018

1 commit

3b54765cc Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge updates from Andrew Morton:

- a few misc things

- ocfs2 updates

- the v9fs maintainers have been missing for a long time. I've taken
over v9fs patch slinging.

- most of MM

* emailed patches from Andrew Morton : (116 commits)
mm,oom_reaper: check for MMF_OOM_SKIP before complaining
mm/ksm: fix interaction with THP
mm/memblock.c: cast constant ULLONG_MAX to phys_addr_t
headers: untangle kmemleak.h from mm.h
include/linux/mmdebug.h: make VM_WARN* non-rvals
mm/page_isolation.c: make start_isolate_page_range() fail if already isolated
mm: change return type to vm_fault_t
mm, oom: remove 3% bonus for CAP_SYS_ADMIN processes
mm, page_alloc: wakeup kcompactd even if kswapd cannot free more memory
kernel/fork.c: detect early free of a live mm
mm: make counting of list_lru_one::nr_items lockless
mm/swap_state.c: make bool enable_vma_readahead and swap_vma_readahead() static
block_invalidatepage(): only release page if the full page was invalidated
mm: kernel-doc: add missing parameter descriptions
mm/swap.c: remove @cold parameter description for release_pages()
mm/nommu: remove description of alloc_vm_area
zram: drop max_zpage_size and use zs_huge_class_size()
zsmalloc: introduce zs_huge_class_size()
mm: fix races between swapoff and flush dcache
fs/direct-io.c: minor cleanups in do_blockdev_direct_IO
...

Linus Torvalds
2018-04-07 05:19:26 +0800

06 Apr, 2018

1 commit

5ad350936 mm: reuse DEFINE_SHOW_ATTRIBUTE() macro ... Browse Code »

...instead of open coding file operations followed by custom ->open()
callbacks per each attribute.

[andriy.shevchenko@linux.intel.com: add tags, fix compilation issue]
Link: http://lkml.kernel.org/r/20180217144253.58604-1-andriy.shevchenko@linux.intel.com
Link: http://lkml.kernel.org/r/20180214154644.54505-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko
Reviewed-by: Matthew Wilcox
Reviewed-by: Andrew Morton
Reviewed-by: Sergey Senozhatsky
Acked-by: Christoph Lameter
Cc: Tejun Heo
Cc: Dennis Zhou
Cc: Minchan Kim
Cc: Nitin Gupta
Cc: Sergey Senozhatsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Shevchenko
2018-04-06 12:36:25 +0800

01 Mar, 2018

1 commit

025aecd8b writeback: remove dead code in wb_blkcg/memcg_offline ... Browse Code »

Signed-off-by: Jiufei Xue
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

Jiufei Xue
2018-03-01 03:23:35 +0800

22 Dec, 2017

1 commit

6d0e4827b Revert "bdi: add error handle for bdi_debug_register" ... Browse Code »

This reverts commit a0747a859ef6d3cc5b6cd50eb694499b78dd0025.

It breaks some booting for some users, and more than a week
into this, there's still no good fix. Revert this commit
for now until a solution has been found.

Reported-by: Laura Abbott
Reported-by: Bruno Wolff III
Signed-off-by: Jens Axboe

Jens Axboe
2017-12-22 01:01:30 +0800

20 Nov, 2017

2 commits

a0747a859 bdi: add error handle for bdi_debug_register ... Browse Code »

In order to make error handle more cleaner we call bdi_debug_register
before set state to WB_registered, that we can avoid call bdi_unregister
in release_bdi().

Reviewed-by: Jan Kara
Signed-off-by: weiping zhang
Signed-off-by: Jens Axboe

weiping zhang
2017-11-20 02:02:13 +0800
97f076979 bdi: convert bdi_debug_register to int ... Browse Code »

Convert bdi_debug_register to int and then do error handle for it.

Reviewed-by: Jan Kara
Signed-off-by: weiping zhang
Signed-off-by: Jens Axboe

weiping zhang
2017-11-20 02:02:13 +0800

06 Oct, 2017

1 commit

775d3a35d backing-dev: kill unused pdflush_proc_obsolete() ... Browse Code »

After commit b35bd0d9f8a8, pdflush_proc_obsolete() is no longer
used. Kill the function and declaration.

Reported-by: Rakesh Pandit
Signed-off-by: Jens Axboe

Jens Axboe
2017-10-06 22:15:15 +0800

12 Sep, 2017

1 commit

0b045bd1c mm/backing-dev.c: fix an error handling path in 'cgwb_create()' ... Browse Code »

If the 'kmalloc' fails, we must go through the existing error handling
path.

Signed-off-by: Christophe JAILLET
Fixes: 52ebea749aae ("writeback: make backing_dev_info host cgroup-specific bdi_writebacks")
Reviewed-by: Jan Kara
Signed-off-by: Jens Axboe

Christophe JAILLET
2017-09-12 04:16:44 +0800

21 Apr, 2017

3 commits

7c4cc3002 bdi: Drop 'parent' argument from bdi_register[_va]() ... Browse Code »

Drop 'parent' argument of bdi_register() and bdi_register_va(). It is
always NULL.

Reviewed-by: Christoph Hellwig
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2017-04-21 02:09:55 +0800
2e82b84c0 block: Remove unused functions ... Browse Code »

Now that all backing_dev_info structure are allocated separately, we can
drop some unused functions.

Reviewed-by: Christoph Hellwig
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2017-04-21 02:09:55 +0800
62bf42adc bdi: Export bdi_alloc_node() and bdi_put() ... Browse Code »

MTD will want to call bdi_alloc_node() and bdi_put() directly. Export
these functions.

Reviewed-by: Christoph Hellwig
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2017-04-21 02:09:55 +0800