25 Sep, 2020

4 commits

  • Replace the two negative flags that are always used together with a
    single positive flag that indicates the writeback capability instead
    of two related non-capabilities. Also remove the pointless wrappers
    to just check the flag.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Replace BDI_CAP_NO_ACCT_WB with a positive BDI_CAP_WRITEBACK_ACCT to
    make the checks more obvious. Also remove the pointless
    bdi_cap_account_writeback wrapper that just obsfucates the check.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • The BDI_CAP_STABLE_WRITES is one of the few bits of information in the
    backing_dev_info shared between the block drivers and the writeback code.
    To help untangling the dependency replace it with a queue flag and a
    superblock flag derived from it. This also helps with the case of e.g.
    a file system requiring stable writes due to its own checksumming, but
    not forcing it on other users of the block device like the swap code.

    One downside is that we an't support the stable_pages_required bdi
    attribute in sysfs anymore. It is replaced with a queue attribute which
    also is writable for easier testing.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Set up a readahead size by default, as very few users have a good
    reason to change it. This means code, ecryptfs, and orangefs now
    set up the values while they were previously missing it, while ubifs,
    mtd and vboxsf manually set it to 0 to avoid readahead.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Acked-by: David Sterba [btrfs]
    Acked-by: Richard Weinberger [ubifs, mtd]
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

09 Jul, 2020

2 commits


10 May, 2020

5 commits

  • The name is only printed for a not registered bdi in writeback. Use the
    device name there as is more useful anyway for the unlike case that the
    warning triggers.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Greg Kroah-Hartman
    Reviewed-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Merge the _node vs normal version and drop the superflous gfp_t argument.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Greg Kroah-Hartman
    Reviewed-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Split out a new bdi_set_owner helper to set the owner, and move the policy
    for creating the bdi name back into genhd.c, where it belongs.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Greg Kroah-Hartman
    Reviewed-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • bdi_register_va is only used by super.c, which can't be modular.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Greg Kroah-Hartman
    Reviewed-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Cache a copy of the name for the life time of the backing_dev_info
    structure so that we can reference it even after unregistering.

    Fixes: 68f23b89067f ("memcg: fix a crash in wb_workfn when a device disappears")
    Reported-by: Yufen Yu
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

07 May, 2020

1 commit

  • bdi_dev_name is not a fast path function, move it out of line. This
    prepares for using it from modular callers without having to export
    an implementation detail like bdi_unknown_name.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Greg Kroah-Hartman
    Reviewed-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

02 Apr, 2020

1 commit

  • blkcg->cgwb_refcnt is used to delay blkcg offlining so that blkgs
    don't get offlined while there are active cgwbs on them. However, it
    ends up making offlining unordered sometimes causing parents to be
    offlined before children.

    To fix it, we want child blkcgs to pin the parents' online states
    turning the refcnt into a more generic online pinning mechanism.

    In prepartion,

    * blkcg->cgwb_refcnt -> blkcg->online_pin
    * blkcg_cgwb_get/put() -> blkcg_pin/unpin_online()
    * Take them out of CONFIG_CGROUP_WRITEBACK

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

01 Feb, 2020

1 commit

  • Without memcg, there is a one-to-one mapping between the bdi and
    bdi_writeback structures. In this world, things are fairly
    straightforward; the first thing bdi_unregister() does is to shutdown
    the bdi_writeback structure (or wb), and part of that writeback ensures
    that no other work queued against the wb, and that the wb is fully
    drained.

    With memcg, however, there is a one-to-many relationship between the bdi
    and bdi_writeback structures; that is, there are multiple wb objects
    which can all point to a single bdi. There is a refcount which prevents
    the bdi object from being released (and hence, unregistered). So in
    theory, the bdi_unregister() *should* only get called once its refcount
    goes to zero (bdi_put will drop the refcount, and when it is zero,
    release_bdi gets called, which calls bdi_unregister).

    Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about
    the Brave New memcg World, and calls bdi_unregister directly. It does
    this without informing the file system, or the memcg code, or anything
    else. This causes the root wb associated with the bdi to be
    unregistered, but none of the memcg-specific wb's are shutdown. So when
    one of these wb's are woken up to do delayed work, they try to
    dereference their wb->bdi->dev to fetch the device name, but
    unfortunately bdi->dev is now NULL, thanks to the bdi_unregister()
    called by del_gendisk(). As a result, *boom*.

    Fortunately, it looks like the rest of the writeback path is perfectly
    happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to
    create a bdi_dev_name() function which can handle bdi->dev being NULL.
    This also allows us to bulletproof the writeback tracepoints to prevent
    them from dereferencing a NULL pointer and crashing the kernel if one is
    tracing with memcg's enabled, and an iSCSI device dies or a USB storage
    stick is pulled.

    The most common way of triggering this will be hotremoval of a device
    while writeback with memcg enabled is going on. It was triggering
    several times a day in a heavily loaded production environment.

    Google Bug Id: 145475544

    Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu
    Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu
    Signed-off-by: Theodore Ts'o
    Cc: Chris Mason
    Cc: Tejun Heo
    Cc: Jens Axboe
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Theodore Ts'o
     

06 Oct, 2019

1 commit

  • A removable block device, such as NVMe or SSD connected over Thunderbolt
    can be hot-removed any time including when the system is suspended. When
    device is hot-removed during suspend and the system gets resumed, kernel
    first resumes devices and then thaws the userspace including freezable
    workqueues. What happens in that case is that the NVMe driver notices
    that the device is unplugged and removes it from the system. This ends
    up calling bdi_unregister() for the gendisk which then schedules
    wb_workfn() to be run one more time.

    However, since the bdi_wq is still frozen flush_delayed_work() call in
    wb_shutdown() blocks forever halting system resume process. User sees
    this as hang as nothing is happening anymore.

    Triggering sysrq-w reveals this:

    Workqueue: nvme-wq nvme_remove_dead_ctrl_work [nvme]
    Call Trace:
    ? __schedule+0x2c5/0x630
    ? wait_for_completion+0xa4/0x120
    schedule+0x3e/0xc0
    schedule_timeout+0x1c9/0x320
    ? resched_curr+0x1f/0xd0
    ? wait_for_completion+0xa4/0x120
    wait_for_completion+0xc3/0x120
    ? wake_up_q+0x60/0x60
    __flush_work+0x131/0x1e0
    ? flush_workqueue_prep_pwqs+0x130/0x130
    bdi_unregister+0xb9/0x130
    del_gendisk+0x2d2/0x2e0
    nvme_ns_remove+0xed/0x110 [nvme_core]
    nvme_remove_namespaces+0x96/0xd0 [nvme_core]
    nvme_remove+0x5b/0x160 [nvme]
    pci_device_remove+0x36/0x90
    device_release_driver_internal+0xdf/0x1c0
    nvme_remove_dead_ctrl_work+0x14/0x30 [nvme]
    process_one_work+0x1c2/0x3f0
    worker_thread+0x48/0x3e0
    kthread+0x100/0x140
    ? current_work+0x30/0x30
    ? kthread_park+0x80/0x80
    ret_from_fork+0x35/0x40

    This is not limited to NVMes so exactly same issue can be reproduced by
    hot-removing SSD (over Thunderbolt) while the system is suspended.

    Prevent this from happening by removing WQ_FREEZABLE from bdi_wq.

    Reported-by: AceLan Kao
    Link: https://marc.info/?l=linux-kernel&m=138695698516487
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=204385
    Link: https://lore.kernel.org/lkml/20191002122136.GD2819@lahna.fi.intel.com/#t
    Acked-by: Rafael J. Wysocki
    Signed-off-by: Mika Westerberg
    Signed-off-by: Jens Axboe

    Mika Westerberg
     

27 Aug, 2019

2 commits

  • Separate out wb_get_lookup() which doesn't try to create one if there
    isn't already one from wb_get_create(). This will be used by later
    patches.

    Reviewed-by: Jan Kara
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • There currently is no way to universally identify and lookup a bdi
    without holding a reference and pointer to it. This patch adds an
    non-recycling bdi->id and implements bdi_get_by_id() which looks up
    bdis by their ids. This will be used by memcg foreign inode flushing.

    I left bdi_list alone for simplicity and because while rb_tree does
    support rcu assignment it doesn't seem to guarantee lossless walk when
    walk is racing aginst tree rebalance operations.

    Reviewed-by: Jan Kara
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

03 Jun, 2019

1 commit

  • When calling debugfs functions, there is no need to ever check the
    return value. The function can work or not, but the code logic should
    never do something different based on this.

    And as the return value does not matter at all, no need to save the
    dentry in struct backing_dev_info, so delete it.

    Cc: Andrew Morton
    Cc: Anders Roxell
    Cc: Arnd Bergmann
    Cc: Michal Hocko
    Cc: linux-mm@kvack.org
    Reviewed-by: Sebastian Andrzej Siewior
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

23 Jan, 2019

1 commit

  • sync_inodes_sb() can race against cgwb (cgroup writeback) membership
    switches and fail to writeback some inodes. For example, if an inode
    switches to another wb while sync_inodes_sb() is in progress, the new
    wb might not be visible to bdi_split_work_to_wbs() at all or the inode
    might jump from a wb which hasn't issued writebacks yet to one which
    already has.

    This patch adds backing_dev_info->wb_switch_rwsem to synchronize cgwb
    switch path against sync_inodes_sb() so that sync_inodes_sb() is
    guaranteed to see all the target wbs and inodes can't jump wbs to
    escape syncing.

    v2: Fixed misplaced rwsem init. Spotted by Jiufei.

    Signed-off-by: Tejun Heo
    Reported-by: Jiufei Xue
    Link: http://lkml.kernel.org/r/dc694ae2-f07f-61e1-7097-7c8411cee12d@gmail.com
    Acked-by: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     

01 Sep, 2018

1 commit

  • Currently, blkcg destruction relies on a sequence of events:
    1. Destruction starts. blkcg_css_offline() is called and blkgs
    release their reference to the blkcg. This immediately destroys
    the cgwbs (writeback).
    2. With blkgs giving up their reference, the blkcg ref count should
    become zero and eventually call blkcg_css_free() which finally
    frees the blkcg.

    Jiufei Xue reported that there is a race between blkcg_bio_issue_check()
    and cgroup_rmdir(). To remedy this, blkg destruction becomes contingent
    on the completion of all writeback associated with the blkcg. A count of
    the number of cgwbs is maintained and once that goes to zero, blkg
    destruction can follow. This should prevent premature blkg destruction
    related to writeback.

    The new process for blkcg cleanup is as follows:
    1. Destruction starts. blkcg_css_offline() is called which offlines
    writeback. Blkg destruction is delayed on the cgwb_refcnt count to
    avoid punting potentially large amounts of outstanding writeback
    to root while maintaining any ongoing policies. Here, the base
    cgwb_refcnt is put back.
    2. When the cgwb_refcnt becomes zero, blkcg_destroy_blkgs() is called
    and handles destruction of blkgs. This is where the css reference
    held by each blkg is released.
    3. Once the blkcg ref count goes to zero, blkcg_css_free() is called.
    This finally frees the blkg.

    It seems in the past blk-throttle didn't do the most understandable
    things with taking data from a blkg while associating with current. So,
    the simplification and unification of what blk-throttle is doing caused
    this.

    Fixes: 08e18eab0c579 ("block: add bi_blkg to the bio for cgroups")
    Reviewed-by: Josef Bacik
    Signed-off-by: Dennis Zhou
    Cc: Jiufei Xue
    Cc: Joseph Qi
    Cc: Tejun Heo
    Cc: Josef Bacik
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Dennis Zhou (Facebook)
     

23 Aug, 2018

2 commits

  • The irqsave variant of refcount_dec_and_lock handles irqsave/restore when
    taking/releasing the spin lock. With this variant the call of
    local_irq_save/restore is no longer required.

    [bigeasy@linutronix.de: s@atomic_dec_and_lock@refcount_dec_and_lock@g]
    Link: http://lkml.kernel.org/r/20180703200141.28415-5-bigeasy@linutronix.de
    Signed-off-by: Anna-Maria Gleixner
    Signed-off-by: Sebastian Andrzej Siewior
    Acked-by: Peter Zijlstra (Intel)
    Cc: Jens Axboe
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anna-Maria Gleixner
     
  • refcount_t type and corresponding API should be used instead of atomic_t
    when the variable is used as a reference counter. This permits avoiding
    accidental refcounter overflows that might lead to use-after-free
    situations.

    Link: http://lkml.kernel.org/r/20180703200141.28415-4-bigeasy@linutronix.de
    Signed-off-by: Sebastian Andrzej Siewior
    Reviewed-by: Andrew Morton
    Acked-by: Peter Zijlstra (Intel)
    Suggested-by: Peter Zijlstra
    Cc: Jens Axboe
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sebastian Andrzej Siewior
     

23 Jun, 2018

1 commit

  • syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
    wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
    WB_shutting_down after wb->bdi->dev became NULL. This indicates that
    unregister_bdi() failed to call wb_shutdown() on one of wb objects.

    The problem is in cgwb_bdi_unregister() which does cgwb_kill() and thus
    drops bdi's reference to wb structures before going through the list of
    wbs again and calling wb_shutdown() on each of them. This way the loop
    iterating through all wbs can easily miss a wb if that wb has already
    passed through cgwb_remove_from_bdi_list() called from wb_shutdown()
    from cgwb_release_workfn() and as a result fully shutdown bdi although
    wb_workfn() for this wb structure is still running. In fact there are
    also other ways cgwb_bdi_unregister() can race with
    cgwb_release_workfn() leading e.g. to use-after-free issues:

    CPU1 CPU2
    cgwb_bdi_unregister()
    cgwb_kill(*slot);

    cgwb_release()
    queue_work(cgwb_release_wq, &wb->release_work);
    cgwb_release_workfn()
    wb = list_first_entry(&bdi->wb_list, ...)
    spin_unlock_irq(&cgwb_lock);
    wb_shutdown(wb);
    ...
    kfree_rcu(wb, rcu);
    wb_shutdown(wb); -> oops use-after-free

    We solve these issues by synchronizing writeback structure shutdown from
    cgwb_bdi_unregister() with cgwb_release_workfn() using a new mutex. That
    way we also no longer need synchronization using WB_shutting_down as the
    mutex provides it for CONFIG_CGROUP_WRITEBACK case and without
    CONFIG_CGROUP_WRITEBACK wb_shutdown() can be called only once from
    bdi_unregister().

    Reported-by: syzbot
    Acked-by: Tejun Heo
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

08 Jun, 2018

1 commit

  • mem_cgroup_cgwb_list is a very simple wrapper and it will never be used
    outside of code under CONFIG_CGROUP_WRITEBACK. so use memcg->cgwb_list
    directly.

    Link: http://lkml.kernel.org/r/1524406173-212182-1-git-send-email-wanglong19@meituan.com
    Signed-off-by: Wang Long
    Reviewed-by: Jan Kara
    Acked-by: Tejun Heo
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Long
     

24 May, 2018

1 commit

  • From 0aa2e9b921d6db71150633ff290199554f0842a8 Mon Sep 17 00:00:00 2001
    From: Tejun Heo
    Date: Wed, 23 May 2018 10:29:00 -0700

    cgwb_release() punts the actual release to cgwb_release_workfn() on
    system_wq. Depending on the number of cgroups or block devices, there
    can be a lot of cgwb_release_workfn() in flight at the same time.

    We're periodically seeing close to 256 kworkers getting stuck with the
    following stack trace and overtime the entire system gets stuck.

    [] _synchronize_rcu_expedited.constprop.72+0x2fc/0x330
    [] synchronize_rcu_expedited+0x24/0x30
    [] bdi_unregister+0x53/0x290
    [] release_bdi+0x89/0xc0
    [] wb_exit+0x85/0xa0
    [] cgwb_release_workfn+0x54/0xb0
    [] process_one_work+0x150/0x410
    [] worker_thread+0x6d/0x520
    [] kthread+0x12c/0x160
    [] ret_from_fork+0x29/0x40
    [] 0xffffffffffffffff

    The events leading to the lockup are...

    1. A lot of cgwb_release_workfn() is queued at the same time and all
    system_wq kworkers are assigned to execute them.

    2. They all end up calling synchronize_rcu_expedited(). One of them
    wins and tries to perform the expedited synchronization.

    3. However, that invovles queueing rcu_exp_work to system_wq and
    waiting for it. Because #1 is holding all available kworkers on
    system_wq, rcu_exp_work can't be executed. cgwb_release_workfn()
    is waiting for synchronize_rcu_expedited() which in turn is waiting
    for cgwb_release_workfn() to free up some of the kworkers.

    We shouldn't be scheduling hundreds of cgwb_release_workfn() at the
    same time. There's nothing to be gained from that. This patch
    updates cgwb release path to use a dedicated percpu workqueue with
    @max_active of 1.

    While this resolves the problem at hand, it might be a good idea to
    isolate rcu_exp_work to its own workqueue too as it can be used from
    various paths and is prone to this sort of indirect A-A deadlocks.

    Signed-off-by: Tejun Heo
    Cc: "Paul E. McKenney"
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe

    Tejun Heo
     

03 May, 2018

2 commits

  • syzbot is reporting use after free bug in debugfs_remove() [1].

    This is because fault injection made memory allocation for
    debugfs_create_file() from bdi_debug_register() from bdi_register_va()
    fail and continued with setting WB_registered. But when debugfs_remove()
    is called from debugfs_remove(bdi->debug_dir) from bdi_debug_unregister()
    from bdi_unregister() from release_bdi() because WB_registered was set
    by bdi_register_va(), IS_ERR_OR_NULL(bdi->debug_dir) == false despite
    debugfs_remove(bdi->debug_dir) was already called from bdi_register_va().

    Fix this by making IS_ERR_OR_NULL(bdi->debug_dir) == true.

    [1] https://syzkaller.appspot.com/bug?id=5ab4efd91a96dcea9b68104f159adf4af2a6dfc1

    Signed-off-by: Tetsuo Handa
    Reported-by: syzbot
    Fixes: 97f07697932e6faf ("bdi: convert bdi_debug_register to int")
    Cc: weiping zhang
    Reviewed-by: Greg Kroah-Hartman
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Tetsuo Handa
     
  • syzbot is reporting hung tasks at wait_on_bit(WB_shutting_down) in
    wb_shutdown() [1]. This seems to be because commit 5318ce7d46866e1d ("bdi:
    Shutdown writeback on all cgwbs in cgwb_bdi_destroy()") forgot to call
    wake_up_bit(WB_shutting_down) after clear_bit(WB_shutting_down).

    Introduce a helper function clear_and_wake_up_bit() and use it, in order
    to avoid similar errors in future.

    [1] https://syzkaller.appspot.com/bug?id=b297474817af98d5796bc544e1bb806fc3da0e5e

    Signed-off-by: Tetsuo Handa
    Reported-by: syzbot
    Fixes: 5318ce7d46866e1d ("bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy()")
    Cc: Tejun Heo
    Reviewed-by: Jan Kara
    Suggested-by: Linus Torvalds
    Signed-off-by: Jens Axboe

    Tetsuo Handa
     

12 Apr, 2018

1 commit

  • memcg reclaim may alter pgdat->flags based on the state of LRU lists in
    cgroup and its children. PGDAT_WRITEBACK may force kswapd to sleep
    congested_wait(), PGDAT_DIRTY may force kswapd to writeback filesystem
    pages. But the worst here is PGDAT_CONGESTED, since it may force all
    direct reclaims to stall in wait_iff_congested(). Note that only kswapd
    have powers to clear any of these bits. This might just never happen if
    cgroup limits configured that way. So all direct reclaims will stall as
    long as we have some congested bdi in the system.

    Leave all pgdat->flags manipulations to kswapd. kswapd scans the whole
    pgdat, only kswapd can clear pgdat->flags once node is balanced, thus
    it's reasonable to leave all decisions about node state to kswapd.

    Why only kswapd? Why not allow to global direct reclaim change these
    flags? It is because currently only kswapd can clear these flags. I'm
    less worried about the case when PGDAT_CONGESTED falsely not set, and
    more worried about the case when it falsely set. If direct reclaimer
    sets PGDAT_CONGESTED, do we have guarantee that after the congestion
    problem is sorted out, kswapd will be woken up and clear the flag? It
    seems like there is no such guarantee. E.g. direct reclaimers may
    eventually balance pgdat and kswapd simply won't wake up (see
    wakeup_kswapd()).

    Moving pgdat->flags manipulation to kswapd, means that cgroup2 recalim
    now loses its congestion throttling mechanism. Add per-cgroup
    congestion state and throttle cgroup2 reclaimers if memcg is in
    congestion state.

    Currently there is no need in per-cgroup PGDAT_WRITEBACK and PGDAT_DIRTY
    bits since they alter only kswapd behavior.

    The problem could be easily demonstrated by creating heavy congestion in
    one cgroup:

    echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
    mkdir -p /sys/fs/cgroup/congester
    echo 512M > /sys/fs/cgroup/congester/memory.max
    echo $$ > /sys/fs/cgroup/congester/cgroup.procs
    /* generate a lot of diry data on slow HDD */
    while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &
    ....
    while true; do dd if=/dev/zero of=/mnt/sdb/zeroes bs=1M count=1024; done &

    and some job in another cgroup:

    mkdir /sys/fs/cgroup/victim
    echo 128M > /sys/fs/cgroup/victim/memory.max

    # time cat /dev/sda > /dev/null
    real 10m15.054s
    user 0m0.487s
    sys 1m8.505s

    According to the tracepoint in wait_iff_congested(), the 'cat' spent 50%
    of the time sleeping there.

    With the patch, cat don't waste time anymore:

    # time cat /dev/sda > /dev/null
    real 5m32.911s
    user 0m0.411s
    sys 0m56.664s

    [aryabinin@virtuozzo.com: congestion state should be per-node]
    Link: http://lkml.kernel.org/r/20180406135215.10057-1-aryabinin@virtuozzo.com
    [ayabinin@virtuozzo.com: make congestion state per-cgroup-per-node instead of just per-cgroup[
    Link: http://lkml.kernel.org/r/20180406180254.8970-2-aryabinin@virtuozzo.com
    Link: http://lkml.kernel.org/r/20180323152029.11084-5-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Reviewed-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Tejun Heo
    Cc: Michal Hocko
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

07 Apr, 2018

1 commit

  • Merge updates from Andrew Morton:

    - a few misc things

    - ocfs2 updates

    - the v9fs maintainers have been missing for a long time. I've taken
    over v9fs patch slinging.

    - most of MM

    * emailed patches from Andrew Morton : (116 commits)
    mm,oom_reaper: check for MMF_OOM_SKIP before complaining
    mm/ksm: fix interaction with THP
    mm/memblock.c: cast constant ULLONG_MAX to phys_addr_t
    headers: untangle kmemleak.h from mm.h
    include/linux/mmdebug.h: make VM_WARN* non-rvals
    mm/page_isolation.c: make start_isolate_page_range() fail if already isolated
    mm: change return type to vm_fault_t
    mm, oom: remove 3% bonus for CAP_SYS_ADMIN processes
    mm, page_alloc: wakeup kcompactd even if kswapd cannot free more memory
    kernel/fork.c: detect early free of a live mm
    mm: make counting of list_lru_one::nr_items lockless
    mm/swap_state.c: make bool enable_vma_readahead and swap_vma_readahead() static
    block_invalidatepage(): only release page if the full page was invalidated
    mm: kernel-doc: add missing parameter descriptions
    mm/swap.c: remove @cold parameter description for release_pages()
    mm/nommu: remove description of alloc_vm_area
    zram: drop max_zpage_size and use zs_huge_class_size()
    zsmalloc: introduce zs_huge_class_size()
    mm: fix races between swapoff and flush dcache
    fs/direct-io.c: minor cleanups in do_blockdev_direct_IO
    ...

    Linus Torvalds
     

06 Apr, 2018

1 commit

  • ...instead of open coding file operations followed by custom ->open()
    callbacks per each attribute.

    [andriy.shevchenko@linux.intel.com: add tags, fix compilation issue]
    Link: http://lkml.kernel.org/r/20180217144253.58604-1-andriy.shevchenko@linux.intel.com
    Link: http://lkml.kernel.org/r/20180214154644.54505-1-andriy.shevchenko@linux.intel.com
    Signed-off-by: Andy Shevchenko
    Reviewed-by: Matthew Wilcox
    Reviewed-by: Andrew Morton
    Reviewed-by: Sergey Senozhatsky
    Acked-by: Christoph Lameter
    Cc: Tejun Heo
    Cc: Dennis Zhou
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Shevchenko
     

01 Mar, 2018

1 commit


22 Dec, 2017

1 commit

  • This reverts commit a0747a859ef6d3cc5b6cd50eb694499b78dd0025.

    It breaks some booting for some users, and more than a week
    into this, there's still no good fix. Revert this commit
    for now until a solution has been found.

    Reported-by: Laura Abbott
    Reported-by: Bruno Wolff III
    Signed-off-by: Jens Axboe

    Jens Axboe
     

20 Nov, 2017

2 commits


06 Oct, 2017

1 commit


12 Sep, 2017

1 commit


21 Apr, 2017

3 commits