22 Jul, 2018

1 commit

  • commit 3ee7e8697d5860b173132606d80a9cd35e7113ee upstream.

    syzbot is reporting NULL pointer dereference at wb_workfn() [1] due to
    wb->bdi->dev being NULL. And Dmitry confirmed that wb->state was
    WB_shutting_down after wb->bdi->dev became NULL. This indicates that
    unregister_bdi() failed to call wb_shutdown() on one of wb objects.

    The problem is in cgwb_bdi_unregister() which does cgwb_kill() and thus
    drops bdi's reference to wb structures before going through the list of
    wbs again and calling wb_shutdown() on each of them. This way the loop
    iterating through all wbs can easily miss a wb if that wb has already
    passed through cgwb_remove_from_bdi_list() called from wb_shutdown()
    from cgwb_release_workfn() and as a result fully shutdown bdi although
    wb_workfn() for this wb structure is still running. In fact there are
    also other ways cgwb_bdi_unregister() can race with
    cgwb_release_workfn() leading e.g. to use-after-free issues:

    CPU1 CPU2
    cgwb_bdi_unregister()
    cgwb_kill(*slot);

    cgwb_release()
    queue_work(cgwb_release_wq, &wb->release_work);
    cgwb_release_workfn()
    wb = list_first_entry(&bdi->wb_list, ...)
    spin_unlock_irq(&cgwb_lock);
    wb_shutdown(wb);
    ...
    kfree_rcu(wb, rcu);
    wb_shutdown(wb); -> oops use-after-free

    We solve these issues by synchronizing writeback structure shutdown from
    cgwb_bdi_unregister() with cgwb_release_workfn() using a new mutex. That
    way we also no longer need synchronization using WB_shutting_down as the
    mutex provides it for CONFIG_CGROUP_WRITEBACK case and without
    CONFIG_CGROUP_WRITEBACK wb_shutdown() can be called only once from
    bdi_unregister().

    Reported-by: syzbot
    Acked-by: Tejun Heo
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     

26 Jun, 2018

1 commit

  • commit f183464684190bacbfb14623bd3e4e51b7575b4c upstream.

    From 0aa2e9b921d6db71150633ff290199554f0842a8 Mon Sep 17 00:00:00 2001
    From: Tejun Heo
    Date: Wed, 23 May 2018 10:29:00 -0700

    cgwb_release() punts the actual release to cgwb_release_workfn() on
    system_wq. Depending on the number of cgroups or block devices, there
    can be a lot of cgwb_release_workfn() in flight at the same time.

    We're periodically seeing close to 256 kworkers getting stuck with the
    following stack trace and overtime the entire system gets stuck.

    [] _synchronize_rcu_expedited.constprop.72+0x2fc/0x330
    [] synchronize_rcu_expedited+0x24/0x30
    [] bdi_unregister+0x53/0x290
    [] release_bdi+0x89/0xc0
    [] wb_exit+0x85/0xa0
    [] cgwb_release_workfn+0x54/0xb0
    [] process_one_work+0x150/0x410
    [] worker_thread+0x6d/0x520
    [] kthread+0x12c/0x160
    [] ret_from_fork+0x29/0x40
    [] 0xffffffffffffffff

    The events leading to the lockup are...

    1. A lot of cgwb_release_workfn() is queued at the same time and all
    system_wq kworkers are assigned to execute them.

    2. They all end up calling synchronize_rcu_expedited(). One of them
    wins and tries to perform the expedited synchronization.

    3. However, that invovles queueing rcu_exp_work to system_wq and
    waiting for it. Because #1 is holding all available kworkers on
    system_wq, rcu_exp_work can't be executed. cgwb_release_workfn()
    is waiting for synchronize_rcu_expedited() which in turn is waiting
    for cgwb_release_workfn() to free up some of the kworkers.

    We shouldn't be scheduling hundreds of cgwb_release_workfn() at the
    same time. There's nothing to be gained from that. This patch
    updates cgwb release path to use a dedicated percpu workqueue with
    @max_active of 1.

    While this resolves the problem at hand, it might be a good idea to
    isolate rcu_exp_work to its own workqueue too as it can be used from
    various paths and is prone to this sort of indirect A-A deadlocks.

    Signed-off-by: Tejun Heo
    Cc: "Paul E. McKenney"
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

16 May, 2018

1 commit

  • commit 8236b0ae31c837d2b3a2565c5f8d77f637e824cc upstream.

    syzbot is reporting hung tasks at wait_on_bit(WB_shutting_down) in
    wb_shutdown() [1]. This seems to be because commit 5318ce7d46866e1d ("bdi:
    Shutdown writeback on all cgwbs in cgwb_bdi_destroy()") forgot to call
    wake_up_bit(WB_shutting_down) after clear_bit(WB_shutting_down).

    Introduce a helper function clear_and_wake_up_bit() and use it, in order
    to avoid similar errors in future.

    [1] https://syzkaller.appspot.com/bug?id=b297474817af98d5796bc544e1bb806fc3da0e5e

    Signed-off-by: Tetsuo Handa
    Reported-by: syzbot
    Fixes: 5318ce7d46866e1d ("bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy()")
    Cc: Tejun Heo
    Reviewed-by: Jan Kara
    Suggested-by: Linus Torvalds
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Tetsuo Handa
     

12 Sep, 2017

1 commit


21 Apr, 2017

5 commits


23 Mar, 2017

6 commits

  • Rename cgwb_bdi_destroy() to cgwb_bdi_unregister() as it gets called
    from bdi_unregister() which is not necessarily called from bdi_destroy()
    and thus the name is somewhat misleading.

    Acked-by: Tejun Heo
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • Currently we wait for all cgwbs to get released in cgwb_bdi_destroy()
    (called from bdi_unregister()). That is however unnecessary now when
    cgwb->bdi is a proper refcounted reference (thus bdi cannot get
    released before all cgwbs are released) and when cgwb_bdi_destroy()
    shuts down writeback directly.

    Acked-by: Tejun Heo
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • Currently we waited for all cgwbs to get freed in cgwb_bdi_destroy()
    which also means that writeback has been shutdown on them. Since this
    wait is going away, directly shutdown writeback on cgwbs from
    cgwb_bdi_destroy() to avoid live writeback structures after
    bdi_unregister() has finished. To make that safe with concurrent
    shutdown from cgwb_release_workfn(), we also have to make sure
    wb_shutdown() returns only after the bdi_writeback structure is really
    shutdown.

    Acked-by: Tejun Heo
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • Currently root wb_writeback structure is added to bdi->wb_list in
    bdi_init() and never removed. That is different from all other
    wb_writeback structures which get added to the list when created and
    removed from it before wb_shutdown().

    So move list addition of root bdi_writeback to bdi_register() and list
    removal of all wb_writeback structures to wb_shutdown(). That way a
    wb_writeback structure is on bdi->wb_list if and only if it can handle
    writeback and it will make it easier for us to handle shutdown of all
    wb_writeback structures in bdi_unregister().

    Acked-by: Tejun Heo
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • Make wb->bdi a proper refcounted reference to bdi for all bdi_writeback
    structures except for the one embedded inside struct backing_dev_info.
    That will allow us to simplify bdi unregistration.

    Acked-by: Tejun Heo
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • congested->bdi pointer is used only to be able to remove congested
    structure from bdi->cgwb_congested_tree on structure release. Moreover
    the pointer can become NULL when we unregister the bdi. Rename the field
    to __bdi and add a comment to make it more explicit this is internal
    stuff of memcg writeback code and people should not use the field as
    such use will be likely race prone.

    We do not bother with converting congested->bdi to a proper refcounted
    reference. It will be slightly ugly to special-case bdi->wb.congested to
    avoid effectively a cyclic reference of bdi to itself and the reference
    gets cleared from bdi_unregister() making it impossible to reference
    a freed bdi.

    Acked-by: Tejun Heo
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

09 Mar, 2017

2 commits

  • bdi_writeback_congested structures get created for each blkcg and bdi
    regardless whether bdi is registered or not. When they are created in
    unregistered bdi and the request queue (and thus bdi) is then destroyed
    while blkg still holds reference to bdi_writeback_congested structure,
    this structure will be referencing freed bdi and last wb_congested_put()
    will try to remove the structure from already freed bdi.

    With commit 165a5e22fafb "block: Move bdi_unregister() to
    del_gendisk()", SCSI started to destroy bdis without calling
    bdi_unregister() first (previously it was calling bdi_unregister() even
    for unregistered bdis) and thus the code detaching
    bdi_writeback_congested in cgwb_bdi_destroy() was not triggered and we
    started hitting this use-after-free bug. It is enough to boot a KVM
    instance with virtio-scsi device to trigger this behavior.

    Fix the problem by detaching bdi_writeback_congested structures in
    bdi_exit() instead of bdi_unregister(). This is also more logical as
    they can get attached to bdi regardless whether it ever got registered
    or not.

    Fixes: 165a5e22fafb127ecb5914e12e8c32a1f0d3f820
    Signed-off-by: Jan Kara
    Tested-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • SCSI can call device_add_disk() several times for one request queue when
    a device in unbound and bound, creating new gendisk each time. This will
    lead to bdi being repeatedly registered and unregistered. This was not a
    big problem until commit 165a5e22fafb "block: Move bdi_unregister() to
    del_gendisk()" since bdi was only registered repeatedly (bdi_register()
    handles repeated calls fine, only we ended up leaking reference to
    gendisk due to overwriting bdi->owner) but unregistered only in
    blk_cleanup_queue() which didn't get called repeatedly. After
    165a5e22fafb we were doing correct bdi_register() - bdi_unregister()
    cycles however bdi_unregister() is not prepared for it. So make sure
    bdi_unregister() cleans up bdi in such a way that it is prepared for
    a possible following bdi_register() call.

    An easy way to provoke this behavior is to enable
    CONFIG_DEBUG_TEST_DRIVER_REMOVE and use scsi_debug driver to create a
    scsi disk which immediately hangs without this fix.

    Fixes: 165a5e22fafb127ecb5914e12e8c32a1f0d3f820
    Signed-off-by: Jan Kara
    Tested-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jan Kara
     

23 Feb, 2017

1 commit

  • To make the code clearer, use rb_entry() instead of container_of() to
    deal with rbtree.

    Link: http://lkml.kernel.org/r/671275de093d93ddc7c6f77ddc0d357149691a39.1484306840.git.geliangtang@gmail.com
    Signed-off-by: Geliang Tang
    Cc: Jens Axboe
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geliang Tang
     

09 Feb, 2017

1 commit

  • When !CONFIG_CGROUP_WRITEBACK, bdi has single bdi_writeback_congested
    at bdi->wb_congested. cgwb_bdi_init() allocates it with kzalloc() and
    doesn't do further initialization. This usually works fine as the
    reference count gets bumped to 1 by wb_init() and the put from
    wb_exit() releases it.

    However, when wb_init() fails, it puts the wb base ref automatically
    freeing the wb and the explicit kfree() in cgwb_bdi_init() error path
    ends up trying to free the same pointer the second time causing a
    double-free.

    Fix it by explicitly initilizing the refcnt to 1 and putting the base
    ref from cgwb_bdi_destroy().

    Signed-off-by: Tejun Heo
    Reported-by: Dmitry Vyukov
    Fixes: a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback")
    Cc: stable@vger.kernel.org # v4.2+
    Signed-off-by: Jens Axboe

    Tejun Heo
     

02 Feb, 2017

1 commit

  • Instead of storing backing_dev_info inside struct request_queue,
    allocate it dynamically, reference count it, and free it when the last
    reference is dropped. Currently only request_queue holds the reference
    but in the following patch we add other users referencing
    backing_dev_info.

    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

08 Nov, 2016

1 commit


05 Aug, 2016

1 commit

  • The name for a bdi of a gendisk is derived from the gendisk's devt.
    However, since the gendisk is destroyed before the bdi it leaves a
    window where a new gendisk could dynamically reuse the same devt while a
    bdi with the same name is still live. Arrange for the bdi to hold a
    reference against its "owner" disk device while it is registered.
    Otherwise we can hit sysfs duplicate name collisions like the following:

    WARNING: CPU: 10 PID: 2078 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x64/0x80
    sysfs: cannot create duplicate filename '/devices/virtual/bdi/259:1'

    Hardware name: HP ProLiant DL580 Gen8, BIOS P79 05/06/2015
    0000000000000286 0000000002c04ad5 ffff88006f24f970 ffffffff8134caec
    ffff88006f24f9c0 0000000000000000 ffff88006f24f9b0 ffffffff8108c351
    0000001f0000000c ffff88105d236000 ffff88105d1031e0 ffff8800357427f8
    Call Trace:
    [] dump_stack+0x63/0x87
    [] __warn+0xd1/0xf0
    [] warn_slowpath_fmt+0x5f/0x80
    [] sysfs_warn_dup+0x64/0x80
    [] sysfs_create_dir_ns+0x7e/0x90
    [] kobject_add_internal+0xaa/0x320
    [] ? vsnprintf+0x34e/0x4d0
    [] kobject_add+0x75/0xd0
    [] ? mutex_lock+0x12/0x2f
    [] device_add+0x125/0x610
    [] device_create_groups_vargs+0xd8/0x100
    [] device_create_vargs+0x1c/0x20
    [] bdi_register+0x8c/0x180
    [] bdi_register_dev+0x27/0x30
    [] add_disk+0x175/0x4a0

    Cc:
    Reported-by: Yi Zhang
    Tested-by: Yi Zhang
    Signed-off-by: Dan Williams

    Fixed up missing 0 return in bdi_register_owner().

    Signed-off-by: Jens Axboe

    Dan Williams
     

29 Jul, 2016

1 commit

  • This moves the LRU lists from the zone to the node and related data such
    as counters, tracing, congestion tracking and writeback tracking.

    Unfortunately, due to reclaim and compaction retry logic, it is
    necessary to account for the number of LRU pages on both zone and node
    logic. Most reclaim logic is based on the node counters but the retry
    logic uses the zone counters which do not distinguish inactive and
    active sizes. It would be possible to leave the LRU counters on a
    per-zone basis but it's a heavier calculation across multiple cache
    lines that is much more frequent than the retry checks.

    Other than the LRU counters, this is mostly a mechanical patch but note
    that it introduces a number of anomalies. For example, the scans are
    per-zone but using per-node counters. We also mark a node as congested
    when a zone is congested. This causes weird problems that are fixed
    later but is easier to review.

    In the event that there is excessive overhead on 32-bit systems due to
    the nodes being on LRU then there are two potential solutions

    1. Long-term isolation of highmem pages when reclaim is lowmem

    When pages are skipped, they are immediately added back onto the LRU
    list. If lowmem reclaim persisted for long periods of time, the same
    highmem pages get continually scanned. The idea would be that lowmem
    keeps those pages on a separate list until a reclaim for highmem pages
    arrives that splices the highmem pages back onto the LRU. It potentially
    could be implemented similar to the UNEVICTABLE list.

    That would reduce the skip rate with the potential corner case is that
    highmem pages have to be scanned and reclaimed to free lowmem slab pages.

    2. Linear scan lowmem pages if the initial LRU shrink fails

    This will break LRU ordering but may be preferable and faster during
    memory pressure than skipping LRU pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

21 May, 2016

1 commit

  • wait_iff_congested has been used to throttle allocator before it retried
    another round of direct reclaim to allow the writeback to make some
    progress and prevent reclaim from looping over dirty/writeback pages
    without making any progress.

    We used to do congestion_wait before commit 0e093d99763e ("writeback: do
    not sleep on the congestion queue if there are no congested BDIs or if
    significant congestion is not being encountered in the current zone")
    but that led to undesirable stalls and sleeping for the full timeout
    even when the BDI wasn't congested. Hence wait_iff_congested was used
    instead.

    But it seems that even wait_iff_congested doesn't work as expected. We
    might have a small file LRU list with all pages dirty/writeback and yet
    the bdi is not congested so this is just a cond_resched in the end and
    can end up triggering pre mature OOM.

    This patch replaces the unconditional wait_iff_congested by
    congestion_wait which is executed only if we _know_ that the last round
    of direct reclaim didn't make any progress and dirty+writeback pages are
    more than a half of the reclaimable pages on the zone which might be
    usable for our target allocation. This shouldn't reintroduce stalls
    fixed by 0e093d99763e because congestion_wait is called only when we are
    getting hopeless when sleeping is a better choice than OOM with many
    pages under IO.

    We have to preserve logic introduced by commit 373ccbe59270 ("mm,
    vmstat: allow WQ concurrency to discover memory reclaim doesn't make any
    progress") into the __alloc_pages_slowpath now that wait_iff_congested
    is not used anymore. As the only remaining user of wait_iff_congested
    is shrink_inactive_list we can remove the WQ specific short sleep from
    wait_iff_congested because the sleep is needed to be done only once in
    the allocation retry cycle.

    [mhocko@suse.com: high_zoneidx->ac_classzone_idx to evaluate memory reserves properly]
    Link: http://lkml.kernel.org/r/1463051677-29418-2-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Tetsuo Handa
    Cc: Vladimir Davydov
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

01 Apr, 2016

1 commit


18 Mar, 2016

1 commit

  • Most of the mm subsystem uses pr_ so make it consistent.

    Miscellanea:

    - Realign arguments
    - Add missing newline to format
    - kmemleak-test.c has a "kmemleak: " prefix added to the
    "Kmemleak testing" logging message via pr_fmt

    Signed-off-by: Joe Perches
    Acked-by: Tejun Heo [percpu]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     

12 Feb, 2016

1 commit

  • We need to use post-decrement to get percpu_counter_destroy() called on
    &wb->stat[0]. Moreover, the pre-decremebt would cause infinite
    out-of-bounds accesses if the setup code failed at i==0.

    Signed-off-by: Rasmus Villemoes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     

06 Feb, 2016

1 commit

  • Jan Stancek has reported that system occasionally hanging after "oom01"
    testcase from LTP triggers OOM. Guessing from a result that there is a
    kworker thread doing memory allocation and the values between "Node 0
    Normal free:" and "Node 0 Normal:" differs when hanging, vmstat is not
    up-to-date for some reason.

    According to commit 373ccbe59270 ("mm, vmstat: allow WQ concurrency to
    discover memory reclaim doesn't make any progress"), it meant to force
    the kworker thread to take a short sleep, but it by error used
    schedule_timeout(1). We missed that schedule_timeout() in state
    TASK_RUNNING doesn't do anything.

    Fix it by using schedule_timeout_uninterruptible(1) which forces the
    kworker thread to take a short sleep in order to make sure that vmstat
    is up-to-date.

    Fixes: 373ccbe59270 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress")
    Signed-off-by: Tetsuo Handa
    Reported-by: Jan Stancek
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Cristopher Lameter
    Cc: Joonsoo Kim
    Cc: Arkadiusz Miskiewicz
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

15 Jan, 2016

1 commit

  • A later patch will need this symbol in files other than memcontrol.c, so
    export it now and replace mem_cgroup_root_css at the same time.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: David S. Miller
    Reviewed-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

13 Dec, 2015

1 commit

  • Tetsuo Handa has reported that the system might basically livelock in
    OOM condition without triggering the OOM killer.

    The issue is caused by internal dependency of the direct reclaim on
    vmstat counter updates (via zone_reclaimable) which are performed from
    the workqueue context. If all the current workers get assigned to an
    allocation request, though, they will be looping inside the allocator
    trying to reclaim memory but zone_reclaimable can see stalled numbers so
    it will consider a zone reclaimable even though it has been scanned way
    too much. WQ concurrency logic will not consider this situation as a
    congested workqueue because it relies that worker would have to sleep in
    such a situation. This also means that it doesn't try to spawn new
    workers or invoke the rescuer thread if the one is assigned to the
    queue.

    In order to fix this issue we need to do two things. First we have to
    let wq concurrency code know that we are in trouble so we have to do a
    short sleep. In order to prevent from issues handled by 0e093d99763e
    ("writeback: do not sleep on the congestion queue if there are no
    congested BDIs or if significant congestion is not being encountered in
    the current zone") we limit the sleep only to worker threads which are
    the ones of the interest anyway.

    The second thing to do is to create a dedicated workqueue for vmstat and
    mark it WQ_MEM_RECLAIM to note it participates in the reclaim and to
    have a spare worker thread for it.

    Signed-off-by: Michal Hocko
    Reported-by: Tetsuo Handa
    Cc: Tejun Heo
    Cc: Cristopher Lameter
    Cc: Joonsoo Kim
    Cc: Arkadiusz Miskiewicz
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

07 Nov, 2015

1 commit

  • …d avoiding waking kswapd

    __GFP_WAIT has been used to identify atomic context in callers that hold
    spinlocks or are in interrupts. They are expected to be high priority and
    have access one of two watermarks lower than "min" which can be referred
    to as the "atomic reserve". __GFP_HIGH users get access to the first
    lower watermark and can be called the "high priority reserve".

    Over time, callers had a requirement to not block when fallback options
    were available. Some have abused __GFP_WAIT leading to a situation where
    an optimisitic allocation with a fallback option can access atomic
    reserves.

    This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
    cannot sleep and have no alternative. High priority users continue to use
    __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
    are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
    callers that want to wake kswapd for background reclaim. __GFP_WAIT is
    redefined as a caller that is willing to enter direct reclaim and wake
    kswapd for background reclaim.

    This patch then converts a number of sites

    o __GFP_ATOMIC is used by callers that are high priority and have memory
    pools for those requests. GFP_ATOMIC uses this flag.

    o Callers that have a limited mempool to guarantee forward progress clear
    __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
    into this category where kswapd will still be woken but atomic reserves
    are not used as there is a one-entry mempool to guarantee progress.

    o Callers that are checking if they are non-blocking should use the
    helper gfpflags_allow_blocking() where possible. This is because
    checking for __GFP_WAIT as was done historically now can trigger false
    positives. Some exceptions like dm-crypt.c exist where the code intent
    is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
    flag manipulations.

    o Callers that built their own GFP flags instead of starting with GFP_KERNEL
    and friends now also need to specify __GFP_KSWAPD_RECLAIM.

    The first key hazard to watch out for is callers that removed __GFP_WAIT
    and was depending on access to atomic reserves for inconspicuous reasons.
    In some cases it may be appropriate for them to use __GFP_HIGH.

    The second key hazard is callers that assembled their own combination of
    GFP flags instead of starting with something like GFP_KERNEL. They may
    now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
    if it's missed in most cases as other activity will wake kswapd.

    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Vitaly Wool <vitalywool@gmail.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     

21 Oct, 2015

1 commit

  • a20135ffbc44 ("writeback: don't drain bdi_writeback_congested on bdi
    destruction") added rbtree_postorder_for_each_entry_safe() which is
    used to remove all entries; however, according to Cody, the iterator
    isn't safe against operations which may rebalance the tree. Fix it by
    switching to repeatedly removing rb_first() until empty.

    Signed-off-by: Tejun Heo
    Reported-by: Cody P Schafer
    Fixes: a20135ffbc44 ("writeback: don't drain bdi_writeback_congested on bdi destruction")
    Link: http://lkml.kernel.org/g/1443997973-1700-1-git-send-email-dev@codyps.com
    Signed-off-by: Jens Axboe

    Tejun Heo
     

15 Oct, 2015

1 commit

  • bdi's are initialized in two steps, bdi_init() and bdi_register(), but
    destroyed in a single step by bdi_destroy() which, for a bdi embedded
    in a request_queue, is called during blk_cleanup_queue() which makes
    the queue invisible and starts the draining of remaining usages.

    A request_queue's user can access the congestion state of the embedded
    bdi as long as it holds a reference to the queue. As such, it may
    access the congested state of a queue which finished
    blk_cleanup_queue() but hasn't reached blk_release_queue() yet.
    Because the congested state was embedded in backing_dev_info which in
    turn is embedded in request_queue, accessing the congested state after
    bdi_destroy() was called was fine. The bdi was destroyed but the
    memory region for the congested state remained accessible till the
    queue got released.

    a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in
    bdi_writeback") changed the situation. Now, the root congested state
    which is expected to be pinned while request_queue remains accessible
    is separately reference counted and the base ref is put during
    bdi_destroy(). This means that the root congested state may go away
    prematurely while the queue is between bdi_dstroy() and
    blk_cleanup_queue(), which was detected by Andrey's KASAN tests.

    The root cause of this problem is that bdi doesn't distinguish the two
    steps of destruction, unregistration and release, and now the root
    congested state actually requires a separate release step. To fix the
    issue, this patch separates out bdi_unregister() and bdi_exit() from
    bdi_destroy(). bdi_unregister() is called from blk_cleanup_queue()
    and bdi_exit() from blk_release_queue(). bdi_destroy() is now just a
    simple wrapper calling the two steps back-to-back.

    While at it, the prototype of bdi_destroy() is moved right below
    bdi_setup_and_register() so that the counterpart operations are
    located together.

    Signed-off-by: Tejun Heo
    Fixes: a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback")
    Cc: stable@vger.kernel.org # v4.2+
    Reported-and-tested-by: Andrey Konovalov
    Link: http://lkml.kernel.org/g/CAAeHK+zUJ74Zn17=rOyxacHU18SgCfC6bsYW=6kCY5GXJBwGfQ@mail.gmail.com
    Reviewed-by: Jan Kara
    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Tejun Heo
     

13 Oct, 2015

1 commit

  • bdi_for_each_wb() is used in several places to wake up or issue
    writeback work items to all wb's (bdi_writeback's) on a given bdi.
    The iteration is performed by walking bdi->cgwb_tree; however, the
    tree only indexes wb's which are currently active.

    For example, when a memcg gets associated with a different blkcg, the
    old wb is removed from the tree so that the new one can be indexed.
    The old wb starts dying from then on but will linger till all its
    inodes are drained. As these dying wb's may still host dirty inodes,
    writeback operations which affect all wb's must include them.
    bdi_for_each_wb() skipping dying wb's led to sync(2) missing and
    failing to sync the inodes belonging to those wb's.

    This patch adds a RCU protected @bdi->wb_list which lists all wb's
    beloinging to that bdi. wb's are added on creation and removed on
    release rather than on the start of destruction. bdi_for_each_wb()
    usages are replaced with list_for_each[_continue]_rcu() iterations
    over @bdi->wb_list and bdi_for_each_wb() and its helpers are removed.

    v2: Updated as per Jan. last_wb ref leak in bdi_split_work_to_wbs()
    fixed and unnecessary list head severing in cgwb_bdi_destroy()
    removed.

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: Artem Bityutskiy
    Fixes: ebe41ab0c79d ("writeback: implement bdi_for_each_wb()")
    Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     

11 Sep, 2015

1 commit

  • Pull blk-cg updates from Jens Axboe:
    "A bit later in the cycle, but this has been in the block tree for a a
    while. This is basically four patchsets from Tejun, that improve our
    buffered cgroup writeback. It was dependent on the other cgroup
    changes, but they went in earlier in this cycle.

    Series 1 is set of 5 patches that has cgroup writeback updates:

    - bdi_writeback iteration fix which could lead to some wb's being
    skipped or repeated during e.g. sync under memory pressure.

    - Simplification of wb work wait mechanism.

    - Writeback tracepoints updated to report cgroup.

    Series 2 is is a set of updates for the CFQ cgroup writeback handling:

    cfq has always charged all async IOs to the root cgroup. It didn't
    have much choice as writeback didn't know about cgroups and there
    was no way to tell who to blame for a given writeback IO.
    writeback finally grew support for cgroups and now tags each
    writeback IO with the appropriate cgroup to charge it against.

    This patchset updates cfq so that it follows the blkcg each bio is
    tagged with. Async cfq_queues are now shared across cfq_group,
    which is per-cgroup, instead of per-request_queue cfq_data. This
    makes all IOs follow the weight based IO resource distribution
    implemented by cfq.

    - Switched from GFP_ATOMIC to GFP_NOWAIT as suggested by Jeff.

    - Other misc review points addressed, acks added and rebased.

    Series 3 is the blkcg policy cleanup patches:

    This patchset contains assorted cleanups for blkcg_policy methods
    and blk[c]g_policy_data handling.

    - alloc/free added for blkg_policy_data. exit dropped.

    - alloc/free added for blkcg_policy_data.

    - blk-throttle's async percpu allocation is replaced with direct
    allocation.

    - all methods now take blk[c]g_policy_data instead of blkcg_gq or
    blkcg.

    And finally, series 4 is a set of patches cleaning up the blkcg stats
    handling:

    blkcg's stats have always been somwhat of a mess. This patchset
    tries to improve the situation a bit.

    - The following patches added to consolidate blkcg entry point and
    blkg creation. This is in itself is an improvement and helps
    colllecting common stats on bio issue.

    - per-blkg stats now accounted on bio issue rather than request
    completion so that bio based and request based drivers can behave
    the same way. The issue was spotted by Vivek.

    - cfq-iosched implements custom recursive stats and blk-throttle
    implements custom per-cpu stats. This patchset make blkcg core
    support both by default.

    - cfq-iosched and blk-throttle keep track of the same stats
    multiple times. Unify them"

    * 'for-4.3/blkcg' of git://git.kernel.dk/linux-block: (45 commits)
    blkcg: use CGROUP_WEIGHT_* scale for io.weight on the unified hierarchy
    blkcg: s/CFQ_WEIGHT_*/CFQ_WEIGHT_LEGACY_*/
    blkcg: implement interface for the unified hierarchy
    blkcg: misc preparations for unified hierarchy interface
    blkcg: separate out tg_conf_updated() from tg_set_conf()
    blkcg: move body parsing from blkg_conf_prep() to its callers
    blkcg: mark existing cftypes as legacy
    blkcg: rename subsystem name from blkio to io
    blkcg: refine error codes returned during blkcg configuration
    blkcg: remove unnecessary NULL checks from __cfqg_set_weight_device()
    blkcg: reduce stack usage of blkg_rwstat_recursive_sum()
    blkcg: remove cfqg_stats->sectors
    blkcg: move io_service_bytes and io_serviced stats into blkcg_gq
    blkcg: make blkg_[rw]stat_recursive_sum() to be able to index into blkcg_gq
    blkcg: make blkcg_[rw]stat per-cpu
    blkcg: add blkg_[rw]stat->aux_cnt and replace cfq_group->dead_stats with it
    blkcg: consolidate blkg creation in blkcg_bio_issue_check()
    blk-throttle: improve queue bypass handling
    blkcg: move root blkg lookup optimization from throtl_lookup_tg() to __blkg_lookup()
    blkcg: inline [__]blkg_lookup()
    ...

    Linus Torvalds
     

19 Aug, 2015

1 commit

  • blkio interface has become messy over time and is currently the
    largest. In addition to the inconsistent naming scheme, it has
    multiple stat files which report more or less the same thing, a number
    of debug stat files which expose internal details which shouldn't have
    been part of the public interface in the first place, recursive and
    non-recursive stats and leaf and non-leaf knobs.

    Both recursive vs. non-recursive and leaf vs. non-leaf distinctions
    don't make any sense on the unified hierarchy as only leaf cgroups can
    contain processes. cgroups is going through a major interface
    revision with the unified hierarchy involving significant fundamental
    usage changes and given that a significant portion of the interface
    doesn't make sense anymore, it's a good time to reorganize the
    interface.

    As the first step, this patch renames the external visible subsystem
    name from "blkio" to "io". This is more concise, matches the other
    two major subsystem names, "cpu" and "memory", and better suited as
    blkcg will be involved in anything writeback related too whether an
    actual block device is involved or not.

    As the subsystem legacy_name is set to "blkio", the only userland
    visible change outside the unified hierarchy is that blkcg is reported
    as "io" instead of "blkio" in the subsystem initialized message during
    boot. On the unified hierarchy, blkcg now appears as "io".

    Signed-off-by: Tejun Heo
    Cc: Li Zefan
    Cc: Johannes Weiner
    Cc: cgroups@vger.kernel.org
    Signed-off-by: Jens Axboe

    Tejun Heo
     

18 Aug, 2015

1 commit

  • There's a small consistency problem between the inode and writeback
    naming. Writeback calls the "for IO" inode queues b_io and
    b_more_io, but the inode calls these the "writeback list" or
    i_wb_list. This makes it hard to an new "under writeback" list to
    the inode, or call it an "under IO" list on the bdi because either
    way we'll have writeback on IO and IO on writeback and it'll just be
    confusing. I'm getting confused just writing this!

    So, rename the inode "for IO" list variable to i_io_list so we can
    add a new "writeback list" in a subsequent patch.

    Signed-off-by: Dave Chinner
    Signed-off-by: Josef Bacik
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Tested-by: Dave Chinner

    Dave Chinner
     

02 Jul, 2015

2 commits

  • 52ebea749aae ("writeback: make backing_dev_info host cgroup-specific
    bdi_writebacks") made bdi (backing_dev_info) host per-cgroup wb's
    (bdi_writeback's). As the congested state needs to be per-wb and
    referenced from blkcg side and multiple wbs, the patch made all
    non-root cong's (bdi_writeback_congested's) reference counted and
    indexed on bdi.

    When a bdi is destroyed, cgwb_bdi_destroy() tries to drain all
    non-root cong's; however, this can hang indefinitely because wb's can
    also be referenced from blkcg_gq's which are destroyed after bdi
    destruction is complete.

    This patch fixes the bug by updating bdi destruction to not wait for
    cong's to drain. A cong is unlinked from bdi->cgwb_congested_tree on
    bdi destuction regardless of its reference count as the bdi may go
    away any point after destruction. wb_congested_put() checks whether
    the cong is already unlinked on release.

    Signed-off-by: Tejun Heo
    Reported-by: Jon Christopherson
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=100681
    Fixes: 52ebea749aae ("writeback: make backing_dev_info host cgroup-specific bdi_writebacks")
    Tested-by: Jon Christopherson
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • 52ebea749aae ("writeback: make backing_dev_info host cgroup-specific
    bdi_writebacks") made bdi (backing_dev_info) host per-cgroup wb's
    (bdi_writeback's). As the congested state needs to be per-wb and
    referenced from blkcg side and multiple wbs, the patch made all
    non-root cong's (bdi_writeback_congested's) reference counted and
    indexed on bdi.

    When a bdi is destroyed, cgwb_bdi_destroy() tries to drain all
    non-root cong's; however, this can hang indefinitely because wb's can
    also be referenced from blkcg_gq's which are destroyed after bdi
    destruction is complete.

    To fix the bug, bdi destruction will be updated to not wait for cong's
    to drain, which naturally means that cong's may outlive the associated
    bdi. This is fine for non-root cong's but is problematic for the root
    cong's which are embedded in their bdi's as they may end up getting
    dereferenced after the containing bdi's are freed.

    This patch makes root cong's behave the same as non-root cong's. They
    are no longer embedded in their bdi's but allocated separately during
    bdi initialization, indexed and reference counted the same way.

    * As cong handling is the same for all wb's, wb->congested
    initialization is moved into wb_init().

    * When !CONFIG_CGROUP_WRITEBACK, there was no indexing or refcnting.
    bdi->wb_congested is now a pointer pointing to the root cong
    allocated during bdi init and minimal refcnting operations are
    implemented.

    * The above makes root wb init paths diverge depending on
    CONFIG_CGROUP_WRITEBACK. root wb init is moved to cgwb_bdi_init().

    This patch in itself shouldn't cause any consequential behavior
    differences but prepares for the actual fix.

    Signed-off-by: Tejun Heo
    Reported-by: Jon Christopherson
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=100681
    Tested-by: Jon Christopherson

    Added include to backing-dev.h for kfree() definition.

    Signed-off-by: Jens Axboe

    Tejun Heo
     

26 Jun, 2015

1 commit

  • Pull cgroup writeback support from Jens Axboe:
    "This is the big pull request for adding cgroup writeback support.

    This code has been in development for a long time, and it has been
    simmering in for-next for a good chunk of this cycle too. This is one
    of those problems that has been talked about for at least half a
    decade, finally there's a solution and code to go with it.

    Also see last weeks writeup on LWN:

    http://lwn.net/Articles/648292/"

    * 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
    writeback, blkio: add documentation for cgroup writeback support
    vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
    writeback: do foreign inode detection iff cgroup writeback is enabled
    v9fs: fix error handling in v9fs_session_init()
    bdi: fix wrong error return value in cgwb_create()
    buffer: remove unusued 'ret' variable
    writeback: disassociate inodes from dying bdi_writebacks
    writeback: implement foreign cgroup inode bdi_writeback switching
    writeback: add lockdep annotation to inode_to_wb()
    writeback: use unlocked_inode_to_wb transaction in inode_congested()
    writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
    writeback: implement [locked_]inode_to_wb_and_lock_list()
    writeback: implement foreign cgroup inode detection
    writeback: make writeback_control track the inode being written back
    writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
    mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
    writeback: implement memcg writeback domain based throttling
    writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
    writeback: implement memcg wb_domain
    writeback: update wb_over_bg_thresh() to use wb_domain aware operations
    ...

    Linus Torvalds