10 Aug, 2016

1 commit

  • I've found funny live-lock between raid10 barriers during resync and
    memory controller hard limits. Inside mpage_readpages() task holds on to
    its plug bio which blocks the barrier in raid10. Its memory cgroup have
    no free memory thus the task goes into reclaimer but all reclaimable
    pages are dirty and cannot be written because raid10 is rebuilding and
    stuck on the barrier.

    Common flush of such IO in schedule() never happens, because the caller
    doesn't go to sleep.

    Lock is 'live' because changing memory limit or killing tasks which
    holds that stuck bio unblock whole progress.

    That was what happened in 3.18.x but I see no difference in upstream
    logic. Theoretically this might happen even without memory cgroup.

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Jens Axboe

    Konstantin Khlebnikov
     

05 Aug, 2016

1 commit

  • Currently we take care to handle I_DIRTY_TIME in vfs_fsync() and
    queue_io() so that inodes which have only dirty timestamps are properly
    written on fsync(2) and sync(2). However there are other call sites -
    most notably going through write_inode_now() - which expect inode to be
    clean after WB_SYNC_ALL writeback. This is not currently true as we do
    not clear I_DIRTY_TIME in __writeback_single_inode() even for
    WB_SYNC_ALL writeback in all the cases. This then resulted in the
    following oops because bdev_write_inode() did not clean the inode and
    writeback code later stumbled over a dirty inode with detached wb.

    general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
    Modules linked in:
    CPU: 3 PID: 32 Comm: kworker/u10:1 Not tainted 4.6.0-rc3+ #349
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Workqueue: writeback wb_workfn (flush-11:0)
    task: ffff88006ccf1840 ti: ffff88006cda8000 task.ti: ffff88006cda8000
    RIP: 0010:[] []
    locked_inode_to_wb_and_lock_list+0xa2/0x750
    RSP: 0018:ffff88006cdaf7d0 EFLAGS: 00010246
    RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88006ccf2050
    RDX: 0000000000000000 RSI: 000000114c8a8484 RDI: 0000000000000286
    RBP: ffff88006cdaf820 R08: ffff88006ccf1840 R09: 0000000000000000
    R10: 000229915090805f R11: 0000000000000001 R12: ffff88006a72f5e0
    R13: dffffc0000000000 R14: ffffed000d4e5eed R15: ffffffff8830cf40
    FS: 0000000000000000(0000) GS:ffff88006d500000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000003301bf8 CR3: 000000006368f000 CR4: 00000000000006e0
    DR0: 0000000000001ec9 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
    Stack:
    ffff88006a72f680 ffff88006a72f768 ffff8800671230d8 03ff88006cdaf948
    ffff88006a72f668 ffff88006a72f5e0 ffff8800671230d8 ffff88006cdaf948
    ffff880065b90cc8 ffff880067123100 ffff88006cdaf970 ffffffff8188e12e
    Call Trace:
    [< inline >] inode_to_wb_and_lock_list fs/fs-writeback.c:309
    [] writeback_sb_inodes+0x4de/0x1250 fs/fs-writeback.c:1554
    [] __writeback_inodes_wb+0x104/0x1e0 fs/fs-writeback.c:1600
    [] wb_writeback+0x7ce/0xc90 fs/fs-writeback.c:1709
    [< inline >] wb_do_writeback fs/fs-writeback.c:1844
    [] wb_workfn+0x2f9/0x1000 fs/fs-writeback.c:1884
    [] process_one_work+0x78e/0x15c0 kernel/workqueue.c:2094
    [] worker_thread+0xdb/0xfc0 kernel/workqueue.c:2228
    [] kthread+0x23f/0x2d0 drivers/block/aoe/aoecmd.c:1303
    [] ret_from_fork+0x22/0x50 arch/x86/entry/entry_64.S:392
    Code: 05 94 4a a8 06 85 c0 0f 85 03 03 00 00 e8 07 15 d0 ff 41 80 3e
    00 0f 85 64 06 00 00 49 8b 9c 24 88 01 00 00 48 89 d8 48 c1 e8 03
    80 3c 28 00 0f 85 17 06 00 00 48 8b 03 48 83 c0 50 48 39 c3
    RIP [< inline >] wb_get include/linux/backing-dev-defs.h:212
    RIP [] locked_inode_to_wb_and_lock_list+0xa2/0x750
    fs/fs-writeback.c:281
    RSP
    ---[ end trace 986a4d314dcb2694 ]---

    Fix the problem by making sure __writeback_single_inode() writes inode
    only with dirty times in WB_SYNC_ALL mode.

    Reported-by: Dmitry Vyukov
    Tested-by: Laurent Dufour
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

29 Jul, 2016

1 commit

  • There are now a number of accounting oddities such as mapped file pages
    being accounted for on the node while the total number of file pages are
    accounted on the zone. This can be coped with to some extent but it's
    confusing so this patch moves the relevant file-based accounted. Due to
    throttling logic in the page allocator for reliable OOM detection, it is
    still necessary to track dirty and writeback pages on a per-zone basis.

    [mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
    Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

27 Jul, 2016

2 commits

  • The per-sb inode writeback list tracks inodes currently under writeback
    to facilitate efficient sync processing. In particular, it ensures that
    sync only needs to walk through a list of inodes that were cleaned by
    the sync.

    Add a couple tracepoints to help identify when inodes are added/removed
    to and from the writeback lists. Piggyback off of the writeback
    lazytime tracepoint template as it already tracks the relevant inode
    information.

    Link: http://lkml.kernel.org/r/1466594593-6757-3-git-send-email-bfoster@redhat.com
    Signed-off-by: Brian Foster
    Reviewed-by: Jan Kara
    Cc: Dave Chinner
    cc: Josef Bacik
    Cc: Holger Hoffstätte
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brian Foster
     
  • wait_sb_inodes() currently does a walk of all inodes in the filesystem
    to find dirty one to wait on during sync. This is highly inefficient
    and wastes a lot of CPU when there are lots of clean cached inodes that
    we don't need to wait on.

    To avoid this "all inode" walk, we need to track inodes that are
    currently under writeback that we need to wait for. We do this by
    adding inodes to a writeback list on the sb when the mapping is first
    tagged as having pages under writeback. wait_sb_inodes() can then walk
    this list of "inodes under IO" and wait specifically just for the inodes
    that the current sync(2) needs to wait for.

    Define a couple helpers to add/remove an inode from the writeback list
    and call them when the overall mapping is tagged for or cleared from
    writeback. Update wait_sb_inodes() to walk only the inodes under
    writeback due to the sync.

    With this change, filesystem sync times are significantly reduced for
    fs' with largely populated inode caches and otherwise no other work to
    do. For example, on a 16xcpu 2GHz x86-64 server, 10TB XFS filesystem
    with a ~10m entry inode cache, sync times are reduced from ~7.3s to less
    than 0.1s when the filesystem is fully clean.

    Link: http://lkml.kernel.org/r/1466594593-6757-2-git-send-email-bfoster@redhat.com
    Signed-off-by: Dave Chinner
    Signed-off-by: Josef Bacik
    Signed-off-by: Brian Foster
    Reviewed-by: Jan Kara
    Tested-by: Holger Hoffstätte
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Chinner
     

01 Jul, 2016

1 commit

  • Asynchronous wb switching of inodes takes an additional ref count on an
    inode to make sure inode remains valid until switchover is completed.

    However, anyone calling ihold() must already have a ref count on inode,
    but in this case inode->i_count may already be zero:

    ------------[ cut here ]------------
    WARNING: CPU: 1 PID: 917 at fs/inode.c:397 ihold+0x2b/0x30
    CPU: 1 PID: 917 Comm: kworker/u4:5 Not tainted 4.7.0-rc2+ #49
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
    01/01/2011
    Workqueue: writeback wb_workfn (flush-8:16)
    0000000000000000 ffff88007ca0fb58 ffffffff805990af 0000000000000000
    0000000000000000 ffff88007ca0fb98 ffffffff80268702 0000018d000004e2
    ffff88007cef40e8 ffff88007c9b89a8 ffff880079e3a740 0000000000000003
    Call Trace:
    [] dump_stack+0x4d/0x6e
    [] __warn+0xc2/0xe0
    [] warn_slowpath_null+0x18/0x20
    [] ihold+0x2b/0x30
    [] inode_switch_wbs+0x11c/0x180
    [] wbc_detach_inode+0x170/0x1a0
    [] writeback_sb_inodes+0x21c/0x530
    [] wb_writeback+0xee/0x1e0
    [] wb_workfn+0xd7/0x280
    [] ? try_to_wake_up+0x1b1/0x2b0
    [] process_one_work+0x129/0x300
    [] worker_thread+0x126/0x480
    [] ? __schedule+0x1c7/0x561
    [] ? process_one_work+0x300/0x300
    [] kthread+0xc4/0xe0
    [] ? kfree+0xc8/0x100
    [] ret_from_fork+0x1f/0x40
    [] ? __kthread_parkme+0x70/0x70
    ---[ end trace aaefd2fd9f306bc4 ]---

    Signed-off-by: Tahsin Erdogan
    Acked-by: Tejun Heo
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Tahsin Erdogan
     

21 May, 2016

1 commit

  • When writeback operation cannot make forward progress because memory
    allocation requests needed for doing I/O cannot be satisfied (e.g.
    under OOM-livelock situation), we can observe flood of order-0 page
    allocation failure messages caused by complete depletion of memory
    reserves.

    This is caused by unconditionally allocating "struct wb_writeback_work"
    objects using GFP_ATOMIC from PF_MEMALLOC context.

    __alloc_pages_nodemask() {
    __alloc_pages_slowpath() {
    __alloc_pages_direct_reclaim() {
    __perform_reclaim() {
    current->flags |= PF_MEMALLOC;
    try_to_free_pages() {
    do_try_to_free_pages() {
    wakeup_flusher_threads() {
    wb_start_writeback() {
    kzalloc(sizeof(*work), GFP_ATOMIC) {
    /* ALLOC_NO_WATERMARKS via PF_MEMALLOC */
    }
    }
    }
    }
    }
    current->flags &= ~PF_MEMALLOC;
    }
    }
    }
    }

    Since I/O is stalling, allocating writeback requests forever shall
    deplete memory reserves. Fortunately, since wb_start_writeback() can
    fall back to wb_wakeup() when allocating "struct wb_writeback_work"
    failed, we don't need to allow wb_start_writeback() to use memory
    reserves.

    Mem-Info:
    active_anon:289393 inactive_anon:2093 isolated_anon:29
    active_file:10838 inactive_file:113013 isolated_file:859
    unevictable:0 dirty:108531 writeback:5308 unstable:0
    slab_reclaimable:5526 slab_unreclaimable:7077
    mapped:9970 shmem:2159 pagetables:2387 bounce:0
    free:3042 free_pcp:0 free_cma:0
    Node 0 DMA free:6968kB min:44kB low:52kB high:64kB active_anon:6056kB inactive_anon:176kB active_file:712kB inactive_file:744kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:756kB writeback:0kB mapped:736kB shmem:184kB slab_reclaimable:48kB slab_unreclaimable:208kB kernel_stack:160kB pagetables:144kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9708 all_unreclaimable? yes
    lowmem_reserve[]: 0 1732 1732 1732
    Node 0 DMA32 free:5200kB min:5200kB low:6500kB high:7800kB active_anon:1151516kB inactive_anon:8196kB active_file:42640kB inactive_file:451076kB unevictable:0kB isolated(anon):116kB isolated(file):3564kB present:2080640kB managed:1775332kB mlocked:0kB dirty:433368kB writeback:21232kB mapped:39144kB shmem:8452kB slab_reclaimable:22056kB slab_unreclaimable:28100kB kernel_stack:20976kB pagetables:9404kB unstable:0kB bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2701604 all_unreclaimable? no
    lowmem_reserve[]: 0 0 0 0
    Node 0 DMA: 25*4kB (UME) 16*8kB (UME) 3*16kB (UE) 5*32kB (UME) 2*64kB (UM) 2*128kB (ME) 2*256kB (ME) 1*512kB (E) 1*1024kB (E) 2*2048kB (ME) 0*4096kB = 6964kB
    Node 0 DMA32: 925*4kB (UME) 140*8kB (UME) 5*16kB (ME) 5*32kB (M) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 5060kB
    Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
    Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
    126847 total pagecache pages
    0 pages in swap cache
    Swap cache stats: add 0, delete 0, find 0/0
    Free swap = 0kB
    Total swap = 0kB
    524157 pages RAM
    0 pages HighMem/MovableOnly
    76348 pages reserved
    0 pages hwpoisoned
    Out of memory: Kill process 4450 (file_io.00) score 998 or sacrifice child
    Killed process 4450 (file_io.00) total-vm:4308kB, anon-rss:100kB, file-rss:1184kB, shmem-rss:0kB
    kthreadd: page allocation failure: order:0, mode:0x2200020
    file_io.00: page allocation failure: order:0, mode:0x2200020
    CPU: 0 PID: 4457 Comm: file_io.00 Not tainted 4.5.0-rc7+ #45
    Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
    Call Trace:
    warn_alloc_failed+0xf7/0x150
    __alloc_pages_nodemask+0x23f/0xa60
    alloc_pages_current+0x87/0x110
    new_slab+0x3a1/0x440
    ___slab_alloc+0x3cf/0x590
    __slab_alloc.isra.64+0x18/0x1d
    kmem_cache_alloc+0x11c/0x150
    wb_start_writeback+0x39/0x90
    wakeup_flusher_threads+0x7f/0xf0
    do_try_to_free_pages+0x1f9/0x410
    try_to_free_pages+0x94/0xc0
    __alloc_pages_nodemask+0x566/0xa60
    alloc_pages_current+0x87/0x110
    __page_cache_alloc+0xaf/0xc0
    pagecache_get_page+0x88/0x260
    grab_cache_page_write_begin+0x21/0x40
    xfs_vm_write_begin+0x2f/0xf0
    generic_perform_write+0xca/0x1c0
    xfs_file_buffered_aio_write+0xcc/0x1f0
    xfs_file_write_iter+0x84/0x140
    __vfs_write+0xc7/0x100
    vfs_write+0x9d/0x190
    SyS_write+0x50/0xc0
    entry_SYSCALL_64_fastpath+0x12/0x6a
    Mem-Info:
    active_anon:293335 inactive_anon:2093 isolated_anon:0
    active_file:10829 inactive_file:110045 isolated_file:32
    unevictable:0 dirty:109275 writeback:822 unstable:0
    slab_reclaimable:5489 slab_unreclaimable:10070
    mapped:9999 shmem:2159 pagetables:2420 bounce:0
    free:3 free_pcp:0 free_cma:0
    Node 0 DMA free:12kB min:44kB low:52kB high:64kB active_anon:6060kB inactive_anon:176kB active_file:708kB inactive_file:756kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:756kB writeback:0kB mapped:736kB shmem:184kB slab_reclaimable:48kB slab_unreclaimable:7160kB kernel_stack:160kB pagetables:144kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9844 all_unreclaimable? yes
    lowmem_reserve[]: 0 1732 1732 1732
    Node 0 DMA32 free:0kB min:5200kB low:6500kB high:7800kB active_anon:1167280kB inactive_anon:8196kB active_file:42608kB inactive_file:439424kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2080640kB managed:1775332kB mlocked:0kB dirty:436344kB writeback:3288kB mapped:39260kB shmem:8452kB slab_reclaimable:21908kB slab_unreclaimable:33120kB kernel_stack:20976kB pagetables:9536kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:11073180 all_unreclaimable? yes
    lowmem_reserve[]: 0 0 0 0
    Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
    Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
    Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
    Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
    123086 total pagecache pages
    0 pages in swap cache
    Swap cache stats: add 0, delete 0, find 0/0
    Free swap = 0kB
    Total swap = 0kB
    524157 pages RAM
    0 pages HighMem/MovableOnly
    76348 pages reserved
    0 pages hwpoisoned
    SLUB: Unable to allocate memory on node -1 (gfp=0x2088020)
    cache: kmalloc-64, object size: 64, buffer size: 64, default order: 0, min order: 0
    node 0: slabs: 3218, objs: 205952, free: 0
    file_io.00: page allocation failure: order:0, mode:0x2200020
    CPU: 0 PID: 4457 Comm: file_io.00 Not tainted 4.5.0-rc7+ #45

    Assuming that somebody will find a better solution, let's apply this
    patch for now to stop bleeding, for this problem frequently prevents me
    from testing OOM livelock condition.

    Link: http://lkml.kernel.org/r/20160318131136.GE7152@quack.suse.cz
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: Jan Kara
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

20 Mar, 2016

2 commits

  • When cgroup writeback is in use, there can be multiple wb's
    (bdi_writeback's) per bdi and an inode may switch among them
    dynamically. In a couple places, the wrong wb was used leading to
    performing operations on the wrong list under the wrong lock
    corrupting the io lists.

    * writeback_single_inode() was taking @wb parameter and used it to
    remove the inode from io lists if it becomes clean after writeback.
    The callers of this function were always passing in the root wb
    regardless of the actual wb that the inode was associated with,
    which could also change while writeback is in progress.

    Fix it by dropping the @wb parameter and using
    inode_to_wb_and_lock_list() to determine and lock the associated wb.

    * After writeback_sb_inodes() writes out an inode, it re-locks @wb and
    inode to remove it from or move it to the right io list. It assumes
    that the inode is still associated with @wb; however, the inode may
    have switched to another wb while writeback was in progress.

    Fix it by using inode_to_wb_and_lock_list() to determine and lock
    the associated wb after writeback is complete. As the function
    requires the original @wb->list_lock locked for the next iteration,
    in the unlikely case where the inode has changed association, switch
    the locks.

    Kudos to Tahsin for pinpointing these subtle breakages.

    Signed-off-by: Tejun Heo
    Fixes: d10c80955265 ("writeback: implement foreign cgroup inode bdi_writeback switching")
    Link: http://lkml.kernel.org/g/CAAeU0aMYeM_39Y2+PaRvyB1nqAPYZSNngJ1eBRmrxn7gKAt2Mg@mail.gmail.com
    Reported-and-diagnosed-by: Tahsin Erdogan
    Tested-by: Tahsin Erdogan
    Cc: stable@vger.kernel.org # v4.2+
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • locked_inode_to_wb_and_lock_list() wb_get()'s the wb associated with
    the target inode, unlocks inode, locks the wb's list_lock and verifies
    that the inode is still associated with the wb. To prevent the wb
    going away between dropping inode lock and acquiring list_lock, the wb
    is pinned while inode lock is held. The wb reference is put right
    after acquiring list_lock citing that the wb won't be dereferenced
    anymore.

    This isn't true. If the inode is still associated with the wb, the
    inode has reference and it's safe to return the wb; however, if inode
    has been switched, the wb still needs to be unlocked which is a
    dereference and can lead to use-after-free if it it races with wb
    destruction.

    Fix it by putting the reference after releasing list_lock.

    Signed-off-by: Tejun Heo
    Fixes: 87e1d789bf55 ("writeback: implement [locked_]inode_to_wb_and_lock_list()")
    Cc: stable@vger.kernel.org # v4.2+
    Tested-by: Tahsin Erdogan
    Signed-off-by: Jens Axboe

    Tejun Heo
     

04 Mar, 2016

1 commit

  • If cgroup writeback is in use, inodes can be scheduled for
    asynchronous wb switching. Before 5ff8eaac1636 ("writeback: keep
    superblock pinned during cgroup writeback association switches"), this
    could race with umount leading to super_block being destroyed while
    inodes are pinned for wb switching. 5ff8eaac1636 fixed it by bumping
    s_active while wb switches are in flight; however, this allowed
    in-flight wb switches to make umounts asynchronous when the userland
    expected synchronosity - e.g. fsck immediately following umount may
    fail because the device is still busy.

    This patch removes the problematic super_block pinning and instead
    makes generic_shutdown_super() flush in-flight wb switches. wb
    switches are now executed on a dedicated isw_wq so that they can be
    flushed and isw_nr_in_flight keeps track of the number of in-flight wb
    switches so that flushing can be avoided in most cases.

    v2: Move cgroup_writeback_umount() further below and add MS_ACTIVE
    check in inode_switch_wbs() as Jan an Al suggested.

    Signed-off-by: Tejun Heo
    Reported-by: Tahsin Erdogan
    Cc: Jan Kara
    Cc: Al Viro
    Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com
    Fixes: 5ff8eaac1636 ("writeback: keep superblock pinned during cgroup writeback association switches")
    Cc: stable@vger.kernel.org #v4.5
    Reviewed-by: Jan Kara
    Tested-by: Tahsin Erdogan
    Signed-off-by: Jens Axboe

    Tejun Heo
     

17 Feb, 2016

1 commit

  • If cgroup writeback is in use, an inode is associated with a cgroup
    for writeback. If the inode's main dirtier changes to another cgroup,
    the association gets updated asynchronously. Nothing was pinning the
    superblock while such switches are in progress and superblock could go
    away while async switching is pending or in progress leading to
    crashes like the following.

    kernel BUG at fs/jbd2/transaction.c:319!
    invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
    CPU: 1 PID: 29158 Comm: kworker/1:10 Not tainted 4.5.0-rc3 #51
    Hardware name: Google Google, BIOS Google 01/01/2011
    Workqueue: events inode_switch_wbs_work_fn
    task: ffff880213dbbd40 ti: ffff880209264000 task.ti: ffff880209264000
    RIP: 0010:[] [] start_this_handle+0x382/0x3e0
    RSP: 0018:ffff880209267c30 EFLAGS: 00010202
    ...
    Call Trace:
    [] jbd2__journal_start+0xf4/0x190
    [] __ext4_journal_start_sb+0x4e/0x70
    [] ext4_evict_inode+0x12c/0x3d0
    [] evict+0xbb/0x190
    [] iput+0x130/0x190
    [] inode_switch_wbs_work_fn+0x343/0x4c0
    [] process_one_work+0x129/0x300
    [] worker_thread+0x126/0x480
    [] kthread+0xc4/0xe0
    [] ret_from_fork+0x3f/0x70

    Fix it by bumping s_active while cgroup association switching is in
    flight.

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: Tahsin Erdogan
    Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com
    Fixes: d10c80955265 ("writeback: implement foreign cgroup inode bdi_writeback switching")
    Cc: stable@vger.kernel.org #v4.5+
    Signed-off-by: Jens Axboe

    Tejun Heo
     

16 Jan, 2016

1 commit

  • In earlier versions, mem_cgroup_css_from_page() could return non-root
    css on a legacy hierarchy which can go away and required rcu locking;
    however, the eventual version simply returns the root cgroup if memcg is
    on a legacy hierarchy and thus doesn't need rcu locking around or in it.
    Remove spurious rcu lockings.

    Signed-off-by: Tejun Heo
    Reported-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

10 Nov, 2015

1 commit

  • Fix kernel-doc warnings in fs/fs-writeback.c by moving a #define macro to
    after the function's opening brace. Also #undef this macro at the end of
    the function.

    ../fs/fs-writeback.c:1984: warning: Excess function parameter 'inode' description in 'I_DIRTY_INODE'
    ../fs/fs-writeback.c:1984: warning: Excess function parameter 'flags' description in 'I_DIRTY_INODE'

    Signed-off-by: Randy Dunlap
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

06 Nov, 2015

1 commit

  • filemap_fdatawait() is a function to wait for on-going writeback to
    complete but also consume and clear error status of the mapping set during
    writeback.

    The latter functionality is critical for applications to detect writeback
    error with system calls like fsync(2)/fdatasync(2).

    However filemap_fdatawait() is also used by sync(2) or FIFREEZE ioctl,
    which don't check error status of individual mappings.

    As a result, fsync() may not be able to detect writeback error if events
    happen in the following order:

    Application System admin
    ----------------------------------------------------------
    write data on page cache
    Run sync command
    writeback completes with error
    filemap_fdatawait() clears error
    fsync returns success
    (but the data is not on disk)

    This patch adds filemap_fdatawait_keep_errors() for call sites where
    writeback error is not handled so that they don't clear error status.

    Signed-off-by: Jun'ichi Nomura
    Acked-by: Andi Kleen
    Reviewed-by: Tejun Heo
    Cc: Fengguang Wu
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junichi Nomura
     

28 Oct, 2015

1 commit

  • bdi_split_work_to_wbs() uses list_for_each_entry_rcu_continue()
    to walk @bdi->wb_list. To set up the initial iteration
    condition, it uses list_entry_rcu() to calculate the entry
    pointer corresponding to the list head; however, this isn't an
    actual RCU dereference and using list_entry_rcu() for it ended
    up breaking a proposed list_entry_rcu() change because it was
    feeding an non-lvalue pointer into the macro.

    Don't use the RCU variant for simple pointer offsetting. Use
    list_entry() instead.

    Reported-by: Ingo Molnar
    Signed-off-by: Tejun Heo
    Cc: Darren Hart
    Cc: David Howells
    Cc: Dipankar Sarma
    Cc: Eric Dumazet
    Cc: Frederic Weisbecker
    Cc: Josh Triplett
    Cc: Lai Jiangshan
    Cc: Linus Torvalds
    Cc: Mathieu Desnoyers
    Cc: Oleg Nesterov
    Cc: Patrick Marlier
    Cc: Paul McKenney
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: pranith kumar
    Link: http://lkml.kernel.org/r/20151027051939.GA19355@mtj.duckdns.org
    Signed-off-by: Ingo Molnar

    Tejun Heo
     

13 Oct, 2015

2 commits

  • bdi_for_each_wb() is used in several places to wake up or issue
    writeback work items to all wb's (bdi_writeback's) on a given bdi.
    The iteration is performed by walking bdi->cgwb_tree; however, the
    tree only indexes wb's which are currently active.

    For example, when a memcg gets associated with a different blkcg, the
    old wb is removed from the tree so that the new one can be indexed.
    The old wb starts dying from then on but will linger till all its
    inodes are drained. As these dying wb's may still host dirty inodes,
    writeback operations which affect all wb's must include them.
    bdi_for_each_wb() skipping dying wb's led to sync(2) missing and
    failing to sync the inodes belonging to those wb's.

    This patch adds a RCU protected @bdi->wb_list which lists all wb's
    beloinging to that bdi. wb's are added on creation and removed on
    release rather than on the start of destruction. bdi_for_each_wb()
    usages are replaced with list_for_each[_continue]_rcu() iterations
    over @bdi->wb_list and bdi_for_each_wb() and its helpers are removed.

    v2: Updated as per Jan. last_wb ref leak in bdi_split_work_to_wbs()
    fixed and unnecessary list head severing in cgwb_bdi_destroy()
    removed.

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: Artem Bityutskiy
    Fixes: ebe41ab0c79d ("writeback: implement bdi_for_each_wb()")
    Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • wakeup_dirtytime_writeback() walks and wakes up all wb's of all bdi's;
    unfortunately, it was always waking up bdi->wb instead of the wb being
    walked. Fix it.

    Signed-off-by: Tejun Heo
    Fixes: 001fe6f617b1 ("writeback: make wakeup_dirtytime_writeback() handle multiple bdi_writeback's")
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     

20 Sep, 2015

1 commit

  • Commit 505a666ee3fc ("writeback: plug writeback in wb_writeback() and
    writeback_inodes_wb()") has us holding a plug during writeback_sb_inodes,
    which increases the merge rate when relatively contiguous small files
    are written by the filesystem. It helps both on flash and spindles.

    For an fs_mark workload creating 4K files in parallel across 8 drives,
    this commit improves performance ~9% more by unplugging before calling
    cond_resched(). cond_resched() doesn't trigger an implicit unplug, so
    explicitly getting the IO down to the device before scheduling reduces
    latencies for anyone waiting on clean pages.

    It also cuts down on how often we use kblockd to unplug, which means
    less work bouncing from one workqueue to another.

    Many more details about how we got here:

    https://lkml.org/lkml/2015/9/11/570

    Signed-off-by: Chris Mason
    Signed-off-by: Linus Torvalds

    Chris Mason
     

13 Sep, 2015

1 commit

  • We had to revert the pluggin in writeback_sb_inodes() because the
    wb->list_lock is held, but we could easily plug at a higher level before
    taking that lock, and unplug after releasing it. This does that.

    Chris will run performance numbers, just to verify that this approach is
    comparable to the alternative (we could just drop and re-take the lock
    around the blk_finish_plug() rather than these two commits.

    I'd have preferred waiting for actual performance numbers before picking
    one approach over the other, but I don't want to release rc1 with the
    known "sleeping function called from invalid context" issue, so I'll
    pick this cleanup version for now. But if the numbers show that we
    really want to plug just at the writeback_sb_inodes() level, and we
    should just play ugly games with the spinlock, we'll switch to that.

    Cc: Chris Mason
    Cc: Josef Bacik
    Cc: Dave Chinner
    Cc: Neil Brown
    Cc: Jan Kara
    Cc: Christoph Hellwig
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

12 Sep, 2015

1 commit

  • This reverts commit d353d7587d02116b9732d5c06615aed75a4d3a47.

    Doing the block layer plug/unplug inside writeback_sb_inodes() is
    broken, because that function is actually called with a spinlock held:
    wb->list_lock, as pointed out by Chris Mason.

    Chris suggested just dropping and re-taking the spinlock around the
    blk_finish_plug() call (the plgging itself can happen under the
    spinlock), and that would technically work, but is just disgusting.

    We do something fairly similar - but not quite as disgusting because we
    at least have a better reason for it - in writeback_single_inode(), so
    it's not like the caller can depend on the lock being held over the
    call, but in this case there just isn't any good reason for that
    "release and re-take the lock" pattern.

    [ In general, we should really strive to avoid the "release and retake"
    pattern for locks, because in the general case it can easily cause
    subtle bugs when the caller caches any state around the call that
    might be invalidated by dropping the lock even just temporarily. ]

    But in this case, the plugging should be easy to just move up to the
    callers before the spinlock is taken, which should even improve the
    effectiveness of the plug. So there is really no good reason to play
    games with locking here.

    I'll send off a test-patch so that Dave Chinner can verify that that
    plug movement works. In the meantime this just reverts the problematic
    commit and adds a comment to the function so that we hopefully don't
    make this mistake again.

    Reported-by: Chris Mason
    Cc: Josef Bacik
    Cc: Dave Chinner
    Cc: Neil Brown
    Cc: Jan Kara
    Cc: Christoph Hellwig
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

11 Sep, 2015

1 commit

  • Pull blk-cg updates from Jens Axboe:
    "A bit later in the cycle, but this has been in the block tree for a a
    while. This is basically four patchsets from Tejun, that improve our
    buffered cgroup writeback. It was dependent on the other cgroup
    changes, but they went in earlier in this cycle.

    Series 1 is set of 5 patches that has cgroup writeback updates:

    - bdi_writeback iteration fix which could lead to some wb's being
    skipped or repeated during e.g. sync under memory pressure.

    - Simplification of wb work wait mechanism.

    - Writeback tracepoints updated to report cgroup.

    Series 2 is is a set of updates for the CFQ cgroup writeback handling:

    cfq has always charged all async IOs to the root cgroup. It didn't
    have much choice as writeback didn't know about cgroups and there
    was no way to tell who to blame for a given writeback IO.
    writeback finally grew support for cgroups and now tags each
    writeback IO with the appropriate cgroup to charge it against.

    This patchset updates cfq so that it follows the blkcg each bio is
    tagged with. Async cfq_queues are now shared across cfq_group,
    which is per-cgroup, instead of per-request_queue cfq_data. This
    makes all IOs follow the weight based IO resource distribution
    implemented by cfq.

    - Switched from GFP_ATOMIC to GFP_NOWAIT as suggested by Jeff.

    - Other misc review points addressed, acks added and rebased.

    Series 3 is the blkcg policy cleanup patches:

    This patchset contains assorted cleanups for blkcg_policy methods
    and blk[c]g_policy_data handling.

    - alloc/free added for blkg_policy_data. exit dropped.

    - alloc/free added for blkcg_policy_data.

    - blk-throttle's async percpu allocation is replaced with direct
    allocation.

    - all methods now take blk[c]g_policy_data instead of blkcg_gq or
    blkcg.

    And finally, series 4 is a set of patches cleaning up the blkcg stats
    handling:

    blkcg's stats have always been somwhat of a mess. This patchset
    tries to improve the situation a bit.

    - The following patches added to consolidate blkcg entry point and
    blkg creation. This is in itself is an improvement and helps
    colllecting common stats on bio issue.

    - per-blkg stats now accounted on bio issue rather than request
    completion so that bio based and request based drivers can behave
    the same way. The issue was spotted by Vivek.

    - cfq-iosched implements custom recursive stats and blk-throttle
    implements custom per-cpu stats. This patchset make blkcg core
    support both by default.

    - cfq-iosched and blk-throttle keep track of the same stats
    multiple times. Unify them"

    * 'for-4.3/blkcg' of git://git.kernel.dk/linux-block: (45 commits)
    blkcg: use CGROUP_WEIGHT_* scale for io.weight on the unified hierarchy
    blkcg: s/CFQ_WEIGHT_*/CFQ_WEIGHT_LEGACY_*/
    blkcg: implement interface for the unified hierarchy
    blkcg: misc preparations for unified hierarchy interface
    blkcg: separate out tg_conf_updated() from tg_set_conf()
    blkcg: move body parsing from blkg_conf_prep() to its callers
    blkcg: mark existing cftypes as legacy
    blkcg: rename subsystem name from blkio to io
    blkcg: refine error codes returned during blkcg configuration
    blkcg: remove unnecessary NULL checks from __cfqg_set_weight_device()
    blkcg: reduce stack usage of blkg_rwstat_recursive_sum()
    blkcg: remove cfqg_stats->sectors
    blkcg: move io_service_bytes and io_serviced stats into blkcg_gq
    blkcg: make blkg_[rw]stat_recursive_sum() to be able to index into blkcg_gq
    blkcg: make blkcg_[rw]stat per-cpu
    blkcg: add blkg_[rw]stat->aux_cnt and replace cfq_group->dead_stats with it
    blkcg: consolidate blkg creation in blkcg_bio_issue_check()
    blk-throttle: improve queue bypass handling
    blkcg: move root blkg lookup optimization from throtl_lookup_tg() to __blkg_lookup()
    blkcg: inline [__]blkg_lookup()
    ...

    Linus Torvalds
     

06 Sep, 2015

1 commit

  • Pull vfs updates from Al Viro:
    "In this one:

    - d_move fixes (Eric Biederman)

    - UFS fixes (me; locking is mostly sane now, a bunch of bugs in error
    handling ought to be fixed)

    - switch of sb_writers to percpu rwsem (Oleg Nesterov)

    - superblock scalability (Josef Bacik and Dave Chinner)

    - swapon(2) race fix (Hugh Dickins)"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (65 commits)
    vfs: Test for and handle paths that are unreachable from their mnt_root
    dcache: Reduce the scope of i_lock in d_splice_alias
    dcache: Handle escaped paths in prepend_path
    mm: fix potential data race in SyS_swapon
    inode: don't softlockup when evicting inodes
    inode: rename i_wb_list to i_io_list
    sync: serialise per-superblock sync operations
    inode: convert inode_sb_list_lock to per-sb
    inode: add hlist_fake to avoid the inode hash lock in evict
    writeback: plug writeback at a high level
    change sb_writers to use percpu_rw_semaphore
    shift percpu_counter_destroy() into destroy_super_work()
    percpu-rwsem: kill CONFIG_PERCPU_RWSEM
    percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire()
    percpu-rwsem: introduce percpu_down_read_trylock()
    document rwsem_release() in sb_wait_write()
    fix the broken lockdep logic in __sb_start_write()
    introduce __sb_writers_{acquired,release}() helpers
    ufs_inode_get{frag,block}(): get rid of 'phys' argument
    ufs_getfrag_block(): tidy up a bit
    ...

    Linus Torvalds
     

26 Aug, 2015

1 commit

  • e79729123f63 ("writeback: don't issue wb_writeback_work if clean")
    updated writeback path to avoid kicking writeback work items if there
    are no inodes to be written out; unfortunately, the avoidance logic
    was too aggressive and broke sync_inodes_sb().

    * sync_inodes_sb() must write out I_DIRTY_TIME inodes but I_DIRTY_TIME
    inodes dont't contribute to bdi/wb_has_dirty_io() tests and were
    being skipped over.

    * inodes are taken off wb->b_dirty/io/more_io lists after writeback
    starts on them. sync_inodes_sb() skipping wait_sb_inodes() when
    bdi_has_dirty_io() breaks it by making it return while writebacks
    are in-flight.

    This patch fixes the breakages by

    * Removing bdi_has_dirty_io() shortcut from bdi_split_work_to_wbs().
    The callers are already testing the condition.

    * Removing bdi_has_dirty_io() shortcut from sync_inodes_sb() so that
    it always calls into bdi_split_work_to_wbs() and wait_sb_inodes().

    * Making bdi_split_work_to_wbs() consider the b_dirty_time list for
    WB_SYNC_ALL writebacks.

    Kudos to Eryu, Dave and Jan for tracking down the issue.

    Signed-off-by: Tejun Heo
    Fixes: e79729123f63 ("writeback: don't issue wb_writeback_work if clean")
    Link: http://lkml.kernel.org/g/20150812101204.GE17933@dhcp-13-216.nay.redhat.com
    Reported-and-bisected-by: Eryu Guan
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Ted Ts'o
    Signed-off-by: Jens Axboe

    Tejun Heo
     

19 Aug, 2015

4 commits

  • The following tracepoints are updated to report the cgroup used during
    cgroup writeback.

    * writeback_write_inode[_start]
    * writeback_queue
    * writeback_exec
    * writeback_start
    * writeback_written
    * writeback_wait
    * writeback_nowork
    * writeback_wake_background
    * wbc_writepage
    * writeback_queue_io
    * bdi_dirty_ratelimit
    * balance_dirty_pages
    * writeback_sb_inodes_requeue
    * writeback_single_inode[_start]

    Note that writeback_bdi_register is separated out from writeback_class
    as reporting cgroup doesn't make sense to it. Tracepoints which take
    bdi are updated to take bdi_writeback instead.

    Signed-off-by: Tejun Heo
    Suggested-by: Jan Kara
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Signed-off-by: Tejun Heo
    Suggested-by: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • wb_writeback_work->single_wait/done are used for the wait mechanism
    for synchronous wb_work (wb_writeback_work) items which are issued
    when bdi_split_work_to_wbs() fails to allocate memory for asynchronous
    wb_work items; however, there's no reason to use a separate wait
    mechanism for this. bdi_split_work_to_wbs() can simply use on-stack
    fallback wb_work item and separate wb_completion to wait for it.

    This patch removes wb_work->single_wait/done and the related code and
    make bdi_split_work_to_wbs() use on-stack fallback wb_work and
    wb_completion instead.

    Signed-off-by: Tejun Heo
    Suggested-by: Jan Kara
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • wb's (bdi_writeback's) are currently keyed by memcg ID; however, in an
    earlier implementation, wb's were keyed by blkcg ID.
    bdi_for_each_wb() walks bdi->cgwb_tree in the ascending ID order and
    allows iterations to start from an arbitrary ID which is used to
    interrupt and resume iterations.

    Unfortunately, while changing wb to be keyed by memcg ID instead of
    blkcg, bdi_for_each_wb() was missed and is still assuming that wb's
    are keyed by blkcg ID. This doesn't affect iterations which don't get
    interrupted but bdi_split_work_to_wbs() makes use of iteration
    resuming on allocation failures and thus may incorrectly skip or
    repeat wb's.

    Fix it by changing bdi_for_each_wb() to take memcg IDs instead of
    blkcg IDs and updating bdi_split_work_to_wbs() accordingly.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

18 Aug, 2015

4 commits

  • There's a small consistency problem between the inode and writeback
    naming. Writeback calls the "for IO" inode queues b_io and
    b_more_io, but the inode calls these the "writeback list" or
    i_wb_list. This makes it hard to an new "under writeback" list to
    the inode, or call it an "under IO" list on the bdi because either
    way we'll have writeback on IO and IO on writeback and it'll just be
    confusing. I'm getting confused just writing this!

    So, rename the inode "for IO" list variable to i_io_list so we can
    add a new "writeback list" in a subsequent patch.

    Signed-off-by: Dave Chinner
    Signed-off-by: Josef Bacik
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Tested-by: Dave Chinner

    Dave Chinner
     
  • When competing sync(2) calls walk the same filesystem, they need to
    walk the list of inodes on the superblock to find all the inodes
    that we need to wait for IO completion on. However, when multiple
    wait_sb_inodes() calls do this at the same time, they contend on the
    the inode_sb_list_lock and the contention causes system wide
    slowdowns. In effect, concurrent sync(2) calls can take longer and
    burn more CPU than if they were serialised.

    Stop the worst of the contention by adding a per-sb mutex to wrap
    around wait_sb_inodes() so that we only execute one sync(2) IO
    completion walk per superblock superblock at a time and hence avoid
    contention being triggered by concurrent sync(2) calls.

    Signed-off-by: Dave Chinner
    Signed-off-by: Josef Bacik
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Tested-by: Dave Chinner

    Dave Chinner
     
  • The process of reducing contention on per-superblock inode lists
    starts with moving the locking to match the per-superblock inode
    list. This takes the global lock out of the picture and reduces the
    contention problems to within a single filesystem. This doesn't get
    rid of contention as the locks still have global CPU scope, but it
    does isolate operations on different superblocks form each other.

    Signed-off-by: Dave Chinner
    Signed-off-by: Josef Bacik
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Tested-by: Dave Chinner

    Dave Chinner
     
  • Doing writeback on lots of little files causes terrible IOPS storms
    because of the per-mapping writeback plugging we do. This
    essentially causes imeediate dispatch of IO for each mapping,
    regardless of the context in which writeback is occurring.

    IOWs, running a concurrent write-lots-of-small 4k files using fsmark
    on XFS results in a huge number of IOPS being issued for data
    writes. Metadata writes are sorted and plugged at a high level by
    XFS, so aggregate nicely into large IOs. However, data writeback IOs
    are dispatched in individual 4k IOs, even when the blocks of two
    consecutively written files are adjacent.

    Test VM: 8p, 8GB RAM, 4xSSD in RAID0, 100TB sparse XFS filesystem,
    metadata CRCs enabled.

    Kernel: 3.10-rc5 + xfsdev + my 3.11 xfs queue (~70 patches)

    Test:

    $ ./fs_mark -D 10000 -S0 -n 10000 -s 4096 -L 120 -d
    /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d
    /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d
    /mnt/scratch/6 -d /mnt/scratch/7

    Result:

    wall sys create rate Physical write IO
    time CPU (avg files/s) IOPS Bandwidth
    ----- ----- ------------ ------ ---------
    unpatched 6m56s 15m47s 24,000+/-500 26,000 130MB/s
    patched 5m06s 13m28s 32,800+/-600 1,500 180MB/s
    improvement -26.44% -14.68% +36.67% -94.23% +38.46%

    If I use zero length files, this workload at about 500 IOPS, so
    plugging drops the data IOs from roughly 25,500/s to 1000/s.
    3 lines of code, 35% better throughput for 15% less CPU.

    The benefits of plugging at this layer are likely to be higher for
    spinning media as the IO patterns for this workload are going make a
    much bigger difference on high IO latency devices.....

    Signed-off-by: Dave Chinner
    Signed-off-by: Josef Bacik
    Reviewed-by: Jan Kara
    Tested-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Dave Chinner
     

24 Jul, 2015

1 commit


18 Jun, 2015

1 commit

  • Currently, even when a filesystem doesn't set the FS_CGROUP_WRITEBACK
    flag, if the filesystem uses wbc_init_bio() and wbc_account_io(), the
    foreign inode detection and migration logic still ends up activating
    cgroup writeback which is unexpected. This patch ensures that the
    foreign inode detection logic stays disabled when inode_cgwb_enabled()
    is false by not associating writeback_control's with bdi_writeback's.

    This also avoids unnecessary operations in wbc_init_bio(),
    wbc_account_io() and wbc_detach_inode() for filesystems which don't
    support cgroup writeback.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

02 Jun, 2015

6 commits

  • For the purpose of foreign inode detection, wb's (bdi_writeback's) are
    identified by the associated memcg ID. As we create a separate wb for
    each memcg, this is enough to identify the active wb's; however, when
    blkcg is enabled or disabled higher up in the hierarchy, the mapping
    between memcg and blkcg changes which in turn creates a new wb to
    service the new mapping. The old wb is unlinked from index and
    released after all references are drained. The foreign inode
    detection logic can't detect this condition because both the old and
    new wb's point to the same memcg and thus never decides to move inodes
    attached to the old wb to the new one.

    This patch adds logic to initiate switching immediately in
    wbc_attach_and_unlock_inode() if the associated wb is dying. We can
    make the usual foreign detection logic to distinguish the different
    wb's mapped to the memcg but the dying wb is never gonna be in active
    service again and there's no point in tracking the usage history and
    reaching the switch verdict after enough data points are collected.
    It's already known that the wb has to be switched.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • As concurrent write sharing of an inode is expected to be very rare
    and memcg only tracks page ownership on first-use basis severely
    confining the usefulness of such sharing, cgroup writeback tracks
    ownership per-inode. While the support for concurrent write sharing
    of an inode is deemed unnecessary, an inode being written to by
    different cgroups at different points in time is a lot more common,
    and, more importantly, charging only by first-use can too readily lead
    to grossly incorrect behaviors (single foreign page can lead to
    gigabytes of writeback to be incorrectly attributed).

    To resolve this issue, cgroup writeback detects the majority dirtier
    of an inode and transfers the ownership to it. The previous patches
    implemented the foreign condition detection mechanism and laid the
    groundwork. This patch implements the actual switching.

    With the previously implemented [unlocked_]inode_to_wb_and_list_lock()
    and wb stat transaction, grabbing wb->list_lock, inode->i_lock and
    mapping->tree_lock gives us full exclusion against all wb operations
    on the target inode. inode_switch_wb_work_fn() grabs all the locks
    and transfers the inode atomically along with its RECLAIMABLE and
    WRITEBACK stats.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • With the previous three patches, all operations which acquire wb from
    inode are either under one of inode->i_lock, mapping->tree_lock or
    wb->list_lock or protected by unlocked_inode_to_wb transaction. This
    will be depended upon by foreign inode wb switching.

    This patch adds lockdep assertion to inode_to_wb() so that usages
    outside the above list locks can be caught easily. There are three
    exceptions.

    * locked_inode_to_wb_and_lock_list() is holding wb->list_lock but the
    wb may not be the inode's. Ensuring that is the function's role
    after all. Updated to deref inode->i_wb directly.

    * inode_wb_stat_unlocked_begin() is usually protected by combination
    of !I_WB_SWITCH and rcu_read_lock(). Updated to deref inode->i_wb
    directly.

    * inode_congested() wants to test whether inode->i_wb is set before
    starting the transaction. Added inode_to_wb_is_valid() which tests
    inode->i_wb directly.

    v5: might_lock() removed. It annotates that the lock is grabbed w/
    irq enabled which isn't the case and triggering lockdep warning
    spuriously.

    v4: might_lock() added to unlocked_inode_to_wb_begin().

    v3: inode_congested() conversion added.

    v2: locked_inode_to_wb_and_lock_list() was missing in the first
    version.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Similar to wb stat updates, inode_congested() accesses the associated
    wb of an inode locklessly, which will break with foreign inode wb
    switching. This path updates inode_congested() to use unlocked inode
    wb access transaction introduced by the previous patch.

    Combined with the previous two patches, this makes all wb list and
    access operations to be protected by either of inode->i_lock,
    wb->list_lock, or mapping->tree_lock while wb switching is in
    progress.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • The mechanism for detecting whether an inode should switch its wb
    (bdi_writeback) association is now in place. This patch build the
    framework for the actual switching.

    This patch adds a new inode flag I_WB_SWITCHING, which has two
    functions. First, the easy one, it ensures that there's only one
    switching in progress for a give inode. Second, it's used as a
    mechanism to synchronize wb stat updates.

    The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
    but track the current number of dirty pages and pages under writeback
    respectively. As such, when an inode is moved from one wb to another,
    the inode's portion of those stats have to be transferred together;
    unfortunately, this is a bit tricky as those stat updates are percpu
    operations which are performed without holding any lock in some
    places.

    This patch solves the problem in a similar way as memcg. Each such
    lockless stat updates are wrapped in transaction surrounded by
    unlocked_inode_to_wb_begin/end(). During normal operation, they map
    to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
    mapping->tree_lock is grabbed across the transaction.

    In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
    grace period to pass before actually starting to switch, which
    guarantees that all stat update paths are synchronizing against
    mapping->tree_lock.

    This patch still doesn't implement the actual switching.

    v3: Updated on top of the recent cancel_dirty_page() updates.
    unlocked_inode_to_wb_begin() now nests inside
    mem_cgroup_begin_page_stat() to match the locking order.

    v2: The i_wb access transaction will be used for !stat accesses too.
    Function names and comments updated accordingly.

    s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
    s/switch_wb/switch_wbs/

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • cgroup writeback currently assumes that inode to wb association
    doesn't change; however, with the planned foreign inode wb switching
    mechanism, the association will change dynamically.

    When an inode needs to be put on one of the IO lists of its wb, the
    current code simply calls inode_to_wb() and locks the returned wb;
    however, with the planned wb switching, the association may change
    before locking the wb and may even get released.

    This patch implements [locked_]inode_to_wb_and_lock_list() which pins
    the associated wb while holding i_lock, releases it, acquires
    wb->list_lock and verifies that the association hasn't changed
    inbetween. As the association will be protected by both locks among
    other things, this guarantees that the wb is the inode's associated wb
    until the list_lock is released.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Wu Fengguang
    Cc: Greg Thelen
    Signed-off-by: Jens Axboe

    Tejun Heo