23 Jan, 2020

1 commit

  • commit 6d9e8c651dd979aa666bee15f086745f3ea9c4b3 upstream.

    Patch series "use div64_ul() instead of div_u64() if the divisor is
    unsigned long".

    We were first inspired by commit b0ab99e7736a ("sched: Fix possible divide
    by zero in avg_atom () calculation"), then refer to the recently analyzed
    mm code, we found this suspicious place.

    201 if (min) {
    202 min *= this_bw;
    203 do_div(min, tot_bw);
    204 }

    And we also disassembled and confirmed it:

    /usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 201
    0xffffffff811c37da : xor %r10d,%r10d
    0xffffffff811c37dd : test %rax,%rax
    0xffffffff811c37e0 : je 0xffffffff811c3800
    /usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 202
    0xffffffff811c37e2 : imul %r8,%rax
    /usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 203
    0xffffffff811c37e6 : mov %r9d,%r10d ---> truncates it to 32 bits here
    0xffffffff811c37e9 : xor %edx,%edx
    0xffffffff811c37eb : div %r10
    0xffffffff811c37ee : imul %rbx,%rax
    0xffffffff811c37f2 : shr $0x2,%rax
    0xffffffff811c37f6 : mul %rcx
    0xffffffff811c37f9 : shr $0x2,%rdx
    0xffffffff811c37fd : mov %rdx,%r10

    This series uses div64_ul() instead of div_u64() if the divisor is
    unsigned long, to avoid truncation to 32-bit on 64-bit platforms.

    This patch (of 3):

    The variables 'min' and 'max' are unsigned long and do_div truncates
    them to 32 bits, which means it can test non-zero and be truncated to
    zero for division. Fix this issue by using div64_ul() instead.

    Link: http://lkml.kernel.org/r/20200102081442.8273-2-wenyang@linux.alibaba.com
    Fixes: 693108a8a667 ("writeback: make bdi->min/max_ratio handling cgroup writeback aware")
    Signed-off-by: Wen Yang
    Reviewed-by: Andrew Morton
    Cc: Qian Cai
    Cc: Tejun Heo
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Wen Yang
     

27 Aug, 2019

1 commit

  • There's an inherent mismatch between memcg and writeback. The former
    trackes ownership per-page while the latter per-inode. This was a
    deliberate design decision because honoring per-page ownership in the
    writeback path is complicated, may lead to higher CPU and IO overheads
    and deemed unnecessary given that write-sharing an inode across
    different cgroups isn't a common use-case.

    Combined with inode majority-writer ownership switching, this works
    well enough in most cases but there are some pathological cases. For
    example, let's say there are two cgroups A and B which keep writing to
    different but confined parts of the same inode. B owns the inode and
    A's memory is limited far below B's. A's dirty ratio can rise enough
    to trigger balance_dirty_pages() sleeps but B's can be low enough to
    avoid triggering background writeback. A will be slowed down without
    a way to make writeback of the dirty pages happen.

    This patch implements foreign dirty recording and foreign mechanism so
    that when a memcg encounters a condition as above it can trigger
    flushes on bdi_writebacks which can clean its pages. Please see the
    comment on top of mem_cgroup_track_foreign_dirty_slowpath() for
    details.

    A reproducer follows.

    write-range.c::

    #include
    #include
    #include
    #include
    #include

    static const char *usage = "write-range FILE START SIZE\n";

    int main(int argc, char **argv)
    {
    int fd;
    unsigned long start, size, end, pos;
    char *endp;
    char buf[4096];

    if (argc < 4) {
    fprintf(stderr, usage);
    return 1;
    }

    fd = open(argv[1], O_WRONLY);
    if (fd < 0) {
    perror("open");
    return 1;
    }

    start = strtoul(argv[2], &endp, 0);
    if (*endp != '\0') {
    fprintf(stderr, usage);
    return 1;
    }

    size = strtoul(argv[3], &endp, 0);
    if (*endp != '\0') {
    fprintf(stderr, usage);
    return 1;
    }

    end = start + size;

    while (1) {
    for (pos = start; pos < end; ) {
    long bread, bwritten = 0;

    if (lseek(fd, pos, SEEK_SET) < 0) {
    perror("lseek");
    return 1;
    }

    bread = read(0, buf, sizeof(buf) < end - pos ?
    sizeof(buf) : end - pos);
    if (bread < 0) {
    perror("read");
    return 1;
    }
    if (bread == 0)
    return 0;

    while (bwritten < bread) {
    long this;

    this = write(fd, buf + bwritten,
    bread - bwritten);
    if (this < 0) {
    perror("write");
    return 1;
    }

    bwritten += this;
    pos += bwritten;
    }
    }
    }
    }

    repro.sh::

    #!/bin/bash

    set -e
    set -x

    sysctl -w vm.dirty_expire_centisecs=300000
    sysctl -w vm.dirty_writeback_centisecs=300000
    sysctl -w vm.dirtytime_expire_seconds=300000
    echo 3 > /proc/sys/vm/drop_caches

    TEST=/sys/fs/cgroup/test
    A=$TEST/A
    B=$TEST/B

    mkdir -p $A $B
    echo "+memory +io" > $TEST/cgroup.subtree_control
    echo $((1< $A/memory.high
    echo $((32< $B/memory.high

    rm -f testfile
    touch testfile
    fallocate -l 4G testfile

    echo "Starting B"

    (echo $BASHPID > $B/cgroup.procs
    pv -q --rate-limit 70M < /dev/urandom | ./write-range testfile $((2<< $A/cgroup.procs
    pv < /dev/urandom | ./write-range testfile 0 $((2<
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

13 Jul, 2019

1 commit

  • account_page_dirtied() is only used by our set_page_dirty() helpers and
    should not be used anywhere else.

    Link: http://lkml.kernel.org/r/20190605183702.30572-1-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

15 May, 2019

1 commit

  • Recently there have been some hung tasks on our server due to
    wait_on_page_writeback(), and we want to know the details of this
    PG_writeback, i.e. this page is writing back to which device. But it is
    not so convenient to get the details.

    I think it would be better to introduce a tracepoint for diagnosing the
    writeback details.

    Link: http://lkml.kernel.org/r/1556274402-19018-1-git-send-email-laoar.shao@gmail.com
    Signed-off-by: Yafang Shao
    Reviewed-by: Andrew Morton
    Cc: Jan Kara
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yafang Shao
     

06 Mar, 2019

1 commit

  • Many kernel-doc comments in mm/ have the return value descriptions
    either misformatted or omitted at all which makes kernel-doc script
    unhappy:

    $ make V=1 htmldocs
    ...
    ./mm/util.c:36: info: Scanning doc for kstrdup
    ./mm/util.c:41: warning: No description found for return value of 'kstrdup'
    ./mm/util.c:57: info: Scanning doc for kstrdup_const
    ./mm/util.c:66: warning: No description found for return value of 'kstrdup_const'
    ./mm/util.c:75: info: Scanning doc for kstrndup
    ./mm/util.c:83: warning: No description found for return value of 'kstrndup'
    ...

    Fixing the formatting and adding the missing return value descriptions
    eliminates ~100 such warnings.

    Link: http://lkml.kernel.org/r/1549549644-4903-4-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrew Morton
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

29 Dec, 2018

1 commit

  • write_cache_pages() is used in both background and integrity writeback
    scenarios by various filesystems. Background writeback is mostly
    concerned with cleaning a certain number of dirty pages based on various
    mm heuristics. It may not write the full set of dirty pages or wait for
    I/O to complete. Integrity writeback is responsible for persisting a set
    of dirty pages before the writeback job completes. For example, an
    fsync() call must perform integrity writeback to ensure data is on disk
    before the call returns.

    write_cache_pages() unconditionally breaks out of its processing loop in
    the event of a ->writepage() error. This is fine for background
    writeback, which had no strict requirements and will eventually come
    around again. This can cause problems for integrity writeback on
    filesystems that might need to clean up state associated with failed page
    writeouts. For example, XFS performs internal delayed allocation
    accounting before returning a ->writepage() error, where applicable. If
    the current writeback happens to be associated with an unmount and
    write_cache_pages() completes the writeback prematurely due to error, the
    filesystem is unmounted in an inconsistent state if dirty+delalloc pages
    still exist.

    To handle this problem, update write_cache_pages() to always process the
    full set of pages for integrity writeback regardless of ->writepage()
    errors. Save the first encountered error and return it to the caller once
    complete. This facilitates XFS (or any other fs that expects integrity
    writeback to process the entire set of dirty pages) to clean up its
    internal state completely in the event of persistent mapping errors.
    Background writeback continues to exit on the first error encountered.

    [akpm@linux-foundation.org: fix typo in comment]
    Link: http://lkml.kernel.org/r/20181116134304.32440-1-bfoster@redhat.com
    Signed-off-by: Brian Foster
    Reviewed-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brian Foster
     

29 Oct, 2018

1 commit

  • Pull XArray conversion from Matthew Wilcox:
    "The XArray provides an improved interface to the radix tree data
    structure, providing locking as part of the API, specifying GFP flags
    at allocation time, eliminating preloading, less re-walking the tree,
    more efficient iterations and not exposing RCU-protected pointers to
    its users.

    This patch set

    1. Introduces the XArray implementation

    2. Converts the pagecache to use it

    3. Converts memremap to use it

    The page cache is the most complex and important user of the radix
    tree, so converting it was most important. Converting the memremap
    code removes the only other user of the multiorder code, which allows
    us to remove the radix tree code that supported it.

    I have 40+ followup patches to convert many other users of the radix
    tree over to the XArray, but I'd like to get this part in first. The
    other conversions haven't been in linux-next and aren't suitable for
    applying yet, but you can see them in the xarray-conv branch if you're
    interested"

    * 'xarray' of git://git.infradead.org/users/willy/linux-dax: (90 commits)
    radix tree: Remove multiorder support
    radix tree test: Convert multiorder tests to XArray
    radix tree tests: Convert item_delete_rcu to XArray
    radix tree tests: Convert item_kill_tree to XArray
    radix tree tests: Move item_insert_order
    radix tree test suite: Remove multiorder benchmarking
    radix tree test suite: Remove __item_insert
    memremap: Convert to XArray
    xarray: Add range store functionality
    xarray: Move multiorder_check to in-kernel tests
    xarray: Move multiorder_shrink to kernel tests
    xarray: Move multiorder account test in-kernel
    radix tree test suite: Convert iteration test to XArray
    radix tree test suite: Convert tag_tagged_items to XArray
    radix tree: Remove radix_tree_clear_tags
    radix tree: Remove radix_tree_maybe_preload_order
    radix tree: Remove split/join code
    radix tree: Remove radix_tree_update_node_t
    page cache: Finish XArray conversion
    dax: Convert page fault handlers to XArray
    ...

    Linus Torvalds
     

27 Oct, 2018

1 commit

  • We've recently seen a workload on XFS filesystems with a repeatable
    deadlock between background writeback and a multi-process application
    doing concurrent writes and fsyncs to a small range of a file.

    range_cyclic
    writeback Process 1 Process 2

    xfs_vm_writepages
    write_cache_pages
    writeback_index = 2
    cycled = 0
    ....
    find page 2 dirty
    lock Page 2
    ->writepage
    page 2 writeback
    page 2 clean
    page 2 added to bio
    no more pages
    write()
    locks page 1
    dirties page 1
    locks page 2
    dirties page 1
    fsync()
    ....
    xfs_vm_writepages
    write_cache_pages
    start index 0
    find page 1 towrite
    lock Page 1
    ->writepage
    page 1 writeback
    page 1 clean
    page 1 added to bio
    find page 2 towrite
    lock Page 2
    page 2 is writeback

    write()
    locks page 1
    dirties page 1
    fsync()
    ....
    xfs_vm_writepages
    write_cache_pages
    start index 0

    !done && !cycled
    sets index to 0, restarts lookup
    find page 1 dirty
    find page 1 towrite
    lock Page 1
    page 1 is writeback

    lock Page 1

    DEADLOCK because:

    - process 1 needs page 2 writeback to complete to make
    enough progress to issue IO pending for page 1
    - writeback needs page 1 writeback to complete so process 2
    can progress and unlock the page it is blocked on, then it
    can issue the IO pending for page 2
    - process 2 can't make progress until process 1 issues IO
    for page 1

    The underlying cause of the problem here is that range_cyclic writeback is
    processing pages in descending index order as we hold higher index pages
    in a structure controlled from above write_cache_pages(). The
    write_cache_pages() caller needs to be able to submit these pages for IO
    before write_cache_pages restarts writeback at mapping index 0 to avoid
    wcp inverting the page lock/writeback wait order.

    generic_writepages() is not susceptible to this bug as it has no private
    context held across write_cache_pages() - filesystems using this
    infrastructure always submit pages in ->writepage immediately and so there
    is no problem with range_cyclic going back to mapping index 0.

    However:
    mpage_writepages() has a private bio context,
    exofs_writepages() has page_collect
    fuse_writepages() has fuse_fill_wb_data
    nfs_writepages() has nfs_pageio_descriptor
    xfs_vm_writepages() has xfs_writepage_ctx

    All of these ->writepages implementations can hold pages under writeback
    in their private structures until write_cache_pages() returns, and hence
    they are all susceptible to this deadlock.

    Also worth noting is that ext4 has it's own bastardised version of
    write_cache_pages() and so it /may/ have an equivalent deadlock. I looked
    at the code long enough to understand that it has a similar retry loop for
    range_cyclic writeback reaching the end of the file and then promptly ran
    away before my eyes bled too much. I'll leave it for the ext4 developers
    to determine if their code is actually has this deadlock and how to fix it
    if it has.

    There's a few ways I can see avoid this deadlock. There's probably more,
    but these are the first I've though of:

    1. get rid of range_cyclic altogether

    2. range_cyclic always stops at EOF, and we start again from
    writeback index 0 on the next call into write_cache_pages()

    2a. wcp also returns EAGAIN to ->writepages implementations to
    indicate range cyclic has hit EOF. writepages implementations can
    then flush the current context and call wpc again to continue. i.e.
    lift the retry into the ->writepages implementation

    3. range_cyclic uses trylock_page() rather than lock_page(), and it
    skips pages it can't lock without blocking. It will already do this
    for pages under writeback, so this seems like a no-brainer

    3a. all non-WB_SYNC_ALL writeback uses trylock_page() to avoid
    blocking as per pages under writeback.

    I don't think #1 is an option - range_cyclic prevents frequently
    dirtied lower file offset from starving background writeback of
    rarely touched higher file offsets.

    #2 is simple, and I don't think it will have any impact on
    performance as going back to the start of the file implies an
    immediate seek. We'll have exactly the same number of seeks if we
    switch writeback to another inode, and then come back to this one
    later and restart from index 0.

    #2a is pretty much "status quo without the deadlock". Moving the
    retry loop up into the wcp caller means we can issue IO on the
    pending pages before calling wcp again, and so avoid locking or
    waiting on pages in the wrong order. I'm not convinced we need to do
    this given that we get the same thing from #2 on the next writeback
    call from the writeback infrastructure.

    #3 is really just a band-aid - it doesn't fix the access/wait
    inversion problem, just prevents it from becoming a deadlock
    situation. I'd prefer we fix the inversion, not sweep it under the
    carpet like this.

    #3a is really an optimisation that just so happens to include the
    band-aid fix of #3.

    So it seems that the simplest way to fix this issue is to implement
    solution #2

    Link: http://lkml.kernel.org/r/20181005054526.21507-1-david@fromorbit.com
    Signed-off-by: Dave Chinner
    Reviewed-by: Jan Kara
    Cc: Nicholas Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Chinner
     

21 Oct, 2018

1 commit


30 Aug, 2018

1 commit


18 Aug, 2018

1 commit

  • Commit 93f78d882865 ("writeback: move backing_dev_info->bdi_stat[] into
    bdi_writeback") replaced BDI_DIRTIED with WB_DIRTIED in
    account_page_redirty(). Update comment to track that change.

    BDI_DIRTIED => WB_DIRTIED
    BDI_WRITTEN => WB_WRITTEN

    Link: http://lkml.kernel.org/r/20180625171526.173483-1-gthelen@google.com
    Signed-off-by: Greg Thelen
    Reviewed-by: Jan Kara
    Acked-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     

21 Apr, 2018

1 commit

  • lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
    the page's memcg is undergoing move accounting, which occurs when a
    process leaves its memcg for a new one that has
    memory.move_charge_at_immigrate set.

    unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if
    the given inode is switching writeback domains. Switches occur when
    enough writes are issued from a new domain.

    This existing pattern is thus suspicious:
    lock_page_memcg(page);
    unlocked_inode_to_wb_begin(inode, &locked);
    ...
    unlocked_inode_to_wb_end(inode, locked);
    unlock_page_memcg(page);

    If both inode switch and process memcg migration are both in-flight then
    unlocked_inode_to_wb_end() will unconditionally enable interrupts while
    still holding the lock_page_memcg() irq spinlock. This suggests the
    possibility of deadlock if an interrupt occurs before unlock_page_memcg().

    truncate
    __cancel_dirty_page
    lock_page_memcg
    unlocked_inode_to_wb_begin
    unlocked_inode_to_wb_end


    end_page_writeback
    test_clear_page_writeback
    lock_page_memcg

    unlock_page_memcg

    Due to configuration limitations this deadlock is not currently possible
    because we don't mix cgroup writeback (a cgroupv2 feature) and
    memory.move_charge_at_immigrate (a cgroupv1 feature).

    If the kernel is hacked to always claim inode switching and memcg
    moving_account, then this script triggers lockup in less than a minute:

    cd /mnt/cgroup/memory
    mkdir a b
    echo 1 > a/memory.move_charge_at_immigrate
    echo 1 > b/memory.move_charge_at_immigrate
    (
    echo $BASHPID > a/cgroup.procs
    while true; do
    dd if=/dev/zero of=/mnt/big bs=1M count=256
    done
    ) &
    while true; do
    sync
    done &
    sleep 1h &
    SLEEP=$!
    while true; do
    echo $SLEEP > a/cgroup.procs
    echo $SLEEP > b/cgroup.procs
    done

    The deadlock does not seem possible, so it's debatable if there's any
    reason to modify the kernel. I suggest we should to prevent future
    surprises. And Wang Long said "this deadlock occurs three times in our
    environment", so there's more reason to apply this, even to stable.
    Stable 4.4 has minor conflicts applying this patch. For a clean 4.4 patch
    see "[PATCH for-4.4] writeback: safer lock nesting"
    https://lkml.org/lkml/2018/4/11/146

    Wang Long said "this deadlock occurs three times in our environment"

    [gthelen@google.com: v4]
    Link: http://lkml.kernel.org/r/20180411084653.254724-1-gthelen@google.com
    [akpm@linux-foundation.org: comment tweaks, struct initialization simplification]
    Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613
    Link: http://lkml.kernel.org/r/20180410005908.167976-1-gthelen@google.com
    Fixes: 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates")
    Signed-off-by: Greg Thelen
    Reported-by: Wang Long
    Acked-by: Wang Long
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Cc: Nicholas Piggin
    Cc: [v4.2+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     

12 Apr, 2018

1 commit

  • Remove the address_space ->tree_lock and use the xa_lock newly added to
    the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
    since we don't really care that it's a tree.

    [willy@infradead.org: fix nds32, fs/dax.c]
    Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Jeff Layton
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Ryusuke Konishi
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

30 Nov, 2017

1 commit

  • This reverts commit 0f6d24f87856 ("mm/page-writeback.c: print a warning
    if the vm dirtiness settings are illogical") because it causes false
    positive warnings during OOM situations as noticed by Tetsuo Handa:

    Node 0 active_anon:3525940kB inactive_anon:8372kB active_file:216kB inactive_file:1872kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:2504kB dirty:52kB writeback:0kB shmem:8660kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 636928kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
    Node 0 DMA free:14848kB min:284kB low:352kB high:420kB active_anon:992kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB kernel_stack:0kB pagetables:24kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
    lowmem_reserve[]: 0 2687 3645 3645
    Node 0 DMA32 free:53004kB min:49608kB low:62008kB high:74408kB active_anon:2712648kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129216kB managed:2773132kB mlocked:0kB kernel_stack:96kB pagetables:5096kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
    lowmem_reserve[]: 0 0 958 958
    Node 0 Normal free:17140kB min:17684kB low:22104kB high:26524kB active_anon:812300kB inactive_anon:8372kB active_file:1228kB inactive_file:1868kB unevictable:0kB writepending:52kB present:1048576kB managed:981224kB mlocked:0kB kernel_stack:3520kB pagetables:8552kB bounce:0kB free_pcp:120kB local_pcp:120kB free_cma:0kB
    lowmem_reserve[]: 0 0 0 0
    [...]
    Out of memory: Kill process 8459 (a.out) score 999 or sacrifice child
    Killed process 8459 (a.out) total-vm:4180kB, anon-rss:88kB, file-rss:0kB, shmem-rss:0kB
    oom_reaper: reaped process 8459 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    vm direct limit must be set greater than background limit.

    The problem is that both thresh and bg_thresh will be 0 if
    available_memory is less than 4 pages when evaluating
    global_dirtyable_memory.

    While this might be worked around the whole point of the warning is
    dubious at best. We do rely on admins to do sensible things when
    changing tunable knobs. Dirty memory writeback knobs are not any
    special in that regards so revert the warning rather than adding more
    hacks to work this around.

    Debugged by Yafang Shao.

    Link: http://lkml.kernel.org/r/20171127091939.tahb77nznytcxw55@dhcp22.suse.cz
    Fixes: 0f6d24f87856 ("mm/page-writeback.c: print a warning if the vm dirtiness settings are illogical")
    Signed-off-by: Michal Hocko
    Reported-by: Tetsuo Handa
    Cc: Yafang Shao
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

22 Nov, 2017

1 commit

  • In preparation for unconditionally passing the struct timer_list pointer to
    all timer callbacks, switch to using the new timer_setup() and from_timer()
    to pass the timer pointer explicitly.

    Cc: Jens Axboe
    Cc: Michal Hocko
    Cc: Andrew Morton
    Cc: Jan Kara
    Cc: Johannes Weiner
    Cc: Nicholas Piggin
    Cc: Vladimir Davydov
    Cc: Matthew Wilcox
    Cc: Jeff Layton
    Cc: linux-block@vger.kernel.org
    Cc: linux-mm@kvack.org
    Signed-off-by: Kees Cook

    Kees Cook
     

16 Nov, 2017

8 commits

  • The parameter `struct bdi_writeback *wb` is not been used in the
    function body. Remove it.

    Link: http://lkml.kernel.org/r/1509685485-15278-1-git-send-email-wanglong19@meituan.com
    Signed-off-by: Wang Long
    Reviewed-by: Jan Kara
    Acked-by: Tejun Heo
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Long
     
  • Every pagevec_init user claims the pages being released are hot even in
    cases where it is unlikely the pages are hot. As no one cares about the
    hotness of pages being released to the allocator, just ditch the
    parameter.

    No performance impact is expected as the overhead is marginal. The
    parameter is removed simply because it is a bit stupid to have a useless
    parameter copied everywhere.

    Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Jan Kara
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Patch series "Speed up page cache truncation", v1.

    When rebasing our enterprise distro to a newer kernel (from 4.4 to 4.12)
    we have noticed a regression in bonnie++ benchmark when deleting files.
    Eventually we have tracked this down to a fact that page cache
    truncation got slower by about 10%. There were both gains and losses in
    the above interval of kernels but we have been able to identify that
    commit 83929372f629 ("filemap: prepare find and delete operations for
    huge pages") caused about 10% regression on its own.

    After some investigation it didn't seem easily possible to fix the
    regression while maintaining the THP in page cache functionality so
    we've decided to optimize the page cache truncation path instead to make
    up for the change. This series is a result of that effort.

    Patch 1 is an easy speedup of cancel_dirty_page(). Patches 2-6 refactor
    page cache truncation code so that it is easier to batch radix tree
    operations. Patch 7 implements batching of deletes from the radix tree
    which more than makes up for the original regression.

    This patch (of 7):

    cancel_dirty_page() does quite some work even for clean pages (fetching
    of mapping, locking of memcg, atomic bit op on page flags) so it
    accounts for ~2.5% of cost of truncation of a clean page. That is not
    much but still dumb for something we don't need at all. Check whether a
    page is actually dirty and avoid any work if not.

    Link: http://lkml.kernel.org/r/20171010151937.26984-2-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Mel Gorman
    Reviewed-by: Andi Kleen
    Cc: Dave Hansen
    Cc: Dave Chinner
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • In preparation for unconditionally passing the struct timer_list pointer
    to all timer callbacks, switch to using the new timer_setup() and
    from_timer() to pass the timer pointer explicitly.

    Link: http://lkml.kernel.org/r/20171016225913.GA99214@beast
    Signed-off-by: Kees Cook
    Reviewed-by: Jan Kara
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Matthew Wilcox
    Cc: Jeff Layton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • All users of pagevec_lookup() and pagevec_lookup_range() now pass
    PAGEVEC_SIZE as a desired number of pages. Just drop the argument.

    Link: http://lkml.kernel.org/r/20171009151359.31984-15-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Daniel Jordan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Use pagevec_lookup_range_tag() in write_cache_pages() as it is
    interested only in pages from given range. Remove unnecessary code
    resulting from this.

    Link: http://lkml.kernel.org/r/20171009151359.31984-12-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Daniel Jordan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • The vm direct limit setting must be set greater than vm background limit
    setting. Otherwise print a warning to help the operator to figure out
    that the vm dirtiness settings is in illogical state.

    Link: http://lkml.kernel.org/r/1506592464-30962-1-git-send-email-laoar.shao@gmail.com
    Signed-off-by: Yafang Shao
    Cc: Jan Kara
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yafang Shao
     
  • "mapping" parameter to balance_dirty_pages() is not used anymore.

    Fixes: dfb8ae567835 ("writeback: let balance_dirty_pages() work on the matching cgroup bdi_writeback")
    Link: http://lkml.kernel.org/r/20170927221311.23263-1-tahsin@google.com
    Signed-off-by: Tahsin Erdogan
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tahsin Erdogan
     

14 Oct, 2017

1 commit


09 Oct, 2017

1 commit

  • After disable periodic writeback by writing 0 to
    dirty_writeback_centisecs, the handler wb_workfn() will not be
    entered again until the dirty background limit reaches or
    sync syscall is executed or no enough free memory available or
    vmscan is triggered.

    So the periodic writeback can't be enabled by writing a non-zero
    value to dirty_writeback_centisecs.
    As it can be disabled by sysctl, it should be able to enable by
    sysctl as well.

    Reviewed-by: Jan Kara
    Signed-off-by: Yafang Shao
    Signed-off-by: Jens Axboe

    Yafang Shao
     

03 Oct, 2017

2 commits


07 Sep, 2017

1 commit

  • global_page_state is error prone as a recent bug report pointed out [1].
    It only returns proper values for zone based counters as the enum it
    gets suggests. We already have global_node_page_state so let's rename
    global_page_state to global_zone_page_state to be more explicit here.
    All existing users seems to be correct:

    $ git grep "global_page_state(NR_" | sed 's@.*(\(NR_[A-Z_]*\)).*@\1@' | sort | uniq -c
    2 NR_BOUNCE
    2 NR_FREE_CMA_PAGES
    11 NR_FREE_PAGES
    1 NR_KERNEL_STACK_KB
    1 NR_MLOCK
    2 NR_PAGETABLE

    This patch shouldn't introduce any functional change.

    [1] http://lkml.kernel.org/r/201707260628.v6Q6SmaS030814@www262.sakura.ne.jp

    Link: http://lkml.kernel.org/r/20170801134256.5400-2-hannes@cmpxchg.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Johannes Weiner
    Cc: Tetsuo Handa
    Cc: Josef Bacik
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

19 Aug, 2017

1 commit

  • Jaegeuk and Brad report a NULL pointer crash when writeback ending tries
    to update the memcg stats:

    BUG: unable to handle kernel NULL pointer dereference at 00000000000003b0
    IP: test_clear_page_writeback+0x12e/0x2c0
    [...]
    RIP: 0010:test_clear_page_writeback+0x12e/0x2c0
    Call Trace:

    end_page_writeback+0x47/0x70
    f2fs_write_end_io+0x76/0x180 [f2fs]
    bio_endio+0x9f/0x120
    blk_update_request+0xa8/0x2f0
    scsi_end_request+0x39/0x1d0
    scsi_io_completion+0x211/0x690
    scsi_finish_command+0xd9/0x120
    scsi_softirq_done+0x127/0x150
    __blk_mq_complete_request_remote+0x13/0x20
    flush_smp_call_function_queue+0x56/0x110
    generic_smp_call_function_single_interrupt+0x13/0x30
    smp_call_function_single_interrupt+0x27/0x40
    call_function_single_interrupt+0x89/0x90
    RIP: 0010:native_safe_halt+0x6/0x10

    (gdb) l *(test_clear_page_writeback+0x12e)
    0xffffffff811bae3e is in test_clear_page_writeback (./include/linux/memcontrol.h:619).
    614 mod_node_page_state(page_pgdat(page), idx, val);
    615 if (mem_cgroup_disabled() || !page->mem_cgroup)
    616 return;
    617 mod_memcg_state(page->mem_cgroup, idx, val);
    618 pn = page->mem_cgroup->nodeinfo[page_to_nid(page)];
    619 this_cpu_add(pn->lruvec_stat->count[idx], val);
    620 }
    621
    622 unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
    623 gfp_t gfp_mask,

    The issue is that writeback doesn't hold a page reference and the page
    might get freed after PG_writeback is cleared (and the mapping is
    unlocked) in test_clear_page_writeback(). The stat functions looking up
    the page's node or zone are safe, as those attributes are static across
    allocation and free cycles. But page->mem_cgroup is not, and it will
    get cleared if we race with truncation or migration.

    It appears this race window has been around for a while, but less likely
    to trigger when the memcg stats were updated first thing after
    PG_writeback is cleared. Recent changes reshuffled this code to update
    the global node stats before the memcg ones, though, stretching the race
    window out to an extent where people can reproduce the problem.

    Update test_clear_page_writeback() to look up and pin page->mem_cgroup
    before clearing PG_writeback, then not use that pointer afterward. It
    is a partial revert of 62cccb8c8e7a ("mm: simplify lock_page_memcg()")
    but leaves the pageref-holding callsites that aren't affected alone.

    Link: http://lkml.kernel.org/r/20170809183825.GA26387@cmpxchg.org
    Fixes: 62cccb8c8e7a ("mm: simplify lock_page_memcg()")
    Signed-off-by: Johannes Weiner
    Reported-by: Jaegeuk Kim
    Tested-by: Jaegeuk Kim
    Reported-by: Bradley Bolen
    Tested-by: Brad Bolen
    Cc: Vladimir Davydov
    Cc: Michal Hocko
    Cc: [4.6+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

13 Jul, 2017

1 commit

  • Currently the writeback statistics code uses a percpu counters to hold
    various statistics. Furthermore we have 2 families of functions - those
    which disable local irq and those which doesn't and whose names begin
    with double underscore. However, they both end up calling
    __add_wb_stats which in turn calls percpu_counter_add_batch which is
    already irq-safe.

    Exploiting this fact allows to eliminated the __wb_* functions since
    they don't add any further protection than we already have.
    Furthermore, refactor the wb_* function to call __add_wb_stat directly
    without the irq-disabling dance. This will likely result in better
    runtime of code which deals with modifying the stat counters.

    While at it also document why percpu_counter_add_batch is in fact
    preempt and irq-safe since at least 3 people got confused.

    Link: http://lkml.kernel.org/r/1498029937-27293-1-git-send-email-nborisov@suse.com
    Signed-off-by: Nikolay Borisov
    Acked-by: Tejun Heo
    Reviewed-by: Jan Kara
    Cc: Josef Bacik
    Cc: Mel Gorman
    Cc: Jeff Layton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nikolay Borisov
     

08 Jul, 2017

1 commit

  • Pull Writeback error handling fixes from Jeff Layton:
    "The main rationale for all of these changes is to tighten up writeback
    error reporting to userland. There are many ways now that writeback
    errors can be lost, such that fsync/fdatasync/msync return 0 when
    writeback actually failed.

    This pile contains a small set of cleanups and writeback error
    handling fixes that I was able to break off from the main pile (#2).

    Two of the patches in this pile are trivial. The exceptions are the
    patch to fix up error handling in write_one_page, and the patch to
    make JFS pay attention to write_one_page errors"

    * tag 'for-linus-v4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
    fs: remove call_fsync helper function
    mm: clean up error handling in write_one_page
    JFS: do not ignore return code from write_one_page()
    mm: drop "wait" parameter from write_one_page()

    Linus Torvalds
     

07 Jul, 2017

1 commit

  • lruvecs are at the intersection of the NUMA node and memcg, which is the
    scope for most paging activity.

    Introduce a convenient accounting infrastructure that maintains
    statistics per node, per memcg, and the lruvec itself.

    Then convert over accounting sites for statistics that are already
    tracked in both nodes and memcgs and can be easily switched.

    [hannes@cmpxchg.org: fix crash in the new cgroup stat keeping code]
    Link: http://lkml.kernel.org/r/20170531171450.GA10481@cmpxchg.org
    [hannes@cmpxchg.org: don't track uncharged pages at all
    Link: http://lkml.kernel.org/r/20170605175254.GA8547@cmpxchg.org
    [hannes@cmpxchg.org: add missing free_percpu()]
    Link: http://lkml.kernel.org/r/20170605175354.GB8547@cmpxchg.org
    [linux@roeck-us.net: hexagon: fix build error caused by include file order]
    Link: http://lkml.kernel.org/r/20170617153721.GA4382@roeck-us.net
    Link: http://lkml.kernel.org/r/20170530181724.27197-6-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Signed-off-by: Guenter Roeck
    Acked-by: Vladimir Davydov
    Cc: Josef Bacik
    Cc: Michal Hocko
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

06 Jul, 2017

2 commits

  • Don't try to check PageError since that's potentially racy and not
    necessarily going to be set after writepage errors out.

    Instead, check the mapping for an error after writepage returns. That
    should also help us detect errors that occurred if the VM tried to
    clean the page earlier due to memory pressure.

    Signed-off-by: Jeff Layton
    Reviewed-by: Jan Kara

    Jeff Layton
     
  • The callers all set it to 1.

    Also, make it clear that this function will not set any sort of AS_*
    error, and that the caller must do so if necessary. No existing caller
    uses this on normal files, so none of them need it.

    Also, add __must_check here since, in general, the callers need to handle
    an error here in some fashion.

    Link: http://lkml.kernel.org/r/20170525103303.6524-1-jlayton@redhat.com
    Signed-off-by: Jeff Layton
    Reviewed-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Reviewed-by: Matthew Wilcox
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Andrew Morton

    Jeff Layton
     

09 May, 2017

1 commit

  • Pull ext4 updates from Ted Ts'o:

    - add GETFSMAP support

    - some performance improvements for very large file systems and for
    random write workloads into a preallocated file

    - bug fixes and cleanups.

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    jbd2: cleanup write flags handling from jbd2_write_superblock()
    ext4: mark superblock writes synchronous for nobarrier mounts
    ext4: inherit encryption xattr before other xattrs
    ext4: replace BUG_ON with WARN_ONCE in ext4_end_bio()
    ext4: avoid unnecessary transaction stalls during writeback
    ext4: preload block group descriptors
    ext4: make ext4_shutdown() static
    ext4: support GETFSMAP ioctls
    vfs: add common GETFSMAP ioctl definitions
    ext4: evict inline data when writing to memory map
    ext4: remove ext4_xattr_check_entry()
    ext4: rename ext4_xattr_check_names() to ext4_xattr_check_entries()
    ext4: merge ext4_xattr_list() into ext4_listxattr()
    ext4: constify static data that is never modified
    ext4: trim return value and 'dir' argument from ext4_insert_dentry()
    jbd2: fix dbench4 performance regression for 'nobarrier' mounts
    jbd2: Fix lockdep splat with generic/270 test
    mm: retry writepages() on ENOMEM when doing an data integrity writeback

    Linus Torvalds
     

04 May, 2017

3 commits

  • The memory controllers stat function names are awkwardly long and
    arbitrarily different from the zone and node stat functions.

    The current interface is named:

    mem_cgroup_read_stat()
    mem_cgroup_update_stat()
    mem_cgroup_inc_stat()
    mem_cgroup_dec_stat()
    mem_cgroup_update_page_stat()
    mem_cgroup_inc_page_stat()
    mem_cgroup_dec_page_stat()

    This patch renames it to match the corresponding node stat functions:

    memcg_page_state() [node_page_state()]
    mod_memcg_state() [mod_node_state()]
    inc_memcg_state() [inc_node_state()]
    dec_memcg_state() [dec_node_state()]
    mod_memcg_page_state() [mod_node_page_state()]
    inc_memcg_page_state() [inc_node_page_state()]
    dec_memcg_page_state() [dec_node_page_state()]

    Link: http://lkml.kernel.org/r/20170404220148.28338-4-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The current duplication is a high-maintenance mess, and it's painful to
    add new items or query memcg state from the rest of the VM.

    This increases the size of the stat array marginally, but we should aim
    to track all these stats on a per-cgroup level anyway.

    Link: http://lkml.kernel.org/r/20170404220148.28338-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Use setup_deferrable_timer() instead of init_timer_deferrable() to
    simplify the code.

    Link: http://lkml.kernel.org/r/e8e3d4280a34facbc007346f31df833cec28801e.1488070291.git.geliangtang@gmail.com
    Signed-off-by: Geliang Tang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geliang Tang
     

28 Apr, 2017

1 commit

  • Currently, file system's writepages() function must not fail with an
    ENOMEM, since if they do, it's possible for buffered data to be lost.
    This is because on a data integrity writeback writepages() gets called
    but once, and if it returns ENOMEM, if you're lucky the error will get
    reflected back to the userspace process calling fsync(). If you
    aren't lucky, the user is unmounting the file system, and the dirty
    pages will simply be lost.

    For this reason, file system code generally will use GFP_NOFS, and in
    some cases, will retry the allocation in a loop, on the theory that
    "kernel livelocks are temporary; data loss is forever".
    Unfortunately, this can indeed cause livelocks, since inside the
    writepages() call, the file system is holding various mutexes, and
    these mutexes may prevent the OOM killer from killing its targetted
    victim if it is also holding on to those mutexes.

    A better solution would be to allow writepages() to call the memory
    allocator with flags that give greater latitude to the allocator to
    fail, and then release its locks and return ENOMEM, and in the case of
    background writeback, the writes can be retried at a later time. In
    the case of data-integrity writeback retry after waiting a brief
    amount of time.

    Signed-off-by: Theodore Ts'o

    Theodore Ts'o