26 Jan, 2019

1 commit

  • [ Upstream commit 3fa750dcf29e8606e3969d13d8e188cc1c0f511d ]

    write_cache_pages() is used in both background and integrity writeback
    scenarios by various filesystems. Background writeback is mostly
    concerned with cleaning a certain number of dirty pages based on various
    mm heuristics. It may not write the full set of dirty pages or wait for
    I/O to complete. Integrity writeback is responsible for persisting a set
    of dirty pages before the writeback job completes. For example, an
    fsync() call must perform integrity writeback to ensure data is on disk
    before the call returns.

    write_cache_pages() unconditionally breaks out of its processing loop in
    the event of a ->writepage() error. This is fine for background
    writeback, which had no strict requirements and will eventually come
    around again. This can cause problems for integrity writeback on
    filesystems that might need to clean up state associated with failed page
    writeouts. For example, XFS performs internal delayed allocation
    accounting before returning a ->writepage() error, where applicable. If
    the current writeback happens to be associated with an unmount and
    write_cache_pages() completes the writeback prematurely due to error, the
    filesystem is unmounted in an inconsistent state if dirty+delalloc pages
    still exist.

    To handle this problem, update write_cache_pages() to always process the
    full set of pages for integrity writeback regardless of ->writepage()
    errors. Save the first encountered error and return it to the caller once
    complete. This facilitates XFS (or any other fs that expects integrity
    writeback to process the entire set of dirty pages) to clean up its
    internal state completely in the event of persistent mapping errors.
    Background writeback continues to exit on the first error encountered.

    [akpm@linux-foundation.org: fix typo in comment]
    Link: http://lkml.kernel.org/r/20181116134304.32440-1-bfoster@redhat.com
    Signed-off-by: Brian Foster
    Reviewed-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Brian Foster
     

24 Apr, 2018

1 commit

  • commit 2e898e4c0a3897ccd434adac5abb8330194f527b upstream.

    lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
    the page's memcg is undergoing move accounting, which occurs when a
    process leaves its memcg for a new one that has
    memory.move_charge_at_immigrate set.

    unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if
    the given inode is switching writeback domains. Switches occur when
    enough writes are issued from a new domain.

    This existing pattern is thus suspicious:
    lock_page_memcg(page);
    unlocked_inode_to_wb_begin(inode, &locked);
    ...
    unlocked_inode_to_wb_end(inode, locked);
    unlock_page_memcg(page);

    If both inode switch and process memcg migration are both in-flight then
    unlocked_inode_to_wb_end() will unconditionally enable interrupts while
    still holding the lock_page_memcg() irq spinlock. This suggests the
    possibility of deadlock if an interrupt occurs before unlock_page_memcg().

    truncate
    __cancel_dirty_page
    lock_page_memcg
    unlocked_inode_to_wb_begin
    unlocked_inode_to_wb_end


    end_page_writeback
    test_clear_page_writeback
    lock_page_memcg

    unlock_page_memcg

    Due to configuration limitations this deadlock is not currently possible
    because we don't mix cgroup writeback (a cgroupv2 feature) and
    memory.move_charge_at_immigrate (a cgroupv1 feature).

    If the kernel is hacked to always claim inode switching and memcg
    moving_account, then this script triggers lockup in less than a minute:

    cd /mnt/cgroup/memory
    mkdir a b
    echo 1 > a/memory.move_charge_at_immigrate
    echo 1 > b/memory.move_charge_at_immigrate
    (
    echo $BASHPID > a/cgroup.procs
    while true; do
    dd if=/dev/zero of=/mnt/big bs=1M count=256
    done
    ) &
    while true; do
    sync
    done &
    sleep 1h &
    SLEEP=$!
    while true; do
    echo $SLEEP > a/cgroup.procs
    echo $SLEEP > b/cgroup.procs
    done

    The deadlock does not seem possible, so it's debatable if there's any
    reason to modify the kernel. I suggest we should to prevent future
    surprises. And Wang Long said "this deadlock occurs three times in our
    environment", so there's more reason to apply this, even to stable.
    Stable 4.4 has minor conflicts applying this patch. For a clean 4.4 patch
    see "[PATCH for-4.4] writeback: safer lock nesting"
    https://lkml.org/lkml/2018/4/11/146

    Wang Long said "this deadlock occurs three times in our environment"

    [gthelen@google.com: v4]
    Link: http://lkml.kernel.org/r/20180411084653.254724-1-gthelen@google.com
    [akpm@linux-foundation.org: comment tweaks, struct initialization simplification]
    Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613
    Link: http://lkml.kernel.org/r/20180410005908.167976-1-gthelen@google.com
    Fixes: 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates")
    Signed-off-by: Greg Thelen
    Reported-by: Wang Long
    Acked-by: Wang Long
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Cc: Nicholas Piggin
    Cc: [v4.2+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    [natechancellor: Adjust context due to lack of b93b016313b3b]
    Signed-off-by: Nathan Chancellor
    Signed-off-by: Greg Kroah-Hartman

    Greg Thelen
     

07 Sep, 2017

1 commit

  • global_page_state is error prone as a recent bug report pointed out [1].
    It only returns proper values for zone based counters as the enum it
    gets suggests. We already have global_node_page_state so let's rename
    global_page_state to global_zone_page_state to be more explicit here.
    All existing users seems to be correct:

    $ git grep "global_page_state(NR_" | sed 's@.*(\(NR_[A-Z_]*\)).*@\1@' | sort | uniq -c
    2 NR_BOUNCE
    2 NR_FREE_CMA_PAGES
    11 NR_FREE_PAGES
    1 NR_KERNEL_STACK_KB
    1 NR_MLOCK
    2 NR_PAGETABLE

    This patch shouldn't introduce any functional change.

    [1] http://lkml.kernel.org/r/201707260628.v6Q6SmaS030814@www262.sakura.ne.jp

    Link: http://lkml.kernel.org/r/20170801134256.5400-2-hannes@cmpxchg.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Johannes Weiner
    Cc: Tetsuo Handa
    Cc: Josef Bacik
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

19 Aug, 2017

1 commit

  • Jaegeuk and Brad report a NULL pointer crash when writeback ending tries
    to update the memcg stats:

    BUG: unable to handle kernel NULL pointer dereference at 00000000000003b0
    IP: test_clear_page_writeback+0x12e/0x2c0
    [...]
    RIP: 0010:test_clear_page_writeback+0x12e/0x2c0
    Call Trace:

    end_page_writeback+0x47/0x70
    f2fs_write_end_io+0x76/0x180 [f2fs]
    bio_endio+0x9f/0x120
    blk_update_request+0xa8/0x2f0
    scsi_end_request+0x39/0x1d0
    scsi_io_completion+0x211/0x690
    scsi_finish_command+0xd9/0x120
    scsi_softirq_done+0x127/0x150
    __blk_mq_complete_request_remote+0x13/0x20
    flush_smp_call_function_queue+0x56/0x110
    generic_smp_call_function_single_interrupt+0x13/0x30
    smp_call_function_single_interrupt+0x27/0x40
    call_function_single_interrupt+0x89/0x90
    RIP: 0010:native_safe_halt+0x6/0x10

    (gdb) l *(test_clear_page_writeback+0x12e)
    0xffffffff811bae3e is in test_clear_page_writeback (./include/linux/memcontrol.h:619).
    614 mod_node_page_state(page_pgdat(page), idx, val);
    615 if (mem_cgroup_disabled() || !page->mem_cgroup)
    616 return;
    617 mod_memcg_state(page->mem_cgroup, idx, val);
    618 pn = page->mem_cgroup->nodeinfo[page_to_nid(page)];
    619 this_cpu_add(pn->lruvec_stat->count[idx], val);
    620 }
    621
    622 unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
    623 gfp_t gfp_mask,

    The issue is that writeback doesn't hold a page reference and the page
    might get freed after PG_writeback is cleared (and the mapping is
    unlocked) in test_clear_page_writeback(). The stat functions looking up
    the page's node or zone are safe, as those attributes are static across
    allocation and free cycles. But page->mem_cgroup is not, and it will
    get cleared if we race with truncation or migration.

    It appears this race window has been around for a while, but less likely
    to trigger when the memcg stats were updated first thing after
    PG_writeback is cleared. Recent changes reshuffled this code to update
    the global node stats before the memcg ones, though, stretching the race
    window out to an extent where people can reproduce the problem.

    Update test_clear_page_writeback() to look up and pin page->mem_cgroup
    before clearing PG_writeback, then not use that pointer afterward. It
    is a partial revert of 62cccb8c8e7a ("mm: simplify lock_page_memcg()")
    but leaves the pageref-holding callsites that aren't affected alone.

    Link: http://lkml.kernel.org/r/20170809183825.GA26387@cmpxchg.org
    Fixes: 62cccb8c8e7a ("mm: simplify lock_page_memcg()")
    Signed-off-by: Johannes Weiner
    Reported-by: Jaegeuk Kim
    Tested-by: Jaegeuk Kim
    Reported-by: Bradley Bolen
    Tested-by: Brad Bolen
    Cc: Vladimir Davydov
    Cc: Michal Hocko
    Cc: [4.6+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

13 Jul, 2017

1 commit

  • Currently the writeback statistics code uses a percpu counters to hold
    various statistics. Furthermore we have 2 families of functions - those
    which disable local irq and those which doesn't and whose names begin
    with double underscore. However, they both end up calling
    __add_wb_stats which in turn calls percpu_counter_add_batch which is
    already irq-safe.

    Exploiting this fact allows to eliminated the __wb_* functions since
    they don't add any further protection than we already have.
    Furthermore, refactor the wb_* function to call __add_wb_stat directly
    without the irq-disabling dance. This will likely result in better
    runtime of code which deals with modifying the stat counters.

    While at it also document why percpu_counter_add_batch is in fact
    preempt and irq-safe since at least 3 people got confused.

    Link: http://lkml.kernel.org/r/1498029937-27293-1-git-send-email-nborisov@suse.com
    Signed-off-by: Nikolay Borisov
    Acked-by: Tejun Heo
    Reviewed-by: Jan Kara
    Cc: Josef Bacik
    Cc: Mel Gorman
    Cc: Jeff Layton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nikolay Borisov
     

08 Jul, 2017

1 commit

  • Pull Writeback error handling fixes from Jeff Layton:
    "The main rationale for all of these changes is to tighten up writeback
    error reporting to userland. There are many ways now that writeback
    errors can be lost, such that fsync/fdatasync/msync return 0 when
    writeback actually failed.

    This pile contains a small set of cleanups and writeback error
    handling fixes that I was able to break off from the main pile (#2).

    Two of the patches in this pile are trivial. The exceptions are the
    patch to fix up error handling in write_one_page, and the patch to
    make JFS pay attention to write_one_page errors"

    * tag 'for-linus-v4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
    fs: remove call_fsync helper function
    mm: clean up error handling in write_one_page
    JFS: do not ignore return code from write_one_page()
    mm: drop "wait" parameter from write_one_page()

    Linus Torvalds
     

07 Jul, 2017

1 commit

  • lruvecs are at the intersection of the NUMA node and memcg, which is the
    scope for most paging activity.

    Introduce a convenient accounting infrastructure that maintains
    statistics per node, per memcg, and the lruvec itself.

    Then convert over accounting sites for statistics that are already
    tracked in both nodes and memcgs and can be easily switched.

    [hannes@cmpxchg.org: fix crash in the new cgroup stat keeping code]
    Link: http://lkml.kernel.org/r/20170531171450.GA10481@cmpxchg.org
    [hannes@cmpxchg.org: don't track uncharged pages at all
    Link: http://lkml.kernel.org/r/20170605175254.GA8547@cmpxchg.org
    [hannes@cmpxchg.org: add missing free_percpu()]
    Link: http://lkml.kernel.org/r/20170605175354.GB8547@cmpxchg.org
    [linux@roeck-us.net: hexagon: fix build error caused by include file order]
    Link: http://lkml.kernel.org/r/20170617153721.GA4382@roeck-us.net
    Link: http://lkml.kernel.org/r/20170530181724.27197-6-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Signed-off-by: Guenter Roeck
    Acked-by: Vladimir Davydov
    Cc: Josef Bacik
    Cc: Michal Hocko
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

06 Jul, 2017

2 commits

  • Don't try to check PageError since that's potentially racy and not
    necessarily going to be set after writepage errors out.

    Instead, check the mapping for an error after writepage returns. That
    should also help us detect errors that occurred if the VM tried to
    clean the page earlier due to memory pressure.

    Signed-off-by: Jeff Layton
    Reviewed-by: Jan Kara

    Jeff Layton
     
  • The callers all set it to 1.

    Also, make it clear that this function will not set any sort of AS_*
    error, and that the caller must do so if necessary. No existing caller
    uses this on normal files, so none of them need it.

    Also, add __must_check here since, in general, the callers need to handle
    an error here in some fashion.

    Link: http://lkml.kernel.org/r/20170525103303.6524-1-jlayton@redhat.com
    Signed-off-by: Jeff Layton
    Reviewed-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Reviewed-by: Matthew Wilcox
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Andrew Morton

    Jeff Layton
     

09 May, 2017

1 commit

  • Pull ext4 updates from Ted Ts'o:

    - add GETFSMAP support

    - some performance improvements for very large file systems and for
    random write workloads into a preallocated file

    - bug fixes and cleanups.

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    jbd2: cleanup write flags handling from jbd2_write_superblock()
    ext4: mark superblock writes synchronous for nobarrier mounts
    ext4: inherit encryption xattr before other xattrs
    ext4: replace BUG_ON with WARN_ONCE in ext4_end_bio()
    ext4: avoid unnecessary transaction stalls during writeback
    ext4: preload block group descriptors
    ext4: make ext4_shutdown() static
    ext4: support GETFSMAP ioctls
    vfs: add common GETFSMAP ioctl definitions
    ext4: evict inline data when writing to memory map
    ext4: remove ext4_xattr_check_entry()
    ext4: rename ext4_xattr_check_names() to ext4_xattr_check_entries()
    ext4: merge ext4_xattr_list() into ext4_listxattr()
    ext4: constify static data that is never modified
    ext4: trim return value and 'dir' argument from ext4_insert_dentry()
    jbd2: fix dbench4 performance regression for 'nobarrier' mounts
    jbd2: Fix lockdep splat with generic/270 test
    mm: retry writepages() on ENOMEM when doing an data integrity writeback

    Linus Torvalds
     

04 May, 2017

3 commits

  • The memory controllers stat function names are awkwardly long and
    arbitrarily different from the zone and node stat functions.

    The current interface is named:

    mem_cgroup_read_stat()
    mem_cgroup_update_stat()
    mem_cgroup_inc_stat()
    mem_cgroup_dec_stat()
    mem_cgroup_update_page_stat()
    mem_cgroup_inc_page_stat()
    mem_cgroup_dec_page_stat()

    This patch renames it to match the corresponding node stat functions:

    memcg_page_state() [node_page_state()]
    mod_memcg_state() [mod_node_state()]
    inc_memcg_state() [inc_node_state()]
    dec_memcg_state() [dec_node_state()]
    mod_memcg_page_state() [mod_node_page_state()]
    inc_memcg_page_state() [inc_node_page_state()]
    dec_memcg_page_state() [dec_node_page_state()]

    Link: http://lkml.kernel.org/r/20170404220148.28338-4-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The current duplication is a high-maintenance mess, and it's painful to
    add new items or query memcg state from the rest of the VM.

    This increases the size of the stat array marginally, but we should aim
    to track all these stats on a per-cgroup level anyway.

    Link: http://lkml.kernel.org/r/20170404220148.28338-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Use setup_deferrable_timer() instead of init_timer_deferrable() to
    simplify the code.

    Link: http://lkml.kernel.org/r/e8e3d4280a34facbc007346f31df833cec28801e.1488070291.git.geliangtang@gmail.com
    Signed-off-by: Geliang Tang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geliang Tang
     

28 Apr, 2017

1 commit

  • Currently, file system's writepages() function must not fail with an
    ENOMEM, since if they do, it's possible for buffered data to be lost.
    This is because on a data integrity writeback writepages() gets called
    but once, and if it returns ENOMEM, if you're lucky the error will get
    reflected back to the userspace process calling fsync(). If you
    aren't lucky, the user is unmounting the file system, and the dirty
    pages will simply be lost.

    For this reason, file system code generally will use GFP_NOFS, and in
    some cases, will retry the allocation in a loop, on the theory that
    "kernel livelocks are temporary; data loss is forever".
    Unfortunately, this can indeed cause livelocks, since inside the
    writepages() call, the file system is holding various mutexes, and
    these mutexes may prevent the OOM killer from killing its targetted
    victim if it is also holding on to those mutexes.

    A better solution would be to allow writepages() to call the memory
    allocator with flags that give greater latitude to the allocator to
    fail, and then release its locks and return ENOMEM, and in the case of
    background writeback, the writes can be retried at a later time. In
    the case of data-integrity writeback retry after waiting a brief
    amount of time.

    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     

02 Mar, 2017

1 commit

  • Instead of including the full , we are going to include the
    types-only header in , to further
    decouple the scheduler header from the signal headers.

    This means that various files which relied on the full need
    to be updated to gain an explicit dependency on it.

    Update the code that relies on sched.h's inclusion of the header.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

28 Feb, 2017

1 commit

  • Fix typos and add the following to the scripts/spelling.txt:

    comsume||consume
    comsumer||consumer
    comsuming||consuming

    I see some variable names with this pattern, but this commit is only
    touching comment blocks to avoid unexpected impact.

    Link: http://lkml.kernel.org/r/1481573103-11329-19-git-send-email-yamada.masahiro@socionext.com
    Signed-off-by: Masahiro Yamada
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masahiro Yamada
     

25 Feb, 2017

1 commit

  • The likely/unlikely profiler noticed that the unlikely statement in
    wb_domain_writeout_inc() is constantly wrong. This is due to the "not"
    (!) being outside the unlikely statement. It is likely that
    dom->period_time will be set, but unlikely that it wont be. Move the
    not into the unlikely statement.

    Link: http://lkml.kernel.org/r/20170206120035.3c2e2b91@gandalf.local.home
    Signed-off-by: Steven Rostedt (VMware)
    Reviewed-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt (VMware)
     

02 Feb, 2017

1 commit

  • We will want to have struct backing_dev_info allocated separately from
    struct request_queue. As the first step add pointer to backing_dev_info
    to request_queue and convert all users touching it. No functional
    changes in this patch.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

15 Dec, 2016

1 commit

  • This is an exceptionally complicated function with just one caller
    (tag_pages_for_writeback). We devote a large portion of the runtime of
    the test suite to testing this one function which has one caller. By
    introducing the new function radix_tree_iter_tag_set(), we can eliminate
    all of the complexity while keeping the performance. The caller can now
    use a fairly standard radix_tree_for_each() loop, and it doesn't need to
    worry about tricksy things like 'start' wrapping.

    The test suite continues to spend a large amount of time investigating
    this function, but now it's testing the underlying primitives such as
    radix_tree_iter_resume() and the radix_tree_for_each_tagged() iterator
    which are also used by other parts of the kernel.

    Link: http://lkml.kernel.org/r/1480369871-5271-57-git-send-email-mawilcox@linuxonhyperv.com
    Signed-off-by: Matthew Wilcox
    Tested-by: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Ross Zwisler
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

08 Nov, 2016

1 commit


08 Oct, 2016

2 commits

  • File pages use a set of radix tree tags (DIRTY, TOWRITE, WRITEBACK,
    etc.) to accelerate finding the pages with a specific tag in the radix
    tree during inode writeback. But for anonymous pages in the swap cache,
    there is no inode writeback. So there is no need to find the pages with
    some writeback tags in the radix tree. It is not necessary to touch
    radix tree writeback tags for pages in the swap cache.

    Per Rik van Riel's suggestion, a new flag AS_NO_WRITEBACK_TAGS is
    introduced for address spaces which don't need to update the writeback
    tags. The flag is set for swap caches. It may be used for DAX file
    systems, etc.

    With this patch, the swap out bandwidth improved 22.3% (from ~1.2GB/s to
    ~1.48GBps) in the vm-scalability swap-w-seq test case with 8 processes.
    The test is done on a Xeon E5 v3 system. The swap device used is a RAM
    simulated PMEM (persistent memory) device. The improvement comes from
    the reduced contention on the swap cache radix tree lock. To test
    sequential swapping out, the test case uses 8 processes, which
    sequentially allocate and write to the anonymous pages until RAM and
    part of the swap device is used up.

    Details of comparison is as follow,

    base base+patch
    ---------------- --------------------------
    %stddev %change %stddev
    \ | \
    2506952 ± 2% +28.1% 3212076 ± 7% vm-scalability.throughput
    1207402 ± 7% +22.3% 1476578 ± 6% vmstat.swap.so
    10.86 ± 12% -23.4% 8.31 ± 16% perf-profile.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list
    10.82 ± 13% -33.1% 7.24 ± 14% perf-profile.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_zone_memcg
    10.36 ± 11% -100.0% 0.00 ± -1% perf-profile.cycles-pp._raw_spin_lock_irqsave.__test_set_page_writeback.bdev_write_page.__swap_writepage.swap_writepage
    10.52 ± 12% -100.0% 0.00 ± -1% perf-profile.cycles-pp._raw_spin_lock_irqsave.test_clear_page_writeback.end_page_writeback.page_endio.pmem_rw_page

    Link: http://lkml.kernel.org/r/1472578089-5560-1-git-send-email-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Acked-by: Rik van Riel
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Tejun Heo
    Cc: Wu Fengguang
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • throttle_vm_writeout() was introduced back in 2005 to fix OOMs caused by
    excessive pageout activity during the reclaim. Too many pages could be
    put under writeback therefore LRUs would be full of unreclaimable pages
    until the IO completes and in turn the OOM killer could be invoked.

    There have been some important changes introduced since then in the
    reclaim path though. Writers are throttled by balance_dirty_pages when
    initiating the buffered IO and later during the memory pressure, the
    direct reclaim is throttled by wait_iff_congested if the node is
    considered congested by dirty pages on LRUs and the underlying bdi is
    congested by the queued IO. The kswapd is throttled as well if it
    encounters pages marked for immediate reclaim or under writeback which
    signals that that there are too many pages under writeback already.
    Finally should_reclaim_retry does congestion_wait if the reclaim cannot
    make any progress and there are too many dirty/writeback pages.

    Another important aspect is that we do not issue any IO from the direct
    reclaim context anymore. In a heavy parallel load this could queue a
    lot of IO which would be very scattered and thus unefficient which would
    just make the problem worse.

    This three mechanisms should throttle and keep the amount of IO in a
    steady state even under heavy IO and memory pressure so yet another
    throttling point doesn't really seem helpful. Quite contrary, Mikulas
    Patocka has reported that swap backed by dm-crypt doesn't work properly
    because the swapout IO cannot make sufficient progress as the writeout
    path depends on dm_crypt worker which has to allocate memory to perform
    the encryption. In order to guarantee a forward progress it relies on
    the mempool allocator. mempool_alloc(), however, prefers to use the
    underlying (usually page) allocator before it grabs objects from the
    pool. Such an allocation can dive into the memory reclaim and
    consequently to throttle_vm_writeout. If there are too many dirty or
    pages under writeback it will get throttled even though it is in fact a
    flusher to clear pending pages.

    kworker/u4:0 D ffff88003df7f438 10488 6 2 0x00000000
    Workqueue: kcryptd kcryptd_crypt [dm_crypt]
    Call Trace:
    schedule+0x3c/0x90
    schedule_timeout+0x1d8/0x360
    io_schedule_timeout+0xa4/0x110
    congestion_wait+0x86/0x1f0
    throttle_vm_writeout+0x44/0xd0
    shrink_zone_memcg+0x613/0x720
    shrink_zone+0xe0/0x300
    do_try_to_free_pages+0x1ad/0x450
    try_to_free_pages+0xef/0x300
    __alloc_pages_nodemask+0x879/0x1210
    alloc_pages_current+0xa1/0x1f0
    new_slab+0x2d7/0x6a0
    ___slab_alloc+0x3fb/0x5c0
    __slab_alloc+0x51/0x90
    kmem_cache_alloc+0x27b/0x310
    mempool_alloc_slab+0x1d/0x30
    mempool_alloc+0x91/0x230
    bio_alloc_bioset+0xbd/0x260
    kcryptd_crypt+0x114/0x3b0 [dm_crypt]

    Let's just drop throttle_vm_writeout altogether. It is not very much
    helpful anymore.

    I have tried to test a potential writeback IO runaway similar to the one
    described in the original patch which has introduced that [1]. Small
    virtual machine (512MB RAM, 4 CPUs, 2G of swap space and disk image on a
    rather slow NFS in a sync mode on the host) with 8 parallel writers each
    writing 1G worth of data. As soon as the pagecache fills up and the
    direct reclaim hits then I start anon memory consumer in a loop
    (allocating 300M and exiting after populating it) in the background to
    make the memory pressure even stronger as well as to disrupt the steady
    state for the IO. The direct reclaim is throttled because of the
    congestion as well as kswapd hitting congestion_wait due to nr_immediate
    but throttle_vm_writeout doesn't ever trigger the sleep throughout the
    test. Dirty+writeback are close to nr_dirty_threshold with some
    fluctuations caused by the anon consumer.

    [1] https://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.9-rc1/2.6.9-rc1-mm3/broken-out/vm-pageout-throttling.patch
    Link: http://lkml.kernel.org/r/1471171473-21418-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Mikulas Patocka
    Cc: Marcelo Tosatti
    Cc: NeilBrown
    Cc: Ondrej Kozina
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

07 Sep, 2016

1 commit

  • Install the callbacks via the state machine and let the core invoke
    the callbacks on the already online CPUs.

    Signed-off-by: Sebastian Andrzej Siewior
    Cc: Peter Zijlstra
    Cc: Jens Axboe
    Cc: linux-mm@kvack.org
    Cc: rt@linutronix.de
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20160818125731.27256-6-bigeasy@linutronix.de
    Signed-off-by: Thomas Gleixner

    Sebastian Andrzej Siewior
     

29 Jul, 2016

8 commits

  • If per-zone LRU accounting is available then there is no point
    approximating whether reclaim and compaction should retry based on pgdat
    statistics. This is effectively a revert of "mm, vmstat: remove zone
    and node double accounting by approximating retries" with the difference
    that inactive/active stats are still available. This preserves the
    history of why the approximation was retried and why it had to be
    reverted to handle OOM kills on 32-bit systems.

    Link: http://lkml.kernel.org/r/1469110261-7365-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Minchan Kim
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • With the reintroduction of per-zone LRU stats, highmem_file_pages is
    redundant so remove it.

    [mgorman@techsingularity.net: wrong stat is being accumulated in highmem_dirtyable_memory]
    Link: http://lkml.kernel.org/r/20160725092324.GM10438@techsingularity.netLink: http://lkml.kernel.org/r/1469110261-7365-3-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Minchan Kim
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When I tested vmscale in mmtest in 32bit, I found the benchmark was slow
    down 0.5 times.

    base node
    1 global-1
    User 12.98 16.04
    System 147.61 166.42
    Elapsed 26.48 38.08

    With vmstat, I found IO wait avg is much increased compared to base.

    The reason was highmem_dirtyable_memory accumulates free pages and
    highmem_file_pages from HIGHMEM to MOVABLE zones which was wrong. With
    that, dirth_thresh in throtlle_vm_write is always 0 so that it calls
    congestion_wait frequently if writeback starts.

    With this patch, it is much recovered.

    base node fi
    1 global-1 fix
    User 12.98 16.04 13.78
    System 147.61 166.42 143.92
    Elapsed 26.48 38.08 29.64

    Link: http://lkml.kernel.org/r/1468404004-5085-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Minchan Kim
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • The number of LRU pages, dirty pages and writeback pages must be
    accounted for on both zones and nodes because of the reclaim retry
    logic, compaction retry logic and highmem calculations all depending on
    per-zone stats.

    Many lowmem allocations are immune from OOM kill due to a check in
    __alloc_pages_may_oom for (ac->high_zoneidx < ZONE_NORMAL) since commit
    03668b3ceb0c ("oom: avoid oom killer for lowmem allocations"). The
    exception is costly high-order allocations or allocations that cannot
    fail. If the __alloc_pages_may_oom avoids OOM-kill for low-order lowmem
    allocations then it would fall through to __alloc_pages_direct_compact.

    This patch will blindly retry reclaim for zone-constrained allocations
    in should_reclaim_retry up to MAX_RECLAIM_RETRIES. This is not ideal
    but without per-zone stats there are not many alternatives. The impact
    it that zone-constrained allocations may delay before considering the
    OOM killer.

    As there is no guarantee enough memory can ever be freed to satisfy
    compaction, this patch avoids retrying compaction for zone-contrained
    allocations.

    In combination, that means that the per-node stats can be used when
    deciding whether to continue reclaim using a rough approximation. While
    it is possible this will make the wrong decision on occasion, it will
    not infinite loop as the number of reclaim attempts is capped by
    MAX_RECLAIM_RETRIES.

    The final step is calculating the number of dirtyable highmem pages. As
    those calculations only care about the global count of file pages in
    highmem. This patch uses a global counter used instead of per-zone
    stats as it is sufficient.

    In combination, this allows the per-zone LRU and dirty state counters to
    be removed.

    [mgorman@techsingularity.net: fix acct_highmem_file_pages()]
    Link: http://lkml.kernel.org/r/1468853426-12858-4-git-send-email-mgorman@techsingularity.netLink: http://lkml.kernel.org/r/1467970510-21195-35-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Suggested by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • As reclaim is now node-based, it follows that page write activity due to
    page reclaim should also be accounted for on the node. For consistency,
    also account page writes and page dirtying on a per-node basis.

    After this patch, there are a few remaining zone counters that may appear
    strange but are fine. NUMA stats are still per-zone as this is a
    user-space interface that tools consume. NR_MLOCK, NR_SLAB_*,
    NR_PAGETABLE, NR_KERNEL_STACK and NR_BOUNCE are all allocations that
    potentially pin low memory and cannot trivially be reclaimed on demand.
    This information is still useful for debugging a page allocation failure
    warning.

    Link: http://lkml.kernel.org/r/1467970510-21195-21-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There are now a number of accounting oddities such as mapped file pages
    being accounted for on the node while the total number of file pages are
    accounted on the zone. This can be coped with to some extent but it's
    confusing so this patch moves the relevant file-based accounted. Due to
    throttling logic in the page allocator for reliable OOM detection, it is
    still necessary to track dirty and writeback pages on a per-zone basis.

    [mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
    Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Historically dirty pages were spread among zones but now that LRUs are
    per-node it is more appropriate to consider dirty pages in a node.

    Link: http://lkml.kernel.org/r/1467970510-21195-17-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Signed-off-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This moves the LRU lists from the zone to the node and related data such
    as counters, tracing, congestion tracking and writeback tracking.

    Unfortunately, due to reclaim and compaction retry logic, it is
    necessary to account for the number of LRU pages on both zone and node
    logic. Most reclaim logic is based on the node counters but the retry
    logic uses the zone counters which do not distinguish inactive and
    active sizes. It would be possible to leave the LRU counters on a
    per-zone basis but it's a heavier calculation across multiple cache
    lines that is much more frequent than the retry checks.

    Other than the LRU counters, this is mostly a mechanical patch but note
    that it introduces a number of anomalies. For example, the scans are
    per-zone but using per-node counters. We also mark a node as congested
    when a zone is congested. This causes weird problems that are fixed
    later but is easier to review.

    In the event that there is excessive overhead on 32-bit systems due to
    the nodes being on LRU then there are two potential solutions

    1. Long-term isolation of highmem pages when reclaim is lowmem

    When pages are skipped, they are immediately added back onto the LRU
    list. If lowmem reclaim persisted for long periods of time, the same
    highmem pages get continually scanned. The idea would be that lowmem
    keeps those pages on a separate list until a reclaim for highmem pages
    arrives that splices the highmem pages back onto the LRU. It potentially
    could be implemented similar to the UNEVICTABLE list.

    That would reduce the skip rate with the potential corner case is that
    highmem pages have to be scanned and reclaimed to free lowmem slab pages.

    2. Linear scan lowmem pages if the initial LRU shrink fails

    This will break LRU ordering but may be preferable and faster during
    memory pressure than skipping LRU pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

27 Jul, 2016

2 commits

  • Here's basic implementation of huge pages support for shmem/tmpfs.

    It's all pretty streight-forward:

    - shmem_getpage() allcoates huge page if it can and try to inserd into
    radix tree with shmem_add_to_page_cache();

    - shmem_add_to_page_cache() puts the page onto radix-tree if there's
    space for it;

    - shmem_undo_range() removes huge pages, if it fully within range.
    Partial truncate of huge pages zero out this part of THP.

    This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE)
    behaviour. As we don't really create hole in this case,
    lseek(SEEK_HOLE) may have inconsistent results depending what
    pages happened to be allocated.

    - no need to change shmem_fault: core-mm will map an compound page as
    huge if VMA is suitable;

    Link: http://lkml.kernel.org/r/1466021202-61880-30-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • wait_sb_inodes() currently does a walk of all inodes in the filesystem
    to find dirty one to wait on during sync. This is highly inefficient
    and wastes a lot of CPU when there are lots of clean cached inodes that
    we don't need to wait on.

    To avoid this "all inode" walk, we need to track inodes that are
    currently under writeback that we need to wait for. We do this by
    adding inodes to a writeback list on the sb when the mapping is first
    tagged as having pages under writeback. wait_sb_inodes() can then walk
    this list of "inodes under IO" and wait specifically just for the inodes
    that the current sync(2) needs to wait for.

    Define a couple helpers to add/remove an inode from the writeback list
    and call them when the overall mapping is tagged for or cleared from
    writeback. Update wait_sb_inodes() to walk only the inodes under
    writeback due to the sync.

    With this change, filesystem sync times are significantly reduced for
    fs' with largely populated inode caches and otherwise no other work to
    do. For example, on a 16xcpu 2GHz x86-64 server, 10TB XFS filesystem
    with a ~10m entry inode cache, sync times are reduced from ~7.3s to less
    than 0.1s when the filesystem is fully clean.

    Link: http://lkml.kernel.org/r/1466594593-6757-2-git-send-email-bfoster@redhat.com
    Signed-off-by: Dave Chinner
    Signed-off-by: Josef Bacik
    Signed-off-by: Brian Foster
    Reviewed-by: Jan Kara
    Tested-by: Holger Hoffstätte
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Chinner
     

30 May, 2016

1 commit

  • As vm.dirty_[background_]bytes can't be applied verbatim to multiple
    cgroup writeback domains, they get converted to percentages in
    domain_dirty_limits() and applied the same way as
    vm.dirty_[background]ratio. However, if the specified bytes is lower
    than 1% of available memory, the calculated ratios become zero and the
    writeback domain gets throttled constantly.

    Fix it by using per-PAGE_SIZE instead of percentage for ratio
    calculations. Also, the updated DIV_ROUND_UP() usages now should
    yield 1/4096 (0.0244%) as the minimum ratio as long as the specified
    bytes are above zero.

    Signed-off-by: Tejun Heo
    Reported-by: Miao Xie
    Link: http://lkml.kernel.org/g/57333E75.3080309@huawei.com
    Cc: stable@vger.kernel.org # v4.2+
    Fixes: 9fc3a43e1757 ("writeback: separate out domain_dirty_limits()")
    Reviewed-by: Jan Kara

    Adjusted comment based on Jan's suggestion.
    Signed-off-by: Jens Axboe

    Tejun Heo
     

21 May, 2016

1 commit

  • When nfsd is exporting a filesystem over NFS which is then NFS-mounted
    on the local machine there is a risk of deadlock. This happens when
    there are lots of dirty pages in the NFS filesystem and they cause NFSD
    to be throttled, either in throttle_vm_writeout() or in
    balance_dirty_pages().

    To avoid this problem the PF_LESS_THROTTLE flag is set for NFSD threads
    and it provides a 25% increase to the limits that affect NFSD. Any
    process writing to an NFS filesystem will be throttled well before the
    number of dirty NFS pages reaches the limit imposed on NFSD, so NFSD
    will not deadlock on pages that it needs to write out. At least it
    shouldn't.

    All processes are allowed a small excess margin to avoid performing too
    many calculations: ratelimit_pages.

    ratelimit_pages is set so that if a thread on every CPU uses the entire
    margin, the total will only go 3% over the limit, and this is much less
    than the 25% bonus that PF_LESS_THROTTLE provides, so this margin
    shouldn't be a problem. But it is.

    The "total memory" that these 3% and 25% are calculated against are not
    really total memory but are "global_dirtyable_memory()" which doesn't
    include anonymous memory, just free memory and page-cache memory.

    The "ratelimit_pages" number is based on whatever the
    global_dirtyable_memory was on the last CPU hot-plug, which might not be
    what you expect, but is probably close to the total freeable memory.

    The throttle threshold uses the global_dirtable_memory at the moment
    when the throttling happens, which could be much less than at the last
    CPU hotplug. So if lots of anonymous memory has been allocated, thus
    pushing out lots of page-cache pages, then NFSD might end up being
    throttled due to dirty NFS pages because the "25%" bonus it gets is
    calculated against a rather small amount of dirtyable memory, while the
    "3%" margin that other processes are allowed to dirty without penalty is
    calculated against a much larger number.

    To remove this possibility of deadlock we need to make sure that the
    margin granted to PF_LESS_THROTTLE exceeds that rate-limit margin.
    Simply adding ratelimit_pages isn't enough as that should be multiplied
    by the number of cpus.

    So add "global_wb_domain.dirty_limit / 32" as that more accurately
    reflects the current total over-shoot margin. This ensures that the
    number of dirty NFS pages never gets so high that nfsd will be throttled
    waiting for them to be written.

    Link: http://lkml.kernel.org/r/87futgowwv.fsf@notabene.neil.brown.name
    Signed-off-by: NeilBrown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    NeilBrown
     

20 May, 2016

1 commit

  • ZONE_MOVABLE could be treated as highmem so we need to consider it for
    accurate calculation of dirty pages. And, in following patches,
    ZONE_CMA will be introduced and it can be treated as highmem, too. So,
    instead of manually adding stat of ZONE_MOVABLE, looping all zones and
    check whether the zone is highmem or not and add stat of the zone which
    can be treated as highmem.

    Signed-off-by: Joonsoo Kim
    Reviewed-by: Aneesh Kumar K.V
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Laura Abbott
    Cc: Minchan Kim
    Cc: Marek Szyprowski
    Cc: Michal Nazarewicz
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

07 May, 2016

1 commit


06 May, 2016

1 commit

  • Commit 947e9762a8dd ("writeback: update wb_over_bg_thresh() to use
    wb_domain aware operations") unintentionally changed this function's
    meaning from "are there more dirty pages than the background writeback
    threshold" to "are there more dirty pages than the writeback threshold".
    The background writeback threshold is typically half of the writeback
    threshold, so this had the effect of raising the number of dirty pages
    required to cause a writeback worker to perform background writeout.

    This can cause a very severe performance regression when a BDI uses
    BDI_CAP_STRICTLIMIT because balance_dirty_pages() and the writeback worker
    can now disagree on whether writeback should be initiated.

    For example, in a system having 1GB of RAM, a single spinning disk, and a
    "pass-through" FUSE filesystem mounted over the disk, application code
    mmapped a 128MB file on the disk and was randomly dirtying pages in that
    mapping.

    Because FUSE uses strictlimit and has a default max_ratio of only 1%, in
    balance_dirty_pages, thresh is ~200, bg_thresh is ~100, and the
    dirty_freerun_ceiling is the average of those, ~150. So, it pauses the
    dirtying processes when we have 151 dirty pages and wakes up a background
    writeback worker. But the worker tests the wrong threshold (200 instead of
    100), so it does not initiate writeback and just returns.

    Thus, balance_dirty_pages keeps looping, sleeping and then waking up the
    worker who will do nothing. It remains stuck in this state until the few
    dirty pages that we have finally expire and we write them back for that
    reason. Then the whole process repeats, resulting in near-zero throughput
    through the FUSE BDI.

    The fix is to call the parameterized variant of wb_calc_thresh, so that the
    worker will do writeback if the bg_thresh is exceeded which was the
    behavior before the referenced commit.

    Fixes: 947e9762a8dd ("writeback: update wb_over_bg_thresh() to use wb_domain aware operations")
    Signed-off-by: Howard Cochran
    Acked-by: Tejun Heo
    Signed-off-by: Miklos Szeredi
    Cc: # v4.2+
    Tested-by Sedat Dilek
    Signed-off-by: Jens Axboe

    Howard Cochran
     

05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

16 Mar, 2016

1 commit

  • There are several users that nest lock_page_memcg() inside lock_page()
    to prevent page->mem_cgroup from changing. But the page lock prevents
    pages from moving between cgroups, so that is unnecessary overhead.

    Remove lock_page_memcg() in contexts with locked contexts and fix the
    debug code in the page stat functions to be okay with the page lock.

    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner