12 Oct, 2016

1 commit

  • mapping->flags currently encodes two different things into a single flag.
    It contains sticky gfp_mask for page cache allocations and AS_ codes used
    to report errors/enospace and other states which are mapping specific.
    Condensing the two semantically unrelated things saves few bytes but it
    also complicates other things. For one thing the gfp flags space is
    reduced and in fact we are already running out of available bits. It can
    be assumed that more gfp flags will be necessary later on.

    To not introduce the address_space grow (at least on x86_64) we can stick
    it right after private_lock because we have a hole there.

    struct address_space {
    struct inode * host; /* 0 8 */
    struct radix_tree_root page_tree; /* 8 16 */
    spinlock_t tree_lock; /* 24 4 */
    atomic_t i_mmap_writable; /* 28 4 */
    struct rb_root i_mmap; /* 32 8 */
    struct rw_semaphore i_mmap_rwsem; /* 40 40 */
    /* --- cacheline 1 boundary (64 bytes) was 16 bytes ago --- */
    long unsigned int nrpages; /* 80 8 */
    long unsigned int nrexceptional; /* 88 8 */
    long unsigned int writeback_index; /* 96 8 */
    const struct address_space_operations * a_ops; /* 104 8 */
    long unsigned int flags; /* 112 8 */
    spinlock_t private_lock; /* 120 4 */

    /* XXX 4 bytes hole, try to pack */

    /* --- cacheline 2 boundary (128 bytes) --- */
    struct list_head private_list; /* 128 16 */
    void * private_data; /* 144 8 */

    /* size: 152, cachelines: 3, members: 14 */
    /* sum members: 148, holes: 1, sum holes: 4 */
    /* last cacheline: 24 bytes */
    };

    Link: http://lkml.kernel.org/r/20160912114852.GI14524@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

11 Oct, 2016

1 commit

  • Pull misc vfs updates from Al Viro:
    "Assorted misc bits and pieces.

    There are several single-topic branches left after this (rename2
    series from Miklos, current_time series from Deepa Dinamani, xattr
    series from Andreas, uaccess stuff from from me) and I'd prefer to
    send those separately"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
    proc: switch auxv to use of __mem_open()
    hpfs: support FIEMAP
    cifs: get rid of unused arguments of CIFSSMBWrite()
    posix_acl: uapi header split
    posix_acl: xattr representation cleanups
    fs/aio.c: eliminate redundant loads in put_aio_ring_file
    fs/internal.h: add const to ns_dentry_operations declaration
    compat: remove compat_printk()
    fs/buffer.c: make __getblk_slow() static
    proc: unsigned file descriptors
    fs/file: more unsigned file descriptors
    fs: compat: remove redundant check of nr_segs
    cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
    cifs: don't use memcpy() to copy struct iov_iter
    get rid of separate multipage fault-in primitives
    fs: Avoid premature clearing of capabilities
    fs: Give dentry to inode_change_ok() instead of inode
    fuse: Propagate dentry down to inode_change_ok()
    ceph: Propagate dentry down to inode_change_ok()
    xfs: Propagate dentry down to inode_change_ok()
    ...

    Linus Torvalds
     

08 Oct, 2016

2 commits

  • After using the offset of the swap entry as the key of the swap cache,
    the page_index() becomes exactly same as page_file_index(). So the
    page_file_index() is removed and the callers are changed to use
    page_index() instead.

    Link: http://lkml.kernel.org/r/1473270649-27229-2-git-send-email-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Trond Myklebust
    Cc: Anna Schumaker
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Cc: Dan Williams
    Cc: Joonsoo Kim
    Cc: Ross Zwisler
    Cc: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • File pages use a set of radix tree tags (DIRTY, TOWRITE, WRITEBACK,
    etc.) to accelerate finding the pages with a specific tag in the radix
    tree during inode writeback. But for anonymous pages in the swap cache,
    there is no inode writeback. So there is no need to find the pages with
    some writeback tags in the radix tree. It is not necessary to touch
    radix tree writeback tags for pages in the swap cache.

    Per Rik van Riel's suggestion, a new flag AS_NO_WRITEBACK_TAGS is
    introduced for address spaces which don't need to update the writeback
    tags. The flag is set for swap caches. It may be used for DAX file
    systems, etc.

    With this patch, the swap out bandwidth improved 22.3% (from ~1.2GB/s to
    ~1.48GBps) in the vm-scalability swap-w-seq test case with 8 processes.
    The test is done on a Xeon E5 v3 system. The swap device used is a RAM
    simulated PMEM (persistent memory) device. The improvement comes from
    the reduced contention on the swap cache radix tree lock. To test
    sequential swapping out, the test case uses 8 processes, which
    sequentially allocate and write to the anonymous pages until RAM and
    part of the swap device is used up.

    Details of comparison is as follow,

    base base+patch
    ---------------- --------------------------
    %stddev %change %stddev
    \ | \
    2506952 ± 2% +28.1% 3212076 ± 7% vm-scalability.throughput
    1207402 ± 7% +22.3% 1476578 ± 6% vmstat.swap.so
    10.86 ± 12% -23.4% 8.31 ± 16% perf-profile.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list
    10.82 ± 13% -33.1% 7.24 ± 14% perf-profile.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_zone_memcg
    10.36 ± 11% -100.0% 0.00 ± -1% perf-profile.cycles-pp._raw_spin_lock_irqsave.__test_set_page_writeback.bdev_write_page.__swap_writepage.swap_writepage
    10.52 ± 12% -100.0% 0.00 ± -1% perf-profile.cycles-pp._raw_spin_lock_irqsave.test_clear_page_writeback.end_page_writeback.page_endio.pmem_rw_page

    Link: http://lkml.kernel.org/r/1472578089-5560-1-git-send-email-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Acked-by: Rik van Riel
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Tejun Heo
    Cc: Wu Fengguang
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

28 Sep, 2016

1 commit

  • * the only remaining callers of "short" fault-ins are just as happy with generic
    variants (both in lib/iov_iter.c); switch them to multipage variants, kill the
    "short" ones
    * rename the multipage variants to now available plain ones.
    * get rid of compat macro defining iov_iter_fault_in_multipage_readable by
    expanding it in its only user.

    Signed-off-by: Al Viro

    Al Viro
     

26 Sep, 2016

1 commit

  • When building XFS with -Werror, it now fails with:

    include/linux/pagemap.h: In function 'fault_in_multipages_readable':
    include/linux/pagemap.h:602:16: error: variable 'c' set but not used [-Werror=unused-but-set-variable]
    volatile char c;
    ^

    This is a regression caused by commit e23d4159b109 ("fix
    fault_in_multipages_...() on architectures with no-op access_ok()").
    Fix it by re-adding the "(void)c" trick taht was previously used to make
    the compiler think the variable is used.

    Signed-off-by: Dave Chinner
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Dave Chinner
     

21 Sep, 2016

1 commit

  • Switching iov_iter fault-in to multipages variants has exposed an old
    bug in underlying fault_in_multipages_...(); they break if the range
    passed to them wraps around. Normally access_ok() done by callers will
    prevent such (and it's a guaranteed EFAULT - ERR_PTR() values fall into
    such a range and they should not point to any valid objects).

    However, on architectures where userland and kernel live in different
    MMU contexts (e.g. s390) access_ok() is a no-op and on those a range
    with a wraparound can reach fault_in_multipages_...().

    Since any wraparound means EFAULT there, the fix is trivial - turn
    those

    while (uaddr end))
    return -EFAULT;
    do
    ...
    while (uaddr
    Tested-by: Jan Stancek
    Cc: stable@vger.kernel.org # v3.5+
    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     

08 Aug, 2016

1 commit

  • Commit abf545484d31 changed it from an 'rw' flags type to the
    newer ops based interface, but now we're effectively leaking
    some bdev internals to the rest of the kernel. Since we only
    care about whether it's a read or a write at that level, just
    pass in a bool 'is_write' parameter instead.

    Then we can also move op_is_write() and friends back under
    CONFIG_BLOCK protection.

    Reviewed-by: Mike Christie
    Signed-off-by: Jens Axboe

    Jens Axboe
     

05 Aug, 2016

1 commit

  • The rw_page users were not converted to use bio/req ops. As a result
    bdev_write_page is not passing down REQ_OP_WRITE and the IOs will
    be sent down as reads.

    Signed-off-by: Mike Christie
    Fixes: 4e1b2d52a80d ("block, fs, drivers: remove REQ_OP compat defs and related code")

    Modified by me to:

    1) Drop op_flags passing into ->rw_page(), as we don't use it.
    2) Make op_is_write() and friends safe to use for !CONFIG_BLOCK

    Signed-off-by: Jens Axboe

    Mike Christie
     

27 Jul, 2016

1 commit

  • Vladimir has noticed that we might declare memcg oom even during
    readahead because read_pages only uses GFP_KERNEL (with mapping_gfp
    restriction) while __do_page_cache_readahead uses
    page_cache_alloc_readahead which adds __GFP_NORETRY to prevent from
    OOMs. This gfp mask discrepancy is really unfortunate and easily
    fixable. Drop page_cache_alloc_readahead() which only has one user and
    outsource the gfp_mask logic into readahead_gfp_mask and propagate this
    mask from __do_page_cache_readahead down to read_pages.

    This alone would have only very limited impact as most filesystems are
    implementing ->readpages and the common implementation mpage_readpages
    does GFP_KERNEL (with mapping_gfp restriction) again. We can tell it to
    use readahead_gfp_mask instead as this function is called only during
    readahead as well. The same applies to read_cache_pages.

    ext4 has its own ext4_mpage_readpages but the path which has pages !=
    NULL can use the same gfp mask. Btrfs, cifs, f2fs and orangefs are
    doing a very similar pattern to mpage_readpages so the same can be
    applied to them as well.

    [akpm@linux-foundation.org: coding-style fixes]
    [mhocko@suse.com: restrict gfp mask in mpage_alloc]
    Link: http://lkml.kernel.org/r/20160610074223.GC32285@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/1465301556-26431-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Chris Mason
    Cc: Steve French
    Cc: Theodore Ts'o
    Cc: Jan Kara
    Cc: Mike Marshall
    Cc: Jaegeuk Kim
    Cc: Changman Lee
    Cc: Chao Yu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

21 May, 2016

1 commit

  • copy_page_to_iter_iovec() is currently the only user of
    fault_in_pages_writeable(), and it definitely can use fragments from
    high order pages.

    Make sure fault_in_pages_writeable() is only touching two adjacent pages
    at most, as claimed.

    Signed-off-by: Eric Dumazet
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

20 May, 2016

1 commit

  • Many developers already know that field for reference count of the
    struct page is _count and atomic type. They would try to handle it
    directly and this could break the purpose of page reference count
    tracepoint. To prevent direct _count modification, this patch rename it
    to _refcount and add warning message on the code. After that, developer
    who need to handle reference count will find that field should not be
    accessed directly.

    [akpm@linux-foundation.org: fix comments, per Vlastimil]
    [akpm@linux-foundation.org: Documentation/vm/transhuge.txt too]
    [sfr@canb.auug.org.au: sync ethernet driver changes]
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Stephen Rothwell
    Cc: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Johannes Berg
    Cc: "David S. Miller"
    Cc: Sunil Goutham
    Cc: Chris Metcalf
    Cc: Manish Chopra
    Cc: Yuval Mintz
    Cc: Tariq Toukan
    Cc: Saeed Mahameed
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

05 Apr, 2016

3 commits

  • All users gone. We can remove these macros.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Mostly direct substitution with occasional adjustment or removing
    outdated comments.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

18 Mar, 2016

1 commit

  • The success of CMA allocation largely depends on the success of
    migration and key factor of it is page reference count. Until now, page
    reference is manipulated by direct calling atomic functions so we cannot
    follow up who and where manipulate it. Then, it is hard to find actual
    reason of CMA allocation failure. CMA allocation should be guaranteed
    to succeed so finding offending place is really important.

    In this patch, call sites where page reference is manipulated are
    converted to introduced wrapper function. This is preparation step to
    add tracepoint to each page reference manipulation function. With this
    facility, we can easily find reason of CMA allocation failure. There is
    no functional change in this patch.

    In addition, this patch also converts reference read sites. It will
    help a second step that renames page._count to something else and
    prevents later attempt to direct access to it (Suggested by Andrew).

    Signed-off-by: Joonsoo Kim
    Acked-by: Michal Nazarewicz
    Acked-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Sergey Senozhatsky
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

16 Mar, 2016

1 commit

  • Now that migration doesn't clear page->mem_cgroup of live pages anymore,
    it's safe to make lock_page_memcg() and the memcg stat functions take
    pages, and spare the callers from memcg objects.

    [akpm@linux-foundation.org: fix warnings]
    Signed-off-by: Johannes Weiner
    Suggested-by: Vladimir Davydov
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

23 Jan, 2016

1 commit

  • Add find_get_entries_tag() to the family of functions that include
    find_get_entries(), find_get_pages() and find_get_pages_tag(). This is
    needed for DAX dirty page handling because we need a list of both page
    offsets and radix tree entries ('indices' and 'entries' in this
    function) that are marked with the PAGECACHE_TAG_TOWRITE tag.

    Signed-off-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Cc: "H. Peter Anvin"
    Cc: "J. Bruce Fields"
    Cc: "Theodore Ts'o"
    Cc: Alexander Viro
    Cc: Andreas Dilger
    Cc: Dave Chinner
    Cc: Ingo Molnar
    Cc: Jeff Layton
    Cc: Matthew Wilcox
    Cc: Thomas Gleixner
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

16 Jan, 2016

2 commits

  • This patch adds implementation of split_huge_page() for new
    refcountings.

    Unlike previous implementation, new split_huge_page() can fail if
    somebody holds GUP pin on the page. It also means that pin on page
    would prevent it from bening split under you. It makes situation in
    many places much cleaner.

    The basic scheme of split_huge_page():

    - Check that sum of mapcounts of all subpage is equal to page_count()
    plus one (caller pin). Foll off with -EBUSY. This way we can avoid
    useless PMD-splits.

    - Freeze the page counters by splitting all PMD and setup migration
    PTEs.

    - Re-check sum of mapcounts against page_count(). Page's counts are
    stable now. -EBUSY if page is pinned.

    - Split compound page.

    - Unfreeze the page by removing migration entries.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • lock_page() must operate on the whole compound page. It doesn't make
    much sense to lock part of compound page. Change code to use head
    page's PG_locked, if tail page is passed.

    This patch also gets rid of custom helper functions --
    __set_page_locked() and __clear_page_locked(). They are replaced with
    helpers generated by __SETPAGEFLAG/__CLEARPAGEFLAG. Tail pages to these
    helper would trigger VM_BUG_ON().

    SLUB uses PG_locked as a bit spin locked. IIUC, tail pages should never
    appear there. VM_BUG_ON() is added to make sure that this assumption is
    correct.

    [akpm@linux-foundation.org: fix fs/cifs/file.c]
    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: "Aneesh Kumar K.V"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jerome Marchand
    Cc: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

07 Nov, 2015

1 commit

  • There are many places which use mapping_gfp_mask to restrict a more
    generic gfp mask which would be used for allocations which are not
    directly related to the page cache but they are performed in the same
    context.

    Let's introduce a helper function which makes the restriction explicit and
    easier to track. This patch doesn't introduce any functional changes.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Michal Hocko
    Suggested-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

05 Jul, 2015

1 commit

  • Pull more vfs updates from Al Viro:
    "Assorted VFS fixes and related cleanups (IMO the most interesting in
    that part are f_path-related things and Eric's descriptor-related
    stuff). UFS regression fixes (it got broken last cycle). 9P fixes.
    fs-cache series, DAX patches, Jan's file_remove_suid() work"

    [ I'd say this is much more than "fixes and related cleanups". The
    file_table locking rule change by Eric Dumazet is a rather big and
    fundamental update even if the patch isn't huge. - Linus ]

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
    9p: cope with bogus responses from server in p9_client_{read,write}
    p9_client_write(): avoid double p9_free_req()
    9p: forgetting to cancel request on interrupted zero-copy RPC
    dax: bdev_direct_access() may sleep
    block: Add support for DAX reads/writes to block devices
    dax: Use copy_from_iter_nocache
    dax: Add block size note to documentation
    fs/file.c: __fget() and dup2() atomicity rules
    fs/file.c: don't acquire files->file_lock in fd_install()
    fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
    vfs: avoid creation of inode number 0 in get_next_ino
    namei: make set_root_rcu() return void
    make simple_positive() public
    ufs: use dir_pages instead of ufs_dir_pages()
    pagemap.h: move dir_pages() over there
    remove the pointless include of lglock.h
    fs: cleanup slight list_entry abuse
    xfs: Correctly lock inode when removing suid and file capabilities
    fs: Call security_ops->inode_killpriv on truncate
    fs: Provide function telling whether file_remove_privs() will do anything
    ...

    Linus Torvalds
     

24 Jun, 2015

1 commit


02 Jun, 2015

1 commit

  • When modifying PG_Dirty on cached file pages, update the new
    MEM_CGROUP_STAT_DIRTY counter. This is done in the same places where
    global NR_FILE_DIRTY is managed. The new memcg stat is visible in the
    per memcg memory.stat cgroupfs file. The most recent past attempt at
    this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632

    The new accounting supports future efforts to add per cgroup dirty
    page throttling and writeback. It also helps an administrator break
    down a container's memory usage and provides evidence to understand
    memcg oom kills (the new dirty count is included in memcg oom kill
    messages).

    The ability to move page accounting between memcg
    (memory.move_charge_at_immigrate) makes this accounting more
    complicated than the global counter. The existing
    mem_cgroup_{begin,end}_page_stat() lock is used to serialize move
    accounting with stat updates.
    Typical update operation:
    memcg = mem_cgroup_begin_page_stat(page)
    if (TestSetPageDirty()) {
    [...]
    mem_cgroup_update_page_stat(memcg)
    }
    mem_cgroup_end_page_stat(memcg)

    Summary of mem_cgroup_end_page_stat() overhead:
    - Without CONFIG_MEMCG it's a no-op
    - With CONFIG_MEMCG and no inter memcg task movement, it's just
    rcu_read_lock()
    - With CONFIG_MEMCG and inter memcg task movement, it's
    rcu_read_lock() + spin_lock_irqsave()

    A memcg parameter is added to several routines because their callers
    now grab mem_cgroup_begin_page_stat() which returns the memcg later
    needed by for mem_cgroup_update_page_stat().

    Because mem_cgroup_begin_page_stat() may disable interrupts, some
    adjustments are needed:
    - move __mark_inode_dirty() from __set_page_dirty() to its caller.
    __mark_inode_dirty() locking does not want interrupts disabled.
    - use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in
    __delete_from_page_cache(), replace_page_cache_page(),
    invalidate_complete_page2(), and __remove_mapping().

    text data bss dec hex filename
    8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before
    8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after
    +192 text bytes
    8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before
    8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after
    +773 text bytes

    Performance tests run on v4.0-rc1-36-g4f671fe2f952. Lower is better for
    all metrics, they're all wall clock or cycle counts. The read and write
    fault benchmarks just measure fault time, they do not include I/O time.

    * CONFIG_MEMCG not set:
    baseline patched
    kbuild 1m25.030000(+-0.088% 3 samples) 1m25.426667(+-0.120% 3 samples)
    dd write 100 MiB 0.859211561 +-15.10% 0.874162885 +-15.03%
    dd write 200 MiB 1.670653105 +-17.87% 1.669384764 +-11.99%
    dd write 1000 MiB 8.434691190 +-14.15% 8.474733215 +-14.77%
    read fault cycles 254.0(+-0.000% 10 samples) 253.0(+-0.000% 10 samples)
    write fault cycles 2021.2(+-3.070% 10 samples) 1984.5(+-1.036% 10 samples)

    * CONFIG_MEMCG=y root_memcg:
    baseline patched
    kbuild 1m25.716667(+-0.105% 3 samples) 1m25.686667(+-0.153% 3 samples)
    dd write 100 MiB 0.855650830 +-14.90% 0.887557919 +-14.90%
    dd write 200 MiB 1.688322953 +-12.72% 1.667682724 +-13.33%
    dd write 1000 MiB 8.418601605 +-14.30% 8.673532299 +-15.00%
    read fault cycles 266.0(+-0.000% 10 samples) 266.0(+-0.000% 10 samples)
    write fault cycles 2051.7(+-1.349% 10 samples) 2049.6(+-1.686% 10 samples)

    * CONFIG_MEMCG=y non-root_memcg:
    baseline patched
    kbuild 1m26.120000(+-0.273% 3 samples) 1m25.763333(+-0.127% 3 samples)
    dd write 100 MiB 0.861723964 +-15.25% 0.818129350 +-14.82%
    dd write 200 MiB 1.669887569 +-13.30% 1.698645885 +-13.27%
    dd write 1000 MiB 8.383191730 +-14.65% 8.351742280 +-14.52%
    read fault cycles 265.7(+-0.172% 10 samples) 267.0(+-0.000% 10 samples)
    write fault cycles 2070.6(+-1.512% 10 samples) 2084.4(+-2.148% 10 samples)

    As expected anon page faults are not affected by this patch.

    tj: Updated to apply on top of the recent cancel_dirty_page() changes.

    Signed-off-by: Sha Zhengju
    Signed-off-by: Greg Thelen
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Greg Thelen
     

30 Dec, 2014

1 commit

  • Commit 2457aec63745 ("mm: non-atomically mark page accessed during page
    cache allocation where possible") has added a separate parameter for
    specifying gfp mask for radix tree allocations.

    Not only this is less than optimal from the API point of view because it
    is error prone, it is also buggy currently because
    grab_cache_page_write_begin is using GFP_KERNEL for radix tree and if
    fgp_flags doesn't contain FGP_NOFS (mostly controlled by fs by
    AOP_FLAG_NOFS flag) but the mapping_gfp_mask has __GFP_FS cleared then
    the radix tree allocation wouldn't obey the restriction and might
    recurse into filesystem and cause deadlocks. This is the case for most
    filesystems unfortunately because only ext4 and gfs2 are using
    AOP_FLAG_NOFS.

    Let's simply remove radix_gfp_mask parameter because the allocation
    context is same for both page cache and for the radix tree. Just make
    sure that the radix tree gets only the sane subset of the mask (e.g. do
    not pass __GFP_WRITE).

    Long term it is more preferable to convert remaining users of
    AOP_FLAG_NOFS to use mapping_gfp_mask instead and simplify this
    interface even further.

    Reported-by: Dave Chinner
    Signed-off-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

10 Oct, 2014

1 commit

  • Now ballooned pages are detected using PageBalloon(). Fake mapping is no
    longer required. This patch links ballooned pages to balloon device using
    field page->private instead of page->mapping. Also this patch embeds
    balloon_dev_info directly into struct virtio_balloon.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Rafael Aquini
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

09 Oct, 2014

1 commit

  • Pull NFS client updates from Trond Myklebust:
    "Highlights include:

    Stable fixes:
    - fix an NFSv4.1 state renewal regression
    - fix open/lock state recovery error handling
    - fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM fails
    - fix statd when reconnection fails
    - don't wake tasks during connection abort
    - don't start reboot recovery if lease check fails
    - fix duplicate proc entries

    Features:
    - pNFS block driver fixes and clean ups from Christoph
    - More code cleanups from Anna
    - Improve mmap() writeback performance
    - Replace use of PF_TRANS with a more generic mechanism for avoiding
    deadlocks in nfs_release_page"

    * tag 'nfs-for-3.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (66 commits)
    NFSv4.1: Fix an NFSv4.1 state renewal regression
    NFSv4: fix open/lock state recovery error handling
    NFSv4: Fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM fails
    NFS: Fabricate fscache server index key correctly
    SUNRPC: Add missing support for RPC_CLNT_CREATE_NO_RETRANS_TIMEOUT
    NFSv3: Fix missing includes of nfs3_fs.h
    NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page()
    NFS: avoid waiting at all in nfs_release_page when congested.
    NFS: avoid deadlocks with loop-back mounted NFS filesystems.
    MM: export page_wakeup functions
    SCHED: add some "wait..on_bit...timeout()" interfaces.
    NFS: don't use STABLE writes during writeback.
    NFSv4: use exponential retry on NFS4ERR_DELAY for async requests.
    rpc: Add -EPERM processing for xs_udp_send_request()
    rpc: return sent and err from xs_sendpages()
    lockd: Try to reconnect if statd has moved
    SUNRPC: Don't wake tasks during connection abort
    Fixing lease renewal
    nfs: fix duplicate proc entries
    pnfs/blocklayout: Fix a 64-bit division/remainder issue in bl_map_stripe
    ...

    Linus Torvalds
     

08 Oct, 2014

1 commit

  • Pull "trivial tree" updates from Jiri Kosina:
    "Usual pile from trivial tree everyone is so eagerly waiting for"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
    Remove MN10300_PROC_MN2WS0038
    mei: fix comments
    treewide: Fix typos in Kconfig
    kprobes: update jprobe_example.c for do_fork() change
    Documentation: change "&" to "and" in Documentation/applying-patches.txt
    Documentation: remove obsolete pcmcia-cs from Changes
    Documentation: update links in Changes
    Documentation: Docbook: Fix generated DocBook/kernel-api.xml
    score: Remove GENERIC_HAS_IOMAP
    gpio: fix 'CONFIG_GPIO_IRQCHIP' comments
    tty: doc: Fix grammar in serial/tty
    dma-debug: modify check_for_stack output
    treewide: fix errors in printk
    genirq: fix reference in devm_request_threaded_irq comment
    treewide: fix synchronize_rcu() in comments
    checkstack.pl: port to AArch64
    doc: queue-sysfs: minor fixes
    init/do_mounts: better syntax description
    MIPS: fix comment spelling
    powerpc/simpleboot: fix comment
    ...

    Linus Torvalds
     

25 Sep, 2014

2 commits

  • This will allow NFS to wait for PG_private to be cleared and,
    particularly, to send a wake-up when it is.

    Signed-off-by: NeilBrown
    Acked-by: Andrew Morton
    Signed-off-by: Trond Myklebust

    NeilBrown
     
  • In commit c1221321b7c25b53204447cff9949a6d5a7ddddc
    sched: Allow wait_on_bit_action() functions to support a timeout

    I suggested that a "wait_on_bit_timeout()" interface would not meet my
    need. This isn't true - I was just over-engineering.

    Including a 'private' field in wait_bit_key instead of a focused
    "timeout" field was just premature generalization. If some other
    use is ever found, it can be generalized or added later.

    So this patch renames "private" to "timeout" with a meaning "stop
    waiting when "jiffies" reaches or passes "timeout",
    and adds two of the many possible wait..bit..timeout() interfaces:

    wait_on_page_bit_killable_timeout(), which is the one I want to use,
    and out_of_line_wait_on_bit_timeout() which is a reasonably general
    example. Others can be added as needed.

    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: NeilBrown
    Acked-by: Ingo Molnar
    Signed-off-by: Trond Myklebust

    NeilBrown
     

26 Aug, 2014

1 commit


07 Aug, 2014

1 commit

  • Add a comment describing the circumstances in which
    __lock_page_or_retry() will or will not release the mmap_sem when
    returning 0.

    Add comments to lock_page_or_retry()'s callers (filemap_fault(),
    do_swap_page()) noting the impact on VM_FAULT_RETRY returns.

    Add comments on up the call tree, particularly replacing the false "We
    return with mmap_sem still held" comments.

    Signed-off-by: Paul Cassella
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Cassella
     

24 Jul, 2014

1 commit

  • I triggered VM_BUG_ON() in vma_address() when I tried to migrate an
    anonymous hugepage with mbind() in the kernel v3.16-rc3. This is
    because pgoff's calculation in rmap_walk_anon() fails to consider
    compound_order() only to have an incorrect value.

    This patch introduces page_to_pgoff(), which gets the page's offset in
    PAGE_CACHE_SIZE.

    Kirill pointed out that page cache tree should natively handle
    hugepages, and in order to make hugetlbfs fit it, page->index of
    hugetlbfs page should be in PAGE_CACHE_SIZE. This is beyond this patch,
    but page_to_pgoff() contains the point to be fixed in a single function.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Cc: Joonsoo Kim
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Hillf Danton
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

05 Jun, 2014

3 commits

  • aops->write_begin may allocate a new page and make it visible only to have
    mark_page_accessed called almost immediately after. Once the page is
    visible the atomic operations are necessary which is noticable overhead
    when writing to an in-memory filesystem like tmpfs but should also be
    noticable with fast storage. The objective of the patch is to initialse
    the accessed information with non-atomic operations before the page is
    visible.

    The bulk of filesystems directly or indirectly use
    grab_cache_page_write_begin or find_or_create_page for the initial
    allocation of a page cache page. This patch adds an init_page_accessed()
    helper which behaves like the first call to mark_page_accessed() but may
    called before the page is visible and can be done non-atomically.

    The primary APIs of concern in this care are the following and are used
    by most filesystems.

    find_get_page
    find_lock_page
    find_or_create_page
    grab_cache_page_nowait
    grab_cache_page_write_begin

    All of them are very similar in detail to the patch creates a core helper
    pagecache_get_page() which takes a flags parameter that affects its
    behavior such as whether the page should be marked accessed or not. Then
    old API is preserved but is basically a thin wrapper around this core
    function.

    Each of the filesystems are then updated to avoid calling
    mark_page_accessed when it is known that the VM interfaces have already
    done the job. There is a slight snag in that the timing of the
    mark_page_accessed() has now changed so in rare cases it's possible a page
    gets to the end of the LRU as PageReferenced where as previously it might
    have been repromoted. This is expected to be rare but it's worth the
    filesystem people thinking about it in case they see a problem with the
    timing change. It is also the case that some filesystems may be marking
    pages accessed that previously did not but it makes sense that filesystems
    have consistent behaviour in this regard.

    The test case used to evaulate this is a simple dd of a large file done
    multiple times with the file deleted on each iterations. The size of the
    file is 1/10th physical memory to avoid dirty page balancing. In the
    async case it will be possible that the workload completes without even
    hitting the disk and will have variable results but highlight the impact
    of mark_page_accessed for async IO. The sync results are expected to be
    more stable. The exception is tmpfs where the normal case is for the "IO"
    to not hit the disk.

    The test machine was single socket and UMA to avoid any scheduling or NUMA
    artifacts. Throughput and wall times are presented for sync IO, only wall
    times are shown for async as the granularity reported by dd and the
    variability is unsuitable for comparison. As async results were variable
    do to writback timings, I'm only reporting the maximum figures. The sync
    results were stable enough to make the mean and stddev uninteresting.

    The performance results are reported based on a run with no profiling.
    Profile data is based on a separate run with oprofile running.

    async dd
    3.15.0-rc3 3.15.0-rc3
    vanilla accessed-v2
    ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%)
    tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%)
    btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%)
    ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%)
    xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%)

    The XFS figure is a bit strange as it managed to avoid a worst case by
    sheer luck but the average figures looked reasonable.

    samples percentage
    ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed

    [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Tested-by: Prabhakar Lad
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • cold is a bool, make it one. Make the likely case the "if" part of the
    block instead of the else as according to the optimisation manual this is
    preferred.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • page_endio() takes care of updating all the appropriate page flags once
    I/O has finished to a page. Switch to using mapping_set_error() instead
    of setting AS_EIO directly; this will handle thin-provisioned devices
    correctly.

    Signed-off-by: Matthew Wilcox
    Cc: Dave Chinner
    Cc: Dheeraj Reddy
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

04 Apr, 2014

4 commits

  • This patch removes read_cache_page_async() which wasn't really needed
    anywhere and simplifies the code around it a bit.

    read_cache_page_async() is useful when we want to read a page into the
    cache without waiting for it to complete. This happens when the
    appropriate callback 'filler' doesn't complete its read operation and
    releases the page lock immediately, and instead queues a different
    completion routine to do that. This never actually happened anywhere in
    the code.

    read_cache_page_async() had 3 different callers:

    - read_cache_page() which is the sync version, it would just wait for
    the requested read to complete using wait_on_page_read().

    - JFFS2 would call it from jffs2_gc_fetch_page(), but the filler
    function it supplied doesn't do any async reads, and would complete
    before the filler function returns - making it actually a sync read.

    - CRAMFS would call it using the read_mapping_page_async() wrapper, with
    a similar story to JFFS2 - the filler function doesn't do anything that
    reminds async reads and would always complete before the filler function
    returns.

    To sum it up, the code in mm/filemap.c never took advantage of having
    read_cache_page_async(). While there are filler callbacks that do async
    reads (such as the block one), we always called it with the
    read_cache_page().

    This patch adds a mandatory wait for read to complete when adding a new
    page to the cache, and removes read_cache_page_async() and its wrappers.

    Signed-off-by: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     
  • Reclaim will be leaving shadow entries in the page cache radix tree upon
    evicting the real page. As those pages are found from the LRU, an
    iput() can lead to the inode being freed concurrently. At this point,
    reclaim must no longer install shadow pages because the inode freeing
    code needs to ensure the page tree is really empty.

    Add an address_space flag, AS_EXITING, that the inode freeing code sets
    under the tree lock before doing the final truncate. Reclaim will check
    for this flag before installing shadow pages.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • shmem mappings already contain exceptional entries where swap slot
    information is remembered.

    To be able to store eviction information for regular page cache, prepare
    every site dealing with the radix trees directly to handle entries other
    than pages.

    The common lookup functions will filter out non-page entries and return
    NULL for page cache holes, just as before. But provide a raw version of
    the API which returns non-page entries as well, and switch shmem over to
    use it.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The radix tree hole searching code is only used for page cache, for
    example the readahead code trying to get a a picture of the area
    surrounding a fault.

    It sufficed to rely on the radix tree definition of holes, which is
    "empty tree slot". But this is about to change, though, as shadow page
    descriptors will be stored in the page cache after the actual pages get
    evicted from memory.

    Move the functions over to mm/filemap.c and make them native page cache
    operations, where they can later be adapted to handle the new definition
    of "page cache hole".

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner