08 Jun, 2018

8 commits

  • shmem/tmpfs uses pseudo vma to allocate page with correct NUMA policy.

    The pseudo vma doesn't have vm_page_prot set. We are going to encode
    encryption KeyID in vm_page_prot. Having garbage there causes problems.

    Zero out all unused fields in the pseudo vma.

    Link: http://lkml.kernel.org/r/20180531135602.20321-1-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Andrew Morton
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Use new return type vm_fault_t for fault handler. For now, this is just
    documenting that the function returns a VM_FAULT value rather than an
    errno. Once all instances are converted, vm_fault_t will become a
    distinct type.

    See commit 1c8f422059ae ("mm: change return type to vm_fault_t")

    vmf_error() is the newly introduce inline function in 4.17-rc6.

    Link: http://lkml.kernel.org/r/20180521202410.GA17912@jordon-HP-15-Notebook-PC
    Signed-off-by: Souptick Joarder
    Reviewed-by: Matthew Wilcox
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Souptick Joarder
     
  • tmpfs uses the helper d_find_alias() to find a dentry from a decoded
    inode, but d_find_alias() skips unhashed dentries, so unlinked files
    cannot be decoded from a file handle.

    This can be reproduced using xfstests test program open_by_handle:

    $ open_by handle -c /tmp/testdir
    $ open_by_handle -dk /tmp/testdir
    open_by_handle(/tmp/testdir/file000000) returned 116 incorrectly on an unlinked open file!

    To fix this, if d_find_alias() can't find a hashed alias, call
    d_find_any_alias() to return an unhashed one.

    Link: http://lkml.kernel.org/r/CAOQ4uxg+qSLP0KwdW+h1tcPqOCQd+_pGZVXiePQB1TXCMBMctQ@mail.gmail.com
    Signed-off-by: Amir Goldstein
    Reviewed-by: NeilBrown
    Cc: Hugh Dickins
    Cc: Jeff Layton
    Cc: "J. Bruce Fields"
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Amir Goldstein
     
  • Since tmpfs THP was supported in 4.8, hugetlbfs is not the only
    filesystem with huge page support anymore. tmpfs can use huge page via
    THP when mounting by "huge=" mount option.

    When applications use huge page on hugetlbfs, it just need check the
    filesystem magic number, but it is not enough for tmpfs. Make
    stat.st_blksize return huge page size if it is mounted by appropriate
    "huge=" option to give applications a hint to optimize the behavior with
    THP.

    Some applications may not do wisely with THP. For example, QEMU may
    mmap file on non huge page aligned hint address with MAP_FIXED, which
    results in no pages are PMD mapped even though THP is used. Some
    applications may mmap file with non huge page aligned offset. Both
    behaviors make THP pointless.

    statfs.f_bsize still returns 4KB for tmpfs since THP could be split, and
    it also may fallback to 4KB page silently if there is not enough huge
    page. Furthermore, different f_bsize makes max_blocks and free_blocks
    calculation harder but without too much benefit. Returning huge page
    size via stat.st_blksize sounds good enough.

    Since PUD size huge page for THP has not been supported, now it just
    returns HPAGE_PMD_SIZE.

    Hugh said:

    : Sorry, I have no enthusiasm for this patch; but do I feel strongly
    : enough to override you and everyone else to NAK it? No, I don't feel
    : that strongly, maybe st_blksize isn't worth arguing over.
    :
    : We did look at struct stat when designing huge tmpfs, to see if there
    : were any fields that should be adjusted for it; but concluded none.
    : Yes, it would sometimes be nice to have a quickly accessible indicator
    : for when tmpfs has been mounted huge (scanning /proc/mounts for options
    : can be tiresome, agreed); but since tmpfs tries to supply huge (or not)
    : pages transparently, no difference seemed right.
    :
    : So, because st_blksize is a not very useful field of struct stat, with
    : "size" in the name, we're going to put HPAGE_PMD_SIZE in there instead
    : of PAGE_SIZE, if the tmpfs was mounted with one of the huge "huge"
    : options (force or always, okay; within_size or advise, not so much).
    : Though HPAGE_PMD_SIZE is no more its "preferred I/O size" or "blocksize
    : for file system I/O" than PAGE_SIZE was.
    :
    : Which we can expect to speed up some applications and disadvantage
    : others, depending on how they interpret st_blksize: just like if we
    : changed it in the same way on non-huge tmpfs. (Did I actually try
    : changing st_blksize early on, and find it broke something? If so, I've
    : now forgotten what, and a search through commit messages didn't find
    : it; but I guess we'll find out soon enough.)
    :
    : If there were an mstat() syscall, returning a field "preferred
    : alignment", then we could certainly agree to put HPAGE_PMD_SIZE in
    : there; but in stat()'s st_blksize? And what happens when (in future)
    : mm maps this or that hard-disk filesystem's blocks with a pmd mapping -
    : should that filesystem then advertise a bigger st_blksize, despite the
    : same disk layout as before? What happens with DAX?
    :
    : And this change is not going to help the QEMU suboptimality that
    : brought you here (or does QEMU align mmaps according to st_blksize?).
    : QEMU ought to work well with kernels without this change, and kernels
    : with this change; and I hope it can easily deal with both by avoiding
    : that use of MAP_FIXED which prevented the kernel's intended alignment.

    [akpm@linux-foundation.org: remove unneeded `else']
    Link: http://lkml.kernel.org/r/1524665633-83806-1-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Suggested-by: Christoph Hellwig
    Reviewed-by: Christoph Hellwig
    Acked-by: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • With the addition of memfd hugetlbfs support, we now have the situation
    where memfd depends on TMPFS -or- HUGETLBFS. Previously, memfd was only
    supported on tmpfs, so it made sense that the code resided in shmem.c.
    In the current code, memfd is only functional if TMPFS is defined. If
    HUGETLFS is defined and TMPFS is not defined, then memfd functionality
    will not be available for hugetlbfs. This does not cause BUGs, just a
    lack of potentially desired functionality.

    Code is restructured in the following way:
    - include/linux/memfd.h is a new file containing memfd specific
    definitions previously contained in shmem_fs.h.
    - mm/memfd.c is a new file containing memfd specific code previously
    contained in shmem.c.
    - memfd specific code is removed from shmem_fs.h and shmem.c.
    - A new config option MEMFD_CREATE is added that is defined if TMPFS
    or HUGETLBFS is defined.

    No functional changes are made to the code: restructuring only.

    Link: http://lkml.kernel.org/r/20180415182119.4517-4-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reviewed-by: Khalid Aziz
    Cc: Andrea Arcangeli
    Cc: David Herrmann
    Cc: Hugh Dickins
    Cc: Marc-Andr Lureau
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • In preparation for memfd code restructure, update comments, definitions
    and function names dealing with file sealing to indicate that tmpfs and
    hugetlbfs are the supported filesystems. Also, change file pointer
    checks in memfd_file_seals_ptr to use defined interfaces instead of
    directly referencing file_operation structs.

    Link: http://lkml.kernel.org/r/20180415182119.4517-3-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reviewed-by: Khalid Aziz
    Cc: Andrea Arcangeli
    Cc: David Herrmann
    Cc: Hugh Dickins
    Cc: Marc-Andr Lureau
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Patch series "restructure memfd code", v4.

    This patch (of 3):

    In preparation for memfd code restucture, clean up sparse warnings.
    Most changes required adding __rcu annotations. The routine
    find_swap_entry was modified to properly deference radix tree entries.

    Link: http://lkml.kernel.org/r/20180415182119.4517-2-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Signed-off-by: Matthew Wilcox
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Cc: Marc-Andr Lureau
    Cc: David Herrmann
    Cc: Khalid Aziz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Patch series "mm, memcontrol: Implement memory.swap.events", v2.

    This patchset implements memory.swap.events which contains max and fail
    events so that userland can monitor and respond to swap running out.

    This patch (of 2):

    get_swap_page() is always followed by mem_cgroup_try_charge_swap().
    This patch moves mem_cgroup_try_charge_swap() into get_swap_page() and
    makes get_swap_page() call the function even after swap allocation
    failure.

    This simplifies the callers and consolidates memcg related logic and
    will ease adding swap related memcg events.

    Link: http://lkml.kernel.org/r/20180416230934.GH1911913@devbig577.frc2.facebook.com
    Signed-off-by: Tejun Heo
    Reviewed-by: Andrew Morton
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Roman Gushchin
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

12 Apr, 2018

1 commit

  • Remove the address_space ->tree_lock and use the xa_lock newly added to
    the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
    since we don't really care that it's a tree.

    [willy@infradead.org: fix nds32, fs/dax.c]
    Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Jeff Layton
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Ryusuke Konishi
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

06 Apr, 2018

1 commit

  • This patch makes do_swap_page() not need to be aware of two different
    swap readahead algorithms. Just unify cluster-based and vma-based
    readahead function call.

    Link: http://lkml.kernel.org/r/1509520520-32367-3-git-send-email-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20180220085249.151400-3-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Reviewed-by: Andrew Morton
    Cc: Hugh Dickins
    Cc: Huang Ying
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

23 Mar, 2018

1 commit

  • shmem_unused_huge_shrink() gets called from reclaim path. Waiting for
    page lock may lead to deadlock there.

    There was a bug report that may be attributed to this:

    http://lkml.kernel.org/r/alpine.LRH.2.11.1801242349220.30642@mail.ewheeler.net

    Replace lock_page() with trylock_page() and skip the page if we failed
    to lock it. We will get to the page on the next scan.

    We can test for the PageTransHuge() outside the page lock as we only
    need protection against splitting the page under us. Holding pin oni
    the page is enough for this.

    Link: http://lkml.kernel.org/r/20180316210830.43738-1-kirill.shutemov@linux.intel.com
    Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure")
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Eric Wheeler
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Cc: Tetsuo Handa
    Cc: Hugh Dickins
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

01 Feb, 2018

3 commits

  • Adapt add_seals()/get_seals() to work with hugetbfs-backed memory.

    Teach memfd_create() to allow sealing operations on MFD_HUGETLB.

    Link: http://lkml.kernel.org/r/20171107122800.25517-6-marcandre.lureau@redhat.com
    Signed-off-by: Marc-André Lureau
    Reviewed-by: Mike Kravetz
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Cc: David Herrmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marc-André Lureau
     
  • Those functions are called for memfd files, backed by shmem or hugetlb
    (the next patches will handle hugetlb).

    Link: http://lkml.kernel.org/r/20171107122800.25517-3-marcandre.lureau@redhat.com
    Signed-off-by: Marc-André Lureau
    Reviewed-by: Mike Kravetz
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Cc: David Herrmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marc-André Lureau
     
  • Patch series "memfd: add sealing to hugetlb-backed memory", v3.

    Recently, Mike Kravetz added hugetlbfs support to memfd. However, he
    didn't add sealing support. One of the reasons to use memfd is to have
    shared memory sealing when doing IPC or sharing memory with another
    process with some extra safety. qemu uses shared memory & hugetables
    with vhost-user (used by dpdk), so it is reasonable to use memfd now
    instead for convenience and security reasons.

    This patch (of 9):

    The functions are called through shmem_fcntl() only. And no danger in
    removing the EXPORTs as the routines only work with shmem file structs.

    Link: http://lkml.kernel.org/r/20171107122800.25517-2-marcandre.lureau@redhat.com
    Signed-off-by: Marc-André Lureau
    Reviewed-by: Mike Kravetz
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Cc: David Herrmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marc-André Lureau
     

28 Nov, 2017

1 commit

  • This is a pure automated search-and-replace of the internal kernel
    superblock flags.

    The s_flags are now called SB_*, with the names and the values for the
    moment mirroring the MS_* flags that they're equivalent to.

    Note how the MS_xyz flags are the ones passed to the mount system call,
    while the SB_xyz flags are what we then use in sb->s_flags.

    The script to do this was:

    # places to look in; re security/*: it generally should *not* be
    # touched (that stuff parses mount(2) arguments directly), but
    # there are two places where we really deal with superblock flags.
    FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
    include/linux/fs.h include/uapi/linux/bfs_fs.h \
    security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
    # the list of MS_... constants
    SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
    DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
    POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
    I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
    ACTIVE NOUSER"

    SED_PROG=
    for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done

    # we want files that contain at least one of MS_...,
    # with fs/namespace.c and fs/pnode.c excluded.
    L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')

    for f in $L; do sed -i $f $SED_PROG; done

    Requested-by: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

18 Nov, 2017

1 commit

  • Fix the following warning by removing the unused variable:

    mm/shmem.c:3205:27: warning: variable 'info' set but not used [-Wunused-but-set-variable]

    Link: http://lkml.kernel.org/r/1510774029-30652-1-git-send-email-clabbe@baylibre.com
    Signed-off-by: Corentin Labbe
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Corentin Labbe
     

16 Nov, 2017

5 commits

  • Pull drm updates from Dave Airlie:
    "This is the main drm pull request for v4.15.

    Core:
    - Atomic object lifetime fixes
    - Atomic iterator improvements
    - Sparse/smatch fixes
    - Legacy kms ioctls to be interruptible
    - EDID override improvements
    - fb/gem helper cleanups
    - Simple outreachy patches
    - Documentation improvements
    - Fix dma-buf rcu races
    - DRM mode object leasing for improving VR use cases.
    - vgaarb improvements for non-x86 platforms.

    New driver:
    - tve200: Faraday Technology TVE200 block.

    This "TV Encoder" encodes a ITU-T BT.656 stream and can be found in
    the StorLink SL3516 (later Cortina Systems CS3516) as well as the
    Grain Media GM8180.

    New bridges:
    - SiI9234 support

    New panels:
    - S6E63J0X03, OTM8009A, Seiko 43WVF1G, 7" rpi touch panel, Toshiba
    LT089AC19000, Innolux AT043TN24

    i915:
    - Remove Coffeelake from alpha support
    - Cannonlake workarounds
    - Infoframe refactoring for DisplayPort
    - VBT updates
    - DisplayPort vswing/emph/buffer translation refactoring
    - CCS fixes
    - Restore GPU clock boost on missed vblanks
    - Scatter list updates for userptr allocations
    - Gen9+ transition watermarks
    - Display IPC (Isochronous Priority Control)
    - Private PAT management
    - GVT: improved error handling and pci config sanitizing
    - Execlist refactoring
    - Transparent Huge Page support
    - User defined priorities support
    - HuC/GuC firmware refactoring
    - DP MST fixes
    - eDP power sequencing fixes
    - Use RCU instead of stop_machine
    - PSR state tracking support
    - Eviction fixes
    - BDW DP aux channel timeout fixes
    - LSPCON fixes
    - Cannonlake PLL fixes

    amdgpu:
    - Per VM BO support
    - Powerplay cleanups
    - CI powerplay support
    - PASID mgr for kfd
    - SR-IOV fixes
    - initial GPU reset for vega10
    - Prime mmap support
    - TTM updates
    - Clock query interface for Raven
    - Fence to handle ioctl
    - UVD encode ring support on Polaris
    - Transparent huge page DMA support
    - Compute LRU pipe tweaks
    - BO flag to allow buffers to opt out of implicit sync
    - CTX priority setting API
    - VRAM lost infrastructure plumbing

    qxl:
    - fix flicker since atomic rework

    amdkfd:
    - Further improvements from internal AMD tree
    - Usermode events
    - Drop radeon support

    nouveau:
    - Pascal temperature sensor support
    - Improved BAR2 handling
    - MMU rework to support Pascal MMU

    exynos:
    - Improved HDMI/mixer support
    - HDMI audio interface support

    tegra:
    - Prep work for tegra186
    - Cleanup/fixes

    msm:
    - Preemption support for a5xx
    - Display fixes for 8x96 (snapdragon 820)
    - Async cursor plane fixes
    - FW loading rework
    - GPU debugging improvements

    vc4:
    - Prep for DSI panels
    - fix T-format tiling scanout
    - New madvise ioctl

    Rockchip:
    - LVDS support

    omapdrm:
    - omap4 HDMI CEC support

    etnaviv:
    - GPU performance counters groundwork

    sun4i:
    - refactor driver load + TCON backend
    - HDMI improvements
    - A31 support
    - Misc fixes

    udl:
    - Probe/EDID read fixes.

    tilcdc:
    - Misc fixes.

    pl111:
    - Support more variants

    adv7511:
    - Improve EDID handling.
    - HDMI CEC support

    sii8620:
    - Add remote control support"

    * tag 'drm-for-v4.15' of git://people.freedesktop.org/~airlied/linux: (1480 commits)
    drm/rockchip: analogix_dp: Use mutex rather than spinlock
    drm/mode_object: fix documentation for object lookups.
    drm/i915: Reorder context-close to avoid calling i915_vma_close() under RCU
    drm/i915: Move init_clock_gating() back to where it was
    drm/i915: Prune the reservation shared fence array
    drm/i915: Idle the GPU before shinking everything
    drm/i915: Lock llist_del_first() vs llist_del_all()
    drm/i915: Calculate ironlake intermediate watermarks correctly, v2.
    drm/i915: Disable lazy PPGTT page table optimization for vGPU
    drm/i915/execlists: Remove the priority "optimisation"
    drm/i915: Filter out spurious execlists context-switch interrupts
    drm/amdgpu: use irq-safe lock for kiq->ring_lock
    drm/amdgpu: bypass lru touch for KIQ ring submission
    drm/amdgpu: Potential uninitialized variable in amdgpu_vm_update_directories()
    drm/amdgpu: potential uninitialized variable in amdgpu_vce_ring_parse_cs()
    drm/amd/powerplay: initialize a variable before using it
    drm/amd/powerplay: suppress KASAN out of bounds warning in vega10_populate_all_memory_levels
    drm/amd/amdgpu: fix evicted VRAM bo adjudgement condition
    drm/vblank: Tune drm_crtc_accurate_vblank_count() WARN down to a debug
    drm/rockchip: add CONFIG_OF dependency for lvds
    ...

    Linus Torvalds
     
  • In preparation to enabling -Wimplicit-fallthrough, mark switch cases
    where we are expecting to fall through.

    Link: http://lkml.kernel.org/r/20171020191058.GA24427@embeddedor.com
    Signed-off-by: Gustavo A. R. Silva
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gustavo A. R. Silva
     
  • shmem_inode_cachep was created with SLAB_PANIC flag and
    shmem_init_inodecache() never returns non-zero, so convert this
    function to return void.

    Link: http://lkml.kernel.org/r/20170909124542.GA35224@bogon.didichuxing.com
    Signed-off-by: weiping zhang
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    weiping zhang
     
  • Every pagevec_init user claims the pages being released are hot even in
    cases where it is unlikely the pages are hot. As no one cares about the
    hotness of pages being released to the allocator, just ditch the
    parameter.

    No performance impact is expected as the overhead is marginal. The
    parameter is removed simply because it is a bit stupid to have a useless
    parameter copied everywhere.

    Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Jan Kara
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • During truncation, the mapping has already been checked for shmem and
    dax so it's known that workingset_update_node is required.

    This patch avoids the checks on mapping for each page being truncated.
    In all other cases, a lookup helper is used to determine if
    workingset_update_node() needs to be called. The one danger is that the
    API is slightly harder to use as calling workingset_update_node directly
    without checking for dax or shmem mappings could lead to surprises.
    However, the API rarely needs to be used and hopefully the comment is
    enough to give people the hint.

    sparsetruncate (tiny)
    4.14.0-rc4 4.14.0-rc4
    oneirq-v1r1 pickhelper-v1r1
    Min Time 141.00 ( 0.00%) 140.00 ( 0.71%)
    1st-qrtle Time 142.00 ( 0.00%) 141.00 ( 0.70%)
    2nd-qrtle Time 142.00 ( 0.00%) 142.00 ( 0.00%)
    3rd-qrtle Time 143.00 ( 0.00%) 143.00 ( 0.00%)
    Max-90% Time 144.00 ( 0.00%) 144.00 ( 0.00%)
    Max-95% Time 147.00 ( 0.00%) 145.00 ( 1.36%)
    Max-99% Time 195.00 ( 0.00%) 191.00 ( 2.05%)
    Max Time 230.00 ( 0.00%) 205.00 ( 10.87%)
    Amean Time 144.37 ( 0.00%) 143.82 ( 0.38%)
    Stddev Time 10.44 ( 0.00%) 9.00 ( 13.74%)
    Coeff Time 7.23 ( 0.00%) 6.26 ( 13.41%)
    Best99%Amean Time 143.72 ( 0.00%) 143.34 ( 0.26%)
    Best95%Amean Time 142.37 ( 0.00%) 142.00 ( 0.26%)
    Best90%Amean Time 142.19 ( 0.00%) 141.85 ( 0.24%)
    Best75%Amean Time 141.92 ( 0.00%) 141.58 ( 0.24%)
    Best50%Amean Time 141.69 ( 0.00%) 141.31 ( 0.27%)
    Best25%Amean Time 141.38 ( 0.00%) 140.97 ( 0.29%)

    As you'd expect, the gain is marginal but it can be detected. The
    differences in bonnie are all within the noise which is not surprising
    given the impact on the microbenchmark.

    radix_tree_update_node_t is a callback for some radix operations that
    optionally passes in a private field. The only user of the callback is
    workingset_update_node and as it no longer requires a mapping, the
    private field is removed.

    Link: http://lkml.kernel.org/r/20171018075952.10627-3-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Jan Kara
    Cc: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

07 Oct, 2017

1 commit

  • We are planning to use our own tmpfs mnt in i915 in place of the
    shm_mnt, such that we can control the mount options, in particular
    huge=, which we require to support huge-gtt-pages. So rather than roll
    our own version of __shmem_file_setup, it would be preferred if we could
    just give shmem our mnt, and let it do the rest.

    Signed-off-by: Matthew Auld
    Cc: Joonas Lahtinen
    Cc: Chris Wilson
    Cc: Dave Hansen
    Cc: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: linux-mm@kvack.org
    Acked-by: Andrew Morton
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Joonas Lahtinen
    Link: https://patchwork.freedesktop.org/patch/msgid/20171006145041.21673-2-matthew.auld@intel.com
    Signed-off-by: Chris Wilson
    Link: https://patchwork.freedesktop.org/patch/msgid/20171006221833.32439-1-chris@chris-wilson.co.uk

    Matthew Auld
     

14 Sep, 2017

1 commit

  • GFP_TEMPORARY was introduced by commit e12ba74d8ff3 ("Group short-lived
    and reclaimable kernel allocations") along with __GFP_RECLAIMABLE. It's
    primary motivation was to allow users to tell that an allocation is
    short lived and so the allocator can try to place such allocations close
    together and prevent long term fragmentation. As much as this sounds
    like a reasonable semantic it becomes much less clear when to use the
    highlevel GFP_TEMPORARY allocation flag. How long is temporary? Can the
    context holding that memory sleep? Can it take locks? It seems there is
    no good answer for those questions.

    The current implementation of GFP_TEMPORARY is basically GFP_KERNEL |
    __GFP_RECLAIMABLE which in itself is tricky because basically none of
    the existing caller provide a way to reclaim the allocated memory. So
    this is rather misleading and hard to evaluate for any benefits.

    I have checked some random users and none of them has added the flag
    with a specific justification. I suspect most of them just copied from
    other existing users and others just thought it might be a good idea to
    use without any measuring. This suggests that GFP_TEMPORARY just
    motivates for cargo cult usage without any reasoning.

    I believe that our gfp flags are quite complex already and especially
    those with highlevel semantic should be clearly defined to prevent from
    confusion and abuse. Therefore I propose dropping GFP_TEMPORARY and
    replace all existing users to simply use GFP_KERNEL. Please note that
    SLAB users with shrinkers will still get __GFP_RECLAIMABLE heuristic and
    so they will be placed properly for memory fragmentation prevention.

    I can see reasons we might want some gfp flag to reflect shorterm
    allocations but I propose starting from a clear semantic definition and
    only then add users with proper justification.

    This was been brought up before LSF this year by Matthew [1] and it
    turned out that GFP_TEMPORARY really doesn't have a clear semantic. It
    seems to be a heuristic without any measured advantage for most (if not
    all) its current users. The follow up discussion has revealed that
    opinions on what might be temporary allocation differ a lot between
    developers. So rather than trying to tweak existing users into a
    semantic which they haven't expected I propose to simply remove the flag
    and start from scratch if we really need a semantic for short term
    allocations.

    [1] http://lkml.kernel.org/r/20170118054945.GD18349@bombadil.infradead.org

    [akpm@linux-foundation.org: fix typo]
    [akpm@linux-foundation.org: coding-style fixes]
    [sfr@canb.auug.org.au: drm/i915: fix up]
    Link: http://lkml.kernel.org/r/20170816144703.378d4f4d@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20170728091904.14627-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Stephen Rothwell
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Matthew Wilcox
    Cc: Neil Brown
    Cc: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

07 Sep, 2017

5 commits

  • The swap readahead is an important mechanism to reduce the swap in
    latency. Although pure sequential memory access pattern isn't very
    popular for anonymous memory, the space locality is still considered
    valid.

    In the original swap readahead implementation, the consecutive blocks in
    swap device are readahead based on the global space locality estimation.
    But the consecutive blocks in swap device just reflect the order of page
    reclaiming, don't necessarily reflect the access pattern in virtual
    memory. And the different tasks in the system may have different access
    patterns, which makes the global space locality estimation incorrect.

    In this patch, when page fault occurs, the virtual pages near the fault
    address will be readahead instead of the swap slots near the fault swap
    slot in swap device. This avoid to readahead the unrelated swap slots.
    At the same time, the swap readahead is changed to work on per-VMA from
    globally. So that the different access patterns of the different VMAs
    could be distinguished, and the different readahead policy could be
    applied accordingly. The original core readahead detection and scaling
    algorithm is reused, because it is an effect algorithm to detect the
    space locality.

    The test and result is as follow,

    Common test condition
    =====================

    Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM) Swap device:
    NVMe disk

    Micro-benchmark with combined access pattern
    ============================================

    vm-scalability, sequential swap test case, 4 processes to eat 50G
    virtual memory space, repeat the sequential memory writing until 300
    seconds. The first round writing will trigger swap out, the following
    rounds will trigger sequential swap in and out.

    At the same time, run vm-scalability random swap test case in
    background, 8 processes to eat 30G virtual memory space, repeat the
    random memory write until 300 seconds. This will trigger random swap-in
    in the background.

    This is a combined workload with sequential and random memory accessing
    at the same time. The result (for sequential workload) is as follow,

    Base Optimized
    ---- ---------
    throughput 345413 KB/s 414029 KB/s (+19.9%)
    latency.average 97.14 us 61.06 us (-37.1%)
    latency.50th 2 us 1 us
    latency.60th 2 us 1 us
    latency.70th 98 us 2 us
    latency.80th 160 us 2 us
    latency.90th 260 us 217 us
    latency.95th 346 us 369 us
    latency.99th 1.34 ms 1.09 ms
    ra_hit% 52.69% 99.98%

    The original swap readahead algorithm is confused by the background
    random access workload, so readahead hit rate is lower. The VMA-base
    readahead algorithm works much better.

    Linpack
    =======

    The test memory size is bigger than RAM to trigger swapping.

    Base Optimized
    ---- ---------
    elapsed_time 393.49 s 329.88 s (-16.2%)
    ra_hit% 86.21% 98.82%

    The score of base and optimized kernel hasn't visible changes. But the
    elapsed time reduced and readahead hit rate improved, so the optimized
    kernel runs better for startup and tear down stages. And the absolute
    value of readahead hit rate is high, shows that the space locality is
    still valid in some practical workloads.

    Link: http://lkml.kernel.org/r/20170807054038.1843-4-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Fengguang Wu
    Cc: Tim Chen
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • This patch came out of discussions in this e-mail thread:
    http://lkml.kernel.org/r/1499357846-7481-1-git-send-email-mike.kravetz%40oracle.com

    The Oracle JVM team is developing a new garbage collection model. This
    new model requires multiple mappings of the same anonymous memory. One
    straight forward way to accomplish this is with memfd_create. They can
    use the returned fd to create multiple mappings of the same memory.

    The JVM today has an option to use (static hugetlb) huge pages. If this
    option is specified, they would like to use the same garbage collection
    model requiring multiple mappings to the same memory. Using hugetlbfs,
    it is possible to explicitly mount a filesystem and specify file paths
    in order to get an fd that can be used for multiple mappings. However,
    this introduces additional system admin work and coordination.

    Ideally they would like to get a hugetlbfs fd without requiring explicit
    mounting of a filesystem. Today, mmap and shmget can make use of
    hugetlbfs without explicitly mounting a filesystem. The patch adds this
    functionality to memfd_create.

    Add a new flag MFD_HUGETLB to memfd_create() that will specify the file
    to be created resides in the hugetlbfs filesystem. This is the generic
    hugetlbfs filesystem not associated with any specific mount point. As
    with other system calls that request hugetlbfs backed pages, there is
    the ability to encode huge page size in the flag arguments.

    hugetlbfs does not support sealing operations, therefore specifying
    MFD_ALLOW_SEALING with MFD_HUGETLB will result in EINVAL.

    Of course, the memfd_man page would need updating if this type of
    functionality moves forward.

    Link: http://lkml.kernel.org/r/1502149672-7759-2-git-send-email-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • shmem_mfill_zeropage_pte is the low level routine that implements the
    userfaultfd UFFDIO_ZEROPAGE command. Since for shmem mappings zero
    pages are always allocated and accounted, the new method is a slight
    extension of the existing shmem_mcopy_atomic_pte.

    Link: http://lkml.kernel.org/r/1497939652-16528-4-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • The shmem_acct_block and the update of used_blocks are following one
    another in all the places they are used. Combine these two into a
    helper function.

    Link: http://lkml.kernel.org/r/1497939652-16528-3-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Patch series "userfaultfd: enable zeropage support for shmem".

    These patches enable support for UFFDIO_ZEROPAGE for shared memory.

    The first two patches are not strictly related to userfaultfd, they are
    just minor refactoring to reduce amount of code duplication.

    This patch (of 7):

    Currently we update inode and shmem_inode_info before verifying that
    used_blocks will not exceed max_blocks. In case it will, we undo the
    update. Let's switch the order and move the verification of the blocks
    count before the inode and shmem_inode_info update.

    Link: http://lkml.kernel.org/r/1497939652-16528-2-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Hillf Danton
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

26 Aug, 2017

1 commit

  • /sys/kernel/mm/transparent_hugepage/shmem_enabled controls if we want
    to allocate huge pages when allocate pages for private in-kernel shmem
    mount.

    Unfortunately, as Dan noticed, I've screwed it up and the only way to
    make kernel allocate huge page for the mount is to use "force" there.
    All other values will be effectively ignored.

    Link: http://lkml.kernel.org/r/20170822144254.66431-1-kirill.shutemov@linux.intel.com
    Fixes: 5a6e75f8110c ("shmem: prepare huge= mount option and sysfs knob")
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Dan Carpenter
    Cc: stable [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

11 Aug, 2017

1 commit

  • We saw many list corruption warnings on shmem shrinklist:

    WARNING: CPU: 18 PID: 177 at lib/list_debug.c:59 __list_del_entry+0x9e/0xc0
    list_del corruption. prev->next should be ffff9ae5694b82d8, but was ffff9ae5699ba960
    Modules linked in: intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul ghash_clmulni_intel raid0 dcdbas shpchp wmi hed i2c_i801 ioatdma lpc_ich i2c_smbus acpi_cpufreq tcp_diag inet_diag sch_fq_codel ipmi_si ipmi_devintf ipmi_msghandler igb ptp crc32c_intel pps_core i2c_algo_bit i2c_core dca ipv6 crc_ccitt
    CPU: 18 PID: 177 Comm: kswapd1 Not tainted 4.9.34-t3.el7.twitter.x86_64 #1
    Hardware name: Dell Inc. PowerEdge C6220/0W6W6G, BIOS 2.2.3 11/07/2013
    Call Trace:
    dump_stack+0x4d/0x66
    __warn+0xcb/0xf0
    warn_slowpath_fmt+0x4f/0x60
    __list_del_entry+0x9e/0xc0
    shmem_unused_huge_shrink+0xfa/0x2e0
    shmem_unused_huge_scan+0x20/0x30
    super_cache_scan+0x193/0x1a0
    shrink_slab.part.41+0x1e3/0x3f0
    shrink_slab+0x29/0x30
    shrink_node+0xf9/0x2f0
    kswapd+0x2d8/0x6c0
    kthread+0xd7/0xf0
    ret_from_fork+0x22/0x30

    WARNING: CPU: 23 PID: 639 at lib/list_debug.c:33 __list_add+0x89/0xb0
    list_add corruption. prev->next should be next (ffff9ae5699ba960), but was ffff9ae5694b82d8. (prev=ffff9ae5694b82d8).
    Modules linked in: intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul ghash_clmulni_intel raid0 dcdbas shpchp wmi hed i2c_i801 ioatdma lpc_ich i2c_smbus acpi_cpufreq tcp_diag inet_diag sch_fq_codel ipmi_si ipmi_devintf ipmi_msghandler igb ptp crc32c_intel pps_core i2c_algo_bit i2c_core dca ipv6 crc_ccitt
    CPU: 23 PID: 639 Comm: systemd-udevd Tainted: G W 4.9.34-t3.el7.twitter.x86_64 #1
    Hardware name: Dell Inc. PowerEdge C6220/0W6W6G, BIOS 2.2.3 11/07/2013
    Call Trace:
    dump_stack+0x4d/0x66
    __warn+0xcb/0xf0
    warn_slowpath_fmt+0x4f/0x60
    __list_add+0x89/0xb0
    shmem_setattr+0x204/0x230
    notify_change+0x2ef/0x440
    do_truncate+0x5d/0x90
    path_openat+0x331/0x1190
    do_filp_open+0x7e/0xe0
    do_sys_open+0x123/0x200
    SyS_open+0x1e/0x20
    do_syscall_64+0x61/0x170
    entry_SYSCALL64_slow_path+0x25/0x25

    The problem is that shmem_unused_huge_shrink() moves entries from the
    global sbinfo->shrinklist to its local lists and then releases the
    spinlock. However, a parallel shmem_setattr() could access one of these
    entries directly and add it back to the global shrinklist if it is
    removed, with the spinlock held.

    The logic itself looks solid since an entry could be either in a local
    list or the global list, otherwise it is removed from one of them by
    list_del_init(). So probably the race condition is that, one CPU is in
    the middle of INIT_LIST_HEAD() but the other CPU calls list_empty()
    which returns true too early then the following list_add_tail() sees a
    corrupted entry.

    list_empty_careful() is designed to fix this situation.

    [akpm@linux-foundation.org: add comments]
    Link: http://lkml.kernel.org/r/20170803054630.18775-1-xiyou.wangcong@gmail.com
    Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure")
    Signed-off-by: Cong Wang
    Acked-by: Linus Torvalds
    Acked-by: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cong Wang
     

11 Jul, 2017

1 commit

  • PR_SET_THP_DISABLE has a rather subtle semantic. It doesn't affect any
    existing mapping because it only updated mm->def_flags which is a
    template for new mappings.

    The mappings created after prctl(PR_SET_THP_DISABLE) have VM_NOHUGEPAGE
    flag set. This can be quite surprising for all those applications which
    do not do prctl(); fork() & exec() and want to control their own THP
    behavior.

    Another usecase when the immediate semantic of the prctl might be useful
    is a combination of pre- and post-copy migration of containers with
    CRIU. In this case CRIU populates a part of a memory region with data
    that was saved during the pre-copy stage. Afterwards, the region is
    registered with userfaultfd and CRIU expects to get page faults for the
    parts of the region that were not yet populated. However, khugepaged
    collapses the pages and the expected page faults do not occur.

    In more general case, the prctl(PR_SET_THP_DISABLE) could be used as a
    temporary mechanism for enabling/disabling THP process wide.

    Implementation wise, a new MMF_DISABLE_THP flag is added. This flag is
    tested when decision whether to use huge pages is taken either during
    page fault of at the time of THP collapse.

    It should be noted, that the new implementation makes PR_SET_THP_DISABLE
    master override to any per-VMA setting, which was not the case
    previously.

    Fixes: a0715cc22601 ("mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE")
    Link: http://lkml.kernel.org/r/1496415802-30944-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Michal Hocko
    Signed-off-by: Mike Rapoport
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Arnd Bergmann
    Cc: "Kirill A. Shutemov"
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

07 Jul, 2017

3 commits

  • Track the following reclaim counters for every memory cgroup: PGREFILL,
    PGSCAN, PGSTEAL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE and PGLAZYFREED.

    These values are exposed using the memory.stats interface of cgroup v2.

    The meaning of each value is the same as for global counters, available
    using /proc/vmstat.

    Also, for consistency, rename mem_cgroup_count_vm_event() to
    count_memcg_event_mm().

    Link: http://lkml.kernel.org/r/1494530183-30808-1-git-send-email-guro@fb.com
    Signed-off-by: Roman Gushchin
    Suggested-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Tejun Heo
    Cc: Li Zefan
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Now, get_swap_page takes struct page and allocates swap space according
    to page size(ie, normal or THP) so it would be more cleaner to introduce
    put_swap_page which is a counter function of get_swap_page. Then, it
    calls right swap slot free function depending on page's size.

    [ying.huang@intel.com: minor cleanup and fix]
    Link: http://lkml.kernel.org/r/20170515112522.32457-3-ying.huang@intel.com
    Signed-off-by: Minchan Kim
    Signed-off-by: "Huang, Ying"
    Acked-by: Johannes Weiner
    Cc: Andrea Arcangeli
    Cc: Ebru Akagunduz
    Cc: Hugh Dickins
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Patch series "THP swap: Delay splitting THP during swapping out", v11.

    This patchset is to optimize the performance of Transparent Huge Page
    (THP) swap.

    Recently, the performance of the storage devices improved so fast that
    we cannot saturate the disk bandwidth with single logical CPU when do
    page swap out even on a high-end server machine. Because the
    performance of the storage device improved faster than that of single
    logical CPU. And it seems that the trend will not change in the near
    future. On the other hand, the THP becomes more and more popular
    because of increased memory size. So it becomes necessary to optimize
    THP swap performance.

    The advantages of the THP swap support include:

    - Batch the swap operations for the THP to reduce lock
    acquiring/releasing, including allocating/freeing the swap space,
    adding/deleting to/from the swap cache, and writing/reading the swap
    space, etc. This will help improve the performance of the THP swap.

    - The THP swap space read/write will be 2M sequential IO. It is
    particularly helpful for the swap read, which are usually 4k random
    IO. This will improve the performance of the THP swap too.

    - It will help the memory fragmentation, especially when the THP is
    heavily used by the applications. The 2M continuous pages will be
    free up after THP swapping out.

    - It will improve the THP utilization on the system with the swap
    turned on. Because the speed for khugepaged to collapse the normal
    pages into the THP is quite slow. After the THP is split during the
    swapping out, it will take quite long time for the normal pages to
    collapse back into the THP after being swapped in. The high THP
    utilization helps the efficiency of the page based memory management
    too.

    There are some concerns regarding THP swap in, mainly because possible
    enlarged read/write IO size (for swap in/out) may put more overhead on
    the storage device. To deal with that, the THP swap in should be turned
    on only when necessary. For example, it can be selected via
    "always/never/madvise" logic, to be turned on globally, turned off
    globally, or turned on only for VMA with MADV_HUGEPAGE, etc.

    This patchset is the first step for the THP swap support. The plan is
    to delay splitting THP step by step, finally avoid splitting THP during
    the THP swapping out and swap out/in the THP as a whole.

    As the first step, in this patchset, the splitting huge page is delayed
    from almost the first step of swapping out to after allocating the swap
    space for the THP and adding the THP into the swap cache. This will
    reduce lock acquiring/releasing for the locks used for the swap cache
    management.

    With the patchset, the swap out throughput improves 15.5% (from about
    3.73GB/s to about 4.31GB/s) in the vm-scalability swap-w-seq test case
    with 8 processes. The test is done on a Xeon E5 v3 system. The swap
    device used is a RAM simulated PMEM (persistent memory) device. To test
    the sequential swapping out, the test case creates 8 processes, which
    sequentially allocate and write to the anonymous pages until the RAM and
    part of the swap device is used up.

    This patch (of 5):

    In this patch, splitting huge page is delayed from almost the first step
    of swapping out to after allocating the swap space for the THP
    (Transparent Huge Page) and adding the THP into the swap cache. This
    will batch the corresponding operation, thus improve THP swap out
    throughput.

    This is the first step for the THP swap optimization. The plan is to
    delay splitting the THP step by step and avoid splitting the THP
    finally.

    In this patch, one swap cluster is used to hold the contents of each THP
    swapped out. So, the size of the swap cluster is changed to that of the
    THP (Transparent Huge Page) on x86_64 architecture (512). For other
    architectures which want such THP swap optimization,
    ARCH_USES_THP_SWAP_CLUSTER needs to be selected in the Kconfig file for
    the architecture. In effect, this will enlarge swap cluster size by 2
    times on x86_64. Which may make it harder to find a free cluster when
    the swap space becomes fragmented. So that, this may reduce the
    continuous swap space allocation and sequential write in theory. The
    performance test in 0day shows no regressions caused by this.

    In the future of THP swap optimization, some information of the swapped
    out THP (such as compound map count) will be recorded in the
    swap_cluster_info data structure.

    The mem cgroup swap accounting functions are enhanced to support charge
    or uncharge a swap cluster backing a THP as a whole.

    The swap cluster allocate/free functions are added to allocate/free a
    swap cluster for a THP. A fair simple algorithm is used for swap
    cluster allocation, that is, only the first swap device in priority list
    will be tried to allocate the swap cluster. The function will fail if
    the trying is not successful, and the caller will fallback to allocate a
    single swap slot instead. This works good enough for normal cases. If
    the difference of the number of the free swap clusters among multiple
    swap devices is significant, it is possible that some THPs are split
    earlier than necessary. For example, this could be caused by big size
    difference among multiple swap devices.

    The swap cache functions is enhanced to support add/delete THP to/from
    the swap cache as a set of (HPAGE_PMD_NR) sub-pages. This may be
    enhanced in the future with multi-order radix tree. But because we will
    split the THP soon during swapping out, that optimization doesn't make
    much sense for this first step.

    The THP splitting functions are enhanced to support to split THP in swap
    cache during swapping out. The page lock will be held during allocating
    the swap cluster, adding the THP into the swap cache and splitting the
    THP. So in the code path other than swapping out, if the THP need to be
    split, the PageSwapCache(THP) will be always false.

    The swap cluster is only available for SSD, so the THP swap optimization
    in this patchset has no effect for HDD.

    [ying.huang@intel.com: fix two issues in THP optimize patch]
    Link: http://lkml.kernel.org/r/87k25ed8zo.fsf@yhuang-dev.intel.com
    [hannes@cmpxchg.org: extensive cleanups and simplifications, reduce code size]
    Link: http://lkml.kernel.org/r/20170515112522.32457-2-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Signed-off-by: Johannes Weiner
    Suggested-by: Andrew Morton [for config option]
    Acked-by: Kirill A. Shutemov [for changes in huge_memory.c and huge_mm.h]
    Cc: Andrea Arcangeli
    Cc: Ebru Akagunduz
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

04 Jul, 2017

1 commit

  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Add the SYSTEM_SCHEDULING bootup state to move various scheduler
    debug checks earlier into the bootup. This turns silent and
    sporadically deadly bugs into nice, deterministic splats. Fix some
    of the splats that triggered. (Thomas Gleixner)

    - A round of restructuring and refactoring of the load-balancing and
    topology code (Peter Zijlstra)

    - Another round of consolidating ~20 of incremental scheduler code
    history: this time in terms of wait-queue nomenclature. (I didn't
    get much feedback on these renaming patches, and we can still
    easily change any names I might have misplaced, so if anyone hates
    a new name, please holler and I'll fix it.) (Ingo Molnar)

    - sched/numa improvements, fixes and updates (Rik van Riel)

    - Another round of x86/tsc scheduler clock code improvements, in hope
    of making it more robust (Peter Zijlstra)

    - Improve NOHZ behavior (Frederic Weisbecker)

    - Deadline scheduler improvements and fixes (Luca Abeni, Daniel
    Bristot de Oliveira)

    - Simplify and optimize the topology setup code (Lauro Ramos
    Venancio)

    - Debloat and decouple scheduler code some more (Nicolas Pitre)

    - Simplify code by making better use of llist primitives (Byungchul
    Park)

    - ... plus other fixes and improvements"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
    sched/cputime: Refactor the cputime_adjust() code
    sched/debug: Expose the number of RT/DL tasks that can migrate
    sched/numa: Hide numa_wake_affine() from UP build
    sched/fair: Remove effective_load()
    sched/numa: Implement NUMA node level wake_affine()
    sched/fair: Simplify wake_affine() for the single socket case
    sched/numa: Override part of migrate_degrades_locality() when idle balancing
    sched/rt: Move RT related code from sched/core.c to sched/rt.c
    sched/deadline: Move DL related code from sched/core.c to sched/deadline.c
    sched/cpuset: Only offer CONFIG_CPUSETS if SMP is enabled
    sched/fair: Spare idle load balancing on nohz_full CPUs
    nohz: Move idle balancer registration to the idle path
    sched/loadavg: Generalize "_idle" naming to "_nohz"
    sched/core: Drop the unused try_get_task_struct() helper function
    sched/fair: WARN() and refuse to set buddy when !se->on_rq
    sched/debug: Fix SCHED_WARN_ON() to return a value on !CONFIG_SCHED_DEBUG as well
    sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
    sched/wait: Move bit_wait_table[] and related functionality from sched/core.c to sched/wait_bit.c
    sched/wait: Split out the wait_bit*() APIs from into
    sched/wait: Re-adjust macro line continuation backslashes in
    ...

    Linus Torvalds
     

20 Jun, 2017

2 commits

  • So I've noticed a number of instances where it was not obvious from the
    code whether ->task_list was for a wait-queue head or a wait-queue entry.

    Furthermore, there's a number of wait-queue users where the lists are
    not for 'tasks' but other entities (poll tables, etc.), in which case
    the 'task_list' name is actively confusing.

    To clear this all up, name the wait-queue head and entry list structure
    fields unambiguously:

    struct wait_queue_head::task_list => ::head
    struct wait_queue_entry::task_list => ::entry

    For example, this code:

    rqw->wait.task_list.next != &wait->task_list

    ... is was pretty unclear (to me) what it's doing, while now it's written this way:

    rqw->wait.head.next != &wait->entry

    ... which makes it pretty clear that we are iterating a list until we see the head.

    Other examples are:

    list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
    list_for_each_entry(wq, &fence->wait.task_list, task_list) {

    ... where it's unclear (to me) what we are iterating, and during review it's
    hard to tell whether it's trying to walk a wait-queue entry (which would be
    a bug), while now it's written as:

    list_for_each_entry_safe(pos, next, &x->head, entry) {
    list_for_each_entry(wq, &fence->wait.head, entry) {

    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Rename:

    wait_queue_t => wait_queue_entry_t

    'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
    but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
    which had to carry the name.

    Start sorting this out by renaming it to 'wait_queue_entry_t'.

    This also allows the real structure name 'struct __wait_queue' to
    lose its double underscore and become 'struct wait_queue_entry',
    which is the more canonical nomenclature for such data types.

    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

05 Jun, 2017

1 commit


04 Mar, 2017

1 commit

  • Pull vfs 'statx()' update from Al Viro.

    This adds the new extended stat() interface that internally subsumes our
    previous stat interfaces, and allows user mode to specify in more detail
    what kind of information it wants.

    It also allows for some explicit synchronization information to be
    passed to the filesystem, which can be relevant for network filesystems:
    is the cached value ok, or do you need open/close consistency, or what?

    From David Howells.

    Andreas Dilger points out that the first version of the extended statx
    interface was posted June 29, 2010:

    https://www.spinics.net/lists/linux-fsdevel/msg33831.html

    * 'rebased-statx' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    statx: Add a system call to make enhanced file info available

    Linus Torvalds
     

03 Mar, 2017

1 commit

  • Add a system call to make extended file information available, including
    file creation and some attribute flags where available through the
    underlying filesystem.

    The getattr inode operation is altered to take two additional arguments: a
    u32 request_mask and an unsigned int flags that indicate the
    synchronisation mode. This change is propagated to the vfs_getattr*()
    function.

    Functions like vfs_stat() are now inline wrappers around new functions
    vfs_statx() and vfs_statx_fd() to reduce stack usage.

    ========
    OVERVIEW
    ========

    The idea was initially proposed as a set of xattrs that could be retrieved
    with getxattr(), but the general preference proved to be for a new syscall
    with an extended stat structure.

    A number of requests were gathered for features to be included. The
    following have been included:

    (1) Make the fields a consistent size on all arches and make them large.

    (2) Spare space, request flags and information flags are provided for
    future expansion.

    (3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
    __s64).

    (4) Creation time: The SMB protocol carries the creation time, which could
    be exported by Samba, which will in turn help CIFS make use of
    FS-Cache as that can be used for coherency data (stx_btime).

    This is also specified in NFSv4 as a recommended attribute and could
    be exported by NFSD [Steve French].

    (5) Lightweight stat: Ask for just those details of interest, and allow a
    netfs (such as NFS) to approximate anything not of interest, possibly
    without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
    Dilger] (AT_STATX_DONT_SYNC).

    (6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
    its cached attributes are up to date [Trond Myklebust]
    (AT_STATX_FORCE_SYNC).

    And the following have been left out for future extension:

    (7) Data version number: Could be used by userspace NFS servers [Aneesh
    Kumar].

    Can also be used to modify fill_post_wcc() in NFSD which retrieves
    i_version directly, but has just called vfs_getattr(). It could get
    it from the kstat struct if it used vfs_xgetattr() instead.

    (There's disagreement on the exact semantics of a single field, since
    not all filesystems do this the same way).

    (8) BSD stat compatibility: Including more fields from the BSD stat such
    as creation time (st_btime) and inode generation number (st_gen)
    [Jeremy Allison, Bernd Schubert].

    (9) Inode generation number: Useful for FUSE and userspace NFS servers
    [Bernd Schubert].

    (This was asked for but later deemed unnecessary with the
    open-by-handle capability available and caused disagreement as to
    whether it's a security hole or not).

    (10) Extra coherency data may be useful in making backups [Andreas Dilger].

    (No particular data were offered, but things like last backup
    timestamp, the data version number and the DOS archive bit would come
    into this category).

    (11) Allow the filesystem to indicate what it can/cannot provide: A
    filesystem can now say it doesn't support a standard stat feature if
    that isn't available, so if, for instance, inode numbers or UIDs don't
    exist or are fabricated locally...

    (This requires a separate system call - I have an fsinfo() call idea
    for this).

    (12) Store a 16-byte volume ID in the superblock that can be returned in
    struct xstat [Steve French].

    (Deferred to fsinfo).

    (13) Include granularity fields in the time data to indicate the
    granularity of each of the times (NFSv4 time_delta) [Steve French].

    (Deferred to fsinfo).

    (14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags.
    Note that the Linux IOC flags are a mess and filesystems such as Ext4
    define flags that aren't in linux/fs.h, so translation in the kernel
    may be a necessity (or, possibly, we provide the filesystem type too).

    (Some attributes are made available in stx_attributes, but the general
    feeling was that the IOC flags were to ext[234]-specific and shouldn't
    be exposed through statx this way).

    (15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
    Michael Kerrisk].

    (Deferred, probably to fsinfo. Finding out if there's an ACL or
    seclabal might require extra filesystem operations).

    (16) Femtosecond-resolution timestamps [Dave Chinner].

    (A __reserved field has been left in the statx_timestamp struct for
    this - if there proves to be a need).

    (17) A set multiple attributes syscall to go with this.

    ===============
    NEW SYSTEM CALL
    ===============

    The new system call is:

    int ret = statx(int dfd,
    const char *filename,
    unsigned int flags,
    unsigned int mask,
    struct statx *buffer);

    The dfd, filename and flags parameters indicate the file to query, in a
    similar way to fstatat(). There is no equivalent of lstat() as that can be
    emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is
    also no equivalent of fstat() as that can be emulated by passing a NULL
    filename to statx() with the fd of interest in dfd.

    Whether or not statx() synchronises the attributes with the backing store
    can be controlled by OR'ing a value into the flags argument (this typically
    only affects network filesystems):

    (1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
    respect.

    (2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
    its attributes with the server - which might require data writeback to
    occur to get the timestamps correct.

    (3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
    network filesystem. The resulting values should be considered
    approximate.

    mask is a bitmask indicating the fields in struct statx that are of
    interest to the caller. The user should set this to STATX_BASIC_STATS to
    get the basic set returned by stat(). It should be noted that asking for
    more information may entail extra I/O operations.

    buffer points to the destination for the data. This must be 256 bytes in
    size.

    ======================
    MAIN ATTRIBUTES RECORD
    ======================

    The following structures are defined in which to return the main attribute
    set:

    struct statx_timestamp {
    __s64 tv_sec;
    __s32 tv_nsec;
    __s32 __reserved;
    };

    struct statx {
    __u32 stx_mask;
    __u32 stx_blksize;
    __u64 stx_attributes;
    __u32 stx_nlink;
    __u32 stx_uid;
    __u32 stx_gid;
    __u16 stx_mode;
    __u16 __spare0[1];
    __u64 stx_ino;
    __u64 stx_size;
    __u64 stx_blocks;
    __u64 __spare1[1];
    struct statx_timestamp stx_atime;
    struct statx_timestamp stx_btime;
    struct statx_timestamp stx_ctime;
    struct statx_timestamp stx_mtime;
    __u32 stx_rdev_major;
    __u32 stx_rdev_minor;
    __u32 stx_dev_major;
    __u32 stx_dev_minor;
    __u64 __spare2[14];
    };

    The defined bits in request_mask and stx_mask are:

    STATX_TYPE Want/got stx_mode & S_IFMT
    STATX_MODE Want/got stx_mode & ~S_IFMT
    STATX_NLINK Want/got stx_nlink
    STATX_UID Want/got stx_uid
    STATX_GID Want/got stx_gid
    STATX_ATIME Want/got stx_atime{,_ns}
    STATX_MTIME Want/got stx_mtime{,_ns}
    STATX_CTIME Want/got stx_ctime{,_ns}
    STATX_INO Want/got stx_ino
    STATX_SIZE Want/got stx_size
    STATX_BLOCKS Want/got stx_blocks
    STATX_BASIC_STATS [The stuff in the normal stat struct]
    STATX_BTIME Want/got stx_btime{,_ns}
    STATX_ALL [All currently available stuff]

    stx_btime is the file creation time, stx_mask is a bitmask indicating the
    data provided and __spares*[] are where as-yet undefined fields can be
    placed.

    Time fields are structures with separate seconds and nanoseconds fields
    plus a reserved field in case we want to add even finer resolution. Note
    that times will be negative if before 1970; in such a case, the nanosecond
    fields will also be negative if not zero.

    The bits defined in the stx_attributes field convey information about a
    file, how it is accessed, where it is and what it does. The following
    attributes map to FS_*_FL flags and are the same numerical value:

    STATX_ATTR_COMPRESSED File is compressed by the fs
    STATX_ATTR_IMMUTABLE File is marked immutable
    STATX_ATTR_APPEND File is append-only
    STATX_ATTR_NODUMP File is not to be dumped
    STATX_ATTR_ENCRYPTED File requires key to decrypt in fs

    Within the kernel, the supported flags are listed by:

    KSTAT_ATTR_FS_IOC_FLAGS

    [Are any other IOC flags of sufficient general interest to be exposed
    through this interface?]

    New flags include:

    STATX_ATTR_AUTOMOUNT Object is an automount trigger

    These are for the use of GUI tools that might want to mark files specially,
    depending on what they are.

    Fields in struct statx come in a number of classes:

    (0) stx_dev_*, stx_blksize.

    These are local system information and are always available.

    (1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
    stx_size, stx_blocks.

    These will be returned whether the caller asks for them or not. The
    corresponding bits in stx_mask will be set to indicate whether they
    actually have valid values.

    If the caller didn't ask for them, then they may be approximated. For
    example, NFS won't waste any time updating them from the server,
    unless as a byproduct of updating something requested.

    If the values don't actually exist for the underlying object (such as
    UID or GID on a DOS file), then the bit won't be set in the stx_mask,
    even if the caller asked for the value. In such a case, the returned
    value will be a fabrication.

    Note that there are instances where the type might not be valid, for
    instance Windows reparse points.

    (2) stx_rdev_*.

    This will be set only if stx_mode indicates we're looking at a
    blockdev or a chardev, otherwise will be 0.

    (3) stx_btime.

    Similar to (1), except this will be set to 0 if it doesn't exist.

    =======
    TESTING
    =======

    The following test program can be used to test the statx system call:

    samples/statx/test-statx.c

    Just compile and run, passing it paths to the files you want to examine.
    The file is built automatically if CONFIG_SAMPLES is enabled.

    Here's some example output. Firstly, an NFS directory that crosses to
    another FSID. Note that the AUTOMOUNT attribute is set because transiting
    this directory will cause d_automount to be invoked by the VFS.

    [root@andromeda ~]# /tmp/test-statx -A /warthog/data
    statx(/warthog/data) = 0
    results=7ff
    Size: 4096 Blocks: 8 IO Block: 1048576 directory
    Device: 00:26 Inode: 1703937 Links: 125
    Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
    Access: 2016-11-24 09:02:12.219699527+0000
    Modify: 2016-11-17 10:44:36.225653653+0000
    Change: 2016-11-17 10:44:36.225653653+0000
    Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)

    Secondly, the result of automounting on that directory.

    [root@andromeda ~]# /tmp/test-statx /warthog/data
    statx(/warthog/data) = 0
    results=7ff
    Size: 4096 Blocks: 8 IO Block: 1048576 directory
    Device: 00:27 Inode: 2 Links: 125
    Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
    Access: 2016-11-24 09:02:12.219699527+0000
    Modify: 2016-11-17 10:44:36.225653653+0000
    Change: 2016-11-17 10:44:36.225653653+0000

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells