08 Dec, 2018

3 commits

  • commit dcf7fe9d89763a28e0f43975b422ff141fe79e43 upstream.

    Set the page dirty if VM_WRITE is not set because in such case the pte
    won't be marked dirty and the page would be reclaimed without writepage
    (i.e. swapout in the shmem case).

    This was found by source review. Most apps (certainly including QEMU)
    only use UFFDIO_COPY on PROT_READ|PROT_WRITE mappings or the app can't
    modify the memory in the first place. This is for correctness and it
    could help the non cooperative use case to avoid unexpected data loss.

    Link: http://lkml.kernel.org/r/20181126173452.26955-6-aarcange@redhat.com
    Reviewed-by: Hugh Dickins
    Cc: stable@vger.kernel.org
    Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
    Reported-by: Hugh Dickins
    Signed-off-by: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Jann Horn
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Peter Xu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrea Arcangeli
     
  • commit e2a50c1f64145a04959df2442305d57307e5395a upstream.

    With MAP_SHARED: recheck the i_size after taking the PT lock, to
    serialize against truncate with the PT lock. Delete the page from the
    pagecache if the i_size_read check fails.

    With MAP_PRIVATE: check the i_size after the PT lock before mapping
    anonymous memory or zeropages into the MAP_PRIVATE shmem mapping.

    A mostly irrelevant cleanup: like we do the delete_from_page_cache()
    pagecache removal after dropping the PT lock, the PT lock is a spinlock
    so drop it before the sleepable page lock.

    Link: http://lkml.kernel.org/r/20181126173452.26955-5-aarcange@redhat.com
    Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Mike Rapoport
    Reviewed-by: Hugh Dickins
    Reported-by: Jann Horn
    Cc:
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc: Peter Xu
    Cc: stable@vger.kernel.org
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrea Arcangeli
     
  • commit 9e368259ad988356c4c95150fafd1a06af095d98 upstream.

    Patch series "userfaultfd shmem updates".

    Jann found two bugs in the userfaultfd shmem MAP_SHARED backend: the
    lack of the VM_MAYWRITE check and the lack of i_size checks.

    Then looking into the above we also fixed the MAP_PRIVATE case.

    Hugh by source review also found a data loss source if UFFDIO_COPY is
    used on shmem MAP_SHARED PROT_READ mappings (the production usages
    incidentally run with PROT_READ|PROT_WRITE, so the data loss couldn't
    happen in those production usages like with QEMU).

    The whole patchset is marked for stable.

    We verified QEMU postcopy live migration with guest running on shmem
    MAP_PRIVATE run as well as before after the fix of shmem MAP_PRIVATE.
    Regardless if it's shmem or hugetlbfs or MAP_PRIVATE or MAP_SHARED, QEMU
    unconditionally invokes a punch hole if the guest mapping is filebacked
    and a MADV_DONTNEED too (needed to get rid of the MAP_PRIVATE COWs and
    for the anon backend).

    This patch (of 5):

    We internally used EFAULT to communicate with the caller, switch to
    ENOENT, so EFAULT can be used as a non internal retval.

    Link: http://lkml.kernel.org/r/20181126173452.26955-2-aarcange@redhat.com
    Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Mike Rapoport
    Reviewed-by: Hugh Dickins
    Cc: Mike Kravetz
    Cc: Jann Horn
    Cc: Peter Xu
    Cc: "Dr. David Alan Gilbert"
    Cc:
    Cc: stable@vger.kernel.org
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrea Arcangeli
     

06 Dec, 2018

2 commits

  • commit c1cb20d43728aa9b5393bd8d489bc85c142949b2 upstream.

    We changed the key of swap cache tree from swp_entry_t.val to
    swp_offset. We need to do so in shmem_replace_page() as well.

    Hugh said:
    "shmem_replace_page() has been wrong since the day I wrote it: good
    enough to work on swap "type" 0, which is all most people ever use
    (especially those few who need shmem_replace_page() at all), but
    broken once there are any non-0 swp_type bits set in the higher order
    bits"

    Link: http://lkml.kernel.org/r/20181121215442.138545-1-yuzhao@google.com
    Fixes: f6ab1f7f6b2d ("mm, swap: use offset of swap entry as key of swap cache")
    Signed-off-by: Yu Zhao
    Reviewed-by: Matthew Wilcox
    Acked-by: Hugh Dickins
    Cc: [4.9+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Yu Zhao
     
  • commit aaa52e340073b7f4593b3c4ddafcafa70cf838b5 upstream.

    Huge tmpfs testing on a shortish file mapped into a pmd-rounded extent
    hit shmem_evict_inode()'s WARN_ON(inode->i_blocks) followed by
    clear_inode()'s BUG_ON(inode->i_data.nrpages) when the file was later
    closed and unlinked.

    khugepaged's collapse_shmem() was forgetting to update mapping->nrpages
    on the rollback path, after it had added but then needs to undo some
    holes.

    There is indeed an irritating asymmetry between shmem_charge(), whose
    callers want it to increment nrpages after successfully accounting
    blocks, and shmem_uncharge(), when __delete_from_page_cache() already
    decremented nrpages itself: oh well, just add a comment on that to them
    both.

    And shmem_recalc_inode() is supposed to be called when the accounting is
    expected to be in balance (so it can deduce from imbalance that reclaim
    discarded some pages): so change shmem_charge() to update nrpages
    earlier (though it's rare for the difference to matter at all).

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261523450.2275@eggly.anvils
    Fixes: 800d8c63b2e98 ("shmem: add huge pages support")
    Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Jerome Glisse
    Cc: Konstantin Khlebnikov
    Cc: Matthew Wilcox
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Hugh Dickins
     

01 Dec, 2018

1 commit

  • [ Upstream commit 1a413646931cb14442065cfc17561e50f5b5bb44 ]

    Other filesystems such as ext4, f2fs and ubifs all return ENXIO when
    lseek (SEEK_DATA or SEEK_HOLE) requests a negative offset.

    man 2 lseek says

    : EINVAL whence is not valid. Or: the resulting file offset would be
    : negative, or beyond the end of a seekable device.
    :
    : ENXIO whence is SEEK_DATA or SEEK_HOLE, and the file offset is beyond
    : the end of the file.

    Make tmpfs return ENXIO under these circumstances as well. After this,
    tmpfs also passes xfstests's generic/448.

    [akpm@linux-foundation.org: rewrite changelog]
    Link: http://lkml.kernel.org/r/1540434176-14349-1-git-send-email-yuyufen@huawei.com
    Signed-off-by: Yufen Yu
    Reviewed-by: Andrew Morton
    Cc: Al Viro
    Cc: Hugh Dickins
    Cc: William Kucharski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Yufen Yu
     

29 Sep, 2018

1 commit

  • commit b45d71fb89ab8adfe727b9d0ee188ed58582a647 upstream.

    Directories and inodes don't necessarily need to be in the same lockdep
    class. For ex, hugetlbfs splits them out too to prevent false positives
    in lockdep. Annotate correctly after new inode creation. If its a
    directory inode, it will be put into a different class.

    This should fix a lockdep splat reported by syzbot:

    > ======================================================
    > WARNING: possible circular locking dependency detected
    > 4.18.0-rc8-next-20180810+ #36 Not tainted
    > ------------------------------------------------------
    > syz-executor900/4483 is trying to acquire lock:
    > 00000000d2bfc8fe (&sb->s_type->i_mutex_key#9){++++}, at: inode_lock
    > include/linux/fs.h:765 [inline]
    > 00000000d2bfc8fe (&sb->s_type->i_mutex_key#9){++++}, at:
    > shmem_fallocate+0x18b/0x12e0 mm/shmem.c:2602
    >
    > but task is already holding lock:
    > 0000000025208078 (ashmem_mutex){+.+.}, at: ashmem_shrink_scan+0xb4/0x630
    > drivers/staging/android/ashmem.c:448
    >
    > which lock already depends on the new lock.
    >
    > -> #2 (ashmem_mutex){+.+.}:
    > __mutex_lock_common kernel/locking/mutex.c:925 [inline]
    > __mutex_lock+0x171/0x1700 kernel/locking/mutex.c:1073
    > mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:1088
    > ashmem_mmap+0x55/0x520 drivers/staging/android/ashmem.c:361
    > call_mmap include/linux/fs.h:1844 [inline]
    > mmap_region+0xf27/0x1c50 mm/mmap.c:1762
    > do_mmap+0xa10/0x1220 mm/mmap.c:1535
    > do_mmap_pgoff include/linux/mm.h:2298 [inline]
    > vm_mmap_pgoff+0x213/0x2c0 mm/util.c:357
    > ksys_mmap_pgoff+0x4da/0x660 mm/mmap.c:1585
    > __do_sys_mmap arch/x86/kernel/sys_x86_64.c:100 [inline]
    > __se_sys_mmap arch/x86/kernel/sys_x86_64.c:91 [inline]
    > __x64_sys_mmap+0xe9/0x1b0 arch/x86/kernel/sys_x86_64.c:91
    > do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
    > entry_SYSCALL_64_after_hwframe+0x49/0xbe
    >
    > -> #1 (&mm->mmap_sem){++++}:
    > __might_fault+0x155/0x1e0 mm/memory.c:4568
    > _copy_to_user+0x30/0x110 lib/usercopy.c:25
    > copy_to_user include/linux/uaccess.h:155 [inline]
    > filldir+0x1ea/0x3a0 fs/readdir.c:196
    > dir_emit_dot include/linux/fs.h:3464 [inline]
    > dir_emit_dots include/linux/fs.h:3475 [inline]
    > dcache_readdir+0x13a/0x620 fs/libfs.c:193
    > iterate_dir+0x48b/0x5d0 fs/readdir.c:51
    > __do_sys_getdents fs/readdir.c:231 [inline]
    > __se_sys_getdents fs/readdir.c:212 [inline]
    > __x64_sys_getdents+0x29f/0x510 fs/readdir.c:212
    > do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
    > entry_SYSCALL_64_after_hwframe+0x49/0xbe
    >
    > -> #0 (&sb->s_type->i_mutex_key#9){++++}:
    > lock_acquire+0x1e4/0x540 kernel/locking/lockdep.c:3924
    > down_write+0x8f/0x130 kernel/locking/rwsem.c:70
    > inode_lock include/linux/fs.h:765 [inline]
    > shmem_fallocate+0x18b/0x12e0 mm/shmem.c:2602
    > ashmem_shrink_scan+0x236/0x630 drivers/staging/android/ashmem.c:455
    > ashmem_ioctl+0x3ae/0x13a0 drivers/staging/android/ashmem.c:797
    > vfs_ioctl fs/ioctl.c:46 [inline]
    > file_ioctl fs/ioctl.c:501 [inline]
    > do_vfs_ioctl+0x1de/0x1720 fs/ioctl.c:685
    > ksys_ioctl+0xa9/0xd0 fs/ioctl.c:702
    > __do_sys_ioctl fs/ioctl.c:709 [inline]
    > __se_sys_ioctl fs/ioctl.c:707 [inline]
    > __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:707
    > do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
    > entry_SYSCALL_64_after_hwframe+0x49/0xbe
    >
    > other info that might help us debug this:
    >
    > Chain exists of:
    > &sb->s_type->i_mutex_key#9 --> &mm->mmap_sem --> ashmem_mutex
    >
    > Possible unsafe locking scenario:
    >
    > CPU0 CPU1
    > ---- ----
    > lock(ashmem_mutex);
    > lock(&mm->mmap_sem);
    > lock(ashmem_mutex);
    > lock(&sb->s_type->i_mutex_key#9);
    >
    > *** DEADLOCK ***
    >
    > 1 lock held by syz-executor900/4483:
    > #0: 0000000025208078 (ashmem_mutex){+.+.}, at:
    > ashmem_shrink_scan+0xb4/0x630 drivers/staging/android/ashmem.c:448

    Link: http://lkml.kernel.org/r/20180821231835.166639-1-joel@joelfernandes.org
    Signed-off-by: Joel Fernandes (Google)
    Reported-by: syzbot
    Reviewed-by: NeilBrown
    Suggested-by: NeilBrown
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Joel Fernandes (Google)
     

29 Mar, 2018

1 commit

  • commit b3cd54b257ad95d344d121dc563d943ca39b0921 upstream.

    shmem_unused_huge_shrink() gets called from reclaim path. Waiting for
    page lock may lead to deadlock there.

    There was a bug report that may be attributed to this:

    http://lkml.kernel.org/r/alpine.LRH.2.11.1801242349220.30642@mail.ewheeler.net

    Replace lock_page() with trylock_page() and skip the page if we failed
    to lock it. We will get to the page on the next scan.

    We can test for the PageTransHuge() outside the page lock as we only
    need protection against splitting the page under us. Holding pin oni
    the page is enough for this.

    Link: http://lkml.kernel.org/r/20180316210830.43738-1-kirill.shutemov@linux.intel.com
    Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure")
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Eric Wheeler
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Cc: Tetsuo Handa
    Cc: Hugh Dickins
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov
     

14 Sep, 2017

1 commit

  • GFP_TEMPORARY was introduced by commit e12ba74d8ff3 ("Group short-lived
    and reclaimable kernel allocations") along with __GFP_RECLAIMABLE. It's
    primary motivation was to allow users to tell that an allocation is
    short lived and so the allocator can try to place such allocations close
    together and prevent long term fragmentation. As much as this sounds
    like a reasonable semantic it becomes much less clear when to use the
    highlevel GFP_TEMPORARY allocation flag. How long is temporary? Can the
    context holding that memory sleep? Can it take locks? It seems there is
    no good answer for those questions.

    The current implementation of GFP_TEMPORARY is basically GFP_KERNEL |
    __GFP_RECLAIMABLE which in itself is tricky because basically none of
    the existing caller provide a way to reclaim the allocated memory. So
    this is rather misleading and hard to evaluate for any benefits.

    I have checked some random users and none of them has added the flag
    with a specific justification. I suspect most of them just copied from
    other existing users and others just thought it might be a good idea to
    use without any measuring. This suggests that GFP_TEMPORARY just
    motivates for cargo cult usage without any reasoning.

    I believe that our gfp flags are quite complex already and especially
    those with highlevel semantic should be clearly defined to prevent from
    confusion and abuse. Therefore I propose dropping GFP_TEMPORARY and
    replace all existing users to simply use GFP_KERNEL. Please note that
    SLAB users with shrinkers will still get __GFP_RECLAIMABLE heuristic and
    so they will be placed properly for memory fragmentation prevention.

    I can see reasons we might want some gfp flag to reflect shorterm
    allocations but I propose starting from a clear semantic definition and
    only then add users with proper justification.

    This was been brought up before LSF this year by Matthew [1] and it
    turned out that GFP_TEMPORARY really doesn't have a clear semantic. It
    seems to be a heuristic without any measured advantage for most (if not
    all) its current users. The follow up discussion has revealed that
    opinions on what might be temporary allocation differ a lot between
    developers. So rather than trying to tweak existing users into a
    semantic which they haven't expected I propose to simply remove the flag
    and start from scratch if we really need a semantic for short term
    allocations.

    [1] http://lkml.kernel.org/r/20170118054945.GD18349@bombadil.infradead.org

    [akpm@linux-foundation.org: fix typo]
    [akpm@linux-foundation.org: coding-style fixes]
    [sfr@canb.auug.org.au: drm/i915: fix up]
    Link: http://lkml.kernel.org/r/20170816144703.378d4f4d@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20170728091904.14627-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Stephen Rothwell
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Matthew Wilcox
    Cc: Neil Brown
    Cc: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

07 Sep, 2017

5 commits

  • The swap readahead is an important mechanism to reduce the swap in
    latency. Although pure sequential memory access pattern isn't very
    popular for anonymous memory, the space locality is still considered
    valid.

    In the original swap readahead implementation, the consecutive blocks in
    swap device are readahead based on the global space locality estimation.
    But the consecutive blocks in swap device just reflect the order of page
    reclaiming, don't necessarily reflect the access pattern in virtual
    memory. And the different tasks in the system may have different access
    patterns, which makes the global space locality estimation incorrect.

    In this patch, when page fault occurs, the virtual pages near the fault
    address will be readahead instead of the swap slots near the fault swap
    slot in swap device. This avoid to readahead the unrelated swap slots.
    At the same time, the swap readahead is changed to work on per-VMA from
    globally. So that the different access patterns of the different VMAs
    could be distinguished, and the different readahead policy could be
    applied accordingly. The original core readahead detection and scaling
    algorithm is reused, because it is an effect algorithm to detect the
    space locality.

    The test and result is as follow,

    Common test condition
    =====================

    Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM) Swap device:
    NVMe disk

    Micro-benchmark with combined access pattern
    ============================================

    vm-scalability, sequential swap test case, 4 processes to eat 50G
    virtual memory space, repeat the sequential memory writing until 300
    seconds. The first round writing will trigger swap out, the following
    rounds will trigger sequential swap in and out.

    At the same time, run vm-scalability random swap test case in
    background, 8 processes to eat 30G virtual memory space, repeat the
    random memory write until 300 seconds. This will trigger random swap-in
    in the background.

    This is a combined workload with sequential and random memory accessing
    at the same time. The result (for sequential workload) is as follow,

    Base Optimized
    ---- ---------
    throughput 345413 KB/s 414029 KB/s (+19.9%)
    latency.average 97.14 us 61.06 us (-37.1%)
    latency.50th 2 us 1 us
    latency.60th 2 us 1 us
    latency.70th 98 us 2 us
    latency.80th 160 us 2 us
    latency.90th 260 us 217 us
    latency.95th 346 us 369 us
    latency.99th 1.34 ms 1.09 ms
    ra_hit% 52.69% 99.98%

    The original swap readahead algorithm is confused by the background
    random access workload, so readahead hit rate is lower. The VMA-base
    readahead algorithm works much better.

    Linpack
    =======

    The test memory size is bigger than RAM to trigger swapping.

    Base Optimized
    ---- ---------
    elapsed_time 393.49 s 329.88 s (-16.2%)
    ra_hit% 86.21% 98.82%

    The score of base and optimized kernel hasn't visible changes. But the
    elapsed time reduced and readahead hit rate improved, so the optimized
    kernel runs better for startup and tear down stages. And the absolute
    value of readahead hit rate is high, shows that the space locality is
    still valid in some practical workloads.

    Link: http://lkml.kernel.org/r/20170807054038.1843-4-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Fengguang Wu
    Cc: Tim Chen
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • This patch came out of discussions in this e-mail thread:
    http://lkml.kernel.org/r/1499357846-7481-1-git-send-email-mike.kravetz%40oracle.com

    The Oracle JVM team is developing a new garbage collection model. This
    new model requires multiple mappings of the same anonymous memory. One
    straight forward way to accomplish this is with memfd_create. They can
    use the returned fd to create multiple mappings of the same memory.

    The JVM today has an option to use (static hugetlb) huge pages. If this
    option is specified, they would like to use the same garbage collection
    model requiring multiple mappings to the same memory. Using hugetlbfs,
    it is possible to explicitly mount a filesystem and specify file paths
    in order to get an fd that can be used for multiple mappings. However,
    this introduces additional system admin work and coordination.

    Ideally they would like to get a hugetlbfs fd without requiring explicit
    mounting of a filesystem. Today, mmap and shmget can make use of
    hugetlbfs without explicitly mounting a filesystem. The patch adds this
    functionality to memfd_create.

    Add a new flag MFD_HUGETLB to memfd_create() that will specify the file
    to be created resides in the hugetlbfs filesystem. This is the generic
    hugetlbfs filesystem not associated with any specific mount point. As
    with other system calls that request hugetlbfs backed pages, there is
    the ability to encode huge page size in the flag arguments.

    hugetlbfs does not support sealing operations, therefore specifying
    MFD_ALLOW_SEALING with MFD_HUGETLB will result in EINVAL.

    Of course, the memfd_man page would need updating if this type of
    functionality moves forward.

    Link: http://lkml.kernel.org/r/1502149672-7759-2-git-send-email-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • shmem_mfill_zeropage_pte is the low level routine that implements the
    userfaultfd UFFDIO_ZEROPAGE command. Since for shmem mappings zero
    pages are always allocated and accounted, the new method is a slight
    extension of the existing shmem_mcopy_atomic_pte.

    Link: http://lkml.kernel.org/r/1497939652-16528-4-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • The shmem_acct_block and the update of used_blocks are following one
    another in all the places they are used. Combine these two into a
    helper function.

    Link: http://lkml.kernel.org/r/1497939652-16528-3-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Patch series "userfaultfd: enable zeropage support for shmem".

    These patches enable support for UFFDIO_ZEROPAGE for shared memory.

    The first two patches are not strictly related to userfaultfd, they are
    just minor refactoring to reduce amount of code duplication.

    This patch (of 7):

    Currently we update inode and shmem_inode_info before verifying that
    used_blocks will not exceed max_blocks. In case it will, we undo the
    update. Let's switch the order and move the verification of the blocks
    count before the inode and shmem_inode_info update.

    Link: http://lkml.kernel.org/r/1497939652-16528-2-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Hillf Danton
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

26 Aug, 2017

1 commit

  • /sys/kernel/mm/transparent_hugepage/shmem_enabled controls if we want
    to allocate huge pages when allocate pages for private in-kernel shmem
    mount.

    Unfortunately, as Dan noticed, I've screwed it up and the only way to
    make kernel allocate huge page for the mount is to use "force" there.
    All other values will be effectively ignored.

    Link: http://lkml.kernel.org/r/20170822144254.66431-1-kirill.shutemov@linux.intel.com
    Fixes: 5a6e75f8110c ("shmem: prepare huge= mount option and sysfs knob")
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Dan Carpenter
    Cc: stable [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

11 Aug, 2017

1 commit

  • We saw many list corruption warnings on shmem shrinklist:

    WARNING: CPU: 18 PID: 177 at lib/list_debug.c:59 __list_del_entry+0x9e/0xc0
    list_del corruption. prev->next should be ffff9ae5694b82d8, but was ffff9ae5699ba960
    Modules linked in: intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul ghash_clmulni_intel raid0 dcdbas shpchp wmi hed i2c_i801 ioatdma lpc_ich i2c_smbus acpi_cpufreq tcp_diag inet_diag sch_fq_codel ipmi_si ipmi_devintf ipmi_msghandler igb ptp crc32c_intel pps_core i2c_algo_bit i2c_core dca ipv6 crc_ccitt
    CPU: 18 PID: 177 Comm: kswapd1 Not tainted 4.9.34-t3.el7.twitter.x86_64 #1
    Hardware name: Dell Inc. PowerEdge C6220/0W6W6G, BIOS 2.2.3 11/07/2013
    Call Trace:
    dump_stack+0x4d/0x66
    __warn+0xcb/0xf0
    warn_slowpath_fmt+0x4f/0x60
    __list_del_entry+0x9e/0xc0
    shmem_unused_huge_shrink+0xfa/0x2e0
    shmem_unused_huge_scan+0x20/0x30
    super_cache_scan+0x193/0x1a0
    shrink_slab.part.41+0x1e3/0x3f0
    shrink_slab+0x29/0x30
    shrink_node+0xf9/0x2f0
    kswapd+0x2d8/0x6c0
    kthread+0xd7/0xf0
    ret_from_fork+0x22/0x30

    WARNING: CPU: 23 PID: 639 at lib/list_debug.c:33 __list_add+0x89/0xb0
    list_add corruption. prev->next should be next (ffff9ae5699ba960), but was ffff9ae5694b82d8. (prev=ffff9ae5694b82d8).
    Modules linked in: intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul ghash_clmulni_intel raid0 dcdbas shpchp wmi hed i2c_i801 ioatdma lpc_ich i2c_smbus acpi_cpufreq tcp_diag inet_diag sch_fq_codel ipmi_si ipmi_devintf ipmi_msghandler igb ptp crc32c_intel pps_core i2c_algo_bit i2c_core dca ipv6 crc_ccitt
    CPU: 23 PID: 639 Comm: systemd-udevd Tainted: G W 4.9.34-t3.el7.twitter.x86_64 #1
    Hardware name: Dell Inc. PowerEdge C6220/0W6W6G, BIOS 2.2.3 11/07/2013
    Call Trace:
    dump_stack+0x4d/0x66
    __warn+0xcb/0xf0
    warn_slowpath_fmt+0x4f/0x60
    __list_add+0x89/0xb0
    shmem_setattr+0x204/0x230
    notify_change+0x2ef/0x440
    do_truncate+0x5d/0x90
    path_openat+0x331/0x1190
    do_filp_open+0x7e/0xe0
    do_sys_open+0x123/0x200
    SyS_open+0x1e/0x20
    do_syscall_64+0x61/0x170
    entry_SYSCALL64_slow_path+0x25/0x25

    The problem is that shmem_unused_huge_shrink() moves entries from the
    global sbinfo->shrinklist to its local lists and then releases the
    spinlock. However, a parallel shmem_setattr() could access one of these
    entries directly and add it back to the global shrinklist if it is
    removed, with the spinlock held.

    The logic itself looks solid since an entry could be either in a local
    list or the global list, otherwise it is removed from one of them by
    list_del_init(). So probably the race condition is that, one CPU is in
    the middle of INIT_LIST_HEAD() but the other CPU calls list_empty()
    which returns true too early then the following list_add_tail() sees a
    corrupted entry.

    list_empty_careful() is designed to fix this situation.

    [akpm@linux-foundation.org: add comments]
    Link: http://lkml.kernel.org/r/20170803054630.18775-1-xiyou.wangcong@gmail.com
    Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure")
    Signed-off-by: Cong Wang
    Acked-by: Linus Torvalds
    Acked-by: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cong Wang
     

11 Jul, 2017

1 commit

  • PR_SET_THP_DISABLE has a rather subtle semantic. It doesn't affect any
    existing mapping because it only updated mm->def_flags which is a
    template for new mappings.

    The mappings created after prctl(PR_SET_THP_DISABLE) have VM_NOHUGEPAGE
    flag set. This can be quite surprising for all those applications which
    do not do prctl(); fork() & exec() and want to control their own THP
    behavior.

    Another usecase when the immediate semantic of the prctl might be useful
    is a combination of pre- and post-copy migration of containers with
    CRIU. In this case CRIU populates a part of a memory region with data
    that was saved during the pre-copy stage. Afterwards, the region is
    registered with userfaultfd and CRIU expects to get page faults for the
    parts of the region that were not yet populated. However, khugepaged
    collapses the pages and the expected page faults do not occur.

    In more general case, the prctl(PR_SET_THP_DISABLE) could be used as a
    temporary mechanism for enabling/disabling THP process wide.

    Implementation wise, a new MMF_DISABLE_THP flag is added. This flag is
    tested when decision whether to use huge pages is taken either during
    page fault of at the time of THP collapse.

    It should be noted, that the new implementation makes PR_SET_THP_DISABLE
    master override to any per-VMA setting, which was not the case
    previously.

    Fixes: a0715cc22601 ("mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE")
    Link: http://lkml.kernel.org/r/1496415802-30944-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Michal Hocko
    Signed-off-by: Mike Rapoport
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Arnd Bergmann
    Cc: "Kirill A. Shutemov"
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

07 Jul, 2017

3 commits

  • Track the following reclaim counters for every memory cgroup: PGREFILL,
    PGSCAN, PGSTEAL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE and PGLAZYFREED.

    These values are exposed using the memory.stats interface of cgroup v2.

    The meaning of each value is the same as for global counters, available
    using /proc/vmstat.

    Also, for consistency, rename mem_cgroup_count_vm_event() to
    count_memcg_event_mm().

    Link: http://lkml.kernel.org/r/1494530183-30808-1-git-send-email-guro@fb.com
    Signed-off-by: Roman Gushchin
    Suggested-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Tejun Heo
    Cc: Li Zefan
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Now, get_swap_page takes struct page and allocates swap space according
    to page size(ie, normal or THP) so it would be more cleaner to introduce
    put_swap_page which is a counter function of get_swap_page. Then, it
    calls right swap slot free function depending on page's size.

    [ying.huang@intel.com: minor cleanup and fix]
    Link: http://lkml.kernel.org/r/20170515112522.32457-3-ying.huang@intel.com
    Signed-off-by: Minchan Kim
    Signed-off-by: "Huang, Ying"
    Acked-by: Johannes Weiner
    Cc: Andrea Arcangeli
    Cc: Ebru Akagunduz
    Cc: Hugh Dickins
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Patch series "THP swap: Delay splitting THP during swapping out", v11.

    This patchset is to optimize the performance of Transparent Huge Page
    (THP) swap.

    Recently, the performance of the storage devices improved so fast that
    we cannot saturate the disk bandwidth with single logical CPU when do
    page swap out even on a high-end server machine. Because the
    performance of the storage device improved faster than that of single
    logical CPU. And it seems that the trend will not change in the near
    future. On the other hand, the THP becomes more and more popular
    because of increased memory size. So it becomes necessary to optimize
    THP swap performance.

    The advantages of the THP swap support include:

    - Batch the swap operations for the THP to reduce lock
    acquiring/releasing, including allocating/freeing the swap space,
    adding/deleting to/from the swap cache, and writing/reading the swap
    space, etc. This will help improve the performance of the THP swap.

    - The THP swap space read/write will be 2M sequential IO. It is
    particularly helpful for the swap read, which are usually 4k random
    IO. This will improve the performance of the THP swap too.

    - It will help the memory fragmentation, especially when the THP is
    heavily used by the applications. The 2M continuous pages will be
    free up after THP swapping out.

    - It will improve the THP utilization on the system with the swap
    turned on. Because the speed for khugepaged to collapse the normal
    pages into the THP is quite slow. After the THP is split during the
    swapping out, it will take quite long time for the normal pages to
    collapse back into the THP after being swapped in. The high THP
    utilization helps the efficiency of the page based memory management
    too.

    There are some concerns regarding THP swap in, mainly because possible
    enlarged read/write IO size (for swap in/out) may put more overhead on
    the storage device. To deal with that, the THP swap in should be turned
    on only when necessary. For example, it can be selected via
    "always/never/madvise" logic, to be turned on globally, turned off
    globally, or turned on only for VMA with MADV_HUGEPAGE, etc.

    This patchset is the first step for the THP swap support. The plan is
    to delay splitting THP step by step, finally avoid splitting THP during
    the THP swapping out and swap out/in the THP as a whole.

    As the first step, in this patchset, the splitting huge page is delayed
    from almost the first step of swapping out to after allocating the swap
    space for the THP and adding the THP into the swap cache. This will
    reduce lock acquiring/releasing for the locks used for the swap cache
    management.

    With the patchset, the swap out throughput improves 15.5% (from about
    3.73GB/s to about 4.31GB/s) in the vm-scalability swap-w-seq test case
    with 8 processes. The test is done on a Xeon E5 v3 system. The swap
    device used is a RAM simulated PMEM (persistent memory) device. To test
    the sequential swapping out, the test case creates 8 processes, which
    sequentially allocate and write to the anonymous pages until the RAM and
    part of the swap device is used up.

    This patch (of 5):

    In this patch, splitting huge page is delayed from almost the first step
    of swapping out to after allocating the swap space for the THP
    (Transparent Huge Page) and adding the THP into the swap cache. This
    will batch the corresponding operation, thus improve THP swap out
    throughput.

    This is the first step for the THP swap optimization. The plan is to
    delay splitting the THP step by step and avoid splitting the THP
    finally.

    In this patch, one swap cluster is used to hold the contents of each THP
    swapped out. So, the size of the swap cluster is changed to that of the
    THP (Transparent Huge Page) on x86_64 architecture (512). For other
    architectures which want such THP swap optimization,
    ARCH_USES_THP_SWAP_CLUSTER needs to be selected in the Kconfig file for
    the architecture. In effect, this will enlarge swap cluster size by 2
    times on x86_64. Which may make it harder to find a free cluster when
    the swap space becomes fragmented. So that, this may reduce the
    continuous swap space allocation and sequential write in theory. The
    performance test in 0day shows no regressions caused by this.

    In the future of THP swap optimization, some information of the swapped
    out THP (such as compound map count) will be recorded in the
    swap_cluster_info data structure.

    The mem cgroup swap accounting functions are enhanced to support charge
    or uncharge a swap cluster backing a THP as a whole.

    The swap cluster allocate/free functions are added to allocate/free a
    swap cluster for a THP. A fair simple algorithm is used for swap
    cluster allocation, that is, only the first swap device in priority list
    will be tried to allocate the swap cluster. The function will fail if
    the trying is not successful, and the caller will fallback to allocate a
    single swap slot instead. This works good enough for normal cases. If
    the difference of the number of the free swap clusters among multiple
    swap devices is significant, it is possible that some THPs are split
    earlier than necessary. For example, this could be caused by big size
    difference among multiple swap devices.

    The swap cache functions is enhanced to support add/delete THP to/from
    the swap cache as a set of (HPAGE_PMD_NR) sub-pages. This may be
    enhanced in the future with multi-order radix tree. But because we will
    split the THP soon during swapping out, that optimization doesn't make
    much sense for this first step.

    The THP splitting functions are enhanced to support to split THP in swap
    cache during swapping out. The page lock will be held during allocating
    the swap cluster, adding the THP into the swap cache and splitting the
    THP. So in the code path other than swapping out, if the THP need to be
    split, the PageSwapCache(THP) will be always false.

    The swap cluster is only available for SSD, so the THP swap optimization
    in this patchset has no effect for HDD.

    [ying.huang@intel.com: fix two issues in THP optimize patch]
    Link: http://lkml.kernel.org/r/87k25ed8zo.fsf@yhuang-dev.intel.com
    [hannes@cmpxchg.org: extensive cleanups and simplifications, reduce code size]
    Link: http://lkml.kernel.org/r/20170515112522.32457-2-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Signed-off-by: Johannes Weiner
    Suggested-by: Andrew Morton [for config option]
    Acked-by: Kirill A. Shutemov [for changes in huge_memory.c and huge_mm.h]
    Cc: Andrea Arcangeli
    Cc: Ebru Akagunduz
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

04 Jul, 2017

1 commit

  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Add the SYSTEM_SCHEDULING bootup state to move various scheduler
    debug checks earlier into the bootup. This turns silent and
    sporadically deadly bugs into nice, deterministic splats. Fix some
    of the splats that triggered. (Thomas Gleixner)

    - A round of restructuring and refactoring of the load-balancing and
    topology code (Peter Zijlstra)

    - Another round of consolidating ~20 of incremental scheduler code
    history: this time in terms of wait-queue nomenclature. (I didn't
    get much feedback on these renaming patches, and we can still
    easily change any names I might have misplaced, so if anyone hates
    a new name, please holler and I'll fix it.) (Ingo Molnar)

    - sched/numa improvements, fixes and updates (Rik van Riel)

    - Another round of x86/tsc scheduler clock code improvements, in hope
    of making it more robust (Peter Zijlstra)

    - Improve NOHZ behavior (Frederic Weisbecker)

    - Deadline scheduler improvements and fixes (Luca Abeni, Daniel
    Bristot de Oliveira)

    - Simplify and optimize the topology setup code (Lauro Ramos
    Venancio)

    - Debloat and decouple scheduler code some more (Nicolas Pitre)

    - Simplify code by making better use of llist primitives (Byungchul
    Park)

    - ... plus other fixes and improvements"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
    sched/cputime: Refactor the cputime_adjust() code
    sched/debug: Expose the number of RT/DL tasks that can migrate
    sched/numa: Hide numa_wake_affine() from UP build
    sched/fair: Remove effective_load()
    sched/numa: Implement NUMA node level wake_affine()
    sched/fair: Simplify wake_affine() for the single socket case
    sched/numa: Override part of migrate_degrades_locality() when idle balancing
    sched/rt: Move RT related code from sched/core.c to sched/rt.c
    sched/deadline: Move DL related code from sched/core.c to sched/deadline.c
    sched/cpuset: Only offer CONFIG_CPUSETS if SMP is enabled
    sched/fair: Spare idle load balancing on nohz_full CPUs
    nohz: Move idle balancer registration to the idle path
    sched/loadavg: Generalize "_idle" naming to "_nohz"
    sched/core: Drop the unused try_get_task_struct() helper function
    sched/fair: WARN() and refuse to set buddy when !se->on_rq
    sched/debug: Fix SCHED_WARN_ON() to return a value on !CONFIG_SCHED_DEBUG as well
    sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
    sched/wait: Move bit_wait_table[] and related functionality from sched/core.c to sched/wait_bit.c
    sched/wait: Split out the wait_bit*() APIs from into
    sched/wait: Re-adjust macro line continuation backslashes in
    ...

    Linus Torvalds
     

20 Jun, 2017

2 commits

  • So I've noticed a number of instances where it was not obvious from the
    code whether ->task_list was for a wait-queue head or a wait-queue entry.

    Furthermore, there's a number of wait-queue users where the lists are
    not for 'tasks' but other entities (poll tables, etc.), in which case
    the 'task_list' name is actively confusing.

    To clear this all up, name the wait-queue head and entry list structure
    fields unambiguously:

    struct wait_queue_head::task_list => ::head
    struct wait_queue_entry::task_list => ::entry

    For example, this code:

    rqw->wait.task_list.next != &wait->task_list

    ... is was pretty unclear (to me) what it's doing, while now it's written this way:

    rqw->wait.head.next != &wait->entry

    ... which makes it pretty clear that we are iterating a list until we see the head.

    Other examples are:

    list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
    list_for_each_entry(wq, &fence->wait.task_list, task_list) {

    ... where it's unclear (to me) what we are iterating, and during review it's
    hard to tell whether it's trying to walk a wait-queue entry (which would be
    a bug), while now it's written as:

    list_for_each_entry_safe(pos, next, &x->head, entry) {
    list_for_each_entry(wq, &fence->wait.head, entry) {

    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Rename:

    wait_queue_t => wait_queue_entry_t

    'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
    but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
    which had to carry the name.

    Start sorting this out by renaming it to 'wait_queue_entry_t'.

    This also allows the real structure name 'struct __wait_queue' to
    lose its double underscore and become 'struct wait_queue_entry',
    which is the more canonical nomenclature for such data types.

    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

05 Jun, 2017

1 commit


04 Mar, 2017

1 commit

  • Pull vfs 'statx()' update from Al Viro.

    This adds the new extended stat() interface that internally subsumes our
    previous stat interfaces, and allows user mode to specify in more detail
    what kind of information it wants.

    It also allows for some explicit synchronization information to be
    passed to the filesystem, which can be relevant for network filesystems:
    is the cached value ok, or do you need open/close consistency, or what?

    From David Howells.

    Andreas Dilger points out that the first version of the extended statx
    interface was posted June 29, 2010:

    https://www.spinics.net/lists/linux-fsdevel/msg33831.html

    * 'rebased-statx' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    statx: Add a system call to make enhanced file info available

    Linus Torvalds
     

03 Mar, 2017

1 commit

  • Add a system call to make extended file information available, including
    file creation and some attribute flags where available through the
    underlying filesystem.

    The getattr inode operation is altered to take two additional arguments: a
    u32 request_mask and an unsigned int flags that indicate the
    synchronisation mode. This change is propagated to the vfs_getattr*()
    function.

    Functions like vfs_stat() are now inline wrappers around new functions
    vfs_statx() and vfs_statx_fd() to reduce stack usage.

    ========
    OVERVIEW
    ========

    The idea was initially proposed as a set of xattrs that could be retrieved
    with getxattr(), but the general preference proved to be for a new syscall
    with an extended stat structure.

    A number of requests were gathered for features to be included. The
    following have been included:

    (1) Make the fields a consistent size on all arches and make them large.

    (2) Spare space, request flags and information flags are provided for
    future expansion.

    (3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
    __s64).

    (4) Creation time: The SMB protocol carries the creation time, which could
    be exported by Samba, which will in turn help CIFS make use of
    FS-Cache as that can be used for coherency data (stx_btime).

    This is also specified in NFSv4 as a recommended attribute and could
    be exported by NFSD [Steve French].

    (5) Lightweight stat: Ask for just those details of interest, and allow a
    netfs (such as NFS) to approximate anything not of interest, possibly
    without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
    Dilger] (AT_STATX_DONT_SYNC).

    (6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
    its cached attributes are up to date [Trond Myklebust]
    (AT_STATX_FORCE_SYNC).

    And the following have been left out for future extension:

    (7) Data version number: Could be used by userspace NFS servers [Aneesh
    Kumar].

    Can also be used to modify fill_post_wcc() in NFSD which retrieves
    i_version directly, but has just called vfs_getattr(). It could get
    it from the kstat struct if it used vfs_xgetattr() instead.

    (There's disagreement on the exact semantics of a single field, since
    not all filesystems do this the same way).

    (8) BSD stat compatibility: Including more fields from the BSD stat such
    as creation time (st_btime) and inode generation number (st_gen)
    [Jeremy Allison, Bernd Schubert].

    (9) Inode generation number: Useful for FUSE and userspace NFS servers
    [Bernd Schubert].

    (This was asked for but later deemed unnecessary with the
    open-by-handle capability available and caused disagreement as to
    whether it's a security hole or not).

    (10) Extra coherency data may be useful in making backups [Andreas Dilger].

    (No particular data were offered, but things like last backup
    timestamp, the data version number and the DOS archive bit would come
    into this category).

    (11) Allow the filesystem to indicate what it can/cannot provide: A
    filesystem can now say it doesn't support a standard stat feature if
    that isn't available, so if, for instance, inode numbers or UIDs don't
    exist or are fabricated locally...

    (This requires a separate system call - I have an fsinfo() call idea
    for this).

    (12) Store a 16-byte volume ID in the superblock that can be returned in
    struct xstat [Steve French].

    (Deferred to fsinfo).

    (13) Include granularity fields in the time data to indicate the
    granularity of each of the times (NFSv4 time_delta) [Steve French].

    (Deferred to fsinfo).

    (14) FS_IOC_GETFLAGS value. These could be translated to BSD's st_flags.
    Note that the Linux IOC flags are a mess and filesystems such as Ext4
    define flags that aren't in linux/fs.h, so translation in the kernel
    may be a necessity (or, possibly, we provide the filesystem type too).

    (Some attributes are made available in stx_attributes, but the general
    feeling was that the IOC flags were to ext[234]-specific and shouldn't
    be exposed through statx this way).

    (15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
    Michael Kerrisk].

    (Deferred, probably to fsinfo. Finding out if there's an ACL or
    seclabal might require extra filesystem operations).

    (16) Femtosecond-resolution timestamps [Dave Chinner].

    (A __reserved field has been left in the statx_timestamp struct for
    this - if there proves to be a need).

    (17) A set multiple attributes syscall to go with this.

    ===============
    NEW SYSTEM CALL
    ===============

    The new system call is:

    int ret = statx(int dfd,
    const char *filename,
    unsigned int flags,
    unsigned int mask,
    struct statx *buffer);

    The dfd, filename and flags parameters indicate the file to query, in a
    similar way to fstatat(). There is no equivalent of lstat() as that can be
    emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags. There is
    also no equivalent of fstat() as that can be emulated by passing a NULL
    filename to statx() with the fd of interest in dfd.

    Whether or not statx() synchronises the attributes with the backing store
    can be controlled by OR'ing a value into the flags argument (this typically
    only affects network filesystems):

    (1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
    respect.

    (2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
    its attributes with the server - which might require data writeback to
    occur to get the timestamps correct.

    (3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
    network filesystem. The resulting values should be considered
    approximate.

    mask is a bitmask indicating the fields in struct statx that are of
    interest to the caller. The user should set this to STATX_BASIC_STATS to
    get the basic set returned by stat(). It should be noted that asking for
    more information may entail extra I/O operations.

    buffer points to the destination for the data. This must be 256 bytes in
    size.

    ======================
    MAIN ATTRIBUTES RECORD
    ======================

    The following structures are defined in which to return the main attribute
    set:

    struct statx_timestamp {
    __s64 tv_sec;
    __s32 tv_nsec;
    __s32 __reserved;
    };

    struct statx {
    __u32 stx_mask;
    __u32 stx_blksize;
    __u64 stx_attributes;
    __u32 stx_nlink;
    __u32 stx_uid;
    __u32 stx_gid;
    __u16 stx_mode;
    __u16 __spare0[1];
    __u64 stx_ino;
    __u64 stx_size;
    __u64 stx_blocks;
    __u64 __spare1[1];
    struct statx_timestamp stx_atime;
    struct statx_timestamp stx_btime;
    struct statx_timestamp stx_ctime;
    struct statx_timestamp stx_mtime;
    __u32 stx_rdev_major;
    __u32 stx_rdev_minor;
    __u32 stx_dev_major;
    __u32 stx_dev_minor;
    __u64 __spare2[14];
    };

    The defined bits in request_mask and stx_mask are:

    STATX_TYPE Want/got stx_mode & S_IFMT
    STATX_MODE Want/got stx_mode & ~S_IFMT
    STATX_NLINK Want/got stx_nlink
    STATX_UID Want/got stx_uid
    STATX_GID Want/got stx_gid
    STATX_ATIME Want/got stx_atime{,_ns}
    STATX_MTIME Want/got stx_mtime{,_ns}
    STATX_CTIME Want/got stx_ctime{,_ns}
    STATX_INO Want/got stx_ino
    STATX_SIZE Want/got stx_size
    STATX_BLOCKS Want/got stx_blocks
    STATX_BASIC_STATS [The stuff in the normal stat struct]
    STATX_BTIME Want/got stx_btime{,_ns}
    STATX_ALL [All currently available stuff]

    stx_btime is the file creation time, stx_mask is a bitmask indicating the
    data provided and __spares*[] are where as-yet undefined fields can be
    placed.

    Time fields are structures with separate seconds and nanoseconds fields
    plus a reserved field in case we want to add even finer resolution. Note
    that times will be negative if before 1970; in such a case, the nanosecond
    fields will also be negative if not zero.

    The bits defined in the stx_attributes field convey information about a
    file, how it is accessed, where it is and what it does. The following
    attributes map to FS_*_FL flags and are the same numerical value:

    STATX_ATTR_COMPRESSED File is compressed by the fs
    STATX_ATTR_IMMUTABLE File is marked immutable
    STATX_ATTR_APPEND File is append-only
    STATX_ATTR_NODUMP File is not to be dumped
    STATX_ATTR_ENCRYPTED File requires key to decrypt in fs

    Within the kernel, the supported flags are listed by:

    KSTAT_ATTR_FS_IOC_FLAGS

    [Are any other IOC flags of sufficient general interest to be exposed
    through this interface?]

    New flags include:

    STATX_ATTR_AUTOMOUNT Object is an automount trigger

    These are for the use of GUI tools that might want to mark files specially,
    depending on what they are.

    Fields in struct statx come in a number of classes:

    (0) stx_dev_*, stx_blksize.

    These are local system information and are always available.

    (1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
    stx_size, stx_blocks.

    These will be returned whether the caller asks for them or not. The
    corresponding bits in stx_mask will be set to indicate whether they
    actually have valid values.

    If the caller didn't ask for them, then they may be approximated. For
    example, NFS won't waste any time updating them from the server,
    unless as a byproduct of updating something requested.

    If the values don't actually exist for the underlying object (such as
    UID or GID on a DOS file), then the bit won't be set in the stx_mask,
    even if the caller asked for the value. In such a case, the returned
    value will be a fabrication.

    Note that there are instances where the type might not be valid, for
    instance Windows reparse points.

    (2) stx_rdev_*.

    This will be set only if stx_mode indicates we're looking at a
    blockdev or a chardev, otherwise will be 0.

    (3) stx_btime.

    Similar to (1), except this will be set to 0 if it doesn't exist.

    =======
    TESTING
    =======

    The following test program can be used to test the statx system call:

    samples/statx/test-statx.c

    Just compile and run, passing it paths to the files you want to examine.
    The file is built automatically if CONFIG_SAMPLES is enabled.

    Here's some example output. Firstly, an NFS directory that crosses to
    another FSID. Note that the AUTOMOUNT attribute is set because transiting
    this directory will cause d_automount to be invoked by the VFS.

    [root@andromeda ~]# /tmp/test-statx -A /warthog/data
    statx(/warthog/data) = 0
    results=7ff
    Size: 4096 Blocks: 8 IO Block: 1048576 directory
    Device: 00:26 Inode: 1703937 Links: 125
    Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
    Access: 2016-11-24 09:02:12.219699527+0000
    Modify: 2016-11-17 10:44:36.225653653+0000
    Change: 2016-11-17 10:44:36.225653653+0000
    Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)

    Secondly, the result of automounting on that directory.

    [root@andromeda ~]# /tmp/test-statx /warthog/data
    statx(/warthog/data) = 0
    results=7ff
    Size: 4096 Blocks: 8 IO Block: 1048576 directory
    Device: 00:27 Inode: 2 Links: 125
    Access: (3777/drwxrwxrwx) Uid: 0 Gid: 4041
    Access: 2016-11-24 09:02:12.219699527+0000
    Modify: 2016-11-17 10:44:36.225653653+0000
    Change: 2016-11-17 10:44:36.225653653+0000

    Signed-off-by: David Howells
    Signed-off-by: Al Viro

    David Howells
     

02 Mar, 2017

1 commit


25 Feb, 2017

2 commits

  • Running my likely/unlikely profiler, I discovered that the test in
    shmem_write_begin() that tests for info->seals as unlikely, is always
    incorrect. This is because shmem_get_inode() sets info->seals to have
    F_SEAL_SEAL set by default, and it is unlikely to be cleared when
    shmem_write_begin() is called. Thus, the if statement is very likely.

    But as the if statement block only cares about F_SEAL_WRITE and
    F_SEAL_GROW, change the test to only test those two bits.

    Link: http://lkml.kernel.org/r/20170203105656.7aec6237@gandalf.local.home
    Signed-off-by: Steven Rostedt (VMware)
    Acked-by: Hugh Dickins
    Cc: David Herrmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt (VMware)
     
  • ->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to
    take a vma and vmf parameter when the vma already resides in vmf.

    Remove the vma parameter to simplify things.

    [arnd@arndb.de: fix ARM build]
    Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com
    Signed-off-by: Dave Jiang
    Signed-off-by: Arnd Bergmann
    Reviewed-by: Ross Zwisler
    Cc: Theodore Ts'o
    Cc: Darrick J. Wong
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Christoph Hellwig
    Cc: Jan Kara
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jiang
     

23 Feb, 2017

8 commits

  • If the atomic copy_user fails because of a real dangling userland
    pointer, we won't go back into the shmem method, so when the method
    returns it must not leave anything charged up, except the page itself.

    Link: http://lkml.kernel.org/r/20161216144821.5183-37-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Hillf Danton
    Cc: Michael Rapoport
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Use the non atomic version of __SetPageUptodate while the page is still
    private and not visible to lookup operations. Using the non atomic
    version after the page is already visible to lookups is unsafe as there
    would be concurrent lock_page operation modifying the page->flags while
    it runs.

    This solves a lockup in find_lock_entry with the userfaultfd_shmem
    selftest.

    userfaultfd_shm D14296 691 1 0x00000004
    Call Trace:
    schedule+0x3d/0x90
    schedule_timeout+0x228/0x420
    io_schedule_timeout+0xa4/0x110
    __lock_page+0x12d/0x170
    find_lock_entry+0xa4/0x190
    shmem_getpage_gfp+0xb9/0xc30
    shmem_fault+0x70/0x1c0
    __do_fault+0x21/0x150
    handle_mm_fault+0xec9/0x1490
    __do_page_fault+0x20d/0x520
    trace_do_page_fault+0x61/0x270
    do_async_page_fault+0x19/0x80
    async_page_fault+0x25/0x30

    Link: http://lkml.kernel.org/r/20170116180408.12184-2-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Reported-by: Mike Rapoport
    Acked-by: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • A VM_BUG_ON triggered on the shmem selftest.

    Link: http://lkml.kernel.org/r/20161216144821.5183-36-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Hillf Danton
    Cc: Michael Rapoport
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • When processing a page fault in shared memory area for not present page,
    check the VMA determine if faults are to be handled by userfaultfd. If
    so, delegate the page fault to handle_userfault.

    Link: http://lkml.kernel.org/r/20161216144821.5183-33-aarcange@redhat.com
    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Hillf Danton
    Cc: Michael Rapoport
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • It resolves this build error:

    All errors (new ones prefixed by >>):

    mm/shmem.c: In function 'shmem_mcopy_atomic_pte':
    >> mm/shmem.c:2228:2: error: implicit declaration of function 'update_mmu_cache' [-Werror=implicit-function-declaration]
    update_mmu_cache(dst_vma, dst_addr, dst_pte);

    microblaze may have to be also updated to define it in asm/pgtable.h
    like the other archs, then this header inclusion can be removed.

    Link: http://lkml.kernel.org/r/20161216144821.5183-31-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Hillf Danton
    Cc: Michael Rapoport
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Currently userfault relies on vma_is_anonymous and vma_is_hugetlb to
    ensure compatibility of a VMA with userfault. Introduction of
    vma_is_shmem allows detection if tmpfs backed VMAs, so that they may be
    used with userfaultfd. Current implementation presumes usage of
    vma_is_shmem only by slow path routines in userfaultfd, therefore the
    vma_is_shmem is not made inline to leave the few remaining free bits in
    vm_flags.

    Link: http://lkml.kernel.org/r/20161216144821.5183-30-aarcange@redhat.com
    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Hillf Danton
    Cc: Michael Rapoport
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • shmem_mcopy_atomic_pte is the low level routine that implements the
    userfaultfd UFFDIO_COPY command. It is based on the existing
    mcopy_atomic_pte routine with modifications for shared memory pages.

    Link: http://lkml.kernel.org/r/20161216144821.5183-29-aarcange@redhat.com
    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Hillf Danton
    Cc: Michael Rapoport
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Callers of shmem_mapping() are interested in whether the mapping is swap
    backed - except for uprobes, which is interested in whether it should
    use shmem_read_mapping_page(). All these callers are better served by a
    shmem_mapping() which checks for shmem_aops, than the current version
    which goes through several indirections to find where the inode lives -
    and has the surprising effect that a private mmap of /dev/zero satisfies
    both vma_is_anonymous() and shmem_mapping(), when that device node is on
    devtmpfs. I don't think anything in the tree suffers from that
    surprise, but it caught me out, and is better fixed.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1612052148530.13021@eggly.anvils
    Signed-off-by: Hugh Dickins
    Acked-by: Johannes Weiner
    Cc: Oleg Nesterov
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

04 Feb, 2017

1 commit

  • Syzkaller fuzzer managed to trigger this:

    BUG: sleeping function called from invalid context at mm/shmem.c:852
    in_atomic(): 1, irqs_disabled(): 0, pid: 529, name: khugepaged
    3 locks held by khugepaged/529:
    #0: (shrinker_rwsem){++++..}, at: [] shrink_slab.part.59+0x121/0xd30 mm/vmscan.c:451
    #1: (&type->s_umount_key#29){++++..}, at: [] trylock_super+0x20/0x100 fs/super.c:392
    #2: (&(&sbinfo->shrinklist_lock)->rlock){+.+.-.}, at: [] spin_lock include/linux/spinlock.h:302 [inline]
    #2: (&(&sbinfo->shrinklist_lock)->rlock){+.+.-.}, at: [] shmem_unused_huge_shrink+0x28e/0x1490 mm/shmem.c:427
    CPU: 2 PID: 529 Comm: khugepaged Not tainted 4.10.0-rc5+ #201
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
    shmem_undo_range+0xb20/0x2710 mm/shmem.c:852
    shmem_truncate_range+0x27/0xa0 mm/shmem.c:939
    shmem_evict_inode+0x35f/0xca0 mm/shmem.c:1030
    evict+0x46e/0x980 fs/inode.c:553
    iput_final fs/inode.c:1515 [inline]
    iput+0x589/0xb20 fs/inode.c:1542
    shmem_unused_huge_shrink+0xbad/0x1490 mm/shmem.c:446
    shmem_unused_huge_scan+0x10c/0x170 mm/shmem.c:512
    super_cache_scan+0x376/0x450 fs/super.c:106
    do_shrink_slab mm/vmscan.c:378 [inline]
    shrink_slab.part.59+0x543/0xd30 mm/vmscan.c:481
    shrink_slab mm/vmscan.c:2592 [inline]
    shrink_node+0x2c7/0x870 mm/vmscan.c:2592
    shrink_zones mm/vmscan.c:2734 [inline]
    do_try_to_free_pages+0x369/0xc80 mm/vmscan.c:2776
    try_to_free_pages+0x3c6/0x900 mm/vmscan.c:2982
    __perform_reclaim mm/page_alloc.c:3301 [inline]
    __alloc_pages_direct_reclaim mm/page_alloc.c:3322 [inline]
    __alloc_pages_slowpath+0xa24/0x1c30 mm/page_alloc.c:3683
    __alloc_pages_nodemask+0x544/0xae0 mm/page_alloc.c:3848
    __alloc_pages include/linux/gfp.h:426 [inline]
    __alloc_pages_node include/linux/gfp.h:439 [inline]
    khugepaged_alloc_page+0xc2/0x1b0 mm/khugepaged.c:750
    collapse_huge_page+0x182/0x1fe0 mm/khugepaged.c:955
    khugepaged_scan_pmd+0xfdf/0x12a0 mm/khugepaged.c:1208
    khugepaged_scan_mm_slot mm/khugepaged.c:1727 [inline]
    khugepaged_do_scan mm/khugepaged.c:1808 [inline]
    khugepaged+0xe9b/0x1590 mm/khugepaged.c:1853
    kthread+0x326/0x3f0 kernel/kthread.c:227
    ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430

    The iput() from atomic context was a bad idea: if after igrab() somebody
    else calls iput() and we left with the last inode reference, our iput()
    would lead to inode eviction and therefore sleeping.

    This patch should fix the situation.

    Link: http://lkml.kernel.org/r/20170131093141.GA15899@node.shutemov.name
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

25 Dec, 2016

1 commit


18 Dec, 2016

1 commit

  • …/linux/kernel/git/mszeredi/vfs

    Pull partial readlink cleanups from Miklos Szeredi.

    This is the uncontroversial part of the readlink cleanup patch-set that
    simplifies the default readlink handling.

    Miklos and Al are still discussing the rest of the series.

    * git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
    vfs: make generic_readlink() static
    vfs: remove ".readlink = generic_readlink" assignments
    vfs: default to generic_readlink()
    vfs: replace calling i_op->readlink with vfs_readlink()
    proc/self: use generic_readlink
    ecryptfs: use vfs_get_link()
    bad_inode: add missing i_op initializers

    Linus Torvalds