24 Mar, 2019

1 commit

  • [ Upstream commit 414fd080d125408cb15d04ff4907e1dd8145c8c7 ]

    For dax pmd, pmd_trans_huge() returns false but pmd_huge() returns true
    on x86. So the function works as long as hugetlb is configured.
    However, dax doesn't depend on hugetlb.

    Link: http://lkml.kernel.org/r/20190111034033.601-1-yuzhao@google.com
    Signed-off-by: Yu Zhao
    Reviewed-by: Jan Kara
    Cc: Dan Williams
    Cc: Huang Ying
    Cc: Matthew Wilcox
    Cc: Keith Busch
    Cc: "Michael S . Tsirkin"
    Cc: John Hubbard
    Cc: Wei Yang
    Cc: Mike Rapoport
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Yu Zhao
     

24 Aug, 2018

1 commit

  • Use new return type vm_fault_t for fault handler. For now, this is just
    documenting that the function returns a VM_FAULT value rather than an
    errno. Once all instances are converted, vm_fault_t will become a
    distinct type.

    Ref-> commit 1c8f422059ae ("mm: change return type to vm_fault_t")

    The aim is to change the return type of finish_fault() and
    handle_mm_fault() to vm_fault_t type. As part of that clean up return
    type of all other recursively called functions have been changed to
    vm_fault_t type.

    The places from where handle_mm_fault() is getting invoked will be
    change to vm_fault_t type but in a separate patch.

    vmf_error() is the newly introduce inline function in 4.17-rc6.

    [akpm@linux-foundation.org: don't shadow outer local `ret' in __do_huge_pmd_anonymous_page()]
    Link: http://lkml.kernel.org/r/20180604171727.GA20279@jordon-HP-15-Notebook-PC
    Signed-off-by: Souptick Joarder
    Reviewed-by: Matthew Wilcox
    Reviewed-by: Andrew Morton
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Souptick Joarder
     

15 Jul, 2018

1 commit

  • syzbot has noticed that a specially crafted library can easily hit
    VM_BUG_ON in __mm_populate

    kernel BUG at mm/gup.c:1242!
    invalid opcode: 0000 [#1] SMP
    CPU: 2 PID: 9667 Comm: a.out Not tainted 4.18.0-rc3 #644
    Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
    RIP: 0010:__mm_populate+0x1e2/0x1f0
    Code: 55 d0 65 48 33 14 25 28 00 00 00 89 d8 75 21 48 83 c4 20 5b 41 5c 41 5d 41 5e 41 5f 5d c3 e8 75 18 f1 ff 0f 0b e8 6e 18 f1 ff 0b 31 db eb c9 e8 93 06 e0 ff 0f 1f 00 55 48 89 e5 53 48 89 fb
    Call Trace:
    vm_brk_flags+0xc3/0x100
    vm_brk+0x1f/0x30
    load_elf_library+0x281/0x2e0
    __ia32_sys_uselib+0x170/0x1e0
    do_fast_syscall_32+0xca/0x420
    entry_SYSENTER_compat+0x70/0x7f

    The reason is that the length of the new brk is not page aligned when we
    try to populate the it. There is no reason to bug on that though.
    do_brk_flags already aligns the length properly so the mapping is
    expanded as it should. All we need is to tell mm_populate about it.
    Besides that there is absolutely no reason to to bug_on in the first
    place. The worst thing that could happen is that the last page wouldn't
    get populated and that is far from putting system into an inconsistent
    state.

    Fix the issue by moving the length sanitization code from do_brk_flags
    up to vm_brk_flags. The only other caller of do_brk_flags is brk
    syscall entry and it makes sure to provide the proper length so t here
    is no need for sanitation and so we can use do_brk_flags without it.

    Also remove the bogus BUG_ONs.

    [osalvador@techadventures.net: fix up vm_brk_flags s@request@len@]
    Link: http://lkml.kernel.org/r/20180706090217.GI32658@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Reported-by: syzbot
    Tested-by: Tetsuo Handa
    Reviewed-by: Oscar Salvador
    Cc: Zi Yan
    Cc: "Aneesh Kumar K.V"
    Cc: Dan Williams
    Cc: "Kirill A. Shutemov"
    Cc: Michael S. Tsirkin
    Cc: Al Viro
    Cc: "Huang, Ying"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

09 Jun, 2018

2 commits

  • Pull libnvdimm updates from Dan Williams:
    "This adds a user for the new 'bytes-remaining' updates to
    memcpy_mcsafe() that you already received through Ingo via the
    x86-dax- for-linus pull.

    Not included here, but still targeting this cycle, is support for
    handling memory media errors (poison) consumed via userspace dax
    mappings.

    Summary:

    - DAX broke a fundamental assumption of truncate of file mapped
    pages. The truncate path assumed that it is safe to disconnect a
    pinned page from a file and let the filesystem reclaim the physical
    block. With DAX the page is equivalent to the filesystem block.
    Introduce dax_layout_busy_page() to enable filesystems to wait for
    pinned DAX pages to be released. Without this wait a filesystem
    could allocate blocks under active device-DMA to a new file.

    - DAX arranges for the block layer to be bypassed and uses
    dax_direct_access() + copy_to_iter() to satisfy read(2) calls.
    However, the memcpy_mcsafe() facility is available through the pmem
    block driver. In order to safely handle media errors, via the DAX
    block-layer bypass, introduce copy_to_iter_mcsafe().

    - Fix cache management policy relative to the ACPI NFIT Platform
    Capabilities Structure to properly elide cache flushes when they
    are not necessary. The table indicates whether CPU caches are
    power-fail protected. Clarify that a deep flush is always performed
    on REQ_{FUA,PREFLUSH} requests"

    * tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (21 commits)
    dax: Use dax_write_cache* helpers
    libnvdimm, pmem: Do not flush power-fail protected CPU caches
    libnvdimm, pmem: Unconditionally deep flush on *sync
    libnvdimm, pmem: Complete REQ_FLUSH => REQ_PREFLUSH
    acpi, nfit: Remove ecc_unit_size
    dax: dax_insert_mapping_entry always succeeds
    libnvdimm, e820: Register all pmem resources
    libnvdimm: Debug probe times
    linvdimm, pmem: Preserve read-only setting for pmem devices
    x86, nfit_test: Add unit test for memcpy_mcsafe()
    pmem: Switch to copy_to_iter_mcsafe()
    dax: Report bytes remaining in dax_iomap_actor()
    dax: Introduce a ->copy_to_iter dax operation
    uio, lib: Fix CONFIG_ARCH_HAS_UACCESS_MCSAFE compilation
    xfs, dax: introduce xfs_break_dax_layouts()
    xfs: prepare xfs_break_layouts() for another layout type
    xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL
    mm, fs, dax: handle layout changes to pinned dax mappings
    mm: fix __gup_device_huge vs unmap
    mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS
    ...

    Linus Torvalds
     
  • Dan Williams
     

08 Jun, 2018

2 commits

  • mmap_sem will be read locked when calling follow_pmd_mask(). But this
    cannot prevent PMD from being changed for all cases when PTL is
    unlocked, for example, from pmd_trans_huge() to pmd_none() via
    MADV_DONTNEED. So it is possible for the pmd_present() check in
    follow_pmd_mask() to encounter an invalid PMD. This may cause an
    incorrect VM_BUG_ON() or an infinite loop. Fix this by reading the PMD
    entry into a local variable with READ_ONCE() and checking the local
    variable and pmd_none() in the retry loop.

    As Kirill pointed out, with PTL unlocked, the *pmd may be changed under
    us, so reading it directly again and again may incur weird bugs. So
    although using *pmd directly other than for pmd_present() checking may
    be safe, it is still better to replace them to read *pmd once and check
    the local variable multiple times.

    When PTL unlocked, replace all *pmd with local variable was suggested by
    Kirill.

    Link: http://lkml.kernel.org/r/20180419083514.1365-1-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Zi Yan
    Cc: "Kirill A. Shutemov"
    Cc: Al Viro
    Cc: "Aneesh Kumar K.V"
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • Currently the PTE special supports is turned on in per architecture
    header files. Most of the time, it is defined in
    arch/*/include/asm/pgtable.h depending or not on some other per
    architecture static definition.

    This patch introduce a new configuration variable to manage this
    directly in the Kconfig files. It would later replace
    __HAVE_ARCH_PTE_SPECIAL.

    Here notes for some architecture where the definition of
    __HAVE_ARCH_PTE_SPECIAL is not obvious:

    arm
    __HAVE_ARCH_PTE_SPECIAL which is currently defined in
    arch/arm/include/asm/pgtable-3level.h which is included by
    arch/arm/include/asm/pgtable.h when CONFIG_ARM_LPAE is set.
    So select ARCH_HAS_PTE_SPECIAL if ARM_LPAE.

    powerpc
    __HAVE_ARCH_PTE_SPECIAL is defined in 2 files:
    - arch/powerpc/include/asm/book3s/64/pgtable.h
    - arch/powerpc/include/asm/pte-common.h
    The first one is included if (PPC_BOOK3S & PPC64) while the second is
    included in all the other cases.
    So select ARCH_HAS_PTE_SPECIAL all the time.

    sparc:
    __HAVE_ARCH_PTE_SPECIAL is defined if defined(__sparc__) &&
    defined(__arch64__) which are defined through the compiler in
    sparc/Makefile if !SPARC32 which I assume to be if SPARC64.
    So select ARCH_HAS_PTE_SPECIAL if SPARC64

    There is no functional change introduced by this patch.

    Link: http://lkml.kernel.org/r/1523433816-14460-2-git-send-email-ldufour@linux.vnet.ibm.com
    Signed-off-by: Laurent Dufour
    Suggested-by: Jerome Glisse
    Reviewed-by: Jerome Glisse
    Acked-by: David Rientjes
    Cc: Michal Hocko
    Cc: "Aneesh Kumar K . V"
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Jonathan Corbet
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Cc: Rich Felker
    Cc: David S. Miller
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Vineet Gupta
    Cc: Palmer Dabbelt
    Cc: Albert Ou
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: David Rientjes
    Cc: Robin Murphy
    Cc: Christophe LEROY
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laurent Dufour
     

22 May, 2018

1 commit

  • get_user_pages_fast() for device pages is missing the typical validation
    that all page references have been taken while the mapping was valid.
    Without this validation truncate operations can not reliably coordinate
    against new page reference events like O_DIRECT.

    Cc:
    Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
    Reported-by: Jan Kara
    Reviewed-by: Jan Kara
    Signed-off-by: Dan Williams

    Dan Williams
     

18 May, 2018

1 commit

  • proc_pid_cmdline_read() and environ_read() directly access the target
    process' VM to retrieve the command line and environment. If this
    process remaps these areas onto a file via mmap(), the requesting
    process may experience various issues such as extra delays if the
    underlying device is slow to respond.

    Let's simply refuse to access file-backed areas in these functions.
    For this we add a new FOLL_ANON gup flag that is passed to all calls
    to access_remote_vm(). The code already takes care of such failures
    (including unmapped areas). Accesses via /proc/pid/mem were not
    changed though.

    This was assigned CVE-2018-1120.

    Note for stable backports: the patch may apply to kernels prior to 4.11
    but silently miss one location; it must be checked that no call to
    access_remote_vm() keeps zero as the last argument.

    Reported-by: Qualys Security Advisory
    Cc: Linus Torvalds
    Cc: Andy Lutomirski
    Cc: Oleg Nesterov
    Cc: stable@vger.kernel.org
    Signed-off-by: Willy Tarreau
    Signed-off-by: Linus Torvalds

    Willy Tarreau
     

14 Apr, 2018

2 commits

  • __get_user_pages_fast handles errors differently from
    get_user_pages_fast: the former always returns the number of pages
    pinned, the later might return a negative error code.

    Link: http://lkml.kernel.org/r/1522962072-182137-6-git-send-email-mst@redhat.com
    Signed-off-by: Michael S. Tsirkin
    Reviewed-by: Andrew Morton
    Cc: Kirill A. Shutemov
    Cc: Huang Ying
    Cc: Jonathan Corbet
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Thorsten Leemhuis
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael S. Tsirkin
     
  • get_user_pages_fast is supposed to be a faster drop-in equivalent of
    get_user_pages. As such, callers expect it to return a negative return
    code when passed an invalid address, and never expect it to return 0
    when passed a positive number of pages, since its documentation says:

    * Returns number of pages pinned. This may be fewer than the number
    * requested. If nr_pages is 0 or negative, returns 0. If no pages
    * were pinned, returns -errno.

    When get_user_pages_fast fall back on get_user_pages this is exactly
    what happens. Unfortunately the implementation is inconsistent: it
    returns 0 if passed a kernel address, confusing callers: for example,
    the following is pretty common but does not appear to do the right thing
    with a kernel address:

    ret = get_user_pages_fast(addr, 1, writeable, &page);
    if (ret < 0)
    return ret;

    Change get_user_pages_fast to return -EFAULT when supplied a kernel
    address to make it match expectations.

    All callers have been audited for consistency with the documented
    semantics.

    Link: http://lkml.kernel.org/r/1522962072-182137-4-git-send-email-mst@redhat.com
    Fixes: 5b65c4677a57 ("mm, x86/mm: Fix performance regression in get_user_pages_fast()")
    Signed-off-by: Michael S. Tsirkin
    Reported-by: syzbot+6304bf97ef436580fede@syzkaller.appspotmail.com
    Reviewed-by: Andrew Morton
    Cc: Kirill A. Shutemov
    Cc: Huang Ying
    Cc: Jonathan Corbet
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Thorsten Leemhuis
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael S. Tsirkin
     

06 Apr, 2018

1 commit

  • - Fixed style error: 8 spaces -> 1 tab.
    - Fixed style warning: Corrected misleading indentation.

    Link: http://lkml.kernel.org/r/20180302210254.31888-1-marioleinweber@web.de
    Signed-off-by: Mario Leinweber
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mario Leinweber
     

10 Mar, 2018

1 commit

  • KVM is hanging during postcopy live migration with userfaultfd because
    get_user_pages_unlocked is not capable to handle FOLL_NOWAIT.

    Earlier FOLL_NOWAIT was only ever passed to get_user_pages.

    Specifically faultin_page (the callee of get_user_pages_unlocked caller)
    doesn't know that if FAULT_FLAG_RETRY_NOWAIT was set in the page fault
    flags, when VM_FAULT_RETRY is returned, the mmap_sem wasn't actually
    released (even if nonblocking is not NULL). So it sets *nonblocking to
    zero and the caller won't release the mmap_sem thinking it was already
    released, but it wasn't because of FOLL_NOWAIT.

    Link: http://lkml.kernel.org/r/20180302174343.5421-2-aarcange@redhat.com
    Fixes: ce53053ce378c ("kvm: switch get_user_page_nowait() to get_user_pages_unlocked()")
    Signed-off-by: Andrea Arcangeli
    Reported-by: Dr. David Alan Gilbert
    Tested-by: Dr. David Alan Gilbert
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

07 Feb, 2018

1 commit

  • Pull libnvdimm updates from Ross Zwisler:

    - Require struct page by default for filesystem DAX to remove a number
    of surprising failure cases. This includes failures with direct I/O,
    gdb and fork(2).

    - Add support for the new Platform Capabilities Structure added to the
    NFIT in ACPI 6.2a. This new table tells us whether the platform
    supports flushing of CPU and memory controller caches on unexpected
    power loss events.

    - Revamp vmem_altmap and dev_pagemap handling to clean up code and
    better support future future PCI P2P uses.

    - Deprecate the ND_IOCTL_SMART_THRESHOLD command whose payload has
    become out-of-sync with recent versions of the NVDIMM_FAMILY_INTEL
    spec, and instead rely on the generic ND_CMD_CALL approach used by
    the two other IOCTL families, NVDIMM_FAMILY_{HPE,MSFT}.

    - Enhance nfit_test so we can test some of the new things added in
    version 1.6 of the DSM specification. This includes testing firmware
    download and simulating the Last Shutdown State (LSS) status.

    * tag 'libnvdimm-for-4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (37 commits)
    libnvdimm, namespace: remove redundant initialization of 'nd_mapping'
    acpi, nfit: fix register dimm error handling
    libnvdimm, namespace: make min namespace size 4K
    tools/testing/nvdimm: force nfit_test to depend on instrumented modules
    libnvdimm/nfit_test: adding support for unit testing enable LSS status
    libnvdimm/nfit_test: add firmware download emulation
    nfit-test: Add platform cap support from ACPI 6.2a to test
    libnvdimm: expose platform persistence attribute for nd_region
    acpi: nfit: add persistent memory control flag for nd_region
    acpi: nfit: Add support for detect platform CPU cache flush on power loss
    device-dax: Fix trailing semicolon
    libnvdimm, btt: fix uninitialized err_lock
    dax: require 'struct page' by default for filesystem dax
    ext2: auto disable dax instead of failing mount
    ext4: auto disable dax instead of failing mount
    mm, dax: introduce pfn_t_special()
    mm: Fix devm_memremap_pages() collision handling
    mm: Fix memory size alignment in devm_memremap_pages_release()
    memremap: merge find_dev_pagemap into get_dev_pagemap
    memremap: change devm_memremap_pages interface to use struct dev_pagemap
    ...

    Linus Torvalds
     

01 Feb, 2018

1 commit


09 Jan, 2018

1 commit


16 Dec, 2017

1 commit

  • This reverts commits 5c9d2d5c269c, c7da82b894e9, and e7fe7b5cae90.

    We'll probably need to revisit this, but basically we should not
    complicate the get_user_pages_fast() case, and checking the actual page
    table protection key bits will require more care anyway, since the
    protection keys depend on the exact state of the VM in question.

    Particularly when doing a "remote" page lookup (ie in somebody elses VM,
    not your own), you need to be much more careful than this was. Dave
    Hansen says:

    "So, the underlying bug here is that we now a get_user_pages_remote()
    and then go ahead and do the p*_access_permitted() checks against the
    current PKRU. This was introduced recently with the addition of the
    new p??_access_permitted() calls.

    We have checks in the VMA path for the "remote" gups and we avoid
    consulting PKRU for them. This got missed in the pkeys selftests
    because I did a ptrace read, but not a *write*. I also didn't
    explicitly test it against something where a COW needed to be done"

    It's also not entirely clear that it makes sense to check the protection
    key bits at this level at all. But one possible eventual solution is to
    make the get_user_pages_fast() case just abort if it sees protection key
    bits set, which makes us fall back to the regular get_user_pages() case,
    which then has a vma and can do the check there if we want to.

    We'll see.

    Somewhat related to this all: what we _do_ want to do some day is to
    check the PAGE_USER bit - it should obviously always be set for user
    pages, but it would be a good check to have back. Because we have no
    generic way to test for it, we lost it as part of moving over from the
    architecture-specific x86 GUP implementation to the generic one in
    commit e585513b76f7 ("x86/mm/gup: Switch GUP to the generic
    get_user_page_fast() implementation").

    Cc: Peter Zijlstra
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: Kirill A. Shutemov
    Cc: "Jérôme Glisse"
    Cc: Andrew Morton
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

03 Dec, 2017

3 commits

  • The only caller that doesn't pass true in it is get_user_pages() and
    it passes NULL in locked. The only place where we check it is
    if (notify_locked && lock_dropped && *locked)
    and lock_dropped can become true only if we have locked != NULL.
    In other words, the second part of condition will be false when
    called by get_user_pages().

    Just get rid of the argument and turn the condition into
    if (lock_dropped && *locked)

    Signed-off-by: Al Viro

    Al Viro
     
  • Equivalent transformation - the only place in __get_user_pages_locked()
    where we look at notify_drop argument is
    if (notify_drop && lock_dropped && *locked) {
    up_read(&mm->mmap_sem);
    *locked = 0;
    }
    in the very end. Changing notify_drop from false to true won't change
    behaviour unless *locked is non-zero. The caller is
    ret = __get_user_pages_locked(current, mm, start, nr_pages, pages, NULL,
    &locked, false, gup_flags | FOLL_TOUCH);
    if (locked)
    up_read(&mm->mmap_sem);
    so in that case the original kernel would have done up_read() right after
    return from __get_user_pages_locked(), while the modified one would've done
    it right before the return.

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     

30 Nov, 2017

2 commits

  • Patch series "introduce get_user_pages_longterm()", v2.

    Here is a new get_user_pages api for cases where a driver intends to
    keep an elevated page count indefinitely. This is distinct from usages
    like iov_iter_get_pages where the elevated page counts are transient.
    The iov_iter_get_pages cases immediately turn around and submit the
    pages to a device driver which will put_page when the i/o operation
    completes (under kernel control).

    In the longterm case userspace is responsible for dropping the page
    reference at some undefined point in the future. This is untenable for
    filesystem-dax case where the filesystem is in control of the lifetime
    of the block / page and needs reasonable limits on how long it can wait
    for pages in a mapping to become idle.

    Fixing filesystems to actually wait for dax pages to be idle before
    blocks from a truncate/hole-punch operation are repurposed is saved for
    a later patch series.

    Also, allowing longterm registration of dax mappings is a future patch
    series that introduces a "map with lease" semantic where the kernel can
    revoke a lease and force userspace to drop its page references.

    I have also tagged these for -stable to purposely break cases that might
    assume that longterm memory registrations for filesystem-dax mappings
    were supported by the kernel. The behavior regression this policy
    change implies is one of the reasons we maintain the "dax enabled.
    Warning: EXPERIMENTAL, use at your own risk" notification when mounting
    a filesystem in dax mode.

    It is worth noting the device-dax interface does not suffer the same
    constraints since it does not support file space management operations
    like hole-punch.

    This patch (of 4):

    Until there is a solution to the dma-to-dax vs truncate problem it is
    not safe to allow long standing memory registrations against
    filesytem-dax vmas. Device-dax vmas do not have this problem and are
    explicitly allowed.

    This is temporary until a "memory registration with layout-lease"
    mechanism can be implemented for the affected sub-systems (RDMA and
    V4L2).

    [akpm@linux-foundation.org: use kcalloc()]
    Link: http://lkml.kernel.org/r/151068939435.7446.13560129395419350737.stgit@dwillia2-desk3.amr.corp.intel.com
    Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
    Signed-off-by: Dan Williams
    Suggested-by: Christoph Hellwig
    Cc: Doug Ledford
    Cc: Hal Rosenstock
    Cc: Inki Dae
    Cc: Jan Kara
    Cc: Jason Gunthorpe
    Cc: Jeff Moyer
    Cc: Joonyoung Shim
    Cc: Kyungmin Park
    Cc: Mauro Carvalho Chehab
    Cc: Mel Gorman
    Cc: Ross Zwisler
    Cc: Sean Hefty
    Cc: Seung-Woo Kim
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • The 'access_permitted' helper is used in the gup-fast path and goes
    beyond the simple _PAGE_RW check to also:

    - validate that the mapping is writable from a protection keys
    standpoint

    - validate that the pte has _PAGE_USER set since all fault paths where
    pte_write is must be referencing user-memory.

    Link: http://lkml.kernel.org/r/151043111604.2842.8051684481794973100.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Cc: Dave Hansen
    Cc: Kirill A. Shutemov
    Cc: "Jérôme Glisse"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

20 Oct, 2017

1 commit


13 Sep, 2017

1 commit

  • The 0-day test bot found a performance regression that was tracked down to
    switching x86 to the generic get_user_pages_fast() implementation:

    http://lkml.kernel.org/r/20170710024020.GA26389@yexl-desktop

    The regression was caused by the fact that we now use local_irq_save() +
    local_irq_restore() in get_user_pages_fast() to disable interrupts.
    In x86 implementation local_irq_disable() + local_irq_enable() was used.

    The fix is to make get_user_pages_fast() use local_irq_disable(),
    leaving local_irq_save() for __get_user_pages_fast() that can be called
    with interrupts disabled.

    Numbers for pinning a gigabyte of memory, one page a time, 20 repeats:

    Before: Average: 14.91 ms, stddev: 0.45 ms
    After: Average: 10.76 ms, stddev: 0.18 ms

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrew Morton
    Cc: Huang Ying
    Cc: Jonathan Corbet
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Thorsten Leemhuis
    Cc: linux-mm@kvack.org
    Fixes: e585513b76f7 ("x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation")
    Link: http://lkml.kernel.org/r/20170908215603.9189-3-kirill.shutemov@linux.intel.com
    Signed-off-by: Ingo Molnar

    Kirill A. Shutemov
     

09 Sep, 2017

2 commits

  • Platform with advance system bus (like CAPI or CCIX) allow device memory
    to be accessible from CPU in a cache coherent fashion. Add a new type of
    ZONE_DEVICE to represent such memory. The use case are the same as for
    the un-addressable device memory but without all the corners cases.

    Link: http://lkml.kernel.org/r/20170817000548.32038-19-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Aneesh Kumar
    Cc: Paul E. McKenney
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Balbir Singh
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • When THP migration is being used, memory management code needs to handle
    pmd migration entries properly. This patch uses !pmd_present() or
    is_swap_pmd() (depending on whether pmd_none() needs separate code or
    not) to check pmd migration entries at the places where a pmd entry is
    present.

    Since pmd-related code uses split_huge_page(), split_huge_pmd(),
    pmd_trans_huge(), pmd_trans_unstable(), or
    pmd_none_or_trans_huge_or_clear_bad(), this patch:

    1. adds pmd migration entry split code in split_huge_pmd(),

    2. takes care of pmd migration entries whenever pmd_trans_huge() is present,

    3. makes pmd_none_or_trans_huge_or_clear_bad() pmd migration entry aware.

    Since split_huge_page() uses split_huge_pmd() and pmd_trans_unstable()
    is equivalent to pmd_none_or_trans_huge_or_clear_bad(), we do not change
    them.

    Until this commit, a pmd entry should be:
    1. pointing to a pte page,
    2. is_swap_pmd(),
    3. pmd_trans_huge(),
    4. pmd_devmap(), or
    5. pmd_none().

    Signed-off-by: Zi Yan
    Cc: Kirill A. Shutemov
    Cc: "H. Peter Anvin"
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Cc: David Nellans
    Cc: Ingo Molnar
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zi Yan
     

07 Sep, 2017

1 commit

  • These functions are the only bits of generic code that use
    {pud,pmd}_pfn() without checking for CONFIG_TRANSPARENT_HUGEPAGE. This
    works fine on x86, the only arch with devmap support, since the *_pfn()
    functions are always defined there, but this isn't true for every
    architecture.

    Link: http://lkml.kernel.org/r/20170626063833.11094-1-oohall@gmail.com
    Signed-off-by: Oliver O'Halloran
    Cc: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oliver O'Halloran
     

07 Jul, 2017

5 commits

  • When speculatively taking references to a hugepage using
    page_cache_add_speculative() in gup_huge_pmd(), it is assumed that the
    page returned by pmd_page() is the head page. Although normally true,
    this assumption doesn't hold when the hugepage comprises of successive
    page table entries such as when using contiguous bit on arm64 at PTE or
    PMD levels.

    This can be addressed by ensuring that the page passed to
    page_cache_add_speculative() is the real head or by de-referencing the
    head page within the function.

    We take the first approach to keep the usage pattern aligned with
    page_cache_get_speculative() where users already pass the appropriate
    page, i.e., the de-referenced head.

    Apply the same logic to fix gup_huge_[pud|pgd]() as well.

    [punit.agrawal@arm.com: fix arm64 ltp failure]
    Link: http://lkml.kernel.org/r/20170619170145.25577-5-punit.agrawal@arm.com
    Link: http://lkml.kernel.org/r/20170522133604.11392-3-punit.agrawal@arm.com
    Signed-off-by: Punit Agrawal
    Acked-by: Steve Capper
    Cc: Michal Hocko
    Cc: "Kirill A. Shutemov"
    Cc: Aneesh Kumar K.V
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Naoya Horiguchi
    Cc: Mark Rutland
    Cc: Hillf Danton
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Punit Agrawal
     
  • When operating on hugepages with DEBUG_VM enabled, the GUP code checks
    the compound head for each tail page prior to calling
    page_cache_add_speculative. This is broken, because on the fast-GUP
    path (where we don't hold any page table locks) we can be racing with a
    concurrent invocation of split_huge_page_to_list.

    split_huge_page_to_list deals with this race by using page_ref_freeze to
    freeze the page and force concurrent GUPs to fail whilst the component
    pages are modified. This modification includes clearing the
    compound_head field for the tail pages, so checking this prior to a
    successful call to page_cache_add_speculative can lead to false
    positives: In fact, page_cache_add_speculative *already* has this check
    once the page refcount has been successfully updated, so we can simply
    remove the broken calls to VM_BUG_ON_PAGE.

    Link: http://lkml.kernel.org/r/20170522133604.11392-2-punit.agrawal@arm.com
    Signed-off-by: Will Deacon
    Signed-off-by: Punit Agrawal
    Acked-by: Steve Capper
    Acked-by: Kirill A. Shutemov
    Cc: Aneesh Kumar K.V
    Cc: Catalin Marinas
    Cc: Naoya Horiguchi
    Cc: Mark Rutland
    Cc: Hillf Danton
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Will Deacon
     
  • Architectures like ppc64 supports hugepage size that is not mapped to
    any of of the page table levels. Instead they add an alternate page
    table entry format called hugepage directory (hugepd). hugepd indicates
    that the page table entry maps to a set of hugetlb pages. Add support
    for this in generic follow_page_mask code. We already support this
    format in the generic gup code.

    The default implementation prints warning and returns NULL. We will add
    ppc64 support in later patches

    Link: http://lkml.kernel.org/r/1494926612-23928-7-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Cc: Anshuman Khandual
    Cc: Naoya Horiguchi
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • ppc64 supports pgd hugetlb entries. Add code to handle hugetlb pgd
    entries to follow_page_mask so that ppc64 can switch to it to handle
    hugetlbe entries.

    Link: http://lkml.kernel.org/r/1494926612-23928-5-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Anshuman Khandual
    Signed-off-by: Aneesh Kumar K.V
    Cc: Naoya Horiguchi
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • Makes code reading easy. No functional changes in this patch. In a
    followup patch, we will be updating the follow_page_mask to handle
    hugetlb hugepd format so that archs like ppc64 can switch to the generic
    version. This split helps in doing that nicely.

    Link: http://lkml.kernel.org/r/1494926612-23928-3-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Reviewed-by: Naoya Horiguchi
    Cc: Anshuman Khandual
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     

22 Jun, 2017

1 commit


19 Jun, 2017

1 commit

  • Stack guard page is a useful feature to reduce a risk of stack smashing
    into a different mapping. We have been using a single page gap which
    is sufficient to prevent having stack adjacent to a different mapping.
    But this seems to be insufficient in the light of the stack usage in
    userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
    used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
    which is 256kB or stack strings with MAX_ARG_STRLEN.

    This will become especially dangerous for suid binaries and the default
    no limit for the stack size limit because those applications can be
    tricked to consume a large portion of the stack and a single glibc call
    could jump over the guard page. These attacks are not theoretical,
    unfortunatelly.

    Make those attacks less probable by increasing the stack guard gap
    to 1MB (on systems with 4k pages; but make it depend on the page size
    because systems with larger base pages might cap stack allocations in
    the PAGE_SIZE units) which should cover larger alloca() and VLA stack
    allocations. It is obviously not a full fix because the problem is
    somehow inherent, but it should reduce attack space a lot.

    One could argue that the gap size should be configurable from userspace,
    but that can be done later when somebody finds that the new 1MB is wrong
    for some special case applications. For now, add a kernel command line
    option (stack_guard_gap) to specify the stack gap size (in page units).

    Implementation wise, first delete all the old code for stack guard page:
    because although we could get away with accounting one extra page in a
    stack vma, accounting a larger gap can break userspace - case in point,
    a program run with "ulimit -S -v 20000" failed when the 1MB gap was
    counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
    and strict non-overcommit mode.

    Instead of keeping gap inside the stack vma, maintain the stack guard
    gap as a gap between vmas: using vm_start_gap() in place of vm_start
    (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
    places which need to respect the gap - mainly arch_get_unmapped_area(),
    and and the vma tree's subtree_gap support for that.

    Original-patch-by: Oleg Nesterov
    Original-patch-by: Michal Hocko
    Signed-off-by: Hugh Dickins
    Acked-by: Michal Hocko
    Tested-by: Helge Deller # parisc
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

13 Jun, 2017

1 commit

  • This patch provides all required callbacks required by the generic
    get_user_pages_fast() code and switches x86 over - and removes
    the platform specific implementation.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Hansen
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-arch@vger.kernel.org
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20170606113133.22974-2-kirill.shutemov@linux.intel.com
    Signed-off-by: Ingo Molnar

    Kirill A. Shutemov
     

03 Jun, 2017

1 commit

  • KVM uses get_user_pages() to resolve its stage2 faults. KVM sets the
    FOLL_HWPOISON flag causing faultin_page() to return -EHWPOISON when it
    finds a VM_FAULT_HWPOISON. KVM handles these hwpoison pages as a
    special case. (check_user_page_hwpoison())

    When huge pages are involved, this doesn't work so well.
    get_user_pages() calls follow_hugetlb_page(), which stops early if it
    receives VM_FAULT_HWPOISON from hugetlb_fault(), eventually returning
    -EFAULT to the caller. The step to map this to -EHWPOISON based on the
    FOLL_ flags is missing. The hwpoison special case is skipped, and
    -EFAULT is returned to user-space, causing Qemu or kvmtool to exit.

    Instead, move this VM_FAULT_ to errno mapping code into a header file
    and use it from faultin_page() and follow_hugetlb_page().

    With this, KVM works as expected.

    This isn't a problem for arm64 today as we haven't enabled
    MEMORY_FAILURE, but I can't see any reason this doesn't happen on x86
    too, so I think this should be a fix. This doesn't apply earlier than
    stable's v4.11.1 due to all sorts of cleanup.

    [james.morse@arm.com: add vm_fault_to_errno() call to faultin_page()]
    suggested.
    Link: http://lkml.kernel.org/r/20170525171035.16359-1-james.morse@arm.com
    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20170524160900.28786-1-james.morse@arm.com
    Signed-off-by: James Morse
    Acked-by: Punit Agrawal
    Acked-by: Naoya Horiguchi
    Cc: "Kirill A . Shutemov"
    Cc: [4.11.1+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    James Morse
     

04 May, 2017

1 commit

  • MIPS just got changed to only accept a pointer argument for access_ok(),
    causing one warning in drivers/scsi/pmcraid.c. I tried changing x86 the
    same way and found the same warning in __get_user_pages_fast() and
    nowhere else in the kernel during randconfig testing:

    mm/gup.c: In function '__get_user_pages_fast':
    mm/gup.c:1578:6: error: passing argument 1 of '__chk_range_not_ok' makes pointer from integer without a cast [-Werror=int-conversion]

    It would probably be a good idea to enforce type-safety in general, so
    let's change this file to not cause a warning if we do that.

    I don't know why the warning did not appear on MIPS.

    Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()")
    Link: http://lkml.kernel.org/r/20170421162659.3314521-1-arnd@arndb.de
    Signed-off-by: Arnd Bergmann
    Cc: Alexander Viro
    Acked-by: Ingo Molnar
    Cc: Michal Hocko
    Cc: "Kirill A. Shutemov"
    Cc: Lorenzo Stoakes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     

23 Apr, 2017

1 commit

  • This reverts commit 2947ba054a4dabbd82848728d765346886050029.

    Dan Williams reported dax-pmem kernel warnings with the following signature:

    WARNING: CPU: 8 PID: 245 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x1f5/0x200
    percpu ref (dax_pmem_percpu_release [dax_pmem])
    Cc: Andrew Morton
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Kirill A. Shutemov
    Cc: Linus Torvalds
    Cc: Michal Hocko
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Cc: aneesh.kumar@linux.vnet.ibm.com
    Cc: dann.frazier@canonical.com
    Cc: dave.hansen@intel.com
    Cc: steve.capper@linaro.org
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

18 Mar, 2017

2 commits

  • This patch provides all required callbacks required by the generic
    get_user_pages_fast() code and switches x86 over - and removes
    the platform specific implementation.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrew Morton
    Cc: Aneesh Kumar K . V
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Dann Frazier
    Cc: Dave Hansen
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Steve Capper
    Cc: Thomas Gleixner
    Cc: linux-arch@vger.kernel.org
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20170316213906.89528-1-kirill.shutemov@linux.intel.com
    [ Minor readability edits. ]
    Signed-off-by: Ingo Molnar

    Kirill A. Shutemov
     
  • This is a preparation patch for the transition of x86 to the generic GUP_fast()
    implementation.

    On x86, get_user_pages_fast() does a couple of sanity checks to see if we can
    call __get_user_pages_fast() for the range.

    This kind of wrapping protection should be useful for the generic code too.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrew Morton
    Cc: Aneesh Kumar K . V
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Dann Frazier
    Cc: Dave Hansen
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Steve Capper
    Cc: Thomas Gleixner
    Cc: linux-arch@vger.kernel.org
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20170316152655.37789-7-kirill.shutemov@linux.intel.com
    [ Small readability edits. ]
    Signed-off-by: Ingo Molnar

    Kirill A. Shutemov