19 Jun, 2017

1 commit

  • Stack guard page is a useful feature to reduce a risk of stack smashing
    into a different mapping. We have been using a single page gap which
    is sufficient to prevent having stack adjacent to a different mapping.
    But this seems to be insufficient in the light of the stack usage in
    userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
    used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
    which is 256kB or stack strings with MAX_ARG_STRLEN.

    This will become especially dangerous for suid binaries and the default
    no limit for the stack size limit because those applications can be
    tricked to consume a large portion of the stack and a single glibc call
    could jump over the guard page. These attacks are not theoretical,
    unfortunatelly.

    Make those attacks less probable by increasing the stack guard gap
    to 1MB (on systems with 4k pages; but make it depend on the page size
    because systems with larger base pages might cap stack allocations in
    the PAGE_SIZE units) which should cover larger alloca() and VLA stack
    allocations. It is obviously not a full fix because the problem is
    somehow inherent, but it should reduce attack space a lot.

    One could argue that the gap size should be configurable from userspace,
    but that can be done later when somebody finds that the new 1MB is wrong
    for some special case applications. For now, add a kernel command line
    option (stack_guard_gap) to specify the stack gap size (in page units).

    Implementation wise, first delete all the old code for stack guard page:
    because although we could get away with accounting one extra page in a
    stack vma, accounting a larger gap can break userspace - case in point,
    a program run with "ulimit -S -v 20000" failed when the 1MB gap was
    counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
    and strict non-overcommit mode.

    Instead of keeping gap inside the stack vma, maintain the stack guard
    gap as a gap between vmas: using vm_start_gap() in place of vm_start
    (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
    places which need to respect the gap - mainly arch_get_unmapped_area(),
    and and the vma tree's subtree_gap support for that.

    Original-patch-by: Oleg Nesterov
    Original-patch-by: Michal Hocko
    Signed-off-by: Hugh Dickins
    Acked-by: Michal Hocko
    Tested-by: Helge Deller # parisc
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

04 May, 2017

1 commit

  • Commit 091d0d55b286 ("shm: fix null pointer deref when userspace
    specifies invalid hugepage size") had replaced MAP_HUGE_MASK with
    SHM_HUGE_MASK. Though both of them contain the same numeric value of
    0x3f, MAP_HUGE_MASK flag sounds more appropriate than the other one in
    the context. Hence change it back.

    Link: http://lkml.kernel.org/r/20170404045635.616-1-khandual@linux.vnet.ibm.com
    Signed-off-by: Anshuman Khandual
    Reviewed-by: Matthew Wilcox
    Acked-by: Balbir Singh
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     

03 Mar, 2017

1 commit

  • Pull vfs pile two from Al Viro:

    - orangefs fix

    - series of fs/namei.c cleanups from me

    - VFS stuff coming from overlayfs tree

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    orangefs: Use RCU for destroy_inode
    vfs: use helper for calling f_op->fsync()
    mm: use helper for calling f_op->mmap()
    vfs: use helpers for calling f_op->{read,write}_iter()
    vfs: pass type instead of fn to do_{loop,iter}_readv_writev()
    vfs: extract common parts of {compat_,}do_readv_writev()
    vfs: wrap write f_ops with file_{start,end}_write()
    vfs: deny copy_file_range() for non regular files
    vfs: deny fallocate() on directory
    vfs: create vfs helper vfs_tmpfile()
    namei.c: split unlazy_walk()
    namei.c: fold the check for DCACHE_OP_REVALIDATE into d_revalidate()
    lookup_fast(): clean up the logics around the fallback to non-rcu mode
    namei: fold unlazy_link() into its sole caller

    Linus Torvalds
     

02 Mar, 2017

1 commit


25 Feb, 2017

5 commits

  • If madvise(2) advice will result in the underlying vma being split and
    the number of areas mapped by the process will exceed
    /proc/sys/vm/max_map_count as a result, return ENOMEM instead of EAGAIN.

    EAGAIN is returned by madvise(2) when a kernel resource, such as slab,
    is temporarily unavailable. It indicates that userspace should retry
    the advice in the near future. This is important for advice such as
    MADV_DONTNEED which is often used by malloc implementations to free
    memory back to the system: we really do want to free memory back when
    madvise(2) returns EAGAIN because slab allocations (for vmas, anon_vmas,
    or mempolicies) cannot be allocated.

    Encountering /proc/sys/vm/max_map_count is not a temporary failure,
    however, so return ENOMEM to indicate this is a more serious issue. A
    followup patch to the man page will specify this behavior.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701241431120.42507@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Cc: Jonathan Corbet
    Cc: Johannes Weiner
    Cc: Jerome Marchand
    Cc: "Kirill A. Shutemov"
    Cc: Michael Kerrisk
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • When a non-cooperative userfaultfd monitor copies pages in the
    background, it may encounter regions that were already unmapped.
    Addition of UFFD_EVENT_UNMAP allows the uffd monitor to track precisely
    changes in the virtual memory layout.

    Since there might be different uffd contexts for the affected VMAs, we
    first should create a temporary representation for the unmap event for
    each uffd context and then notify them one by one to the appropriate
    userfault file descriptors.

    The event notification occurs after the mmap_sem has been released.

    [arnd@arndb.de: fix nommu build]
    Link: http://lkml.kernel.org/r/20170203165141.3665284-1-arnd@arndb.de
    [mhocko@suse.com: fix nommu build]
    Link: http://lkml.kernel.org/r/20170202091503.GA22823@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/1485542673-24387-3-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Signed-off-by: Michal Hocko
    Signed-off-by: Arnd Bergmann
    Acked-by: Hillf Danton
    Cc: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Patch series "userfaultfd: non-cooperative: better tracking for mapping
    changes", v2.

    These patches try to address issues I've encountered during integration
    of userfaultfd with CRIU.

    Previously added userfaultfd events for fork(), madvise() and mremap()
    unfortunately do not cover all possible changes to a process virtual
    memory layout required for uffd monitor.

    When one or more VMAs is removed from the process mm, the external uffd
    monitor has no way to detect those changes and will attempt to fill the
    removed regions with userfaultfd_copy.

    Another problematic event is the exit() of the process. Here again, the
    external uffd monitor will try to use userfaultfd_copy, although mm
    owning the memory has already gone.

    The first patch in the series is a minor cleanup and it's not strictly
    related to the rest of the series.

    The patches 2 and 3 below add UFFD_EVENT_UNMAP and UFFD_EVENT_EXIT to
    allow the uffd monitor track changes in the memory layout of a process.

    The patches 4 and 5 amend error codes returned by userfaultfd_copy to
    make the uffd monitor able to cope with races that might occur between
    delivery of unmap and exit events and outstanding userfaultfd_copy's.

    This patch (of 5):

    Commit dc0ef0df7b6a ("mm: make mmap_sem for write waits killable for mm
    syscalls") replaced call to vm_munmap in munmap syscall with open coded
    version to allow different waits on mmap_sem in munmap syscall and
    vm_munmap.

    Now both functions use down_write_killable, so we can restore the call
    to vm_munmap from the munmap system call.

    Link: http://lkml.kernel.org/r/1485542673-24387-2-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Hillf Danton
    Cc: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • mmap_init() is no longer associated with VMA slab. So fix it.

    Link: http://lkml.kernel.org/r/1485182601-9294-1-git-send-email-iamyooon@gmail.com
    Signed-off-by: seokhoon.yoon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    seokhoon.yoon
     
  • ->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to
    take a vma and vmf parameter when the vma already resides in vmf.

    Remove the vma parameter to simplify things.

    [arnd@arndb.de: fix ARM build]
    Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com
    Signed-off-by: Dave Jiang
    Signed-off-by: Arnd Bergmann
    Reviewed-by: Ross Zwisler
    Cc: Theodore Ts'o
    Cc: Darrick J. Wong
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Christoph Hellwig
    Cc: Jan Kara
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jiang
     

23 Feb, 2017

1 commit

  • On 32-bit powerpc the ELF PLT sections of binaries (built with
    --bss-plt, or with a toolchain which defaults to it) look like this:

    [17] .sbss NOBITS 0002aff8 01aff8 000014 00 WA 0 0 4
    [18] .plt NOBITS 0002b00c 01aff8 000084 00 WAX 0 0 4
    [19] .bss NOBITS 0002b090 01aff8 0000a4 00 WA 0 0 4

    Which results in an ELF load header:

    Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
    LOAD 0x019c70 0x00029c70 0x00029c70 0x01388 0x014c4 RWE 0x10000

    This is all correct, the load region containing the PLT is marked as
    executable. Note that the PLT starts at 0002b00c but the file mapping
    ends at 0002aff8, so the PLT falls in the 0 fill section described by
    the load header, and after a page boundary.

    Unfortunately the generic ELF loader ignores the X bit in the load
    headers when it creates the 0 filled non-file backed mappings. It
    assumes all of these mappings are RW BSS sections, which is not the case
    for PPC.

    gcc/ld has an option (--secure-plt) to not do this, this is said to
    incur a small performance penalty.

    Currently, to support 32-bit binaries with PLT in BSS kernel maps
    *entire brk area* with executable rights for all binaries, even
    --secure-plt ones.

    Stop doing that.

    Teach the ELF loader to check the X bit in the relevant load header and
    create 0 filled anonymous mappings that are executable if the load
    header requests that.

    Test program showing the difference in /proc/$PID/maps:

    int main() {
    char buf[16*1024];
    char *p = malloc(123); /* make "[heap]" mapping appear */
    int fd = open("/proc/self/maps", O_RDONLY);
    int len = read(fd, buf, sizeof(buf));
    write(1, buf, len);
    printf("%p\n", p);
    return 0;
    }

    Compiled using: gcc -mbss-plt -m32 -Os test.c -otest

    Unpatched ppc64 kernel:
    00100000-00120000 r-xp 00000000 00:00 0 [vdso]
    0fe10000-0ffd0000 r-xp 00000000 fd:00 67898094 /usr/lib/libc-2.17.so
    0ffd0000-0ffe0000 r--p 001b0000 fd:00 67898094 /usr/lib/libc-2.17.so
    0ffe0000-0fff0000 rw-p 001c0000 fd:00 67898094 /usr/lib/libc-2.17.so
    10000000-10010000 r-xp 00000000 fd:00 100674505 /home/user/test
    10010000-10020000 r--p 00000000 fd:00 100674505 /home/user/test
    10020000-10030000 rw-p 00010000 fd:00 100674505 /home/user/test
    10690000-106c0000 rwxp 00000000 00:00 0 [heap]
    f7f70000-f7fa0000 r-xp 00000000 fd:00 67898089 /usr/lib/ld-2.17.so
    f7fa0000-f7fb0000 r--p 00020000 fd:00 67898089 /usr/lib/ld-2.17.so
    f7fb0000-f7fc0000 rw-p 00030000 fd:00 67898089 /usr/lib/ld-2.17.so
    ffa90000-ffac0000 rw-p 00000000 00:00 0 [stack]
    0x10690008

    Patched ppc64 kernel:
    00100000-00120000 r-xp 00000000 00:00 0 [vdso]
    0fe10000-0ffd0000 r-xp 00000000 fd:00 67898094 /usr/lib/libc-2.17.so
    0ffd0000-0ffe0000 r--p 001b0000 fd:00 67898094 /usr/lib/libc-2.17.so
    0ffe0000-0fff0000 rw-p 001c0000 fd:00 67898094 /usr/lib/libc-2.17.so
    10000000-10010000 r-xp 00000000 fd:00 100674505 /home/user/test
    10010000-10020000 r--p 00000000 fd:00 100674505 /home/user/test
    10020000-10030000 rw-p 00010000 fd:00 100674505 /home/user/test
    10180000-101b0000 rw-p 00000000 00:00 0 [heap]
    ^^^^ this has changed
    f7c60000-f7c90000 r-xp 00000000 fd:00 67898089 /usr/lib/ld-2.17.so
    f7c90000-f7ca0000 r--p 00020000 fd:00 67898089 /usr/lib/ld-2.17.so
    f7ca0000-f7cb0000 rw-p 00030000 fd:00 67898089 /usr/lib/ld-2.17.so
    ff860000-ff890000 rw-p 00000000 00:00 0 [stack]
    0x10180008

    The patch was originally posted in 2012 by Jason Gunthorpe
    and apparently ignored:

    https://lkml.org/lkml/2012/9/30/138

    Lightly run-tested.

    Link: http://lkml.kernel.org/r/20161215131950.23054-1-dvlasenk@redhat.com
    Signed-off-by: Jason Gunthorpe
    Signed-off-by: Denys Vlasenko
    Acked-by: Kees Cook
    Acked-by: Michael Ellerman
    Tested-by: Jason Gunthorpe
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "Aneesh Kumar K.V"
    Cc: Oleg Nesterov
    Cc: Florian Weimer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Denys Vlasenko
     

20 Feb, 2017

1 commit


25 Dec, 2016

1 commit


08 Oct, 2016

6 commits

  • The old code was always doing:

    vma->vm_end = next->vm_end
    vma_rb_erase(next) // in __vma_unlink
    vma->vm_next = next->vm_next // in __vma_unlink
    next = vma->vm_next
    vma_gap_update(next)

    The new code still does the above for remove_next == 1 and 2, but for
    remove_next == 3 it has been changed and it does:

    next->vm_start = vma->vm_start
    vma_rb_erase(vma) // in __vma_unlink
    vma_gap_update(next)

    In the latter case, while unlinking "vma", validate_mm_rb() is told to
    ignore "vma" that is being removed, but next->vm_start was reduced
    instead. So for the new case, to avoid the false positive from
    validate_mm_rb, it should be "next" that is ignored when "vma" is
    being unlinked.

    "vma" and "next" in the above comment, considered pre-swap().

    Link: http://lkml.kernel.org/r/1474492522-2261-4-git-send-email-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Tested-by: Shaun Tancheff
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Jan Vorlicek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • The cases are three not two.

    Link: http://lkml.kernel.org/r/1474492522-2261-3-git-send-email-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Jan Vorlicek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • If next would be NULL we couldn't reach such code path.

    Link: http://lkml.kernel.org/r/1474309513-20313-2-git-send-email-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Jan Vorlicek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • The rmap_walk can access vm_page_prot (and potentially vm_flags in the
    pte/pmd manipulations). So it's not safe to wait the caller to update
    the vm_page_prot/vm_flags after vma_merge returned potentially removing
    the "next" vma and extending the "current" vma over the
    next->vm_start,vm_end range, but still with the "current" vma
    vm_page_prot, after releasing the rmap locks.

    The vm_page_prot/vm_flags must be transferred from the "next" vma to the
    current vma while vma_merge still holds the rmap locks.

    The side effect of this race condition is pte corruption during migrate
    as remove_migration_ptes when run on a address of the "next" vma that
    got removed, used the vm_page_prot of the current vma.

    migrate mprotect
    ------------ -------------
    migrating in "next" vma
    vma_merge() # removes "next" vma and
    # extends "current" vma
    # current vma is not with
    # vm_page_prot updated
    remove_migration_ptes
    read vm_page_prot of current "vma"
    establish pte with wrong permissions
    vm_set_page_prot(vma) # too late!
    change_protection in the old vma range
    only, next range is not updated

    This caused segmentation faults and potentially memory corruption in
    heavy mprotect loads with some light page migration caused by compaction
    in the background.

    Hugh Dickins pointed out the comment about the Odd case 8 in vma_merge
    which confirms the case 8 is only buggy one where the race can trigger,
    in all other vma_merge cases the above cannot happen.

    This fix removes the oddness factor from case 8 and it converts it from:

    AAAA
    PPPPNNNNXXXX -> PPPPNNNNNNNN

    to:

    AAAA
    PPPPNNNNXXXX -> PPPPXXXXXXXX

    XXXX has the right vma properties for the whole merged vma returned by
    vma_adjust, so it solves the problem fully. It has the added benefits
    that the callers could stop updating vma properties when vma_merge
    succeeds however the callers are not updated by this patch (there are
    bits like VM_SOFTDIRTY that still need special care for the whole range,
    as the vma merging ignores them, but as long as they're not processed by
    rmap walks and instead they're accessed with the mmap_sem at least for
    reading, they are fine not to be updated within vma_adjust before
    releasing the rmap_locks).

    Link: http://lkml.kernel.org/r/1474309513-20313-1-git-send-email-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Reported-by: Aditya Mandaleeka
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Jan Vorlicek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • mm->highest_vm_end doesn't need any update.

    After finally removing the oddness from vma_merge case 8 that was
    causing:

    1) constant risk of trouble whenever anybody would check vma fields
    from rmap_walks, like it happened when page migration was
    introduced and it read the vma->vm_page_prot from a rmap_walk

    2) the callers of vma_merge to re-initialize any value different from
    the current vma, instead of vma_merge() more reliably returning a
    vma that already matches all fields passed as parameter

    .. it is also worth to take the opportunity of cleaning up superfluous
    code in vma_adjust(), that if not removed adds up to the hard
    readability of the function.

    Link: http://lkml.kernel.org/r/1474492522-2261-5-git-send-email-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Jan Vorlicek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • vma->vm_page_prot is read lockless from the rmap_walk, it may be updated
    concurrently and this prevents the risk of reading intermediate values.

    Link: http://lkml.kernel.org/r/1474660305-19222-1-git-send-email-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Jan Vorlicek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

04 Oct, 2016

1 commit

  • Pull x86 vdso updates from Ingo Molnar:
    "The main changes in this cycle centered around adding support for
    32-bit compatible C/R of the vDSO on 64-bit kernels, by Dmitry
    Safonov"

    * 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/vdso: Use CONFIG_X86_X32_ABI to enable vdso prctl
    x86/vdso: Only define map_vdso_randomized() if CONFIG_X86_64
    x86/vdso: Only define prctl_map_vdso() if CONFIG_CHECKPOINT_RESTORE
    x86/signal: Add SA_{X32,IA32}_ABI sa_flags
    x86/ptrace: Down with test_thread_flag(TIF_IA32)
    x86/coredump: Use pr_reg size, rather that TIF_IA32 flag
    x86/arch_prctl/vdso: Add ARCH_MAP_VDSO_*
    x86/vdso: Replace calculate_addr in map_vdso() with addr
    x86/vdso: Unmap vdso blob on vvar mapping failure

    Linus Torvalds
     

15 Sep, 2016

1 commit

  • Add API to change vdso blob type with arch_prctl.
    As this is usefull only by needs of CRIU, expose
    this interface under CONFIG_CHECKPOINT_RESTORE.

    Signed-off-by: Dmitry Safonov
    Acked-by: Andy Lutomirski
    Cc: 0x7f454c46@gmail.com
    Cc: oleg@redhat.com
    Cc: linux-mm@kvack.org
    Cc: gorcunov@openvz.org
    Cc: xemul@virtuozzo.com
    Link: http://lkml.kernel.org/r/20160905133308.28234-4-dsafonov@virtuozzo.com
    Signed-off-by: Thomas Gleixner

    Dmitry Safonov
     

26 Aug, 2016

1 commit

  • The ARMv8 architecture allows execute-only user permissions by clearing
    the PTE_UXN and PTE_USER bits. However, the kernel running on a CPU
    implementation without User Access Override (ARMv8.2 onwards) can still
    access such page, so execute-only page permission does not protect
    against read(2)/write(2) etc. accesses. Systems requiring such
    protection must enable features like SECCOMP.

    This patch changes the arm64 __P100 and __S100 protection_map[] macros
    to the new __PAGE_EXECONLY attributes. A side effect is that
    pte_user() no longer triggers for __PAGE_EXECONLY since PTE_USER isn't
    set. To work around this, the check is done on the PTE_NG bit via the
    pte_ng() macro. VM_READ is also checked now for page faults.

    Reviewed-by: Will Deacon
    Signed-off-by: Catalin Marinas
    Signed-off-by: Will Deacon

    Catalin Marinas
     

03 Aug, 2016

1 commit

  • The vm_brk() alignment calculations should refuse to overflow. The ELF
    loader depending on this, but it has been fixed now. No other unsafe
    callers have been found.

    Link: http://lkml.kernel.org/r/1468014494-25291-3-git-send-email-keescook@chromium.org
    Signed-off-by: Kees Cook
    Reported-by: Hector Marco-Gisbert
    Cc: Ismael Ripoll Ripoll
    Cc: Alexander Viro
    Cc: "Kirill A. Shutemov"
    Cc: Oleg Nesterov
    Cc: Chen Gang
    Cc: Michal Hocko
    Cc: Konstantin Khlebnikov
    Cc: Andrea Arcangeli
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

29 Jul, 2016

1 commit

  • There's one case when vma_adjust() expands the vma, overlapping with
    *two* next vma. See case 6 of mprotect, described in the comment to
    vma_merge().

    To handle this (and only this) situation we iterate twice over main part
    of the function. See "goto again".

    Vegard reported[1] that he sees out-of-bounds access complain from
    KASAN, if anon_vma_clone() on the *second* iteration fails.

    This happens because we free 'next' vma by the end of first iteration
    and don't have a way to undo this if anon_vma_clone() fails on the
    second iteration.

    The solution is to do all required allocations upfront, before we touch
    vmas.

    The allocation on the second iteration is only required if first two
    vmas don't have anon_vma, but third does. So we need, in total, one
    anon_vma_clone() call.

    It's easy to adjust 'exporter' to the third vma for such case.

    [1] http://lkml.kernel.org/r/1469514843-23778-1-git-send-email-vegard.nossum@oracle.com

    Link: http://lkml.kernel.org/r/1469625255-126641-1-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Vegard Nossum
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

27 Jul, 2016

3 commits

  • Provide a shmem_get_unmapped_area method in file_operations, called at
    mmap time to decide the mapping address. It could be conditional on
    CONFIG_TRANSPARENT_HUGEPAGE, but save #ifdefs in other places by making
    it unconditional.

    shmem_get_unmapped_area() first calls the usual mm->get_unmapped_area
    (which we treat as a black box, highly dependent on architecture and
    config and executable layout). Lots of conditions, and in most cases it
    just goes with the address that chose; but when our huge stars are
    rightly aligned, yet that did not provide a suitable address, go back to
    ask for a larger arena, within which to align the mapping suitably.

    There have to be some direct calls to shmem_get_unmapped_area(), not via
    the file_operations: because of the way shmem_zero_setup() is called to
    create a shmem object late in the mmap sequence, when MAP_SHARED is
    requested with MAP_ANONYMOUS or /dev/zero. Though this only matters
    when /proc/sys/vm/shmem_huge has been set.

    Link: http://lkml.kernel.org/r/1466021202-61880-29-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Hugh Dickins
    Signed-off-by: Kirill A. Shutemov

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • As with anon THP, we only mlock file huge pages if we can prove that the
    page is not mapped with PTE. This way we can avoid mlock leak into
    non-mlocked vma on split.

    We rely on PageDoubleMap() under lock_page() to check if the the page
    may be PTE mapped. PG_double_map is set by page_add_file_rmap() when
    the page mapped with PTEs.

    Link: http://lkml.kernel.org/r/1466021202-61880-21-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • vma_addjust_trans_huge() splits pmd if it's crossing VMA boundary.
    During split we munlock the huge page which requires rmap walk. rmap
    wants to take the lock on its own.

    Let's move vma_adjust_trans_huge() outside i_mmap_rwsem to fix this.

    Link: http://lkml.kernel.org/r/1466021202-61880-19-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

08 Jul, 2016

1 commit

  • Add possibility for 32-bit user-space applications to move
    the vDSO mapping.

    Previously, when a user-space app called mremap() for the vDSO
    address, in the syscall return path it would land on the previous
    address of the vDSOpage, resulting in segmentation violation.

    Now it lands fine and returns to userspace with a remapped vDSO.

    This will also fix the context.vdso pointer for 64-bit, which does
    not affect the user of vDSO after mremap() currently, but this
    may change in the future.

    As suggested by Andy, return -EINVAL for mremap() that would
    split the vDSO image: that operation cannot possibly result in
    a working system so reject it.

    Renamed and moved the text_mapping structure declaration inside
    map_vdso(), as it used only there and now it complements the
    vvar_mapping variable.

    There is still a problem for remapping the vDSO in glibc
    applications: the linker relocates addresses for syscalls
    on the vDSO page, so you need to relink with the new
    addresses.

    Without that the next syscall through glibc may fail:

    Program received signal SIGSEGV, Segmentation fault.
    #0 0xf7fd9b80 in __kernel_vsyscall ()
    #1 0xf7ec8238 in _exit () from /usr/lib32/libc.so.6

    Signed-off-by: Dmitry Safonov
    Acked-by: Andy Lutomirski
    Cc: 0x7f454c46@gmail.com
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20160628113539.13606-2-dsafonov@virtuozzo.com
    Signed-off-by: Ingo Molnar

    Dmitry Safonov
     

28 May, 2016

1 commit

  • The do_brk() and vm_brk() return value was "unsigned long" and returned
    the starting address on success, and an error value on failure. The
    reasons are entirely historical, and go back to it basically behaving
    like the mmap() interface does.

    However, nobody actually wanted that interface, and it causes totally
    pointless IS_ERR_VALUE() confusion.

    What every single caller actually wants is just the simpler integer
    return of zero for success and negative error number on failure.

    So just convert to that much clearer and more common calling convention,
    and get rid of all the IS_ERR_VALUE() uses wrt vm_brk().

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

24 May, 2016

4 commits

  • Now that all the callers handle vm_brk failure we can change it wait for
    mmap_sem killable to help oom_reaper to not get blocked just because
    vm_brk gets blocked behind mmap_sem readers.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: Oleg Nesterov
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Almost all current users of vm_munmap are ignoring the return value and
    so they do not handle potential error. This means that some VMAs might
    stay behind. This patch doesn't try to solve those potential problems.
    Quite contrary it adds a new failure mode by using down_write_killable
    in vm_munmap. This should be safer than other failure modes, though,
    because the process is guaranteed to die as soon as it leaves the kernel
    and exit_mmap will clean the whole address space.

    This will help in the OOM conditions when the oom victim might be stuck
    waiting for the mmap_sem for write which in turn can block oom_reaper
    which relies on the mmap_sem for read to make a forward progress and
    reclaim the address space of the victim.

    Signed-off-by: Michal Hocko
    Cc: Oleg Nesterov
    Cc: "Kirill A. Shutemov"
    Cc: Konstantin Khlebnikov
    Cc: Andrea Arcangeli
    Cc: Alexander Viro
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • All the callers of vm_mmap seem to check for the failure already and
    bail out in one way or another on the error which means that we can
    change it to use killable version of vm_mmap_pgoff and return -EINTR if
    the current task gets killed while waiting for mmap_sem. This also
    means that vm_mmap_pgoff can be killable by default and drop the
    additional parameter.

    This will help in the OOM conditions when the oom victim might be stuck
    waiting for the mmap_sem for write which in turn can block oom_reaper
    which relies on the mmap_sem for read to make a forward progress and
    reclaim the address space of the victim.

    Please note that load_elf_binary is ignoring vm_mmap error for
    current->personality & MMAP_PAGE_ZERO case but that shouldn't be a
    problem because the address is not used anywhere and we never return to
    the userspace if we got killed.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Andrea Arcangeli
    Cc: Al Viro
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • This is a follow up work for oom_reaper [1]. As the async OOM killing
    depends on oom_sem for read we would really appreciate if a holder for
    write didn't stood in the way. This patchset is changing many of
    down_write calls to be killable to help those cases when the writer is
    blocked and waiting for readers to release the lock and so help
    __oom_reap_task to process the oom victim.

    Most of the patches are really trivial because the lock is help from a
    shallow syscall paths where we can return EINTR trivially and allow the
    current task to die (note that EINTR will never get to the userspace as
    the task has fatal signal pending). Others seem to be easy as well as
    the callers are already handling fatal errors and bail and return to
    userspace which should be sufficient to handle the failure gracefully.
    I am not familiar with all those code paths so a deeper review is really
    appreciated.

    As this work is touching more areas which are not directly connected I
    have tried to keep the CC list as small as possible and people who I
    believed would be familiar are CCed only to the specific patches (all
    should have received the cover though).

    This patchset is based on linux-next and it depends on
    down_write_killable for rw_semaphores which got merged into tip
    locking/rwsem branch and it is merged into this next tree. I guess it
    would be easiest to route these patches via mmotm because of the
    dependency on the tip tree but if respective maintainers prefer other
    way I have no objections.

    I haven't covered all the mmap_write(mm->mmap_sem) instances here

    $ git grep "down_write(.*\)" next/master | wc -l
    98
    $ git grep "down_write(.*\)" | wc -l
    62

    I have tried to cover those which should be relatively easy to review in
    this series because this alone should be a nice improvement. Other
    places can be changed on top.

    [0] http://lkml.kernel.org/r/1456752417-9626-1-git-send-email-mhocko@kernel.org
    [1] http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org
    [2] http://lkml.kernel.org/r/1456750705-7141-1-git-send-email-mhocko@kernel.org

    This patch (of 18):

    This is the first step in making mmap_sem write waiters killable. It
    focuses on the trivial ones which are taking the lock early after
    entering the syscall and they are not changing state before.

    Therefore it is very easy to change them to use down_write_killable and
    immediately return with -EINTR. This will allow the waiter to pass away
    without blocking the mmap_sem which might be required to make a forward
    progress. E.g. the oom reaper will need the lock for reading to
    dismantle the OOM victim address space.

    The only tricky function in this patch is vm_mmap_pgoff which has many
    call sites via vm_mmap. To reduce the risk keep vm_mmap with the
    original non-killable semantic for now.

    vm_munmap callers do not bother checking the return value so open code
    it into the munmap syscall path for now for simplicity.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Konstantin Khlebnikov
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

21 May, 2016

1 commit

  • Since commit 84638335900f ("mm: rework virtual memory accounting")
    RLIMIT_DATA limits both brk() and private mmap() but this's disabled by
    default because of incompatibility with older versions of valgrind.

    Valgrind always set limit to zero and fails if RLIMIT_DATA is enabled.
    Fortunately it changes only rlim_cur and keeps rlim_max for reverting
    limit back when needed.

    This patch checks current usage also against rlim_max if rlim_cur is
    zero. This is safe because task anyway can increase rlim_cur up to
    rlim_max. Size of brk is still checked against rlim_cur, so this part
    is completely compatible - zero rlim_cur forbids brk() but allows
    private mmap().

    Link: http://lkml.kernel.org/r/56A28613.5070104@de.ibm.com
    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Linus Torvalds
    Cc: Cyrill Gorcunov
    Cc: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

20 May, 2016

1 commit


21 Mar, 2016

1 commit

  • Pull x86 protection key support from Ingo Molnar:
    "This tree adds support for a new memory protection hardware feature
    that is available in upcoming Intel CPUs: 'protection keys' (pkeys).

    There's a background article at LWN.net:

    https://lwn.net/Articles/643797/

    The gist is that protection keys allow the encoding of
    user-controllable permission masks in the pte. So instead of having a
    fixed protection mask in the pte (which needs a system call to change
    and works on a per page basis), the user can map a (handful of)
    protection mask variants and can change the masks runtime relatively
    cheaply, without having to change every single page in the affected
    virtual memory range.

    This allows the dynamic switching of the protection bits of large
    amounts of virtual memory, via user-space instructions. It also
    allows more precise control of MMU permission bits: for example the
    executable bit is separate from the read bit (see more about that
    below).

    This tree adds the MM infrastructure and low level x86 glue needed for
    that, plus it adds a high level API to make use of protection keys -
    if a user-space application calls:

    mmap(..., PROT_EXEC);

    or

    mprotect(ptr, sz, PROT_EXEC);

    (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
    this special case, and will set a special protection key on this
    memory range. It also sets the appropriate bits in the Protection
    Keys User Rights (PKRU) register so that the memory becomes unreadable
    and unwritable.

    So using protection keys the kernel is able to implement 'true'
    PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
    PROT_READ as well. Unreadable executable mappings have security
    advantages: they cannot be read via information leaks to figure out
    ASLR details, nor can they be scanned for ROP gadgets - and they
    cannot be used by exploits for data purposes either.

    We know about no user-space code that relies on pure PROT_EXEC
    mappings today, but binary loaders could start making use of this new
    feature to map binaries and libraries in a more secure fashion.

    There is other pending pkeys work that offers more high level system
    call APIs to manage protection keys - but those are not part of this
    pull request.

    Right now there's a Kconfig that controls this feature
    (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
    (like most x86 CPU feature enablement code that has no runtime
    overhead), but it's not user-configurable at the moment. If there's
    any serious problem with this then we can make it configurable and/or
    flip the default"

    * 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
    x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
    mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
    x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
    mm/core, x86/mm/pkeys: Add execute-only protection keys support
    x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
    x86/mm/pkeys: Allow kernel to modify user pkey rights register
    x86/fpu: Allow setting of XSAVE state
    x86/mm: Factor out LDT init from context init
    mm/core, x86/mm/pkeys: Add arch_validate_pkey()
    mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
    x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
    x86/mm/pkeys: Add Kconfig prompt to existing config option
    x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
    x86/mm/pkeys: Dump PKRU with other kernel registers
    mm/core, x86/mm/pkeys: Differentiate instruction fetches
    x86/mm/pkeys: Optimize fault handling in access_error()
    mm/core: Do not enforce PKEY permissions on remote mm access
    um, pkeys: Add UML arch_*_access_permitted() methods
    mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
    x86/mm/gup: Simplify get_user_pages() PTE bit handling
    ...

    Linus Torvalds
     

18 Mar, 2016

3 commits

  • Kernel style prefers a single string over split strings when the string is
    'user-visible'.

    Miscellanea:

    - Add a missing newline
    - Realign arguments

    Signed-off-by: Joe Perches
    Acked-by: Tejun Heo [percpu]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Currently we have two copies of the same code which implements memory
    overcommitment logic. Let's move it into mm/util.c and hence avoid
    duplication. No functional changes here.

    Signed-off-by: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • max_map_count sysctl unrelated to scheduler. Move its bits from
    include/linux/sched/sysctl.h to include/linux/mm.h.

    Signed-off-by: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

07 Mar, 2016

1 commit


19 Feb, 2016

1 commit

  • Grazvydas Ignotas has reported a regression in remap_file_pages()
    emulation.

    Testcase:
    #define _GNU_SOURCE
    #include
    #include
    #include
    #include

    #define SIZE (4096 * 3)

    int main(int argc, char **argv)
    {
    unsigned long *p;
    long i;

    p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
    MAP_SHARED | MAP_ANONYMOUS, -1, 0);
    if (p == MAP_FAILED) {
    perror("mmap");
    return -1;
    }

    for (i = 0; i < SIZE / 4096; i++)
    p[i * 4096 / sizeof(*p)] = i;

    if (remap_file_pages(p, 4096, 0, 1, 0)) {
    perror("remap_file_pages");
    return -1;
    }

    if (remap_file_pages(p, 4096 * 2, 0, 1, 0)) {
    perror("remap_file_pages");
    return -1;
    }

    assert(p[0] == 1);

    munmap(p, SIZE);

    return 0;
    }

    The second remap_file_pages() fails with -EINVAL.

    The reason is that remap_file_pages() emulation assumes that the target
    vma covers whole area we want to over map. That assumption is broken by
    first remap_file_pages() call: it split the area into two vma.

    The solution is to check next adjacent vmas, if they map the same file
    with the same flags.

    Fixes: c8d78c1823f4 ("mm: replace remap_file_pages() syscall with emulation")
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Grazvydas Ignotas
    Tested-by: Grazvydas Ignotas
    Cc: [4.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov