08 Oct, 2016

6 commits

  • The old code was always doing:

    vma->vm_end = next->vm_end
    vma_rb_erase(next) // in __vma_unlink
    vma->vm_next = next->vm_next // in __vma_unlink
    next = vma->vm_next
    vma_gap_update(next)

    The new code still does the above for remove_next == 1 and 2, but for
    remove_next == 3 it has been changed and it does:

    next->vm_start = vma->vm_start
    vma_rb_erase(vma) // in __vma_unlink
    vma_gap_update(next)

    In the latter case, while unlinking "vma", validate_mm_rb() is told to
    ignore "vma" that is being removed, but next->vm_start was reduced
    instead. So for the new case, to avoid the false positive from
    validate_mm_rb, it should be "next" that is ignored when "vma" is
    being unlinked.

    "vma" and "next" in the above comment, considered pre-swap().

    Link: http://lkml.kernel.org/r/1474492522-2261-4-git-send-email-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Tested-by: Shaun Tancheff
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Jan Vorlicek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • The cases are three not two.

    Link: http://lkml.kernel.org/r/1474492522-2261-3-git-send-email-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Jan Vorlicek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • If next would be NULL we couldn't reach such code path.

    Link: http://lkml.kernel.org/r/1474309513-20313-2-git-send-email-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Jan Vorlicek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • The rmap_walk can access vm_page_prot (and potentially vm_flags in the
    pte/pmd manipulations). So it's not safe to wait the caller to update
    the vm_page_prot/vm_flags after vma_merge returned potentially removing
    the "next" vma and extending the "current" vma over the
    next->vm_start,vm_end range, but still with the "current" vma
    vm_page_prot, after releasing the rmap locks.

    The vm_page_prot/vm_flags must be transferred from the "next" vma to the
    current vma while vma_merge still holds the rmap locks.

    The side effect of this race condition is pte corruption during migrate
    as remove_migration_ptes when run on a address of the "next" vma that
    got removed, used the vm_page_prot of the current vma.

    migrate mprotect
    ------------ -------------
    migrating in "next" vma
    vma_merge() # removes "next" vma and
    # extends "current" vma
    # current vma is not with
    # vm_page_prot updated
    remove_migration_ptes
    read vm_page_prot of current "vma"
    establish pte with wrong permissions
    vm_set_page_prot(vma) # too late!
    change_protection in the old vma range
    only, next range is not updated

    This caused segmentation faults and potentially memory corruption in
    heavy mprotect loads with some light page migration caused by compaction
    in the background.

    Hugh Dickins pointed out the comment about the Odd case 8 in vma_merge
    which confirms the case 8 is only buggy one where the race can trigger,
    in all other vma_merge cases the above cannot happen.

    This fix removes the oddness factor from case 8 and it converts it from:

    AAAA
    PPPPNNNNXXXX -> PPPPNNNNNNNN

    to:

    AAAA
    PPPPNNNNXXXX -> PPPPXXXXXXXX

    XXXX has the right vma properties for the whole merged vma returned by
    vma_adjust, so it solves the problem fully. It has the added benefits
    that the callers could stop updating vma properties when vma_merge
    succeeds however the callers are not updated by this patch (there are
    bits like VM_SOFTDIRTY that still need special care for the whole range,
    as the vma merging ignores them, but as long as they're not processed by
    rmap walks and instead they're accessed with the mmap_sem at least for
    reading, they are fine not to be updated within vma_adjust before
    releasing the rmap_locks).

    Link: http://lkml.kernel.org/r/1474309513-20313-1-git-send-email-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Reported-by: Aditya Mandaleeka
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Jan Vorlicek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • mm->highest_vm_end doesn't need any update.

    After finally removing the oddness from vma_merge case 8 that was
    causing:

    1) constant risk of trouble whenever anybody would check vma fields
    from rmap_walks, like it happened when page migration was
    introduced and it read the vma->vm_page_prot from a rmap_walk

    2) the callers of vma_merge to re-initialize any value different from
    the current vma, instead of vma_merge() more reliably returning a
    vma that already matches all fields passed as parameter

    .. it is also worth to take the opportunity of cleaning up superfluous
    code in vma_adjust(), that if not removed adds up to the hard
    readability of the function.

    Link: http://lkml.kernel.org/r/1474492522-2261-5-git-send-email-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Jan Vorlicek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • vma->vm_page_prot is read lockless from the rmap_walk, it may be updated
    concurrently and this prevents the risk of reading intermediate values.

    Link: http://lkml.kernel.org/r/1474660305-19222-1-git-send-email-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Jan Vorlicek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

04 Oct, 2016

1 commit

  • Pull x86 vdso updates from Ingo Molnar:
    "The main changes in this cycle centered around adding support for
    32-bit compatible C/R of the vDSO on 64-bit kernels, by Dmitry
    Safonov"

    * 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/vdso: Use CONFIG_X86_X32_ABI to enable vdso prctl
    x86/vdso: Only define map_vdso_randomized() if CONFIG_X86_64
    x86/vdso: Only define prctl_map_vdso() if CONFIG_CHECKPOINT_RESTORE
    x86/signal: Add SA_{X32,IA32}_ABI sa_flags
    x86/ptrace: Down with test_thread_flag(TIF_IA32)
    x86/coredump: Use pr_reg size, rather that TIF_IA32 flag
    x86/arch_prctl/vdso: Add ARCH_MAP_VDSO_*
    x86/vdso: Replace calculate_addr in map_vdso() with addr
    x86/vdso: Unmap vdso blob on vvar mapping failure

    Linus Torvalds
     

15 Sep, 2016

1 commit

  • Add API to change vdso blob type with arch_prctl.
    As this is usefull only by needs of CRIU, expose
    this interface under CONFIG_CHECKPOINT_RESTORE.

    Signed-off-by: Dmitry Safonov
    Acked-by: Andy Lutomirski
    Cc: 0x7f454c46@gmail.com
    Cc: oleg@redhat.com
    Cc: linux-mm@kvack.org
    Cc: gorcunov@openvz.org
    Cc: xemul@virtuozzo.com
    Link: http://lkml.kernel.org/r/20160905133308.28234-4-dsafonov@virtuozzo.com
    Signed-off-by: Thomas Gleixner

    Dmitry Safonov
     

26 Aug, 2016

1 commit

  • The ARMv8 architecture allows execute-only user permissions by clearing
    the PTE_UXN and PTE_USER bits. However, the kernel running on a CPU
    implementation without User Access Override (ARMv8.2 onwards) can still
    access such page, so execute-only page permission does not protect
    against read(2)/write(2) etc. accesses. Systems requiring such
    protection must enable features like SECCOMP.

    This patch changes the arm64 __P100 and __S100 protection_map[] macros
    to the new __PAGE_EXECONLY attributes. A side effect is that
    pte_user() no longer triggers for __PAGE_EXECONLY since PTE_USER isn't
    set. To work around this, the check is done on the PTE_NG bit via the
    pte_ng() macro. VM_READ is also checked now for page faults.

    Reviewed-by: Will Deacon
    Signed-off-by: Catalin Marinas
    Signed-off-by: Will Deacon

    Catalin Marinas
     

03 Aug, 2016

1 commit

  • The vm_brk() alignment calculations should refuse to overflow. The ELF
    loader depending on this, but it has been fixed now. No other unsafe
    callers have been found.

    Link: http://lkml.kernel.org/r/1468014494-25291-3-git-send-email-keescook@chromium.org
    Signed-off-by: Kees Cook
    Reported-by: Hector Marco-Gisbert
    Cc: Ismael Ripoll Ripoll
    Cc: Alexander Viro
    Cc: "Kirill A. Shutemov"
    Cc: Oleg Nesterov
    Cc: Chen Gang
    Cc: Michal Hocko
    Cc: Konstantin Khlebnikov
    Cc: Andrea Arcangeli
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

29 Jul, 2016

1 commit

  • There's one case when vma_adjust() expands the vma, overlapping with
    *two* next vma. See case 6 of mprotect, described in the comment to
    vma_merge().

    To handle this (and only this) situation we iterate twice over main part
    of the function. See "goto again".

    Vegard reported[1] that he sees out-of-bounds access complain from
    KASAN, if anon_vma_clone() on the *second* iteration fails.

    This happens because we free 'next' vma by the end of first iteration
    and don't have a way to undo this if anon_vma_clone() fails on the
    second iteration.

    The solution is to do all required allocations upfront, before we touch
    vmas.

    The allocation on the second iteration is only required if first two
    vmas don't have anon_vma, but third does. So we need, in total, one
    anon_vma_clone() call.

    It's easy to adjust 'exporter' to the third vma for such case.

    [1] http://lkml.kernel.org/r/1469514843-23778-1-git-send-email-vegard.nossum@oracle.com

    Link: http://lkml.kernel.org/r/1469625255-126641-1-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Vegard Nossum
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

27 Jul, 2016

3 commits

  • Provide a shmem_get_unmapped_area method in file_operations, called at
    mmap time to decide the mapping address. It could be conditional on
    CONFIG_TRANSPARENT_HUGEPAGE, but save #ifdefs in other places by making
    it unconditional.

    shmem_get_unmapped_area() first calls the usual mm->get_unmapped_area
    (which we treat as a black box, highly dependent on architecture and
    config and executable layout). Lots of conditions, and in most cases it
    just goes with the address that chose; but when our huge stars are
    rightly aligned, yet that did not provide a suitable address, go back to
    ask for a larger arena, within which to align the mapping suitably.

    There have to be some direct calls to shmem_get_unmapped_area(), not via
    the file_operations: because of the way shmem_zero_setup() is called to
    create a shmem object late in the mmap sequence, when MAP_SHARED is
    requested with MAP_ANONYMOUS or /dev/zero. Though this only matters
    when /proc/sys/vm/shmem_huge has been set.

    Link: http://lkml.kernel.org/r/1466021202-61880-29-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Hugh Dickins
    Signed-off-by: Kirill A. Shutemov

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • As with anon THP, we only mlock file huge pages if we can prove that the
    page is not mapped with PTE. This way we can avoid mlock leak into
    non-mlocked vma on split.

    We rely on PageDoubleMap() under lock_page() to check if the the page
    may be PTE mapped. PG_double_map is set by page_add_file_rmap() when
    the page mapped with PTEs.

    Link: http://lkml.kernel.org/r/1466021202-61880-21-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • vma_addjust_trans_huge() splits pmd if it's crossing VMA boundary.
    During split we munlock the huge page which requires rmap walk. rmap
    wants to take the lock on its own.

    Let's move vma_adjust_trans_huge() outside i_mmap_rwsem to fix this.

    Link: http://lkml.kernel.org/r/1466021202-61880-19-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

08 Jul, 2016

1 commit

  • Add possibility for 32-bit user-space applications to move
    the vDSO mapping.

    Previously, when a user-space app called mremap() for the vDSO
    address, in the syscall return path it would land on the previous
    address of the vDSOpage, resulting in segmentation violation.

    Now it lands fine and returns to userspace with a remapped vDSO.

    This will also fix the context.vdso pointer for 64-bit, which does
    not affect the user of vDSO after mremap() currently, but this
    may change in the future.

    As suggested by Andy, return -EINVAL for mremap() that would
    split the vDSO image: that operation cannot possibly result in
    a working system so reject it.

    Renamed and moved the text_mapping structure declaration inside
    map_vdso(), as it used only there and now it complements the
    vvar_mapping variable.

    There is still a problem for remapping the vDSO in glibc
    applications: the linker relocates addresses for syscalls
    on the vDSO page, so you need to relink with the new
    addresses.

    Without that the next syscall through glibc may fail:

    Program received signal SIGSEGV, Segmentation fault.
    #0 0xf7fd9b80 in __kernel_vsyscall ()
    #1 0xf7ec8238 in _exit () from /usr/lib32/libc.so.6

    Signed-off-by: Dmitry Safonov
    Acked-by: Andy Lutomirski
    Cc: 0x7f454c46@gmail.com
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20160628113539.13606-2-dsafonov@virtuozzo.com
    Signed-off-by: Ingo Molnar

    Dmitry Safonov
     

28 May, 2016

1 commit

  • The do_brk() and vm_brk() return value was "unsigned long" and returned
    the starting address on success, and an error value on failure. The
    reasons are entirely historical, and go back to it basically behaving
    like the mmap() interface does.

    However, nobody actually wanted that interface, and it causes totally
    pointless IS_ERR_VALUE() confusion.

    What every single caller actually wants is just the simpler integer
    return of zero for success and negative error number on failure.

    So just convert to that much clearer and more common calling convention,
    and get rid of all the IS_ERR_VALUE() uses wrt vm_brk().

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

24 May, 2016

4 commits

  • Now that all the callers handle vm_brk failure we can change it wait for
    mmap_sem killable to help oom_reaper to not get blocked just because
    vm_brk gets blocked behind mmap_sem readers.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: Oleg Nesterov
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Almost all current users of vm_munmap are ignoring the return value and
    so they do not handle potential error. This means that some VMAs might
    stay behind. This patch doesn't try to solve those potential problems.
    Quite contrary it adds a new failure mode by using down_write_killable
    in vm_munmap. This should be safer than other failure modes, though,
    because the process is guaranteed to die as soon as it leaves the kernel
    and exit_mmap will clean the whole address space.

    This will help in the OOM conditions when the oom victim might be stuck
    waiting for the mmap_sem for write which in turn can block oom_reaper
    which relies on the mmap_sem for read to make a forward progress and
    reclaim the address space of the victim.

    Signed-off-by: Michal Hocko
    Cc: Oleg Nesterov
    Cc: "Kirill A. Shutemov"
    Cc: Konstantin Khlebnikov
    Cc: Andrea Arcangeli
    Cc: Alexander Viro
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • All the callers of vm_mmap seem to check for the failure already and
    bail out in one way or another on the error which means that we can
    change it to use killable version of vm_mmap_pgoff and return -EINTR if
    the current task gets killed while waiting for mmap_sem. This also
    means that vm_mmap_pgoff can be killable by default and drop the
    additional parameter.

    This will help in the OOM conditions when the oom victim might be stuck
    waiting for the mmap_sem for write which in turn can block oom_reaper
    which relies on the mmap_sem for read to make a forward progress and
    reclaim the address space of the victim.

    Please note that load_elf_binary is ignoring vm_mmap error for
    current->personality & MMAP_PAGE_ZERO case but that shouldn't be a
    problem because the address is not used anywhere and we never return to
    the userspace if we got killed.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Andrea Arcangeli
    Cc: Al Viro
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • This is a follow up work for oom_reaper [1]. As the async OOM killing
    depends on oom_sem for read we would really appreciate if a holder for
    write didn't stood in the way. This patchset is changing many of
    down_write calls to be killable to help those cases when the writer is
    blocked and waiting for readers to release the lock and so help
    __oom_reap_task to process the oom victim.

    Most of the patches are really trivial because the lock is help from a
    shallow syscall paths where we can return EINTR trivially and allow the
    current task to die (note that EINTR will never get to the userspace as
    the task has fatal signal pending). Others seem to be easy as well as
    the callers are already handling fatal errors and bail and return to
    userspace which should be sufficient to handle the failure gracefully.
    I am not familiar with all those code paths so a deeper review is really
    appreciated.

    As this work is touching more areas which are not directly connected I
    have tried to keep the CC list as small as possible and people who I
    believed would be familiar are CCed only to the specific patches (all
    should have received the cover though).

    This patchset is based on linux-next and it depends on
    down_write_killable for rw_semaphores which got merged into tip
    locking/rwsem branch and it is merged into this next tree. I guess it
    would be easiest to route these patches via mmotm because of the
    dependency on the tip tree but if respective maintainers prefer other
    way I have no objections.

    I haven't covered all the mmap_write(mm->mmap_sem) instances here

    $ git grep "down_write(.*\)" next/master | wc -l
    98
    $ git grep "down_write(.*\)" | wc -l
    62

    I have tried to cover those which should be relatively easy to review in
    this series because this alone should be a nice improvement. Other
    places can be changed on top.

    [0] http://lkml.kernel.org/r/1456752417-9626-1-git-send-email-mhocko@kernel.org
    [1] http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org
    [2] http://lkml.kernel.org/r/1456750705-7141-1-git-send-email-mhocko@kernel.org

    This patch (of 18):

    This is the first step in making mmap_sem write waiters killable. It
    focuses on the trivial ones which are taking the lock early after
    entering the syscall and they are not changing state before.

    Therefore it is very easy to change them to use down_write_killable and
    immediately return with -EINTR. This will allow the waiter to pass away
    without blocking the mmap_sem which might be required to make a forward
    progress. E.g. the oom reaper will need the lock for reading to
    dismantle the OOM victim address space.

    The only tricky function in this patch is vm_mmap_pgoff which has many
    call sites via vm_mmap. To reduce the risk keep vm_mmap with the
    original non-killable semantic for now.

    vm_munmap callers do not bother checking the return value so open code
    it into the munmap syscall path for now for simplicity.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Konstantin Khlebnikov
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

21 May, 2016

1 commit

  • Since commit 84638335900f ("mm: rework virtual memory accounting")
    RLIMIT_DATA limits both brk() and private mmap() but this's disabled by
    default because of incompatibility with older versions of valgrind.

    Valgrind always set limit to zero and fails if RLIMIT_DATA is enabled.
    Fortunately it changes only rlim_cur and keeps rlim_max for reverting
    limit back when needed.

    This patch checks current usage also against rlim_max if rlim_cur is
    zero. This is safe because task anyway can increase rlim_cur up to
    rlim_max. Size of brk is still checked against rlim_cur, so this part
    is completely compatible - zero rlim_cur forbids brk() but allows
    private mmap().

    Link: http://lkml.kernel.org/r/56A28613.5070104@de.ibm.com
    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Linus Torvalds
    Cc: Cyrill Gorcunov
    Cc: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

20 May, 2016

1 commit


21 Mar, 2016

1 commit

  • Pull x86 protection key support from Ingo Molnar:
    "This tree adds support for a new memory protection hardware feature
    that is available in upcoming Intel CPUs: 'protection keys' (pkeys).

    There's a background article at LWN.net:

    https://lwn.net/Articles/643797/

    The gist is that protection keys allow the encoding of
    user-controllable permission masks in the pte. So instead of having a
    fixed protection mask in the pte (which needs a system call to change
    and works on a per page basis), the user can map a (handful of)
    protection mask variants and can change the masks runtime relatively
    cheaply, without having to change every single page in the affected
    virtual memory range.

    This allows the dynamic switching of the protection bits of large
    amounts of virtual memory, via user-space instructions. It also
    allows more precise control of MMU permission bits: for example the
    executable bit is separate from the read bit (see more about that
    below).

    This tree adds the MM infrastructure and low level x86 glue needed for
    that, plus it adds a high level API to make use of protection keys -
    if a user-space application calls:

    mmap(..., PROT_EXEC);

    or

    mprotect(ptr, sz, PROT_EXEC);

    (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
    this special case, and will set a special protection key on this
    memory range. It also sets the appropriate bits in the Protection
    Keys User Rights (PKRU) register so that the memory becomes unreadable
    and unwritable.

    So using protection keys the kernel is able to implement 'true'
    PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
    PROT_READ as well. Unreadable executable mappings have security
    advantages: they cannot be read via information leaks to figure out
    ASLR details, nor can they be scanned for ROP gadgets - and they
    cannot be used by exploits for data purposes either.

    We know about no user-space code that relies on pure PROT_EXEC
    mappings today, but binary loaders could start making use of this new
    feature to map binaries and libraries in a more secure fashion.

    There is other pending pkeys work that offers more high level system
    call APIs to manage protection keys - but those are not part of this
    pull request.

    Right now there's a Kconfig that controls this feature
    (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
    (like most x86 CPU feature enablement code that has no runtime
    overhead), but it's not user-configurable at the moment. If there's
    any serious problem with this then we can make it configurable and/or
    flip the default"

    * 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
    x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
    mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
    x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
    mm/core, x86/mm/pkeys: Add execute-only protection keys support
    x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
    x86/mm/pkeys: Allow kernel to modify user pkey rights register
    x86/fpu: Allow setting of XSAVE state
    x86/mm: Factor out LDT init from context init
    mm/core, x86/mm/pkeys: Add arch_validate_pkey()
    mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
    x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
    x86/mm/pkeys: Add Kconfig prompt to existing config option
    x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
    x86/mm/pkeys: Dump PKRU with other kernel registers
    mm/core, x86/mm/pkeys: Differentiate instruction fetches
    x86/mm/pkeys: Optimize fault handling in access_error()
    mm/core: Do not enforce PKEY permissions on remote mm access
    um, pkeys: Add UML arch_*_access_permitted() methods
    mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
    x86/mm/gup: Simplify get_user_pages() PTE bit handling
    ...

    Linus Torvalds
     

18 Mar, 2016

3 commits

  • Kernel style prefers a single string over split strings when the string is
    'user-visible'.

    Miscellanea:

    - Add a missing newline
    - Realign arguments

    Signed-off-by: Joe Perches
    Acked-by: Tejun Heo [percpu]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Currently we have two copies of the same code which implements memory
    overcommitment logic. Let's move it into mm/util.c and hence avoid
    duplication. No functional changes here.

    Signed-off-by: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • max_map_count sysctl unrelated to scheduler. Move its bits from
    include/linux/sched/sysctl.h to include/linux/mm.h.

    Signed-off-by: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

07 Mar, 2016

1 commit


19 Feb, 2016

3 commits

  • Grazvydas Ignotas has reported a regression in remap_file_pages()
    emulation.

    Testcase:
    #define _GNU_SOURCE
    #include
    #include
    #include
    #include

    #define SIZE (4096 * 3)

    int main(int argc, char **argv)
    {
    unsigned long *p;
    long i;

    p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
    MAP_SHARED | MAP_ANONYMOUS, -1, 0);
    if (p == MAP_FAILED) {
    perror("mmap");
    return -1;
    }

    for (i = 0; i < SIZE / 4096; i++)
    p[i * 4096 / sizeof(*p)] = i;

    if (remap_file_pages(p, 4096, 0, 1, 0)) {
    perror("remap_file_pages");
    return -1;
    }

    if (remap_file_pages(p, 4096 * 2, 0, 1, 0)) {
    perror("remap_file_pages");
    return -1;
    }

    assert(p[0] == 1);

    munmap(p, SIZE);

    return 0;
    }

    The second remap_file_pages() fails with -EINVAL.

    The reason is that remap_file_pages() emulation assumes that the target
    vma covers whole area we want to over map. That assumption is broken by
    first remap_file_pages() call: it split the area into two vma.

    The solution is to check next adjacent vmas, if they map the same file
    with the same flags.

    Fixes: c8d78c1823f4 ("mm: replace remap_file_pages() syscall with emulation")
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Grazvydas Ignotas
    Tested-by: Grazvydas Ignotas
    Cc: [4.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Protection keys provide new page-based protection in hardware.
    But, they have an interesting attribute: they only affect data
    accesses and never affect instruction fetches. That means that
    if we set up some memory which is set as "access-disabled" via
    protection keys, we can still execute from it.

    This patch uses protection keys to set up mappings to do just that.
    If a user calls:

    mmap(..., PROT_EXEC);
    or
    mprotect(ptr, sz, PROT_EXEC);

    (note PROT_EXEC-only without PROT_READ/WRITE), the kernel will
    notice this, and set a special protection key on the memory. It
    also sets the appropriate bits in the Protection Keys User Rights
    (PKRU) register so that the memory becomes unreadable and
    unwritable.

    I haven't found any userspace that does this today. With this
    facility in place, we expect userspace to move to use it
    eventually. Userspace _could_ start doing this today. Any
    PROT_EXEC calls get converted to PROT_READ inside the kernel, and
    would transparently be upgraded to "true" PROT_EXEC with this
    code. IOW, userspace never has to do any PROT_EXEC runtime
    detection.

    This feature provides enhanced protection against leaking
    executable memory contents. This helps thwart attacks which are
    attempting to find ROP gadgets on the fly.

    But, the security provided by this approach is not comprehensive.
    The PKRU register which controls access permissions is a normal
    user register writable from unprivileged userspace. An attacker
    who can execute the 'wrpkru' instruction can easily disable the
    protection provided by this feature.

    The protection key that is used for execute-only support is
    permanently dedicated at compile time. This is fine for now
    because there is currently no API to set a protection key other
    than this one.

    Despite there being a constant PKRU value across the entire
    system, we do not set it unless this feature is in use in a
    process. That is to preserve the PKRU XSAVE 'init state',
    which can lead to faster context switches.

    PKRU *is* a user register and the kernel is modifying it. That
    means that code doing:

    pkru = rdpkru()
    pkru |= 0x100;
    mmap(..., PROT_EXEC);
    wrpkru(pkru);

    could lose the bits in PKRU that enforce execute-only
    permissions. To avoid this, we suggest avoiding ever calling
    mmap() or mprotect() when the PKRU value is expected to be
    unstable.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Andrea Arcangeli
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Andy Lutomirski
    Cc: Aneesh Kumar K.V
    Cc: Borislav Petkov
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Chen Gang
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: David Hildenbrand
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Kees Cook
    Cc: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Piotr Kwapulinski
    Cc: Rik van Riel
    Cc: Stephen Smalley
    Cc: Vladimir Murzin
    Cc: Will Deacon
    Cc: keescook@google.com
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20160212210240.CB4BB5CA@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     
  • This plumbs a protection key through calc_vm_flag_bits(). We
    could have done this in calc_vm_prot_bits(), but I did not feel
    super strongly which way to go. It was pretty arbitrary which
    one to use.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Andrea Arcangeli
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Arve Hjønnevåg
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Chen Gang
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: David Airlie
    Cc: Denys Vlasenko
    Cc: Eric W. Biederman
    Cc: Geliang Tang
    Cc: Greg Kroah-Hartman
    Cc: H. Peter Anvin
    Cc: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Leon Romanovsky
    Cc: Linus Torvalds
    Cc: Masahiro Yamada
    Cc: Maxime Coquelin
    Cc: Mel Gorman
    Cc: Michael Ellerman
    Cc: Oleg Nesterov
    Cc: Paul Gortmaker
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Riley Andrews
    Cc: Vladimir Davydov
    Cc: devel@driverdev.osuosl.org
    Cc: linux-api@vger.kernel.org
    Cc: linux-arch@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Link: http://lkml.kernel.org/r/20160212210231.E6F1F0D6@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

18 Feb, 2016

1 commit


16 Feb, 2016

1 commit


06 Feb, 2016

2 commits

  • Sequence vma_lock_anon_vma() - vma_unlock_anon_vma() isn't safe if
    anon_vma appeared between lock and unlock. We have to check anon_vma
    first or call anon_vma_prepare() to be sure that it's here. There are
    only few users of these legacy helpers. Let's get rid of them.

    This patch fixes anon_vma lock imbalance in validate_mm(). Write lock
    isn't required here, read lock is enough.

    And reorders expand_downwards/expand_upwards: security_mmap_addr() and
    wrapping-around check don't have to be under anon vma lock.

    Link: https://lkml.kernel.org/r/CACT4Y+Y908EjM2z=706dv4rV6dWtxTLK9nFg9_7DhRMLppBo2g@mail.gmail.com
    Signed-off-by: Konstantin Khlebnikov
    Reported-by: Dmitry Vyukov
    Acked-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • The mmap_sem for reading in validate_mm called from expand_stack is not
    enough to prevent the argumented rbtree rb_subtree_gap information to
    change from under us because expand_stack may be running from other
    threads concurrently which will hold the mmap_sem for reading too.

    The argumented rbtree is updated with vma_gap_update under the
    page_table_lock so use it in browse_rb() too to avoid false positives.

    Signed-off-by: Andrea Arcangeli
    Reported-by: Dmitry Vyukov
    Tested-by: Dmitry Vyukov
    Cc: Konstantin Khlebnikov
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

04 Feb, 2016

1 commit

  • This patch provides a way of working around a slight regression
    introduced by commit 84638335900f ("mm: rework virtual memory
    accounting").

    Before that commit RLIMIT_DATA have control only over size of the brk
    region. But that change have caused problems with all existing versions
    of valgrind, because it set RLIMIT_DATA to zero.

    This patch fixes rlimit check (limit actually in bytes, not pages) and
    by default turns it into warning which prints at first VmData misuse:

    "mmap: top (795): VmData 516096 exceed data ulimit 512000. Will be forbidden soon."

    Behavior is controlled by boot param ignore_rlimit_data=y/n and by sysfs
    /sys/module/kernel/parameters/ignore_rlimit_data. For now it set to "y".

    [akpm@linux-foundation.org: tweak kernel-parameters.txt text[
    Signed-off-by: Konstantin Khlebnikov
    Link: http://lkml.kernel.org/r/20151228211015.GL2194@uranus
    Reported-by: Christian Borntraeger
    Cc: Cyrill Gorcunov
    Cc: Linus Torvalds
    Cc: Vegard Nossum
    Cc: Peter Zijlstra
    Cc: Vladimir Davydov
    Cc: Andy Lutomirski
    Cc: Quentin Casasnovas
    Cc: Kees Cook
    Cc: Willy Tarreau
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

29 Jan, 2016

1 commit


16 Jan, 2016

1 commit

  • Dmitry Vyukov has reported[1] possible deadlock (triggered by his
    syzkaller fuzzer):

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&hugetlbfs_i_mmap_rwsem_key);
    lock(&mapping->i_mmap_rwsem);
    lock(&hugetlbfs_i_mmap_rwsem_key);
    lock(&mapping->i_mmap_rwsem);

    Both traces points to mm_take_all_locks() as a source of the problem.
    It doesn't take care about ordering or hugetlbfs_i_mmap_rwsem_key (aka
    mapping->i_mmap_rwsem for hugetlb mapping) vs. i_mmap_rwsem.

    huge_pmd_share() does memory allocation under hugetlbfs_i_mmap_rwsem_key
    and allocator can take i_mmap_rwsem if it hit reclaim. So we need to
    take i_mmap_rwsem from all hugetlb VMAs before taking i_mmap_rwsem from
    rest of VMAs.

    The patch also documents locking order for hugetlbfs_i_mmap_rwsem_key.

    [1] http://lkml.kernel.org/r/CACT4Y+Zu95tBs-0EvdiAKzUOsb4tczRRfCRTpLr4bg_OP9HuVg@mail.gmail.com

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Dmitry Vyukov
    Reviewed-by: Michal Hocko
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

15 Jan, 2016

3 commits

  • When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which
    testing the RLIMIT_DATA value to figure out if we're allowed to assign
    new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been
    commited that RLIMIT_DATA in a form it's implemented now doesn't do
    anything useful because most of user-space libraries use mmap() syscall
    for dynamic memory allocations.

    Linus suggested to convert RLIMIT_DATA rlimit into something suitable
    for anonymous memory accounting. But in this patch we go further, and
    the changes are bundled together as:

    * keep vma counting if CONFIG_PROC_FS=n, will be used for limits
    * replace mm->shared_vm with better defined mm->data_vm
    * account anonymous executable areas as executable
    * account file-backed growsdown/up areas as stack
    * drop struct file* argument from vm_stat_account
    * enforce RLIMIT_DATA for size of data areas

    This way code looks cleaner: now code/stack/data classification depends
    only on vm_flags state:

    VM_EXEC & ~VM_WRITE -> code (VmExe + VmLib in proc)
    VM_GROWSUP | VM_GROWSDOWN -> stack (VmStk)
    VM_WRITE & ~VM_SHARED & !stack -> data (VmData)

    The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called
    "shared", but that might be strange beast like readonly-private or VM_IO
    area.

    - RLIMIT_AS limits whole address space "VmSize"
    - RLIMIT_STACK limits stack "VmStk" (but each vma individually)
    - RLIMIT_DATA now limits "VmData"

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Cyrill Gorcunov
    Cc: Quentin Casasnovas
    Cc: Vegard Nossum
    Acked-by: Linus Torvalds
    Cc: Willy Tarreau
    Cc: Andy Lutomirski
    Cc: Kees Cook
    Cc: Vladimir Davydov
    Cc: Pavel Emelyanov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Address Space Layout Randomization (ASLR) provides a barrier to
    exploitation of user-space processes in the presence of security
    vulnerabilities by making it more difficult to find desired code/data
    which could help an attack. This is done by adding a random offset to
    the location of regions in the process address space, with a greater
    range of potential offset values corresponding to better protection/a
    larger search-space for brute force, but also to greater potential for
    fragmentation.

    The offset added to the mmap_base address, which provides the basis for
    the majority of the mappings for a process, is set once on process exec
    in arch_pick_mmap_layout() and is done via hard-coded per-arch values,
    which reflect, hopefully, the best compromise for all systems. The
    trade-off between increased entropy in the offset value generation and
    the corresponding increased variability in address space fragmentation
    is not absolute, however, and some platforms may tolerate higher amounts
    of entropy. This patch introduces both new Kconfig values and a sysctl
    interface which may be used to change the amount of entropy used for
    offset generation on a system.

    The direct motivation for this change was in response to the
    libstagefright vulnerabilities that affected Android, specifically to
    information provided by Google's project zero at:

    http://googleprojectzero.blogspot.com/2015/09/stagefrightened.html

    The attack presented therein, by Google's project zero, specifically
    targeted the limited randomness used to generate the offset added to the
    mmap_base address in order to craft a brute-force-based attack.
    Concretely, the attack was against the mediaserver process, which was
    limited to respawning every 5 seconds, on an arm device. The hard-coded
    8 bits used resulted in an average expected success rate of defeating
    the mmap ASLR after just over 10 minutes (128 tries at 5 seconds a
    piece). With this patch, and an accompanying increase in the entropy
    value to 16 bits, the same attack would take an average expected time of
    over 45 hours (32768 tries), which makes it both less feasible and more
    likely to be noticed.

    The introduced Kconfig and sysctl options are limited by per-arch
    minimum and maximum values, the minimum of which was chosen to match the
    current hard-coded value and the maximum of which was chosen so as to
    give the greatest flexibility without generating an invalid mmap_base
    address, generally a 3-4 bits less than the number of bits in the
    user-space accessible virtual address space.

    When decided whether or not to change the default value, a system
    developer should consider that mmap_base address could be placed
    anywhere up to 2^(value) bits away from the non-randomized location,
    which would introduce variable-sized areas above and below the mmap_base
    address such that the maximum vm_area_struct size may be reduced,
    preventing very large allocations.

    This patch (of 4):

    ASLR only uses as few as 8 bits to generate the random offset for the
    mmap base address on 32 bit architectures. This value was chosen to
    prevent a poorly chosen value from dividing the address space in such a
    way as to prevent large allocations. This may not be an issue on all
    platforms. Allow the specification of a minimum number of bits so that
    platforms desiring greater ASLR protection may determine where to place
    the trade-off.

    Signed-off-by: Daniel Cashman
    Cc: Russell King
    Acked-by: Kees Cook
    Cc: Ingo Molnar
    Cc: Jonathan Corbet
    Cc: Don Zickus
    Cc: Eric W. Biederman
    Cc: Heinrich Schuchardt
    Cc: Josh Poimboeuf
    Cc: Kirill A. Shutemov
    Cc: Naoya Horiguchi
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Thomas Gleixner
    Cc: David Rientjes
    Cc: Mark Salyzyn
    Cc: Jeff Vander Stoep
    Cc: Nick Kralevich
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: "H. Peter Anvin"
    Cc: Hector Marco-Gisbert
    Cc: Borislav Petkov
    Cc: Ralf Baechle
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Cashman
     
  • The following flag comparison in mmap_region makes no sense:

    if (!(vm_flags & MAP_FIXED))
    return -ENOMEM;

    The condition is always false and thus the above "return -ENOMEM" is
    never executed. The vm_flags must not be compared with MAP_FIXED flag.
    The vm_flags may only be compared with VM_* flags. MAP_FIXED has the
    same value as VM_MAYREAD.

    Hitting the rlimit is a slow path and find_vma_intersection should
    realize that there is no overlapping VMA for !MAP_FIXED case pretty
    quickly.

    Signed-off-by: Piotr Kwapulinski
    Acked-by: Michal Hocko
    Cc: Oleg Nesterov
    Cc: Chris Metcalf
    Reviewed-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Piotr Kwapulinski