21 Mar, 2016

1 commit

  • Pull x86 protection key support from Ingo Molnar:
    "This tree adds support for a new memory protection hardware feature
    that is available in upcoming Intel CPUs: 'protection keys' (pkeys).

    There's a background article at LWN.net:

    https://lwn.net/Articles/643797/

    The gist is that protection keys allow the encoding of
    user-controllable permission masks in the pte. So instead of having a
    fixed protection mask in the pte (which needs a system call to change
    and works on a per page basis), the user can map a (handful of)
    protection mask variants and can change the masks runtime relatively
    cheaply, without having to change every single page in the affected
    virtual memory range.

    This allows the dynamic switching of the protection bits of large
    amounts of virtual memory, via user-space instructions. It also
    allows more precise control of MMU permission bits: for example the
    executable bit is separate from the read bit (see more about that
    below).

    This tree adds the MM infrastructure and low level x86 glue needed for
    that, plus it adds a high level API to make use of protection keys -
    if a user-space application calls:

    mmap(..., PROT_EXEC);

    or

    mprotect(ptr, sz, PROT_EXEC);

    (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
    this special case, and will set a special protection key on this
    memory range. It also sets the appropriate bits in the Protection
    Keys User Rights (PKRU) register so that the memory becomes unreadable
    and unwritable.

    So using protection keys the kernel is able to implement 'true'
    PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
    PROT_READ as well. Unreadable executable mappings have security
    advantages: they cannot be read via information leaks to figure out
    ASLR details, nor can they be scanned for ROP gadgets - and they
    cannot be used by exploits for data purposes either.

    We know about no user-space code that relies on pure PROT_EXEC
    mappings today, but binary loaders could start making use of this new
    feature to map binaries and libraries in a more secure fashion.

    There is other pending pkeys work that offers more high level system
    call APIs to manage protection keys - but those are not part of this
    pull request.

    Right now there's a Kconfig that controls this feature
    (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
    (like most x86 CPU feature enablement code that has no runtime
    overhead), but it's not user-configurable at the moment. If there's
    any serious problem with this then we can make it configurable and/or
    flip the default"

    * 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
    x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
    mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
    x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
    mm/core, x86/mm/pkeys: Add execute-only protection keys support
    x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
    x86/mm/pkeys: Allow kernel to modify user pkey rights register
    x86/fpu: Allow setting of XSAVE state
    x86/mm: Factor out LDT init from context init
    mm/core, x86/mm/pkeys: Add arch_validate_pkey()
    mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
    x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
    x86/mm/pkeys: Add Kconfig prompt to existing config option
    x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
    x86/mm/pkeys: Dump PKRU with other kernel registers
    mm/core, x86/mm/pkeys: Differentiate instruction fetches
    x86/mm/pkeys: Optimize fault handling in access_error()
    mm/core: Do not enforce PKEY permissions on remote mm access
    um, pkeys: Add UML arch_*_access_permitted() methods
    mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
    x86/mm/gup: Simplify get_user_pages() PTE bit handling
    ...

    Linus Torvalds
     

18 Mar, 2016

3 commits

  • Kernel style prefers a single string over split strings when the string is
    'user-visible'.

    Miscellanea:

    - Add a missing newline
    - Realign arguments

    Signed-off-by: Joe Perches
    Acked-by: Tejun Heo [percpu]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Currently we have two copies of the same code which implements memory
    overcommitment logic. Let's move it into mm/util.c and hence avoid
    duplication. No functional changes here.

    Signed-off-by: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • max_map_count sysctl unrelated to scheduler. Move its bits from
    include/linux/sched/sysctl.h to include/linux/mm.h.

    Signed-off-by: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

07 Mar, 2016

1 commit


19 Feb, 2016

3 commits

  • Grazvydas Ignotas has reported a regression in remap_file_pages()
    emulation.

    Testcase:
    #define _GNU_SOURCE
    #include
    #include
    #include
    #include

    #define SIZE (4096 * 3)

    int main(int argc, char **argv)
    {
    unsigned long *p;
    long i;

    p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
    MAP_SHARED | MAP_ANONYMOUS, -1, 0);
    if (p == MAP_FAILED) {
    perror("mmap");
    return -1;
    }

    for (i = 0; i < SIZE / 4096; i++)
    p[i * 4096 / sizeof(*p)] = i;

    if (remap_file_pages(p, 4096, 0, 1, 0)) {
    perror("remap_file_pages");
    return -1;
    }

    if (remap_file_pages(p, 4096 * 2, 0, 1, 0)) {
    perror("remap_file_pages");
    return -1;
    }

    assert(p[0] == 1);

    munmap(p, SIZE);

    return 0;
    }

    The second remap_file_pages() fails with -EINVAL.

    The reason is that remap_file_pages() emulation assumes that the target
    vma covers whole area we want to over map. That assumption is broken by
    first remap_file_pages() call: it split the area into two vma.

    The solution is to check next adjacent vmas, if they map the same file
    with the same flags.

    Fixes: c8d78c1823f4 ("mm: replace remap_file_pages() syscall with emulation")
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Grazvydas Ignotas
    Tested-by: Grazvydas Ignotas
    Cc: [4.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Protection keys provide new page-based protection in hardware.
    But, they have an interesting attribute: they only affect data
    accesses and never affect instruction fetches. That means that
    if we set up some memory which is set as "access-disabled" via
    protection keys, we can still execute from it.

    This patch uses protection keys to set up mappings to do just that.
    If a user calls:

    mmap(..., PROT_EXEC);
    or
    mprotect(ptr, sz, PROT_EXEC);

    (note PROT_EXEC-only without PROT_READ/WRITE), the kernel will
    notice this, and set a special protection key on the memory. It
    also sets the appropriate bits in the Protection Keys User Rights
    (PKRU) register so that the memory becomes unreadable and
    unwritable.

    I haven't found any userspace that does this today. With this
    facility in place, we expect userspace to move to use it
    eventually. Userspace _could_ start doing this today. Any
    PROT_EXEC calls get converted to PROT_READ inside the kernel, and
    would transparently be upgraded to "true" PROT_EXEC with this
    code. IOW, userspace never has to do any PROT_EXEC runtime
    detection.

    This feature provides enhanced protection against leaking
    executable memory contents. This helps thwart attacks which are
    attempting to find ROP gadgets on the fly.

    But, the security provided by this approach is not comprehensive.
    The PKRU register which controls access permissions is a normal
    user register writable from unprivileged userspace. An attacker
    who can execute the 'wrpkru' instruction can easily disable the
    protection provided by this feature.

    The protection key that is used for execute-only support is
    permanently dedicated at compile time. This is fine for now
    because there is currently no API to set a protection key other
    than this one.

    Despite there being a constant PKRU value across the entire
    system, we do not set it unless this feature is in use in a
    process. That is to preserve the PKRU XSAVE 'init state',
    which can lead to faster context switches.

    PKRU *is* a user register and the kernel is modifying it. That
    means that code doing:

    pkru = rdpkru()
    pkru |= 0x100;
    mmap(..., PROT_EXEC);
    wrpkru(pkru);

    could lose the bits in PKRU that enforce execute-only
    permissions. To avoid this, we suggest avoiding ever calling
    mmap() or mprotect() when the PKRU value is expected to be
    unstable.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Andrea Arcangeli
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Andy Lutomirski
    Cc: Aneesh Kumar K.V
    Cc: Borislav Petkov
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Chen Gang
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: David Hildenbrand
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Kees Cook
    Cc: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Piotr Kwapulinski
    Cc: Rik van Riel
    Cc: Stephen Smalley
    Cc: Vladimir Murzin
    Cc: Will Deacon
    Cc: keescook@google.com
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20160212210240.CB4BB5CA@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     
  • This plumbs a protection key through calc_vm_flag_bits(). We
    could have done this in calc_vm_prot_bits(), but I did not feel
    super strongly which way to go. It was pretty arbitrary which
    one to use.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Andrea Arcangeli
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Arve Hjønnevåg
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Chen Gang
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: David Airlie
    Cc: Denys Vlasenko
    Cc: Eric W. Biederman
    Cc: Geliang Tang
    Cc: Greg Kroah-Hartman
    Cc: H. Peter Anvin
    Cc: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Leon Romanovsky
    Cc: Linus Torvalds
    Cc: Masahiro Yamada
    Cc: Maxime Coquelin
    Cc: Mel Gorman
    Cc: Michael Ellerman
    Cc: Oleg Nesterov
    Cc: Paul Gortmaker
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Riley Andrews
    Cc: Vladimir Davydov
    Cc: devel@driverdev.osuosl.org
    Cc: linux-api@vger.kernel.org
    Cc: linux-arch@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Link: http://lkml.kernel.org/r/20160212210231.E6F1F0D6@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

18 Feb, 2016

1 commit


16 Feb, 2016

1 commit


06 Feb, 2016

2 commits

  • Sequence vma_lock_anon_vma() - vma_unlock_anon_vma() isn't safe if
    anon_vma appeared between lock and unlock. We have to check anon_vma
    first or call anon_vma_prepare() to be sure that it's here. There are
    only few users of these legacy helpers. Let's get rid of them.

    This patch fixes anon_vma lock imbalance in validate_mm(). Write lock
    isn't required here, read lock is enough.

    And reorders expand_downwards/expand_upwards: security_mmap_addr() and
    wrapping-around check don't have to be under anon vma lock.

    Link: https://lkml.kernel.org/r/CACT4Y+Y908EjM2z=706dv4rV6dWtxTLK9nFg9_7DhRMLppBo2g@mail.gmail.com
    Signed-off-by: Konstantin Khlebnikov
    Reported-by: Dmitry Vyukov
    Acked-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • The mmap_sem for reading in validate_mm called from expand_stack is not
    enough to prevent the argumented rbtree rb_subtree_gap information to
    change from under us because expand_stack may be running from other
    threads concurrently which will hold the mmap_sem for reading too.

    The argumented rbtree is updated with vma_gap_update under the
    page_table_lock so use it in browse_rb() too to avoid false positives.

    Signed-off-by: Andrea Arcangeli
    Reported-by: Dmitry Vyukov
    Tested-by: Dmitry Vyukov
    Cc: Konstantin Khlebnikov
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

04 Feb, 2016

1 commit

  • This patch provides a way of working around a slight regression
    introduced by commit 84638335900f ("mm: rework virtual memory
    accounting").

    Before that commit RLIMIT_DATA have control only over size of the brk
    region. But that change have caused problems with all existing versions
    of valgrind, because it set RLIMIT_DATA to zero.

    This patch fixes rlimit check (limit actually in bytes, not pages) and
    by default turns it into warning which prints at first VmData misuse:

    "mmap: top (795): VmData 516096 exceed data ulimit 512000. Will be forbidden soon."

    Behavior is controlled by boot param ignore_rlimit_data=y/n and by sysfs
    /sys/module/kernel/parameters/ignore_rlimit_data. For now it set to "y".

    [akpm@linux-foundation.org: tweak kernel-parameters.txt text[
    Signed-off-by: Konstantin Khlebnikov
    Link: http://lkml.kernel.org/r/20151228211015.GL2194@uranus
    Reported-by: Christian Borntraeger
    Cc: Cyrill Gorcunov
    Cc: Linus Torvalds
    Cc: Vegard Nossum
    Cc: Peter Zijlstra
    Cc: Vladimir Davydov
    Cc: Andy Lutomirski
    Cc: Quentin Casasnovas
    Cc: Kees Cook
    Cc: Willy Tarreau
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

29 Jan, 2016

1 commit


16 Jan, 2016

1 commit

  • Dmitry Vyukov has reported[1] possible deadlock (triggered by his
    syzkaller fuzzer):

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&hugetlbfs_i_mmap_rwsem_key);
    lock(&mapping->i_mmap_rwsem);
    lock(&hugetlbfs_i_mmap_rwsem_key);
    lock(&mapping->i_mmap_rwsem);

    Both traces points to mm_take_all_locks() as a source of the problem.
    It doesn't take care about ordering or hugetlbfs_i_mmap_rwsem_key (aka
    mapping->i_mmap_rwsem for hugetlb mapping) vs. i_mmap_rwsem.

    huge_pmd_share() does memory allocation under hugetlbfs_i_mmap_rwsem_key
    and allocator can take i_mmap_rwsem if it hit reclaim. So we need to
    take i_mmap_rwsem from all hugetlb VMAs before taking i_mmap_rwsem from
    rest of VMAs.

    The patch also documents locking order for hugetlbfs_i_mmap_rwsem_key.

    [1] http://lkml.kernel.org/r/CACT4Y+Zu95tBs-0EvdiAKzUOsb4tczRRfCRTpLr4bg_OP9HuVg@mail.gmail.com

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Dmitry Vyukov
    Reviewed-by: Michal Hocko
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

15 Jan, 2016

4 commits

  • When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which
    testing the RLIMIT_DATA value to figure out if we're allowed to assign
    new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been
    commited that RLIMIT_DATA in a form it's implemented now doesn't do
    anything useful because most of user-space libraries use mmap() syscall
    for dynamic memory allocations.

    Linus suggested to convert RLIMIT_DATA rlimit into something suitable
    for anonymous memory accounting. But in this patch we go further, and
    the changes are bundled together as:

    * keep vma counting if CONFIG_PROC_FS=n, will be used for limits
    * replace mm->shared_vm with better defined mm->data_vm
    * account anonymous executable areas as executable
    * account file-backed growsdown/up areas as stack
    * drop struct file* argument from vm_stat_account
    * enforce RLIMIT_DATA for size of data areas

    This way code looks cleaner: now code/stack/data classification depends
    only on vm_flags state:

    VM_EXEC & ~VM_WRITE -> code (VmExe + VmLib in proc)
    VM_GROWSUP | VM_GROWSDOWN -> stack (VmStk)
    VM_WRITE & ~VM_SHARED & !stack -> data (VmData)

    The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called
    "shared", but that might be strange beast like readonly-private or VM_IO
    area.

    - RLIMIT_AS limits whole address space "VmSize"
    - RLIMIT_STACK limits stack "VmStk" (but each vma individually)
    - RLIMIT_DATA now limits "VmData"

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Cyrill Gorcunov
    Cc: Quentin Casasnovas
    Cc: Vegard Nossum
    Acked-by: Linus Torvalds
    Cc: Willy Tarreau
    Cc: Andy Lutomirski
    Cc: Kees Cook
    Cc: Vladimir Davydov
    Cc: Pavel Emelyanov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Address Space Layout Randomization (ASLR) provides a barrier to
    exploitation of user-space processes in the presence of security
    vulnerabilities by making it more difficult to find desired code/data
    which could help an attack. This is done by adding a random offset to
    the location of regions in the process address space, with a greater
    range of potential offset values corresponding to better protection/a
    larger search-space for brute force, but also to greater potential for
    fragmentation.

    The offset added to the mmap_base address, which provides the basis for
    the majority of the mappings for a process, is set once on process exec
    in arch_pick_mmap_layout() and is done via hard-coded per-arch values,
    which reflect, hopefully, the best compromise for all systems. The
    trade-off between increased entropy in the offset value generation and
    the corresponding increased variability in address space fragmentation
    is not absolute, however, and some platforms may tolerate higher amounts
    of entropy. This patch introduces both new Kconfig values and a sysctl
    interface which may be used to change the amount of entropy used for
    offset generation on a system.

    The direct motivation for this change was in response to the
    libstagefright vulnerabilities that affected Android, specifically to
    information provided by Google's project zero at:

    http://googleprojectzero.blogspot.com/2015/09/stagefrightened.html

    The attack presented therein, by Google's project zero, specifically
    targeted the limited randomness used to generate the offset added to the
    mmap_base address in order to craft a brute-force-based attack.
    Concretely, the attack was against the mediaserver process, which was
    limited to respawning every 5 seconds, on an arm device. The hard-coded
    8 bits used resulted in an average expected success rate of defeating
    the mmap ASLR after just over 10 minutes (128 tries at 5 seconds a
    piece). With this patch, and an accompanying increase in the entropy
    value to 16 bits, the same attack would take an average expected time of
    over 45 hours (32768 tries), which makes it both less feasible and more
    likely to be noticed.

    The introduced Kconfig and sysctl options are limited by per-arch
    minimum and maximum values, the minimum of which was chosen to match the
    current hard-coded value and the maximum of which was chosen so as to
    give the greatest flexibility without generating an invalid mmap_base
    address, generally a 3-4 bits less than the number of bits in the
    user-space accessible virtual address space.

    When decided whether or not to change the default value, a system
    developer should consider that mmap_base address could be placed
    anywhere up to 2^(value) bits away from the non-randomized location,
    which would introduce variable-sized areas above and below the mmap_base
    address such that the maximum vm_area_struct size may be reduced,
    preventing very large allocations.

    This patch (of 4):

    ASLR only uses as few as 8 bits to generate the random offset for the
    mmap base address on 32 bit architectures. This value was chosen to
    prevent a poorly chosen value from dividing the address space in such a
    way as to prevent large allocations. This may not be an issue on all
    platforms. Allow the specification of a minimum number of bits so that
    platforms desiring greater ASLR protection may determine where to place
    the trade-off.

    Signed-off-by: Daniel Cashman
    Cc: Russell King
    Acked-by: Kees Cook
    Cc: Ingo Molnar
    Cc: Jonathan Corbet
    Cc: Don Zickus
    Cc: Eric W. Biederman
    Cc: Heinrich Schuchardt
    Cc: Josh Poimboeuf
    Cc: Kirill A. Shutemov
    Cc: Naoya Horiguchi
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Thomas Gleixner
    Cc: David Rientjes
    Cc: Mark Salyzyn
    Cc: Jeff Vander Stoep
    Cc: Nick Kralevich
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: "H. Peter Anvin"
    Cc: Hector Marco-Gisbert
    Cc: Borislav Petkov
    Cc: Ralf Baechle
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Cashman
     
  • The following flag comparison in mmap_region makes no sense:

    if (!(vm_flags & MAP_FIXED))
    return -ENOMEM;

    The condition is always false and thus the above "return -ENOMEM" is
    never executed. The vm_flags must not be compared with MAP_FIXED flag.
    The vm_flags may only be compared with VM_* flags. MAP_FIXED has the
    same value as VM_MAYREAD.

    Hitting the rlimit is a slow path and find_vma_intersection should
    realize that there is no overlapping VMA for !MAP_FIXED case pretty
    quickly.

    Signed-off-by: Piotr Kwapulinski
    Acked-by: Michal Hocko
    Cc: Oleg Nesterov
    Cc: Chris Metcalf
    Reviewed-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Piotr Kwapulinski
     
  • Simplify may_expand_vm().

    [akpm@linux-foundation.org: further simplification, per Naoya Horiguchi]
    Signed-off-by: Chen Gang
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     

12 Jan, 2016

1 commit

  • Requiring special mappings to give a list of struct pages is
    inflexible: it prevents sane use of IO memory in a special
    mapping, it's inefficient (it requires arch code to initialize a
    list of struct pages, and it requires the mm core to walk the
    entire list just to figure out how long it is), and it prevents
    arch code from doing anything fancy when a special mapping fault
    occurs.

    Add a .fault method as an alternative to filling in a .pages
    array.

    Looks-OK-to: Andrew Morton
    Signed-off-by: Andy Lutomirski
    Reviewed-by: Kees Cook
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: Fenghua Yu
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Quentin Casasnovas
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/a26d1677c0bc7e774c33f469451a78ca31e9e6af.1451446564.git.luto@kernel.org
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     

06 Nov, 2015

8 commits

  • The cost of faulting in all memory to be locked can be very high when
    working with large mappings. If only portions of the mapping will be used
    this can incur a high penalty for locking.

    For the example of a large file, this is the usage pattern for a large
    statical language model (probably applies to other statical or graphical
    models as well). For the security example, any application transacting in
    data that cannot be swapped out (credit card data, medical records, etc).

    This patch introduces the ability to request that pages are not
    pre-faulted, but are placed on the unevictable LRU when they are finally
    faulted in. The VM_LOCKONFAULT flag will be used together with VM_LOCKED
    and has no effect when set without VM_LOCKED. Setting the VM_LOCKONFAULT
    flag for a VMA will cause pages faulted into that VMA to be added to the
    unevictable LRU when they are faulted or if they are already present, but
    will not cause any missing pages to be faulted in.

    Exposing this new lock state means that we cannot overload the meaning of
    the FOLL_POPULATE flag any longer. Prior to this patch it was used to
    mean that the VMA for a fault was locked. This means we need the new
    FOLL_MLOCK flag to communicate the locked state of a VMA. FOLL_POPULATE
    will now only control if the VMA should be populated and in the case of
    VM_LOCKONFAULT, it will not be set.

    Signed-off-by: Eric B Munson
    Acked-by: Kirill A. Shutemov
    Acked-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Jonathan Corbet
    Cc: Catalin Marinas
    Cc: Geert Uytterhoeven
    Cc: Guenter Roeck
    Cc: Heiko Carstens
    Cc: Michael Kerrisk
    Cc: Ralf Baechle
    Cc: Shuah Khan
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     
  • Make __install_special_mapping() args order match the caller, so the
    caller can pass their register args directly to callee with no touch.

    For most of architectures, args (at least the first 5th args) are in
    registers, so this change will have effect on most of architectures.

    For -O2, __install_special_mapping() may be inlined under most of
    architectures, but for -Os, it should not. So this change can get a
    little better performance for -Os, at least.

    Signed-off-by: Chen Gang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • When fget() fails we can return -EBADF directly.

    Signed-off-by: Chen Gang
    Acked-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • It is still a little better to remove it, although it should be skipped
    by "-O2".

    Signed-off-by: Chen Gang =0A=
    Acked-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • Cosmetic, but expand_upwards() and expand_downwards() overuse vma->vm_mm,
    a local variable makes sense imho.

    Signed-off-by: Oleg Nesterov
    Acked-by: Hugh Dickins
    Cc: Andrey Konovalov
    Cc: Davidlohr Bueso
    Cc: "Kirill A. Shutemov"
    Cc: Sasha Levin
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • "mm->locked_vm += grow" and vm_stat_account() in acct_stack_growth() are
    not safe; multiple threads using the same ->mm can do this at the same
    time trying to expans different vma's under down_read(mmap_sem). This
    means that one of the "locked_vm += grow" changes can be lost and we can
    miss munlock_vma_pages_all() later.

    Move this code into the caller(s) under mm->page_table_lock. All other
    updates to ->locked_vm hold mmap_sem for writing.

    Signed-off-by: Oleg Nesterov
    Acked-by: Hugh Dickins
    Cc: Andrey Konovalov
    Cc: Davidlohr Bueso
    Cc: "Kirill A. Shutemov"
    Cc: Sasha Levin
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • linux/mm.h provides offset_in_page() macro. Let's use already predefined
    macro instead of (addr & ~PAGE_MASK).

    Signed-off-by: Alexander Kuleshov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Kuleshov
     
  • Before the main loop, vma is already is NULL. There is no need to set it
    to NULL again.

    Signed-off-by: Chen Gang
    Reviewed-by: Oleg Nesterov
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     

23 Sep, 2015

1 commit

  • For VM_PFNMAP and VM_MIXEDMAP we use vm_ops->pfn_mkwrite instead of
    vm_ops->page_mkwrite to notify abort write access. This means we want
    vma->vm_page_prot to be write-protected if the VMA provides this vm_ops.

    A theoretical scenario that will cause these missed events is:

    On writable mapping with vm_ops->pfn_mkwrite, but without
    vm_ops->page_mkwrite: read fault followed by write access to the pfn.
    Writable pte will be set up on read fault and write fault will not be
    generated.

    I found it examining Dave's complaint on generic/080:

    http://lkml.kernel.org/g/20150831233803.GO3902@dastard

    Although I don't think it's the reason.

    It shouldn't be a problem for ext2/ext4 as they provide both pfn_mkwrite
    and page_mkwrite.

    [akpm@linux-foundation.org: add local vm_ops to avoid 80-cols mess]
    Signed-off-by: Kirill A. Shutemov
    Cc: Yigal Korman
    Acked-by: Boaz Harrosh
    Cc: Matthew Wilcox
    Cc: Jan Kara
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

18 Sep, 2015

1 commit

  • Revert commit 6dc296e7df4c "mm: make sure all file VMAs have ->vm_ops
    set".

    Will Deacon reports that it "causes some mmap regressions in LTP, which
    appears to use a MAP_PRIVATE mmap of /dev/zero as a way to get anonymous
    pages in some of its tests (specifically mmap10 [1])".

    William Shuman reports Oracle crashes.

    So revert the patch while we work out what to do.

    Reported-by: William Shuman
    Reported-by: Will Deacon
    Cc: Kirill A. Shutemov
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

11 Sep, 2015

2 commits

  • We rely on vma->vm_ops == NULL to detect anonymous VMA: see
    vma_is_anonymous(), but some drivers doesn't set ->vm_ops.

    As a result we can end up with anonymous page in private file mapping.
    That should not lead to serious misbehaviour, but nevertheless is wrong.

    Let's fix by setting up dummy ->vm_ops for file mmapping if f_op->mmap()
    didn't set its own.

    The patch also adds sanity check into __vma_link_rb(). It will help
    catch broken VMAs which inserted directly into mm_struct via
    insert_vm_struct().

    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Oleg Nesterov
    Cc: "H. Peter Anvin"
    Cc: Andy Lutomirski
    Cc: Dave Hansen
    Cc: Ingo Molnar
    Cc: Minchan Kim
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Add the additional "vm_flags_t vm_flags" argument to do_mmap_pgoff(),
    rename it to do_mmap(), and re-introduce do_mmap_pgoff() as a simple
    wrapper on top of do_mmap(). Perhaps we should update the callers of
    do_mmap_pgoff() and kill it later.

    This way mpx_mmap() can simply call do_mmap(vm_flags => VM_MPX) and do not
    play with vm internals.

    After this change mmap_region() has a single user outside of mmap.c,
    arch/tile/mm/elf.c:arch_setup_additional_pages(). It would be nice to
    change arch/tile/ and unexport mmap_region().

    [kirill@shutemov.name: fix build]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Oleg Nesterov
    Acked-by: Dave Hansen
    Tested-by: Dave Hansen
    Signed-off-by: Kirill A. Shutemov
    Cc: "H. Peter Anvin"
    Cc: Andy Lutomirski
    Cc: Ingo Molnar
    Cc: Minchan Kim
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

09 Sep, 2015

4 commits

  • There's no point in initializing vma->vm_pgoff if the insertion attempt
    will be failing anyway. Run the checks before performing the
    initialization.

    Signed-off-by: Chen Gang
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • __split_vma() doesn't need out_err label, neither need initializing err.

    copy_vma() can return NULL directly when kmem_cache_alloc() fails.

    Signed-off-by: Chen Gang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • Test-case:

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include

    void *find_vdso_vaddr(void)
    {
    FILE *perl;
    char buf[32] = {};

    perl = popen("perl -e 'open STDIN,qq|/proc/@{[getppid]}/maps|;"
    "/^(.*?)-.*vdso/ && print hex $1 while <>'", "r");
    fread(buf, sizeof(buf), 1, perl);
    fclose(perl);

    return (void *)atol(buf);
    }

    #define PAGE_SIZE 4096

    void *get_unmapped_area(void)
    {
    void *p = mmap(0, PAGE_SIZE, PROT_NONE,
    MAP_PRIVATE|MAP_ANONYMOUS, -1,0);
    assert(p != MAP_FAILED);
    munmap(p, PAGE_SIZE);
    return p;
    }

    char save[2][PAGE_SIZE];

    int main(void)
    {
    void *vdso = find_vdso_vaddr();
    void *page[2];

    assert(vdso);
    memcpy(save, vdso, sizeof (save));
    // force another fault on the next check
    assert(madvise(vdso, 2 * PAGE_SIZE, MADV_DONTNEED) == 0);

    page[0] = mremap(vdso,
    PAGE_SIZE, PAGE_SIZE, MREMAP_FIXED | MREMAP_MAYMOVE,
    get_unmapped_area());
    page[1] = mremap(vdso + PAGE_SIZE,
    PAGE_SIZE, PAGE_SIZE, MREMAP_FIXED | MREMAP_MAYMOVE,
    get_unmapped_area());

    assert(page[0] != MAP_FAILED && page[1] != MAP_FAILED);
    printf("match: %d %d\n",
    !memcmp(save[0], page[0], PAGE_SIZE),
    !memcmp(save[1], page[1], PAGE_SIZE));

    return 0;
    }

    fails without this patch. Before the previous commit it gets the wrong
    page, now it segfaults (which is imho better).

    This is because copy_vma() wrongly assumes that if vma->vm_file == NULL
    is irrelevant until the first fault which will use do_anonymous_page().
    This is obviously wrong for the special mapping.

    Signed-off-by: Oleg Nesterov
    Acked-by: Kirill A. Shutemov
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Test-case:

    #include
    #include
    #include
    #include
    #include
    #include

    void *find_vdso_vaddr(void)
    {
    FILE *perl;
    char buf[32] = {};

    perl = popen("perl -e 'open STDIN,qq|/proc/@{[getppid]}/maps|;"
    "/^(.*?)-.*vdso/ && print hex $1 while <>'", "r");
    fread(buf, sizeof(buf), 1, perl);
    fclose(perl);

    return (void *)atol(buf);
    }

    #define PAGE_SIZE 4096

    int main(void)
    {
    void *vdso = find_vdso_vaddr();
    assert(vdso);

    // of course they should differ, and they do so far
    printf("vdso pages differ: %d\n",
    !!memcmp(vdso, vdso + PAGE_SIZE, PAGE_SIZE));

    // split into 2 vma's
    assert(mprotect(vdso, PAGE_SIZE, PROT_READ) == 0);

    // force another fault on the next check
    assert(madvise(vdso, 2 * PAGE_SIZE, MADV_DONTNEED) == 0);

    // now they no longer differ, the 2nd vm_pgoff is wrong
    printf("vdso pages differ: %d\n",
    !!memcmp(vdso, vdso + PAGE_SIZE, PAGE_SIZE));

    return 0;
    }

    Output:

    vdso pages differ: 1
    vdso pages differ: 0

    This is because split_vma() correctly updates ->vm_pgoff, but the logic
    in insert_vm_struct() and special_mapping_fault() is absolutely broken,
    so the fault at vdso + PAGE_SIZE return the 1st page. The same happens
    if you simply unmap the 1st page.

    special_mapping_fault() does:

    pgoff = vmf->pgoff - vma->vm_pgoff;

    and this is _only_ correct if vma->vm_start mmaps the first page from
    ->vm_private_data array.

    vdso or any other user of install_special_mapping() is not anonymous,
    it has the "backing storage" even if it is just the array of pages.
    So we actually need to make vm_pgoff work as an offset in this array.

    Note: this also allows to fix another problem: currently gdb can't access
    "[vvar]" memory because in this case special_mapping_fault() doesn't work.
    Now that we can use ->vm_pgoff we can implement ->access() and fix this.

    Signed-off-by: Oleg Nesterov
    Acked-by: Kirill A. Shutemov
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

05 Sep, 2015

1 commit

  • vma->vm_userfaultfd_ctx is yet another vma parameter that vma_merge
    must be aware about so that we can merge vmas back like they were
    originally before arming the userfaultfd on some memory range.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

10 Jul, 2015

1 commit

  • Today proc and sysfs do not contain any executable files. Several
    applications today mount proc or sysfs without noexec and nosuid and
    then depend on there being no exectuables files on proc or sysfs.
    Having any executable files show on proc or sysfs would cause
    a user space visible regression, and most likely security problems.

    Therefore commit to never allowing executables on proc and sysfs by
    adding a new flag to mark them as filesystems without executables and
    enforce that flag.

    Test the flag where MNT_NOEXEC is tested today, so that the only user
    visible effect will be that exectuables will be treated as if the
    execute bit is cleared.

    The filesystems proc and sysfs do not currently incoporate any
    executable files so this does not result in any user visible effects.

    This makes it unnecessary to vet changes to proc and sysfs tightly for
    adding exectuable files or changes to chattr that would modify
    existing files, as no matter what the individual file say they will
    not be treated as exectuable files by the vfs.

    Not having to vet changes to closely is important as without this we
    are only one proc_create call (or another goof up in the
    implementation of notify_change) from having problematic executables
    on proc. Those mistakes are all too easy to make and would create
    a situation where there are security issues or the assumptions of
    some program having to be broken (and cause userspace regressions).

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

25 Jun, 2015

1 commit

  • The simple check for zero length memory mapping may be performed
    earlier. So that in case of zero length memory mapping some unnecessary
    code is not executed at all. It does not make the code less readable
    and saves some CPU cycles.

    Signed-off-by: Piotr Kwapulinski
    Acked-by: Michal Hocko
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Piotr Kwapulinski
     

16 Apr, 2015

1 commit

  • The creators of the C language gave us the while keyword. Let's use
    that instead of synthesizing it from if+goto.

    Made possible by 6597d783397a ("mm/mmap.c: replace find_vma_prepare()
    with clearer find_vma_links()").

    [akpm@linux-foundation.org: fix 80-col overflows]
    Signed-off-by: Rasmus Villemoes
    Cc: "Kirill A. Shutemov"
    Cc: Sasha Levin
    Cc: Cyrill Gorcunov
    Cc: Roman Gushchin
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes