06 Mar, 2019

1 commit

  • commit 0a1d52994d440e21def1c2174932410b4f2a98a1 upstream.

    security_mmap_addr() does a capability check with current_cred(), but
    we can reach this code from contexts like a VFS write handler where
    current_cred() must not be used.

    This can be abused on systems without SMAP to make NULL pointer
    dereferences exploitable again.

    Fixes: 8869477a49c3 ("security: protect from stack expansion into low vm addresses")
    Cc: stable@kernel.org
    Signed-off-by: Jann Horn
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     

13 Oct, 2018

1 commit

  • Daniel Micay reports that attempting to use MAP_FIXED_NOREPLACE in an
    application causes that application to randomly crash. The existing check
    for handling MAP_FIXED_NOREPLACE looks up the first VMA that either
    overlaps or follows the requested region, and then bails out if that VMA
    overlaps *the start* of the requested region. It does not bail out if the
    VMA only overlaps another part of the requested region.

    Fix it by checking that the found VMA only starts at or after the end of
    the requested region, in which case there is no overlap.

    Test case:

    user@debian:~$ cat mmap_fixed_simple.c
    #include
    #include
    #include
    #include
    #include

    #ifndef MAP_FIXED_NOREPLACE
    #define MAP_FIXED_NOREPLACE 0x100000
    #endif

    int main(void) {
    char *p;

    errno = 0;
    p = mmap((void*)0x10001000, 0x4000, PROT_NONE,
    MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE, -1, 0);
    printf("p1=%p err=%m\n", p);

    errno = 0;
    p = mmap((void*)0x10000000, 0x2000, PROT_READ,
    MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE, -1, 0);
    printf("p2=%p err=%m\n", p);

    char cmd[100];
    sprintf(cmd, "cat /proc/%d/maps", getpid());
    system(cmd);

    return 0;
    }
    user@debian:~$ gcc -o mmap_fixed_simple mmap_fixed_simple.c
    user@debian:~$ ./mmap_fixed_simple
    p1=0x10001000 err=Success
    p2=0x10000000 err=Success
    10000000-10002000 r--p 00000000 00:00 0
    10002000-10005000 ---p 00000000 00:00 0
    564a9a06f000-564a9a070000 r-xp 00000000 fe:01 264004
    /home/user/mmap_fixed_simple
    564a9a26f000-564a9a270000 r--p 00000000 fe:01 264004
    /home/user/mmap_fixed_simple
    564a9a270000-564a9a271000 rw-p 00001000 fe:01 264004
    /home/user/mmap_fixed_simple
    564a9a54a000-564a9a56b000 rw-p 00000000 00:00 0 [heap]
    7f8eba447000-7f8eba5dc000 r-xp 00000000 fe:01 405885
    /lib/x86_64-linux-gnu/libc-2.24.so
    7f8eba5dc000-7f8eba7dc000 ---p 00195000 fe:01 405885
    /lib/x86_64-linux-gnu/libc-2.24.so
    7f8eba7dc000-7f8eba7e0000 r--p 00195000 fe:01 405885
    /lib/x86_64-linux-gnu/libc-2.24.so
    7f8eba7e0000-7f8eba7e2000 rw-p 00199000 fe:01 405885
    /lib/x86_64-linux-gnu/libc-2.24.so
    7f8eba7e2000-7f8eba7e6000 rw-p 00000000 00:00 0
    7f8eba7e6000-7f8eba809000 r-xp 00000000 fe:01 405876
    /lib/x86_64-linux-gnu/ld-2.24.so
    7f8eba9e9000-7f8eba9eb000 rw-p 00000000 00:00 0
    7f8ebaa06000-7f8ebaa09000 rw-p 00000000 00:00 0
    7f8ebaa09000-7f8ebaa0a000 r--p 00023000 fe:01 405876
    /lib/x86_64-linux-gnu/ld-2.24.so
    7f8ebaa0a000-7f8ebaa0b000 rw-p 00024000 fe:01 405876
    /lib/x86_64-linux-gnu/ld-2.24.so
    7f8ebaa0b000-7f8ebaa0c000 rw-p 00000000 00:00 0
    7ffcc99fa000-7ffcc9a1b000 rw-p 00000000 00:00 0 [stack]
    7ffcc9b44000-7ffcc9b47000 r--p 00000000 00:00 0 [vvar]
    7ffcc9b47000-7ffcc9b49000 r-xp 00000000 00:00 0 [vdso]
    ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0
    [vsyscall]
    user@debian:~$ uname -a
    Linux debian 4.19.0-rc6+ #181 SMP Wed Oct 3 23:43:42 CEST 2018 x86_64 GNU/Linux
    user@debian:~$

    As you can see, the first page of the mapping at 0x10001000 was clobbered.

    Link: http://lkml.kernel.org/r/20181010152736.99475-1-jannh@google.com
    Fixes: a4ff8e8620d3 ("mm: introduce MAP_FIXED_NOREPLACE")
    Signed-off-by: Jann Horn
    Reported-by: Daniel Micay
    Acked-by: Michal Hocko
    Acked-by: John Hubbard
    Acked-by: Kees Cook
    Acked-by: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     

23 Aug, 2018

2 commits

  • oom_reaper used to rely on the oom_lock since e2fe14564d33 ("oom_reaper:
    close race with exiting task"). We do not really need the lock anymore
    though. 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run
    concurrently") has removed serialization with the exit path based on the
    mm reference count and so we do not really rely on the oom_lock anymore.

    Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock
    to prevent from races when the page allocator didn't manage to get the
    freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later
    on and move on to another victim. Although this is possible in principle
    let's wait for it to actually happen in real life before we make the
    locking more complex again.

    Therefore remove the oom_lock for oom_reaper paths (both exit_mmap and
    oom_reap_task_mm). The reaper serializes with exit_mmap by mmap_sem +
    MMF_OOM_SKIP flag. There is no synchronization with out_of_memory path
    now.

    [mhocko@kernel.org: oom_reap_task_mm should return false when __oom_reap_task_mm did]
    Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20180719075922.13784-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Suggested-by: David Rientjes
    Acked-by: David Rientjes
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • There are several blockable mmu notifiers which might sleep in
    mmu_notifier_invalidate_range_start and that is a problem for the
    oom_reaper because it needs to guarantee a forward progress so it cannot
    depend on any sleepable locks.

    Currently we simply back off and mark an oom victim with blockable mmu
    notifiers as done after a short sleep. That can result in selecting a new
    oom victim prematurely because the previous one still hasn't torn its
    memory down yet.

    We can do much better though. Even if mmu notifiers use sleepable locks
    there is no reason to automatically assume those locks are held. Moreover
    majority of notifiers only care about a portion of the address space and
    there is absolutely zero reason to fail when we are unmapping an unrelated
    range. Many notifiers do really block and wait for HW which is harder to
    handle and we have to bail out though.

    This patch handles the low hanging fruit.
    __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
    are not allowed to sleep if the flag is set to false. This is achieved by
    using trylock instead of the sleepable lock for most callbacks and
    continue as long as we do not block down the call chain.

    I think we can improve that even further because there is a common pattern
    to do a range lookup first and then do something about that. The first
    part can be done without a sleeping lock in most cases AFAICS.

    The oom_reaper end then simply retries if there is at least one notifier
    which couldn't make any progress in !blockable mode. A retry loop is
    already implemented to wait for the mmap_sem and this is basically the
    same thing.

    The simplest way for driver developers to test this code path is to wrap
    userspace code which uses these notifiers into a memcg and set the hard
    limit to hit the oom. This can be done e.g. after the test faults in all
    the mmu notifier managed memory and set the hard limit to something really
    small. Then we are looking for a proper process tear down.

    [akpm@linux-foundation.org: coding style fixes]
    [akpm@linux-foundation.org: minor code simplification]
    Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Christian König # AMD notifiers
    Acked-by: Leon Romanovsky # mlx and umem_odp
    Reported-by: David Rientjes
    Cc: "David (ChunMing) Zhou"
    Cc: Paolo Bonzini
    Cc: Alex Deucher
    Cc: David Airlie
    Cc: Jani Nikula
    Cc: Joonas Lahtinen
    Cc: Rodrigo Vivi
    Cc: Doug Ledford
    Cc: Jason Gunthorpe
    Cc: Mike Marciniszyn
    Cc: Dennis Dalessandro
    Cc: Sudeep Dutt
    Cc: Ashutosh Dixit
    Cc: Dimitri Sivanich
    Cc: Boris Ostrovsky
    Cc: Juergen Gross
    Cc: "Jérôme Glisse"
    Cc: Andrea Arcangeli
    Cc: Felix Kuehling
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

18 Aug, 2018

1 commit

  • This patch is reworked from an earlier patch that Dan has posted:
    https://patchwork.kernel.org/patch/10131727/

    VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
    the memory page it is dealing with is not typical memory from the linear
    map. The get_user_pages_fast() path, since it does not resolve the vma,
    is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
    use that as a VM_MIXEDMAP replacement in some locations. In the cases
    where there is no pte to consult we fallback to using vma_is_dax() to
    detect the VM_MIXEDMAP special case.

    Now that we have explicit driver pfn_t-flag opt-in/opt-out for
    get_user_pages() support for DAX we can stop setting VM_MIXEDMAP. This
    also means we no longer need to worry about safely manipulating vm_flags
    in a future where we support dynamically changing the dax mode of a
    file.

    DAX should also now be supported with madvise_behavior(), vma_merge(),
    and copy_page_range().

    This patch has been tested against ndctl unit test. It has also been
    tested against xfstests commit: 625515d using fake pmem created by
    memmap and no additional issues have been observed.

    Link: http://lkml.kernel.org/r/152847720311.55924.16999195879201817653.stgit@djiang5-desk3.ch.intel.com
    Signed-off-by: Dave Jiang
    Acked-by: Dan Williams
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jiang
     

27 Jul, 2018

1 commit

  • vma_is_anonymous() relies on ->vm_ops being NULL to detect anonymous
    VMA. This is unreliable as ->mmap may not set ->vm_ops.

    False-positive vma_is_anonymous() may lead to crashes:

    next ffff8801ce5e7040 prev ffff8801d20eca50 mm ffff88019c1e13c0
    prot 27 anon_vma ffff88019680cdd8 vm_ops 0000000000000000
    pgoff 0 file ffff8801b2ec2d00 private_data 0000000000000000
    flags: 0xff(read|write|exec|shared|mayread|maywrite|mayexec|mayshare)
    ------------[ cut here ]------------
    kernel BUG at mm/memory.c:1422!
    invalid opcode: 0000 [#1] SMP KASAN
    CPU: 0 PID: 18486 Comm: syz-executor3 Not tainted 4.18.0-rc3+ #136
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
    01/01/2011
    RIP: 0010:zap_pmd_range mm/memory.c:1421 [inline]
    RIP: 0010:zap_pud_range mm/memory.c:1466 [inline]
    RIP: 0010:zap_p4d_range mm/memory.c:1487 [inline]
    RIP: 0010:unmap_page_range+0x1c18/0x2220 mm/memory.c:1508
    Call Trace:
    unmap_single_vma+0x1a0/0x310 mm/memory.c:1553
    zap_page_range_single+0x3cc/0x580 mm/memory.c:1644
    unmap_mapping_range_vma mm/memory.c:2792 [inline]
    unmap_mapping_range_tree mm/memory.c:2813 [inline]
    unmap_mapping_pages+0x3a7/0x5b0 mm/memory.c:2845
    unmap_mapping_range+0x48/0x60 mm/memory.c:2880
    truncate_pagecache+0x54/0x90 mm/truncate.c:800
    truncate_setsize+0x70/0xb0 mm/truncate.c:826
    simple_setattr+0xe9/0x110 fs/libfs.c:409
    notify_change+0xf13/0x10f0 fs/attr.c:335
    do_truncate+0x1ac/0x2b0 fs/open.c:63
    do_sys_ftruncate+0x492/0x560 fs/open.c:205
    __do_sys_ftruncate fs/open.c:215 [inline]
    __se_sys_ftruncate fs/open.c:213 [inline]
    __x64_sys_ftruncate+0x59/0x80 fs/open.c:213
    do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Reproducer:

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define KCOV_INIT_TRACE _IOR('c', 1, unsigned long)
    #define KCOV_ENABLE _IO('c', 100)
    #define KCOV_DISABLE _IO('c', 101)
    #define COVER_SIZE (1024<<< 20);
    return 0;
    }

    This can be fixed by assigning anonymous VMAs own vm_ops and not relying
    on it being NULL.

    If ->mmap() failed to set ->vm_ops, mmap_region() will set it to
    dummy_vm_ops. This way we will have non-NULL ->vm_ops for all VMAs.

    Link: http://lkml.kernel.org/r/20180724121139.62570-4-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Reported-by: syzbot+3f84280d52be9b7083cc@syzkaller.appspotmail.com
    Acked-by: Linus Torvalds
    Reviewed-by: Andrew Morton
    Cc: Dmitry Vyukov
    Cc: Oleg Nesterov
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

22 Jul, 2018

3 commits

  • Like vm_area_dup(), it initializes the anon_vma_chain head, and the
    basic mm pointer.

    The rest of the fields end up being different for different users,
    although the plan is to also initialize the 'vm_ops' field to a dummy
    entry.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • .. and re-initialize th eanon_vma_chain head.

    This removes some boiler-plate from the users, and also makes it clear
    why it didn't need use the 'zalloc()' version.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • The vm_area_struct is one of the most fundamental memory management
    objects, but the management of it is entirely open-coded evertwhere,
    ranging from allocation and freeing (using kmem_cache_[z]alloc and
    kmem_cache_free) to initializing all the fields.

    We want to unify this in order to end up having some unified
    initialization of the vmas, and the first step to this is to at least
    have basic allocation functions.

    Right now those functions are literally just wrappers around the
    kmem_cache_*() calls. This is a purely mechanical conversion:

    # new vma:
    kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL) -> vm_area_alloc()

    # copy old vma
    kmem_cache_alloc(vm_area_cachep, GFP_KERNEL) -> vm_area_dup(old)

    # free vma
    kmem_cache_free(vm_area_cachep, vma) -> vm_area_free(vma)

    to the point where the old vma passed in to the vm_area_dup() function
    isn't even used yet (because I've left all the old manual initialization
    alone).

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

15 Jul, 2018

1 commit

  • syzbot has noticed that a specially crafted library can easily hit
    VM_BUG_ON in __mm_populate

    kernel BUG at mm/gup.c:1242!
    invalid opcode: 0000 [#1] SMP
    CPU: 2 PID: 9667 Comm: a.out Not tainted 4.18.0-rc3 #644
    Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
    RIP: 0010:__mm_populate+0x1e2/0x1f0
    Code: 55 d0 65 48 33 14 25 28 00 00 00 89 d8 75 21 48 83 c4 20 5b 41 5c 41 5d 41 5e 41 5f 5d c3 e8 75 18 f1 ff 0f 0b e8 6e 18 f1 ff 0b 31 db eb c9 e8 93 06 e0 ff 0f 1f 00 55 48 89 e5 53 48 89 fb
    Call Trace:
    vm_brk_flags+0xc3/0x100
    vm_brk+0x1f/0x30
    load_elf_library+0x281/0x2e0
    __ia32_sys_uselib+0x170/0x1e0
    do_fast_syscall_32+0xca/0x420
    entry_SYSENTER_compat+0x70/0x7f

    The reason is that the length of the new brk is not page aligned when we
    try to populate the it. There is no reason to bug on that though.
    do_brk_flags already aligns the length properly so the mapping is
    expanded as it should. All we need is to tell mm_populate about it.
    Besides that there is absolutely no reason to to bug_on in the first
    place. The worst thing that could happen is that the last page wouldn't
    get populated and that is far from putting system into an inconsistent
    state.

    Fix the issue by moving the length sanitization code from do_brk_flags
    up to vm_brk_flags. The only other caller of do_brk_flags is brk
    syscall entry and it makes sure to provide the proper length so t here
    is no need for sanitation and so we can use do_brk_flags without it.

    Also remove the bogus BUG_ONs.

    [osalvador@techadventures.net: fix up vm_brk_flags s@request@len@]
    Link: http://lkml.kernel.org/r/20180706090217.GI32658@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Reported-by: syzbot
    Tested-by: Tetsuo Handa
    Reviewed-by: Oscar Salvador
    Cc: Zi Yan
    Cc: "Aneesh Kumar K.V"
    Cc: Dan Williams
    Cc: "Kirill A. Shutemov"
    Cc: Michael S. Tsirkin
    Cc: Al Viro
    Cc: "Huang, Ying"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

08 Jun, 2018

1 commit

  • Use new return type vm_fault_t for fault handler in struct
    vm_operations_struct. For now, this is just documenting that the
    function returns a VM_FAULT value rather than an errno. Once all
    instances are converted, vm_fault_t will become a distinct type.

    See commit 1c8f422059ae ("mm: change return type to vm_fault_t")

    Link: http://lkml.kernel.org/r/20180512063745.GA26866@jordon-HP-15-Notebook-PC
    Signed-off-by: Souptick Joarder
    Reviewed-by: Matthew Wilcox
    Reviewed-by: Andrew Morton
    Cc: Joe Perches
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dan Williams
    Cc: David Rientjes
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Aneesh Kumar K.V
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Souptick Joarder
     

05 Jun, 2018

1 commit

  • Pull documentation updates from Jonathan Corbet:
    "There's been a fair amount of work in the docs tree this time around,
    including:

    - Extensive RST conversions and organizational work in the
    memory-management docs thanks to Mike Rapoport.

    - An update of Documentation/features from Andrea Parri and a script
    to keep it updated.

    - Various LICENSES updates from Thomas, along with a script to check
    SPDX tags.

    - Work to fix dangling references to documentation files; this
    involved a fair number of one-liner comment changes outside of
    Documentation/

    ... and the usual list of documentation improvements, typo fixes, etc"

    * tag 'docs-4.18' of git://git.lwn.net/linux: (103 commits)
    Documentation: document hung_task_panic kernel parameter
    docs/admin-guide/mm: add high level concepts overview
    docs/vm: move ksm and transhuge from "user" to "internals" section.
    docs: Use the kerneldoc comments for memalloc_no*()
    doc: document scope NOFS, NOIO APIs
    docs: update kernel versions and dates in tables
    docs/vm: transhuge: split userspace bits to admin-guide/mm/transhuge
    docs/vm: transhuge: minor updates
    docs/vm: transhuge: change sections order
    Documentation: arm: clean up Marvell Berlin family info
    Documentation: gpio: driver: Fix a typo and some odd grammar
    docs: ranoops.rst: fix location of ramoops.txt
    scripts/documentation-file-ref-check: rewrite it in perl with auto-fix mode
    docs: uio-howto.rst: use a code block to solve a warning
    mm, THP, doc: Add document for thp_swpout/thp_swpout_fallback
    w1: w1_io.c: fix a kernel-doc warning
    Documentation/process/posting: wrap text at 80 cols
    docs: admin-guide: add cgroup-v2 documentation
    Revert "Documentation/features/vm: Remove arch support status file for 'pte_special'"
    Documentation: refcount-vs-atomic: Update reference to LKMM doc.
    ...

    Linus Torvalds
     

20 May, 2018

1 commit

  • Commit be83bbf80682 ("mmap: introduce sane default mmap limits") was
    introduced to catch problems in various ad-hoc character device drivers
    doing mmap and getting the size limits wrong. In the process, it used
    "known good" limits for the normal cases of mapping regular files and
    block device drivers.

    It turns out that the "s_maxbytes" limit was less "known good" than I
    thought. In particular, /proc doesn't set it, but exposes one regular
    file to mmap: /proc/vmcore. As a result, that file got limited to the
    default MAX_INT s_maxbytes value.

    This went unnoticed for a while, because apparently the only thing that
    needs it is the s390 kernel zfcpdump, but there might be other tools
    that use this too.

    Vasily suggested just changing s_maxbytes for all of /proc, which isn't
    wrong, but makes me nervous at this stage. So instead, just make the
    new mmap limit always be MAX_LFS_FILESIZE for regular files, which won't
    affect anything else. It wasn't the regular file case I was worried
    about.

    I'd really prefer for maxsize to have been per-inode, but that is not
    how things are today.

    Fixes: be83bbf80682 ("mmap: introduce sane default mmap limits")
    Reported-by: Vasily Gorbik
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

12 May, 2018

3 commits

  • Merge misc fixes from Andrew Morton:
    "13 fixes"

    * emailed patches from Andrew Morton :
    rbtree: include rcu.h
    scripts/faddr2line: fix error when addr2line output contains discriminator
    ocfs2: take inode cluster lock before moving reflinked inode from orphan dir
    mm, oom: fix concurrent munlock and oom reaper unmap, v3
    mm: migrate: fix double call of radix_tree_replace_slot()
    proc/kcore: don't bounds check against address 0
    mm: don't show nr_indirectly_reclaimable in /proc/vmstat
    mm: sections are not offlined during memory hotremove
    z3fold: fix reclaim lock-ups
    init: fix false positives in W+X checking
    lib/find_bit_benchmark.c: avoid soft lockup in test_find_first_bit()
    KASAN: prohibit KASAN+STRUCTLEAK combination
    MAINTAINERS: update Shuah's email address

    Linus Torvalds
     
  • Since exit_mmap() is done without the protection of mm->mmap_sem, it is
    possible for the oom reaper to concurrently operate on an mm until
    MMF_OOM_SKIP is set.

    This allows munlock_vma_pages_all() to concurrently run while the oom
    reaper is operating on a vma. Since munlock_vma_pages_range() depends
    on clearing VM_LOCKED from vm_flags before actually doing the munlock to
    determine if any other vmas are locking the same memory, the check for
    VM_LOCKED in the oom reaper is racy.

    This is especially noticeable on architectures such as powerpc where
    clearing a huge pmd requires serialize_against_pte_lookup(). If the pmd
    is zapped by the oom reaper during follow_page_mask() after the check
    for pmd_none() is bypassed, this ends up deferencing a NULL ptl or a
    kernel oops.

    Fix this by manually freeing all possible memory from the mm before
    doing the munlock and then setting MMF_OOM_SKIP. The oom reaper can not
    run on the mm anymore so the munlock is safe to do in exit_mmap(). It
    also matches the logic that the oom reaper currently uses for
    determining when to set MMF_OOM_SKIP itself, so there's no new risk of
    excessive oom killing.

    This issue fixes CVE-2018-1000200.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1804241526320.238665@chino.kir.corp.google.com
    Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
    Signed-off-by: David Rientjes
    Suggested-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: [4.14+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The internal VM "mmap()" interfaces are based on the mmap target doing
    everything using page indexes rather than byte offsets, because
    traditionally (ie 32-bit) we had the situation that the byte offset
    didn't fit in a register. So while the mmap virtual address was limited
    by the word size of the architecture, the backing store was not.

    So we're basically passing "pgoff" around as a page index, in order to
    be able to describe backing store locations that are much bigger than
    the word size (think files larger than 4GB etc).

    But while this all makes a ton of sense conceptually, we've been dogged
    by various drivers that don't really understand this, and internally
    work with byte offsets, and then try to work with the page index by
    turning it into a byte offset with "pgoff << PAGE_SHIFT".

    Which obviously can overflow.

    Adding the size of the mapping to it to get the byte offset of the end
    of the backing store just exacerbates the problem, and if you then use
    this overflow-prone value to check various limits of your device driver
    mmap capability, you're just setting yourself up for problems.

    The correct thing for drivers to do is to do their limit math in page
    indices, the way the interface is designed. Because the generic mmap
    code _does_ test that the index doesn't overflow, since that's what the
    mmap code really cares about.

    HOWEVER.

    Finding and fixing various random drivers is a sisyphean task, so let's
    just see if we can just make the core mmap() code do the limiting for
    us. Realistically, the only "big" backing stores we need to care about
    are regular files and block devices, both of which are known to do this
    properly, and which have nice well-defined limits for how much data they
    can access.

    So let's special-case just those two known cases, and then limit other
    random mmap users to a backing store that still fits in "unsigned long".
    Realistically, that's not much of a limit at all on 64-bit, and on
    32-bit architectures the only worry might be the GPU drivers, which can
    have big physical address spaces.

    To make it possible for drivers like that to say that they are 64-bit
    clean, this patch does repurpose the "FMODE_UNSIGNED_OFFSET" bit in the
    file flags to allow drivers to mark their file descriptors as safe in
    the full 64-bit mmap address space.

    [ The timing for doing this is less than optimal, and this should really
    go in a merge window. But realistically, this needs wide testing more
    than it needs anything else, and being main-line is the only way to do
    that.

    So the earlier the better, even if it's outside the proper development
    cycle - Linus ]

    Cc: Kees Cook
    Cc: Dan Carpenter
    Cc: Al Viro
    Cc: Willy Tarreau
    Cc: Dave Airlie
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

25 Apr, 2018

1 commit

  • commit ce9962bf7e22bb3891655c349faff618922d4a73

    0day reported warnings at boot on 32-bit systems without NX support:

    attempted to set unsupported pgprot: 8000000000000025 bits: 8000000000000000 supported: 7fffffffffffffff
    WARNING: CPU: 0 PID: 1 at
    arch/x86/include/asm/pgtable.h:540 handle_mm_fault+0xfc1/0xfe0:
    check_pgprot at arch/x86/include/asm/pgtable.h:535
    (inlined by) pfn_pte at arch/x86/include/asm/pgtable.h:549
    (inlined by) do_anonymous_page at mm/memory.c:3169
    (inlined by) handle_pte_fault at mm/memory.c:3961
    (inlined by) __handle_mm_fault at mm/memory.c:4087
    (inlined by) handle_mm_fault at mm/memory.c:4124

    The problem is that due to the recent commit which removed auto-massaging
    of page protections, filtering page permissions at PTE creation time is not
    longer done, so vma->vm_page_prot is passed unfiltered to PTE creation.

    Filter the page protections before they are installed in vma->vm_page_prot.

    Fixes: fb43d6cb91 ("x86/mm: Do not auto-massage page protections")
    Reported-by: Fengguang Wu
    Signed-off-by: Dave Hansen
    Signed-off-by: Thomas Gleixner
    Acked-by: Ingo Molnar
    Cc: Andrea Arcangeli
    Cc: Juergen Gross
    Cc: Kees Cook
    Cc: Josh Poimboeuf
    Cc: Peter Zijlstra
    Cc: David Woodhouse
    Cc: Hugh Dickins
    Cc: linux-mm@kvack.org
    Cc: Linus Torvalds
    Cc: Borislav Petkov
    Cc: Andy Lutomirski
    Cc: Greg Kroah-Hartman
    Cc: Nadav Amit
    Cc: Dan Williams
    Cc: Arjan van de Ven
    Link: https://lkml.kernel.org/r/20180420222028.99D72858@viggo.jf.intel.com

    Dave Hansen
     

17 Apr, 2018

2 commits

  • Mike Rapoport says:

    These patches convert files in Documentation/vm to ReST format, add an
    initial index and link it to the top level documentation.

    There are no contents changes in the documentation, except few spelling
    fixes. The relatively large diffstat stems from the indentation and
    paragraph wrapping changes.

    I've tried to keep the formatting as consistent as possible, but I could
    miss some places that needed markup and add some markup where it was not
    necessary.

    [jc: significant conflicts in vm/hmm.rst]

    Jonathan Corbet
     
  • Signed-off-by: Mike Rapoport
    Signed-off-by: Jonathan Corbet

    Mike Rapoport
     

12 Apr, 2018

1 commit

  • Patch series "mm: introduce MAP_FIXED_NOREPLACE", v2.

    This has started as a follow up discussion [3][4] resulting in the
    runtime failure caused by hardening patch [5] which removes MAP_FIXED
    from the elf loader because MAP_FIXED is inherently dangerous as it
    might silently clobber an existing underlying mapping (e.g. stack).
    The reason for the failure is that some architectures enforce an
    alignment for the given address hint without MAP_FIXED used (e.g. for
    shared or file backed mappings).

    One way around this would be excluding those archs which do alignment
    tricks from the hardening [6]. The patch is really trivial but it has
    been objected, rightfully so, that this screams for a more generic
    solution. We basically want a non-destructive MAP_FIXED.

    The first patch introduced MAP_FIXED_NOREPLACE which enforces the given
    address but unlike MAP_FIXED it fails with EEXIST if the given range
    conflicts with an existing one. The flag is introduced as a completely
    new one rather than a MAP_FIXED extension because of the backward
    compatibility. We really want a never-clobber semantic even on older
    kernels which do not recognize the flag. Unfortunately mmap sucks
    wrt flags evaluation because we do not EINVAL on unknown flags. On
    those kernels we would simply use the traditional hint based semantic so
    the caller can still get a different address (which sucks) but at least
    not silently corrupt an existing mapping. I do not see a good way
    around that. Except we won't export expose the new semantic to the
    userspace at all.

    It seems there are users who would like to have something like that.
    Jemalloc has been mentioned by Michael Ellerman [7]

    Florian Weimer has mentioned the following:
    : glibc ld.so currently maps DSOs without hints. This means that the kernel
    : will map right next to each other, and the offsets between them a completely
    : predictable. We would like to change that and supply a random address in a
    : window of the address space. If there is a conflict, we do not want the
    : kernel to pick a non-random address. Instead, we would try again with a
    : random address.

    John Hubbard has mentioned CUDA example
    : a) Searches /proc//maps for a "suitable" region of available
    : VA space. "Suitable" generally means it has to have a base address
    : within a certain limited range (a particular device model might
    : have odd limitations, for example), it has to be large enough, and
    : alignment has to be large enough (again, various devices may have
    : constraints that lead us to do this).
    :
    : This is of course subject to races with other threads in the process.
    :
    : Let's say it finds a region starting at va.
    :
    : b) Next it does:
    : p = mmap(va, ...)
    :
    : *without* setting MAP_FIXED, of course (so va is just a hint), to
    : attempt to safely reserve that region. If p != va, then in most cases,
    : this is a failure (almost certainly due to another thread getting a
    : mapping from that region before we did), and so this layer now has to
    : call munmap(), before returning a "failure: retry" to upper layers.
    :
    : IMPROVEMENT: --> if instead, we could call this:
    :
    : p = mmap(va, ... MAP_FIXED_NOREPLACE ...)
    :
    : , then we could skip the munmap() call upon failure. This
    : is a small thing, but it is useful here. (Thanks to Piotr
    : Jaroszynski and Mark Hairgrove for helping me get that detail
    : exactly right, btw.)
    :
    : c) After that, CUDA suballocates from p, via:
    :
    : q = mmap(sub_region_start, ... MAP_FIXED ...)
    :
    : Interestingly enough, "freeing" is also done via MAP_FIXED, and
    : setting PROT_NONE to the subregion. Anyway, I just included (c) for
    : general interest.

    Atomic address range probing in the multithreaded programs in general
    sounds like an interesting thing to me.

    The second patch simply replaces MAP_FIXED use in elf loader by
    MAP_FIXED_NOREPLACE. I believe other places which rely on MAP_FIXED
    should follow. Actually real MAP_FIXED usages should be docummented
    properly and they should be more of an exception.

    [1] http://lkml.kernel.org/r/20171116101900.13621-1-mhocko@kernel.org
    [2] http://lkml.kernel.org/r/20171129144219.22867-1-mhocko@kernel.org
    [3] http://lkml.kernel.org/r/20171107162217.382cd754@canb.auug.org.au
    [4] http://lkml.kernel.org/r/1510048229.12079.7.camel@abdul.in.ibm.com
    [5] http://lkml.kernel.org/r/20171023082608.6167-1-mhocko@kernel.org
    [6] http://lkml.kernel.org/r/20171113094203.aofz2e7kueitk55y@dhcp22.suse.cz
    [7] http://lkml.kernel.org/r/87efp1w7vy.fsf@concordia.ellerman.id.au

    This patch (of 2):

    MAP_FIXED is used quite often to enforce mapping at the particular range.
    The main problem of this flag is, however, that it is inherently dangerous
    because it unmaps existing mappings covered by the requested range. This
    can cause silent memory corruptions. Some of them even with serious
    security implications. While the current semantic might be really
    desiderable in many cases there are others which would want to enforce the
    given range but rather see a failure than a silent memory corruption on a
    clashing range. Please note that there is no guarantee that a given range
    is obeyed by the mmap even when it is free - e.g. arch specific code is
    allowed to apply an alignment.

    Introduce a new MAP_FIXED_NOREPLACE flag for mmap to achieve this
    behavior. It has the same semantic as MAP_FIXED wrt. the given address
    request with a single exception that it fails with EEXIST if the requested
    address is already covered by an existing mapping. We still do rely on
    get_unmaped_area to handle all the arch specific MAP_FIXED treatment and
    check for a conflicting vma after it returns.

    The flag is introduced as a completely new one rather than a MAP_FIXED
    extension because of the backward compatibility. We really want a
    never-clobber semantic even on older kernels which do not recognize the
    flag. Unfortunately mmap sucks wrt. flags evaluation because we do not
    EINVAL on unknown flags. On those kernels we would simply use the
    traditional hint based semantic so the caller can still get a different
    address (which sucks) but at least not silently corrupt an existing
    mapping. I do not see a good way around that.

    [mpe@ellerman.id.au: fix whitespace]
    [fail on clashing range with EEXIST as per Florian Weimer]
    [set MAP_FIXED before round_hint_to_min as per Khalid Aziz]
    Link: http://lkml.kernel.org/r/20171213092550.2774-2-mhocko@kernel.org
    Reviewed-by: Khalid Aziz
    Signed-off-by: Michal Hocko
    Acked-by: Michael Ellerman
    Cc: Khalid Aziz
    Cc: Russell King - ARM Linux
    Cc: Andrea Arcangeli
    Cc: Florian Weimer
    Cc: John Hubbard
    Cc: Matthew Wilcox
    Cc: Abdul Haleem
    Cc: Joel Stanley
    Cc: Kees Cook
    Cc: Michal Hocko
    Cc: Jason Evans
    Cc: David Goldblatt
    Cc: Edward Tomasz Napierała
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

06 Apr, 2018

1 commit

  • The documentation for ignore_rlimit_data says that it will print a
    warning at first misuse. Yet it doesn't seem to do that.

    Fix the code to print the warning even when we allow the process to
    continue.

    Link: http://lkml.kernel.org/r/1517935505-9321-1-git-send-email-dwmw@amazon.co.uk
    Signed-off-by: David Woodhouse
    Acked-by: Konstantin Khlebnikov
    Cc: Cyrill Gorcunov
    Cc: Vladimir Davydov
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Woodhouse
     

03 Apr, 2018

1 commit

  • Using this helper allows us to avoid the in-kernel calls to the
    sys_mmap_pgoff() syscall. The ksys_ prefix denotes that this function is
    meant as a drop-in replacement for the syscall. In particular, it uses the
    same calling convention as sys_mmap_pgoff().

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Andrew Morton
    Cc: linux-mm@kvack.org
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     

15 Dec, 2017

1 commit

  • David Rientjes has reported the following memory corruption while the
    oom reaper tries to unmap the victims address space

    BUG: Bad page map in process oom_reaper pte:6353826300000000 pmd:00000000
    addr:00007f50cab1d000 vm_flags:08100073 anon_vma:ffff9eea335603f0 mapping: (null) index:7f50cab1d
    file: (null) fault: (null) mmap: (null) readpage: (null)
    CPU: 2 PID: 1001 Comm: oom_reaper
    Call Trace:
    unmap_page_range+0x1068/0x1130
    __oom_reap_task_mm+0xd5/0x16b
    oom_reaper+0xff/0x14c
    kthread+0xc1/0xe0

    Tetsuo Handa has noticed that the synchronization inside exit_mmap is
    insufficient. We only synchronize with the oom reaper if
    tsk_is_oom_victim which is not true if the final __mmput is called from
    a different context than the oom victim exit path. This can trivially
    happen from context of any task which has grabbed mm reference (e.g. to
    read /proc// file which requires mm etc.).

    The race would look like this

    oom_reaper oom_victim task
    mmget_not_zero
    do_exit
    mmput
    __oom_reap_task_mm mmput
    __mmput
    exit_mmap
    remove_vma
    unmap_page_range

    Fix this issue by providing a new mm_is_oom_victim() helper which
    operates on the mm struct rather than a task. Any context which
    operates on a remote mm struct should use this helper in place of
    tsk_is_oom_victim. The flag is set in mark_oom_victim and never cleared
    so it is stable in the exit_mmap path.

    Debugged by Tetsuo Handa.

    Link: http://lkml.kernel.org/r/20171210095130.17110-1-mhocko@kernel.org
    Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
    Signed-off-by: Michal Hocko
    Reported-by: David Rientjes
    Acked-by: David Rientjes
    Cc: Tetsuo Handa
    Cc: Andrea Argangeli
    Cc: [4.14]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

30 Nov, 2017

1 commit

  • Patch series "device-dax: fix unaligned munmap handling"

    When device-dax is operating in huge-page mode we want it to behave like
    hugetlbfs and fail attempts to split vmas into unaligned ranges. It
    would be messy to teach the munmap path about device-dax alignment
    constraints in the same (hstate) way that hugetlbfs communicates this
    constraint. Instead, these patches introduce a new ->split() vm
    operation.

    This patch (of 2):

    The device-dax interface has similar constraints as hugetlbfs in that it
    requires the munmap path to unmap in huge page aligned units. Rather
    than add more custom vma handling code in __split_vma() introduce a new
    vm operation to perform this vma specific check.

    Link: http://lkml.kernel.org/r/151130418135.4029.6783191281930729710.stgit@dwillia2-desk3.amr.corp.intel.com
    Fixes: dee410792419 ("/dev/dax, core: file operations and dax-mmap")
    Signed-off-by: Dan Williams
    Cc: Jeff Moyer
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

03 Nov, 2017

1 commit

  • The mmap(2) syscall suffers from the ABI anti-pattern of not validating
    unknown flags. However, proposals like MAP_SYNC need a mechanism to
    define new behavior that is known to fail on older kernels without the
    support. Define a new MAP_SHARED_VALIDATE flag pattern that is
    guaranteed to fail on all legacy mmap implementations.

    It is worth noting that the original proposal was for a standalone
    MAP_VALIDATE flag. However, when that could not be supported by all
    archs Linus observed:

    I see why you *think* you want a bitmap. You think you want
    a bitmap because you want to make MAP_VALIDATE be part of MAP_SYNC
    etc, so that people can do

    ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED
    | MAP_SYNC, fd, 0);

    and "know" that MAP_SYNC actually takes.

    And I'm saying that whole wish is bogus. You're fundamentally
    depending on special semantics, just make it explicit. It's already
    not portable, so don't try to make it so.

    Rename that MAP_VALIDATE as MAP_SHARED_VALIDATE, make it have a value
    of 0x3, and make people do

    ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED_VALIDATE
    | MAP_SYNC, fd, 0);

    and then the kernel side is easier too (none of that random garbage
    playing games with looking at the "MAP_VALIDATE bit", but just another
    case statement in that map type thing.

    Boom. Done.

    Similar to ->fallocate() we also want the ability to validate the
    support for new flags on a per ->mmap() 'struct file_operations'
    instance basis. Towards that end arrange for flags to be generically
    validated against a mmap_supported_flags exported by 'struct
    file_operations'. By default all existing flags are implicitly
    supported, but new flags require MAP_SHARED_VALIDATE and
    per-instance-opt-in.

    Cc: Jan Kara
    Cc: Arnd Bergmann
    Cc: Andy Lutomirski
    Cc: Andrew Morton
    Suggested-by: Christoph Hellwig
    Suggested-by: Linus Torvalds
    Reviewed-by: Ross Zwisler
    Signed-off-by: Dan Williams
    Signed-off-by: Jan Kara
    Signed-off-by: Dan Williams

    Dan Williams
     

09 Sep, 2017

1 commit

  • Allow interval trees to quickly check for overlaps to avoid unnecesary
    tree lookups in interval_tree_iter_first().

    As of this patch, all interval tree flavors will require using a
    'rb_root_cached' such that we can have the leftmost node easily
    available. While most users will make use of this feature, those with
    special functions (in addition to the generic insert, delete, search
    calls) will avoid using the cached option as they can do funky things
    with insertions -- for example, vma_interval_tree_insert_after().

    [jglisse@redhat.com: fix deadlock from typo vm_lock_anon_vma()]
    Link: http://lkml.kernel.org/r/20170808225719.20723-1-jglisse@redhat.com
    Link: http://lkml.kernel.org/r/20170719014603.19029-12-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Jérôme Glisse
    Acked-by: Christian König
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Doug Ledford
    Acked-by: Michael S. Tsirkin
    Cc: David Airlie
    Cc: Jason Wang
    Cc: Christian Benvenuti
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

07 Sep, 2017

3 commits

  • This is purely required because exit_aio() may block and exit_mmap() may
    never start, if the oom_reap_task cannot start running on a mm with
    mm_users == 0.

    At the same time if the OOM reaper doesn't wait at all for the memory of
    the current OOM candidate to be freed by exit_mmap->unmap_vmas, it would
    generate a spurious OOM kill.

    If it wasn't because of the exit_aio or similar blocking functions in
    the last mmput, it would be enough to change the oom_reap_task() in the
    case it finds mm_users == 0, to wait for a timeout or to wait for
    __mmput to set MMF_OOM_SKIP itself, but it's not just exit_mmap the
    problem here so the concurrency of exit_mmap and oom_reap_task is
    apparently warranted.

    It's a non standard runtime, exit_mmap() runs without mmap_sem, and
    oom_reap_task runs with the mmap_sem for reading as usual (kind of
    MADV_DONTNEED).

    The race between the two is solved with a combination of
    tsk_is_oom_victim() (serialized by task_lock) and MMF_OOM_SKIP
    (serialized by a dummy down_write/up_write cycle on the same lines of
    the ksm_exit method).

    If the oom_reap_task() may be running concurrently during exit_mmap,
    exit_mmap will wait it to finish in down_write (before taking down mm
    structures that would make the oom_reap_task fail with use after free).

    If exit_mmap comes first, oom_reap_task() will skip the mm if
    MMF_OOM_SKIP is already set and in turn all memory is already freed and
    furthermore the mm data structures may already have been taken down by
    free_pgtables.

    [aarcange@redhat.com: incremental one liner]
    Link: http://lkml.kernel.org/r/20170726164319.GC29716@redhat.com
    [rientjes@google.com: remove unused mmput_async]
    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1708141733130.50317@chino.kir.corp.google.com
    [aarcange@redhat.com: microoptimization]
    Link: http://lkml.kernel.org/r/20170817171240.GB5066@redhat.com
    Link: http://lkml.kernel.org/r/20170726162912.GA29716@redhat.com
    Fixes: 26db62f179d1 ("oom: keep mm of the killed task available")
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: David Rientjes
    Reported-by: David Rientjes
    Tested-by: David Rientjes
    Reviewed-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • A __split_vma is not a worthy event to report, and it's definitely not a
    unmap so it would be incorrect to report unmap for the whole region to
    the userfaultfd manager if a __split_vma fails.

    So only call userfaultfd_unmap_prep after the __vma_splitting is over
    and do_munmap cannot fail anymore.

    Also add unlikely because it's better to optimize for the vast majority
    of apps that aren't using userfaultfd in a non cooperative way. Ideally
    we should also find a way to eliminate the branch entirely if
    CONFIG_USERFAULTFD=n, but it would complicate things so stick to
    unlikely for now.

    Link: http://lkml.kernel.org/r/20170802165145.22628-5-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Cc: "Dr. David Alan Gilbert"
    Cc: Alexey Perevalov
    Cc: Maxime Coquelin
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • global_page_state is error prone as a recent bug report pointed out [1].
    It only returns proper values for zone based counters as the enum it
    gets suggests. We already have global_node_page_state so let's rename
    global_page_state to global_zone_page_state to be more explicit here.
    All existing users seems to be correct:

    $ git grep "global_page_state(NR_" | sed 's@.*(\(NR_[A-Z_]*\)).*@\1@' | sort | uniq -c
    2 NR_BOUNCE
    2 NR_FREE_CMA_PAGES
    11 NR_FREE_PAGES
    1 NR_KERNEL_STACK_KB
    1 NR_MLOCK
    2 NR_PAGETABLE

    This patch shouldn't introduce any functional change.

    [1] http://lkml.kernel.org/r/201707260628.v6Q6SmaS030814@www262.sakura.ne.jp

    Link: http://lkml.kernel.org/r/20170801134256.5400-2-hannes@cmpxchg.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Johannes Weiner
    Cc: Tetsuo Handa
    Cc: Josef Bacik
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

15 Jul, 2017

1 commit

  • Jörn Engel noticed that the expand_upwards() function might not return
    -ENOMEM in case the requested address is (unsigned long)-PAGE_SIZE and
    if the architecture didn't defined TASK_SIZE as multiple of PAGE_SIZE.

    Affected architectures are arm, frv, m68k, blackfin, h8300 and xtensa
    which all define TASK_SIZE as 0xffffffff, but since none of those have
    an upwards-growing stack we currently have no actual issue.

    Nevertheless let's fix this just in case any of the architectures with
    an upward-growing stack (currently parisc, metag and partly ia64) define
    TASK_SIZE similar.

    Link: http://lkml.kernel.org/r/20170702192452.GA11868@p100.box
    Fixes: bd726c90b6b8 ("Allow stack to grow up to address space limit")
    Signed-off-by: Helge Deller
    Reported-by: Jörn Engel
    Cc: Hugh Dickins
    Cc: Oleg Nesterov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Helge Deller
     

11 Jul, 2017

3 commits

  • Use rlimit() helper instead of manually writing whole chain from current
    task to rlim_cur.

    Link: http://lkml.kernel.org/r/20170705172811.8027-1-k.opasiak@samsung.com
    Signed-off-by: Krzysztof Opasiak
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Krzysztof Opasiak
     
  • expand_stack(vma) fails if address < stack_guard_gap even if there is no
    vma->vm_prev. I don't think this makes sense, and we didn't do this
    before the recent commit 1be7107fbe18 ("mm: larger stack guard gap,
    between vmas").

    We do not need a gap in this case, any address is fine as long as
    security_mmap_addr() doesn't object.

    This also simplifies the code, we know that address >= prev->vm_end and
    thus underflow is not possible.

    Link: http://lkml.kernel.org/r/20170628175258.GA24881@redhat.com
    Signed-off-by: Oleg Nesterov
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Larry Woodman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Commit 1be7107fbe18 ("mm: larger stack guard gap, between vmas") has
    introduced a regression in some rust and Java environments which are
    trying to implement their own stack guard page. They are punching a new
    MAP_FIXED mapping inside the existing stack Vma.

    This will confuse expand_{downwards,upwards} into thinking that the
    stack expansion would in fact get us too close to an existing non-stack
    vma which is a correct behavior wrt safety. It is a real regression on
    the other hand.

    Let's work around the problem by considering PROT_NONE mapping as a part
    of the stack. This is a gros hack but overflowing to such a mapping
    would trap anyway an we only can hope that usespace knows what it is
    doing and handle it propely.

    Fixes: 1be7107fbe18 ("mm: larger stack guard gap, between vmas")
    Link: http://lkml.kernel.org/r/20170705182849.GA18027@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Debugged-by: Vlastimil Babka
    Cc: Ben Hutchings
    Cc: Willy Tarreau
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

09 Jul, 2017

1 commit

  • Pull ARM updates from Russell King:

    - add support for ftrace-with-registers, which is needed for kgraft and
    other ftrace tools

    - support for mremap() for the sigpage/vDSO so that checkpoint/restore
    can work

    - add timestamps to each line of the register dump output

    - remove the unused KTHREAD_SIZE from nommu

    - align the ARM bitops APIs with the generic API (using unsigned long
    pointers rather than void pointers)

    - make the configuration of userspace Thumb support an expert option so
    that we can default it on, and avoid some hard to debug userspace
    crashes

    * 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm:
    ARM: 8684/1: NOMMU: Remove unused KTHREAD_SIZE definition
    ARM: 8683/1: ARM32: Support mremap() for sigpage/vDSO
    ARM: 8679/1: bitops: Align prototypes to generic API
    ARM: 8678/1: ftrace: Adds support for CONFIG_DYNAMIC_FTRACE_WITH_REGS
    ARM: make configuration of userspace Thumb support an expert option
    ARM: 8673/1: Fix __show_regs output timestamps

    Linus Torvalds
     

07 Jul, 2017

1 commit

  • The protection map is only modified by per-arch init code so it can be
    protected from writes after the init code runs.

    This change was extracted from PaX where it's part of KERNEXEC.

    Link: http://lkml.kernel.org/r/20170510174441.26163-1-danielmicay@gmail.com
    Signed-off-by: Daniel Micay
    Acked-by: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Micay
     

22 Jun, 2017

2 commits

  • Fix expand_upwards() on architectures with an upward-growing stack (parisc,
    metag and partly IA-64) to allow the stack to reliably grow exactly up to
    the address space limit given by TASK_SIZE.

    Signed-off-by: Helge Deller
    Acked-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Helge Deller
     
  • Trinity gets kernel BUG at mm/mmap.c:1963! in about 3 minutes of
    mmap testing. That's the VM_BUG_ON(gap_end < gap_start) at the
    end of unmapped_area_topdown(). Linus points out how MAP_FIXED
    (which does not have to respect our stack guard gap intentions)
    could result in gap_end below gap_start there. Fix that, and
    the similar case in its alternative, unmapped_area().

    Cc: stable@vger.kernel.org
    Fixes: 1be7107fbe18 ("mm: larger stack guard gap, between vmas")
    Reported-by: Dave Jones
    Debugged-by: Linus Torvalds
    Signed-off-by: Hugh Dickins
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

21 Jun, 2017

1 commit

  • CRIU restores application mappings on the same place where they
    were before Checkpoint. That means, that we need to move vDSO
    and sigpage during restore on exactly the same place where
    they were before C/R.

    Make mremap() code update mm->context.{sigpage,vdso} pointers
    during VMA move. Sigpage is used for landing after handling
    a signal - if the pointer is not updated during moving, the
    application might crash on any signal after mremap().

    vDSO pointer on ARM32 is used only for setting auxv at this moment,
    update it during mremap() in case of future usage.

    Without those updates, current work of CRIU on ARM32 is not reliable.
    Historically, we error Checkpointing if we find vDSO page on ARM32
    and suggest user to disable CONFIG_VDSO.
    But that's not correct - it goes from x86 where signal processing
    is ended in vDSO blob. For arm32 it's sigpage, which is not disabled
    with `CONFIG_VDSO=n'.

    Looks like C/R was working by luck - because userspace on ARM32 at
    this moment always sets SA_RESTORER.

    Signed-off-by: Dmitry Safonov
    Acked-by: Andy Lutomirski
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: Will Deacon
    Cc: Thomas Gleixner
    Cc: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: Christopher Covington
    Signed-off-by: Russell King

    Dmitry Safonov
     

19 Jun, 2017

1 commit

  • Stack guard page is a useful feature to reduce a risk of stack smashing
    into a different mapping. We have been using a single page gap which
    is sufficient to prevent having stack adjacent to a different mapping.
    But this seems to be insufficient in the light of the stack usage in
    userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
    used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
    which is 256kB or stack strings with MAX_ARG_STRLEN.

    This will become especially dangerous for suid binaries and the default
    no limit for the stack size limit because those applications can be
    tricked to consume a large portion of the stack and a single glibc call
    could jump over the guard page. These attacks are not theoretical,
    unfortunatelly.

    Make those attacks less probable by increasing the stack guard gap
    to 1MB (on systems with 4k pages; but make it depend on the page size
    because systems with larger base pages might cap stack allocations in
    the PAGE_SIZE units) which should cover larger alloca() and VLA stack
    allocations. It is obviously not a full fix because the problem is
    somehow inherent, but it should reduce attack space a lot.

    One could argue that the gap size should be configurable from userspace,
    but that can be done later when somebody finds that the new 1MB is wrong
    for some special case applications. For now, add a kernel command line
    option (stack_guard_gap) to specify the stack gap size (in page units).

    Implementation wise, first delete all the old code for stack guard page:
    because although we could get away with accounting one extra page in a
    stack vma, accounting a larger gap can break userspace - case in point,
    a program run with "ulimit -S -v 20000" failed when the 1MB gap was
    counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
    and strict non-overcommit mode.

    Instead of keeping gap inside the stack vma, maintain the stack guard
    gap as a gap between vmas: using vm_start_gap() in place of vm_start
    (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
    places which need to respect the gap - mainly arch_get_unmapped_area(),
    and and the vma tree's subtree_gap support for that.

    Original-patch-by: Oleg Nesterov
    Original-patch-by: Michal Hocko
    Signed-off-by: Hugh Dickins
    Acked-by: Michal Hocko
    Tested-by: Helge Deller # parisc
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

04 May, 2017

1 commit

  • Commit 091d0d55b286 ("shm: fix null pointer deref when userspace
    specifies invalid hugepage size") had replaced MAP_HUGE_MASK with
    SHM_HUGE_MASK. Though both of them contain the same numeric value of
    0x3f, MAP_HUGE_MASK flag sounds more appropriate than the other one in
    the context. Hence change it back.

    Link: http://lkml.kernel.org/r/20170404045635.616-1-khandual@linux.vnet.ibm.com
    Signed-off-by: Anshuman Khandual
    Reviewed-by: Matthew Wilcox
    Acked-by: Balbir Singh
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual