06 Jan, 2017

1 commit

  • commit bfedb589252c01fa505ac9f6f2a3d5d68d707ef4 upstream.

    During exec dumpable is cleared if the file that is being executed is
    not readable by the user executing the file. A bug in
    ptrace_may_access allows reading the file if the executable happens to
    enter into a subordinate user namespace (aka clone(CLONE_NEWUSER),
    unshare(CLONE_NEWUSER), or setns(fd, CLONE_NEWUSER).

    This problem is fixed with only necessary userspace breakage by adding
    a user namespace owner to mm_struct, captured at the time of exec, so
    it is clear in which user namespace CAP_SYS_PTRACE must be present in
    to be able to safely give read permission to the executable.

    The function ptrace_may_access is modified to verify that the ptracer
    has CAP_SYS_ADMIN in task->mm->user_ns instead of task->cred->user_ns.
    This ensures that if the task changes it's cred into a subordinate
    user namespace it does not become ptraceable.

    The function ptrace_attach is modified to only set PT_PTRACE_CAP when
    CAP_SYS_PTRACE is held over task->mm->user_ns. The intent of
    PT_PTRACE_CAP is to be a flag to note that whatever permission changes
    the task might go through the tracer has sufficient permissions for
    it not to be an issue. task->cred->user_ns is always the same
    as or descendent of mm->user_ns. Which guarantees that having
    CAP_SYS_PTRACE over mm->user_ns is the worst case for the tasks
    credentials.

    To prevent regressions mm->dumpable and mm->user_ns are not considered
    when a task has no mm. As simply failing ptrace_may_attach causes
    regressions in privileged applications attempting to read things
    such as /proc//stat

    Acked-by: Kees Cook
    Tested-by: Cyrill Gorcunov
    Fixes: 8409cca70561 ("userns: allow ptrace from non-init user namespaces")
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     

07 Dec, 2016

1 commit

  • The shmem hole punching with fallocate(FALLOC_FL_PUNCH_HOLE) does not
    want to race with generating new pages by faulting them in.

    However, the wait-queue used to delay the page faulting has a serious
    problem: the wait queue head (in shmem_fallocate()) is allocated on the
    stack, and the code expects that "wake_up_all()" will make sure that all
    the queue entries are gone before the stack frame is de-allocated.

    And that is not at all necessarily the case.

    Yes, a normal wake-up sequence will remove the wait-queue entry that
    caused the wakeup (see "autoremove_wake_function()"), but the key
    wording there is "that caused the wakeup". When there are multiple
    possible wakeup sources, the wait queue entry may well stay around.

    And _particularly_ in a page fault path, we may be faulting in new pages
    from user space while we also have other things going on, and there may
    well be other pending wakeups.

    So despite the "wake_up_all()", it's not at all guaranteed that all list
    entries are removed from the wait queue head on the stack.

    Fix this by introducing a new wakeup function that removes the list
    entry unconditionally, even if the target process had already woken up
    for other reasons. Use that "synchronous" function to set up the
    waiters in shmem_fault().

    This problem has never been seen in the wild afaik, but Dave Jones has
    reported it on and off while running trinity. We thought we fixed the
    stack corruption with the blk-mq rq_list locking fix (commit
    7fe311302f7d: "blk-mq: update hardware and software queues for sleeping
    alloc"), but it turns out there was _another_ stack corruptor hiding
    in the trinity runs.

    Vegard Nossum (also running trinity) was able to trigger this one fairly
    consistently, and made us look once again at the shmem code due to the
    faults often being in that area.

    Reported-and-tested-by: Vegard Nossum .
    Reported-by: Dave Jones
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

03 Dec, 2016

2 commits

  • Boris Zhmurov has reported RCU stalls during the kswapd reclaim:

    INFO: rcu_sched detected stalls on CPUs/tasks:
    23-...: (22 ticks this GP) idle=92f/140000000000000/0 softirq=2638404/2638404 fqs=23
    (detected by 4, t=6389 jiffies, g=786259, c=786258, q=42115)
    Task dump for CPU 23:
    kswapd1 R running task 0 148 2 0x00000008
    Call Trace:
    shrink_node+0xd2/0x2f0
    kswapd+0x2cb/0x6a0
    mem_cgroup_shrink_node+0x160/0x160
    kthread+0xbd/0xe0
    __switch_to+0x1fa/0x5c0
    ret_from_fork+0x1f/0x40
    kthread_create_on_node+0x180/0x180

    a closer code inspection has shown that we might indeed miss all the
    scheduling points in the reclaim path if no pages can be isolated from
    the LRU list. This is a pathological case but other reports from Donald
    Buczek have shown that we might indeed hit such a path:

    clusterd-989 [009] .... 118023.654491: mm_vmscan_direct_reclaim_end: nr_reclaimed=193
    kswapd1-86 [001] dN.. 118023.987475: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239830 nr_taken=0 file=1
    kswapd1-86 [001] dN.. 118024.320968: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239844 nr_taken=0 file=1
    kswapd1-86 [001] dN.. 118024.654375: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239858 nr_taken=0 file=1
    kswapd1-86 [001] dN.. 118024.987036: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239872 nr_taken=0 file=1
    kswapd1-86 [001] dN.. 118025.319651: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239886 nr_taken=0 file=1
    kswapd1-86 [001] dN.. 118025.652248: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239900 nr_taken=0 file=1
    kswapd1-86 [001] dN.. 118025.984870: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239914 nr_taken=0 file=1
    [...]
    kswapd1-86 [001] dN.. 118084.274403: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4241133 nr_taken=0 file=1

    this is minute long snapshot which didn't take a single page from the
    LRU. It is not entirely clear why only 1303 pages have been scanned
    during that time (maybe there was a heavy IRQ activity interfering).

    In any case it looks like we can really hit long periods without
    scheduling on non preemptive kernels so an explicit cond_resched() in
    shrink_node_memcg which is independent on the reclaim operation is due.

    Link: http://lkml.kernel.org/r/20161202095841.16648-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Boris Zhmurov
    Tested-by: Boris Zhmurov
    Reported-by: Donald Buczek
    Reported-by: "Christopher S. Aker"
    Reported-by: Paul Menzel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Commit 0a6b76dd23fa ("mm: workingset: make shadow node shrinker memcg
    aware") has made the workingset shadow nodes shrinker memcg aware. The
    implementation is not correct though because memcg_kmem_enabled() might
    become true while we are doing a global reclaim when the sc->memcg might
    be NULL which is exactly what Marek has seen:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000400
    IP: [] mem_cgroup_node_nr_lru_pages+0x20/0x40
    PGD 0
    Oops: 0000 [#1] SMP
    CPU: 0 PID: 60 Comm: kswapd0 Tainted: G O 4.8.10-12.pvops.qubes.x86_64 #1
    task: ffff880011863b00 task.stack: ffff880011868000
    RIP: mem_cgroup_node_nr_lru_pages+0x20/0x40
    RSP: e02b:ffff88001186bc70 EFLAGS: 00010293
    RAX: 0000000000000000 RBX: ffff88001186bd20 RCX: 0000000000000002
    RDX: 000000000000000c RSI: 0000000000000000 RDI: 0000000000000000
    RBP: ffff88001186bc70 R08: 28f5c28f5c28f5c3 R09: 0000000000000000
    R10: 0000000000006c34 R11: 0000000000000333 R12: 00000000000001f6
    R13: ffffffff81c6f6a0 R14: 0000000000000000 R15: 0000000000000000
    FS: 0000000000000000(0000) GS:ffff880013c00000(0000) knlGS:ffff880013d00000
    CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000400 CR3: 00000000122f2000 CR4: 0000000000042660
    Call Trace:
    count_shadow_nodes+0x9a/0xa0
    shrink_slab.part.42+0x119/0x3e0
    shrink_node+0x22c/0x320
    kswapd+0x32c/0x700
    kthread+0xd8/0xf0
    ret_from_fork+0x1f/0x40
    Code: 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 3b 35 dd eb b1 00 55 48 89 e5 73 2c 89 d2 31 c9 31 c0 4c 63 ce 48 0f a3 ca 73 13 8b b4 cf 00 04 00 00 41 89 c8 4a 03 84 c6 80 00 00 00 83 c1
    RIP mem_cgroup_node_nr_lru_pages+0x20/0x40
    RSP
    CR2: 0000000000000400
    ---[ end trace 100494b9edbdfc4d ]---

    This patch fixes the issue by checking sc->memcg rather than
    memcg_kmem_enabled() which is sufficient because shrink_slab makes sure
    that only memcg aware shrinkers will get non-NULL memcgs and only if
    memcg_kmem_enabled is true.

    Fixes: 0a6b76dd23fa ("mm: workingset: make shadow node shrinker memcg aware")
    Link: http://lkml.kernel.org/r/20161201132156.21450-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Marek Marczykowski-Górecki
    Tested-by: Marek Marczykowski-Górecki
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Balbir Singh
    Cc: [4.6+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

01 Dec, 2016

5 commits

  • Hugetlb pages have ->index in size of the huge pages (PMD_SIZE or
    PUD_SIZE), not in PAGE_SIZE as other types of pages. This means we
    cannot user page_to_pgoff() to check whether we've got the right page
    for the radix-tree index.

    Let's introduce page_to_index() which would return radix-tree index for
    given page.

    We will be able to get rid of this once hugetlb will be switched to
    multi-order entries.

    Fixes: fc127da085c2 ("truncate: handle file thp")
    Link: http://lkml.kernel.org/r/20161123093053.mjbnvn5zwxw5e6lk@black.fi.intel.com
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Doug Nelson
    Tested-by: Doug Nelson
    Reviewed-by: Naoya Horiguchi
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Gcc revision 241896 implements use-after-scope detection. Will be
    available in gcc 7. Support it in KASAN.

    Gcc emits 2 new callbacks to poison/unpoison large stack objects when
    they go in/out of scope. Implement the callbacks and add a test.

    [dvyukov@google.com: v3]
    Link: http://lkml.kernel.org/r/1479998292-144502-1-git-send-email-dvyukov@google.com
    Link: http://lkml.kernel.org/r/1479226045-145148-1-git-send-email-dvyukov@google.com
    Signed-off-by: Dmitry Vyukov
    Acked-by: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: [4.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     
  • kasan_global struct is part of compiler/runtime ABI. gcc revision
    241983 has added a new field to kasan_global struct. Update kernel
    definition of kasan_global struct to include the new field.

    Without this patch KASAN is broken with gcc 7.

    Link: http://lkml.kernel.org/r/1479219743-28682-1-git-send-email-dvyukov@google.com
    Signed-off-by: Dmitry Vyukov
    Acked-by: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: [4.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     
  • The following program triggers BUG() in munlock_vma_pages_range():

    // autogenerated by syzkaller (http://github.com/google/syzkaller)
    #include

    int main()
    {
    mmap((void*)0x20105000ul, 0xc00000ul, 0x2ul, 0x2172ul, -1, 0);
    mremap((void*)0x201fd000ul, 0x4000ul, 0xc00000ul, 0x3ul, 0x203f0000ul);
    return 0;
    }

    The test-case constructs the situation when munlock_vma_pages_range()
    finds PTE-mapped THP-head in the middle of page table and, by mistake,
    skips HPAGE_PMD_NR pages after that.

    As result, on the next iteration it hits the middle of PMD-mapped THP
    and gets upset seeing mlocked tail page.

    The solution is only skip HPAGE_PMD_NR pages if the THP was mlocked
    during munlock_vma_page(). It would guarantee that the page is
    PMD-mapped as we never mlock PTE-mapeed THPs.

    Fixes: e90309c9f772 ("thp: allow mlocked THP again")
    Link: http://lkml.kernel.org/r/20161115132703.7s7rrgmwttegcdh4@black.fi.intel.com
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Dmitry Vyukov
    Cc: Konstantin Khlebnikov
    Cc: Andrey Ryabinin
    Cc: syzkaller
    Cc: Andrea Arcangeli
    Cc: [4.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Commit b46e756f5e47 ("thp: extract khugepaged from mm/huge_memory.c")
    moved code from huge_memory.c to khugepaged.c. Some of this code should
    be compiled only when CONFIG_SYSFS is enabled but the condition around
    this code was not moved into khugepaged.c.

    The result is a compilation error when CONFIG_SYSFS is disabled:

    mm/built-in.o: In function `khugepaged_defrag_store': khugepaged.c:(.text+0x2d095): undefined reference to `single_hugepage_flag_store'
    mm/built-in.o: In function `khugepaged_defrag_show': khugepaged.c:(.text+0x2d0ab): undefined reference to `single_hugepage_flag_show'

    This commit adds the #ifdef CONFIG_SYSFS around the code related to
    sysfs.

    Link: http://lkml.kernel.org/r/20161114203448.24197-1-jeremy.lefaure@lse.epita.fr
    Signed-off-by: Jérémy Lefaure
    Acked-by: Kirill A. Shutemov
    Acked-by: Hillf Danton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérémy Lefaure
     

30 Nov, 2016

1 commit

  • Linus found there still is a race in mremap after commit 5d1904204c99
    ("mremap: fix race between mremap() and page cleanning").

    As described by Linus:
    "the issue is that another thread might make the pte be dirty (in the
    hardware walker, so no locking of ours will make any difference)
    *after* we checked whether it was dirty, but *before* we removed it
    from the page tables"

    Fix it by moving the check after we removed it from the page table.

    Suggested-by: Linus Torvalds
    Signed-off-by: Aaron Lu
    Signed-off-by: Linus Torvalds

    Aaron Lu
     

18 Nov, 2016

1 commit

  • Prior to 3.15, there was a race between zap_pte_range() and
    page_mkclean() where writes to a page could be lost. Dave Hansen
    discovered by inspection that there is a similar race between
    move_ptes() and page_mkclean().

    We've been able to reproduce the issue by enlarging the race window with
    a msleep(), but have not been able to hit it without modifying the code.
    So, we think it's a real issue, but is difficult or impossible to hit in
    practice.

    The zap_pte_range() issue is fixed by commit 1cf35d47712d("mm: split
    'tlb_flush_mmu()' into tlb flushing and memory freeing parts"). And
    this patch is to fix the race between page_mkclean() and mremap().

    Here is one possible way to hit the race: suppose a process mmapped a
    file with READ | WRITE and SHARED, it has two threads and they are bound
    to 2 different CPUs, e.g. CPU1 and CPU2. mmap returned X, then thread
    1 did a write to addr X so that CPU1 now has a writable TLB for addr X
    on it. Thread 2 starts mremaping from addr X to Y while thread 1
    cleaned the page and then did another write to the old addr X again.
    The 2nd write from thread 1 could succeed but the value will get lost.

    thread 1 thread 2
    (bound to CPU1) (bound to CPU2)

    1: write 1 to addr X to get a
    writeable TLB on this CPU

    2: mremap starts

    3: move_ptes emptied PTE for addr X
    and setup new PTE for addr Y and
    then dropped PTL for X and Y

    4: page laundering for N by doing
    fadvise FADV_DONTNEED. When done,
    pageframe N is deemed clean.

    5: *write 2 to addr X

    6: tlb flush for addr X

    7: munmap (Y, pagesize) to make the
    page unmapped

    8: fadvise with FADV_DONTNEED again
    to kick the page off the pagecache

    9: pread the page from file to verify
    the value. If 1 is there, it means
    we have lost the written 2.

    *the write may or may not cause segmentation fault, it depends on
    if the TLB is still on the CPU.

    Please note that this is only one specific way of how the race could
    occur, it didn't mean that the race could only occur in exact the above
    config, e.g. more than 2 threads could be involved and fadvise() could
    be done in another thread, etc.

    For anonymous pages, they could race between mremap() and page reclaim:
    THP: a huge PMD is moved by mremap to a new huge PMD, then the new huge
    PMD gets unmapped/splitted/pagedout before the flush tlb happened for
    the old huge PMD in move_page_tables() and we could still write data to
    it. The normal anonymous page has similar situation.

    To fix this, check for any dirty PTE in move_ptes()/move_huge_pmd() and
    if any, did the flush before dropping the PTL. If we did the flush for
    every move_ptes()/move_huge_pmd() call then we do not need to do the
    flush in move_pages_tables() for the whole range. But if we didn't, we
    still need to do the whole range flush.

    Alternatively, we can track which part of the range is flushed in
    move_ptes()/move_huge_pmd() and which didn't to avoid flushing the whole
    range in move_page_tables(). But that would require multiple tlb
    flushes for the different sub-ranges and should be less efficient than
    the single whole range flush.

    KBuild test on my Sandybridge desktop doesn't show any noticeable change.
    v4.9-rc4:
    real 5m14.048s
    user 32m19.800s
    sys 4m50.320s

    With this commit:
    real 5m13.888s
    user 32m19.330s
    sys 4m51.200s

    Reported-by: Dave Hansen
    Signed-off-by: Aaron Lu
    Signed-off-by: Linus Torvalds

    Aaron Lu
     

12 Nov, 2016

9 commits

  • Limit the number of kmemleak false positives by including
    .data.ro_after_init in memory scanning. To achieve this we need to add
    symbols for start and end of the section to the linker scripts.

    The problem was been uncovered by commit 56989f6d8568 ("genetlink: mark
    families as __ro_after_init").

    Link: http://lkml.kernel.org/r/1478274173-15218-1-git-send-email-jakub.kicinski@netronome.com
    Reviewed-by: Catalin Marinas
    Signed-off-by: Jakub Kicinski
    Cc: Arnd Bergmann
    Cc: Cong Wang
    Cc: Johannes Berg
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jakub Kicinski
     
  • While testing OBJFREELIST_SLAB integration with pagealloc, we found a
    bug where kmem_cache(sys) would be created with both CFLGS_OFF_SLAB &
    CFLGS_OBJFREELIST_SLAB. When it happened, critical allocations needed
    for loading drivers or creating new caches will fail.

    The original kmem_cache is created early making OFF_SLAB not possible.
    When kmem_cache(sys) is created, OFF_SLAB is possible and if pagealloc
    is enabled it will try to enable it first under certain conditions.
    Given kmem_cache(sys) reuses the original flag, you can have both flags
    at the same time resulting in allocation failures and odd behaviors.

    This fix discards allocator specific flags from memcg before calling
    create_cache.

    The bug exists since 4.6-rc1 and affects testing debug pagealloc
    configurations.

    Fixes: b03a017bebc4 ("mm/slab: introduce new slab management type, OBJFREELIST_SLAB")
    Link: http://lkml.kernel.org/r/1478553075-120242-1-git-send-email-thgarnie@google.com
    Signed-off-by: Greg Thelen
    Signed-off-by: Thomas Garnier
    Tested-by: Thomas Garnier
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     
  • Starting from 4.9-rc1 kernel, I started noticing some test failures of
    sendfile(2) and splice(2) (sendfile0N and splice01 from LTP) when
    testing on sub-page block size filesystems (tested both XFS and ext4),
    these syscalls start to return EIO in the tests. e.g.

    sendfile02 1 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 26, got: -1
    sendfile02 2 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 24, got: -1
    sendfile02 3 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 22, got: -1
    sendfile02 4 TFAIL : sendfile02.c:133: sendfile(2) failed to return expected value, expected: 20, got: -1

    This is because that in sub-page block size cases, we don't need the
    whole page to be uptodate, only the part we care about is uptodate is OK
    (if fs has ->is_partially_uptodate defined).

    But page_cache_pipe_buf_confirm() doesn't have the ability to check the
    partially-uptodate case, it needs the whole page to be uptodate. So it
    returns EIO in this case.

    This is a regression introduced by commit 82c156f85384 ("switch
    generic_file_splice_read() to use of ->read_iter()"). Prior to the
    change, generic_file_splice_read() doesn't allow partially-uptodate page
    either, so it worked fine.

    Fix it by skipping the partially-uptodate check if we're working on a
    pipe in do_generic_file_read(), so we read the whole page from disk as
    long as the page is not uptodate.

    I think the other way to fix it is to add the ability to check & allow
    partially-uptodate page to page_cache_pipe_buf_confirm(), but that is
    much harder to do and seems gain little.

    Link: http://lkml.kernel.org/r/1477986187-12717-1-git-send-email-guaneryu@gmail.com
    Signed-off-by: Eryu Guan
    Reviewed-by: Jan Kara
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eryu Guan
     
  • Error paths in hugetlb_cow() and hugetlb_no_page() may free a newly
    allocated huge page.

    If a reservation was associated with the huge page, alloc_huge_page()
    consumed the reservation while allocating. When the newly allocated
    page is freed in free_huge_page(), it will increment the global
    reservation count. However, the reservation entry in the reserve map
    will remain.

    This is not an issue for shared mappings as the entry in the reserve map
    indicates a reservation exists. But, an entry in a private mapping
    reserve map indicates the reservation was consumed and no longer exists.
    This results in an inconsistency between the reserve map and the global
    reservation count. This 'leaks' a reserved huge page.

    Create a new routine restore_reserve_on_error() to restore the reserve
    entry in these specific error paths. This routine makes use of a new
    function vma_add_reservation() which will add a reserve entry for a
    specific address/page.

    In general, these error paths were rarely (if ever) taken on most
    architectures. However, powerpc contained arch specific code that that
    resulted in an extra fault and execution of these error paths on all
    private mappings.

    Fixes: 67961f9db8c4 ("mm/hugetlb: fix huge page reserve accounting for private mappings)
    Link: http://lkml.kernel.org/r/1476933077-23091-2-git-send-email-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reported-by: Jan Stancek
    Tested-by: Jan Stancek
    Reviewed-by: Aneesh Kumar K.V
    Acked-by: Hillf Danton
    Cc: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: Kirill A . Shutemov
    Cc: Dave Hansen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • When memory_failure() runs on a thp tail page after pmd is split, we
    trigger the following VM_BUG_ON_PAGE():

    page:ffffd7cd819b0040 count:0 mapcount:0 mapping: (null) index:0x1
    flags: 0x1fffc000400000(hwpoison)
    page dumped because: VM_BUG_ON_PAGE(!page_count(p))
    ------------[ cut here ]------------
    kernel BUG at /src/linux-dev/mm/memory-failure.c:1132!

    memory_failure() passed refcount and page lock from tail page to head
    page, which is not needed because we can pass any subpage to
    split_huge_page().

    Fixes: 61f5d698cc97 ("mm: re-enable THP")
    Link: http://lkml.kernel.org/r/1477961577-7183-1-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Cc: [4.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • When root activates a swap partition whose header has the wrong
    endianness, nr_badpages elements of badpages are swabbed before
    nr_badpages has been checked, leading to a buffer overrun of up to 8GB.

    This normally is not a security issue because it can only be exploited
    by root (more specifically, a process with CAP_SYS_ADMIN or the ability
    to modify a swap file/partition), and such a process can already e.g.
    modify swapped-out memory of any other userspace process on the system.

    Link: http://lkml.kernel.org/r/1477949533-2509-1-git-send-email-jann@thejh.net
    Signed-off-by: Jann Horn
    Acked-by: Kees Cook
    Acked-by: Jerome Marchand
    Acked-by: Johannes Weiner
    Cc: "Kirill A. Shutemov"
    Cc: Vlastimil Babka
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • CMA allocation request size is represented by size_t that gets truncated
    when same is passed as int to bitmap_find_next_zero_area_off.

    We observe that during fuzz testing when cma allocation request is too
    high, bitmap_find_next_zero_area_off still returns success due to the
    truncation. This leads to kernel crash, as subsequent code assumes that
    requested memory is available.

    Fail cma allocation in case the request breaches the corresponding cma
    region size.

    Link: http://lkml.kernel.org/r/1478189211-3467-1-git-send-email-shashim@codeaurora.org
    Signed-off-by: Shiraz Hashim
    Cc: Catalin Marinas
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shiraz Hashim
     
  • If shmem_alloc_page() does not set PageLocked and PageSwapBacked, then
    shmem_replace_page() needs to do so for itself. Without this, it puts
    newpage on the wrong lru, re-unlocks the unlocked newpage, and system
    descends into "Bad page" reports and freeze; or if CONFIG_DEBUG_VM=y, it
    hits an earlier VM_BUG_ON_PAGE(!PageLocked), depending on config.

    But shmem_replace_page() is not a common path: it's only called when
    swapin (or swapoff) finds the page was already read into an unsuitable
    zone: usually all zones are suitable, but gem objects for a few drm
    devices (gma500, omapdrm, crestline, broadwater) require zone DMA32 if
    there's more than 4GB of ram.

    Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1611062003510.11253@eggly.anvils
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: [4.8.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Commit 63f53dea0c98 ("mm: warn about allocations which stall for too
    long") by error embedded "\n" in the format string, resulting in strange
    output.

    [ 722.876655] kworker/0:1: page alloction stalls for 160001ms, order:0
    [ 722.876656] , mode:0x2400000(GFP_NOIO)
    [ 722.876657] CPU: 0 PID: 6966 Comm: kworker/0:1 Not tainted 4.8.0+ #69

    Link: http://lkml.kernel.org/r/1476026219-7974-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

02 Nov, 2016

1 commit


01 Nov, 2016

1 commit

  • The stack frame size could grow too large when the plugin used long long
    on 32-bit architectures when the given function had too many basic blocks.

    The gcc warning was:

    drivers/pci/hotplug/ibmphp_ebda.c: In function 'ibmphp_access_ebda':
    drivers/pci/hotplug/ibmphp_ebda.c:409:1: warning: the frame size of 1108 bytes is larger than 1024 bytes [-Wframe-larger-than=]

    This switches latent_entropy from u64 to unsigned long.

    Thanks to PaX Team and Emese Revfy for the patch.

    Signed-off-by: Kees Cook

    Kees Cook
     

28 Oct, 2016

10 commits

  • Merge misc fixes from Andrew Morton:
    "20 fixes"

    * emailed patches from Andrew Morton :
    drivers/misc/sgi-gru/grumain.c: remove bogus 0x prefix from printk
    cris/arch-v32: cryptocop: print a hex number after a 0x prefix
    ipack: print a hex number after a 0x prefix
    block: DAC960: print a hex number after a 0x prefix
    fs: exofs: print a hex number after a 0x prefix
    lib/genalloc.c: start search from start of chunk
    mm: memcontrol: do not recurse in direct reclaim
    CREDITS: update credit information for Martin Kepplinger
    proc: fix NULL dereference when reading /proc//auxv
    mm: kmemleak: ensure that the task stack is not freed during scanning
    lib/stackdepot.c: bump stackdepot capacity from 16MB to 128MB
    latent_entropy: raise CONFIG_FRAME_WARN by default
    kconfig.h: remove config_enabled() macro
    ipc: account for kmem usage on mqueue and msg
    mm/slab: improve performance of gathering slabinfo stats
    mm: page_alloc: use KERN_CONT where appropriate
    mm/list_lru.c: avoid error-path NULL pointer deref
    h8300: fix syscall restarting
    kcov: properly check if we are in an interrupt
    mm/slab: fix kmemcg cache creation delayed issue

    Linus Torvalds
     
  • On 4.0, we saw a stack corruption from a page fault entering direct
    memory cgroup reclaim, calling into btrfs_releasepage(), which then
    tried to allocate an extent and recursed back into a kmem charge ad
    nauseam:

    [...]
    btrfs_releasepage+0x2c/0x30
    try_to_release_page+0x32/0x50
    shrink_page_list+0x6da/0x7a0
    shrink_inactive_list+0x1e5/0x510
    shrink_lruvec+0x605/0x7f0
    shrink_zone+0xee/0x320
    do_try_to_free_pages+0x174/0x440
    try_to_free_mem_cgroup_pages+0xa7/0x130
    try_charge+0x17b/0x830
    memcg_charge_kmem+0x40/0x80
    new_slab+0x2d9/0x5a0
    __slab_alloc+0x2fd/0x44f
    kmem_cache_alloc+0x193/0x1e0
    alloc_extent_state+0x21/0xc0
    __clear_extent_bit+0x2b5/0x400
    try_release_extent_mapping+0x1a3/0x220
    __btrfs_releasepage+0x31/0x70
    btrfs_releasepage+0x2c/0x30
    try_to_release_page+0x32/0x50
    shrink_page_list+0x6da/0x7a0
    shrink_inactive_list+0x1e5/0x510
    shrink_lruvec+0x605/0x7f0
    shrink_zone+0xee/0x320
    do_try_to_free_pages+0x174/0x440
    try_to_free_mem_cgroup_pages+0xa7/0x130
    try_charge+0x17b/0x830
    mem_cgroup_try_charge+0x65/0x1c0
    handle_mm_fault+0x117f/0x1510
    __do_page_fault+0x177/0x420
    do_page_fault+0xc/0x10
    page_fault+0x22/0x30

    On later kernels, kmem charging is opt-in rather than opt-out, and that
    particular kmem allocation in btrfs_releasepage() is no longer being
    charged and won't recurse and overrun the stack anymore.

    But it's not impossible for an accounted allocation to happen from the
    memcg direct reclaim context, and we needed to reproduce this crash many
    times before we even got a useful stack trace out of it.

    Like other direct reclaimers, mark tasks in memcg reclaim PF_MEMALLOC to
    avoid recursing into any other form of direct reclaim. Then let
    recursive charges from PF_MEMALLOC contexts bypass the cgroup limit.

    Link: http://lkml.kernel.org/r/20161025141050.GA13019@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Tejun Heo
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Commit 68f24b08ee89 ("sched/core: Free the stack early if
    CONFIG_THREAD_INFO_IN_TASK") may cause the task->stack to be freed
    during kmemleak_scan() execution, leading to either a NULL pointer fault
    (if task->stack is NULL) or kmemleak accessing already freed memory.

    This patch uses the new try_get_task_stack() API to ensure that the task
    stack is not freed during kmemleak stack scanning.

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=173901.

    Fixes: 68f24b08ee89 ("sched/core: Free the stack early if CONFIG_THREAD_INFO_IN_TASK")
    Link: http://lkml.kernel.org/r/1476266223-14325-1-git-send-email-catalin.marinas@arm.com
    Signed-off-by: Catalin Marinas
    Reported-by: CAI Qian
    Tested-by: CAI Qian
    Acked-by: Michal Hocko
    Cc: Andy Lutomirski
    Cc: CAI Qian
    Cc: Hillf Danton
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Catalin Marinas
     
  • On large systems, when some slab caches grow to millions of objects (and
    many gigabytes), running 'cat /proc/slabinfo' can take up to 1-2
    seconds. During this time, interrupts are disabled while walking the
    slab lists (slabs_full, slabs_partial, and slabs_free) for each node,
    and this sometimes causes timeouts in other drivers (for instance,
    Infiniband).

    This patch optimizes 'cat /proc/slabinfo' by maintaining a counter for
    total number of allocated slabs per node, per cache. This counter is
    updated when a slab is created or destroyed. This enables us to skip
    traversing the slabs_full list while gathering slabinfo statistics, and
    since slabs_full tends to be the biggest list when the cache is large,
    it results in a dramatic performance improvement. Getting slabinfo
    statistics now only requires walking the slabs_free and slabs_partial
    lists, and those lists are usually much smaller than slabs_full.

    We tested this after growing the dentry cache to 70GB, and the
    performance improved from 2s to 5ms.

    Link: http://lkml.kernel.org/r/1472517876-26814-1-git-send-email-aruna.ramakrishna@oracle.com
    Signed-off-by: Aruna Ramakrishna
    Acked-by: David Rientjes
    Cc: Mike Kravetz
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aruna Ramakrishna
     
  • Recent changes to printk require KERN_CONT uses to continue logging
    messages. So add KERN_CONT where necessary.

    [akpm@linux-foundation.org: coding-style fixes]
    Fixes: 4bcc595ccd80 ("printk: reinstate KERN_CONT for printing continuation lines")
    Link: http://lkml.kernel.org/r/c7df37c8665134654a17aaeb8b9f6ace1d6db58b.1476239034.git.joe@perches.com
    Reported-by: Mark Rutland
    Signed-off-by: Joe Perches
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • As described in https://bugzilla.kernel.org/show_bug.cgi?id=177821:

    After some analysis it seems to be that the problem is in alloc_super().
    In case list_lru_init_memcg() fails it goes into destroy_super(), which
    calls list_lru_destroy().

    And in list_lru_init() we see that in case memcg_init_list_lru() fails,
    lru->node is freed, but not set NULL, which then leads list_lru_destroy()
    to believe it is initialized and call memcg_destroy_list_lru().
    memcg_destroy_list_lru() in turn can access lru->node[i].memcg_lrus,
    which is NULL.

    [akpm@linux-foundation.org: add comment]
    Signed-off-by: Alexander Polakov
    Acked-by: Vladimir Davydov
    Cc: Al Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Polakov
     
  • There is a bug report that SLAB makes extreme load average due to over
    2000 kworker thread.

    https://bugzilla.kernel.org/show_bug.cgi?id=172981

    This issue is caused by kmemcg feature that try to create new set of
    kmem_caches for each memcg. Recently, kmem_cache creation is slowed by
    synchronize_sched() and futher kmem_cache creation is also delayed since
    kmem_cache creation is synchronized by a global slab_mutex lock. So,
    the number of kworker that try to create kmem_cache increases quietly.

    synchronize_sched() is for lockless access to node's shared array but
    it's not needed when a new kmem_cache is created. So, this patch rules
    out that case.

    Fixes: 801faf0db894 ("mm/slab: lockless decision to grow cache")
    Link: http://lkml.kernel.org/r/1475734855-4837-1-git-send-email-iamjoonsoo.kim@lge.com
    Reported-by: Doug Smythies
    Tested-by: Doug Smythies
    Signed-off-by: Joonsoo Kim
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • No, KASAN may not be able to co-exist with HOTPLUG_MEMORY at runtime,
    but for build testing there is no reason not to allow them together.

    This hopefully means better build coverage and fewer embarrasing silly
    problems like the one fixed by commit 9db4f36e82c2 ("mm: remove unused
    variable in memory hotplug") in the future.

    Cc: Stephen Rothwell
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • When I removed the per-zone bitlock hashed waitqueues in commit
    9dcb8b685fc3 ("mm: remove per-zone hashtable of bitlock waitqueues"), I
    removed all the magic hotplug memory initialization of said waitqueues
    too.

    But when I actually _tested_ the resulting build, I stupidly assumed
    that "allmodconfig" would enable memory hotplug. And it doesn't,
    because it enables KASAN instead, which then disables hotplug memory
    support.

    As a result, my build test of the per-zone waitqueues was totally
    broken, and I didn't notice that the compiler warns about the now unused
    iterator variable 'i'.

    I guess I should be happy that that seems to be the worst breakage from
    my clearly horribly failed test coverage.

    Reported-by: Stephen Rothwell
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • The per-zone waitqueues exist because of a scalability issue with the
    page waitqueues on some NUMA machines, but it turns out that they hurt
    normal loads, and now with the vmalloced stacks they also end up
    breaking gfs2 that uses a bit_wait on a stack object:

    wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE)

    where 'gh' can be a reference to the local variable 'mount_gh' on the
    stack of fill_super().

    The reason the per-zone hash table breaks for this case is that there is
    no "zone" for virtual allocations, and trying to look up the physical
    page to get at it will fail (with a BUG_ON()).

    It turns out that I actually complained to the mm people about the
    per-zone hash table for another reason just a month ago: the zone lookup
    also hurts the regular use of "unlock_page()" a lot, because the zone
    lookup ends up forcing several unnecessary cache misses and generates
    horrible code.

    As part of that earlier discussion, we had a much better solution for
    the NUMA scalability issue - by just making the page lock have a
    separate contention bit, the waitqueue doesn't even have to be looked at
    for the normal case.

    Peter Zijlstra already has a patch for that, but let's see if anybody
    even notices. In the meantime, let's fix the actual gfs2 breakage by
    simplifying the bitlock waitqueues and removing the per-zone issue.

    Reported-by: Andreas Gruenbacher
    Tested-by: Bob Peterson
    Acked-by: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Andy Lutomirski
    Cc: Steven Whitehouse
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

25 Oct, 2016

1 commit

  • This patch unexports the low-level __get_user_pages() function.

    Recent refactoring of the get_user_pages* functions allow flags to be
    passed through get_user_pages() which eliminates the need for access to
    this function from its one user, kvm.

    We can see that the two calls to get_user_pages() which replace
    __get_user_pages() in kvm_main.c are equivalent by examining their call
    stacks:

    get_user_page_nowait():
    get_user_pages(start, 1, flags, page, NULL)
    __get_user_pages_locked(current, current->mm, start, 1, page, NULL, NULL,
    false, flags | FOLL_TOUCH)
    __get_user_pages(current, current->mm, start, 1,
    flags | FOLL_TOUCH | FOLL_GET, page, NULL, NULL)

    check_user_page_hwpoison():
    get_user_pages(addr, 1, flags, NULL, NULL)
    __get_user_pages_locked(current, current->mm, addr, 1, NULL, NULL, NULL,
    false, flags | FOLL_TOUCH)
    __get_user_pages(current, current->mm, addr, 1, flags | FOLL_TOUCH, NULL,
    NULL, NULL)

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Paolo Bonzini
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     

23 Oct, 2016

1 commit

  • Pull vmap stack fixes from Ingo Molnar:
    "This is fallout from CONFIG_HAVE_ARCH_VMAP_STACK=y on x86: stack
    accesses that used to be just somewhat questionable are now totally
    buggy.

    These changes try to do it without breaking the ABI: the fields are
    left there, they are just reporting zero, or reporting narrower
    information (the maps file change)"

    * 'mm-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    mm: Change vm_is_stack_for_task() to vm_is_stack_for_current()
    fs/proc: Stop trying to report thread stacks
    fs/proc: Stop reporting eip and esp in /proc/PID/stat
    mm/numa: Remove duplicated include from mprotect.c

    Linus Torvalds
     

20 Oct, 2016

1 commit

  • Asking for a non-current task's stack can't be done without races
    unless the task is frozen in kernel mode. As far as I know,
    vm_is_stack_for_task() never had a safe non-current use case.

    The __unused annotation is because some KSTK_ESP implementations
    ignore their parameter, which IMO is further justification for this
    patch.

    Signed-off-by: Andy Lutomirski
    Acked-by: Thomas Gleixner
    Cc: Al Viro
    Cc: Andrew Morton
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Jann Horn
    Cc: Kees Cook
    Cc: Linus Torvalds
    Cc: Linux API
    Cc: Peter Zijlstra
    Cc: Tycho Andersen
    Link: http://lkml.kernel.org/r/4c3f68f426e6c061ca98b4fc7ef85ffbb0a25b0c.1475257877.git.luto@kernel.org
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     

19 Oct, 2016

5 commits

  • Merge the gup_flags cleanups from Lorenzo Stoakes:
    "This patch series adjusts functions in the get_user_pages* family such
    that desired FOLL_* flags are passed as an argument rather than
    implied by flags.

    The purpose of this change is to make the use of FOLL_FORCE explicit
    so it is easier to grep for and clearer to callers that this flag is
    being used. The use of FOLL_FORCE is an issue as it overrides missing
    VM_READ/VM_WRITE flags for the VMA whose pages we are reading
    from/writing to, which can result in surprising behaviour.

    The patch series came out of the discussion around commit 38e088546522
    ("mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing"),
    which addressed a BUG_ON() being triggered when a page was faulted in
    with PROT_NONE set but having been overridden by FOLL_FORCE.
    do_numa_page() was run on the assumption the page _must_ be one marked
    for NUMA node migration as an actual PROT_NONE page would have been
    dealt with prior to this code path, however FOLL_FORCE introduced a
    situation where this assumption did not hold.

    See

    https://marc.info/?l=linux-mm&m=147585445805166

    for the patch proposal"

    Additionally, there's a fix for an ancient bug related to FOLL_FORCE and
    FOLL_WRITE by me.

    [ This branch was rebased recently to add a few more acked-by's and
    reviewed-by's ]

    * gup_flag-cleanups:
    mm: replace access_process_vm() write parameter with gup_flags
    mm: replace access_remote_vm() write parameter with gup_flags
    mm: replace __access_remote_vm() write parameter with gup_flags
    mm: replace get_user_pages_remote() write/force parameters with gup_flags
    mm: replace get_user_pages() write/force parameters with gup_flags
    mm: replace get_vaddr_frames() write/force parameters with gup_flags
    mm: replace get_user_pages_locked() write/force parameters with gup_flags
    mm: replace get_user_pages_unlocked() write/force parameters with gup_flags
    mm: remove write/force parameters from __get_user_pages_unlocked()
    mm: remove write/force parameters from __get_user_pages_locked()
    mm: remove gup_flags FOLL_WRITE games from __get_user_pages()

    Linus Torvalds
     
  • This removes the 'write' argument from access_process_vm() and replaces
    it with 'gup_flags' as use of this function previously silently implied
    FOLL_FORCE, whereas after this patch callers explicitly pass this flag.

    We make this explicit as use of FOLL_FORCE can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Jesper Nilsson
    Acked-by: Michal Hocko
    Acked-by: Michael Ellerman
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • Signed-off-by: Wei Yongjun
    Cc: Dave Hansen
    Cc: linux-mm@kvack.org
    Cc: Andrew Morton
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Link: http://lkml.kernel.org/r/1476719259-6214-1-git-send-email-weiyj.lk@gmail.com
    Signed-off-by: Thomas Gleixner

    Wei Yongjun
     
  • This removes the 'write' argument from access_remote_vm() and replaces
    it with 'gup_flags' as use of this function previously silently implied
    FOLL_FORCE, whereas after this patch callers explicitly pass this flag.

    We make this explicit as use of FOLL_FORCE can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     
  • This removes the 'write' argument from __access_remote_vm() and replaces
    it with 'gup_flags' as use of this function previously silently implied
    FOLL_FORCE, whereas after this patch callers explicitly pass this flag.

    We make this explicit as use of FOLL_FORCE can result in surprising
    behaviour (and hence bugs) within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes