24 Apr, 2018

5 commits

  • commit 2e898e4c0a3897ccd434adac5abb8330194f527b upstream.

    lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if
    the page's memcg is undergoing move accounting, which occurs when a
    process leaves its memcg for a new one that has
    memory.move_charge_at_immigrate set.

    unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if
    the given inode is switching writeback domains. Switches occur when
    enough writes are issued from a new domain.

    This existing pattern is thus suspicious:
    lock_page_memcg(page);
    unlocked_inode_to_wb_begin(inode, &locked);
    ...
    unlocked_inode_to_wb_end(inode, locked);
    unlock_page_memcg(page);

    If both inode switch and process memcg migration are both in-flight then
    unlocked_inode_to_wb_end() will unconditionally enable interrupts while
    still holding the lock_page_memcg() irq spinlock. This suggests the
    possibility of deadlock if an interrupt occurs before unlock_page_memcg().

    truncate
    __cancel_dirty_page
    lock_page_memcg
    unlocked_inode_to_wb_begin
    unlocked_inode_to_wb_end


    end_page_writeback
    test_clear_page_writeback
    lock_page_memcg

    unlock_page_memcg

    Due to configuration limitations this deadlock is not currently possible
    because we don't mix cgroup writeback (a cgroupv2 feature) and
    memory.move_charge_at_immigrate (a cgroupv1 feature).

    If the kernel is hacked to always claim inode switching and memcg
    moving_account, then this script triggers lockup in less than a minute:

    cd /mnt/cgroup/memory
    mkdir a b
    echo 1 > a/memory.move_charge_at_immigrate
    echo 1 > b/memory.move_charge_at_immigrate
    (
    echo $BASHPID > a/cgroup.procs
    while true; do
    dd if=/dev/zero of=/mnt/big bs=1M count=256
    done
    ) &
    while true; do
    sync
    done &
    sleep 1h &
    SLEEP=$!
    while true; do
    echo $SLEEP > a/cgroup.procs
    echo $SLEEP > b/cgroup.procs
    done

    The deadlock does not seem possible, so it's debatable if there's any
    reason to modify the kernel. I suggest we should to prevent future
    surprises. And Wang Long said "this deadlock occurs three times in our
    environment", so there's more reason to apply this, even to stable.
    Stable 4.4 has minor conflicts applying this patch. For a clean 4.4 patch
    see "[PATCH for-4.4] writeback: safer lock nesting"
    https://lkml.org/lkml/2018/4/11/146

    Wang Long said "this deadlock occurs three times in our environment"

    [gthelen@google.com: v4]
    Link: http://lkml.kernel.org/r/20180411084653.254724-1-gthelen@google.com
    [akpm@linux-foundation.org: comment tweaks, struct initialization simplification]
    Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613
    Link: http://lkml.kernel.org/r/20180410005908.167976-1-gthelen@google.com
    Fixes: 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates")
    Signed-off-by: Greg Thelen
    Reported-by: Wang Long
    Acked-by: Wang Long
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Cc: Nicholas Piggin
    Cc: [v4.2+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    [natechancellor: Adjust context due to lack of b93b016313b3b]
    Signed-off-by: Nathan Chancellor
    Signed-off-by: Greg Kroah-Hartman

    Greg Thelen
     
  • commit abc1be13fd113ddef5e2d807a466286b864caed3 upstream.

    f2fs specifies the __GFP_ZERO flag for allocating some of its pages.
    Unfortunately, the page cache also uses the mapping's GFP flags for
    allocating radix tree nodes. It always masked off the __GFP_HIGHMEM
    flag, and masks off __GFP_ZERO in some paths, but not all. That causes
    radix tree nodes to be allocated with a NULL list_head, which causes
    backtraces like:

    __list_del_entry+0x30/0xd0
    list_lru_del+0xac/0x1ac
    page_cache_tree_insert+0xd8/0x110

    The __GFP_DMA and __GFP_DMA32 flags would also be able to sneak through
    if they are ever used. Fix them all by using GFP_RECLAIM_MASK at the
    innermost location, and remove it from earlier in the callchain.

    Link: http://lkml.kernel.org/r/20180411060320.14458-2-willy@infradead.org
    Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
    Signed-off-by: Matthew Wilcox
    Reported-by: Chris Fries
    Debugged-by: Minchan Kim
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Reviewed-by: Jan Kara
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Matthew Wilcox
     
  • commit a9f2a846f0503e7d729f552e3ccfe2279010fe94 upstream.

    cache_reap() is initially scheduled in start_cpu_timer() via
    schedule_delayed_work_on(). But then the next iterations are scheduled
    via schedule_delayed_work(), i.e. using WORK_CPU_UNBOUND.

    Thus since commit ef557180447f ("workqueue: schedule WORK_CPU_UNBOUND
    work on wq_unbound_cpumask CPUs") there is no guarantee the future
    iterations will run on the originally intended cpu, although it's still
    preferred. I was able to demonstrate this with
    /sys/module/workqueue/parameters/debug_force_rr_cpu. IIUC, it may also
    happen due to migrating timers in nohz context. As a result, some cpu's
    would be calling cache_reap() more frequently and others never.

    This patch uses schedule_delayed_work_on() with the current cpu when
    scheduling the next iteration.

    Link: http://lkml.kernel.org/r/20180411070007.32225-1-vbabka@suse.cz
    Fixes: ef557180447f ("workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs")
    Signed-off-by: Vlastimil Babka
    Acked-by: Pekka Enberg
    Acked-by: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Tejun Heo
    Cc: Lai Jiangshan
    Cc: John Stultz
    Cc: Thomas Gleixner
    Cc: Stephen Boyd
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     
  • commit c719547f032d4610c7a20900baacae26d0b1ff3e upstream.

    The private field of mm_walk struct point to an hmm_vma_walk struct and
    not to the hmm_range struct desired. Fix to get proper struct pointer.

    Link: http://lkml.kernel.org/r/20180323005527.758-6-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Evgeny Baskakov
    Cc: Ralph Campbell
    Cc: Mark Hairgrove
    Cc: John Hubbard
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jérôme Glisse
     
  • commit a38c015f3156895b07e71d4e4414289f8a3b2745 upstream.

    When using KSM with use_zero_pages, we replace anonymous pages
    containing only zeroes with actual zero pages, which are not anonymous.
    We need to do proper accounting of the mm counters, otherwise we will
    get wrong values in /proc and a BUG message in dmesg when tearing down
    the mm.

    Link: http://lkml.kernel.org/r/1522931274-15552-1-git-send-email-imbrenda@linux.vnet.ibm.com
    Fixes: e86c59b1b1 ("mm/ksm: improve deduplication of zero pages with colouring")
    Signed-off-by: Claudio Imbrenda
    Reviewed-by: Andrew Morton
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: Christian Borntraeger
    Cc: Gerald Schaefer
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Claudio Imbrenda
     

19 Apr, 2018

1 commit

  • commit c61611f70958d86f659bca25c02ae69413747a8d upstream.

    get_user_pages_fast is supposed to be a faster drop-in equivalent of
    get_user_pages. As such, callers expect it to return a negative return
    code when passed an invalid address, and never expect it to return 0
    when passed a positive number of pages, since its documentation says:

    * Returns number of pages pinned. This may be fewer than the number
    * requested. If nr_pages is 0 or negative, returns 0. If no pages
    * were pinned, returns -errno.

    When get_user_pages_fast fall back on get_user_pages this is exactly
    what happens. Unfortunately the implementation is inconsistent: it
    returns 0 if passed a kernel address, confusing callers: for example,
    the following is pretty common but does not appear to do the right thing
    with a kernel address:

    ret = get_user_pages_fast(addr, 1, writeable, &page);
    if (ret < 0)
    return ret;

    Change get_user_pages_fast to return -EFAULT when supplied a kernel
    address to make it match expectations.

    All callers have been audited for consistency with the documented
    semantics.

    Link: http://lkml.kernel.org/r/1522962072-182137-4-git-send-email-mst@redhat.com
    Fixes: 5b65c4677a57 ("mm, x86/mm: Fix performance regression in get_user_pages_fast()")
    Signed-off-by: Michael S. Tsirkin
    Reported-by: syzbot+6304bf97ef436580fede@syzkaller.appspotmail.com
    Reviewed-by: Andrew Morton
    Cc: Kirill A. Shutemov
    Cc: Huang Ying
    Cc: Jonathan Corbet
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Thorsten Leemhuis
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michael S. Tsirkin
     

08 Apr, 2018

1 commit

  • commit 47504ee04b9241548ae2c28be7d0b01cff3b7aa6 upstream.

    Percpu memory using the vmalloc area based chunk allocator lazily
    populates chunks by first requesting the full virtual address space
    required for the chunk and subsequently adding pages as allocations come
    through. To ensure atomic allocations can succeed, a workqueue item is
    used to maintain a minimum number of empty pages. In certain scenarios,
    such as reported in [1], it is possible that physical memory becomes
    quite scarce which can result in either a rather long time spent trying
    to find free pages or worse, a kernel panic.

    This patch adds support for __GFP_NORETRY and __GFP_NOWARN passing them
    through to the underlying allocators. This should prevent any
    unnecessary panics potentially caused by the workqueue item. The passing
    of gfp around is as additional flags rather than a full set of flags.
    The next patch will change these to caller passed semantics.

    V2:
    Added const modifier to gfp flags in the balance path.
    Removed an extra whitespace.

    [1] https://lkml.org/lkml/2018/2/12/551

    Signed-off-by: Dennis Zhou
    Suggested-by: Daniel Borkmann
    Reported-by: syzbot+adb03f3f0bb57ce3acda@syzkaller.appspotmail.com
    Acked-by: Christoph Lameter
    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Dennis Zhou
     

29 Mar, 2018

7 commits

  • commit 1c610d5f93c709df56787f50b3576704ac271826 upstream.

    Commit 726d061fbd36 ("mm: vmscan: kick flushers when we encounter dirty
    pages on the LRU") added flusher invocation to shrink_inactive_list()
    when many dirty pages on the LRU are encountered.

    However, shrink_inactive_list() doesn't wake up flushers for legacy
    cgroup reclaim, so the next commit bbef938429f5 ("mm: vmscan: remove old
    flusher wakeup from direct reclaim path") removed the only source of
    flusher's wake up in legacy mem cgroup reclaim path.

    This leads to premature OOM if there is too many dirty pages in cgroup:
    # mkdir /sys/fs/cgroup/memory/test
    # echo $$ > /sys/fs/cgroup/memory/test/tasks
    # echo 50M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
    # dd if=/dev/zero of=tmp_file bs=1M count=100
    Killed

    dd invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=0

    Call Trace:
    dump_stack+0x46/0x65
    dump_header+0x6b/0x2ac
    oom_kill_process+0x21c/0x4a0
    out_of_memory+0x2a5/0x4b0
    mem_cgroup_out_of_memory+0x3b/0x60
    mem_cgroup_oom_synchronize+0x2ed/0x330
    pagefault_out_of_memory+0x24/0x54
    __do_page_fault+0x521/0x540
    page_fault+0x45/0x50

    Task in /test killed as a result of limit of /test
    memory: usage 51200kB, limit 51200kB, failcnt 73
    memory+swap: usage 51200kB, limit 9007199254740988kB, failcnt 0
    kmem: usage 296kB, limit 9007199254740988kB, failcnt 0
    Memory cgroup stats for /test: cache:49632KB rss:1056KB rss_huge:0KB shmem:0KB
    mapped_file:0KB dirty:49500KB writeback:0KB swap:0KB inactive_anon:0KB
    active_anon:1168KB inactive_file:24760KB active_file:24960KB unevictable:0KB
    Memory cgroup out of memory: Kill process 3861 (bash) score 88 or sacrifice child
    Killed process 3876 (dd) total-vm:8484kB, anon-rss:1052kB, file-rss:1720kB, shmem-rss:0kB
    oom_reaper: reaped process 3876 (dd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

    Wake up flushers in legacy cgroup reclaim too.

    Link: http://lkml.kernel.org/r/20180315164553.17856-1-aryabinin@virtuozzo.com
    Fixes: bbef938429f5 ("mm: vmscan: remove old flusher wakeup from direct reclaim path")
    Signed-off-by: Andrey Ryabinin
    Tested-by: Shakeel Butt
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Tejun Heo
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrey Ryabinin
     
  • commit f59f1caf72ba00d519c793c3deb32cd3be32edc2 upstream.

    This reverts commit b92df1de5d28 ("mm: page_alloc: skip over regions of
    invalid pfns where possible"). The commit is meant to be a boot init
    speed up skipping the loop in memmap_init_zone() for invalid pfns.

    But given some specific memory mapping on x86_64 (or more generally
    theoretically anywhere but on arm with CONFIG_HAVE_ARCH_PFN_VALID) the
    implementation also skips valid pfns which is plain wrong and causes
    'kernel BUG at mm/page_alloc.c:1389!'

    crash> log | grep -e BUG -e RIP -e Call.Trace -e move_freepages_block -e rmqueue -e freelist -A1
    kernel BUG at mm/page_alloc.c:1389!
    invalid opcode: 0000 [#1] SMP
    --
    RIP: 0010: move_freepages+0x15e/0x160
    --
    Call Trace:
    move_freepages_block+0x73/0x80
    __rmqueue+0x263/0x460
    get_page_from_freelist+0x7e1/0x9e0
    __alloc_pages_nodemask+0x176/0x420
    --

    crash> page_init_bug -v | grep RAM
    1000 - 9bfff System RAM (620.00 KiB)
    100000 - 430bffff System RAM ( 1.05 GiB = 1071.75 MiB = 1097472.00 KiB)
    4b0c8000 - 4bf9cfff System RAM ( 14.83 MiB = 15188.00 KiB)
    4bfac000 - 646b1fff System RAM (391.02 MiB = 400408.00 KiB)
    7b788000 - 7b7fffff System RAM (480.00 KiB)
    100000000 - 67fffffff System RAM ( 22.00 GiB)

    crash> page_init_bug | head -6
    7b788000 - 7b7fffff System RAM (480.00 KiB)
    1fffff00000000 0 1 DMA32 4096 1048575
    505736 505344 505855
    0 0 0 DMA 1 4095
    1fffff00000400 0 1 DMA32 4096 1048575
    BUG, zones differ!

    crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b787000 7b788000
    PAGE PHYSICAL MAPPING INDEX CNT FLAGS
    ffffea0001e00000 78000000 0 0 0 0
    ffffea0001ed7fc0 7b5ff000 0 0 0 0
    ffffea0001ed8000 7b600000 0 0 0 0 <<<<
    ffffea0001ede1c0 7b787000 0 0 0 0
    ffffea0001ede200 7b788000 0 0 1 1fffff00000000

    Link: http://lkml.kernel.org/r/20180316143855.29838-1-neelx@redhat.com
    Fixes: b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns where possible")
    Signed-off-by: Daniel Vacek
    Acked-by: Ard Biesheuvel
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Pavel Tatashin
    Cc: Paul Burton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Daniel Vacek
     
  • commit b3cd54b257ad95d344d121dc563d943ca39b0921 upstream.

    shmem_unused_huge_shrink() gets called from reclaim path. Waiting for
    page lock may lead to deadlock there.

    There was a bug report that may be attributed to this:

    http://lkml.kernel.org/r/alpine.LRH.2.11.1801242349220.30642@mail.ewheeler.net

    Replace lock_page() with trylock_page() and skip the page if we failed
    to lock it. We will get to the page on the next scan.

    We can test for the PageTransHuge() outside the page lock as we only
    need protection against splitting the page under us. Holding pin oni
    the page is enough for this.

    Link: http://lkml.kernel.org/r/20180316210830.43738-1-kirill.shutemov@linux.intel.com
    Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure")
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Eric Wheeler
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Cc: Tetsuo Handa
    Cc: Hugh Dickins
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov
     
  • commit fa41b900c30b45fab03783724932dc30cd46a6be upstream.

    deferred_split_scan() gets called from reclaim path. Waiting for page
    lock may lead to deadlock there.

    Replace lock_page() with trylock_page() and skip the page if we failed
    to lock it. We will get to the page on the next scan.

    Link: http://lkml.kernel.org/r/20180315150747.31945-1-kirill.shutemov@linux.intel.com
    Fixes: 9a982250f773 ("thp: introduce deferred_split_huge_page()")
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov
     
  • commit fece2029a9e65b9a990831afe2a2b83290cbbe26 upstream.

    khugepaged is not yet able to convert PTE-mapped huge pages back to PMD
    mapped. We do not collapse such pages. See check
    khugepaged_scan_pmd().

    But if between khugepaged_scan_pmd() and __collapse_huge_page_isolate()
    somebody managed to instantiate THP in the range and then split the PMD
    back to PTEs we would have a problem --
    VM_BUG_ON_PAGE(PageCompound(page)) will get triggered.

    It's possible since we drop mmap_sem during collapse to re-take for
    write.

    Replace the VM_BUG_ON() with graceful collapse fail.

    Link: http://lkml.kernel.org/r/20180315152353.27989-1-kirill.shutemov@linux.intel.com
    Fixes: b1caa957ae6d ("khugepaged: ignore pmd tables with THP mapped with ptes")
    Signed-off-by: Kirill A. Shutemov
    Cc: Laura Abbott
    Cc: Jerome Marchand
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov
     
  • commit 63489f8e821144000e0bdca7e65a8d1cc23a7ee7 upstream.

    A vma with vm_pgoff large enough to overflow a loff_t type when
    converted to a byte offset can be passed via the remap_file_pages system
    call. The hugetlbfs mmap routine uses the byte offset to calculate
    reservations and file size.

    A sequence such as:

    mmap(0x20a00000, 0x600000, 0, 0x66033, -1, 0);
    remap_file_pages(0x20a00000, 0x600000, 0, 0x20000000000000, 0);

    will result in the following when task exits/file closed,

    kernel BUG at mm/hugetlb.c:749!
    Call Trace:
    hugetlbfs_evict_inode+0x2f/0x40
    evict+0xcb/0x190
    __dentry_kill+0xcb/0x150
    __fput+0x164/0x1e0
    task_work_run+0x84/0xa0
    exit_to_usermode_loop+0x7d/0x80
    do_syscall_64+0x18b/0x190
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    The overflowed pgoff value causes hugetlbfs to try to set up a mapping
    with a negative range (end < start) that leaves invalid state which
    causes the BUG.

    The previous overflow fix to this code was incomplete and did not take
    the remap_file_pages system call into account.

    [mike.kravetz@oracle.com: v3]
    Link: http://lkml.kernel.org/r/20180309002726.7248-1-mike.kravetz@oracle.com
    [akpm@linux-foundation.org: include mmdebug.h]
    [akpm@linux-foundation.org: fix -ve left shift count on sh]
    Link: http://lkml.kernel.org/r/20180308210502.15952-1-mike.kravetz@oracle.com
    Fixes: 045c7a3f53d9 ("hugetlbfs: fix offset overflow in hugetlbfs mmap")
    Signed-off-by: Mike Kravetz
    Reported-by: Nic Losby
    Acked-by: Michal Hocko
    Cc: "Kirill A . Shutemov"
    Cc: Yisheng Xie
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mike Kravetz
     
  • commit 2e517d681632326ed98399cb4dd99519efe3e32c upstream.

    Dave Jones reported fs_reclaim lockdep warnings.

    ============================================
    WARNING: possible recursive locking detected
    4.15.0-rc9-backup-debug+ #1 Not tainted
    --------------------------------------------
    sshd/24800 is trying to acquire lock:
    (fs_reclaim){+.+.}, at: [] fs_reclaim_acquire.part.102+0x5/0x30

    but task is already holding lock:
    (fs_reclaim){+.+.}, at: [] fs_reclaim_acquire.part.102+0x5/0x30

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(fs_reclaim);
    lock(fs_reclaim);

    *** DEADLOCK ***

    May be due to missing lock nesting notation

    2 locks held by sshd/24800:
    #0: (sk_lock-AF_INET6){+.+.}, at: [] tcp_sendmsg+0x19/0x40
    #1: (fs_reclaim){+.+.}, at: [] fs_reclaim_acquire.part.102+0x5/0x30

    stack backtrace:
    CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
    Call Trace:
    dump_stack+0xbc/0x13f
    __lock_acquire+0xa09/0x2040
    lock_acquire+0x12e/0x350
    fs_reclaim_acquire.part.102+0x29/0x30
    kmem_cache_alloc+0x3d/0x2c0
    alloc_extent_state+0xa7/0x410
    __clear_extent_bit+0x3ea/0x570
    try_release_extent_mapping+0x21a/0x260
    __btrfs_releasepage+0xb0/0x1c0
    btrfs_releasepage+0x161/0x170
    try_to_release_page+0x162/0x1c0
    shrink_page_list+0x1d5a/0x2fb0
    shrink_inactive_list+0x451/0x940
    shrink_node_memcg.constprop.88+0x4c9/0x5e0
    shrink_node+0x12d/0x260
    try_to_free_pages+0x418/0xaf0
    __alloc_pages_slowpath+0x976/0x1790
    __alloc_pages_nodemask+0x52c/0x5c0
    new_slab+0x374/0x3f0
    ___slab_alloc.constprop.81+0x47e/0x5a0
    __slab_alloc.constprop.80+0x32/0x60
    __kmalloc_track_caller+0x267/0x310
    __kmalloc_reserve.isra.40+0x29/0x80
    __alloc_skb+0xee/0x390
    sk_stream_alloc_skb+0xb8/0x340
    tcp_sendmsg_locked+0x8e6/0x1d30
    tcp_sendmsg+0x27/0x40
    inet_sendmsg+0xd0/0x310
    sock_write_iter+0x17a/0x240
    __vfs_write+0x2ab/0x380
    vfs_write+0xfb/0x260
    SyS_write+0xb6/0x140
    do_syscall_64+0x1e5/0xc05
    entry_SYSCALL64_slow_path+0x25/0x25

    This warning is caused by commit d92a8cfcb37e ("locking/lockdep:
    Rework FS_RECLAIM annotation") which replaced the use of
    lockdep_{set,clear}_current_reclaim_state() in __perform_reclaim()
    and lockdep_trace_alloc() in slab_pre_alloc_hook() with
    fs_reclaim_acquire()/ fs_reclaim_release().

    Since __kmalloc_reserve() from __alloc_skb() adds __GFP_NOMEMALLOC |
    __GFP_NOWARN to gfp_mask, and all reclaim path simply propagates
    __GFP_NOMEMALLOC, fs_reclaim_acquire() in slab_pre_alloc_hook() is
    trying to grab the 'fake' lock again when __perform_reclaim() already
    grabbed the 'fake' lock.

    The

    /* this guy won't enter reclaim */
    if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
    return false;

    test which causes slab_pre_alloc_hook() to try to grab the 'fake' lock
    was added by commit cf40bd16fdad ("lockdep: annotate reclaim context
    (__GFP_NOFS)"). But that test is outdated because PF_MEMALLOC thread
    won't enter reclaim regardless of __GFP_NOMEMALLOC after commit
    341ce06f69ab ("page allocator: calculate the alloc_flags for allocation
    only once") added the PF_MEMALLOC safeguard (

    /* Avoid recursion of direct reclaim */
    if (p->flags & PF_MEMALLOC)
    goto nopage;

    in __alloc_pages_slowpath()).

    Thus, let's fix outdated test by removing __GFP_NOMEMALLOC test and
    allow __need_fs_reclaim() to return false.

    Link: http://lkml.kernel.org/r/201802280650.FJC73911.FOSOMLJVFFQtHO@I-love.SAKURA.ne.jp
    Fixes: d92a8cfcb37ecd13 ("locking/lockdep: Rework FS_RECLAIM annotation")
    Signed-off-by: Tetsuo Handa
    Reported-by: Dave Jones
    Tested-by: Dave Jones
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Ingo Molnar
    Cc: Nikolay Borisov
    Cc: Michal Hocko
    Cc: [4.14+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Tetsuo Handa
     

15 Mar, 2018

1 commit

  • commit 379b03b7fa05f7db521b7732a52692448a3c34fe upstream.

    This is just a cleanup. It aids handling the special end case in the
    next commit.

    [akpm@linux-foundation.org: make it work against current -linus, not against -mm]
    [akpm@linux-foundation.org: make it work against current -linus, not against -mm some more]
    Link: http://lkml.kernel.org/r/1ca478d4269125a99bcfb1ca04d7b88ac1aee924.1520011944.git.neelx@redhat.com
    Signed-off-by: Daniel Vacek
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Pavel Tatashin
    Cc: Paul Burton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Daniel Vacek
     

03 Mar, 2018

1 commit

  • [ Upstream commit 1f704fd0d14043e76e80f6b8b2251b9b2cedcca6 ]

    A semaphore is acquired before this check, so we must release it before
    leaving.

    Link: http://lkml.kernel.org/r/20171211211009.4971-1-christophe.jaillet@wanadoo.fr
    Fixes: b7f0554a56f2 ("mm: fail get_vaddr_frames() for filesystem-dax mappings")
    Signed-off-by: Christophe JAILLET
    Acked-by: Michal Hocko
    Cc: Dan Williams
    Cc: Christian Borntraeger
    Cc: David Sterba
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Christophe JAILLET
     

28 Feb, 2018

1 commit

  • commit 7ba716698cc53f8d5367766c93c538c7da6c68ce upstream.

    It was reported by Sergey Senozhatsky that if THP (Transparent Huge
    Page) and frontswap (via zswap) are both enabled, when memory goes low
    so that swap is triggered, segfault and memory corruption will occur in
    random user space applications as follow,

    kernel: urxvt[338]: segfault at 20 ip 00007fc08889ae0d sp 00007ffc73a7fc40 error 6 in libc-2.26.so[7fc08881a000+1ae000]
    #0 0x00007fc08889ae0d _int_malloc (libc.so.6)
    #1 0x00007fc08889c2f3 malloc (libc.so.6)
    #2 0x0000560e6004bff7 _Z14rxvt_wcstoutf8PKwi (urxvt)
    #3 0x0000560e6005e75c n/a (urxvt)
    #4 0x0000560e6007d9f1 _ZN16rxvt_perl_interp6invokeEP9rxvt_term9hook_typez (urxvt)
    #5 0x0000560e6003d988 _ZN9rxvt_term9cmd_parseEv (urxvt)
    #6 0x0000560e60042804 _ZN9rxvt_term6pty_cbERN2ev2ioEi (urxvt)
    #7 0x0000560e6005c10f _Z17ev_invoke_pendingv (urxvt)
    #8 0x0000560e6005cb55 ev_run (urxvt)
    #9 0x0000560e6003b9b9 main (urxvt)
    #10 0x00007fc08883af4a __libc_start_main (libc.so.6)
    #11 0x0000560e6003f9da _start (urxvt)

    After bisection, it was found the first bad commit is bd4c82c22c36 ("mm,
    THP, swap: delay splitting THP after swapped out").

    The root cause is as follows:

    When the pages are written to swap device during swapping out in
    swap_writepage(), zswap (fontswap) is tried to compress the pages to
    improve performance. But zswap (frontswap) will treat THP as a normal
    page, so only the head page is saved. After swapping in, tail pages
    will not be restored to their original contents, causing memory
    corruption in the applications.

    This is fixed by refusing to save page in the frontswap store functions
    if the page is a THP. So that the THP will be swapped out to swap
    device.

    Another choice is to split THP if frontswap is enabled. But it is found
    that the frontswap enabling isn't flexible. For example, if
    CONFIG_ZSWAP=y (cannot be module), frontswap will be enabled even if
    zswap itself isn't enabled.

    Frontswap has multiple backends, to make it easy for one backend to
    enable THP support, the THP checking is put in backend frontswap store
    functions instead of the general interfaces.

    Link: http://lkml.kernel.org/r/20180209084947.22749-1-ying.huang@intel.com
    Fixes: bd4c82c22c367e068 ("mm, THP, swap: delay splitting THP after swapped out")
    Signed-off-by: "Huang, Ying"
    Reported-by: Sergey Senozhatsky
    Tested-by: Sergey Senozhatsky
    Suggested-by: Minchan Kim [put THP checking in backend]
    Cc: Konrad Rzeszutek Wilk
    Cc: Dan Streetman
    Cc: Seth Jennings
    Cc: Tetsuo Handa
    Cc: Shaohua Li
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Shakeel Butt
    Cc: Boris Ostrovsky
    Cc: Juergen Gross
    Cc: [4.14]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Huang Ying
     

25 Feb, 2018

3 commits

  • commit 698d0831ba87b92ae10b15e8203cfd59f5a59a35 upstream.

    Kai Heng Feng has noticed that BUG_ON(PageHighMem(pg)) triggers in
    drivers/media/common/saa7146/saa7146_core.c since 19809c2da28a ("mm,
    vmalloc: use __GFP_HIGHMEM implicitly").

    saa7146_vmalloc_build_pgtable uses vmalloc_32 and it is reasonable to
    expect that the resulting page is not in highmem. The above commit
    aimed to add __GFP_HIGHMEM only for those requests which do not specify
    any zone modifier gfp flag. vmalloc_32 relies on GFP_VMALLOC32 which
    should do the right thing. Except it has been missed that GFP_VMALLOC32
    is an alias for GFP_KERNEL on 32b architectures. Thanks to Matthew to
    notice this.

    Fix the problem by unconditionally setting GFP_DMA32 in GFP_VMALLOC32
    for !64b arches (as a bailout). This should do the right thing and use
    ZONE_NORMAL which should be always below 4G on 32b systems.

    Debugged by Matthew Wilcox.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20180212095019.GX21609@dhcp22.suse.cz
    Fixes: 19809c2da28a ("mm, vmalloc: use __GFP_HIGHMEM implicitly”)
    Signed-off-by: Michal Hocko
    Reported-by: Kai Heng Feng
    Cc: Matthew Wilcox
    Cc: Laura Abbott
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     
  • [ Upstream commit 7f6f60a1ba52538c16f26930bfbcfe193d9d746a ]

    earlyprintk=efi,keep does not work any more with a warning
    in mm/early_ioremap.c: WARN_ON(system_state != SYSTEM_BOOTING):
    Boot just hangs because of the earlyprintk within the earlyprintk
    implementation code itself.

    This is caused by a new introduced middle state in:

    69a78ff226fe ("init: Introduce SYSTEM_SCHEDULING state")

    early_ioremap() is fine in both SYSTEM_BOOTING and SYSTEM_SCHEDULING
    states, original condition should be updated accordingly.

    Signed-off-by: Dave Young
    Acked-by: Thomas Gleixner
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: bp@suse.de
    Cc: linux-efi@vger.kernel.org
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20171209041610.GA3249@dhcp-128-65.nay.redhat.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Dave Young
     
  • commit bb422a738f6566f7439cd347d54e321e4fe92a9f upstream.

    Syzbot caught an oops at unregister_shrinker() because combination of
    commit 1d3d4437eae1bb29 ("vmscan: per-node deferred work") and fault
    injection made register_shrinker() fail and the caller of
    register_shrinker() did not check for failure.

    ----------
    [ 554.881422] FAULT_INJECTION: forcing a failure.
    [ 554.881422] name failslab, interval 1, probability 0, space 0, times 0
    [ 554.881438] CPU: 1 PID: 13231 Comm: syz-executor1 Not tainted 4.14.0-rc8+ #82
    [ 554.881443] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    [ 554.881445] Call Trace:
    [ 554.881459] dump_stack+0x194/0x257
    [ 554.881474] ? arch_local_irq_restore+0x53/0x53
    [ 554.881486] ? find_held_lock+0x35/0x1d0
    [ 554.881507] should_fail+0x8c0/0xa40
    [ 554.881522] ? fault_create_debugfs_attr+0x1f0/0x1f0
    [ 554.881537] ? check_noncircular+0x20/0x20
    [ 554.881546] ? find_next_zero_bit+0x2c/0x40
    [ 554.881560] ? ida_get_new_above+0x421/0x9d0
    [ 554.881577] ? find_held_lock+0x35/0x1d0
    [ 554.881594] ? __lock_is_held+0xb6/0x140
    [ 554.881628] ? check_same_owner+0x320/0x320
    [ 554.881634] ? lock_downgrade+0x990/0x990
    [ 554.881649] ? find_held_lock+0x35/0x1d0
    [ 554.881672] should_failslab+0xec/0x120
    [ 554.881684] __kmalloc+0x63/0x760
    [ 554.881692] ? lock_downgrade+0x990/0x990
    [ 554.881712] ? register_shrinker+0x10e/0x2d0
    [ 554.881721] ? trace_event_raw_event_module_request+0x320/0x320
    [ 554.881737] register_shrinker+0x10e/0x2d0
    [ 554.881747] ? prepare_kswapd_sleep+0x1f0/0x1f0
    [ 554.881755] ? _down_write_nest_lock+0x120/0x120
    [ 554.881765] ? memcpy+0x45/0x50
    [ 554.881785] sget_userns+0xbcd/0xe20
    (...snipped...)
    [ 554.898693] kasan: CONFIG_KASAN_INLINE enabled
    [ 554.898724] kasan: GPF could be caused by NULL-ptr deref or user memory access
    [ 554.898732] general protection fault: 0000 [#1] SMP KASAN
    [ 554.898737] Dumping ftrace buffer:
    [ 554.898741] (ftrace buffer empty)
    [ 554.898743] Modules linked in:
    [ 554.898752] CPU: 1 PID: 13231 Comm: syz-executor1 Not tainted 4.14.0-rc8+ #82
    [ 554.898755] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    [ 554.898760] task: ffff8801d1dbe5c0 task.stack: ffff8801c9e38000
    [ 554.898772] RIP: 0010:__list_del_entry_valid+0x7e/0x150
    [ 554.898775] RSP: 0018:ffff8801c9e3f108 EFLAGS: 00010246
    [ 554.898780] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
    [ 554.898784] RDX: 0000000000000000 RSI: ffff8801c53c6f98 RDI: ffff8801c53c6fa0
    [ 554.898788] RBP: ffff8801c9e3f120 R08: 1ffff100393c7d55 R09: 0000000000000004
    [ 554.898791] R10: ffff8801c9e3ef70 R11: 0000000000000000 R12: 0000000000000000
    [ 554.898795] R13: dffffc0000000000 R14: 1ffff100393c7e45 R15: ffff8801c53c6f98
    [ 554.898800] FS: 0000000000000000(0000) GS:ffff8801db300000(0000) knlGS:0000000000000000
    [ 554.898804] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
    [ 554.898807] CR2: 00000000dbc23000 CR3: 00000001c7269000 CR4: 00000000001406e0
    [ 554.898813] DR0: 0000000020000000 DR1: 0000000020000000 DR2: 0000000000000000
    [ 554.898816] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
    [ 554.898818] Call Trace:
    [ 554.898828] unregister_shrinker+0x79/0x300
    [ 554.898837] ? perf_trace_mm_vmscan_writepage+0x750/0x750
    [ 554.898844] ? down_write+0x87/0x120
    [ 554.898851] ? deactivate_super+0x139/0x1b0
    [ 554.898857] ? down_read+0x150/0x150
    [ 554.898864] ? check_same_owner+0x320/0x320
    [ 554.898875] deactivate_locked_super+0x64/0xd0
    [ 554.898883] deactivate_super+0x141/0x1b0
    ----------

    Since allowing register_shrinker() callers to call unregister_shrinker()
    when register_shrinker() failed can simplify error recovery path, this
    patch makes unregister_shrinker() no-op when register_shrinker() failed.
    Also, reset shrinker->nr_deferred in case unregister_shrinker() was
    by error called twice.

    Signed-off-by: Tetsuo Handa
    Signed-off-by: Aliaksei Karaliou
    Reported-by: syzbot
    Cc: Glauber Costa
    Cc: Al Viro
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Tetsuo Handa
     

22 Feb, 2018

7 commits

  • commit fd0e786d9d09024f67bd71ec094b110237dc3840 upstream.

    In the following commit:

    ce0fa3e56ad2 ("x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages")

    ... we added code to memory_failure() to unmap the page from the
    kernel 1:1 virtual address space to avoid speculative access to the
    page logging additional errors.

    But memory_failure() may not always succeed in taking the page offline,
    especially if the page belongs to the kernel. This can happen if
    there are too many corrected errors on a page and either mcelog(8)
    or drivers/ras/cec.c asks to take a page offline.

    Since we remove the 1:1 mapping early in memory_failure(), we can
    end up with the page unmapped, but still in use. On the next access
    the kernel crashes :-(

    There are also various debug paths that call memory_failure() to simulate
    occurrence of an error. Since there is no actual error in memory, we
    don't need to map out the page for those cases.

    Revert most of the previous attempt and keep the solution local to
    arch/x86/kernel/cpu/mcheck/mce.c. Unmap the page only when:

    1) there is a real error
    2) memory_failure() succeeds.

    All of this only applies to 64-bit systems. 32-bit kernel doesn't map
    all of memory into kernel space. It isn't worth adding the code to unmap
    the piece that is mapped because nobody would run a 32-bit kernel on a
    machine that has recoverable machine checks.

    Signed-off-by: Tony Luck
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave
    Cc: Denys Vlasenko
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Naoya Horiguchi
    Cc: Peter Zijlstra
    Cc: Robert (Persistent Memory)
    Cc: Thomas Gleixner
    Cc: linux-mm@kvack.org
    Cc: stable@vger.kernel.org #v4.14
    Fixes: ce0fa3e56ad2 ("x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages")
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Tony Luck
     
  • commit af27d9403f5b80685b79c88425086edccecaf711 upstream.

    We get a warning about some slow configurations in randconfig kernels:

    mm/memory.c:83:2: error: #warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_cpupid. [-Werror=cpp]

    The warning is reasonable by itself, but gets in the way of randconfig
    build testing, so I'm hiding it whenever CONFIG_COMPILE_TEST is set.

    The warning was added in 2013 in commit 75980e97dacc ("mm: fold
    page->_last_nid into page->flags where possible").

    Cc: stable@vger.kernel.org
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Arnd Bergmann
     
  • commit f335195adf043168ee69d78ea72ac3e30f0c57ce upstream.

    Commit 4675ff05de2d ("kmemcheck: rip it out") has removed the code but
    for some reason SPDX header stayed in place. This looks like a rebase
    mistake in the mmotm tree or the merge mistake. Let's drop those
    leftovers as well.

    Signed-off-by: Michal Hocko
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     
  • commit 4675ff05de2d76d167336b368bd07f3fef6ed5a6 upstream.

    Fix up makefiles, remove references, and git rm kmemcheck.

    Link: http://lkml.kernel.org/r/20171007030159.22241-4-alexander.levin@verizon.com
    Signed-off-by: Sasha Levin
    Cc: Steven Rostedt
    Cc: Vegard Nossum
    Cc: Pekka Enberg
    Cc: Michal Hocko
    Cc: Eric W. Biederman
    Cc: Alexander Potapenko
    Cc: Tim Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Levin, Alexander (Sasha Levin)
     
  • commit d8be75663cec0069b85f80191abd2682ce4a512f upstream.

    Now that kmemcheck is gone, we don't need the NOTRACK flags.

    Link: http://lkml.kernel.org/r/20171007030159.22241-5-alexander.levin@verizon.com
    Signed-off-by: Sasha Levin
    Cc: Alexander Potapenko
    Cc: Eric W. Biederman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Steven Rostedt
    Cc: Tim Hansen
    Cc: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Levin, Alexander (Sasha Levin)
     
  • commit 75f296d93bcebcfe375884ddac79e30263a31766 upstream.

    Convert all allocations that used a NOTRACK flag to stop using it.

    Link: http://lkml.kernel.org/r/20171007030159.22241-3-alexander.levin@verizon.com
    Signed-off-by: Sasha Levin
    Cc: Alexander Potapenko
    Cc: Eric W. Biederman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Steven Rostedt
    Cc: Tim Hansen
    Cc: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Levin, Alexander (Sasha Levin)
     
  • commit 4950276672fce5c241857540f8561c440663673d upstream.

    Patch series "kmemcheck: kill kmemcheck", v2.

    As discussed at LSF/MM, kill kmemcheck.

    KASan is a replacement that is able to work without the limitation of
    kmemcheck (single CPU, slow). KASan is already upstream.

    We are also not aware of any users of kmemcheck (or users who don't
    consider KASan as a suitable replacement).

    The only objection was that since KASAN wasn't supported by all GCC
    versions provided by distros at that time we should hold off for 2
    years, and try again.

    Now that 2 years have passed, and all distros provide gcc that supports
    KASAN, kill kmemcheck again for the very same reasons.

    This patch (of 4):

    Remove kmemcheck annotations, and calls to kmemcheck from the kernel.

    [alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs]
    Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com
    Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.com
    Signed-off-by: Sasha Levin
    Cc: Alexander Potapenko
    Cc: Eric W. Biederman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Steven Rostedt
    Cc: Tim Hansen
    Cc: Vegard Nossum
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Levin, Alexander (Sasha Levin)
     

13 Feb, 2018

1 commit

  • [ Upstream commit edbe69ef2c90fc86998a74b08319a01c508bd497 ]

    This patch effectively reverts commit 9f1c2674b328 ("net: memcontrol:
    defer call to mem_cgroup_sk_alloc()").

    Moving mem_cgroup_sk_alloc() to the inet_csk_accept() completely breaks
    memcg socket memory accounting, as packets received before memcg
    pointer initialization are not accounted and are causing refcounting
    underflow on socket release.

    Actually the free-after-use problem was fixed by
    commit c0576e397508 ("net: call cgroup_sk_alloc() earlier in
    sk_clone_lock()") for the cgroup pointer.

    So, let's revert it and call mem_cgroup_sk_alloc() just before
    cgroup_sk_alloc(). This is safe, as we hold a reference to the socket
    we're cloning, and it holds a reference to the memcg.

    Also, let's drop BUG_ON(mem_cgroup_is_root()) check from
    mem_cgroup_sk_alloc(). I see no reasons why bumping the root
    memcg counter is a good reason to panic, and there are no realistic
    ways to hit it.

    Signed-off-by: Roman Gushchin
    Cc: Eric Dumazet
    Cc: David S. Miller
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Roman Gushchin
     

04 Feb, 2018

1 commit

  • [ Upstream commit bde5f6bc68db51128f875a756e9082a6c6ff7b4c ]

    kmemleak_scan() will scan struct page for each node and it can be really
    large and resulting in a soft lockup. We have seen a soft lockup when
    do scan while compile kernel:

    watchdog: BUG: soft lockup - CPU#53 stuck for 22s! [bash:10287]
    [...]
    Call Trace:
    kmemleak_scan+0x21a/0x4c0
    kmemleak_write+0x312/0x350
    full_proxy_write+0x5a/0xa0
    __vfs_write+0x33/0x150
    vfs_write+0xad/0x1a0
    SyS_write+0x52/0xc0
    do_syscall_64+0x61/0x1a0
    entry_SYSCALL64_slow_path+0x25/0x25

    Fix this by adding cond_resched every MAX_SCAN_SIZE.

    Link: http://lkml.kernel.org/r/1511439788-20099-1-git-send-email-xieyisheng1@huawei.com
    Signed-off-by: Yisheng Xie
    Suggested-by: Catalin Marinas
    Acked-by: Catalin Marinas
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Yisheng Xie
     

31 Jan, 2018

1 commit

  • commit b050e3769c6b4013bb937e879fc43bf1847ee819 upstream.

    Since commit 97a16fc82a7c ("mm, page_alloc: only enforce watermarks for
    order-0 allocations"), __zone_watermark_ok() check for high-order
    allocations will shortcut per-migratetype free list checks for
    ALLOC_HARDER allocations, and return true as long as there's free page
    of any migratetype. The intention is that ALLOC_HARDER can allocate
    from MIGRATE_HIGHATOMIC free lists, while normal allocations can't.

    However, as a side effect, the watermark check will then also return
    true when there are pages only on the MIGRATE_ISOLATE list, or (prior to
    CMA conversion to ZONE_MOVABLE) on the MIGRATE_CMA list. Since the
    allocation cannot actually obtain isolated pages, and might not be able
    to obtain CMA pages, this can result in a false positive.

    The condition should be rare and perhaps the outcome is not a fatal one.
    Still, it's better if the watermark check is correct. There also
    shouldn't be a performance tradeoff here.

    Link: http://lkml.kernel.org/r/20171102125001.23708-1-vbabka@suse.cz
    Fixes: 97a16fc82a7c ("mm, page_alloc: only enforce watermarks for order-0 allocations")
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Rik van Riel
    Cc: David Rientjes
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     

24 Jan, 2018

1 commit

  • commit 0d665e7b109d512b7cae3ccef6e8654714887844 upstream.

    Tetsuo reported random crashes under memory pressure on 32-bit x86
    system and tracked down to change that introduced
    page_vma_mapped_walk().

    The root cause of the issue is the faulty pointer math in check_pte().
    As ->pte may point to an arbitrary page we have to check that they are
    belong to the section before doing math. Otherwise it may lead to weird
    results.

    It wasn't noticed until now as mem_map[] is virtually contiguous on
    flatmem or vmemmap sparsemem. Pointer arithmetic just works against all
    'struct page' pointers. But with classic sparsemem, it doesn't because
    each section memap is allocated separately and so consecutive pfns
    crossing two sections might have struct pages at completely unrelated
    addresses.

    Let's restructure code a bit and replace pointer arithmetic with
    operations on pfns.

    Signed-off-by: Kirill A. Shutemov
    Reported-and-tested-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()")
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov
     

10 Jan, 2018

2 commits

  • commit d09cfbbfa0f761a97687828b5afb27b56cbf2e19 upstream.

    In commit 83e3c48729d9 ("mm/sparsemem: Allocate mem_section at runtime
    for CONFIG_SPARSEMEM_EXTREME=y") mem_section is allocated at runtime to
    save memory.

    It allocates the first dimension of array with sizeof(struct mem_section).

    It costs extra memory, should be sizeof(struct mem_section *).

    Fix it.

    Link: http://lkml.kernel.org/r/1513932498-20350-1-git-send-email-bhe@redhat.com
    Fixes: 83e3c48729 ("mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y")
    Signed-off-by: Baoquan He
    Tested-by: Dave Young
    Acked-by: Kirill A. Shutemov
    Cc: Kirill A. Shutemov
    Cc: Ingo Molnar
    Cc: Andy Lutomirski
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Atsushi Kumagai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Baoquan He
     
  • commit 4991c09c7c812dba13ea9be79a68b4565bb1fa4e upstream.

    While testing on a large CPU system, detected the following RCU stall
    many times over the span of the workload. This problem is solved by
    adding a cond_resched() in the change_pmd_range() function.

    INFO: rcu_sched detected stalls on CPUs/tasks:
    154-....: (670 ticks this GP) idle=022/140000000000000/0 softirq=2825/2825 fqs=612
    (detected by 955, t=6002 jiffies, g=4486, c=4485, q=90864)
    Sending NMI from CPU 955 to CPUs 154:
    NMI backtrace for cpu 154
    CPU: 154 PID: 147071 Comm: workload Not tainted 4.15.0-rc3+ #3
    NIP: c0000000000b3f64 LR: c0000000000b33d4 CTR: 000000000000aa18
    REGS: 00000000a4b0fb44 TRAP: 0501 Not tainted (4.15.0-rc3+)
    MSR: 8000000000009033 CR: 22422082 XER: 00000000
    CFAR: 00000000006cf8f0 SOFTE: 1
    GPR00: 0010000000000000 c00003ef9b1cb8c0 c0000000010cc600 0000000000000000
    GPR04: 8e0000018c32b200 40017b3858fd6e00 8e0000018c32b208 40017b3858fd6e00
    GPR08: 8e0000018c32b210 40017b3858fd6e00 8e0000018c32b218 40017b3858fd6e00
    GPR12: ffffffffffffffff c00000000fb25100
    NIP [c0000000000b3f64] plpar_hcall9+0x44/0x7c
    LR [c0000000000b33d4] pSeries_lpar_flush_hash_range+0x384/0x420
    Call Trace:
    flush_hash_range+0x48/0x100
    __flush_tlb_pending+0x44/0xd0
    hpte_need_flush+0x408/0x470
    change_protection_range+0xaac/0xf10
    change_prot_numa+0x30/0xb0
    task_numa_work+0x2d0/0x3e0
    task_work_run+0x130/0x190
    do_notify_resume+0x118/0x120
    ret_from_except_lite+0x70/0x74
    Instruction dump:
    60000000 f8810028 7ca42b78 7cc53378 7ce63b78 7d074378 7d284b78 7d495378
    e9410060 e9610068 e9810070 44000022 e9810028 f88c0000 f8ac0008

    Link: http://lkml.kernel.org/r/20171214140551.5794-1-khandual@linux.vnet.ibm.com
    Signed-off-by: Anshuman Khandual
    Suggested-by: Nicholas Piggin
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Anshuman Khandual
     

25 Dec, 2017

3 commits

  • commit 629a359bdb0e0652a8227b4ff3125431995fec6e upstream.

    Since commit:

    83e3c48729d9 ("mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y")

    we allocate the mem_section array dynamically in sparse_memory_present_with_active_regions(),
    but some architectures, like arm64, don't call the routine to initialize sparsemem.

    Let's move the initialization into memory_present() it should cover all
    architectures.

    Reported-and-tested-by: Sudeep Holla
    Tested-by: Bjorn Andersson
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Will Deacon
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-mm@kvack.org
    Fixes: 83e3c48729d9 ("mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y")
    Link: http://lkml.kernel.org/r/20171107083337.89952-1-kirill.shutemov@linux.intel.com
    Signed-off-by: Ingo Molnar
    Cc: Dan Rue
    Cc: Naresh Kamboju
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov
     
  • commit 3382290ed2d5e275429cef510ab21889d3ccd164 upstream.

    [ Note, this is a Git cherry-pick of the following commit:

    506458efaf15 ("locking/barriers: Convert users of lockless_dereference() to READ_ONCE()")

    ... for easier x86 PTI code testing and back-porting. ]

    READ_ONCE() now has an implicit smp_read_barrier_depends() call, so it
    can be used instead of lockless_dereference() without any change in
    semantics.

    Signed-off-by: Will Deacon
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1508840570-22169-4-git-send-email-will.deacon@arm.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Will Deacon
     
  • commit 83e3c48729d9ebb7af5a31a504f3fd6aff0348c4 upstream.

    Size of the mem_section[] array depends on the size of the physical address space.

    In preparation for boot-time switching between paging modes on x86-64
    we need to make the allocation of mem_section[] dynamic, because otherwise
    we waste a lot of RAM: with CONFIG_NODE_SHIFT=10, mem_section[] size is 32kB
    for 4-level paging and 2MB for 5-level paging mode.

    The patch allocates the array on the first call to sparse_memory_present_with_active_regions().

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Cyrill Gorcunov
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20170929140821.37654-2-kirill.shutemov@linux.intel.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov
     

20 Dec, 2017

1 commit

  • commit 4837fe37adff1d159904f0c013471b1ecbcb455e upstream.

    David Rientjes has reported the following memory corruption while the
    oom reaper tries to unmap the victims address space

    BUG: Bad page map in process oom_reaper pte:6353826300000000 pmd:00000000
    addr:00007f50cab1d000 vm_flags:08100073 anon_vma:ffff9eea335603f0 mapping: (null) index:7f50cab1d
    file: (null) fault: (null) mmap: (null) readpage: (null)
    CPU: 2 PID: 1001 Comm: oom_reaper
    Call Trace:
    unmap_page_range+0x1068/0x1130
    __oom_reap_task_mm+0xd5/0x16b
    oom_reaper+0xff/0x14c
    kthread+0xc1/0xe0

    Tetsuo Handa has noticed that the synchronization inside exit_mmap is
    insufficient. We only synchronize with the oom reaper if
    tsk_is_oom_victim which is not true if the final __mmput is called from
    a different context than the oom victim exit path. This can trivially
    happen from context of any task which has grabbed mm reference (e.g. to
    read /proc// file which requires mm etc.).

    The race would look like this

    oom_reaper oom_victim task
    mmget_not_zero
    do_exit
    mmput
    __oom_reap_task_mm mmput
    __mmput
    exit_mmap
    remove_vma
    unmap_page_range

    Fix this issue by providing a new mm_is_oom_victim() helper which
    operates on the mm struct rather than a task. Any context which
    operates on a remote mm struct should use this helper in place of
    tsk_is_oom_victim. The flag is set in mark_oom_victim and never cleared
    so it is stable in the exit_mmap path.

    Debugged by Tetsuo Handa.

    Link: http://lkml.kernel.org/r/20171210095130.17110-1-mhocko@kernel.org
    Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
    Signed-off-by: Michal Hocko
    Reported-by: David Rientjes
    Acked-by: David Rientjes
    Cc: Tetsuo Handa
    Cc: Andrea Argangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

14 Dec, 2017

2 commits

  • [ Upstream commit 11066386efa692f77171484c32ea30f6e5a0d729 ]

    When slub_debug=O is set. It is possible to clear debug flags for an
    "unmergeable" slab cache in kmem_cache_open(). It makes the "unmergeable"
    cache became "mergeable" in sysfs_slab_add().

    These caches will generate their "unique IDs" by create_unique_id(), but
    it is possible to create identical unique IDs. In my experiment,
    sgpool-128, names_cache, biovec-256 generate the same ID ":Ft-0004096" and
    the kernel reports "sysfs: cannot create duplicate filename
    '/kernel/slab/:Ft-0004096'".

    To repeat my experiment, set disable_higher_order_debug=1,
    CONFIG_SLUB_DEBUG_ON=y in kernel-4.14.

    Fix this issue by setting unmergeable=1 if slub_debug=O and the the
    default slub_debug contains any no-merge flags.

    call path:
    kmem_cache_create()
    __kmem_cache_alias() -> we set SLAB_NEVER_MERGE flags here
    create_cache()
    __kmem_cache_create()
    kmem_cache_open() -> clear DEBUG_METADATA_FLAGS
    sysfs_slab_add() -> the slab cache is mergeable now

    sysfs: cannot create duplicate filename '/kernel/slab/:Ft-0004096'
    ------------[ cut here ]------------
    WARNING: CPU: 0 PID: 1 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x60/0x7c
    Modules linked in:
    CPU: 0 PID: 1 Comm: swapper/0 Tainted: G W 4.14.0-rc7ajb-00131-gd4c2e9f-dirty #123
    Hardware name: linux,dummy-virt (DT)
    task: ffffffc07d4e0080 task.stack: ffffff8008008000
    PC is at sysfs_warn_dup+0x60/0x7c
    LR is at sysfs_warn_dup+0x60/0x7c
    pc : lr : pstate: 60000145
    Call trace:
    sysfs_warn_dup+0x60/0x7c
    sysfs_create_dir_ns+0x98/0xa0
    kobject_add_internal+0xa0/0x294
    kobject_init_and_add+0x90/0xb4
    sysfs_slab_add+0x90/0x200
    __kmem_cache_create+0x26c/0x438
    kmem_cache_create+0x164/0x1f4
    sg_pool_init+0x60/0x100
    do_one_initcall+0x38/0x12c
    kernel_init_freeable+0x138/0x1d4
    kernel_init+0x10/0xfc
    ret_from_fork+0x10/0x18

    Link: http://lkml.kernel.org/r/1510365805-5155-1-git-send-email-miles.chen@mediatek.com
    Signed-off-by: Miles Chen
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Miles Chen
     
  • [ Upstream commit 1aedcafbf32b3f232c159b14cd0d423fcfe2b861 ]

    Use BUG_ON(in_interrupt()) in zs_map_object(). This is not a new
    BUG_ON(), it's always been there, but was recently changed to
    VM_BUG_ON(). There are several problems there. First, we use use
    per-CPU mappings both in zsmalloc and in zram, and interrupt may easily
    corrupt those buffers. Second, and more importantly, we believe it's
    possible to start leaking sensitive information. Consider the following
    case:

    -> process P
    swap out
    zram
    per-cpu mapping CPU1
    compress page A
    -> IRQ

    swap out
    zram
    per-cpu mapping CPU1
    compress page B
    write page from per-cpu mapping CPU1 to zsmalloc pool
    iret

    -> process P
    write page from per-cpu mapping CPU1 to zsmalloc pool [*]
    return

    * so we store overwritten data that actually belongs to another
    page (task) and potentially contains sensitive data. And when
    process P will page fault it's going to read (swap in) that
    other task's data.

    Link: http://lkml.kernel.org/r/20170929045140.4055-1-sergey.senozhatsky@gmail.com
    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Sergey Senozhatsky
     

10 Dec, 2017

1 commit

  • [ Upstream commit 5b65c4677a57a1d4414212f9995aa0e46a21ff80 ]

    The 0-day test bot found a performance regression that was tracked down to
    switching x86 to the generic get_user_pages_fast() implementation:

    http://lkml.kernel.org/r/20170710024020.GA26389@yexl-desktop

    The regression was caused by the fact that we now use local_irq_save() +
    local_irq_restore() in get_user_pages_fast() to disable interrupts.
    In x86 implementation local_irq_disable() + local_irq_enable() was used.

    The fix is to make get_user_pages_fast() use local_irq_disable(),
    leaving local_irq_save() for __get_user_pages_fast() that can be called
    with interrupts disabled.

    Numbers for pinning a gigabyte of memory, one page a time, 20 repeats:

    Before: Average: 14.91 ms, stddev: 0.45 ms
    After: Average: 10.76 ms, stddev: 0.18 ms

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrew Morton
    Cc: Huang Ying
    Cc: Jonathan Corbet
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Thorsten Leemhuis
    Cc: linux-mm@kvack.org
    Fixes: e585513b76f7 ("x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation")
    Link: http://lkml.kernel.org/r/20170908215603.9189-3-kirill.shutemov@linux.intel.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov