24 Jul, 2017

1 commit

  • TI-Feature: ti_linux_base_lsk
    TI-Tree: git@git.ti.com:ti-linux-kernel/ti-linux-kernel.git
    TI-Branch: ti-linux-4.9.y

    * 'ti-linux-4.9.y' of git.ti.com:ti-linux-kernel/ti-linux-kernel: (78 commits)
    4.9.39
    kvm: vmx: allow host to access guest MSR_IA32_BNDCFGS
    kvm: vmx: Check value written to IA32_BNDCFGS
    kvm: x86: Guest BNDCFGS requires guest MPX support
    kvm: vmx: Do not disable intercepts for BNDCFGS
    tracing: Use SOFTIRQ_OFFSET for softirq dectection for more accurate results
    PM / QoS: return -EINVAL for bogus strings
    PM / wakeirq: Convert to SRCU
    sched/topology: Fix overlapping sched_group_mask
    sched/topology: Optimize build_group_mask()
    sched/topology: Fix building of overlapping sched-groups
    sched/fair, cpumask: Export for_each_cpu_wrap()
    Revert "sched/core: Optimize SCHED_SMT"
    crypto: caam - fix signals handling
    crypto: caam - properly set IV after {en,de}crypt
    crypto: sha1-ssse3 - Disable avx2
    crypto: atmel - only treat EBUSY as transient if backlog
    crypto: talitos - Extend max key length for SHA384/512-HMAC and AEAD
    mm: fix overflow check in expand_upwards()
    selftests/capabilities: Fix the test_execve test
    ...

    Signed-off-by: Dan Murphy

    Dan Murphy
     

21 Jul, 2017

3 commits

  • commit 37511fb5c91db93d8bd6e3f52f86e5a7ff7cfcdf upstream.

    Jörn Engel noticed that the expand_upwards() function might not return
    -ENOMEM in case the requested address is (unsigned long)-PAGE_SIZE and
    if the architecture didn't defined TASK_SIZE as multiple of PAGE_SIZE.

    Affected architectures are arm, frv, m68k, blackfin, h8300 and xtensa
    which all define TASK_SIZE as 0xffffffff, but since none of those have
    an upwards-growing stack we currently have no actual issue.

    Nevertheless let's fix this just in case any of the architectures with
    an upward-growing stack (currently parisc, metag and partly ia64) define
    TASK_SIZE similar.

    Link: http://lkml.kernel.org/r/20170702192452.GA11868@p100.box
    Fixes: bd726c90b6b8 ("Allow stack to grow up to address space limit")
    Signed-off-by: Helge Deller
    Reported-by: Jörn Engel
    Cc: Hugh Dickins
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Helge Deller
     
  • commit 2c80cd57c74339889a8752b20862a16c28929c3a upstream.

    list_lru_count_node() iterates over all memcgs to get the total number of
    entries on the node but it can race with memcg_drain_all_list_lrus(),
    which migrates the entries from a dead cgroup to another. This can return
    incorrect number of entries from list_lru_count_node().

    Fix this by keeping track of entries per node and simply return it in
    list_lru_count_node().

    Link: http://lkml.kernel.org/r/1498707555-30525-1-git-send-email-stummala@codeaurora.org
    Signed-off-by: Sahitya Tummala
    Acked-by: Vladimir Davydov
    Cc: Jan Kara
    Cc: Alexander Polakov
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Sahitya Tummala
     
  • commit bbf29ffc7f963bb894f84f0580c70cfea01c3892 upstream.

    Reinette reported the following crash:

    BUG: Bad page state in process log2exe pfn:57600
    page:ffffea00015d8000 count:0 mapcount:0 mapping: (null) index:0x20200
    flags: 0x4000000000040019(locked|uptodate|dirty|swapbacked)
    raw: 4000000000040019 0000000000000000 0000000000020200 00000000ffffffff
    raw: ffffea00015d8020 ffffea00015d8020 0000000000000000 0000000000000000
    page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
    bad because of flags: 0x1(locked)
    Modules linked in: rfcomm 8021q bnep intel_rapl x86_pkg_temp_thermal coretemp efivars btusb btrtl btbcm pwm_lpss_pci snd_hda_codec_hdmi btintel pwm_lpss snd_hda_codec_realtek snd_soc_skl snd_hda_codec_generic snd_soc_skl_ipc spi_pxa2xx_platform snd_soc_sst_ipc snd_soc_sst_dsp i2c_designware_platform i2c_designware_core snd_hda_ext_core snd_soc_sst_match snd_hda_intel snd_hda_codec mei_me snd_hda_core mei snd_soc_rt286 snd_soc_rl6347a snd_soc_core efivarfs
    CPU: 1 PID: 354 Comm: log2exe Not tainted 4.12.0-rc7-test-test #19
    Hardware name: Intel corporation NUC6CAYS/NUC6CAYB, BIOS AYAPLCEL.86A.0027.2016.1108.1529 11/08/2016
    Call Trace:
    bad_page+0x16a/0x1f0
    free_pages_check_bad+0x117/0x190
    free_hot_cold_page+0x7b1/0xad0
    __put_page+0x70/0xa0
    madvise_free_huge_pmd+0x627/0x7b0
    madvise_free_pte_range+0x6f8/0x1150
    __walk_page_range+0x6b5/0xe30
    walk_page_range+0x13b/0x310
    madvise_free_page_range.isra.16+0xad/0xd0
    madvise_free_single_vma+0x2e4/0x470
    SyS_madvise+0x8ce/0x1450

    If somebody frees the page under us and we hold the last reference to
    it, put_page() would attempt to free the page before unlocking it.

    The fix is trivial reorder of operations.

    Dave said:
    "I came up with the exact same patch. For posterity, here's the test
    case, generated by syzkaller and trimmed down by Reinette:

    https://www.sr71.net/~dave/intel/log2.c

    And the config that helps detect this:

    https://www.sr71.net/~dave/intel/config-log2"

    Fixes: b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called")
    Link: http://lkml.kernel.org/r/20170628101249.17879-1-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Reinette Chatre
    Acked-by: Dave Hansen
    Acked-by: Michal Hocko
    Acked-by: Minchan Kim
    Cc: Huang Ying
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov
     

07 Jul, 2017

1 commit

  • TI-Feature: ti_linux_base_lsk
    TI-Tree: git@git.ti.com:ti-linux-kernel/ti-linux-kernel.git
    TI-Branch: ti-linux-4.9.y

    * 'ti-linux-4.9.y' of git.ti.com:ti-linux-kernel/ti-linux-kernel: (172 commits)
    Linux 4.9.36
    KVM: nVMX: Fix exception injection
    KVM: x86: zero base3 of unusable segments
    KVM: x86/vPMU: fix undefined shift in intel_pmu_refresh()
    KVM: x86: fix emulation of RSM and IRET instructions
    arm64: fix NULL dereference in have_cpu_die()
    mtd: nand: brcmnand: Check flash #WP pin status before nand erase/program
    i2c: brcmstb: Fix START and STOP conditions
    brcmfmac: avoid writing channel out of allocated array
    infiniband: hns: avoid gcc-7.0.1 warning for uninitialized data
    objtool: Fix another GCC jump table detection issue
    clk: scpi: don't add cpufreq device if the scpi dvfs node is disabled
    cpufreq: s3c2416: double free on driver init error path
    iommu/amd: Fix interrupt remapping when disable guest_mode
    iommu/amd: Fix incorrect error handling in amd_iommu_bind_pasid()
    iommu/dma: Don't reserve PCI I/O windows
    iommu: Handle default domain attach failure
    iommu/vt-d: Don't over-free page table directories
    ocfs2: o2hb: revert hb threshold to keep compatible
    x86/mm: Fix flush_tlb_page() on Xen
    ...

    Signed-off-by: LCPD Auto Merger

    LCPD Auto Merger
     

05 Jul, 2017

3 commits

  • commit 029c54b09599573015a5c18dbe59cbdf42742237 upstream.

    Existing code that uses vmalloc_to_page() may assume that any address
    for which is_vmalloc_addr() returns true may be passed into
    vmalloc_to_page() to retrieve the associated struct page.

    This is not un unreasonable assumption to make, but on architectures
    that have CONFIG_HAVE_ARCH_HUGE_VMAP=y, it no longer holds, and we need
    to ensure that vmalloc_to_page() does not go off into the weeds trying
    to dereference huge PUDs or PMDs as table entries.

    Given that vmalloc() and vmap() themselves never create huge mappings or
    deal with compound pages at all, there is no correct answer in this
    case, so return NULL instead, and issue a warning.

    When reading /proc/kcore on arm64, you will hit an oops as soon as you
    hit the huge mappings used for the various segments that make up the
    mapping of vmlinux. With this patch applied, you will no longer hit the
    oops, but the kcore contents willl be incorrect (these regions will be
    zeroed out)

    We are fixing this for kcore specifically, so it avoids vread() for
    those regions. At least one other problematic user exists, i.e.,
    /dev/kmem, but that is currently broken on arm64 for other reasons.

    Link: http://lkml.kernel.org/r/20170609082226.26152-1-ard.biesheuvel@linaro.org
    Signed-off-by: Ard Biesheuvel
    Acked-by: Mark Rutland
    Reviewed-by: Laura Abbott
    Cc: Michal Hocko
    Cc: zhong jiang
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    [ardb: non-trivial backport to v4.9]
    Signed-off-by: Ard Biesheuvel
    Signed-off-by: Greg Kroah-Hartman

    Ard Biesheuvel
     
  • commit 3c226c637b69104f6b9f1c6ec5b08d7b741b3229 upstream.

    In do_huge_pmd_numa_page(), we attempt to handle a migrating thp pmd by
    waiting until the pmd is unlocked before we return and retry. However,
    we can race with migrate_misplaced_transhuge_page():

    // do_huge_pmd_numa_page // migrate_misplaced_transhuge_page()
    // Holds 0 refs on page // Holds 2 refs on page

    vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
    /* ... */
    if (pmd_trans_migrating(*vmf->pmd)) {
    page = pmd_page(*vmf->pmd);
    spin_unlock(vmf->ptl);
    ptl = pmd_lock(mm, pmd);
    if (page_count(page) != 2)) {
    /* roll back */
    }
    /* ... */
    mlock_migrate_page(new_page, page);
    /* ... */
    spin_unlock(ptl);
    put_page(page);
    put_page(page); // page freed here
    wait_on_page_locked(page);
    goto out;
    }

    This can result in the freed page having its waiters flag set
    unexpectedly, which trips the PAGE_FLAGS_CHECK_AT_PREP checks in the
    page alloc/free functions. This has been observed on arm64 KVM guests.

    We can avoid this by having do_huge_pmd_numa_page() take a reference on
    the page before dropping the pmd lock, mirroring what we do in
    __migration_entry_wait().

    When we hit the race, migrate_misplaced_transhuge_page() will see the
    reference and abort the migration, as it may do today in other cases.

    Fixes: b8916634b77bffb2 ("mm: Prevent parallel splits during THP migration")
    Link: http://lkml.kernel.org/r/1497349722-6731-2-git-send-email-will.deacon@arm.com
    Signed-off-by: Mark Rutland
    Signed-off-by: Will Deacon
    Acked-by: Steve Capper
    Acked-by: Kirill A. Shutemov
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mark Rutland
     
  • commit 460bcec84e11c75122ace5976214abbc596eb91b upstream.

    We got need_resched() warnings in swap_cgroup_swapoff() because
    swap_cgroup_ctrl[type].length is particularly large.

    Reschedule when needed.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1704061315270.80559@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Cc: Ben Hutchings
    Signed-off-by: Greg Kroah-Hartman

    David Rientjes
     

30 Jun, 2017

1 commit


28 Jun, 2017

1 commit


24 Jun, 2017

5 commits

  • commit f4cb767d76cf7ee72f97dd76f6cfa6c76a5edc89 upstream.

    Trinity gets kernel BUG at mm/mmap.c:1963! in about 3 minutes of
    mmap testing. That's the VM_BUG_ON(gap_end < gap_start) at the
    end of unmapped_area_topdown(). Linus points out how MAP_FIXED
    (which does not have to respect our stack guard gap intentions)
    could result in gap_end below gap_start there. Fix that, and
    the similar case in its alternative, unmapped_area().

    Fixes: 1be7107fbe18 ("mm: larger stack guard gap, between vmas")
    Reported-by: Dave Jones
    Debugged-by: Linus Torvalds
    Signed-off-by: Hugh Dickins
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Hugh Dickins
     
  • commit bd726c90b6b8ce87602208701b208a208e6d5600 upstream.

    Fix expand_upwards() on architectures with an upward-growing stack (parisc,
    metag and partly IA-64) to allow the stack to reliably grow exactly up to
    the address space limit given by TASK_SIZE.

    Signed-off-by: Helge Deller
    Acked-by: Hugh Dickins
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Helge Deller
     
  • commit 1be7107fbe18eed3e319a6c3e83c78254b693acb upstream.

    Stack guard page is a useful feature to reduce a risk of stack smashing
    into a different mapping. We have been using a single page gap which
    is sufficient to prevent having stack adjacent to a different mapping.
    But this seems to be insufficient in the light of the stack usage in
    userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
    used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
    which is 256kB or stack strings with MAX_ARG_STRLEN.

    This will become especially dangerous for suid binaries and the default
    no limit for the stack size limit because those applications can be
    tricked to consume a large portion of the stack and a single glibc call
    could jump over the guard page. These attacks are not theoretical,
    unfortunatelly.

    Make those attacks less probable by increasing the stack guard gap
    to 1MB (on systems with 4k pages; but make it depend on the page size
    because systems with larger base pages might cap stack allocations in
    the PAGE_SIZE units) which should cover larger alloca() and VLA stack
    allocations. It is obviously not a full fix because the problem is
    somehow inherent, but it should reduce attack space a lot.

    One could argue that the gap size should be configurable from userspace,
    but that can be done later when somebody finds that the new 1MB is wrong
    for some special case applications. For now, add a kernel command line
    option (stack_guard_gap) to specify the stack gap size (in page units).

    Implementation wise, first delete all the old code for stack guard page:
    because although we could get away with accounting one extra page in a
    stack vma, accounting a larger gap can break userspace - case in point,
    a program run with "ulimit -S -v 20000" failed when the 1MB gap was
    counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
    and strict non-overcommit mode.

    Instead of keeping gap inside the stack vma, maintain the stack guard
    gap as a gap between vmas: using vm_start_gap() in place of vm_start
    (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
    places which need to respect the gap - mainly arch_get_unmapped_area(),
    and and the vma tree's subtree_gap support for that.

    Original-patch-by: Oleg Nesterov
    Original-patch-by: Michal Hocko
    Signed-off-by: Hugh Dickins
    Acked-by: Michal Hocko
    Tested-by: Helge Deller # parisc
    Signed-off-by: Linus Torvalds
    [wt: backport to 4.11: adjust context]
    [wt: backport to 4.9: adjust context ; kernel doc was not in admin-guide]
    Signed-off-by: Willy Tarreau
    Signed-off-by: Greg Kroah-Hartman

    Hugh Dickins
     
  • commit ef70762948dde012146926720b70e79736336764 upstream.

    I saw need_resched() warnings when swapping on large swapfile (TBs)
    because continuously allocating many pages in swap_cgroup_prepare() took
    too long.

    We already cond_resched when freeing page in swap_cgroup_swapoff(). Do
    the same for the page allocation.

    Link: http://lkml.kernel.org/r/20170604200109.17606-1-yuzhao@google.com
    Signed-off-by: Yu Zhao
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Yu Zhao
     
  • commit 7258ae5c5a2ce2f5969e8b18b881be40ab55433d upstream.

    memory_failure() chooses a recovery action function based on the page
    flags. For huge pages it uses the tail page flags which don't have
    anything interesting set, resulting in:

    > Memory failure: 0x9be3b4: Unknown page state
    > Memory failure: 0x9be3b4: recovery action for unknown page: Failed

    Instead, save a copy of the head page's flags if this is a huge page,
    this means if there are no relevant flags for this tail page, we use the
    head pages flags instead. This results in the me_huge_page() recovery
    action being called:

    > Memory failure: 0x9b7969: recovery action for huge page: Delayed

    For hugepages that have not yet been allocated, this allows the hugepage
    to be dequeued.

    Fixes: 524fca1e7356 ("HWPOISON: fix misjudgement of page_action() for errors on mlocked pages")
    Link: http://lkml.kernel.org/r/20170524130204.21845-1-james.morse@arm.com
    Signed-off-by: James Morse
    Tested-by: Punit Agrawal
    Acked-by: Punit Agrawal
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    James Morse
     

17 Jun, 2017

2 commits

  • [ Upstream commit 4f40c6e5627ea73b4e7c615c59631f38cc880885 ]

    After much waiting I finally reproduced a KASAN issue, only to find my
    trace-buffer empty of useful information because it got spooled out :/

    Make kasan_report honour the /proc/sys/kernel/traceoff_on_warning
    interface.

    Link: http://lkml.kernel.org/r/20170125164106.3514-1-aryabinin@virtuozzo.com
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Andrey Ryabinin
    Acked-by: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     
  • [ Upstream commit 253fd0f02040a19c6fe80e4171659fa3482a422d ]

    Syzkaller fuzzer managed to trigger this:

    BUG: sleeping function called from invalid context at mm/shmem.c:852
    in_atomic(): 1, irqs_disabled(): 0, pid: 529, name: khugepaged
    3 locks held by khugepaged/529:
    #0: (shrinker_rwsem){++++..}, at: [] shrink_slab.part.59+0x121/0xd30 mm/vmscan.c:451
    #1: (&type->s_umount_key#29){++++..}, at: [] trylock_super+0x20/0x100 fs/super.c:392
    #2: (&(&sbinfo->shrinklist_lock)->rlock){+.+.-.}, at: [] spin_lock include/linux/spinlock.h:302 [inline]
    #2: (&(&sbinfo->shrinklist_lock)->rlock){+.+.-.}, at: [] shmem_unused_huge_shrink+0x28e/0x1490 mm/shmem.c:427
    CPU: 2 PID: 529 Comm: khugepaged Not tainted 4.10.0-rc5+ #201
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
    shmem_undo_range+0xb20/0x2710 mm/shmem.c:852
    shmem_truncate_range+0x27/0xa0 mm/shmem.c:939
    shmem_evict_inode+0x35f/0xca0 mm/shmem.c:1030
    evict+0x46e/0x980 fs/inode.c:553
    iput_final fs/inode.c:1515 [inline]
    iput+0x589/0xb20 fs/inode.c:1542
    shmem_unused_huge_shrink+0xbad/0x1490 mm/shmem.c:446
    shmem_unused_huge_scan+0x10c/0x170 mm/shmem.c:512
    super_cache_scan+0x376/0x450 fs/super.c:106
    do_shrink_slab mm/vmscan.c:378 [inline]
    shrink_slab.part.59+0x543/0xd30 mm/vmscan.c:481
    shrink_slab mm/vmscan.c:2592 [inline]
    shrink_node+0x2c7/0x870 mm/vmscan.c:2592
    shrink_zones mm/vmscan.c:2734 [inline]
    do_try_to_free_pages+0x369/0xc80 mm/vmscan.c:2776
    try_to_free_pages+0x3c6/0x900 mm/vmscan.c:2982
    __perform_reclaim mm/page_alloc.c:3301 [inline]
    __alloc_pages_direct_reclaim mm/page_alloc.c:3322 [inline]
    __alloc_pages_slowpath+0xa24/0x1c30 mm/page_alloc.c:3683
    __alloc_pages_nodemask+0x544/0xae0 mm/page_alloc.c:3848
    __alloc_pages include/linux/gfp.h:426 [inline]
    __alloc_pages_node include/linux/gfp.h:439 [inline]
    khugepaged_alloc_page+0xc2/0x1b0 mm/khugepaged.c:750
    collapse_huge_page+0x182/0x1fe0 mm/khugepaged.c:955
    khugepaged_scan_pmd+0xfdf/0x12a0 mm/khugepaged.c:1208
    khugepaged_scan_mm_slot mm/khugepaged.c:1727 [inline]
    khugepaged_do_scan mm/khugepaged.c:1808 [inline]
    khugepaged+0xe9b/0x1590 mm/khugepaged.c:1853
    kthread+0x326/0x3f0 kernel/kthread.c:227
    ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430

    The iput() from atomic context was a bad idea: if after igrab() somebody
    else calls iput() and we left with the last inode reference, our iput()
    would lead to inode eviction and therefore sleeping.

    This patch should fix the situation.

    Link: http://lkml.kernel.org/r/20170131093141.GA15899@node.shutemov.name
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov
     

14 Jun, 2017

1 commit

  • commit 93407472a21b82f39c955ea7787e5bc7da100642 upstream.

    Replace all 1 << inode->i_blkbits and (1 << inode->i_blkbits) in fs
    branch.

    This patch also fixes multiple checkpatch warnings: WARNING: Prefer
    'unsigned int' to bare use of 'unsigned'

    Thanks to Andrew Morton for suggesting more appropriate function instead
    of macro.

    [geliangtang@gmail.com: truncate: use i_blocksize()]
    Link: http://lkml.kernel.org/r/9c8b2cd83c8f5653805d43debde9fa8817e02fc4.1484895804.git.geliangtang@gmail.com
    Link: http://lkml.kernel.org/r/1481319905-10126-1-git-send-email-fabf@skynet.be
    Signed-off-by: Fabian Frederick
    Signed-off-by: Geliang Tang
    Cc: Alexander Viro
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Fabian Frederick
     

08 Jun, 2017

2 commits

  • Add memblock_cap_memory_range() which will remove all the memblock regions
    except the memory range specified in the arguments. In addition, rework is
    done on memblock_mem_limit_remove_map() to re-implement it using
    memblock_cap_memory_range().

    This function, like memblock_mem_limit_remove_map(), will not remove
    memblocks with MEMMAP_NOMAP attribute as they may be mapped and accessed
    later as "device memory."
    See the commit a571d4eb55d8 ("mm/memblock.c: add new infrastructure to
    address the mem limit issue").

    This function is used, in a succeeding patch in the series of arm64 kdump
    suuport, to limit the range of usable memory, or System RAM, on crash dump
    kernel.
    (Please note that "mem=" parameter is of little use for this purpose.)

    Signed-off-by: AKASHI Takahiro
    Reviewed-by: Will Deacon
    Acked-by: Catalin Marinas
    Acked-by: Dennis Chen
    Cc: linux-mm@kvack.org
    Cc: Andrew Morton
    Reviewed-by: Ard Biesheuvel
    Signed-off-by: Catalin Marinas

    AKASHI Takahiro
     
  • This function, with a combination of memblock_mark_nomap(), will be used
    in a later kdump patch for arm64 when it temporarily isolates some range
    of memory from the other memory blocks in order to create a specific
    kernel mapping at boot time.

    Signed-off-by: AKASHI Takahiro
    Reviewed-by: Ard Biesheuvel
    Signed-off-by: Catalin Marinas

    AKASHI Takahiro
     

07 Jun, 2017

6 commits

  • commit aa2efd5ea4041754da4046c3d2e7edaac9526258 upstream.

    Currently when trace is enabled (e.g. slub_debug=T,kmalloc-128 ) the
    trace messages are mostly output at KERN_INFO. However the trace code
    also calls print_section() to hexdump the head of a free object. This
    is hard coded to use KERN_ERR, meaning the console is deluged with trace
    messages even if we've asked for quiet.

    Fix this the obvious way but adding a level parameter to
    print_section(), allowing calls from the trace code to use the same
    trace level as other trace messages.

    Link: http://lkml.kernel.org/r/20170113154850.518-1-daniel.thompson@linaro.org
    Signed-off-by: Daniel Thompson
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Daniel Thompson
     
  • commit 478fe3037b2278d276d4cd9cd0ab06c4cb2e9b32 upstream.

    memcg_propagate_slab_attrs() abuses the sysfs attribute file functions
    to propagate settings from the root kmem_cache to a newly created
    kmem_cache. It does that with:

    attr->show(root, buf);
    attr->store(new, buf, strlen(bug);

    Aside of being a lazy and absurd hackery this is broken because it does
    not check the return value of the show() function.

    Some of the show() functions return 0 w/o touching the buffer. That
    means in such a case the store function is called with the stale content
    of the previous show(). That causes nonsense like invoking
    kmem_cache_shrink() on a newly created kmem_cache. In the worst case it
    would cause handing in an uninitialized buffer.

    This should be rewritten proper by adding a propagate() callback to
    those slub_attributes which must be propagated and avoid that insane
    conversion to and from ASCII, but that's too large for a hot fix.

    Check at least the return value of the show() function, so calling
    store() with stale content is prevented.

    Steven said:
    "It can cause a deadlock with get_online_cpus() that has been uncovered
    by recent cpu hotplug and lockdep changes that Thomas and Peter have
    been doing.

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(cpu_hotplug.lock);
    lock(slab_mutex);
    lock(cpu_hotplug.lock);
    lock(slab_mutex);

    *** DEADLOCK ***"

    Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1705201244540.2255@nanos
    Signed-off-by: Thomas Gleixner
    Reported-by: Steven Rostedt
    Acked-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Peter Zijlstra
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit a7306c3436e9c8e584a4b9fad5f3dc91be2a6076 upstream.

    "err" needs to be left set to -EFAULT if split_huge_page succeeds.
    Otherwise if "err" gets clobbered with zero and write_protect_page
    fails, try_to_merge_one_page() will succeed instead of returning -EFAULT
    and then try_to_merge_with_ksm_page() will continue thinking kpage is a
    PageKsm when in fact it's still an anonymous page. Eventually it'll
    crash in page_add_anon_rmap.

    This has been reproduced on Fedora25 kernel but I can reproduce with
    upstream too.

    The bug was introduced in commit f765f540598a ("ksm: prepare to new THP
    semantics") introduced in v4.5.

    page:fffff67546ce1cc0 count:4 mapcount:2 mapping:ffffa094551e36e1 index:0x7f0f46673
    flags: 0x2ffffc0004007c(referenced|uptodate|dirty|lru|active|swapbacked)
    page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))
    page->mem_cgroup:ffffa09674bf0000
    ------------[ cut here ]------------
    kernel BUG at mm/rmap.c:1222!
    CPU: 1 PID: 76 Comm: ksmd Not tainted 4.9.3-200.fc25.x86_64 #1
    RIP: do_page_add_anon_rmap+0x1c4/0x240
    Call Trace:
    page_add_anon_rmap+0x18/0x20
    try_to_merge_with_ksm_page+0x50b/0x780
    ksm_scan_thread+0x1211/0x1410
    ? prepare_to_wait_event+0x100/0x100
    ? try_to_merge_with_ksm_page+0x780/0x780
    kthread+0xd9/0xf0
    ? kthread_park+0x60/0x60
    ret_from_fork+0x25/0x30

    Fixes: f765f54059 ("ksm: prepare to new THP semantics")
    Link: http://lkml.kernel.org/r/20170513131040.21732-1-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Reported-by: Federico Simoncelli
    Acked-by: Kirill A. Shutemov
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrea Arcangeli
     
  • commit 864b9a393dcb5aed09b8fd31b9bbda0fdda99374 upstream.

    We have seen an early OOM killer invocation on ppc64 systems with
    crashkernel=4096M:

    kthreadd invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=7, order=0, oom_score_adj=0
    kthreadd cpuset=/ mems_allowed=7
    CPU: 0 PID: 2 Comm: kthreadd Not tainted 4.4.68-1.gd7fe927-default #1
    Call Trace:
    dump_stack+0xb0/0xf0 (unreliable)
    dump_header+0xb0/0x258
    out_of_memory+0x5f0/0x640
    __alloc_pages_nodemask+0xa8c/0xc80
    kmem_getpages+0x84/0x1a0
    fallback_alloc+0x2a4/0x320
    kmem_cache_alloc_node+0xc0/0x2e0
    copy_process.isra.25+0x260/0x1b30
    _do_fork+0x94/0x470
    kernel_thread+0x48/0x60
    kthreadd+0x264/0x330
    ret_from_kernel_thread+0x5c/0xa4

    Mem-Info:
    active_anon:0 inactive_anon:0 isolated_anon:0
    active_file:0 inactive_file:0 isolated_file:0
    unevictable:0 dirty:0 writeback:0 unstable:0
    slab_reclaimable:5 slab_unreclaimable:73
    mapped:0 shmem:0 pagetables:0 bounce:0
    free:0 free_pcp:0 free_cma:0
    Node 7 DMA free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:52428800kB managed:110016kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:320kB slab_unreclaimable:4672kB kernel_stack:1152kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
    lowmem_reserve[]: 0 0 0 0
    Node 7 DMA: 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 0kB
    0 total pagecache pages
    0 pages in swap cache
    Swap cache stats: add 0, delete 0, find 0/0
    Free swap = 0kB
    Total swap = 0kB
    819200 pages RAM
    0 pages HighMem/MovableOnly
    817481 pages reserved
    0 pages cma reserved
    0 pages hwpoisoned

    the reason is that the managed memory is too low (only 110MB) while the
    rest of the the 50GB is still waiting for the deferred intialization to
    be done. update_defer_init estimates the initial memoty to initialize
    to 2GB at least but it doesn't consider any memory allocated in that
    range. In this particular case we've had

    Reserving 4096MB of memory at 128MB for crashkernel (System RAM: 51200MB)

    so the low 2GB is mostly depleted.

    Fix this by considering memblock allocations in the initial static
    initialization estimation. Move the max_initialise to
    reset_deferred_meminit and implement a simple memblock_reserved_memory
    helper which iterates all reserved blocks and sums the size of all that
    start below the given address. The cumulative size is than added on top
    of the initial estimation. This is still not ideal because
    reset_deferred_meminit doesn't consider holes and so reservation might
    be above the initial estimation whihch we ignore but let's make the
    logic simpler until we really need to handle more complicated cases.

    Fixes: 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
    Link: http://lkml.kernel.org/r/20170531104010.GI27783@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Acked-by: Mel Gorman
    Tested-by: Srikar Dronamraju
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     
  • commit 70feee0e1ef331b22cc51f383d532a0d043fbdcc upstream.

    Kefeng reported that when running the follow test, the mlock count in
    meminfo will increase permanently:

    [1] testcase
    linux:~ # cat test_mlockal
    grep Mlocked /proc/meminfo
    for j in `seq 0 10`
    do
    for i in `seq 4 15`
    do
    ./p_mlockall >> log &
    done
    sleep 0.2
    done
    # wait some time to let mlock counter decrease and 5s may not enough
    sleep 5
    grep Mlocked /proc/meminfo

    linux:~ # cat p_mlockall.c
    #include
    #include
    #include

    #define SPACE_LEN 4096

    int main(int argc, char ** argv)
    {
    int ret;
    void *adr = malloc(SPACE_LEN);
    if (!adr)
    return -1;

    ret = mlockall(MCL_CURRENT | MCL_FUTURE);
    printf("mlcokall ret = %d\n", ret);

    ret = munlockall();
    printf("munlcokall ret = %d\n", ret);

    free(adr);
    return 0;
    }

    In __munlock_pagevec() we should decrement NR_MLOCK for each page where
    we clear the PageMlocked flag. Commit 1ebb7cc6a583 ("mm: munlock: batch
    NR_MLOCK zone state updates") has introduced a bug where we don't
    decrement NR_MLOCK for pages where we clear the flag, but fail to
    isolate them from the lru list (e.g. when the pages are on some other
    cpu's percpu pagevec). Since PageMlocked stays cleared, the NR_MLOCK
    accounting gets permanently disrupted by this.

    Fix it by counting the number of page whose PageMlock flag is cleared.

    Fixes: 1ebb7cc6a583 (" mm: munlock: batch NR_MLOCK zone state updates")
    Link: http://lkml.kernel.org/r/1495678405-54569-1-git-send-email-xieyisheng1@huawei.com
    Signed-off-by: Yisheng Xie
    Reported-by: Kefeng Wang
    Tested-by: Kefeng Wang
    Cc: Vlastimil Babka
    Cc: Joern Engel
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Xishi Qiu
    Cc: zhongjiang
    Cc: Hanjun Guo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Yisheng Xie
     
  • commit 30809f559a0d348c2dfd7ab05e9a451e2384962e upstream.

    On failing to migrate a page, soft_offline_huge_page() performs the
    necessary update to the hugepage ref-count.

    But when !hugepage_migration_supported() , unmap_and_move_hugepage()
    also decrements the page ref-count for the hugepage. The combined
    behaviour leaves the ref-count in an inconsistent state.

    This leads to soft lockups when running the overcommitted hugepage test
    from mce-tests suite.

    Soft offlining pfn 0x83ed600 at process virtual address 0x400000000000
    soft offline: 0x83ed600: migration failed 1, type 1fffc00000008008 (uptodate|head)
    INFO: rcu_preempt detected stalls on CPUs/tasks:
    Tasks blocked on level-0 rcu_node (CPUs 0-7): P2715
    (detected by 7, t=5254 jiffies, g=963, c=962, q=321)
    thugetlb_overco R running task 0 2715 2685 0x00000008
    Call trace:
    dump_backtrace+0x0/0x268
    show_stack+0x24/0x30
    sched_show_task+0x134/0x180
    rcu_print_detail_task_stall_rnp+0x54/0x7c
    rcu_check_callbacks+0xa74/0xb08
    update_process_times+0x34/0x60
    tick_sched_handle.isra.7+0x38/0x70
    tick_sched_timer+0x4c/0x98
    __hrtimer_run_queues+0xc0/0x300
    hrtimer_interrupt+0xac/0x228
    arch_timer_handler_phys+0x3c/0x50
    handle_percpu_devid_irq+0x8c/0x290
    generic_handle_irq+0x34/0x50
    __handle_domain_irq+0x68/0xc0
    gic_handle_irq+0x5c/0xb0

    Address this by changing the putback_active_hugepage() in
    soft_offline_huge_page() to putback_movable_pages().

    This only triggers on systems that enable memory failure handling
    (ARCH_SUPPORTS_MEMORY_FAILURE) but not hugepage migration
    (!ARCH_ENABLE_HUGEPAGE_MIGRATION).

    I imagine this wasn't triggered as there aren't many systems running
    this configuration.

    [akpm@linux-foundation.org: remove dead comment, per Naoya]
    Link: http://lkml.kernel.org/r/20170525135146.32011-1-punit.agrawal@arm.com
    Reported-by: Manoj Iyer
    Tested-by: Manoj Iyer
    Suggested-by: Naoya Horiguchi
    Signed-off-by: Punit Agrawal
    Cc: Joonsoo Kim
    Cc: Wanpeng Li
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Punit Agrawal
     

20 May, 2017

1 commit

  • commit 62be1511b1db8066220b18b7d4da2e6b9fdc69fb upstream.

    Patch series "more robust PF_MEMALLOC handling"

    This series aims to unify the setting and clearing of PF_MEMALLOC, which
    prevents recursive reclaim. There are some places that clear the flag
    unconditionally from current->flags, which may result in clearing a
    pre-existing flag. This already resulted in a bug report that Patch 1
    fixes (without the new helpers, to make backporting easier). Patch 2
    introduces the new helpers, modelled after existing memalloc_noio_* and
    memalloc_nofs_* helpers, and converts mm core to use them. Patches 3
    and 4 convert non-mm code.

    This patch (of 4):

    __alloc_pages_direct_compact() sets PF_MEMALLOC to prevent deadlock
    during page migration by lock_page() (see the comment in
    __unmap_and_move()). Then it unconditionally clears the flag, which can
    clear a pre-existing PF_MEMALLOC flag and result in recursive reclaim.
    This was not a problem until commit a8161d1ed609 ("mm, page_alloc:
    restructure direct compaction handling in slowpath"), because direct
    compation was called only after direct reclaim, which was skipped when
    PF_MEMALLOC flag was set.

    Even now it's only a theoretical issue, as the new callsite of
    __alloc_pages_direct_compact() is reached only for costly orders and
    when gfp_pfmemalloc_allowed() is true, which means either
    __GFP_NOMEMALLOC is in gfp_flags or in_interrupt() is true. There is no
    such known context, but let's play it safe and make
    __alloc_pages_direct_compact() robust for cases where PF_MEMALLOC is
    already set.

    Fixes: a8161d1ed609 ("mm, page_alloc: restructure direct compaction handling in slowpath")
    Link: http://lkml.kernel.org/r/20170405074700.29871-2-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Reported-by: Andrey Ryabinin
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Boris Brezillon
    Cc: Chris Leech
    Cc: "David S. Miller"
    Cc: Eric Dumazet
    Cc: Josef Bacik
    Cc: Lee Duncan
    Cc: Michal Hocko
    Cc: Richard Weinberger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     

27 Apr, 2017

1 commit

  • commit fc280fe871449ead4bdbd1665fa52c7c01c64765 upstream.

    Commit 6afcf8ef0ca0 ("mm, compaction: fix NR_ISOLATED_* stats for pfn
    based migration") moved the dec_node_page_state() call (along with the
    page_is_file_cache() call) to after putback_lru_page().

    But page_is_file_cache() can change after putback_lru_page() is called,
    so it should be called before putback_lru_page(), as it was before that
    patch, to prevent NR_ISOLATE_* stats from going negative.

    Without this fix, non-CONFIG_SMP kernels end up hanging in the
    while(too_many_isolated()) { congestion_wait() } loop in
    shrink_active_list() due to the negative stats.

    Mem-Info:
    active_anon:32567 inactive_anon:121 isolated_anon:1
    active_file:6066 inactive_file:6639 isolated_file:4294967295
    ^^^^^^^^^^
    unevictable:0 dirty:115 writeback:0 unstable:0
    slab_reclaimable:2086 slab_unreclaimable:3167
    mapped:3398 shmem:18366 pagetables:1145 bounce:0
    free:1798 free_pcp:13 free_cma:0

    Fixes: 6afcf8ef0ca0 ("mm, compaction: fix NR_ISOLATED_* stats for pfn based migration")
    Link: http://lkml.kernel.org/r/1492683865-27549-1-git-send-email-rabin.vincent@axis.com
    Signed-off-by: Rabin Vincent
    Acked-by: Michal Hocko
    Cc: Ming Ling
    Cc: Minchan Kim
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Rabin Vincent
     

21 Apr, 2017

3 commits

  • commit 13583c3d3224508582ec03d881d0b68dd3ee8e10 upstream.

    Creating a lot of cgroups at the same time might stall all worker
    threads with kmem cache creation works, because kmem cache creation is
    done with the slab_mutex held. The problem was amplified by commits
    801faf0db894 ("mm/slab: lockless decision to grow cache") in case of
    SLAB and 81ae6d03952c ("mm/slub.c: replace kick_all_cpus_sync() with
    synchronize_sched() in kmem_cache_shrink()") in case of SLUB, which
    increased the maximal time the slab_mutex can be held.

    To prevent that from happening, let's use a special ordered single
    threaded workqueue for kmem cache creation. This shouldn't introduce
    any functional changes regarding how kmem caches are created, as the
    work function holds the global slab_mutex during its whole runtime
    anyway, making it impossible to run more than one work at a time. By
    using a single threaded workqueue, we just avoid creating a thread per
    each work. Ordering is required to avoid a situation when a cgroup's
    work is put off indefinitely because there are other cgroups to serve,
    in other words to guarantee fairness.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=172981
    Link: http://lkml.kernel.org/r/20161004131417.GC1862@esperanza
    Signed-off-by: Vladimir Davydov
    Reported-by: Doug Smythies
    Acked-by: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Cc: Ben Hutchings
    Signed-off-by: Greg Kroah-Hartman

    Vladimir Davydov
     
  • commit 85d492f28d056c40629fc25d79f54da618a29dc4 upstream.

    Now 64K page system, zsamlloc has 257 classes so 8 class bit is not
    enough. With that, it corrupts the system when zsmalloc stores
    65536byte data(ie, index number 256) so that this patch increases class
    bit for simple fix for stable backport. We should clean up this mess
    soon.

    index size
    0 32
    1 288
    ..
    ..
    204 52256
    256 65536

    Fixes: 3783689a1 ("zsmalloc: introduce zspage structure")
    Link: http://lkml.kernel.org/r/1492042622-12074-3-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Minchan Kim
     
  • commit 58ceeb6bec86d9140f9d91d71a710e963523d063 upstream.

    Both MADV_DONTNEED and MADV_FREE handled with down_read(mmap_sem).

    It's critical to not clear pmd intermittently while handling MADV_FREE
    to avoid race with MADV_DONTNEED:

    CPU0: CPU1:
    madvise_free_huge_pmd()
    pmdp_huge_get_and_clear_full()
    madvise_dontneed()
    zap_pmd_range()
    pmd_trans_huge(*pmd) == 0 (without ptl)
    // skip the pmd
    set_pmd_at();
    // pmd is re-established

    It results in MADV_DONTNEED skipping the pmd, leaving it not cleared.
    It violates MADV_DONTNEED interface and can result is userspace
    misbehaviour.

    Basically it's the same race as with numa balancing in
    change_huge_pmd(), but a bit simpler to mitigate: we don't need to
    preserve dirty/young flags here due to MADV_FREE functionality.

    [kirill.shutemov@linux.intel.com: Urgh... Power is special again]
    Link: http://lkml.kernel.org/r/20170303102636.bhd2zhtpds4mt62a@black.fi.intel.com
    Link: http://lkml.kernel.org/r/20170302151034.27829-4-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Minchan Kim
    Cc: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov
     

12 Apr, 2017

2 commits

  • commit cf01fb9985e8deb25ccf0ea54d916b8871ae0e62 upstream.

    In the case that compat_get_bitmap fails we do not want to copy the
    bitmap to the user as it will contain uninitialized stack data and leak
    sensitive data.

    Signed-off-by: Chris Salls
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Chris Salls
     
  • commit 1f06b81aea5ecba2c1f8afd87e0ba1b9f8f90160 upstream.

    Fixes: 11fb998986a72a ("mm: move most file-based accounting to the node")
    Link: http://lkml.kernel.org/r/1490377730.30219.2.camel@beget.ru
    Signed-off-by: Alexander Polyakov
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Alexander Polakov
     

08 Apr, 2017

3 commits

  • commit 0cefabdaf757a6455d75f00cb76874e62703ed18 upstream.

    Commit 0a6b76dd23fa ("mm: workingset: make shadow node shrinker memcg
    aware") enabled cgroup-awareness in the shadow node shrinker, but forgot
    to also enable cgroup-awareness in the list_lru the shadow nodes sit on.

    Consequently, all shadow nodes are sitting on a global (per-NUMA node)
    list, while the shrinker applies the limits according to the amount of
    cache in the cgroup its shrinking. The result is excessive pressure on
    the shadow nodes from cgroups that have very little cache.

    Enable memcg-mode on the shadow node LRUs, such that per-cgroup limits
    are applied to per-cgroup lists.

    Fixes: 0a6b76dd23fa ("mm: workingset: make shadow node shrinker memcg aware")
    Link: http://lkml.kernel.org/r/20170322005320.8165-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Johannes Weiner
     
  • commit c9d398fa237882ea07167e23bcfc5e6847066518 upstream.

    I found the race condition which triggers the following bug when
    move_pages() and soft offline are called on a single hugetlb page
    concurrently.

    Soft offlining page 0x119400 at 0x700000000000
    BUG: unable to handle kernel paging request at ffffea0011943820
    IP: follow_huge_pmd+0x143/0x190
    PGD 7ffd2067
    PUD 7ffd1067
    PMD 0
    [61163.582052] Oops: 0000 [#1] SMP
    Modules linked in: binfmt_misc ppdev virtio_balloon parport_pc pcspkr i2c_piix4 parport i2c_core acpi_cpufreq ip_tables xfs libcrc32c ata_generic pata_acpi virtio_blk 8139too crc32c_intel ata_piix serio_raw libata virtio_pci 8139cp virtio_ring virtio mii floppy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: cap_check]
    CPU: 0 PID: 22573 Comm: iterate_numa_mo Tainted: P OE 4.11.0-rc2-mm1+ #2
    Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
    RIP: 0010:follow_huge_pmd+0x143/0x190
    RSP: 0018:ffffc90004bdbcd0 EFLAGS: 00010202
    RAX: 0000000465003e80 RBX: ffffea0004e34d30 RCX: 00003ffffffff000
    RDX: 0000000011943800 RSI: 0000000000080001 RDI: 0000000465003e80
    RBP: ffffc90004bdbd18 R08: 0000000000000000 R09: ffff880138d34000
    R10: ffffea0004650000 R11: 0000000000c363b0 R12: ffffea0011943800
    R13: ffff8801b8d34000 R14: ffffea0000000000 R15: 000077ff80000000
    FS: 00007fc977710740(0000) GS:ffff88007dc00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffea0011943820 CR3: 000000007a746000 CR4: 00000000001406f0
    Call Trace:
    follow_page_mask+0x270/0x550
    SYSC_move_pages+0x4ea/0x8f0
    SyS_move_pages+0xe/0x10
    do_syscall_64+0x67/0x180
    entry_SYSCALL64_slow_path+0x25/0x25
    RIP: 0033:0x7fc976e03949
    RSP: 002b:00007ffe72221d88 EFLAGS: 00000246 ORIG_RAX: 0000000000000117
    RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc976e03949
    RDX: 0000000000c22390 RSI: 0000000000001400 RDI: 0000000000005827
    RBP: 00007ffe72221e00 R08: 0000000000c2c3a0 R09: 0000000000000004
    R10: 0000000000c363b0 R11: 0000000000000246 R12: 0000000000400650
    R13: 00007ffe72221ee0 R14: 0000000000000000 R15: 0000000000000000
    Code: 81 e4 ff ff 1f 00 48 21 c2 49 c1 ec 0c 48 c1 ea 0c 4c 01 e2 49 bc 00 00 00 00 00 ea ff ff 48 c1 e2 06 49 01 d4 f6 45 bc 04 74 90 8b 7c 24 20 40 f6 c7 01 75 2b 4c 89 e7 8b 47 1c 85 c0 7e 2a
    RIP: follow_huge_pmd+0x143/0x190 RSP: ffffc90004bdbcd0
    CR2: ffffea0011943820
    ---[ end trace e4f81353a2d23232 ]---
    Kernel panic - not syncing: Fatal exception
    Kernel Offset: disabled

    This bug is triggered when pmd_present() returns true for non-present
    hugetlb, so fixing the present check in follow_huge_pmd() prevents it.
    Using pmd_present() to determine present/non-present for hugetlb is not
    correct, because pmd_present() checks multiple bits (not only
    _PAGE_PRESENT) for historical reason and it can misjudge hugetlb state.

    Fixes: e66f17ff7177 ("mm/hugetlb: take page table lock in follow_huge_pmd()")
    Link: http://lkml.kernel.org/r/1490149898-20231-1-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Acked-by: Hillf Danton
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Cc: "Kirill A. Shutemov"
    Cc: Mike Kravetz
    Cc: Christian Borntraeger
    Cc: Gerald Schaefer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Naoya Horiguchi
     
  • commit 553af430e7c981e6e8fa5007c5b7b5773acc63dd upstream.

    Huge pages are accounted as single units in the memcg's "file_mapped"
    counter. Account the correct number of base pages, like we do in the
    corresponding node counter.

    Link: http://lkml.kernel.org/r/20170322005111.3156-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Johannes Weiner
     

26 Mar, 2017

1 commit

  • commit 320661b08dd6f1746d5c7ab4eb435ec64b97cd45 upstream.

    Update to pcpu_nr_empty_pop_pages in pcpu_alloc() is currently done
    without holding pcpu_lock. This can lead to bad updates to the variable.
    Add missing lock calls.

    Fixes: b539b87fed37 ("percpu: implmeent pcpu_nr_empty_pop_pages and chunk->nr_populated")
    Signed-off-by: Tahsin Erdogan
    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Tahsin Erdogan
     

22 Mar, 2017

1 commit

  • [ Upstream commit 89e364db71fb5e7fc8d93228152abfa67daf35fa ]

    synchronize_sched() is a heavy operation and calling it per each cache
    owned by a memory cgroup being destroyed may take quite some time. What
    is worse, it's currently called under the slab_mutex, stalling all works
    doing cache creation/destruction.

    Actually, there isn't much point in calling synchronize_sched() for each
    cache - it's enough to call it just once - after setting cpu_partial for
    all caches and before shrinking them. This way, we can also move it out
    of the slab_mutex, which we have to hold for iterating over the slab
    cache list.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=172991
    Link: http://lkml.kernel.org/r/0a10d71ecae3db00fb4421bcd3f82bcc911f4be4.1475329751.git.vdavydov.dev@gmail.com
    Signed-off-by: Vladimir Davydov
    Reported-by: Doug Smythies
    Acked-by: Joonsoo Kim
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Vladimir Davydov
     

15 Mar, 2017

2 commits

  • commit 40e952f9d687928b32db20226f085ae660a7237c upstream.

    mem_cgroup_free() indirectly calls wb_domain_exit() which is not
    prepared to deal with a struct wb_domain object that hasn't executed
    wb_domain_init(). For instance, the following warning message is
    printed by lockdep if alloc_percpu() fails in mem_cgroup_alloc():

    INFO: trying to register non-static key.
    the code is fine but needs lockdep annotation.
    turning off the locking correctness validator.
    CPU: 1 PID: 1950 Comm: mkdir Not tainted 4.10.0+ #151
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
    dump_stack+0x67/0x99
    register_lock_class+0x36d/0x540
    __lock_acquire+0x7f/0x1a30
    lock_acquire+0xcc/0x200
    del_timer_sync+0x3c/0xc0
    wb_domain_exit+0x14/0x20
    mem_cgroup_free+0x14/0x40
    mem_cgroup_css_alloc+0x3f9/0x620
    cgroup_apply_control_enable+0x190/0x390
    cgroup_mkdir+0x290/0x3d0
    kernfs_iop_mkdir+0x58/0x80
    vfs_mkdir+0x10e/0x1a0
    SyS_mkdirat+0xa8/0xd0
    SyS_mkdir+0x14/0x20
    entry_SYSCALL_64_fastpath+0x18/0xad

    Add __mem_cgroup_free() which skips wb_domain_exit(). This is used by
    both mem_cgroup_free() and mem_cgroup_alloc() clean up.

    Fixes: 0b8f73e104285 ("mm: memcontrol: clean up alloc, online, offline, free functions")
    Link: http://lkml.kernel.org/r/20170306192122.24262-1-tahsin@google.com
    Signed-off-by: Tahsin Erdogan
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Tahsin Erdogan
     
  • commit 6ebb4a1b848fe75323135f93e72c78f8780fd268 upstream.

    The following test case triggers BUG() in munlock_vma_pages_range():

    int main(int argc, char *argv[])
    {
    int fd;

    system("mount -t tmpfs -o huge=always none /mnt");
    fd = open("/mnt/test", O_CREAT | O_RDWR);
    ftruncate(fd, 4UL << 20);
    mmap(NULL, 4UL << 20, PROT_READ | PROT_WRITE,
    MAP_SHARED | MAP_FIXED | MAP_LOCKED, fd, 0);
    mmap(NULL, 4096, PROT_READ | PROT_WRITE,
    MAP_SHARED | MAP_LOCKED, fd, 0);
    munlockall();
    return 0;
    }

    The second mmap() create PTE-mapping of the first huge page in file. It
    makes kernel munlock the page as we never keep PTE-mapped page mlocked.

    On munlockall() when we handle vma created by the first mmap(),
    munlock_vma_page() returns page_mask == 0, as the page is not mlocked
    anymore. On next iteration follow_page_mask() return tail page, but
    page_mask is HPAGE_NR_PAGES - 1. It makes us skip to the first tail
    page of the next huge page and step on
    VM_BUG_ON_PAGE(PageMlocked(page)).

    The fix is not use the page_mask from follow_page_mask() at all. It has
    no use for us.

    Link: http://lkml.kernel.org/r/20170302150252.34120-1-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov