20 Apr, 2016

1 commit

  • Unable to handle kernel paging request at virtual address 0af37d40
    pgd = d4dec000
    [0af37d40] *pgd=00000000
    Internal error: Oops: 5 [#1] PREEMPT SMP ARM
    [] (_raw_spin_lock) from [] (list_lru_count_one+0x14/0x28)
    [] (list_lru_count_one) from [] (super_cache_count+0x40/0xa0)
    [] (super_cache_count) from [] (debug_shrinker_show+0x50/0x90)
    [] (debug_shrinker_show) from [] (seq_read+0x1ec/0x48c)
    [] (seq_read) from [] (__vfs_read+0x20/0xd0)
    [] (__vfs_read) from [] (vfs_read+0x7c/0x104)
    [] (vfs_read) from [] (SyS_read+0x44/0x9c)
    [] (SyS_read) from [] (ret_fast_syscall+0x0/0x3c)
    Code: e1a04000 e3a00001 ebd66b39 f594f000 (e1943f9f)
    ---[ end trace 60c74014a63a9688 ]---
    Kernel panic - not syncing: Fatal exception

    shrink_control.nid is used but not initialzed.
    set shrink_control.nid to trace NUMA NUMNODES to fix this issue.

    Signed-off-by: Xiaowen Liu

    Xiaowen Liu
     

19 Jan, 2016

1 commit

  • Conflicts:
    arch/arm/boot/dts/Makefile
    arch/arm/boot/dts/imx6qdl-sabreauto.dtsi
    arch/arm/boot/dts/imx6qdl-sabresd.dtsi
    arch/arm/boot/dts/imx6qp-sabresd.dts
    arch/arm/boot/dts/imx6sl-evk.dts
    arch/arm/boot/dts/imx6sl.dtsi
    arch/arm/boot/dts/imx6sx-14x14-arm2.dts
    arch/arm/boot/dts/imx6sx-19x19-arm2.dts
    arch/arm/boot/dts/imx6sx-sabreauto.dts
    arch/arm/boot/dts/imx6sx-sdb-btwifi.dts
    arch/arm/boot/dts/imx6sx-sdb.dtsi
    arch/arm/boot/dts/imx6sx.dtsi
    arch/arm/boot/dts/imx6ul-14x14-evk.dts
    arch/arm/boot/dts/imx6ul-9x9-evk.dts
    arch/arm/boot/dts/imx6ul-evk-btwifi.dtsi
    arch/arm/boot/dts/imx6ul-pinfunc.h
    arch/arm/boot/dts/imx6ul.dtsi
    arch/arm/boot/dts/imx7d-12x12-lpddr3-arm2.dts
    arch/arm/boot/dts/imx7d-pinfunc.h
    arch/arm/boot/dts/imx7d-sdb-epdc.dtsi
    arch/arm/boot/dts/imx7d-sdb-m4.dtsi
    arch/arm/boot/dts/imx7d-sdb-reva-touch.dts
    arch/arm/boot/dts/imx7d-sdb-reva.dts
    arch/arm/boot/dts/imx7d-sdb.dts
    arch/arm/boot/dts/imx7d.dtsi
    arch/arm/configs/imx_v7_defconfig
    arch/arm/configs/imx_v7_mfg_defconfig
    arch/arm/mach-imx/clk-imx6q.c
    arch/arm/mach-imx/clk.h
    arch/arm/mach-imx/cpuidle-imx7d.c
    arch/arm/mach-imx/ddr3_freq_imx7d.S
    arch/arm/mach-imx/gpcv2.c
    arch/arm/mach-imx/imx7d_low_power_idle.S
    arch/arm/mach-imx/lpddr3_freq_imx.S
    arch/arm/mach-imx/mach-imx7d.c
    arch/arm/mach-imx/pm-imx7.c
    arch/arm/mach-imx/suspend-imx7.S
    drivers/ata/ahci_imx.c
    drivers/cpufreq/imx6q-cpufreq.c
    drivers/dma/imx-sdma.c
    drivers/dma/pxp/pxp_dma_v2.c
    drivers/input/touchscreen/ads7846.c
    drivers/media/platform/mxc/capture/ov5640_mipi.c
    drivers/media/platform/mxc/output/mxc_pxp_v4l2.c
    drivers/mmc/core/core.c
    drivers/mmc/core/sd.c
    drivers/mtd/spi-nor/fsl-quadspi.c
    drivers/mxc/gpu-viv/Kbuild
    drivers/mxc/gpu-viv/config
    drivers/mxc/gpu-viv/hal/kernel/arch/gc_hal_kernel_context.c
    drivers/mxc/gpu-viv/hal/kernel/arch/gc_hal_kernel_context.h
    drivers/mxc/gpu-viv/hal/kernel/arch/gc_hal_kernel_hardware.c
    drivers/mxc/gpu-viv/hal/kernel/arch/gc_hal_kernel_hardware.h
    drivers/mxc/gpu-viv/hal/kernel/arch/gc_hal_kernel_recorder.c
    drivers/mxc/gpu-viv/hal/kernel/archvg/gc_hal_kernel_hardware_command_vg.c
    drivers/mxc/gpu-viv/hal/kernel/archvg/gc_hal_kernel_hardware_command_vg.h
    drivers/mxc/gpu-viv/hal/kernel/archvg/gc_hal_kernel_hardware_vg.c
    drivers/mxc/gpu-viv/hal/kernel/archvg/gc_hal_kernel_hardware_vg.h
    drivers/mxc/gpu-viv/hal/kernel/gc_hal_kernel.c
    drivers/mxc/gpu-viv/hal/kernel/gc_hal_kernel.h
    drivers/mxc/gpu-viv/hal/kernel/gc_hal_kernel_command.c
    drivers/mxc/gpu-viv/hal/kernel/gc_hal_kernel_command_vg.c
    drivers/mxc/gpu-viv/hal/kernel/gc_hal_kernel_db.c
    drivers/mxc/gpu-viv/hal/kernel/gc_hal_kernel_debug.c
    drivers/mxc/gpu-viv/hal/kernel/gc_hal_kernel_event.c
    drivers/mxc/gpu-viv/hal/kernel/gc_hal_kernel_heap.c
    drivers/mxc/gpu-viv/hal/kernel/gc_hal_kernel_interrupt_vg.c
    drivers/mxc/gpu-viv/hal/kernel/gc_hal_kernel_mmu.c
    drivers/mxc/gpu-viv/hal/kernel/gc_hal_kernel_mmu_vg.c
    drivers/mxc/gpu-viv/hal/kernel/gc_hal_kernel_power.c
    drivers/mxc/gpu-viv/hal/kernel/gc_hal_kernel_precomp.h
    drivers/mxc/gpu-viv/hal/kernel/gc_hal_kernel_security.c
    drivers/mxc/gpu-viv/hal/kernel/gc_hal_kernel_vg.c
    drivers/mxc/gpu-viv/hal/kernel/gc_hal_kernel_vg.h
    drivers/mxc/gpu-viv/hal/kernel/gc_hal_kernel_video_memory.c
    drivers/mxc/gpu-viv/hal/kernel/inc/gc_hal.h
    drivers/mxc/gpu-viv/hal/kernel/inc/gc_hal_base.h
    drivers/mxc/gpu-viv/hal/kernel/inc/gc_hal_driver.h
    drivers/mxc/gpu-viv/hal/kernel/inc/gc_hal_driver_vg.h
    drivers/mxc/gpu-viv/hal/kernel/inc/gc_hal_dump.h
    drivers/mxc/gpu-viv/hal/kernel/inc/gc_hal_eglplatform.h
    drivers/mxc/gpu-viv/hal/kernel/inc/gc_hal_eglplatform_type.h
    drivers/mxc/gpu-viv/hal/kernel/inc/gc_hal_engine.h
    drivers/mxc/gpu-viv/hal/kernel/inc/gc_hal_engine_vg.h
    drivers/mxc/gpu-viv/hal/kernel/inc/gc_hal_enum.h
    drivers/mxc/gpu-viv/hal/kernel/inc/gc_hal_kernel_buffer.h
    drivers/mxc/gpu-viv/hal/kernel/inc/gc_hal_mem.h
    drivers/mxc/gpu-viv/hal/kernel/inc/gc_hal_options.h
    drivers/mxc/gpu-viv/hal/kernel/inc/gc_hal_profiler.h
    drivers/mxc/gpu-viv/hal/kernel/inc/gc_hal_raster.h
    drivers/mxc/gpu-viv/hal/kernel/inc/gc_hal_rename.h
    drivers/mxc/gpu-viv/hal/kernel/inc/gc_hal_security_interface.h
    drivers/mxc/gpu-viv/hal/kernel/inc/gc_hal_statistics.h
    drivers/mxc/gpu-viv/hal/kernel/inc/gc_hal_types.h
    drivers/mxc/gpu-viv/hal/kernel/inc/gc_hal_version.h
    drivers/mxc/gpu-viv/hal/kernel/inc/gc_hal_vg.h
    drivers/mxc/gpu-viv/hal/os/linux/kernel/allocator/default/gc_hal_kernel_allocator_array.h
    drivers/mxc/gpu-viv/hal/os/linux/kernel/allocator/default/gc_hal_kernel_allocator_dmabuf.c
    drivers/mxc/gpu-viv/hal/os/linux/kernel/allocator/freescale/gc_hal_kernel_allocator_array.h
    drivers/mxc/gpu-viv/hal/os/linux/kernel/allocator/freescale/gc_hal_kernel_allocator_cma.c
    drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_allocator.c
    drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_allocator.h
    drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_debug.h
    drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_debugfs.c
    drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_debugfs.h
    drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_device.c
    drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_device.h
    drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_iommu.c
    drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_linux.c
    drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_linux.h
    drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_math.c
    drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_mutex.h
    drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_os.c
    drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_os.h
    drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_platform.h
    drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_probe.c
    drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_security_channel.c
    drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_sync.c
    drivers/mxc/gpu-viv/hal/os/linux/kernel/gc_hal_kernel_sync.h
    drivers/mxc/gpu-viv/hal/os/linux/kernel/platform/freescale/gc_hal_kernel_platform_imx6q14.c
    drivers/mxc/gpu-viv/hal/os/linux/kernel/platform/freescale/gc_hal_kernel_platform_imx6q14.config
    drivers/mxc/hdmi-cec/mxc_hdmi-cec.c
    drivers/mxc/ipu3/ipu_common.c
    drivers/mxc/mlb/mxc_mlb.c
    drivers/net/ethernet/freescale/fec_main.c
    drivers/net/wireless/bcmdhd/dhd_linux.c
    drivers/net/wireless/bcmdhd/dhd_sdio.c
    drivers/scsi/scsi_error.c
    drivers/spi/spi-imx.c
    drivers/thermal/imx_thermal.c
    drivers/tty/serial/imx.c
    drivers/usb/chipidea/udc.c
    drivers/usb/gadget/configfs.c
    drivers/video/fbdev/mxc/mipi_dsi.c
    drivers/video/fbdev/mxc/mipi_dsi.h
    drivers/video/fbdev/mxc/mipi_dsi_samsung.c
    drivers/video/fbdev/mxc/mxc_edid.c
    drivers/video/fbdev/mxc/mxc_epdc_fb.c
    drivers/video/fbdev/mxc/mxc_epdc_v2_fb.c
    drivers/video/fbdev/mxc/mxc_ipuv3_fb.c
    drivers/video/fbdev/mxc/mxcfb_hx8369_wvga.c
    drivers/video/fbdev/mxsfb.c
    firmware/imx/sdma/sdma-imx6q.bin.ihex
    include/trace/events/cpufreq_interactive.h

    guoyin.chen
     

16 Dec, 2015

1 commit


20 Nov, 2015

7 commits

  • Userspace processes often have multiple allocators that each do
    anonymous mmaps to get memory. When examining memory usage of
    individual processes or systems as a whole, it is useful to be
    able to break down the various heaps that were allocated by
    each layer and examine their size, RSS, and physical memory
    usage.

    This patch adds a user pointer to the shared union in
    vm_area_struct that points to a null terminated string inside
    the user process containing a name for the vma. vmas that
    point to the same address will be merged, but vmas that
    point to equivalent strings at different addresses will
    not be merged.

    Userspace can set the name for a region of memory by calling
    prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name);
    Setting the name to NULL clears it.

    The names of named anonymous vmas are shown in /proc/pid/maps
    as [anon:] and in /proc/pid/smaps in a new "Name" field
    that is only present for named vmas. If the userspace pointer
    is no longer valid all or part of the name will be replaced
    with "".

    The idea to store a userspace pointer to reduce the complexity
    within mm (at the expense of the complexity of reading
    /proc/pid/mem) came from Dave Hansen. This results in no
    runtime overhead in the mm subsystem other than comparing
    the anon_name pointers when considering vma merging. The pointer
    is stored in a union with fieds that are only used on file-backed
    mappings, so it does not increase memory usage.

    Includes fix from Jed Davis for typo in
    prctl_set_vma_anon_name, which could attempt to set the name
    across two vmas at the same time due to a typo, which might
    corrupt the vma list. Fix it to use tmp instead of end to limit
    the name setting to a single vma at a time.

    Change-Id: I9aa7b6b5ef536cd780599ba4e2fba8ceebe8b59f
    Signed-off-by: Dmitry Shmidt

    Colin Cross
     
  • Add a userspace visible knob to tell the VM to keep an extra amount
    of memory free, by increasing the gap between each zone's min and
    low watermarks.

    This is useful for realtime applications that call system
    calls and have a bound on the number of allocations that happen
    in any short time period. In this application, extra_free_kbytes
    would be left at an amount equal to or larger than than the
    maximum number of allocations that happen in any burst.

    It may also be useful to reduce the memory use of virtual
    machines (temporarily?), in a way that does not cause memory
    fragmentation like ballooning does.

    [ccross]
    Revived for use on old kernels where no other solution exists.
    The tunable will be removed on kernels that do better at avoiding
    direct reclaim.

    Change-Id: I765a42be8e964bfd3e2886d1ca85a29d60c3bb3e
    Signed-off-by: Rik van Riel
    Signed-off-by: Colin Cross

    Rik van Riel
     
  • This patch adds a debugfs file called "shrinker" when read this calls
    all the shrinkers in the system with nr_to_scan set to zero and prints
    the result. These results are the number of objects the shrinkers have
    available and can thus be used an indication of the total memory
    that would be availble to the system if a shrink occurred.

    Change-Id: Ied0ee7caff3d2fc1cb4bb839aaafee81b5b0b143
    Signed-off-by: Rebecca Schultz Zavin

    Rebecca Schultz Zavin
     
  • By default the kernel tries to keep half as much memory free at each
    order as it does for one order below. This can be too agressive when
    running without swap.

    Change-Id: I5efc1a0b50f41ff3ac71e92d2efd175dedd54ead
    Signed-off-by: Arve Hjønnevåg

    Arve Hjønnevåg
     
  • Pass correct argument to subsys_cgroup_allow_attach(), which
    expects 'struct cgroup_subsys_state *' argument but we pass
    'struct cgroup *' instead which doesn't seem right.

    This fixes following 'incompatible pointer type' compiler warning:
    ----------
    CC mm/memcontrol.o
    mm/memcontrol.c: In function ‘mem_cgroup_allow_attach’:
    mm/memcontrol.c:5052:2: warning: passing argument 1 of ‘subsys_cgroup_allow_attach’ from incompatible pointer type [enabled by default]
    In file included from include/linux/memcontrol.h:22:0,
    from mm/memcontrol.c:29:
    include/linux/cgroup.h:953:5: note: expected ‘struct cgroup_subsys_state *’ but argument is of type ‘struct cgroup *’
    ----------

    Signed-off-by: Amit Pundir

    Amit Pundir
     
  • Use the 'allow_attach' handler for the 'mem' cgroup to allow
    non-root processes to add arbitrary processes to a 'mem' cgroup
    if it has the CAP_SYS_NICE capability set.

    Bug: 18260435
    Change-Id: If7d37bf90c1544024c4db53351adba6a64966250
    Signed-off-by: Rom Lemarchand

    Rom Lemarchand
     
  • NOT FOR STAGING
    This patch re-adds the original shmem_set_file to mm/shmem.c
    and converts ashmem.c back to using it.

    CC: Brian Swetland
    CC: Colin Cross
    CC: Arve Hjønnevåg
    CC: Dima Zavin
    CC: Robert Love
    CC: Greg KH
    Signed-off-by: John Stultz

    John Stultz
     

10 Nov, 2015

2 commits

  • commit 47aee4d8e314384807e98b67ade07f6da476aa75 upstream.

    Use is_zero_pfn() on pteval only after pte_present() check on pteval
    (It might be better idea to introduce is_zero_pte() which checks
    pte_present() first).

    Otherwise when working on a swap or migration entry and if pte_pfn's
    result is equal to zero_pfn by chance, we lose user's data in
    __collapse_huge_page_copy(). So if you're unlucky, the application
    segfaults and finally you could see below message on exit:

    BUG: Bad rss-counter state mm:ffff88007f099300 idx:2 val:3

    Fixes: ca0984caa823 ("mm: incorporate zero pages into transparent huge pages")
    Signed-off-by: Minchan Kim
    Reviewed-by: Andrea Arcangeli
    Acked-by: Kirill A. Shutemov
    Cc: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Minchan Kim
     
  • commit 296291cdd1629c308114504b850dc343eabc2782 upstream.

    Currently a simple program below issues a sendfile(2) system call which
    takes about 62 days to complete in my test KVM instance.

    int fd;
    off_t off = 0;

    fd = open("file", O_RDWR | O_TRUNC | O_SYNC | O_CREAT, 0644);
    ftruncate(fd, 2);
    lseek(fd, 0, SEEK_END);
    sendfile(fd, fd, &off, 0xfffffff);

    Now you should not ask kernel to do a stupid stuff like copying 256MB in
    2-byte chunks and call fsync(2) after each chunk but if you do, sysadmin
    should have a way to stop you.

    We actually do have a check for fatal_signal_pending() in
    generic_perform_write() which triggers in this path however because we
    always succeed in writing something before the check is done, we return
    value > 0 from generic_perform_write() and thus the information about
    signal gets lost.

    Fix the problem by doing the signal check before writing anything. That
    way generic_perform_write() returns -EINTR, the error gets propagated up
    and the sendfile loop terminates early.

    Signed-off-by: Jan Kara
    Reported-by: Dmitry Vyukov
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     

27 Oct, 2015

1 commit

  • commit 424cdc14138088ada1b0e407a2195b2783c6e5ef upstream.

    page_counter_memparse() returns pages for the threshold, while
    mem_cgroup_usage() returns bytes for memory usage. Convert the
    threshold to bytes.

    Fixes: 3e32cb2e0a12b6915 ("memcg: rename cgroup_event to mem_cgroup_event").
    Signed-off-by: Shaohua Li
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Shaohua Li
     

23 Oct, 2015

3 commits

  • commit 03a2d2a3eafe4015412cf4e9675ca0e2d9204074 upstream.

    Commit description is copied from the original post of this bug:

    http://comments.gmane.org/gmane.linux.kernel.mm/135349

    Kernels after v3.9 use kmalloc_size(INDEX_NODE + 1) to get the next
    larger cache size than the size index INDEX_NODE mapping. In kernels
    3.9 and earlier we used malloc_sizes[INDEX_L3 + 1].cs_size.

    However, sometimes we can't get the right output we expected via
    kmalloc_size(INDEX_NODE + 1), causing a BUG().

    The mapping table in the latest kernel is like:
    index = {0, 1, 2 , 3, 4, 5, 6, n}
    size = {0, 96, 192, 8, 16, 32, 64, 2^n}
    The mapping table before 3.10 is like this:
    index = {0 , 1 , 2, 3, 4 , 5 , 6, n}
    size = {32, 64, 96, 128, 192, 256, 512, 2^(n+3)}

    The problem on my mips64 machine is as follows:

    (1) When configured DEBUG_SLAB && DEBUG_PAGEALLOC && DEBUG_LOCK_ALLOC
    && DEBUG_SPINLOCK, the sizeof(struct kmem_cache_node) will be "150",
    and the macro INDEX_NODE turns out to be "2": #define INDEX_NODE
    kmalloc_index(sizeof(struct kmem_cache_node))

    (2) Then the result of kmalloc_size(INDEX_NODE + 1) is 8.

    (3) Then "if(size >= kmalloc_size(INDEX_NODE + 1)" will lead to "size
    = PAGE_SIZE".

    (4) Then "if ((size >= (PAGE_SIZE >> 3))" test will be satisfied and
    "flags |= CFLGS_OFF_SLAB" will be covered.

    (5) if (flags & CFLGS_OFF_SLAB)" test will be satisfied and will go to
    "cachep->slabp_cache = kmalloc_slab(slab_size, 0u)", and the result
    here may be NULL while kernel bootup.

    (6) Finally,"BUG_ON(ZERO_OR_NULL_PTR(cachep->slabp_cache));" causes the
    BUG info as the following shows (may be only mips64 has this problem):

    This patch fixes the problem of kmalloc_size(INDEX_NODE + 1) and removes
    the BUG by adding 'size >= 256' check to guarantee that all necessary
    small sized slabs are initialized regardless sequence of slab size in
    mapping table.

    Fixes: e33660165c90 ("slab: Use common kmalloc_index/kmalloc_size...")
    Signed-off-by: Joonsoo Kim
    Reported-by: Liuhailong
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Joonsoo Kim
     
  • commit 2f84a8990ebbe235c59716896e017c6b2ca1200f upstream.

    SunDong reported the following on

    https://bugzilla.kernel.org/show_bug.cgi?id=103841

    I think I find a linux bug, I have the test cases is constructed. I
    can stable recurring problems in fedora22(4.0.4) kernel version,
    arch for x86_64. I construct transparent huge page, when the parent
    and child process with MAP_SHARE, MAP_PRIVATE way to access the same
    huge page area, it has the opportunity to lead to huge page copy on
    write failure, and then it will munmap the child corresponding mmap
    area, but then the child mmap area with VM_MAYSHARE attributes, child
    process munmap this area can trigger VM_BUG_ON in set_vma_resv_flags
    functions (vma - > vm_flags & VM_MAYSHARE).

    There were a number of problems with the report (e.g. it's hugetlbfs that
    triggers this, not transparent huge pages) but it was fundamentally
    correct in that a VM_BUG_ON in set_vma_resv_flags() can be triggered that
    looks like this

    vma ffff8804651fd0d0 start 00007fc474e00000 end 00007fc475e00000
    next ffff8804651fd018 prev ffff8804651fd188 mm ffff88046b1b1800
    prot 8000000000000027 anon_vma (null) vm_ops ffffffff8182a7a0
    pgoff 0 file ffff88106bdb9800 private_data (null)
    flags: 0x84400fb(read|write|shared|mayread|maywrite|mayexec|mayshare|dontexpand|hugetlb)
    ------------
    kernel BUG at mm/hugetlb.c:462!
    SMP
    Modules linked in: xt_pkttype xt_LOG xt_limit [..]
    CPU: 38 PID: 26839 Comm: map Not tainted 4.0.4-default #1
    Hardware name: Dell Inc. PowerEdge R810/0TT6JF, BIOS 2.7.4 04/26/2012
    set_vma_resv_flags+0x2d/0x30

    The VM_BUG_ON is correct because private and shared mappings have
    different reservation accounting but the warning clearly shows that the
    VMA is shared.

    When a private COW fails to allocate a new page then only the process
    that created the VMA gets the page -- all the children unmap the page.
    If the children access that data in the future then they get killed.

    The problem is that the same file is mapped shared and private. During
    the COW, the allocation fails, the VMAs are traversed to unmap the other
    private pages but a shared VMA is found and the bug is triggered. This
    patch identifies such VMAs and skips them.

    Signed-off-by: Mel Gorman
    Reported-by: SunDong
    Reviewed-by: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: David Rientjes
    Reviewed-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     
  • commit 3aaa76e125c1dd58c9b599baa8c6021896874c12 upstream.

    Since commit bcc54222309c ("mm: hugetlb: introduce page_huge_active")
    each hugetlb page maintains its active flag to avoid a race condition
    betwe= en multiple calls of isolate_huge_page(), but current kernel
    doesn't set the f= lag on a hugepage allocated by migration because the
    proper putback routine isn= 't called. This means that users could
    still encounter the race referred to by bcc54222309c in this special
    case, so this patch fixes it.

    Fixes: bcc54222309c ("mm: hugetlb: introduce page_huge_active")
    Signed-off-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: Andi Kleen
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Naoya Horiguchi
     

30 Sep, 2015

2 commits

  • commit c54839a722a02818677bcabe57e957f0ce4f841d upstream.

    reclaim_clean_pages_from_list() assumes that shrink_page_list() returns
    number of pages removed from the candidate list. But shrink_page_list()
    puts back mlocked pages without passing it to caller and without
    counting as nr_reclaimed. This increases nr_isolated.

    To fix this, this patch changes shrink_page_list() to pass unevictable
    pages back to caller. Caller will take care those pages.

    Minchan said:

    It fixes two issues.

    1. With unevictable page, cma_alloc will be successful.

    Exactly speaking, cma_alloc of current kernel will fail due to
    unevictable pages.

    2. fix leaking of NR_ISOLATED counter of vmstat

    With it, too_many_isolated works. Otherwise, it could make hang until
    the process get SIGKILL.

    Signed-off-by: Jaewon Kim
    Acked-by: Minchan Kim
    Cc: Mel Gorman
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jaewon Kim
     
  • commit 2f064f3485cd29633ad1b3cfb00cc519509a3d72 upstream.

    Commit c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb") added
    checks for page->pfmemalloc to __skb_fill_page_desc():

    if (page->pfmemalloc && !page->mapping)
    skb->pfmemalloc = true;

    It assumes page->mapping == NULL implies that page->pfmemalloc can be
    trusted. However, __delete_from_page_cache() can set set page->mapping
    to NULL and leave page->index value alone. Due to being in union, a
    non-zero page->index will be interpreted as true page->pfmemalloc.

    So the assumption is invalid if the networking code can see such a page.
    And it seems it can. We have encountered this with a NFS over loopback
    setup when such a page is attached to a new skbuf. There is no copying
    going on in this case so the page confuses __skb_fill_page_desc which
    interprets the index as pfmemalloc flag and the network stack drops
    packets that have been allocated using the reserves unless they are to
    be queued on sockets handling the swapping which is the case here and
    that leads to hangs when the nfs client waits for a response from the
    server which has been dropped and thus never arrive.

    The struct page is already heavily packed so rather than finding another
    hole to put it in, let's do a trick instead. We can reuse the index
    again but define it to an impossible value (-1UL). This is the page
    index so it should never see the value that large. Replace all direct
    users of page->pfmemalloc by page_is_pfmemalloc which will hide this
    nastiness from unspoiled eyes.

    The information will get lost if somebody wants to use page->index
    obviously but that was the case before and the original code expected
    that the information should be persisted somewhere else if that is
    really needed (e.g. what SLAB and SLUB do).

    [akpm@linux-foundation.org: fix blooper in slub]
    Fixes: c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb")
    Signed-off-by: Michal Hocko
    Debugged-by: Vlastimil Babka
    Debugged-by: Jiri Bohac
    Cc: Eric Dumazet
    Cc: David Miller
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

14 Sep, 2015

2 commits

  • commit 036138080a4376e5f3e5d0cca8ac99084c5cf06e upstream.

    Hugetlbfs pages will get a refcount in get_any_page() or
    madvise_hwpoison() if soft offlining through madvise. The refcount which
    is held by the soft offline path should be released if we fail to isolate
    hugetlbfs pages.

    Fix it by reducing the refcount for both isolation success and failure.

    Signed-off-by: Wanpeng Li
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Wanpeng Li
     
  • commit 4f32be677b124a49459e2603321c7a5605ceb9f8 upstream.

    After trying to drain pages from pagevec/pageset, we try to get reference
    count of the page again, however, the reference count of the page is not
    reduced if the page is still not on LRU list.

    Fix it by adding the put_page() to drop the page reference which is from
    __get_any_page().

    Signed-off-by: Wanpeng Li
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Wanpeng Li
     

17 Aug, 2015

1 commit

  • commit ecf5fc6e9654cd7a268c782a523f072b2f1959f9 upstream.

    Nikolay has reported a hang when a memcg reclaim got stuck with the
    following backtrace:

    PID: 18308 TASK: ffff883d7c9b0a30 CPU: 1 COMMAND: "rsync"
    #0 __schedule at ffffffff815ab152
    #1 schedule at ffffffff815ab76e
    #2 schedule_timeout at ffffffff815ae5e5
    #3 io_schedule_timeout at ffffffff815aad6a
    #4 bit_wait_io at ffffffff815abfc6
    #5 __wait_on_bit at ffffffff815abda5
    #6 wait_on_page_bit at ffffffff8111fd4f
    #7 shrink_page_list at ffffffff81135445
    #8 shrink_inactive_list at ffffffff81135845
    #9 shrink_lruvec at ffffffff81135ead
    #10 shrink_zone at ffffffff811360c3
    #11 shrink_zones at ffffffff81136eff
    #12 do_try_to_free_pages at ffffffff8113712f
    #13 try_to_free_mem_cgroup_pages at ffffffff811372be
    #14 try_charge at ffffffff81189423
    #15 mem_cgroup_try_charge at ffffffff8118c6f5
    #16 __add_to_page_cache_locked at ffffffff8112137d
    #17 add_to_page_cache_lru at ffffffff81121618
    #18 pagecache_get_page at ffffffff8112170b
    #19 grow_dev_page at ffffffff811c8297
    #20 __getblk_slow at ffffffff811c91d6
    #21 __getblk_gfp at ffffffff811c92c1
    #22 ext4_ext_grow_indepth at ffffffff8124565c
    #23 ext4_ext_create_new_leaf at ffffffff81246ca8
    #24 ext4_ext_insert_extent at ffffffff81246f09
    #25 ext4_ext_map_blocks at ffffffff8124a848
    #26 ext4_map_blocks at ffffffff8121a5b7
    #27 mpage_map_one_extent at ffffffff8121b1fa
    #28 mpage_map_and_submit_extent at ffffffff8121f07b
    #29 ext4_writepages at ffffffff8121f6d5
    #30 do_writepages at ffffffff8112c490
    #31 __filemap_fdatawrite_range at ffffffff81120199
    #32 filemap_flush at ffffffff8112041c
    #33 ext4_alloc_da_blocks at ffffffff81219da1
    #34 ext4_rename at ffffffff81229b91
    #35 ext4_rename2 at ffffffff81229e32
    #36 vfs_rename at ffffffff811a08a5
    #37 SYSC_renameat2 at ffffffff811a3ffc
    #38 sys_renameat2 at ffffffff811a408e
    #39 sys_rename at ffffffff8119e51e
    #40 system_call_fastpath at ffffffff815afa89

    Dave Chinner has properly pointed out that this is a deadlock in the
    reclaim code because ext4 doesn't submit pages which are marked by
    PG_writeback right away.

    The heuristic was introduced by commit e62e384e9da8 ("memcg: prevent OOM
    with too many dirty pages") and it was applied only when may_enter_fs
    was specified. The code has been changed by c3b94f44fcb0 ("memcg:
    further prevent OOM with too many dirty pages") which has removed the
    __GFP_FS restriction with a reasoning that we do not get into the fs
    code. But this is not sufficient apparently because the fs doesn't
    necessarily submit pages marked PG_writeback for IO right away.

    ext4_bio_write_page calls io_submit_add_bh but that doesn't necessarily
    submit the bio. Instead it tries to map more pages into the bio and
    mpage_map_one_extent might trigger memcg charge which might end up
    waiting on a page which is marked PG_writeback but hasn't been submitted
    yet so we would end up waiting for something that never finishes.

    Fix this issue by replacing __GFP_IO by may_enter_fs check (for case 2)
    before we go to wait on the writeback. The page fault path, which is
    the only path that triggers memcg oom killer since 3.12, shouldn't
    require GFP_NOFS and so we shouldn't reintroduce the premature OOM
    killer issue which was originally addressed by the heuristic.

    As per David Chinner the xfs is doing similar thing since 2.6.15 already
    so ext4 is not the only affected filesystem. Moreover he notes:

    : For example: IO completion might require unwritten extent conversion
    : which executes filesystem transactions and GFP_NOFS allocations. The
    : writeback flag on the pages can not be cleared until unwritten
    : extent conversion completes. Hence memory reclaim cannot wait on
    : page writeback to complete in GFP_NOFS context because it is not
    : safe to do so, memcg reclaim or otherwise.

    Cc: stable@vger.kernel.org # 3.9+
    [tytso@mit.edu: corrected the control flow]
    Fixes: c3b94f44fcb0 ("memcg: further prevent OOM with too many dirty pages")
    Reported-by: Nikolay Borisov
    Signed-off-by: Michal Hocko
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

04 Aug, 2015

2 commits

  • commit 6b7339f4c31ad69c8e9c0b2859276e22cf72176d upstream.

    Reading page fault handler code I've noticed that under right
    circumstances kernel would map anonymous pages into file mappings: if
    the VMA doesn't have vm_ops->fault() and the VMA wasn't fully populated
    on ->mmap(), kernel would handle page fault to not populated pte with
    do_anonymous_page().

    Let's change page fault handler to use do_anonymous_page() only on
    anonymous VMA (->vm_ops == NULL) and make sure that the VMA is not
    shared.

    For file mappings without vm_ops->fault() or shred VMA without vm_ops,
    page fault on pte_none() entry would lead to SIGBUS.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Oleg Nesterov
    Cc: Andrew Morton
    Cc: Willy Tarreau
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov
     
  • commit 641844f5616d7c6597309f560838f996466d7aac upstream.

    Currently the initial value of order in dissolve_free_huge_page is 64 or
    32, which leads to the following warning in static checker:

    mm/hugetlb.c:1203 dissolve_free_huge_pages()
    warn: potential right shift more than type allows '9,18,64'

    This is a potential risk of infinite loop, because 1 << order (== 0) is used
    in for-loop like this:

    for (pfn =3D start_pfn; pfn < end_pfn; pfn +=3D 1 << order)
    ...

    So this patch fixes it by using global minimum_order calculated at boot time.

    text data bss dec hex filename
    28313 469 84236 113018 1b97a mm/hugetlb.o
    28256 473 84236 112965 1b945 mm/hugetlb.o (patched)

    Fixes: c8721bbbdd36 ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
    Reported-by: Dan Carpenter
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Naoya Horiguchi
     

22 Jul, 2015

3 commits

  • commit 0867a57c4f80a566dda1bac975b42fcd857cb489 upstream.

    Since commit 077fcf116c8c ("mm/thp: allocate transparent hugepages on
    local node"), we handle THP allocations on page fault in a special way -
    for non-interleave memory policies, the allocation is only attempted on
    the node local to the current CPU, if the policy's nodemask allows the
    node.

    This is motivated by the assumption that THP benefits cannot offset the
    cost of remote accesses, so it's better to fallback to base pages on the
    local node (which might still be available, while huge pages are not due
    to fragmentation) than to allocate huge pages on a remote node.

    The nodemask check prevents us from violating e.g. MPOL_BIND policies
    where the local node is not among the allowed nodes. However, the
    current implementation can still give surprising results for the
    MPOL_PREFERRED policy when the preferred node is different than the
    current CPU's local node.

    In such case we should honor the preferred node and not use the local
    node, which is what this patch does. If hugepage allocation on the
    preferred node fails, we fall back to base pages and don't try other
    nodes, with the same motivation as is done for the local node hugepage
    allocations. The patch also moves the MPOL_INTERLEAVE check around to
    simplify the hugepage specific test.

    The difference can be demonstrated using in-tree transhuge-stress test
    on the following 2-node machine where half memory on one node was
    occupied to show the difference.

    > numactl --hardware
    available: 2 nodes (0-1)
    node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35
    node 0 size: 7878 MB
    node 0 free: 3623 MB
    node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47
    node 1 size: 8045 MB
    node 1 free: 7818 MB
    node distances:
    node 0 1
    0: 10 21
    1: 21 10

    Before the patch:
    > numactl -p0 -C0 ./transhuge-stress
    transhuge-stress: 2.197 s/loop, 0.276 ms/page, 7249.168 MiB/s 7962 succeed, 0 failed, 1786 different pages

    > numactl -p0 -C12 ./transhuge-stress
    transhuge-stress: 2.962 s/loop, 0.372 ms/page, 5376.172 MiB/s 7962 succeed, 0 failed, 3873 different pages

    Number of successful THP allocations corresponds to free memory on node 0 in
    the first case and node 1 in the second case, i.e. -p parameter is ignored and
    cpu binding "wins".

    After the patch:
    > numactl -p0 -C0 ./transhuge-stress
    transhuge-stress: 2.183 s/loop, 0.274 ms/page, 7295.516 MiB/s 7962 succeed, 0 failed, 1760 different pages

    > numactl -p0 -C12 ./transhuge-stress
    transhuge-stress: 2.878 s/loop, 0.361 ms/page, 5533.638 MiB/s 7962 succeed, 0 failed, 1750 different pages

    > numactl -p1 -C0 ./transhuge-stress
    transhuge-stress: 4.628 s/loop, 0.581 ms/page, 3440.893 MiB/s 7962 succeed, 0 failed, 3918 different pages

    The -p parameter is respected regardless of cpu binding.

    > numactl -C0 ./transhuge-stress
    transhuge-stress: 2.202 s/loop, 0.277 ms/page, 7230.003 MiB/s 7962 succeed, 0 failed, 1750 different pages

    > numactl -C12 ./transhuge-stress
    transhuge-stress: 3.020 s/loop, 0.379 ms/page, 5273.324 MiB/s 7962 succeed, 0 failed, 3916 different pages

    Without -p parameter, hugepage restriction to CPU-local node works as before.

    Fixes: 077fcf116c8c ("mm/thp: allocate transparent hugepages on local node")
    Signed-off-by: Vlastimil Babka
    Cc: Aneesh Kumar K.V
    Acked-by: David Rientjes
    Cc: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     
  • commit 8a8c35fadfaf55629a37ef1a8ead1b8fb32581d2 upstream.

    Beginning at commit d52d3997f843 ("ipv6: Create percpu rt6_info"), the
    following INFO splat is logged:

    ===============================
    [ INFO: suspicious RCU usage. ]
    4.1.0-rc7-next-20150612 #1 Not tainted
    -------------------------------
    kernel/sched/core.c:7318 Illegal context switch in RCU-bh read-side critical section!
    other info that might help us debug this:
    rcu_scheduler_active = 1, debug_locks = 0
    3 locks held by systemd/1:
    #0: (rtnl_mutex){+.+.+.}, at: [] rtnetlink_rcv+0x1f/0x40
    #1: (rcu_read_lock_bh){......}, at: [] ipv6_add_addr+0x62/0x540
    #2: (addrconf_hash_lock){+...+.}, at: [] ipv6_add_addr+0x184/0x540
    stack backtrace:
    CPU: 0 PID: 1 Comm: systemd Not tainted 4.1.0-rc7-next-20150612 #1
    Hardware name: TOSHIBA TECRA A50-A/TECRA A50-A, BIOS Version 4.20 04/17/2014
    Call Trace:
    dump_stack+0x4c/0x6e
    lockdep_rcu_suspicious+0xe7/0x120
    ___might_sleep+0x1d5/0x1f0
    __might_sleep+0x4d/0x90
    kmem_cache_alloc+0x47/0x250
    create_object+0x39/0x2e0
    kmemleak_alloc_percpu+0x61/0xe0
    pcpu_alloc+0x370/0x630

    Additional backtrace lines are truncated. In addition, the above splat
    is followed by several "BUG: sleeping function called from invalid
    context at mm/slub.c:1268" outputs. As suggested by Martin KaFai Lau,
    these are the clue to the fix. Routine kmemleak_alloc_percpu() always
    uses GFP_KERNEL for its allocations, whereas it should follow the gfp
    from its callers.

    Reviewed-by: Catalin Marinas
    Reviewed-by: Kamalesh Babulal
    Acked-by: Martin KaFai Lau
    Signed-off-by: Larry Finger
    Cc: Martin KaFai Lau
    Cc: Catalin Marinas
    Cc: Tejun Heo
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Larry Finger
     
  • commit c5f3b1a51a591c18c8b33983908e7fdda6ae417e upstream.

    The kmemleak scanning thread can run for minutes. Callbacks like
    kmemleak_free() are allowed during this time, the race being taken care
    of by the object->lock spinlock. Such lock also prevents a memory block
    from being freed or unmapped while it is being scanned by blocking the
    kmemleak_free() -> ... -> __delete_object() function until the lock is
    released in scan_object().

    When a kmemleak error occurs (e.g. it fails to allocate its metadata),
    kmemleak_enabled is set and __delete_object() is no longer called on
    freed objects. If kmemleak_scan is running at the same time,
    kmemleak_free() no longer waits for the object scanning to complete,
    allowing the corresponding memory block to be freed or unmapped (in the
    case of vfree()). This leads to kmemleak_scan potentially triggering a
    page fault.

    This patch separates the kmemleak_free() enabling/disabling from the
    overall kmemleak_enabled nob so that we can defer the disabling of the
    object freeing tracking until the scanning thread completed. The
    kmemleak_free_part() is deliberately ignored by this patch since this is
    only called during boot before the scanning thread started.

    Signed-off-by: Catalin Marinas
    Reported-by: Vignesh Radhakrishnan
    Tested-by: Vignesh Radhakrishnan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Catalin Marinas
     

18 Jun, 2015

1 commit

  • It appears that, at some point last year, XFS made directory handling
    changes which bring it into lockdep conflict with shmem_zero_setup():
    it is surprising that mmap() can clone an inode while holding mmap_sem,
    but that has been so for many years.

    Since those few lockdep traces that I've seen all implicated selinux,
    I'm hoping that we can use the __shmem_file_setup(,,,S_PRIVATE) which
    v3.13's commit c7277090927a ("security: shmem: implement kernel private
    shmem inodes") introduced to avoid LSM checks on kernel-internal inodes:
    the mmap("/dev/zero") cloned inode is indeed a kernel-internal detail.

    This also covers the !CONFIG_SHMEM use of ramfs to support /dev/zero
    (and MAP_SHARED|MAP_ANONYMOUS). I thought there were also drivers
    which cloned inode in mmap(), but if so, I cannot locate them now.

    Reported-and-tested-by: Prarit Bhargava
    Reported-and-tested-by: Daniel Wagner
    Reported-and-tested-by: Morten Stevens
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

11 Jun, 2015

4 commits

  • If zs_create_pool()->create_handle_cache()->kmem_cache_create() or
    pool->name allocation fails, zs_create_pool()->destroy_handle_cache()
    will dereference the NULL pool->handle_cachep.

    Modify destroy_handle_cache() to avoid this.

    Signed-off-by: Sergey Senozhatsky
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • On -rt, the VM_BUG_ON(!irqs_disabled()) triggers inside the memcg
    swapout path because the spin_lock_irq(&mapping->tree_lock) in the
    caller doesn't actually disable the hardware interrupts - which is fine,
    because on -rt the tophalves run in process context and so we are still
    safe from preemption while updating the statistics.

    Remove the VM_BUG_ON() but keep the comment of what we rely on.

    Signed-off-by: Johannes Weiner
    Reported-by: Clark Williams
    Cc: Fernando Lopez-Lezcano
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When trimming memcg consumption excess (see memory.high), we call
    try_to_free_mem_cgroup_pages without checking if we are allowed to sleep
    in the current context, which can result in a deadlock. Fix this.

    Fixes: 241994ed8649 ("mm: memcontrol: default hierarchy interface for memory")
    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Izumi found the following oops when hot re-adding a node:

    BUG: unable to handle kernel paging request at ffffc90008963690
    IP: __wake_up_bit+0x20/0x70
    Oops: 0000 [#1] SMP
    CPU: 68 PID: 1237 Comm: rs:main Q:Reg Not tainted 4.1.0-rc5 #80
    Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 1.87 04/28/2015
    task: ffff880838df8000 ti: ffff880017b94000 task.ti: ffff880017b94000
    RIP: 0010:[] [] __wake_up_bit+0x20/0x70
    RSP: 0018:ffff880017b97be8 EFLAGS: 00010246
    RAX: ffffc90008963690 RBX: 00000000003c0000 RCX: 000000000000a4c9
    RDX: 0000000000000000 RSI: ffffea101bffd500 RDI: ffffc90008963648
    RBP: ffff880017b97c08 R08: 0000000002000020 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a0797c73800
    R13: ffffea101bffd500 R14: 0000000000000001 R15: 00000000003c0000
    FS: 00007fcc7ffff700(0000) GS:ffff880874800000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffc90008963690 CR3: 0000000836761000 CR4: 00000000001407e0
    Call Trace:
    unlock_page+0x6d/0x70
    generic_write_end+0x53/0xb0
    xfs_vm_write_end+0x29/0x80 [xfs]
    generic_perform_write+0x10a/0x1e0
    xfs_file_buffered_aio_write+0x14d/0x3e0 [xfs]
    xfs_file_write_iter+0x79/0x120 [xfs]
    __vfs_write+0xd4/0x110
    vfs_write+0xac/0x1c0
    SyS_write+0x58/0xd0
    system_call_fastpath+0x12/0x76
    Code: 5d c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 48 83 ec 20 65 48 8b 04 25 28 00 00 00 48 89 45 f8 31 c0 48 8d 47 48 39 47 48 48 c7 45 e8 00 00 00 00 48 c7 45 f0 00 00 00 00 48
    RIP [] __wake_up_bit+0x20/0x70
    RSP
    CR2: ffffc90008963690

    Reproduce method (re-add a node)::
    Hot-add nodeA --> remove nodeA --> hot-add nodeA (panic)

    This seems an use-after-free problem, and the root cause is
    zone->wait_table was not set to *NULL* after free it in
    try_offline_node.

    When hot re-add a node, we will reuse the pgdat of it, so does the zone
    struct, and when add pages to the target zone, it will init the zone
    first (including the wait_table) if the zone is not initialized. The
    judgement of zone initialized is based on zone->wait_table:

    static inline bool zone_is_initialized(struct zone *zone)
    {
    return !!zone->wait_table;
    }

    so if we do not set the zone->wait_table to *NULL* after free it, the
    memory hotplug routine will skip the init of new zone when hot re-add
    the node, and the wait_table still points to the freed memory, then we
    will access the invalid address when trying to wake up the waiting
    people after the i/o operation with the page is done, such as mentioned
    above.

    Signed-off-by: Gu Zheng
    Reported-by: Taku Izumi
    Reviewed by: Yasuaki Ishimatsu
    Cc: KAMEZAWA Hiroyuki
    Cc: Tang Chen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gu Zheng
     

29 May, 2015

1 commit

  • bdi_unregister() now contains very little functionality.

    It contains a "WARN_ON" if bdi->dev is NULL. This warning is of no
    real consequence as bdi->dev isn't needed by anything else in the function,
    and it triggers if
    blk_cleanup_queue() -> bdi_destroy()
    is called before bdi_unregister, which happens since
    Commit: 6cd18e711dd8 ("block: destroy bdi before blockdev is unregistered.")

    So this isn't wanted.

    It also calls bdi_set_min_ratio(). This needs to be called after
    writes through the bdi have all been flushed, and before the bdi is destroyed.
    Calling it early is better than calling it late as it frees up a global
    resource.

    Calling it immediately after bdi_wb_shutdown() in bdi_destroy()
    perfectly fits these requirements.

    So bdi_unregister() can be discarded with the important content moved to
    bdi_destroy(), as can the
    writeback_bdi_unregister
    event which is already not used.

    Reported-by: Mike Snitzer
    Cc: stable@vger.kernel.org (v4.0)
    Fixes: c4db59d31e39 ("fs: don't reassign dirty inodes to default_backing_dev_info")
    Fixes: 6cd18e711dd8 ("block: destroy bdi before blockdev is unregistered.")
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Dan Williams
    Tested-by: Nicholas Moulin
    Signed-off-by: NeilBrown
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    NeilBrown
     

15 May, 2015

3 commits

  • NUMA balancing is meant to be disabled by default on UMA machines but
    the check is using nr_node_ids (highest node) instead of
    num_online_nodes (online nodes).

    The consequences are that a UMA machine with a node ID of 1 or higher
    will enable NUMA balancing. This will incur useless overhead due to
    minor faults with the impact depending on the workload. These are the
    impact on the stats when running a kernel build on a single node machine
    whose node ID happened to be 1:

    vanilla patched
    NUMA base PTE updates 5113158 0
    NUMA huge PMD updates 643 0
    NUMA page range updates 5442374 0
    NUMA hint faults 2109622 0
    NUMA hint local faults 2109622 0
    NUMA hint local percent 100 100
    NUMA pages migrated 0 0

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: [3.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • I had an issue:

    Unable to handle kernel NULL pointer dereference at virtual address 0000082a
    pgd = cc970000
    [0000082a] *pgd=00000000
    Internal error: Oops: 5 [#1] PREEMPT SMP ARM
    PC is at get_pageblock_flags_group+0x5c/0xb0
    LR is at unset_migratetype_isolate+0x148/0x1b0
    pc : [] lr : [] psr: 80000093
    sp : c7029d00 ip : 00000105 fp : c7029d1c
    r10: 00000001 r9 : 0000000a r8 : 00000004
    r7 : 60000013 r6 : 000000a4 r5 : c0a357e4 r4 : 00000000
    r3 : 00000826 r2 : 00000002 r1 : 00000000 r0 : 0000003f
    Flags: Nzcv IRQs off FIQs on Mode SVC_32 ISA ARM Segment user
    Control: 10c5387d Table: 2cb7006a DAC: 00000015
    Backtrace:
    get_pageblock_flags_group+0x0/0xb0
    unset_migratetype_isolate+0x0/0x1b0
    undo_isolate_page_range+0x0/0xdc
    __alloc_contig_range+0x0/0x34c
    alloc_contig_range+0x0/0x18

    This issue is because when calling unset_migratetype_isolate() to unset
    a part of CMA memory, it try to access the buddy page to get its status:

    if (order >= pageblock_order) {
    page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
    buddy_idx = __find_buddy_index(page_idx, order);
    buddy = page + (buddy_idx - page_idx);

    if (!is_migrate_isolate_page(buddy)) {

    But the begin addr of this part of CMA memory is very close to a part of
    memory that is reserved at boot time (not in buddy system). So add a
    check before accessing it.

    [akpm@linux-foundation.org: use conventional code layout]
    Signed-off-by: Hui Zhu
    Suggested-by: Laura Abbott
    Suggested-by: Joonsoo Kim
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hui Zhu
     
  • Not all kmem allocations should be accounted to memcg. The following
    patch gives an example when accounting of a certain type of allocations to
    memcg can effectively result in a memory leak. This patch adds the
    __GFP_NOACCOUNT flag which if passed to kmalloc and friends will force the
    allocation to go through the root cgroup. It will be used by the next
    patch.

    Note, since in case of kmemleak enabled each kmalloc implies yet another
    allocation from the kmemleak_object cache, we add __GFP_NOACCOUNT to
    gfp_kmemleak_mask.

    Alternatively, we could introduce a per kmem cache flag disabling
    accounting for all allocations of a particular kind, but (a) we would not
    be able to bypass accounting for kmalloc then and (b) a kmem cache with
    this flag set could not be merged with a kmem cache without this flag,
    which would increase the number of global caches and therefore
    fragmentation even if the memory cgroup controller is not used.

    Despite its generic name, currently __GFP_NOACCOUNT disables accounting
    only for kmem allocations while user page allocations are always charged.
    To catch abusing of this flag, a warning is issued on an attempt of
    passing it to mem_cgroup_try_charge.

    Signed-off-by: Vladimir Davydov
    Cc: Tejun Heo
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Greg Thelen
    Cc: Greg Kroah-Hartman
    Cc: [4.0.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

09 May, 2015

1 commit

  • Pull block fixes from Jens Axboe:
    "A collection of fixes since the merge window;

    - fix for a double elevator module release, from Chao Yu. Ancient bug.

    - the splice() MORE flag fix from Christophe Leroy.

    - a fix for NVMe, fixing a patch that went in in the merge window.
    From Keith.

    - two fixes for blk-mq CPU hotplug handling, from Ming Lei.

    - bdi vs blockdev lifetime fix from Neil Brown, fixing and oops in md.

    - two blk-mq fixes from Shaohua, fixing a race on queue stop and a
    bad merge issue with FUA writes.

    - division-by-zero fix for writeback from Tejun.

    - a block bounce page accounting fix, making sure we inc/dec after
    bouncing so that pre/post IO pages match up. From Wang YanQing"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    splice: sendfile() at once fails for big files
    blk-mq: don't lose requests if a stopped queue restarts
    blk-mq: fix FUA request hang
    block: destroy bdi before blockdev is unregistered.
    block:bounce: fix call inc_|dec_zone_page_state on different pages confuse value of NR_BOUNCE
    elevator: fix double release of elevator module
    writeback: use |1 instead of +1 to protect against div by zero
    blk-mq: fix CPU hotplug handling
    blk-mq: fix race between timeout and CPU hotplug
    NVMe: Fix VPD B0 max sectors translation

    Linus Torvalds
     

06 May, 2015

4 commits

  • Hwpoison injector checks PageLRU of the raw target page to find out
    whether the page is an appropriate target, but current code now filters
    out thp tail pages, which prevents us from testing for such cases via this
    interface. So let's check hpage instead of p.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Dean Nelson
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Hidetoshi Seto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Hwpoison injection via debugfs:hwpoison/corrupt-pfn takes a refcount of
    the target page. But current code doesn't release it if the target page
    is not supposed to be injected, which results in memory leak. This patch
    simply adds the refcount releasing code.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Dean Nelson
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Hidetoshi Seto
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • If multiple soft offline events hit one free page/hugepage concurrently,
    soft_offline_page() can handle the free page/hugepage multiple times,
    which makes num_poisoned_pages counter increased more than once. This
    patch fixes this wrong counting by checking TestSetPageHWPoison for normal
    papes and by checking the return value of dequeue_hwpoisoned_huge_page()
    for hugepages.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Dean Nelson
    Cc: Andi Kleen
    Cc: [3.14+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Currently memory_failure() calls shake_page() to sweep pages out from
    pcplists only when the victim page is 4kB LRU page or thp head page.
    But we should do this for a thp tail page too.

    Consider that a memory error hits a thp tail page whose head page is on
    a pcplist when memory_failure() runs. Then, the current kernel skips
    shake_pages() part, so hwpoison_user_mappings() returns without calling
    split_huge_page() nor try_to_unmap() because PageLRU of the thp head is
    still cleared due to the skip of shake_page().

    As a result, me_huge_page() runs for the thp, which is broken behavior.

    One effect is a leak of the thp. And another is to fail to isolate the
    memory error, so later access to the error address causes another MCE,
    which kills the processes which used the thp.

    This patch fixes this problem by calling shake_page() for thp tail case.

    Fixes: 385de35722c9 ("thp: allow a hwpoisoned head page to be put back to LRU")
    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Andi Kleen
    Acked-by: Dean Nelson
    Cc: Andrea Arcangeli
    Cc: Hidetoshi Seto
    Cc: Jin Dongming
    Cc: [3.4+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi