12 Aug, 2016

7 commits

  • The following oops occurs after a pgdat is hotadded:

    Unable to handle kernel paging request for data at address 0x00c30001
    Faulting instruction address: 0xc00000000022f8f4
    Oops: Kernel access of bad area, sig: 11 [#1]
    SMP NR_CPUS=2048 NUMA pSeries
    Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter nls_utf8 isofs sg virtio_balloon uio_pdrv_genirq uio ip_tables xfs libcrc32c sr_mod cdrom sd_mod virtio_net ibmvscsi scsi_transport_srp virtio_pci virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod
    CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W 4.8.0-rc1-device #110
    task: c000000000ef3080 task.stack: c000000000f6c000
    NIP: c00000000022f8f4 LR: c00000000022f948 CTR: 0000000000000000
    REGS: c000000000f6fa50 TRAP: 0300 Tainted: G W (4.8.0-rc1-device)
    MSR: 800000010280b033 CR: 84002028 XER: 20000000
    CFAR: d000000001d2013c DAR: 0000000000c30001 DSISR: 40000000 SOFTE: 0
    NIP refresh_cpu_vm_stats+0x1a4/0x2f0
    LR refresh_cpu_vm_stats+0x1f8/0x2f0
    Call Trace:
    refresh_cpu_vm_stats+0x1f8/0x2f0 (unreliable)

    Add per_cpu_nodestats initialization to the hotplug codepath.

    Link: http://lkml.kernel.org/r/1470931473-7090-1-git-send-email-arbab@linux.vnet.ibm.com
    Signed-off-by: Reza Arbab
    Cc: Mel Gorman
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Reza Arbab
     
  • mm/oom_kill.c: In function `task_will_free_mem':
    mm/oom_kill.c:767: warning: `ret' may be used uninitialized in this function

    If __task_will_free_mem() is never called inside the for_each_process()
    loop, ret will not be initialized.

    Fixes: 1af8bb43269563e4 ("mm, oom: fortify task_will_free_mem()")
    Link: http://lkml.kernel.org/r/1470255599-24841-1-git-send-email-geert@linux-m68k.org
    Signed-off-by: Geert Uytterhoeven
    Acked-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     
  • It's quite unlikely that the user will so little memory that the per-CPU
    quarantines won't fit into the given fraction of the available memory.
    Even in that case he won't be able to do anything with the information
    given in the warning.

    Link: http://lkml.kernel.org/r/1470929182-101413-1-git-send-email-glider@google.com
    Signed-off-by: Alexander Potapenko
    Acked-by: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Andrey Konovalov
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Kuthonuzo Luruo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • Since commit 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure
    after many small jobs") swap entries do not pin memcg->css.refcnt
    directly. Instead, they pin memcg->id.ref. So we should adjust the
    reference counters accordingly when moving swap charges between cgroups.

    Fixes: 73f576c04b941 ("mm: memcontrol: fix cgroup creation failure after many small jobs")
    Link: http://lkml.kernel.org/r/9ce297c64954a42dc90b543bc76106c4a94f07e8.1470219853.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: [3.19+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • An offline memory cgroup might have anonymous memory or shmem left
    charged to it and no swap. Since only swap entries pin the id of an
    offline cgroup, such a cgroup will have no id and so an attempt to
    swapout its anon/shmem will not store memory cgroup info in the swap
    cgroup map. As a result, memcg->swap or memcg->memsw will never get
    uncharged from it and any of its ascendants.

    Fix this by always charging swapout to the first ancestor cgroup that
    hasn't released its id yet.

    [hannes@cmpxchg.org: add comment to mem_cgroup_swapout]
    [vdavydov@virtuozzo.com: use WARN_ON_ONCE() in mem_cgroup_id_get_online()]
    Link: http://lkml.kernel.org/r/20160803123445.GJ13263@esperanza
    Fixes: 73f576c04b941 ("mm: memcontrol: fix cgroup creation failure after many small jobs")
    Link: http://lkml.kernel.org/r/5336daa5c9a32e776067773d9da655d2dc126491.1470219853.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: [3.19+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • meminfo_proc_show() and si_mem_available() are using the wrong helpers
    for calculating the size of the LRUs. The user-visible impact is that
    there appears to be an abnormally high number of unevictable pages.

    Link: http://lkml.kernel.org/r/20160805105805.GR2799@techsingularity.net
    Signed-off-by: Mel Gorman
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When memory hotplug operates, free hugepages will be freed if the
    movable node is offline. Therefore, /proc/sys/vm/nr_hugepages will be
    incorrect.

    Fix it by reducing max_huge_pages when the node is offlined.

    n-horiguchi@ah.jp.nec.com said:

    : dissolve_free_huge_page intends to break a hugepage into buddy, and the
    : destination hugepage is supposed to be allocated from the pool of the
    : destination node, so the system-wide pool size is reduced. So adding
    : h->max_huge_pages-- makes sense to me.

    Link: http://lkml.kernel.org/r/1470624546-902-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhong jiang
    Cc: Mike Kravetz
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     

11 Aug, 2016

6 commits

  • With debugobjects enabled and using SLAB_DESTROY_BY_RCU, when a
    kmem_cache_node is destroyed the call_rcu() may trigger a slab
    allocation to fill the debug object pool (__debug_object_init:fill_pool).

    Everywhere but during kmem_cache_destroy(), discard_slab() is performed
    outside of the kmem_cache_node->list_lock and avoids a lockdep warning
    about potential recursion:

    =============================================
    [ INFO: possible recursive locking detected ]
    4.8.0-rc1-gfxbench+ #1 Tainted: G U
    ---------------------------------------------
    rmmod/8895 is trying to acquire lock:
    (&(&n->list_lock)->rlock){-.-...}, at: [] get_partial_node.isra.63+0x47/0x430

    but task is already holding lock:
    (&(&n->list_lock)->rlock){-.-...}, at: [] __kmem_cache_shutdown+0x54/0x320

    other info that might help us debug this:
    Possible unsafe locking scenario:
    CPU0
    ----
    lock(&(&n->list_lock)->rlock);
    lock(&(&n->list_lock)->rlock);

    *** DEADLOCK ***
    May be due to missing lock nesting notation
    5 locks held by rmmod/8895:
    #0: (&dev->mutex){......}, at: driver_detach+0x42/0xc0
    #1: (&dev->mutex){......}, at: driver_detach+0x50/0xc0
    #2: (cpu_hotplug.dep_map){++++++}, at: get_online_cpus+0x2d/0x80
    #3: (slab_mutex){+.+.+.}, at: kmem_cache_destroy+0x3c/0x220
    #4: (&(&n->list_lock)->rlock){-.-...}, at: __kmem_cache_shutdown+0x54/0x320

    stack backtrace:
    CPU: 6 PID: 8895 Comm: rmmod Tainted: G U 4.8.0-rc1-gfxbench+ #1
    Hardware name: Gigabyte Technology Co., Ltd. H87M-D3H/H87M-D3H, BIOS F11 08/18/2015
    Call Trace:
    __lock_acquire+0x1646/0x1ad0
    lock_acquire+0xb2/0x200
    _raw_spin_lock+0x36/0x50
    get_partial_node.isra.63+0x47/0x430
    ___slab_alloc.constprop.67+0x1a7/0x3b0
    __slab_alloc.isra.64.constprop.66+0x43/0x80
    kmem_cache_alloc+0x236/0x2d0
    __debug_object_init+0x2de/0x400
    debug_object_activate+0x109/0x1e0
    __call_rcu.constprop.63+0x32/0x2f0
    call_rcu+0x12/0x20
    discard_slab+0x3d/0x40
    __kmem_cache_shutdown+0xdb/0x320
    shutdown_cache+0x19/0x60
    kmem_cache_destroy+0x1ae/0x220
    i915_gem_load_cleanup+0x14/0x40 [i915]
    i915_driver_unload+0x151/0x180 [i915]
    i915_pci_remove+0x14/0x20 [i915]
    pci_device_remove+0x34/0xb0
    __device_release_driver+0x95/0x140
    driver_detach+0xb6/0xc0
    bus_remove_driver+0x53/0xd0
    driver_unregister+0x27/0x50
    pci_unregister_driver+0x25/0x70
    i915_exit+0x1a/0x1e2 [i915]
    SyS_delete_module+0x193/0x1f0
    entry_SYSCALL_64_fastpath+0x1c/0xac

    Fixes: 52b4b950b507 ("mm: slab: free kmem_cache_node after destroy sysfs file")
    Link: http://lkml.kernel.org/r/1470759070-18743-1-git-send-email-chris@chris-wilson.co.uk
    Reported-by: Dave Gordon
    Signed-off-by: Chris Wilson
    Reviewed-by: Vladimir Davydov
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Dmitry Safonov
    Cc: Daniel Vetter
    Cc: Dave Gordon
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Wilson
     
  • In page_remove_file_rmap(.) we have the following check:

    VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);

    This is meant to check for either HugeTLB pages or THP when a compound
    page is passed in.

    Unfortunately, if one disables CONFIG_TRANSPARENT_HUGEPAGE, then
    PageTransHuge(.) will always return false, provoking BUGs when one runs
    the libhugetlbfs test suite.

    This patch replaces PageTransHuge(), with PageHead() which will work for
    both HugeTLB and THP.

    Fixes: dd78fedde4b9 ("rmap: support file thp")
    Link: http://lkml.kernel.org/r/1470838217-5889-1-git-send-email-steve.capper@arm.com
    Signed-off-by: Steve Capper
    Acked-by: Kirill A. Shutemov
    Cc: Huang Shijie
    Cc: Will Deacon
    Cc: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steve Capper
     
  • PageTransCompound() doesn't distinguish THP from from any other type of
    compound pages. This can lead to false-positive VM_BUG_ON() in
    page_add_file_rmap() if called on compound page from a driver[1].

    I think we can exclude such cases by checking if the page belong to a
    mapping.

    The VM_BUG_ON_PAGE() is downgraded to VM_WARN_ON_ONCE(). This path
    should not cause any harm to non-THP page, but good to know if we step
    on anything else.

    [1] http://lkml.kernel.org/r/c711e067-0bff-a6cb-3c37-04dfe77d2db1@redhat.com

    Link: http://lkml.kernel.org/r/20160810161345.GA67522@black.fi.intel.com
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Laura Abbott
    Tested-by: Laura Abbott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Some of node threshold depends on number of managed pages in the node.
    When memory is going on/offline, it can be changed and we need to adjust
    them.

    Add recalculation to appropriate places and clean-up related functions
    for better maintenance.

    Link: http://lkml.kernel.org/r/1470724248-26780-2-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Acked-by: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Before resetting min_unmapped_pages, we need to initialize
    min_unmapped_pages rather than min_slab_pages.

    Fixes: a5f5f91da6 (mm: convert zone_reclaim to node_reclaim)
    Link: http://lkml.kernel.org/r/1470724248-26780-1-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Acked-by: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • The newly introduced shmem_huge_enabled() function has two definitions,
    but neither of them is visible if CONFIG_SYSFS is disabled, leading to a
    build error:

    mm/khugepaged.o: In function `khugepaged':
    khugepaged.c:(.text.khugepaged+0x3ca): undefined reference to `shmem_huge_enabled'

    This changes the #ifdef guards around the definition to match those that
    are used in the header file.

    Fixes: e496cf3d7821 ("thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE")
    Link: http://lkml.kernel.org/r/20160809123638.1357593-1-arnd@arndb.de
    Signed-off-by: Arnd Bergmann
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     

10 Aug, 2016

1 commit

  • To distinguish non-slab pages charged to kmemcg we mark them PageKmemcg,
    which sets page->_mapcount to -512. Currently, we set/clear PageKmemcg
    in __alloc_pages_nodemask()/free_pages_prepare() for any page allocated
    with __GFP_ACCOUNT, including those that aren't actually charged to any
    cgroup, i.e. allocated from the root cgroup context. To avoid overhead
    in case cgroups are not used, we only do that if memcg_kmem_enabled() is
    true. The latter is set iff there are kmem-enabled memory cgroups
    (online or offline). The root cgroup is not considered kmem-enabled.

    As a result, if a page is allocated with __GFP_ACCOUNT for the root
    cgroup when there are kmem-enabled memory cgroups and is freed after all
    kmem-enabled memory cgroups were removed, e.g.

    # no memory cgroups has been created yet, create one
    mkdir /sys/fs/cgroup/memory/test
    # run something allocating pages with __GFP_ACCOUNT, e.g.
    # a program using pipe
    dmesg | tail
    # remove the memory cgroup
    rmdir /sys/fs/cgroup/memory/test

    we'll get bad page state bug complaining about page->_mapcount != -1:

    BUG: Bad page state in process swapper/0 pfn:1fd945c
    page:ffffea007f651700 count:0 mapcount:-511 mapping: (null) index:0x0
    flags: 0x1000000000000000()

    To avoid that, let's mark with PageKmemcg only those pages that are
    actually charged to and hence pin a non-root memory cgroup.

    Fixes: 4949148ad433 ("mm: charge/uncharge kmemcg from generic page allocator paths")
    Reported-and-tested-by: Eric Dumazet
    Signed-off-by: Vladimir Davydov
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

09 Aug, 2016

1 commit

  • Pull usercopy protection from Kees Cook:
    "Tbhis implements HARDENED_USERCOPY verification of copy_to_user and
    copy_from_user bounds checking for most architectures on SLAB and
    SLUB"

    * tag 'usercopy-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    mm: SLUB hardened usercopy support
    mm: SLAB hardened usercopy support
    s390/uaccess: Enable hardened usercopy
    sparc/uaccess: Enable hardened usercopy
    powerpc/uaccess: Enable hardened usercopy
    ia64/uaccess: Enable hardened usercopy
    arm64/uaccess: Enable hardened usercopy
    ARM: uaccess: Enable hardened usercopy
    x86/uaccess: Enable hardened usercopy
    mm: Hardened usercopy
    mm: Implement stack frame object validation
    mm: Add is_migrate_cma_page

    Linus Torvalds
     

08 Aug, 2016

2 commits


06 Aug, 2016

1 commit

  • Pull block fixes from Jens Axboe:
    "Here's the second round of block updates for this merge window.

    It's a mix of fixes for changes that went in previously in this round,
    and fixes in general. This pull request contains:

    - Fixes for loop from Christoph

    - A bdi vs gendisk lifetime fix from Dan, worth two cookies.

    - A blk-mq timeout fix, when on frozen queues. From Gabriel.

    - Writeback fix from Jan, ensuring that __writeback_single_inode()
    does the right thing.

    - Fix for bio->bi_rw usage in f2fs from me.

    - Error path deadlock fix in blk-mq sysfs registration from me.

    - Floppy O_ACCMODE fix from Jiri.

    - Fix to the new bio op methods from Mike.

    One more followup will be coming here, ensuring that we don't
    propagate the block types outside of block. That, and a rename of
    bio->bi_rw is coming right after -rc1 is cut.

    - Various little fixes"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    mm/block: convert rw_page users to bio op use
    loop: make do_req_filebacked more robust
    loop: don't try to use AIO for discards
    blk-mq: fix deadlock in blk_mq_register_disk() error path
    Include: blkdev: Removed duplicate 'struct request;' declaration.
    Fixup direct bi_rw modifiers
    block: fix bdi vs gendisk lifetime mismatch
    blk-mq: Allow timeouts to run while queue is freezing
    nbd: fix race in ioctl
    block: fix use-after-free in seq file
    f2fs: drop bio->bi_rw manual assignment
    block: add missing group association in bio-cloning functions
    blkcg: kill unused field nr_undestroyed_grps
    writeback: Write dirty times for WB_SYNC_ALL writeback
    floppy: fix open(O_ACCMODE) for ioctl-only open

    Linus Torvalds
     

05 Aug, 2016

8 commits

  • Pull more powerpc updates from Michael Ellerman:
    "These were delayed for various reasons, so I let them sit in next a
    bit longer, rather than including them in my first pull request.

    Fixes:
    - Fix early access to cpu_spec relocation from Benjamin Herrenschmidt
    - Fix incorrect event codes in power9-event-list from Madhavan Srinivasan
    - Move register_process_table() out of ppc_md from Michael Ellerman

    Use jump_label use for [cpu|mmu]_has_feature():
    - Add mmu_early_init_devtree() from Michael Ellerman
    - Move disable_radix handling into mmu_early_init_devtree() from Michael Ellerman
    - Do hash device tree scanning earlier from Michael Ellerman
    - Do radix device tree scanning earlier from Michael Ellerman
    - Do feature patching before MMU init from Michael Ellerman
    - Check features don't change after patching from Michael Ellerman
    - Make MMU_FTR_RADIX a MMU family feature from Aneesh Kumar K.V
    - Convert mmu_has_feature() to returning bool from Michael Ellerman
    - Convert cpu_has_feature() to returning bool from Michael Ellerman
    - Define radix_enabled() in one place & use static inline from Michael Ellerman
    - Add early_[cpu|mmu]_has_feature() from Michael Ellerman
    - Convert early cpu/mmu feature check to use the new helpers from Aneesh Kumar K.V
    - jump_label: Make it possible for arches to invoke jump_label_init() earlier from Kevin Hao
    - Call jump_label_init() in apply_feature_fixups() from Aneesh Kumar K.V
    - Remove mfvtb() from Kevin Hao
    - Move cpu_has_feature() to a separate file from Kevin Hao
    - Add kconfig option to use jump labels for cpu/mmu_has_feature() from Michael Ellerman
    - Add option to use jump label for cpu_has_feature() from Kevin Hao
    - Add option to use jump label for mmu_has_feature() from Kevin Hao
    - Catch usage of cpu/mmu_has_feature() before jump label init from Aneesh Kumar K.V
    - Annotate jump label assembly from Michael Ellerman

    TLB flush enhancements from Aneesh Kumar K.V:
    - radix: Implement tlb mmu gather flush efficiently
    - Add helper for finding SLBE LLP encoding
    - Use hugetlb flush functions
    - Drop multiple definition of mm_is_core_local
    - radix: Add tlb flush of THP ptes
    - radix: Rename function and drop unused arg
    - radix/hugetlb: Add helper for finding page size
    - hugetlb: Add flush_hugetlb_tlb_range
    - remove flush_tlb_page_nohash

    Add new ptrace regsets from Anshuman Khandual and Simon Guo:
    - elf: Add powerpc specific core note sections
    - Add the function flush_tmregs_to_thread
    - Enable in transaction NT_PRFPREG ptrace requests
    - Enable in transaction NT_PPC_VMX ptrace requests
    - Enable in transaction NT_PPC_VSX ptrace requests
    - Adapt gpr32_get, gpr32_set functions for transaction
    - Enable support for NT_PPC_CGPR
    - Enable support for NT_PPC_CFPR
    - Enable support for NT_PPC_CVMX
    - Enable support for NT_PPC_CVSX
    - Enable support for TM SPR state
    - Enable NT_PPC_TM_CTAR, NT_PPC_TM_CPPR, NT_PPC_TM_CDSCR
    - Enable support for NT_PPPC_TAR, NT_PPC_PPR, NT_PPC_DSCR
    - Enable support for EBB registers
    - Enable support for Performance Monitor registers"

    * tag 'powerpc-4.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (48 commits)
    powerpc/mm: Move register_process_table() out of ppc_md
    powerpc/perf: Fix incorrect event codes in power9-event-list
    powerpc/32: Fix early access to cpu_spec relocation
    powerpc/ptrace: Enable support for Performance Monitor registers
    powerpc/ptrace: Enable support for EBB registers
    powerpc/ptrace: Enable support for NT_PPPC_TAR, NT_PPC_PPR, NT_PPC_DSCR
    powerpc/ptrace: Enable NT_PPC_TM_CTAR, NT_PPC_TM_CPPR, NT_PPC_TM_CDSCR
    powerpc/ptrace: Enable support for TM SPR state
    powerpc/ptrace: Enable support for NT_PPC_CVSX
    powerpc/ptrace: Enable support for NT_PPC_CVMX
    powerpc/ptrace: Enable support for NT_PPC_CFPR
    powerpc/ptrace: Enable support for NT_PPC_CGPR
    powerpc/ptrace: Adapt gpr32_get, gpr32_set functions for transaction
    powerpc/ptrace: Enable in transaction NT_PPC_VSX ptrace requests
    powerpc/ptrace: Enable in transaction NT_PPC_VMX ptrace requests
    powerpc/ptrace: Enable in transaction NT_PRFPREG ptrace requests
    powerpc/process: Add the function flush_tmregs_to_thread
    elf: Add powerpc specific core note sections
    powerpc/mm: remove flush_tlb_page_nohash
    powerpc/mm/hugetlb: Add flush_hugetlb_tlb_range
    ...

    Linus Torvalds
     
  • It causes NULL dereference error and failure to get type_a->regions[0]
    info if parameter type_b of __next_mem_range_rev() == NULL

    Fix this by checking before dereferring and initializing idx_b to 0

    The approach is tested by dumping all types of region via
    __memblock_dump_all() and __next_mem_range_rev() fixed to UART
    separately the result is okay after checking the logs.

    Link: http://lkml.kernel.org/r/57A0320D.6070102@zoho.com
    Signed-off-by: zijun_hu
    Tested-by: zijun_hu
    Acked-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zijun_hu
     
  • With m68k-linux-gnu-gcc-4.1:

    include/linux/slub_def.h:126: warning: `fixup_red_left' declared inline after being called
    include/linux/slub_def.h:126: warning: previous declaration of `fixup_red_left' was here

    Commit c146a2b98eb5 ("mm, kasan: account for object redzone in SLUB's
    nearest_obj()") made fixup_red_left() global, but forgot to remove the
    inline keyword.

    Fixes: c146a2b98eb5898e ("mm, kasan: account for object redzone in SLUB's nearest_obj()")
    Link: http://lkml.kernel.org/r/1470256262-1586-1-git-send-email-geert@linux-m68k.org
    Signed-off-by: Geert Uytterhoeven
    Cc: Alexander Potapenko
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     
  • Paul Mackerras and Reza Arbab reported that machines with memoryless
    nodes fail when vmstats are refreshed. Paul reported an oops as follows

    Unable to handle kernel paging request for data at address 0xff7a10000
    Faulting instruction address: 0xc000000000270cd0
    Oops: Kernel access of bad area, sig: 11 [#1]
    SMP NR_CPUS=2048 NUMA PowerNV
    Modules linked in:
    CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.7.0-kvm+ #118
    task: c000000ff0680010 task.stack: c000000ff0704000
    NIP: c000000000270cd0 LR: c000000000270ce8 CTR: 0000000000000000
    REGS: c000000ff0707900 TRAP: 0300 Not tainted (4.7.0-kvm+)
    MSR: 9000000102009033 CR: 846b6824 XER: 20000000
    CFAR: c000000000008768 DAR: 0000000ff7a10000 DSISR: 42000000 SOFTE: 1
    NIP refresh_zone_stat_thresholds+0x80/0x240
    LR refresh_zone_stat_thresholds+0x98/0x240
    Call Trace:
    refresh_zone_stat_thresholds+0xb8/0x240 (unreliable)

    Both supplied potential fixes but one potentially misses checks and
    another had redundant initialisations. This version initialises
    per_cpu_nodestats on a per-pgdat basis instead of on a per-zone basis.

    Link: http://lkml.kernel.org/r/20160804092404.GI2799@techsingularity.net
    Signed-off-by: Mel Gorman
    Reported-by: Paul Mackerras
    Reported-by: Reza Arbab
    Tested-by: Reza Arbab
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • s/accomodate/accommodate/

    Link: http://lkml.kernel.org/r/20160804121824.18100-1-kuleshovmail@gmail.com
    Signed-off-by: Alexander Kuleshov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Kuleshov
     
  • At present it is obvious that memory online and offline will fail when
    KASAN is enabled. So add the condition to limit the memory_hotplug when
    KASAN is enabled.

    Link: http://lkml.kernel.org/r/1470063651-29519-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhong jiang
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     
  • The rw_page users were not converted to use bio/req ops. As a result
    bdev_write_page is not passing down REQ_OP_WRITE and the IOs will
    be sent down as reads.

    Signed-off-by: Mike Christie
    Fixes: 4e1b2d52a80d ("block, fs, drivers: remove REQ_OP compat defs and related code")

    Modified by me to:

    1) Drop op_flags passing into ->rw_page(), as we don't use it.
    2) Make op_is_write() and friends safe to use for !CONFIG_BLOCK

    Signed-off-by: Jens Axboe

    Mike Christie
     
  • The name for a bdi of a gendisk is derived from the gendisk's devt.
    However, since the gendisk is destroyed before the bdi it leaves a
    window where a new gendisk could dynamically reuse the same devt while a
    bdi with the same name is still live. Arrange for the bdi to hold a
    reference against its "owner" disk device while it is registered.
    Otherwise we can hit sysfs duplicate name collisions like the following:

    WARNING: CPU: 10 PID: 2078 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x64/0x80
    sysfs: cannot create duplicate filename '/devices/virtual/bdi/259:1'

    Hardware name: HP ProLiant DL580 Gen8, BIOS P79 05/06/2015
    0000000000000286 0000000002c04ad5 ffff88006f24f970 ffffffff8134caec
    ffff88006f24f9c0 0000000000000000 ffff88006f24f9b0 ffffffff8108c351
    0000001f0000000c ffff88105d236000 ffff88105d1031e0 ffff8800357427f8
    Call Trace:
    [] dump_stack+0x63/0x87
    [] __warn+0xd1/0xf0
    [] warn_slowpath_fmt+0x5f/0x80
    [] sysfs_warn_dup+0x64/0x80
    [] sysfs_create_dir_ns+0x7e/0x90
    [] kobject_add_internal+0xaa/0x320
    [] ? vsnprintf+0x34e/0x4d0
    [] kobject_add+0x75/0xd0
    [] ? mutex_lock+0x12/0x2f
    [] device_add+0x125/0x610
    [] device_create_groups_vargs+0xd8/0x100
    [] device_create_vargs+0x1c/0x20
    [] bdi_register+0x8c/0x180
    [] bdi_register_dev+0x27/0x30
    [] add_disk+0x175/0x4a0

    Cc:
    Reported-by: Yi Zhang
    Tested-by: Yi Zhang
    Signed-off-by: Dan Williams

    Fixed up missing 0 return in bdi_register_owner().

    Signed-off-by: Jens Axboe

    Dan Williams
     

04 Aug, 2016

1 commit

  • If CONFIG_TRANSPARENT_HUGE_PAGECACHE=n, HPAGE_PMD_NR evaluates to
    BUILD_BUG_ON(), and may cause (e.g. with gcc 4.12):

    mm/built-in.o: In function `shmem_alloc_hugepage':
    shmem.c:(.text+0x17570): undefined reference to `__compiletime_assert_1365'

    To fix this, move the assignment to hindex after the check for huge
    pages support.

    Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
    Signed-off-by: Geert Uytterhoeven
    Acked-by: Kirill A. Shutemov
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     

03 Aug, 2016

13 commits

  • Merge yet more updates from Andrew Morton:

    - the rest of ocfs2

    - various hotfixes, mainly MM

    - quite a bit of misc stuff - drivers, fork, exec, signals, etc.

    - printk updates

    - firmware

    - checkpatch

    - nilfs2

    - more kexec stuff than usual

    - rapidio updates

    - w1 things

    * emailed patches from Andrew Morton : (111 commits)
    ipc: delete "nr_ipc_ns"
    kcov: allow more fine-grained coverage instrumentation
    init/Kconfig: add clarification for out-of-tree modules
    config: add android config fragments
    init/Kconfig: ban CONFIG_LOCALVERSION_AUTO with allmodconfig
    relay: add global mode support for buffer-only channels
    init: allow blacklisting of module_init functions
    w1:omap_hdq: fix regression
    w1: add helper macro module_w1_family
    w1: remove need for ida and use PLATFORM_DEVID_AUTO
    rapidio/switches: add driver for IDT gen3 switches
    powerpc/fsl_rio: apply changes for RIO spec rev 3
    rapidio: modify for rev.3 specification changes
    rapidio: change inbound window size type to u64
    rapidio/idt_gen2: fix locking warning
    rapidio: fix error handling in mbox request/release functions
    rapidio/tsi721_dma: advance queue processing from transfer submit call
    rapidio/tsi721: add messaging mbox selector parameter
    rapidio/tsi721: add PCIe MRRS override parameter
    rapidio/tsi721_dma: add channel mask and queue size parameters
    ...

    Linus Torvalds
     
  • The vm_brk() alignment calculations should refuse to overflow. The ELF
    loader depending on this, but it has been fixed now. No other unsafe
    callers have been found.

    Link: http://lkml.kernel.org/r/1468014494-25291-3-git-send-email-keescook@chromium.org
    Signed-off-by: Kees Cook
    Reported-by: Hector Marco-Gisbert
    Cc: Ismael Ripoll Ripoll
    Cc: Alexander Viro
    Cc: "Kirill A. Shutemov"
    Cc: Oleg Nesterov
    Cc: Chen Gang
    Cc: Michal Hocko
    Cc: Konstantin Khlebnikov
    Cc: Andrea Arcangeli
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • There was only one use of __initdata_refok and __exit_refok

    __init_refok was used 46 times against 82 for __ref.

    Those definitions are obsolete since commit 312b1485fb50 ("Introduce new
    section reference annotations tags: __ref, __refdata, __refconst")

    This patch removes the following compatibility definitions and replaces
    them treewide.

    /* compatibility defines */
    #define __init_refok __ref
    #define __initdata_refok __refdata
    #define __exit_refok __ref

    I can also provide separate patches if necessary.
    (One patch per tree and check in 1 month or 2 to remove old definitions)

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/1466796271-3043-1-git-send-email-fabf@skynet.be
    Signed-off-by: Fabian Frederick
    Cc: Ingo Molnar
    Cc: Sam Ravnborg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • We must call shrink_slab() for each memory cgroup on both global and
    memcg reclaim in shrink_node_memcg(). Commit d71df22b55099 accidentally
    changed that so that now shrink_slab() is only called with memcg != NULL
    on memcg reclaim. As a result, memcg-aware shrinkers (including
    dentry/inode) are never invoked on global reclaim. Fix that.

    Fixes: b2e18757f2c9 ("mm, vmscan: begin reclaiming pages on a per-node basis")
    Link: http://lkml.kernel.org/r/1470056590-7177-1-git-send-email-vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Hillf Danton
    Cc: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • If the total amount of memory assigned to quarantine is less than the
    amount of memory assigned to per-cpu quarantines, |new_quarantine_size|
    may overflow. Instead, set it to zero.

    [akpm@linux-foundation.org: cleanup: use WARN_ONCE return value]
    Link: http://lkml.kernel.org/r/1470063563-96266-1-git-send-email-glider@google.com
    Fixes: 55834c59098d ("mm: kasan: initial memory quarantine implementation")
    Signed-off-by: Alexander Potapenko
    Reported-by: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • Currently we just dump stack in case of double free bug.
    Let's dump all info about the object that we have.

    [aryabinin@virtuozzo.com: change double free message per Alexander]
    Link: http://lkml.kernel.org/r/1470153654-30160-1-git-send-email-aryabinin@virtuozzo.com
    Link: http://lkml.kernel.org/r/1470062715-14077-6-git-send-email-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • The state of object currently tracked in two places - shadow memory, and
    the ->state field in struct kasan_alloc_meta. We can get rid of the
    latter. The will save us a little bit of memory. Also, this allow us
    to move free stack into struct kasan_alloc_meta, without increasing
    memory consumption. So now we should always know when the last time the
    object was freed. This may be useful for long delayed use-after-free
    bugs.

    As a side effect this fixes following UBSAN warning:
    UBSAN: Undefined behaviour in mm/kasan/quarantine.c:102:13
    member access within misaligned address ffff88000d1efebc for type 'struct qlist_node'
    which requires 8 byte alignment

    Link: http://lkml.kernel.org/r/1470062715-14077-5-git-send-email-aryabinin@virtuozzo.com
    Reported-by: kernel test robot
    Signed-off-by: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • Size of slab object already stored in cache->object_size.

    Note, that kmalloc() internally rounds up size of allocation, so
    object_size may be not equal to alloc_size, but, usually we don't need
    to know the exact size of allocated object. In case if we need that
    information, we still can figure it out from the report. The dump of
    shadow memory allows to identify the end of allocated memory, and
    thereby the exact allocation size.

    Link: http://lkml.kernel.org/r/1470062715-14077-4-git-send-email-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • SLUB doesn't require disabled interrupts to call ___cache_free().

    Link: http://lkml.kernel.org/r/1470062715-14077-3-git-send-email-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Acked-by: Alexander Potapenko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • Currently we call quarantine_reduce() for ___GFP_KSWAPD_RECLAIM (implied
    by __GFP_RECLAIM) allocation. So, basically we call it on almost every
    allocation. quarantine_reduce() sometimes is heavy operation, and
    calling it with disabled interrupts may trigger hard LOCKUP:

    NMI watchdog: Watchdog detected hard LOCKUP on cpu 2irq event stamp: 1411258
    Call Trace:
    dump_stack+0x68/0x96
    watchdog_overflow_callback+0x15b/0x190
    __perf_event_overflow+0x1b1/0x540
    perf_event_overflow+0x14/0x20
    intel_pmu_handle_irq+0x36a/0xad0
    perf_event_nmi_handler+0x2c/0x50
    nmi_handle+0x128/0x480
    default_do_nmi+0xb2/0x210
    do_nmi+0x1aa/0x220
    end_repeat_nmi+0x1a/0x1e
    <> __kernel_text_address+0x86/0xb0
    print_context_stack+0x7b/0x100
    dump_trace+0x12b/0x350
    save_stack_trace+0x2b/0x50
    set_track+0x83/0x140
    free_debug_processing+0x1aa/0x420
    __slab_free+0x1d6/0x2e0
    ___cache_free+0xb6/0xd0
    qlist_free_all+0x83/0x100
    quarantine_reduce+0x177/0x1b0
    kasan_kmalloc+0xf3/0x100

    Reduce the quarantine_reduce iff direct reclaim is allowed.

    Fixes: 55834c59098d("mm: kasan: initial memory quarantine implementation")
    Link: http://lkml.kernel.org/r/1470062715-14077-2-git-send-email-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Reported-by: Dave Jones
    Acked-by: Alexander Potapenko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • Once an object is put into quarantine, we no longer own it, i.e. object
    could leave the quarantine and be reallocated. So having set_track()
    call after the quarantine_put() may corrupt slab objects.

    BUG kmalloc-4096 (Not tainted): Poison overwritten
    -----------------------------------------------------------------------------
    Disabling lock debugging due to kernel taint
    INFO: 0xffff8804540de850-0xffff8804540de857. First byte 0xb5 instead of 0x6b
    ...
    INFO: Freed in qlist_free_all+0x42/0x100 age=75 cpu=3 pid=24492
    __slab_free+0x1d6/0x2e0
    ___cache_free+0xb6/0xd0
    qlist_free_all+0x83/0x100
    quarantine_reduce+0x177/0x1b0
    kasan_kmalloc+0xf3/0x100
    kasan_slab_alloc+0x12/0x20
    kmem_cache_alloc+0x109/0x3e0
    mmap_region+0x53e/0xe40
    do_mmap+0x70f/0xa50
    vm_mmap_pgoff+0x147/0x1b0
    SyS_mmap_pgoff+0x2c7/0x5b0
    SyS_mmap+0x1b/0x30
    do_syscall_64+0x1a0/0x4e0
    return_from_SYSCALL_64+0x0/0x7a
    INFO: Slab 0xffffea0011503600 objects=7 used=7 fp=0x (null) flags=0x8000000000004080
    INFO: Object 0xffff8804540de848 @offset=26696 fp=0xffff8804540dc588
    Redzone ffff8804540de840: bb bb bb bb bb bb bb bb ........
    Object ffff8804540de848: 6b 6b 6b 6b 6b 6b 6b 6b b5 52 00 00 f2 01 60 cc kkkkkkkk.R....`.

    Similarly, poisoning after the quarantine_put() leads to false positive
    use-after-free reports:

    BUG: KASAN: use-after-free in anon_vma_interval_tree_insert+0x304/0x430 at addr ffff880405c540a0
    Read of size 8 by task trinity-c0/3036
    CPU: 0 PID: 3036 Comm: trinity-c0 Not tainted 4.7.0-think+ #9
    Call Trace:
    dump_stack+0x68/0x96
    kasan_report_error+0x222/0x600
    __asan_report_load8_noabort+0x61/0x70
    anon_vma_interval_tree_insert+0x304/0x430
    anon_vma_chain_link+0x91/0xd0
    anon_vma_clone+0x136/0x3f0
    anon_vma_fork+0x81/0x4c0
    copy_process.part.47+0x2c43/0x5b20
    _do_fork+0x16d/0xbd0
    SyS_clone+0x19/0x20
    do_syscall_64+0x1a0/0x4e0
    entry_SYSCALL64_slow_path+0x25/0x25

    Fix this by putting an object in the quarantine after all other
    operations.

    Fixes: 80a9201a5965 ("mm, kasan: switch SLUB to stackdepot, enable memory quarantine for SLUB")
    Link: http://lkml.kernel.org/r/1470062715-14077-1-git-send-email-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Reported-by: Dave Jones
    Reported-by: Vegard Nossum
    Reported-by: Sasha Levin
    Acked-by: Alexander Potapenko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • We've had a report about soft lockups caused by lock bouncing in the
    soft reclaim path:

    BUG: soft lockup - CPU#0 stuck for 22s! [kav4proxy-kavic:3128]
    RIP: 0010:[] [] _raw_spin_lock+0x18/0x20
    Call Trace:
    mem_cgroup_soft_limit_reclaim+0x25a/0x280
    shrink_zones+0xed/0x200
    do_try_to_free_pages+0x74/0x320
    try_to_free_pages+0x112/0x180
    __alloc_pages_slowpath+0x3ff/0x820
    __alloc_pages_nodemask+0x1e9/0x200
    alloc_pages_vma+0xe1/0x290
    do_wp_page+0x19f/0x840
    handle_pte_fault+0x1cd/0x230
    do_page_fault+0x1fd/0x4c0
    page_fault+0x25/0x30

    There are no memcgs created so there cannot be any in the soft limit
    excess obviously:

    [...]
    memory 0 1 1

    so all this just seems to be mem_cgroup_largest_soft_limit_node trying
    to get spin_lock_irq(&mctz->lock) just to find out that the soft limit
    excess tree is empty. This is just pointless wasting of cycles and
    cache line bouncing during heavy parallel reclaim on large machines.
    The particular machine wasn't very healthy and most probably suffering
    from a memory leak which just caused the memory reclaim to trash
    heavily. But bouncing on the lock certainly didn't help...

    Fix this by optimistic lockless check and bail out early if the tree is
    empty. This is theoretically racy but that shouldn't matter all that
    much. First of all soft limit is a best effort feature and it is slowly
    getting deprecated and its usage should be really scarce. Bouncing on a
    lock without a good reason is surely much bigger problem, especially on
    large CPU machines.

    Link: http://lkml.kernel.org/r/1470073277-1056-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Zhong Jiang has reported a BUG_ON from huge_pte_alloc hitting when he
    runs his database load with memory online and offline running in
    parallel. The reason is that huge_pmd_share might detect a shared pmd
    which is currently migrated and so it has migration pte which is
    !pte_huge.

    There doesn't seem to be any easy way to prevent from the race and in
    fact seeing the migration swap entry is not harmful. Both callers of
    huge_pte_alloc are prepared to handle them. copy_hugetlb_page_range
    will copy the swap entry and make it COW if needed. hugetlb_fault will
    back off and so the page fault is retries if the page is still under
    migration and waits for its completion in hugetlb_fault.

    That means that the BUG_ON is wrong and we should update it. Let's
    simply check that all present ptes are pte_huge instead.

    Link: http://lkml.kernel.org/r/20160721074340.GA26398@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Reported-by: zhongjiang
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko