27 Aug, 2016

4 commits

  • For DAX inodes we need to be careful to never have page cache pages in
    the mapping->page_tree. This radix tree should be composed only of DAX
    exceptional entries and zero pages.

    ltp's readahead02 test was triggering a warning because we were trying
    to insert a DAX exceptional entry but found that a page cache page had
    already been inserted into the tree. This page was being inserted into
    the radix tree in response to a readahead(2) call.

    Readahead doesn't make sense for DAX inodes, but we don't want it to
    report a failure either. Instead, we just return success and don't do
    any work.

    Link: http://lkml.kernel.org/r/20160824221429.21158-1-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Reported-by: Jeff Moyer
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Jan Kara
    Cc: [4.5+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • A bugfix in v4.8-rc2 introduced a harmless warning when
    CONFIG_MEMCG_SWAP is disabled but CONFIG_MEMCG is enabled:

    mm/memcontrol.c:4085:27: error: 'mem_cgroup_id_get_online' defined but not used [-Werror=unused-function]
    static struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg)

    This moves the function inside of the #ifdef block that hides the
    calling function, to avoid the warning.

    Fixes: 1f47b61fb407 ("mm: memcontrol: fix swap counter leak on swapout from offline cgroup")
    Link: http://lkml.kernel.org/r/20160824113733.2776701-1-arnd@arndb.de
    Signed-off-by: Arnd Bergmann
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • The current wording of the COMPACTION Kconfig help text doesn't
    emphasise that disabling COMPACTION might cripple the page allocator
    which relies on the compaction quite heavily for high order requests and
    an unexpected OOM can happen with the lack of compaction. Make sure we
    are vocal about that.

    Link: http://lkml.kernel.org/r/20160823091726.GK23577@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Cc: Markus Trippelsdorf
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • While adding proper userfaultfd_wp support with bits in pagetable and
    swap entry to avoid false positives WP userfaults through swap/fork/
    KSM/etc, I've been adding a framework that mostly mirrors soft dirty.

    So I noticed in one place I had to add uffd_wp support to the pagetables
    that wasn't covered by soft_dirty and I think it should have.

    Example: in the THP migration code migrate_misplaced_transhuge_page()
    pmd_mkdirty is called unconditionally after mk_huge_pmd.

    entry = mk_huge_pmd(new_page, vma->vm_page_prot);
    entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);

    That sets soft dirty too (it's a false positive for soft dirty, the soft
    dirty bit could be more finegrained and transfer the bit like uffd_wp
    will do.. pmd/pte_uffd_wp() enforces the invariant that when it's set
    pmd/pte_write is not set).

    However in the THP split there's no unconditional pmd_mkdirty after
    mk_huge_pmd and pte_swp_mksoft_dirty isn't called after the migration
    entry is created. The code sets the dirty bit in the struct page
    instead of setting it in the pagetable (which is fully equivalent as far
    as the real dirty bit is concerned, as the whole point of pagetable bits
    is to be eventually flushed out of to the page, but that is not
    equivalent for the soft-dirty bit that gets lost in translation).

    This was found by code review only and totally untested as I'm working
    to actually replace soft dirty and I don't have time to test potential
    soft dirty bugfixes as well :).

    Transfer the soft_dirty from pmd to pte during THP splits.

    This fix avoids losing the soft_dirty bit and avoids userland memory
    corruption in the checkpoint.

    Fixes: eef1b3ba053aa6 ("thp: implement split_huge_pmd()")
    Link: http://lkml.kernel.org/r/1471610515-30229-2-git-send-email-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: "Kirill A. Shutemov"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

23 Aug, 2016

2 commits

  • When running with a local patch which moves the '_stext' symbol to the
    very beginning of the kernel text area, I got the following panic with
    CONFIG_HARDENED_USERCOPY:

    usercopy: kernel memory exposure attempt detected from ffff88103dfff000 () (4096 bytes)
    ------------[ cut here ]------------
    kernel BUG at mm/usercopy.c:79!
    invalid opcode: 0000 [#1] SMP
    ...
    CPU: 0 PID: 4800 Comm: cp Not tainted 4.8.0-rc3.after+ #1
    Hardware name: Dell Inc. PowerEdge R720/0X3D66, BIOS 2.5.4 01/22/2016
    task: ffff880817444140 task.stack: ffff880816274000
    RIP: 0010:[] __check_object_size+0x76/0x413
    RSP: 0018:ffff880816277c40 EFLAGS: 00010246
    RAX: 000000000000006b RBX: ffff88103dfff000 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: ffff88081f80dfa8 RDI: ffff88081f80dfa8
    RBP: ffff880816277c90 R08: 000000000000054c R09: 0000000000000000
    R10: 0000000000000005 R11: 0000000000000006 R12: 0000000000001000
    R13: ffff88103e000000 R14: ffff88103dffffff R15: 0000000000000001
    FS: 00007fb9d1750800(0000) GS:ffff88081f800000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00000000021d2000 CR3: 000000081a08f000 CR4: 00000000001406f0
    Stack:
    ffff880816277cc8 0000000000010000 000000043de07000 0000000000000000
    0000000000001000 ffff880816277e60 0000000000001000 ffff880816277e28
    000000000000c000 0000000000001000 ffff880816277ce8 ffffffff8136c3a6
    Call Trace:
    [] copy_page_to_iter_iovec+0xa6/0x1c0
    [] copy_page_to_iter+0x16/0x90
    [] generic_file_read_iter+0x3e3/0x7c0
    [] ? xfs_file_buffered_aio_write+0xad/0x260 [xfs]
    [] ? down_read+0x12/0x40
    [] xfs_file_buffered_aio_read+0x51/0xc0 [xfs]
    [] xfs_file_read_iter+0x62/0xb0 [xfs]
    [] __vfs_read+0xdf/0x130
    [] vfs_read+0x8e/0x140
    [] SyS_read+0x55/0xc0
    [] do_syscall_64+0x67/0x160
    [] entry_SYSCALL64_slow_path+0x25/0x25
    RIP: 0033:[] 0x7fb9d0c33c00
    RSP: 002b:00007ffc9c262f28 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
    RAX: ffffffffffffffda RBX: fffffffffff8ffff RCX: 00007fb9d0c33c00
    RDX: 0000000000010000 RSI: 00000000021c3000 RDI: 0000000000000004
    RBP: 00000000021c3000 R08: 0000000000000000 R09: 00007ffc9c264d6c
    R10: 00007ffc9c262c50 R11: 0000000000000246 R12: 0000000000010000
    R13: 00007ffc9c2630b0 R14: 0000000000000004 R15: 0000000000010000
    Code: 81 48 0f 44 d0 48 c7 c6 90 4d a3 81 48 c7 c0 bb b3 a2 81 48 0f 44 f0 4d 89 e1 48 89 d9 48 c7 c7 68 16 a3 81 31 c0 e8 f4 57 f7 ff 0b 48 8d 90 00 40 00 00 48 39 d3 0f 83 22 01 00 00 48 39 c3
    RIP [] __check_object_size+0x76/0x413
    RSP

    The checked object's range [ffff88103dfff000, ffff88103e000000) is
    valid, so there shouldn't have been a BUG. The hardened usercopy code
    got confused because the range's ending address is the same as the
    kernel's text starting address at 0xffff88103e000000. The overlap check
    is slightly off.

    Fixes: f5509cc18daa ("mm: Hardened usercopy")
    Signed-off-by: Josh Poimboeuf
    Signed-off-by: Kees Cook

    Josh Poimboeuf
     
  • check_bogus_address() checked for pointer overflow using this expression,
    where 'ptr' has type 'const void *':

    ptr + n < ptr

    Since pointer wraparound is undefined behavior, gcc at -O2 by default
    treats it like the following, which would not behave as intended:

    (long)n < 0

    Fortunately, this doesn't currently happen for kernel code because kernel
    code is compiled with -fno-strict-overflow. But the expression should be
    fixed anyway to use well-defined integer arithmetic, since it could be
    treated differently by different compilers in the future or could be
    reported by tools checking for undefined behavior.

    Signed-off-by: Eric Biggers
    Signed-off-by: Kees Cook

    Eric Biggers
     

12 Aug, 2016

7 commits

  • The following oops occurs after a pgdat is hotadded:

    Unable to handle kernel paging request for data at address 0x00c30001
    Faulting instruction address: 0xc00000000022f8f4
    Oops: Kernel access of bad area, sig: 11 [#1]
    SMP NR_CPUS=2048 NUMA pSeries
    Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter nls_utf8 isofs sg virtio_balloon uio_pdrv_genirq uio ip_tables xfs libcrc32c sr_mod cdrom sd_mod virtio_net ibmvscsi scsi_transport_srp virtio_pci virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod
    CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W 4.8.0-rc1-device #110
    task: c000000000ef3080 task.stack: c000000000f6c000
    NIP: c00000000022f8f4 LR: c00000000022f948 CTR: 0000000000000000
    REGS: c000000000f6fa50 TRAP: 0300 Tainted: G W (4.8.0-rc1-device)
    MSR: 800000010280b033 CR: 84002028 XER: 20000000
    CFAR: d000000001d2013c DAR: 0000000000c30001 DSISR: 40000000 SOFTE: 0
    NIP refresh_cpu_vm_stats+0x1a4/0x2f0
    LR refresh_cpu_vm_stats+0x1f8/0x2f0
    Call Trace:
    refresh_cpu_vm_stats+0x1f8/0x2f0 (unreliable)

    Add per_cpu_nodestats initialization to the hotplug codepath.

    Link: http://lkml.kernel.org/r/1470931473-7090-1-git-send-email-arbab@linux.vnet.ibm.com
    Signed-off-by: Reza Arbab
    Cc: Mel Gorman
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Reza Arbab
     
  • mm/oom_kill.c: In function `task_will_free_mem':
    mm/oom_kill.c:767: warning: `ret' may be used uninitialized in this function

    If __task_will_free_mem() is never called inside the for_each_process()
    loop, ret will not be initialized.

    Fixes: 1af8bb43269563e4 ("mm, oom: fortify task_will_free_mem()")
    Link: http://lkml.kernel.org/r/1470255599-24841-1-git-send-email-geert@linux-m68k.org
    Signed-off-by: Geert Uytterhoeven
    Acked-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     
  • It's quite unlikely that the user will so little memory that the per-CPU
    quarantines won't fit into the given fraction of the available memory.
    Even in that case he won't be able to do anything with the information
    given in the warning.

    Link: http://lkml.kernel.org/r/1470929182-101413-1-git-send-email-glider@google.com
    Signed-off-by: Alexander Potapenko
    Acked-by: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Andrey Konovalov
    Cc: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: Kuthonuzo Luruo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • Since commit 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure
    after many small jobs") swap entries do not pin memcg->css.refcnt
    directly. Instead, they pin memcg->id.ref. So we should adjust the
    reference counters accordingly when moving swap charges between cgroups.

    Fixes: 73f576c04b941 ("mm: memcontrol: fix cgroup creation failure after many small jobs")
    Link: http://lkml.kernel.org/r/9ce297c64954a42dc90b543bc76106c4a94f07e8.1470219853.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: [3.19+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • An offline memory cgroup might have anonymous memory or shmem left
    charged to it and no swap. Since only swap entries pin the id of an
    offline cgroup, such a cgroup will have no id and so an attempt to
    swapout its anon/shmem will not store memory cgroup info in the swap
    cgroup map. As a result, memcg->swap or memcg->memsw will never get
    uncharged from it and any of its ascendants.

    Fix this by always charging swapout to the first ancestor cgroup that
    hasn't released its id yet.

    [hannes@cmpxchg.org: add comment to mem_cgroup_swapout]
    [vdavydov@virtuozzo.com: use WARN_ON_ONCE() in mem_cgroup_id_get_online()]
    Link: http://lkml.kernel.org/r/20160803123445.GJ13263@esperanza
    Fixes: 73f576c04b941 ("mm: memcontrol: fix cgroup creation failure after many small jobs")
    Link: http://lkml.kernel.org/r/5336daa5c9a32e776067773d9da655d2dc126491.1470219853.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: [3.19+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • meminfo_proc_show() and si_mem_available() are using the wrong helpers
    for calculating the size of the LRUs. The user-visible impact is that
    there appears to be an abnormally high number of unevictable pages.

    Link: http://lkml.kernel.org/r/20160805105805.GR2799@techsingularity.net
    Signed-off-by: Mel Gorman
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When memory hotplug operates, free hugepages will be freed if the
    movable node is offline. Therefore, /proc/sys/vm/nr_hugepages will be
    incorrect.

    Fix it by reducing max_huge_pages when the node is offlined.

    n-horiguchi@ah.jp.nec.com said:

    : dissolve_free_huge_page intends to break a hugepage into buddy, and the
    : destination hugepage is supposed to be allocated from the pool of the
    : destination node, so the system-wide pool size is reduced. So adding
    : h->max_huge_pages-- makes sense to me.

    Link: http://lkml.kernel.org/r/1470624546-902-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhong jiang
    Cc: Mike Kravetz
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     

11 Aug, 2016

6 commits

  • With debugobjects enabled and using SLAB_DESTROY_BY_RCU, when a
    kmem_cache_node is destroyed the call_rcu() may trigger a slab
    allocation to fill the debug object pool (__debug_object_init:fill_pool).

    Everywhere but during kmem_cache_destroy(), discard_slab() is performed
    outside of the kmem_cache_node->list_lock and avoids a lockdep warning
    about potential recursion:

    =============================================
    [ INFO: possible recursive locking detected ]
    4.8.0-rc1-gfxbench+ #1 Tainted: G U
    ---------------------------------------------
    rmmod/8895 is trying to acquire lock:
    (&(&n->list_lock)->rlock){-.-...}, at: [] get_partial_node.isra.63+0x47/0x430

    but task is already holding lock:
    (&(&n->list_lock)->rlock){-.-...}, at: [] __kmem_cache_shutdown+0x54/0x320

    other info that might help us debug this:
    Possible unsafe locking scenario:
    CPU0
    ----
    lock(&(&n->list_lock)->rlock);
    lock(&(&n->list_lock)->rlock);

    *** DEADLOCK ***
    May be due to missing lock nesting notation
    5 locks held by rmmod/8895:
    #0: (&dev->mutex){......}, at: driver_detach+0x42/0xc0
    #1: (&dev->mutex){......}, at: driver_detach+0x50/0xc0
    #2: (cpu_hotplug.dep_map){++++++}, at: get_online_cpus+0x2d/0x80
    #3: (slab_mutex){+.+.+.}, at: kmem_cache_destroy+0x3c/0x220
    #4: (&(&n->list_lock)->rlock){-.-...}, at: __kmem_cache_shutdown+0x54/0x320

    stack backtrace:
    CPU: 6 PID: 8895 Comm: rmmod Tainted: G U 4.8.0-rc1-gfxbench+ #1
    Hardware name: Gigabyte Technology Co., Ltd. H87M-D3H/H87M-D3H, BIOS F11 08/18/2015
    Call Trace:
    __lock_acquire+0x1646/0x1ad0
    lock_acquire+0xb2/0x200
    _raw_spin_lock+0x36/0x50
    get_partial_node.isra.63+0x47/0x430
    ___slab_alloc.constprop.67+0x1a7/0x3b0
    __slab_alloc.isra.64.constprop.66+0x43/0x80
    kmem_cache_alloc+0x236/0x2d0
    __debug_object_init+0x2de/0x400
    debug_object_activate+0x109/0x1e0
    __call_rcu.constprop.63+0x32/0x2f0
    call_rcu+0x12/0x20
    discard_slab+0x3d/0x40
    __kmem_cache_shutdown+0xdb/0x320
    shutdown_cache+0x19/0x60
    kmem_cache_destroy+0x1ae/0x220
    i915_gem_load_cleanup+0x14/0x40 [i915]
    i915_driver_unload+0x151/0x180 [i915]
    i915_pci_remove+0x14/0x20 [i915]
    pci_device_remove+0x34/0xb0
    __device_release_driver+0x95/0x140
    driver_detach+0xb6/0xc0
    bus_remove_driver+0x53/0xd0
    driver_unregister+0x27/0x50
    pci_unregister_driver+0x25/0x70
    i915_exit+0x1a/0x1e2 [i915]
    SyS_delete_module+0x193/0x1f0
    entry_SYSCALL_64_fastpath+0x1c/0xac

    Fixes: 52b4b950b507 ("mm: slab: free kmem_cache_node after destroy sysfs file")
    Link: http://lkml.kernel.org/r/1470759070-18743-1-git-send-email-chris@chris-wilson.co.uk
    Reported-by: Dave Gordon
    Signed-off-by: Chris Wilson
    Reviewed-by: Vladimir Davydov
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Dmitry Safonov
    Cc: Daniel Vetter
    Cc: Dave Gordon
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Wilson
     
  • In page_remove_file_rmap(.) we have the following check:

    VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);

    This is meant to check for either HugeTLB pages or THP when a compound
    page is passed in.

    Unfortunately, if one disables CONFIG_TRANSPARENT_HUGEPAGE, then
    PageTransHuge(.) will always return false, provoking BUGs when one runs
    the libhugetlbfs test suite.

    This patch replaces PageTransHuge(), with PageHead() which will work for
    both HugeTLB and THP.

    Fixes: dd78fedde4b9 ("rmap: support file thp")
    Link: http://lkml.kernel.org/r/1470838217-5889-1-git-send-email-steve.capper@arm.com
    Signed-off-by: Steve Capper
    Acked-by: Kirill A. Shutemov
    Cc: Huang Shijie
    Cc: Will Deacon
    Cc: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steve Capper
     
  • PageTransCompound() doesn't distinguish THP from from any other type of
    compound pages. This can lead to false-positive VM_BUG_ON() in
    page_add_file_rmap() if called on compound page from a driver[1].

    I think we can exclude such cases by checking if the page belong to a
    mapping.

    The VM_BUG_ON_PAGE() is downgraded to VM_WARN_ON_ONCE(). This path
    should not cause any harm to non-THP page, but good to know if we step
    on anything else.

    [1] http://lkml.kernel.org/r/c711e067-0bff-a6cb-3c37-04dfe77d2db1@redhat.com

    Link: http://lkml.kernel.org/r/20160810161345.GA67522@black.fi.intel.com
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Laura Abbott
    Tested-by: Laura Abbott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Some of node threshold depends on number of managed pages in the node.
    When memory is going on/offline, it can be changed and we need to adjust
    them.

    Add recalculation to appropriate places and clean-up related functions
    for better maintenance.

    Link: http://lkml.kernel.org/r/1470724248-26780-2-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Acked-by: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Before resetting min_unmapped_pages, we need to initialize
    min_unmapped_pages rather than min_slab_pages.

    Fixes: a5f5f91da6 (mm: convert zone_reclaim to node_reclaim)
    Link: http://lkml.kernel.org/r/1470724248-26780-1-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Acked-by: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • The newly introduced shmem_huge_enabled() function has two definitions,
    but neither of them is visible if CONFIG_SYSFS is disabled, leading to a
    build error:

    mm/khugepaged.o: In function `khugepaged':
    khugepaged.c:(.text.khugepaged+0x3ca): undefined reference to `shmem_huge_enabled'

    This changes the #ifdef guards around the definition to match those that
    are used in the header file.

    Fixes: e496cf3d7821 ("thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE")
    Link: http://lkml.kernel.org/r/20160809123638.1357593-1-arnd@arndb.de
    Signed-off-by: Arnd Bergmann
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     

10 Aug, 2016

1 commit

  • To distinguish non-slab pages charged to kmemcg we mark them PageKmemcg,
    which sets page->_mapcount to -512. Currently, we set/clear PageKmemcg
    in __alloc_pages_nodemask()/free_pages_prepare() for any page allocated
    with __GFP_ACCOUNT, including those that aren't actually charged to any
    cgroup, i.e. allocated from the root cgroup context. To avoid overhead
    in case cgroups are not used, we only do that if memcg_kmem_enabled() is
    true. The latter is set iff there are kmem-enabled memory cgroups
    (online or offline). The root cgroup is not considered kmem-enabled.

    As a result, if a page is allocated with __GFP_ACCOUNT for the root
    cgroup when there are kmem-enabled memory cgroups and is freed after all
    kmem-enabled memory cgroups were removed, e.g.

    # no memory cgroups has been created yet, create one
    mkdir /sys/fs/cgroup/memory/test
    # run something allocating pages with __GFP_ACCOUNT, e.g.
    # a program using pipe
    dmesg | tail
    # remove the memory cgroup
    rmdir /sys/fs/cgroup/memory/test

    we'll get bad page state bug complaining about page->_mapcount != -1:

    BUG: Bad page state in process swapper/0 pfn:1fd945c
    page:ffffea007f651700 count:0 mapcount:-511 mapping: (null) index:0x0
    flags: 0x1000000000000000()

    To avoid that, let's mark with PageKmemcg only those pages that are
    actually charged to and hence pin a non-root memory cgroup.

    Fixes: 4949148ad433 ("mm: charge/uncharge kmemcg from generic page allocator paths")
    Reported-and-tested-by: Eric Dumazet
    Signed-off-by: Vladimir Davydov
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

09 Aug, 2016

1 commit

  • Pull usercopy protection from Kees Cook:
    "Tbhis implements HARDENED_USERCOPY verification of copy_to_user and
    copy_from_user bounds checking for most architectures on SLAB and
    SLUB"

    * tag 'usercopy-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    mm: SLUB hardened usercopy support
    mm: SLAB hardened usercopy support
    s390/uaccess: Enable hardened usercopy
    sparc/uaccess: Enable hardened usercopy
    powerpc/uaccess: Enable hardened usercopy
    ia64/uaccess: Enable hardened usercopy
    arm64/uaccess: Enable hardened usercopy
    ARM: uaccess: Enable hardened usercopy
    x86/uaccess: Enable hardened usercopy
    mm: Hardened usercopy
    mm: Implement stack frame object validation
    mm: Add is_migrate_cma_page

    Linus Torvalds
     

08 Aug, 2016

2 commits


06 Aug, 2016

1 commit

  • Pull block fixes from Jens Axboe:
    "Here's the second round of block updates for this merge window.

    It's a mix of fixes for changes that went in previously in this round,
    and fixes in general. This pull request contains:

    - Fixes for loop from Christoph

    - A bdi vs gendisk lifetime fix from Dan, worth two cookies.

    - A blk-mq timeout fix, when on frozen queues. From Gabriel.

    - Writeback fix from Jan, ensuring that __writeback_single_inode()
    does the right thing.

    - Fix for bio->bi_rw usage in f2fs from me.

    - Error path deadlock fix in blk-mq sysfs registration from me.

    - Floppy O_ACCMODE fix from Jiri.

    - Fix to the new bio op methods from Mike.

    One more followup will be coming here, ensuring that we don't
    propagate the block types outside of block. That, and a rename of
    bio->bi_rw is coming right after -rc1 is cut.

    - Various little fixes"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    mm/block: convert rw_page users to bio op use
    loop: make do_req_filebacked more robust
    loop: don't try to use AIO for discards
    blk-mq: fix deadlock in blk_mq_register_disk() error path
    Include: blkdev: Removed duplicate 'struct request;' declaration.
    Fixup direct bi_rw modifiers
    block: fix bdi vs gendisk lifetime mismatch
    blk-mq: Allow timeouts to run while queue is freezing
    nbd: fix race in ioctl
    block: fix use-after-free in seq file
    f2fs: drop bio->bi_rw manual assignment
    block: add missing group association in bio-cloning functions
    blkcg: kill unused field nr_undestroyed_grps
    writeback: Write dirty times for WB_SYNC_ALL writeback
    floppy: fix open(O_ACCMODE) for ioctl-only open

    Linus Torvalds
     

05 Aug, 2016

8 commits

  • Pull more powerpc updates from Michael Ellerman:
    "These were delayed for various reasons, so I let them sit in next a
    bit longer, rather than including them in my first pull request.

    Fixes:
    - Fix early access to cpu_spec relocation from Benjamin Herrenschmidt
    - Fix incorrect event codes in power9-event-list from Madhavan Srinivasan
    - Move register_process_table() out of ppc_md from Michael Ellerman

    Use jump_label use for [cpu|mmu]_has_feature():
    - Add mmu_early_init_devtree() from Michael Ellerman
    - Move disable_radix handling into mmu_early_init_devtree() from Michael Ellerman
    - Do hash device tree scanning earlier from Michael Ellerman
    - Do radix device tree scanning earlier from Michael Ellerman
    - Do feature patching before MMU init from Michael Ellerman
    - Check features don't change after patching from Michael Ellerman
    - Make MMU_FTR_RADIX a MMU family feature from Aneesh Kumar K.V
    - Convert mmu_has_feature() to returning bool from Michael Ellerman
    - Convert cpu_has_feature() to returning bool from Michael Ellerman
    - Define radix_enabled() in one place & use static inline from Michael Ellerman
    - Add early_[cpu|mmu]_has_feature() from Michael Ellerman
    - Convert early cpu/mmu feature check to use the new helpers from Aneesh Kumar K.V
    - jump_label: Make it possible for arches to invoke jump_label_init() earlier from Kevin Hao
    - Call jump_label_init() in apply_feature_fixups() from Aneesh Kumar K.V
    - Remove mfvtb() from Kevin Hao
    - Move cpu_has_feature() to a separate file from Kevin Hao
    - Add kconfig option to use jump labels for cpu/mmu_has_feature() from Michael Ellerman
    - Add option to use jump label for cpu_has_feature() from Kevin Hao
    - Add option to use jump label for mmu_has_feature() from Kevin Hao
    - Catch usage of cpu/mmu_has_feature() before jump label init from Aneesh Kumar K.V
    - Annotate jump label assembly from Michael Ellerman

    TLB flush enhancements from Aneesh Kumar K.V:
    - radix: Implement tlb mmu gather flush efficiently
    - Add helper for finding SLBE LLP encoding
    - Use hugetlb flush functions
    - Drop multiple definition of mm_is_core_local
    - radix: Add tlb flush of THP ptes
    - radix: Rename function and drop unused arg
    - radix/hugetlb: Add helper for finding page size
    - hugetlb: Add flush_hugetlb_tlb_range
    - remove flush_tlb_page_nohash

    Add new ptrace regsets from Anshuman Khandual and Simon Guo:
    - elf: Add powerpc specific core note sections
    - Add the function flush_tmregs_to_thread
    - Enable in transaction NT_PRFPREG ptrace requests
    - Enable in transaction NT_PPC_VMX ptrace requests
    - Enable in transaction NT_PPC_VSX ptrace requests
    - Adapt gpr32_get, gpr32_set functions for transaction
    - Enable support for NT_PPC_CGPR
    - Enable support for NT_PPC_CFPR
    - Enable support for NT_PPC_CVMX
    - Enable support for NT_PPC_CVSX
    - Enable support for TM SPR state
    - Enable NT_PPC_TM_CTAR, NT_PPC_TM_CPPR, NT_PPC_TM_CDSCR
    - Enable support for NT_PPPC_TAR, NT_PPC_PPR, NT_PPC_DSCR
    - Enable support for EBB registers
    - Enable support for Performance Monitor registers"

    * tag 'powerpc-4.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (48 commits)
    powerpc/mm: Move register_process_table() out of ppc_md
    powerpc/perf: Fix incorrect event codes in power9-event-list
    powerpc/32: Fix early access to cpu_spec relocation
    powerpc/ptrace: Enable support for Performance Monitor registers
    powerpc/ptrace: Enable support for EBB registers
    powerpc/ptrace: Enable support for NT_PPPC_TAR, NT_PPC_PPR, NT_PPC_DSCR
    powerpc/ptrace: Enable NT_PPC_TM_CTAR, NT_PPC_TM_CPPR, NT_PPC_TM_CDSCR
    powerpc/ptrace: Enable support for TM SPR state
    powerpc/ptrace: Enable support for NT_PPC_CVSX
    powerpc/ptrace: Enable support for NT_PPC_CVMX
    powerpc/ptrace: Enable support for NT_PPC_CFPR
    powerpc/ptrace: Enable support for NT_PPC_CGPR
    powerpc/ptrace: Adapt gpr32_get, gpr32_set functions for transaction
    powerpc/ptrace: Enable in transaction NT_PPC_VSX ptrace requests
    powerpc/ptrace: Enable in transaction NT_PPC_VMX ptrace requests
    powerpc/ptrace: Enable in transaction NT_PRFPREG ptrace requests
    powerpc/process: Add the function flush_tmregs_to_thread
    elf: Add powerpc specific core note sections
    powerpc/mm: remove flush_tlb_page_nohash
    powerpc/mm/hugetlb: Add flush_hugetlb_tlb_range
    ...

    Linus Torvalds
     
  • It causes NULL dereference error and failure to get type_a->regions[0]
    info if parameter type_b of __next_mem_range_rev() == NULL

    Fix this by checking before dereferring and initializing idx_b to 0

    The approach is tested by dumping all types of region via
    __memblock_dump_all() and __next_mem_range_rev() fixed to UART
    separately the result is okay after checking the logs.

    Link: http://lkml.kernel.org/r/57A0320D.6070102@zoho.com
    Signed-off-by: zijun_hu
    Tested-by: zijun_hu
    Acked-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zijun_hu
     
  • With m68k-linux-gnu-gcc-4.1:

    include/linux/slub_def.h:126: warning: `fixup_red_left' declared inline after being called
    include/linux/slub_def.h:126: warning: previous declaration of `fixup_red_left' was here

    Commit c146a2b98eb5 ("mm, kasan: account for object redzone in SLUB's
    nearest_obj()") made fixup_red_left() global, but forgot to remove the
    inline keyword.

    Fixes: c146a2b98eb5898e ("mm, kasan: account for object redzone in SLUB's nearest_obj()")
    Link: http://lkml.kernel.org/r/1470256262-1586-1-git-send-email-geert@linux-m68k.org
    Signed-off-by: Geert Uytterhoeven
    Cc: Alexander Potapenko
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     
  • Paul Mackerras and Reza Arbab reported that machines with memoryless
    nodes fail when vmstats are refreshed. Paul reported an oops as follows

    Unable to handle kernel paging request for data at address 0xff7a10000
    Faulting instruction address: 0xc000000000270cd0
    Oops: Kernel access of bad area, sig: 11 [#1]
    SMP NR_CPUS=2048 NUMA PowerNV
    Modules linked in:
    CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.7.0-kvm+ #118
    task: c000000ff0680010 task.stack: c000000ff0704000
    NIP: c000000000270cd0 LR: c000000000270ce8 CTR: 0000000000000000
    REGS: c000000ff0707900 TRAP: 0300 Not tainted (4.7.0-kvm+)
    MSR: 9000000102009033 CR: 846b6824 XER: 20000000
    CFAR: c000000000008768 DAR: 0000000ff7a10000 DSISR: 42000000 SOFTE: 1
    NIP refresh_zone_stat_thresholds+0x80/0x240
    LR refresh_zone_stat_thresholds+0x98/0x240
    Call Trace:
    refresh_zone_stat_thresholds+0xb8/0x240 (unreliable)

    Both supplied potential fixes but one potentially misses checks and
    another had redundant initialisations. This version initialises
    per_cpu_nodestats on a per-pgdat basis instead of on a per-zone basis.

    Link: http://lkml.kernel.org/r/20160804092404.GI2799@techsingularity.net
    Signed-off-by: Mel Gorman
    Reported-by: Paul Mackerras
    Reported-by: Reza Arbab
    Tested-by: Reza Arbab
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • s/accomodate/accommodate/

    Link: http://lkml.kernel.org/r/20160804121824.18100-1-kuleshovmail@gmail.com
    Signed-off-by: Alexander Kuleshov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Kuleshov
     
  • At present it is obvious that memory online and offline will fail when
    KASAN is enabled. So add the condition to limit the memory_hotplug when
    KASAN is enabled.

    Link: http://lkml.kernel.org/r/1470063651-29519-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhong jiang
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     
  • The rw_page users were not converted to use bio/req ops. As a result
    bdev_write_page is not passing down REQ_OP_WRITE and the IOs will
    be sent down as reads.

    Signed-off-by: Mike Christie
    Fixes: 4e1b2d52a80d ("block, fs, drivers: remove REQ_OP compat defs and related code")

    Modified by me to:

    1) Drop op_flags passing into ->rw_page(), as we don't use it.
    2) Make op_is_write() and friends safe to use for !CONFIG_BLOCK

    Signed-off-by: Jens Axboe

    Mike Christie
     
  • The name for a bdi of a gendisk is derived from the gendisk's devt.
    However, since the gendisk is destroyed before the bdi it leaves a
    window where a new gendisk could dynamically reuse the same devt while a
    bdi with the same name is still live. Arrange for the bdi to hold a
    reference against its "owner" disk device while it is registered.
    Otherwise we can hit sysfs duplicate name collisions like the following:

    WARNING: CPU: 10 PID: 2078 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x64/0x80
    sysfs: cannot create duplicate filename '/devices/virtual/bdi/259:1'

    Hardware name: HP ProLiant DL580 Gen8, BIOS P79 05/06/2015
    0000000000000286 0000000002c04ad5 ffff88006f24f970 ffffffff8134caec
    ffff88006f24f9c0 0000000000000000 ffff88006f24f9b0 ffffffff8108c351
    0000001f0000000c ffff88105d236000 ffff88105d1031e0 ffff8800357427f8
    Call Trace:
    [] dump_stack+0x63/0x87
    [] __warn+0xd1/0xf0
    [] warn_slowpath_fmt+0x5f/0x80
    [] sysfs_warn_dup+0x64/0x80
    [] sysfs_create_dir_ns+0x7e/0x90
    [] kobject_add_internal+0xaa/0x320
    [] ? vsnprintf+0x34e/0x4d0
    [] kobject_add+0x75/0xd0
    [] ? mutex_lock+0x12/0x2f
    [] device_add+0x125/0x610
    [] device_create_groups_vargs+0xd8/0x100
    [] device_create_vargs+0x1c/0x20
    [] bdi_register+0x8c/0x180
    [] bdi_register_dev+0x27/0x30
    [] add_disk+0x175/0x4a0

    Cc:
    Reported-by: Yi Zhang
    Tested-by: Yi Zhang
    Signed-off-by: Dan Williams

    Fixed up missing 0 return in bdi_register_owner().

    Signed-off-by: Jens Axboe

    Dan Williams
     

04 Aug, 2016

1 commit

  • If CONFIG_TRANSPARENT_HUGE_PAGECACHE=n, HPAGE_PMD_NR evaluates to
    BUILD_BUG_ON(), and may cause (e.g. with gcc 4.12):

    mm/built-in.o: In function `shmem_alloc_hugepage':
    shmem.c:(.text+0x17570): undefined reference to `__compiletime_assert_1365'

    To fix this, move the assignment to hindex after the check for huge
    pages support.

    Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
    Signed-off-by: Geert Uytterhoeven
    Acked-by: Kirill A. Shutemov
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     

03 Aug, 2016

7 commits

  • Merge yet more updates from Andrew Morton:

    - the rest of ocfs2

    - various hotfixes, mainly MM

    - quite a bit of misc stuff - drivers, fork, exec, signals, etc.

    - printk updates

    - firmware

    - checkpatch

    - nilfs2

    - more kexec stuff than usual

    - rapidio updates

    - w1 things

    * emailed patches from Andrew Morton : (111 commits)
    ipc: delete "nr_ipc_ns"
    kcov: allow more fine-grained coverage instrumentation
    init/Kconfig: add clarification for out-of-tree modules
    config: add android config fragments
    init/Kconfig: ban CONFIG_LOCALVERSION_AUTO with allmodconfig
    relay: add global mode support for buffer-only channels
    init: allow blacklisting of module_init functions
    w1:omap_hdq: fix regression
    w1: add helper macro module_w1_family
    w1: remove need for ida and use PLATFORM_DEVID_AUTO
    rapidio/switches: add driver for IDT gen3 switches
    powerpc/fsl_rio: apply changes for RIO spec rev 3
    rapidio: modify for rev.3 specification changes
    rapidio: change inbound window size type to u64
    rapidio/idt_gen2: fix locking warning
    rapidio: fix error handling in mbox request/release functions
    rapidio/tsi721_dma: advance queue processing from transfer submit call
    rapidio/tsi721: add messaging mbox selector parameter
    rapidio/tsi721: add PCIe MRRS override parameter
    rapidio/tsi721_dma: add channel mask and queue size parameters
    ...

    Linus Torvalds
     
  • The vm_brk() alignment calculations should refuse to overflow. The ELF
    loader depending on this, but it has been fixed now. No other unsafe
    callers have been found.

    Link: http://lkml.kernel.org/r/1468014494-25291-3-git-send-email-keescook@chromium.org
    Signed-off-by: Kees Cook
    Reported-by: Hector Marco-Gisbert
    Cc: Ismael Ripoll Ripoll
    Cc: Alexander Viro
    Cc: "Kirill A. Shutemov"
    Cc: Oleg Nesterov
    Cc: Chen Gang
    Cc: Michal Hocko
    Cc: Konstantin Khlebnikov
    Cc: Andrea Arcangeli
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • There was only one use of __initdata_refok and __exit_refok

    __init_refok was used 46 times against 82 for __ref.

    Those definitions are obsolete since commit 312b1485fb50 ("Introduce new
    section reference annotations tags: __ref, __refdata, __refconst")

    This patch removes the following compatibility definitions and replaces
    them treewide.

    /* compatibility defines */
    #define __init_refok __ref
    #define __initdata_refok __refdata
    #define __exit_refok __ref

    I can also provide separate patches if necessary.
    (One patch per tree and check in 1 month or 2 to remove old definitions)

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/1466796271-3043-1-git-send-email-fabf@skynet.be
    Signed-off-by: Fabian Frederick
    Cc: Ingo Molnar
    Cc: Sam Ravnborg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • We must call shrink_slab() for each memory cgroup on both global and
    memcg reclaim in shrink_node_memcg(). Commit d71df22b55099 accidentally
    changed that so that now shrink_slab() is only called with memcg != NULL
    on memcg reclaim. As a result, memcg-aware shrinkers (including
    dentry/inode) are never invoked on global reclaim. Fix that.

    Fixes: b2e18757f2c9 ("mm, vmscan: begin reclaiming pages on a per-node basis")
    Link: http://lkml.kernel.org/r/1470056590-7177-1-git-send-email-vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Hillf Danton
    Cc: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • If the total amount of memory assigned to quarantine is less than the
    amount of memory assigned to per-cpu quarantines, |new_quarantine_size|
    may overflow. Instead, set it to zero.

    [akpm@linux-foundation.org: cleanup: use WARN_ONCE return value]
    Link: http://lkml.kernel.org/r/1470063563-96266-1-git-send-email-glider@google.com
    Fixes: 55834c59098d ("mm: kasan: initial memory quarantine implementation")
    Signed-off-by: Alexander Potapenko
    Reported-by: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • Currently we just dump stack in case of double free bug.
    Let's dump all info about the object that we have.

    [aryabinin@virtuozzo.com: change double free message per Alexander]
    Link: http://lkml.kernel.org/r/1470153654-30160-1-git-send-email-aryabinin@virtuozzo.com
    Link: http://lkml.kernel.org/r/1470062715-14077-6-git-send-email-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • The state of object currently tracked in two places - shadow memory, and
    the ->state field in struct kasan_alloc_meta. We can get rid of the
    latter. The will save us a little bit of memory. Also, this allow us
    to move free stack into struct kasan_alloc_meta, without increasing
    memory consumption. So now we should always know when the last time the
    object was freed. This may be useful for long delayed use-after-free
    bugs.

    As a side effect this fixes following UBSAN warning:
    UBSAN: Undefined behaviour in mm/kasan/quarantine.c:102:13
    member access within misaligned address ffff88000d1efebc for type 'struct qlist_node'
    which requires 8 byte alignment

    Link: http://lkml.kernel.org/r/1470062715-14077-5-git-send-email-aryabinin@virtuozzo.com
    Reported-by: kernel test robot
    Signed-off-by: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin