04 Mar, 2014

5 commits

  • Jan Stancek reports manual page migration encountering allocation
    failures after some pages when there is still plenty of memory free, and
    bisected the problem down to commit 81c0a2bb515f ("mm: page_alloc: fair
    zone allocator policy").

    The problem is that GFP_THISNODE obeys the zone fairness allocation
    batches on one hand, but doesn't reset them and wake kswapd on the other
    hand. After a few of those allocations, the batches are exhausted and
    the allocations fail.

    Fixing this means either having GFP_THISNODE wake up kswapd, or
    GFP_THISNODE not participating in zone fairness at all. The latter
    seems safer as an acute bugfix, we can clean up later.

    Reported-by: Jan Stancek
    Signed-off-by: Johannes Weiner
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Daniel Borkmann reported a VM_BUG_ON assertion failing:

    ------------[ cut here ]------------
    kernel BUG at mm/mlock.c:528!
    invalid opcode: 0000 [#1] SMP
    Modules linked in: ccm arc4 iwldvm [...]
    video
    CPU: 3 PID: 2266 Comm: netsniff-ng Not tainted 3.14.0-rc2+ #8
    Hardware name: LENOVO 2429BP3/2429BP3, BIOS G4ET37WW (1.12 ) 05/29/2012
    task: ffff8801f87f9820 ti: ffff88002cb44000 task.ti: ffff88002cb44000
    RIP: 0010:[] [] munlock_vma_pages_range+0x2e0/0x2f0
    Call Trace:
    do_munmap+0x18f/0x3b0
    vm_munmap+0x41/0x60
    SyS_munmap+0x22/0x30
    system_call_fastpath+0x1a/0x1f
    RIP munlock_vma_pages_range+0x2e0/0x2f0
    ---[ end trace a0088dcf07ae10f2 ]---

    because munlock_vma_pages_range() thinks it's unexpectedly in the middle
    of a THP page. This can be reproduced with default config since 3.11
    kernels. A reproducer can be found in the kernel's selftest directory
    for networking by running ./psock_tpacket.

    The problem is that an order=2 compound page (allocated by
    alloc_one_pg_vec_page() is part of the munlocked VM_MIXEDMAP vma (mapped
    by packet_mmap()) and mistaken for a THP page and assumed to be order=9.

    The checks for THP in munlock came with commit ff6a6da60b89 ("mm:
    accelerate munlock() treatment of THP pages"), i.e. since 3.9, but did
    not trigger a bug. It just makes munlock_vma_pages_range() skip such
    compound pages until the next 512-pages-aligned page, when it encounters
    a head page. This is however not a problem for vma's where mlocking has
    no effect anyway, but it can distort the accounting.

    Since commit 7225522bb429 ("mm: munlock: batch non-THP page isolation
    and munlock+putback using pagevec") this can trigger a VM_BUG_ON in
    PageTransHuge() check.

    This patch fixes the issue by adding VM_MIXEDMAP flag to VM_SPECIAL, a
    list of flags that make vma's non-mlockable and non-mergeable. The
    reasoning is that VM_MIXEDMAP vma's are similar to VM_PFNMAP, which is
    already on the VM_SPECIAL list, and both are intended for non-LRU pages
    where mlocking makes no sense anyway. Related Lkml discussion can be
    found in [2].

    [1] tools/testing/selftests/net/psock_tpacket
    [2] https://lkml.org/lkml/2014/1/10/427

    Signed-off-by: Vlastimil Babka
    Signed-off-by: Daniel Borkmann
    Reported-by: Daniel Borkmann
    Tested-by: Daniel Borkmann
    Cc: Thomas Hellstrom
    Cc: John David Anglin
    Cc: HATAYAMA Daisuke
    Cc: Konstantin Khlebnikov
    Cc: Carsten Otte
    Cc: Jared Hulbert
    Tested-by: Hannes Frederic Sowa
    Cc: Kirill A. Shutemov
    Acked-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: [3.11.x+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Sometimes the cleanup after memcg hierarchy testing gets stuck in
    mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.

    There may turn out to be several causes, but a major cause is this: the
    workitem to offline parent can get run before workitem to offline child;
    parent's mem_cgroup_reparent_charges() circles around waiting for the
    child's pages to be reparented to its lrus, but it's holding
    cgroup_mutex which prevents the child from reaching its
    mem_cgroup_reparent_charges().

    Further testing showed that an ordered workqueue for cgroup_destroy_wq
    is not always good enough: percpu_ref_kill_and_confirm's call_rcu_sched
    stage on the way can mess up the order before reaching the workqueue.

    Instead, when offlining a memcg, call mem_cgroup_reparent_charges() on
    all its children (and grandchildren, in the correct order) to have their
    charges reparented first.

    Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction")
    Signed-off-by: Filipe Brandenburger
    Signed-off-by: Hugh Dickins
    Reviewed-by: Tejun Heo
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: [v3.10+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Filipe Brandenburger
     
  • Commit 0eef615665ed ("memcg: fix css reference leak and endless loop in
    mem_cgroup_iter") got the interaction with the commit a few before it
    d8ad30559715 ("mm/memcg: iteration skip memcgs not yet fully
    initialized") slightly wrong, and we didn't notice at the time.

    It's elusive, and harder to get than the original, but for a couple of
    days before rc1, I several times saw a endless loop similar to that
    supposedly being fixed.

    This time it was a tighter loop in __mem_cgroup_iter_next(): because we
    can get here when our root has already been offlined, and the ordering
    of conditions was such that we then just cycled around forever.

    Fixes: 0eef615665ed ("memcg: fix css reference leak and endless loop in mem_cgroup_iter").
    Signed-off-by: Hugh Dickins
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Greg Thelen
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Commit bf6bddf1924e ("mm: introduce compaction and migration for
    ballooned pages") introduces page_count(page) into memory compaction
    which dereferences page->first_page if PageTail(page).

    This results in a very rare NULL pointer dereference on the
    aforementioned page_count(page). Indeed, anything that does
    compound_head(), including page_count() is susceptible to racing with
    prep_compound_page() and seeing a NULL or dangling page->first_page
    pointer.

    This patch uses Andrea's implementation of compound_trans_head() that
    deals with such a race and makes it the default compound_head()
    implementation. This includes a read memory barrier that ensures that
    if PageTail(head) is true that we return a head page that is neither
    NULL nor dangling. The patch then adds a store memory barrier to
    prep_compound_page() to ensure page->first_page is set.

    This is the safest way to ensure we see the head page that we are
    expecting, PageTail(page) is already in the unlikely() path and the
    memory barriers are unfortunately required.

    Hugetlbfs is the exception, we don't enforce a store memory barrier
    during init since no race is possible.

    Signed-off-by: David Rientjes
    Cc: Holger Kiehl
    Cc: Christoph Lameter
    Cc: Rafael Aquini
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: "Kirill A. Shutemov"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

26 Feb, 2014

3 commits

  • Kirill has reported the following:

    Task in /test killed as a result of limit of /test
    memory: usage 10240kB, limit 10240kB, failcnt 51
    memory+swap: usage 10240kB, limit 10240kB, failcnt 0
    kmem: usage 0kB, limit 18014398509481983kB, failcnt 0
    Memory cgroup stats for /test:

    BUG: sleeping function called from invalid context at kernel/cpu.c:68
    in_atomic(): 1, irqs_disabled(): 0, pid: 66, name: memcg_test
    2 locks held by memcg_test/66:
    #0: (memcg_oom_lock#2){+.+...}, at: [] pagefault_out_of_memory+0x14/0x90
    #1: (oom_info_lock){+.+...}, at: [] mem_cgroup_print_oom_info+0x2a/0x390
    CPU: 2 PID: 66 Comm: memcg_test Not tainted 3.14.0-rc1-dirty #745
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011
    Call Trace:
    __might_sleep+0x16a/0x210
    get_online_cpus+0x1c/0x60
    mem_cgroup_read_stat+0x27/0xb0
    mem_cgroup_print_oom_info+0x260/0x390
    dump_header+0x88/0x251
    ? trace_hardirqs_on+0xd/0x10
    oom_kill_process+0x258/0x3d0
    mem_cgroup_oom_synchronize+0x656/0x6c0
    ? mem_cgroup_charge_common+0xd0/0xd0
    pagefault_out_of_memory+0x14/0x90
    mm_fault_error+0x91/0x189
    __do_page_fault+0x48e/0x580
    do_page_fault+0xe/0x10
    page_fault+0x22/0x30

    which complains that mem_cgroup_read_stat cannot be called from an atomic
    context but mem_cgroup_print_oom_info takes a spinlock. Change
    oom_info_lock to a mutex.

    This was introduced by 947b3dd1a84b ("memcg, oom: lock
    mem_cgroup_print_oom_info").

    Signed-off-by: Michal Hocko
    Reported-by: "Kirill A. Shutemov"
    Cc: Johannes Weiner
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Masayoshi Mizuma reported a bug with the hang of an application under
    the memcg limit. It happens on write-protection fault to huge zero page

    If we successfully allocate a huge page to replace zero page but hit the
    memcg limit we need to split the zero page with split_huge_page_pmd()
    and fallback to small pages.

    The other part of the problem is that VM_FAULT_OOM has special meaning
    in do_huge_pmd_wp_page() context. __handle_mm_fault() expects the page
    to be split if it sees VM_FAULT_OOM and it will will retry page fault
    handling. This causes an infinite loop if the page was not split.

    do_huge_pmd_wp_zero_page_fallback() can return VM_FAULT_OOM if it failed
    to allocate one small page, so fallback to small pages will not help.

    The solution for this part is to replace VM_FAULT_OOM with
    VM_FAULT_FALLBACK is fallback required.

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Masayoshi Mizuma
    Reviewed-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • It seems we forget to release page after detecting HW error.

    Signed-off-by: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

21 Feb, 2014

1 commit

  • Pull cgroup fixes from Tejun Heo:
    "Quite a few fixes this time.

    Three locking fixes, all marked for -stable. A couple error path
    fixes and some misc fixes. Hugh found a bug in memcg offlining
    sequence and we thought we could fix that from cgroup core side but
    that turned out to be insufficient and got reverted. A different fix
    has been applied to -mm"

    * 'for-3.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: update cgroup_enable_task_cg_lists() to grab siglock
    Revert "cgroup: use an ordered workqueue for cgroup destruction"
    cgroup: protect modifications to cgroup_idr with cgroup_mutex
    cgroup: fix locking in cgroup_cfts_commit()
    cgroup: fix error return from cgroup_create()
    cgroup: fix error return value in cgroup_mount()
    cgroup: use an ordered workqueue for cgroup destruction
    nfs: include xattr.h from fs/nfs/nfs3proc.c
    cpuset: update MAINTAINERS entry
    arm, pm, vmpressure: add missing slab.h includes

    Linus Torvalds
     

18 Feb, 2014

1 commit

  • Pull powerpc fixes from Ben Herrenschmidt:
    "Here are some more powerpc fixes for 3.14

    The main one is a nasty issue with the NUMA balancing support which
    requires a small generic change and the addition of a new accessor to
    set _PAGE_NUMA. Both have been reviewed and acked by Mel and Rik.

    The changelog should have plenty of details but basically, without
    this fix, we get random user segfaults and/or corruptions due to
    missing TLB/hash flushes. Aneesh series of 3 patches fixes it.

    We have some vDSO vs. perf fixes from Anton, some small EEH fixes
    from Gavin, a ppc32 regression vs the stack overflow detector, and a
    fix for the way we handle PCIe host bridge speed settings on pseries
    (which is needed for proper operations of AMD graphics cards on
    Power8)"

    * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
    powerpc/eeh: Disable EEH on reboot
    powerpc/eeh: Cleanup on eeh_subsystem_enabled
    powerpc/powernv: Rework EEH reset
    powerpc: Use unstripped VDSO image for more accurate profiling data
    powerpc: Link VDSOs at 0x0
    mm: Use ptep/pmdp_set_numa() for updating _PAGE_NUMA bit
    mm: Dirty accountable change only apply to non prot numa case
    powerpc/mm: Add new "set" flag argument to pte/pmd update function
    powerpc/pseries: Add Gen3 definitions for PCIE link speed
    powerpc/pseries: Fix regression on PCI link speed
    powerpc: Set the correct ksp_limit on ppc32 when switching to irq stack

    Linus Torvalds
     

17 Feb, 2014

2 commits

  • Archs like ppc64 doesn't do tlb flush in set_pte/pmd functions when using
    a hash table MMU for various reasons (the flush is handled as part of
    the PTE modification when necessary).

    ppc64 thus doesn't implement flush_tlb_range for hash based MMUs.

    Additionally ppc64 require the tlb flushing to be batched within ptl locks.

    The reason to do that is to ensure that the hash page table is in sync with
    linux page table.

    We track the hpte index in linux pte and if we clear them without flushing
    hash and drop the ptl lock, we can have another cpu update the pte and can
    end up with duplicate entry in the hash table, which is fatal.

    We also want to keep set_pte_at simpler by not requiring them to do hash
    flush for performance reason. We do that by assuming that set_pte_at() is
    never *ever* called on a PTE that is already valid.

    This was the case until the NUMA code went in which broke that assumption.

    Fix that by introducing a new pair of helpers to set _PAGE_NUMA in a
    way similar to ptep/pmdp_set_wrprotect(), with a generic implementation
    using set_pte_at() and a powerpc specific one using the appropriate
    mechanism needed to keep the hash table in sync.

    Acked-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Benjamin Herrenschmidt

    Aneesh Kumar K.V
     
  • So move it within the if loop

    Acked-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Benjamin Herrenschmidt

    Aneesh Kumar K.V
     

11 Feb, 2014

3 commits

  • mce-test detected a test failure when injecting error to a thp tail
    page. This is because we take page refcount of the tail page in
    madvise_hwpoison() while the fix in commit a3e0f9e47d5e
    ("mm/memory-failure.c: transfer page count from head page to tail page
    after split thp") assumes that we always take refcount on the head page.

    When a real memory error happens we take refcount on the head page where
    memory_failure() is called without MF_COUNT_INCREASED set, so it seems
    to me that testing memory error on thp tail page using madvise makes
    little sense.

    This patch cancels moving refcount in !MF_COUNT_INCREASED for valid
    testing.

    [akpm@linux-foundation.org: s/&&/&/]
    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Wanpeng Li
    Cc: Chen Gong
    Cc: [3.9+: a3e0f9e47d5e]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Vladimir reported the following issue:

    Commit c65c1877bd68 ("slub: use lockdep_assert_held") requires
    remove_partial() to be called with n->list_lock held, but free_partial()
    called from kmem_cache_close() on cache destruction does not follow this
    rule, leading to a warning:

    WARNING: CPU: 0 PID: 2787 at mm/slub.c:1536 __kmem_cache_shutdown+0x1b2/0x1f0()
    Modules linked in:
    CPU: 0 PID: 2787 Comm: modprobe Tainted: G W 3.14.0-rc1-mm1+ #1
    Hardware name:
    0000000000000600 ffff88003ae1dde8 ffffffff816d9583 0000000000000600
    0000000000000000 ffff88003ae1de28 ffffffff8107c107 0000000000000000
    ffff880037ab2b00 ffff88007c240d30 ffffea0001ee5280 ffffea0001ee52a0
    Call Trace:
    __kmem_cache_shutdown+0x1b2/0x1f0
    kmem_cache_destroy+0x43/0xf0
    xfs_destroy_zones+0x103/0x110 [xfs]
    exit_xfs_fs+0x38/0x4e4 [xfs]
    SyS_delete_module+0x19a/0x1f0
    system_call_fastpath+0x16/0x1b

    His solution was to add a spinlock in order to quiet lockdep. Although
    there would be no contention to adding the lock, that lock also requires
    disabling of interrupts which will have a larger impact on the system.

    Instead of adding a spinlock to a location where it is not needed for
    lockdep, make a __remove_partial() function that does not test if the
    list_lock is held, as no one should have it due to it being freed.

    Also added a __add_partial() function that does not do the lock
    validation either, as it is not needed for the creation of the cache.

    Signed-off-by: Steven Rostedt
    Reported-by: Vladimir Davydov
    Suggested-by: David Rientjes
    Acked-by: David Rientjes
    Acked-by: Vladimir Davydov
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     
  • Commit c65c1877bd68 ("slub: use lockdep_assert_held") incorrectly
    required that add_full() and remove_full() hold n->list_lock. The lock
    is only taken when kmem_cache_debug(s), since that's the only time it
    actually does anything.

    Require that the lock only be taken under such a condition.

    Reported-by: Larry Finger
    Tested-by: Larry Finger
    Tested-by: Paul E. McKenney
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

10 Feb, 2014

2 commits

  • Pull vfs fixes from Al Viro:
    "A couple of fixes, both -stable fodder. The O_SYNC bug is fairly
    old..."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fix a kmap leak in virtio_console
    fix O_SYNC|O_APPEND syncing the wrong range on write()

    Linus Torvalds
     
  • It actually goes back to 2004 ([PATCH] Concurrent O_SYNC write support)
    when sync_page_range() had been introduced; generic_file_write{,v}() correctly
    synced
    pos_after_write - written .. pos_after_write - 1
    but generic_file_aio_write() synced
    pos_before_write .. pos_before_write + written - 1
    instead. Which is not the same thing with O_APPEND, obviously.
    A couple of years later correct variant had been killed off when
    everything switched to use of generic_file_aio_write().

    All users of generic_file_aio_write() are affected, and the same bug
    has been copied into other instances of ->aio_write().

    The fix is trivial; the only subtle point is that generic_write_sync()
    ought to be inlined to avoid calculations useless for the majority of
    calls.

    Signed-off-by: Al Viro

    Al Viro
     

09 Feb, 2014

1 commit

  • Pull x86 fixes from Peter Anvin:
    "Quite a varied little collection of fixes. Most of them are
    relatively small or isolated; the biggest one is Mel Gorman's fixes
    for TLB range flushing.

    A couple of AMD-related fixes (including not crashing when given an
    invalid microcode image) and fix a crash when compiled with gcov"

    * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86, microcode, AMD: Unify valid container checks
    x86, hweight: Fix BUG when booting with CONFIG_GCOV_PROFILE_ALL=y
    x86/efi: Allow mapping BGRT on x86-32
    x86: Fix the initialization of physnode_map
    x86, cpu hotplug: Fix stack frame warning in check_irq_vectors_for_cpu_disable()
    x86/intel/mid: Fix X86_INTEL_MID dependencies
    arch/x86/mm/srat: Skip NUMA_NO_NODE while parsing SLIT
    mm, x86: Revisit tlb_flushall_shift tuning for page flushes except on IvyBridge
    x86: mm: change tlb_flushall_shift for IvyBridge
    x86/mm: Eliminate redundant page table walk during TLB range flushing
    x86/mm: Clean up inconsistencies when flushing TLB ranges
    mm, x86: Account for TLB flushes only when debugging
    x86/AMD/NB: Fix amd_set_subcaches() parameter type
    x86/quirks: Add workaround for AMD F16h Erratum792
    x86, doc, kconfig: Fix dud URL for Microcode data

    Linus Torvalds
     

08 Feb, 2014

1 commit


07 Feb, 2014

3 commits

  • During aio stress test, we observed the following lockdep warning. This
    mean AIO+numa_balancing is currently deadlockable.

    The problem is, aio_migratepage disable interrupt, but
    __set_page_dirty_nobuffers unintentionally enable it again.

    Generally, all helper function should use spin_lock_irqsave() instead of
    spin_lock_irq() because they don't know caller at all.

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(&(&ctx->completion_lock)->rlock);

    lock(&(&ctx->completion_lock)->rlock);

    *** DEADLOCK ***

    dump_stack+0x19/0x1b
    print_usage_bug+0x1f7/0x208
    mark_lock+0x21d/0x2a0
    mark_held_locks+0xb9/0x140
    trace_hardirqs_on_caller+0x105/0x1d0
    trace_hardirqs_on+0xd/0x10
    _raw_spin_unlock_irq+0x2c/0x50
    __set_page_dirty_nobuffers+0x8c/0xf0
    migrate_page_copy+0x434/0x540
    aio_migratepage+0xb1/0x140
    move_to_new_page+0x7d/0x230
    migrate_pages+0x5e5/0x700
    migrate_misplaced_page+0xbc/0xf0
    do_numa_page+0x102/0x190
    handle_pte_fault+0x241/0x970
    handle_mm_fault+0x265/0x370
    __do_page_fault+0x172/0x5a0
    do_page_fault+0x1a/0x70
    page_fault+0x28/0x30

    Signed-off-by: KOSAKI Motohiro
    Cc: Larry Woodman
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Acked-by: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • swapoff clear swap_info's SWP_USED flag prematurely and free its
    resources after that. A concurrent swapon will reuse this swap_info
    while its previous resources are not cleared completely.

    These late freed resources are:
    - p->percpu_cluster
    - swap_cgroup_ctrl[type]
    - block_device setting
    - inode->i_flags &= ~S_SWAPFILE

    This patch clears the SWP_USED flag after all its resources are freed,
    so that swapon can reuse this swap_info by alloc_swap_info() safely.

    [akpm@linux-foundation.org: tidy up code comment]
    Signed-off-by: Weijie Yang
    Acked-by: Hugh Dickins
    Cc: Krzysztof Kozlowski
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Weijie Yang
     
  • This is a patch to improve swap readahead algorithm. It's from Hugh and
    I slightly changed it.

    Hugh's original changelog:

    swapin readahead does a blind readahead, whether or not the swapin is
    sequential. This may be ok on harddisk, because large reads have
    relatively small costs, and if the readahead pages are unneeded they can
    be reclaimed easily - though, what if their allocation forced reclaim of
    useful pages? But on SSD devices large reads are more expensive than
    small ones: if the readahead pages are unneeded, reading them in caused
    significant overhead.

    This patch adds very simplistic random read detection. Stealing the
    PageReadahead technique from Konstantin Khlebnikov's patch, avoiding the
    vma/anon_vma sophistications of Shaohua Li's patch, swapin_nr_pages()
    simply looks at readahead's current success rate, and narrows or widens
    its readahead window accordingly. There is little science to its
    heuristic: it's about as stupid as can be whilst remaining effective.

    The table below shows elapsed times (in centiseconds) when running a
    single repetitive swapping load across a 1000MB mapping in 900MB ram
    with 1GB swap (the harddisk tests had taken painfully too long when I
    used mem=500M, but SSD shows similar results for that).

    Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes his
    Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1 patch
    which Shaohua showed to be defective; HughNew this Nov 14 patch, with
    page_cluster as usual at default of 3 (8-page reads); HughPC4 this same
    patch with page_cluster 4 (16-page reads); HughPC0 with page_cluster 0
    (1-page reads: no readahead).

    HDD for swapping to harddisk, SSD for swapping to VertexII SSD. Seq for
    sequential access to the mapping, cycling five times around; Rand for
    the same number of random touches. Anon for a MAP_PRIVATE anon mapping;
    Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs.

    One weakness of Shaohua's vma/anon_vma approach was that it did not
    optimize Shmem: seen below. Konstantin's approach was perhaps mistuned,
    50% slower on Seq: did not compete and is not shown below.

    HDD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
    Seq Anon 73921 76210 75611 76904 78191 121542
    Seq Shmem 73601 73176 73855 72947 74543 118322
    Rand Anon 895392 831243 871569 845197 846496 841680
    Rand Shmem 1058375 1053486 827935 764955 764376 756489

    SSD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
    Seq Anon 24634 24198 24673 25107 21614 70018
    Seq Shmem 24959 24932 25052 25703 22030 69678
    Rand Anon 43014 26146 28075 25989 26935 25901
    Rand Shmem 45349 45215 28249 24268 24138 24332

    These tests are, of course, two extremes of a very simple case: under
    heavier mixed loads I've not yet observed any consistent improvement or
    degradation, and wider testing would be welcome.

    Shaohua Li:

    Test shows Vanilla is slightly better in sequential workload than Hugh's
    patch. I observed with Hugh's patch sometimes the readahead size is
    shrinked too fast (from 8 to 1 immediately) in sequential workload if
    there is no hit. And in such case, continuing doing readahead is good
    actually.

    I don't prepare a sophisticated algorithm for the sequential workload
    because so far we can't guarantee sequential accessed pages are swap out
    sequentially. So I slightly change Hugh's heuristic - don't shrink
    readahead size too fast.

    Here is my test result (unit second, 3 runs average):
    Vanilla Hugh New
    Seq 356 370 360
    Random 4525 2447 2444

    Attached graph is the swapin/swapout throughput I collected with 'vmstat
    2'. The first part is running a random workload (till around 1200 of
    the x-axis) and the second part is running a sequential workload.
    swapin and swapout throughput are almost identical in steady state in
    both workloads. These are expected behavior. while in Vanilla, swapin
    is much bigger than swapout especially in random workload (because wrong
    readahead).

    Original patches by: Shaohua Li and Konstantin Khlebnikov.

    [fengguang.wu@intel.com: swapin_nr_pages() can be static]
    Signed-off-by: Hugh Dickins
    Signed-off-by: Shaohua Li
    Signed-off-by: Fengguang Wu
    Cc: Rik van Riel
    Cc: Wu Fengguang
    Cc: Minchan Kim
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

04 Feb, 2014

1 commit

  • arch/arm/mach-tegra/pm.c, kernel/power/console.c and mm/vmpressure.c
    were somehow getting slab.h indirectly through cgroup.h which in turn
    was getting it indirectly through xattr.h. A scheduled cgroup change
    drops xattr.h inclusion from cgroup.h and breaks compilation of these
    three files. Add explicit slab.h includes to the three files.

    A pending cgroup patch depends on this change and it'd be great if
    this can be routed through cgroup/for-3.14-fixes branch.

    Signed-off-by: Tejun Heo
    Acked-by: Stephen Warren
    Cc: Thierry Reding
    Cc: linux-tegra@vger.kernel.org
    Cc: "Rafael J. Wysocki"
    Cc: linux-pm@vger.kernel.org
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: cgroups@vger.kernel.org

    Tejun Heo
     

03 Feb, 2014

1 commit

  • Pull SLAB changes from Pekka Enberg:
    "Random bug fixes that have accumulated in my inbox over the past few
    months"

    * 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
    mm: Fix warning on make htmldocs caused by slab.c
    mm: slub: work around unneeded lockdep warning
    mm: sl[uo]b: fix misleading comments
    slub: Fix possible format string bug.
    slub: use lockdep_assert_held
    slub: Fix calculation of cpu slabs
    slab.h: remove duplicate kmalloc declaration and fix kernel-doc warnings

    Linus Torvalds
     

31 Jan, 2014

9 commits

  • This patch fixed following errors while make htmldocs
    Warning(/mm/slab.c:1956): No description found for parameter 'page'
    Warning(/mm/slab.c:1956): Excess function parameter 'slabp' description in 'slab_destroy'

    Incorrect function parameter "slabp" was set instead of "page"

    Acked-by: Christoph Lameter
    Signed-off-by: Masanari Iida
    Signed-off-by: Pekka Enberg

    Masanari Iida
     
  • The slub code does some setup during early boot in
    early_kmem_cache_node_alloc() with some local data. There is no
    possible way that another CPU can see this data, so the slub code
    doesn't unnecessarily lock it. However, some new lockdep asserts
    check to make sure that add_partial() _always_ has the list_lock
    held.

    Just add the locking, even though it is technically unnecessary.

    Cc: Peter Zijlstra
    Cc: Russell King
    Acked-by: David Rientjes
    Signed-off-by: Dave Hansen
    Signed-off-by: Pekka Enberg

    Dave Hansen
     
  • Commit 842e2873697e ("memcg: get rid of kmem_cache_dup()") introduced a
    mutex for memcg_create_kmem_cache() to protect the tmp_name buffer that
    holds the memcg name. It failed to unlock the mutex if this buffer
    could not be allocated.

    This patch fixes the issue by appropriately unlocking the mutex if the
    allocation fails.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Glauber Costa
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • A 3% of system memory bonus is sometimes too excessive in comparison to
    other processes.

    With commit a63d83f427fb ("oom: badness heuristic rewrite"), the OOM
    killer tries to avoid killing privileged tasks by subtracting 3% of
    overall memory (system or cgroup) from their per-task consumption. But
    as a result, all root tasks that consume less than 3% of overall memory
    are considered equal, and so it only takes 33+ privileged tasks pushing
    the system out of memory for the OOM killer to do something stupid and
    kill dhclient or other root-owned processes. For example, on a 32G
    machine it can't tell the difference between the 1M agetty and the 10G
    fork bomb member.

    The changelog describes this 3% boost as the equivalent to the global
    overcommit limit being 3% higher for privileged tasks, but this is not
    the same as discounting 3% of overall memory from _every privileged task
    individually_ during OOM selection.

    Replace the 3% of system memory bonus with a 3% of current memory usage
    bonus.

    By giving root tasks a bonus that is proportional to their actual size,
    they remain comparable even when relatively small. In the example
    above, the OOM killer will discount the 1M agetty's 256 badness points
    down to 179, and the 10G fork bomb's 262144 points down to 183500 points
    and make the right choice, instead of discounting both to 0 and killing
    agetty because it's first in the task list.

    Signed-off-by: David Rientjes
    Reported-by: Johannes Weiner
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Commit abca7c496584 ("mm: fix slab->page _count corruption when using
    slub") notes that we can not _set_ a page->counters directly, except
    when using a real double-cmpxchg. Doing so can lose updates to
    ->_count.

    That is an absolute rule:

    You may not *set* page->counters except via a cmpxchg.

    Commit abca7c496584 fixed this for the folks who have the slub
    cmpxchg_double code turned off at compile time, but it left the bad case
    alone. It can still be reached, and the same bug triggered in two
    cases:

    1. Turning on slub debugging at runtime, which is available on
    the distro kernels that I looked at.
    2. On 64-bit CPUs with no CMPXCHG16B (some early AMD x86-64
    cpus, evidently)

    There are at least 3 ways we could fix this:

    1. Take all of the exising calls to cmpxchg_double_slab() and
    __cmpxchg_double_slab() and convert them to take an old, new
    and target 'struct page'.
    2. Do (1), but with the newly-introduced 'slub_data'.
    3. Do some magic inside the two cmpxchg...slab() functions to
    pull the counters out of new_counters and only set those
    fields in page->{inuse,frozen,objects}.

    I've done (2) as well, but it's a bunch more code. This patch is an
    attempt at (3). This was the most straightforward and foolproof way
    that I could think to do this.

    This would also technically allow us to get rid of the ugly

    #if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
    defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)

    in 'struct page', but leaving it alone has the added benefit that
    'counters' stays 'unsigned' instead of 'unsigned long', so all the
    copies that the slub code does stay a bit smaller.

    Signed-off-by: Dave Hansen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Matt Mackall
    Cc: Pravin B Shelar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • As a result of commit 5606e3877ad8 ("mm: numa: Migrate on reference
    policy"), /proc//numa_maps prints the mempolicy for any as
    "prefer:N" for the local node, N, of the process reading the file.

    This should only be printed when the mempolicy of is
    MPOL_PREFERRED for node N.

    If the process is actually only using the default mempolicy for local
    node allocation, make sure "default" is printed as expected.

    Signed-off-by: David Rientjes
    Reported-by: Robert Lippert
    Cc: Peter Zijlstra
    Acked-by: Mel Gorman
    Cc: Ingo Molnar
    Cc: [3.7+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Add my copyright to the zsmalloc source code which I maintain.

    Signed-off-by: Minchan Kim
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • This patch moves zsmalloc under mm directory.

    Before that, description will explain why we have needed custom
    allocator.

    Zsmalloc is a new slab-based memory allocator for storing compressed
    pages. It is designed for low fragmentation and high allocation success
    rate on large object, but
    Acked-by: Nitin Gupta
    Reviewed-by: Konrad Rzeszutek Wilk
    Cc: Bob Liu
    Cc: Greg Kroah-Hartman
    Cc: Hugh Dickins
    Cc: Jens Axboe
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Cc: Seth Jennings
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Pull core block IO changes from Jens Axboe:
    "The major piece in here is the immutable bio_ve series from Kent, the
    rest is fairly minor. It was supposed to go in last round, but
    various issues pushed it to this release instead. The pull request
    contains:

    - Various smaller blk-mq fixes from different folks. Nothing major
    here, just minor fixes and cleanups.

    - Fix for a memory leak in the error path in the block ioctl code
    from Christian Engelmayer.

    - Header export fix from CaiZhiyong.

    - Finally the immutable biovec changes from Kent Overstreet. This
    enables some nice future work on making arbitrarily sized bios
    possible, and splitting more efficient. Related fixes to immutable
    bio_vecs:

    - dm-cache immutable fixup from Mike Snitzer.
    - btrfs immutable fixup from Muthu Kumar.

    - bio-integrity fix from Nic Bellinger, which is also going to stable"

    * 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
    xtensa: fixup simdisk driver to work with immutable bio_vecs
    block/blk-mq-cpu.c: use hotcpu_notifier()
    blk-mq: for_each_* macro correctness
    block: Fix memory leak in rw_copy_check_uvector() handling
    bio-integrity: Fix bio_integrity_verify segment start bug
    block: remove unrelated header files and export symbol
    blk-mq: uses page->list incorrectly
    blk-mq: use __smp_call_function_single directly
    btrfs: fix missing increment of bi_remaining
    Revert "block: Warn and free bio if bi_end_io is not set"
    block: Warn and free bio if bi_end_io is not set
    blk-mq: fix initializing request's start time
    block: blk-mq: don't export blk_mq_free_queue()
    block: blk-mq: make blk_sync_queue support mq
    block: blk-mq: support draining mq queue
    dm cache: increment bi_remaining when bi_end_io is restored
    block: fixup for generic bio chaining
    block: Really silence spurious compiler warnings
    block: Silence spurious compiler warnings
    block: Kill bio_pair_split()
    ...

    Linus Torvalds
     

30 Jan, 2014

7 commits

  • In original bootmem wrapper for memblock, we have limit checking.

    Add it to memblock_virt_alloc, to address arm and x86 booting crash.

    Signed-off-by: Yinghai Lu
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Reported-by: Kevin Hilman
    Tested-by: Kevin Hilman
    Reported-by: Olof Johansson
    Tested-by: Olof Johansson
    Reported-by: Konrad Rzeszutek Wilk
    Tested-by: Konrad Rzeszutek Wilk
    Cc: Dave Hansen
    Cc: Santosh Shilimkar
    Cc: "Strashko, Grygorii"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     
  • Commit 63d0f0a3c7e1 ("mm/readahead.c:do_readhead(): don't check for
    ->readpage") unintentionally made do_readahead return 0 for all valid
    files regardless of whether readahead was supported, rather than the
    expected -EINVAL. This gets forwarded on to userspace, and results in
    sys_readahead appearing to succeed in cases that don't make sense (e.g.
    when called on pipes or sockets). This issue is detected by the LTP
    readahead01 testcase.

    As the exact return value of force_page_cache_readahead is currently
    never used, we can simplify it to return only 0 or -EINVAL (when
    readpage or readpages is missing). With that in place we can simply
    forward on the return value of force_page_cache_readahead in
    do_readahead.

    This patch performs said change, restoring the expected semantics.

    Signed-off-by: Mark Rutland
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Rutland
     
  • Commit 309381feaee5 ("mm: dump page when hitting a VM_BUG_ON using
    VM_BUG_ON_PAGE") added a bunch of VM_BUG_ON_PAGE() calls.

    But, most of the ones in the slub code are for _temporary_ 'struct
    page's which are declared on the stack and likely have lots of gunk in
    them. Dumping their contents out will just confuse folks looking at
    bad_page() output. Plus, if we try to page_to_pfn() on them or
    soemthing, we'll probably oops anyway.

    Turn them back in to VM_BUG_ON()s.

    Signed-off-by: Dave Hansen
    Cc: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • On kmem_cache_create_memcg() error path we set 'err', but leave 's' (the
    new cache ptr) undefined. The latter can be NULL if we could not
    allocate the cache, or pointing to a freed area if we failed somewhere
    later while trying to initialize it. Initially we checked 'err'
    immediately before exiting the function and returned NULL if it was set
    ignoring the value of 's':

    out_unlock:
    ...
    if (err) {
    /* report error */
    return NULL;
    }
    return s;

    Recently this check was, in fact, broken by commit f717eb3abb5e ("slab:
    do not panic if we fail to create memcg cache"), which turned it to:

    out_unlock:
    ...
    if (err && !memcg) {
    /* report error */
    return NULL;
    }
    return s;

    As a result, if we are failing creating a cache for a memcg, we will
    skip the check and return 's' that can contain crap. Obviously, commit
    f717eb3abb5e intended not to return crap on error allocating a cache for
    a memcg, but only to remove the error reporting in this case, so the
    check should look like this:

    out_unlock:
    ...
    if (err) {
    if (!memcg)
    return NULL;
    /* report error */
    return NULL;
    }
    return s;

    [rientjes@google.com: despaghettification]
    [vdavydov@parallels.com: patch monkeying]
    Signed-off-by: David Rientjes
    Signed-off-by: Vladimir Davydov
    Signed-off-by: Dave Jones
    Reported-by: Dave Jones
    Acked-by: Pekka Enberg
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jones
     
  • A few printk(KERN_*'s have snuck in there.

    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The command line parsing takes place before jump labels are initialised
    which generates a warning if numa_balancing= is specified and
    CONFIG_JUMP_LABEL is set.

    On older kernels before commit c4b2c0c5f647 ("static_key: WARN on usage
    before jump_label_init was called") the kernel would have crashed. This
    patch enables automatic numa balancing later in the initialisation
    process if numa_balancing= is specified.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: stable
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The VM is currently heavily tuned to avoid swapping. Whether that is
    good or bad is a separate discussion, but as long as the VM won't swap
    to make room for dirty cache, we can not consider anonymous pages when
    calculating the amount of dirtyable memory, the baseline to which
    dirty_background_ratio and dirty_ratio are applied.

    A simple workload that occupies a significant size (40+%, depending on
    memory layout, storage speeds etc.) of memory with anon/tmpfs pages and
    uses the remainder for a streaming writer demonstrates this problem. In
    that case, the actual cache pages are a small fraction of what is
    considered dirtyable overall, which results in an relatively large
    portion of the cache pages to be dirtied. As kswapd starts rotating
    these, random tasks enter direct reclaim and stall on IO.

    Only consider free pages and file pages dirtyable.

    Signed-off-by: Johannes Weiner
    Reported-by: Tejun Heo
    Tested-by: Tejun Heo
    Reviewed-by: Rik van Riel
    Cc: Mel Gorman
    Cc: Wu Fengguang
    Reviewed-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner