29 Apr, 2014

1 commit

  • BUG_ON() is a big hammer, and should be used _only_ if there is some
    major corruption that you cannot possibly recover from, making it
    imperative that the current process (and possibly the whole machine) be
    terminated with extreme prejudice.

    The trivial sanity check in the vmacache code is *not* such a fatal
    error. Recovering from it is absolutely trivial, and using BUG_ON()
    just makes it harder to debug for no actual advantage.

    To make matters worse, the placement of the BUG_ON() (only if the range
    check matched) actually makes it harder to hit the sanity check to begin
    with, so _if_ there is a bug (and we just got a report from Srivatsa
    Bhat that this can indeed trigger), it is harder to debug not just
    because the machine is possibly dead, but because we don't have better
    coverage.

    BUG_ON() must *die*. Maybe we should add a checkpatch warning for it,
    because it is simply just about the worst thing you can ever do if you
    hit some "this cannot happen" situation.

    Reported-by: Srivatsa S. Bhat
    Cc: Davidlohr Bueso
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

26 Apr, 2014

1 commit

  • The mmu-gather operation 'tlb_flush_mmu()' has done two things: the
    actual tlb flush operation, and the batched freeing of the pages that
    the TLB entries pointed at.

    This splits the operation into separate phases, so that the forced
    batched flushing done by zap_pte_range() can now do the actual TLB flush
    while still holding the page table lock, but delay the batched freeing
    of all the pages to after the lock has been dropped.

    This in turn allows us to avoid a race condition between
    set_page_dirty() (as called by zap_pte_range() when it finds a dirty
    shared memory pte) and page_mkclean(): because we now flush all the
    dirty page data from the TLB's while holding the pte lock,
    page_mkclean() will be held up walking the (recently cleaned) page
    tables until after the TLB entries have been flushed from all CPU's.

    Reported-by: Benjamin Herrenschmidt
    Tested-by: Dave Hansen
    Acked-by: Hugh Dickins
    Cc: Peter Zijlstra
    Cc: Russell King - ARM Linux
    Cc: Tony Luck
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

23 Apr, 2014

1 commit

  • fixup_user_fault() is used by the futex code when the direct user access
    fails, and the futex code wants it to either map in the page in a usable
    form or return an error. It relied on handle_mm_fault() to map the
    page, and correctly checked the error return from that, but while that
    does map the page, it doesn't actually guarantee that the page will be
    mapped with sufficient permissions to be then accessed.

    So do the appropriate tests of the vma access rights by hand.

    [ Side note: arguably handle_mm_fault() could just do that itself, but
    we have traditionally done it in the caller, because some callers -
    notably get_user_pages() - have been able to access pages even when
    they are mapped with PROT_NONE. Maybe we should re-visit that design
    decision, but in the meantime this is the minimal patch. ]

    Found by Dave Jones running his trinity tool.

    Reported-by: Dave Jones
    Acked-by: Hugh Dickins
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

19 Apr, 2014

4 commits

  • Sasha Levin has reported two THP BUGs[1][2]. I believe both of them
    have the same root cause. Let's look to them one by one.

    The first bug[1] is "kernel BUG at mm/huge_memory.c:1829!". It's
    BUG_ON(mapcount != page_mapcount(page)) in __split_huge_page(). From my
    testing I see that page_mapcount() is higher than mapcount here.

    I think it happens due to race between zap_huge_pmd() and
    page_check_address_pmd(). page_check_address_pmd() misses PMD which is
    under zap:

    CPU0 CPU1
    zap_huge_pmd()
    pmdp_get_and_clear()
    __split_huge_page()
    anon_vma_interval_tree_foreach()
    __split_huge_page_splitting()
    page_check_address_pmd()
    mm_find_pmd()
    /*
    * We check if PMD present without taking ptl: no
    * serialization against zap_huge_pmd(). We miss this PMD,
    * it's not accounted to 'mapcount' in __split_huge_page().
    */
    pmd_present(pmd) == 0

    BUG_ON(mapcount != page_mapcount(page)) // CRASH!!!

    page_remove_rmap(page)
    atomic_add_negative(-1, &page->_mapcount)

    The second bug[2] is "kernel BUG at mm/huge_memory.c:1371!".
    It's VM_BUG_ON_PAGE(!PageHead(page), page) in zap_huge_pmd().

    This happens in similar way:

    CPU0 CPU1
    zap_huge_pmd()
    pmdp_get_and_clear()
    page_remove_rmap(page)
    atomic_add_negative(-1, &page->_mapcount)
    __split_huge_page()
    anon_vma_interval_tree_foreach()
    __split_huge_page_splitting()
    page_check_address_pmd()
    mm_find_pmd()
    pmd_present(pmd) == 0 /* The same comment as above */
    /*
    * No crash this time since we already decremented page->_mapcount in
    * zap_huge_pmd().
    */
    BUG_ON(mapcount != page_mapcount(page))

    /*
    * We split the compound page here into small pages without
    * serialization against zap_huge_pmd()
    */
    __split_huge_page_refcount()
    VM_BUG_ON_PAGE(!PageHead(page), page); // CRASH!!!

    So my understanding the problem is pmd_present() check in mm_find_pmd()
    without taking page table lock.

    The bug was introduced by me commit with commit 117b0791ac42. Sorry for
    that. :(

    Let's open code mm_find_pmd() in page_check_address_pmd() and do the
    check under page table lock.

    Note that __page_check_address() does the same for PTE entires
    if sync != 0.

    I've stress tested split and zap code paths for 36+ hours by now and
    don't see crashes with the patch applied. Before it took
    [2] https://lkml.kernel.org/g/

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Sasha Levin
    Tested-by: Sasha Levin
    Cc: Bob Liu
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Dave Jones
    Cc: Vlastimil Babka
    Cc: [3.13+]

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Fix new kernel-doc warning in mm/filemap.c:

    Warning(mm/filemap.c:2600): Excess function parameter 'ppos' description in '__generic_file_aio_write'

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • soft lockup in freeing gigantic hugepage fixed in commit 55f67141a892 "mm:
    hugetlb: fix softlockup when a large number of hugepages are freed." can
    happen in return_unused_surplus_pages(), so let's fix it.

    Signed-off-by: Masayoshi Mizuma
    Signed-off-by: Naoya Horiguchi
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Aneesh Kumar
    Cc: KOSAKI Motohiro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mizuma, Masayoshi
     
  • Seems to be called with preemption enabled. Therefore it must use
    mod_zone_page_state instead.

    Signed-off-by: Christoph Lameter
    Reported-by: Grygorii Strashko
    Tested-by: Grygorii Strashko
    Cc: Tejun Heo
    Cc: Santosh Shilimkar
    Cc: Ingo Molnar
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

14 Apr, 2014

2 commits

  • Some versions of gcc even warn about it:

    mm/shmem.c: In function ‘shmem_file_aio_read’:
    mm/shmem.c:1414: warning: ‘error’ may be used uninitialized in this function

    If the loop is aborted during the first iteration by one of the two
    first break statements, error will be uninitialized.

    Introduced by commit 6e58e79db8a1 ("introduce copy_page_to_iter, kill
    loop over iovec in generic_file_aio_read()").

    Signed-off-by: Geert Uytterhoeven
    Acked-by: Al Viro
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     
  • Pull slab changes from Pekka Enberg:
    "The biggest change is byte-sized freelist indices which reduces slab
    freelist memory usage:

    https://lkml.org/lkml/2013/12/2/64"

    * 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
    mm: slab/slub: use page->list consistently instead of page->lru
    mm/slab.c: cleanup outdated comments and unify variables naming
    slab: fix wrongly used macro
    slub: fix high order page allocation problem with __GFP_NOFAIL
    slab: Make allocations with GFP_ZERO slightly more efficient
    slab: make more slab management structure off the slab
    slab: introduce byte sized index for the freelist of a slab
    slab: restrict the number of objects in a slab
    slab: introduce helper functions to get/set free object
    slab: factor out calculate nr objects in cache_estimate

    Linus Torvalds
     

13 Apr, 2014

2 commits

  • Pull vfs updates from Al Viro:
    "The first vfs pile, with deep apologies for being very late in this
    window.

    Assorted cleanups and fixes, plus a large preparatory part of iov_iter
    work. There's a lot more of that, but it'll probably go into the next
    merge window - it *does* shape up nicely, removes a lot of
    boilerplate, gets rid of locking inconsistencie between aio_write and
    splice_write and I hope to get Kent's direct-io rewrite merged into
    the same queue, but some of the stuff after this point is having
    (mostly trivial) conflicts with the things already merged into
    mainline and with some I want more testing.

    This one passes LTP and xfstests without regressions, in addition to
    usual beating. BTW, readahead02 in ltp syscalls testsuite has started
    giving failures since "mm/readahead.c: fix readahead failure for
    memoryless NUMA nodes and limit readahead pages" - might be a false
    positive, might be a real regression..."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    missing bits of "splice: fix racy pipe->buffers uses"
    cifs: fix the race in cifs_writev()
    ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
    kill generic_file_buffered_write()
    ocfs2_file_aio_write(): switch to generic_perform_write()
    ceph_aio_write(): switch to generic_perform_write()
    xfs_file_buffered_aio_write(): switch to generic_perform_write()
    export generic_perform_write(), start getting rid of generic_file_buffer_write()
    generic_file_direct_write(): get rid of ppos argument
    btrfs_file_aio_write(): get rid of ppos
    kill the 5th argument of generic_file_buffered_write()
    kill the 4th argument of __generic_file_aio_write()
    lustre: don't open-code kernel_recvmsg()
    ocfs2: don't open-code kernel_recvmsg()
    drbd: don't open-code kernel_recvmsg()
    constify blk_rq_map_user_iov() and friends
    lustre: switch to kernel_sendmsg()
    ocfs2: don't open-code kernel_sendmsg()
    take iov_iter stuff to mm/iov_iter.c
    process_vm_access: tidy up a bit
    ...

    Linus Torvalds
     
  • Pull audit updates from Eric Paris.

    * git://git.infradead.org/users/eparis/audit: (28 commits)
    AUDIT: make audit_is_compat depend on CONFIG_AUDIT_COMPAT_GENERIC
    audit: renumber AUDIT_FEATURE_CHANGE into the 1300 range
    audit: do not cast audit_rule_data pointers pointlesly
    AUDIT: Allow login in non-init namespaces
    audit: define audit_is_compat in kernel internal header
    kernel: Use RCU_INIT_POINTER(x, NULL) in audit.c
    sched: declare pid_alive as inline
    audit: use uapi/linux/audit.h for AUDIT_ARCH declarations
    syscall_get_arch: remove useless function arguments
    audit: remove stray newline from audit_log_execve_info() audit_panic() call
    audit: remove stray newlines from audit_log_lost messages
    audit: include subject in login records
    audit: remove superfluous new- prefix in AUDIT_LOGIN messages
    audit: allow user processes to log from another PID namespace
    audit: anchor all pid references in the initial pid namespace
    audit: convert PPIDs to the inital PID namespace.
    pid: get pid_t ppid of task in init_pid_ns
    audit: rename the misleading audit_get_context() to audit_take_context()
    audit: Add generic compat syscall support
    audit: Add CONFIG_HAVE_ARCH_AUDITSYSCALL
    ...

    Linus Torvalds
     

12 Apr, 2014

1 commit


11 Apr, 2014

1 commit

  • 'struct page' has two list_head fields: 'lru' and 'list'. Conveniently,
    they are unioned together. This means that code can use them
    interchangably, which gets horribly confusing like with this nugget from
    slab.c:

    > list_del(&page->lru);
    > if (page->active == cachep->num)
    > list_add(&page->list, &n->slabs_full);

    This patch makes the slab and slub code use page->lru universally instead
    of mixing ->list and ->lru.

    So, the new rule is: page->lru is what the you use if you want to keep
    your page on a list. Don't like the fact that it's not called ->list?
    Too bad.

    Signed-off-by: Dave Hansen
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Pekka Enberg

    Dave Hansen
     

09 Apr, 2014

1 commit

  • Page reclaim force-scans / swaps anonymous pages when file cache drops
    below the high watermark of a zone in order to prevent what little cache
    remains from thrashing.

    However, on bigger machines the high watermark value can be quite large
    and when the workload is dominated by a static anonymous/shmem set, the
    file set might just be a small window of used-once cache. In such
    situations, the VM starts swapping heavily when instead it should be
    recycling the no longer used cache.

    This is a longer-standing problem, but it's more likely to trigger after
    commit 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy")
    because file pages can no longer accumulate in a single zone and are
    dispersed into smaller fractions among the available zones.

    To resolve this, do not force scan anon when file pages are low but
    instead rely on the scan/rotation ratios to make the right prediction.

    Signed-off-by: Johannes Weiner
    Acked-by: Rafael Aquini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Suleiman Souhlal
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

08 Apr, 2014

26 commits

  • Merge second patch-bomb from Andrew Morton:
    - the rest of MM
    - zram updates
    - zswap updates
    - exit
    - procfs
    - exec
    - wait
    - crash dump
    - lib/idr
    - rapidio
    - adfs, affs, bfs, ufs
    - cris
    - Kconfig things
    - initramfs
    - small amount of IPC material
    - percpu enhancements
    - early ioremap support
    - various other misc things

    * emailed patches from Andrew Morton : (156 commits)
    MAINTAINERS: update Intel C600 SAS driver maintainers
    fs/ufs: remove unused ufs_super_block_third pointer
    fs/ufs: remove unused ufs_super_block_second pointer
    fs/ufs: remove unused ufs_super_block_first pointer
    fs/ufs/super.c: add __init to init_inodecache()
    doc/kernel-parameters.txt: add early_ioremap_debug
    arm64: add early_ioremap support
    arm64: initialize pgprot info earlier in boot
    x86: use generic early_ioremap
    mm: create generic early_ioremap() support
    x86/mm: sparse warning fix for early_memremap
    lglock: map to spinlock when !CONFIG_SMP
    percpu: add preemption checks to __this_cpu ops
    vmstat: use raw_cpu_ops to avoid false positives on preemption checks
    slub: use raw_cpu_inc for incrementing statistics
    net: replace __this_cpu_inc in route.c with raw_cpu_inc
    modules: use raw_cpu_write for initialization of per cpu refcount.
    mm: use raw_cpu ops for determining current NUMA node
    percpu: add raw_cpu_ops
    slub: fix leak of 'name' in sysfs_slab_add
    ...

    Linus Torvalds
     
  • This patch creates a generic implementation of early_ioremap() support
    based on the existing x86 implementation. early_ioremp() is useful for
    early boot code which needs to temporarily map I/O or memory regions
    before normal mapping functions such as ioremap() are available.

    Some architectures have optional MMU. In the no-MMU case, the remap
    functions simply return the passed in physical address and the unmap
    functions do nothing.

    Signed-off-by: Mark Salter
    Acked-by: Catalin Marinas
    Acked-by: H. Peter Anvin
    Cc: Borislav Petkov
    Cc: Dave Young
    Cc: Will Deacon
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Salter
     
  • Statistics are not critical to the operation of the allocation but
    should also not cause too much overhead.

    When __this_cpu_inc is altered to check if preemption is disabled this
    triggers. Use raw_cpu_inc to avoid the checks. Using this_cpu_ops may
    cause interrupt disable/enable sequences on various arches which may
    significantly impact allocator performance.

    [akpm@linux-foundation.org: add comment]
    Signed-off-by: Christoph Lameter
    Cc: Fengguang Wu
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • The failure paths of sysfs_slab_add don't release the allocation of
    'name' made by create_unique_id() a few lines above the context of the
    diff below. Create a common exit path to make it more obvious what
    needs freeing.

    [vdavydov@parallels.com: free the name only if !unmergeable]
    Signed-off-by: Dave Jones
    Signed-off-by: Vladimir Davydov
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jones
     
  • Currently, we try to arrange sysfs entries for memcg caches in the same
    manner as for global caches. Apart from turning /sys/kernel/slab into a
    mess when there are a lot of kmem-active memcgs created, it actually
    does not work properly - we won't create more than one link to a memcg
    cache in case its parent is merged with another cache. For instance, if
    A is a root cache merged with another root cache B, we will have the
    following sysfs setup:

    X
    A -> X
    B -> X

    where X is some unique id (see create_unique_id()). Now if memcgs M and
    N start to allocate from cache A (or B, which is the same), we will get:

    X
    X:M
    X:N
    A -> X
    B -> X
    A:M -> X:M
    A:N -> X:N

    Since B is an alias for A, we won't get entries B:M and B:N, which is
    confusing.

    It is more logical to have entries for memcg caches under the
    corresponding root cache's sysfs directory. This would allow us to keep
    sysfs layout clean, and avoid such inconsistencies like one described
    above.

    This patch does the trick. It creates a "cgroup" kset in each root
    cache kobject to keep its children caches there.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Otherwise, kzalloc() called from a memcg won't clear the whole object.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently we destroy children caches at the very beginning of
    kmem_cache_destroy(). This is wrong, because the root cache will not
    necessarily be destroyed in the end - if it has aliases (refcount > 0),
    kmem_cache_destroy() will simply decrement its refcount and return. In
    this case, at best we will get a bunch of warnings in dmesg, like this
    one:

    kmem_cache_destroy kmalloc-32:0: Slab cache still has objects
    CPU: 1 PID: 7139 Comm: modprobe Tainted: G B W 3.13.0+ #117
    Call Trace:
    dump_stack+0x49/0x5b
    kmem_cache_destroy+0xdf/0xf0
    kmem_cache_destroy_memcg_children+0x97/0xc0
    kmem_cache_destroy+0xf/0xf0
    xfs_mru_cache_uninit+0x21/0x30 [xfs]
    exit_xfs_fs+0x2e/0xc44 [xfs]
    SyS_delete_module+0x198/0x1f0
    system_call_fastpath+0x16/0x1b

    At worst - if kmem_cache_destroy() will race with an allocation from a
    memcg cache - the kernel will panic.

    This patch fixes this by moving children caches destruction after the
    check if the cache has aliases. Plus, it forbids destroying a root
    cache if it still has children caches, because each children cache keeps
    a reference to its parent.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Currently, memcg_unregister_cache(), which deletes the cache being
    destroyed from the memcg_slab_caches list, is called after
    __kmem_cache_shutdown() (see kmem_cache_destroy()), which starts to
    destroy the cache.

    As a result, one can access a partially destroyed cache while traversing
    a memcg_slab_caches list, which can have deadly consequences (for
    instance, cache_show() called for each cache on a memcg_slab_caches list
    from mem_cgroup_slabinfo_read() will dereference pointers to already
    freed data).

    To fix this, let's move memcg_unregister_cache() before the cache
    destruction process beginning, issuing memcg_register_cache() on failure.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Memcg-awareness turned kmem_cache_create() into a dirty interweaving of
    memcg-only and except-for-memcg calls. To clean this up, let's move the
    code responsible for memcg cache creation to a separate function.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • This patch cleans up the memcg cache creation path as follows:

    - Move memcg cache name creation to a separate function to be called
    from kmem_cache_create_memcg(). This allows us to get rid of the mutex
    protecting the temporary buffer used for the name formatting, because
    the whole cache creation path is protected by the slab_mutex.

    - Get rid of memcg_create_kmem_cache(). This function serves as a proxy
    to kmem_cache_create_memcg(). After separating the cache name creation
    path, it would be reduced to a function call, so let's inline it.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • When a kmem cache is created (kmem_cache_create_memcg()), we first try to
    find a compatible cache that already exists and can handle requests from
    the new cache, i.e. has the same object size, alignment, ctor, etc. If
    there is such a cache, we do not create any new caches, instead we simply
    increment the refcount of the cache found and return it.

    Currently we do this procedure not only when creating root caches, but
    also for memcg caches. However, there is no point in that, because, as
    every memcg cache has exactly the same parameters as its parent and cache
    merging cannot be turned off in runtime (only on boot by passing
    "slub_nomerge"), the root caches of any two potentially mergeable memcg
    caches should be merged already, i.e. it must be the same root cache, and
    therefore we couldn't even get to the memcg cache creation, because it
    already exists.

    The only exception is boot caches - they are explicitly forbidden to be
    merged by setting their refcount to -1. There are currently only two of
    them - kmem_cache and kmem_cache_node, which are used in slab internals (I
    do not count kmalloc caches as their refcount is set to 1 immediately
    after creation). Since they are prevented from merging preliminary I
    guess we should avoid to merge their children too.

    So let's remove the useless code responsible for merging memcg caches.

    Signed-off-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Fix following trivial checkpatch error:

    ERROR: return is not a function, parentheses are not required

    Signed-off-by: SeongJae Park
    Acked-by: Seth Jennings
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    SeongJae Park
     
  • Cai Liu reporeted that now zbud pool pages counting has a problem when
    multiple swap is used because it just counts only one swap intead of all
    of swap so zswap cannot control writeback properly. The result is
    unnecessary writeback or no writeback when we should really writeback.

    IOW, it made zswap crazy.

    Another problem in zswap is:

    For example, let's assume we use two swap A and B with different
    priority and A already has charged 19% long time ago and let's assume
    that A swap is full now so VM start to use B so that B has charged 1%
    recently. It menas zswap charged (19% + 1%) is full by default. Then,
    if VM want to swap out more pages into B, zbud_reclaim_page would be
    evict one of pages in B's pool and it would be repeated continuously.
    It's totally LRU reverse problem and swap thrashing in B would happen.

    This patch makes zswap consider mutliple swap by creating *a* zbud pool
    which will be shared by multiple swap so all of zswap pages in multiple
    swap keep order by LRU so it can prevent above two problems.

    Signed-off-by: Minchan Kim
    Reported-by: Cai Liu
    Suggested-by: Weijie Yang
    Cc: Seth Jennings
    Reviewed-by: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • zswap used zsmalloc before and now using zbud. But, some comments saying
    it use zsmalloc yet. Fix the trivial problems.

    Signed-off-by: SeongJae Park
    Cc: Seth Jennings
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    SeongJae Park
     
  • Signed-off-by: SeongJae Park
    Cc: Seth Jennings
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    SeongJae Park
     
  • A new dump_page() routine was recently added, and marked
    EXPORT_SYMBOL_GPL. dump_page() was also added to the VM_BUG_ON_PAGE()
    macro, and so the end result is that non-GPL code can no longer call
    get_page() and a few other routines.

    This only happens if the kernel was compiled with CONFIG_DEBUG_VM.

    Change dump_page() to be EXPORT_SYMBOL.

    Longer explanation:

    Prior to commit 309381feaee5 ("mm: dump page when hitting a VM_BUG_ON
    using VM_BUG_ON_PAGE") , it was possible to build MIT-licensed (non-GPL)
    drivers on Fedora. Fedora is semi-unique, in that it sets
    CONFIG_VM_DEBUG.

    Because Fedora sets CONFIG_VM_DEBUG, they end up pulling in dump_page(),
    via VM_BUG_ON_PAGE, via get_page(). As one of the authors of NVIDIA's
    new, open source, "UVM-Lite" kernel module, I originally choose to use
    the kernel's get_page() routine from within nvidia_uvm_page_cache.c,
    because get_page() has always seemed to be very clearly intended for use
    by non-GPL, driver code.

    So I'm hoping that making get_page() widely accessible again will not be
    too controversial. We did check with Fedora first, and they responded
    (https://bugzilla.redhat.com/show_bug.cgi?id=1074710#c3) that we should
    try to get upstream changed, before asking Fedora to change. Their
    reasoning seems beneficial to Linux: leaving CONFIG_DEBUG_VM set allows
    Fedora to help catch mm bugs.

    Signed-off-by: John Hubbard
    Cc: Sasha Levin
    Cc: Josh Boyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Hubbard
     
  • Commit f9acc8c7b35a ("readahead: sanify file_ra_state names") left
    ra_submit with a single function call.

    Move ra_submit to internal.h and inline it to save some stack. Thanks
    to Andrew Morton for commenting different versions.

    Signed-off-by: Fabian Frederick
    Suggested-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • When I decrease the value of nr_hugepage in procfs a lot, softlockup
    happens. It is because there is no chance of context switch during this
    process.

    On the other hand, when I allocate a large number of hugepages, there is
    some chance of context switch. Hence softlockup doesn't happen during
    this process. So it's necessary to add the context switch in the
    freeing process as same as allocating process to avoid softlockup.

    When I freed 12 TB hugapages with kernel-2.6.32-358.el6, the freeing
    process occupied a CPU over 150 seconds and following softlockup message
    appeared twice or more.

    $ echo 6000000 > /proc/sys/vm/nr_hugepages
    $ cat /proc/sys/vm/nr_hugepages
    6000000
    $ grep ^Huge /proc/meminfo
    HugePages_Total: 6000000
    HugePages_Free: 6000000
    HugePages_Rsvd: 0
    HugePages_Surp: 0
    Hugepagesize: 2048 kB
    $ echo 0 > /proc/sys/vm/nr_hugepages

    BUG: soft lockup - CPU#16 stuck for 67s! [sh:12883] ...
    Pid: 12883, comm: sh Not tainted 2.6.32-358.el6.x86_64 #1
    Call Trace:
    free_pool_huge_page+0xb8/0xd0
    set_max_huge_pages+0x128/0x190
    hugetlb_sysctl_handler_common+0x113/0x140
    hugetlb_sysctl_handler+0x1e/0x20
    proc_sys_call_handler+0x97/0xd0
    proc_sys_write+0x14/0x20
    vfs_write+0xb8/0x1a0
    sys_write+0x51/0x90
    __audit_syscall_exit+0x265/0x290
    system_call_fastpath+0x16/0x1b

    I have not confirmed this problem with upstream kernels because I am not
    able to prepare the machine equipped with 12TB memory now. However I
    confirmed that the amount of decreasing hugepages was directly
    proportional to the amount of required time.

    I measured required times on a smaller machine. It showed 130-145
    hugepages decreased in a millisecond.

    Amount of decreasing Required time Decreasing rate
    hugepages (msec) (pages/msec)
    ------------------------------------------------------------
    10,000 pages == 20GB 70 - 74 135-142
    30,000 pages == 60GB 208 - 229 131-144

    It means decrement of 6TB hugepages will trigger softlockup with the
    default threshold 20sec, in this decreasing rate.

    Signed-off-by: Masayoshi Mizuma
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Wanpeng Li
    Cc: Aneesh Kumar
    Cc: KOSAKI Motohiro
    Cc: Naoya Horiguchi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mizuma, Masayoshi
     
  • Replace ((phys_addr_t)(x) << PAGE_SHIFT) by pfn macro.

    Signed-off-by: Fabian Frederick
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • This is a small cleanup.

    Signed-off-by: Emil Medve
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Emil Medve
     
  • There's only one caller of set_page_dirty_balance() and that will call it
    with page_mkwrite == 0.

    The page_mkwrite argument was unused since commit b827e496c893 "mm: close
    page_mkwrite races".

    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • A BUG_ON(!PageLocked) was triggered in mlock_vma_page() by Sasha Levin
    fuzzing with trinity. The call site try_to_unmap_cluster() does not lock
    the pages other than its check_page parameter (which is already locked).

    The BUG_ON in mlock_vma_page() is not documented and its purpose is
    somewhat unclear, but apparently it serializes against page migration,
    which could otherwise fail to transfer the PG_mlocked flag. This would
    not be fatal, as the page would be eventually encountered again, but
    NR_MLOCK accounting would become distorted nevertheless. This patch adds
    a comment to the BUG_ON in mlock_vma_page() and munlock_vma_page() to that
    effect.

    The call site try_to_unmap_cluster() is fixed so that for page !=
    check_page, trylock_page() is attempted (to avoid possible deadlocks as we
    already have check_page locked) and mlock_vma_page() is performed only
    upon success. If the page lock cannot be obtained, the page is left
    without PG_mlocked, which is again not a problem in the whole unevictable
    memory design.

    Signed-off-by: Vlastimil Babka
    Signed-off-by: Bob Liu
    Reported-by: Sasha Levin
    Cc: Wanpeng Li
    Cc: Michel Lespinasse
    Cc: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Joonsoo Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • On NUMA systems, a node may start thrashing cache or even swap anonymous
    pages while there are still free pages on remote nodes.

    This is a result of commits 81c0a2bb515f ("mm: page_alloc: fair zone
    allocator policy") and fff4068cba48 ("mm: page_alloc: revert NUMA aspect
    of fair allocation policy").

    Before those changes, the allocator would first try all allowed zones,
    including those on remote nodes, before waking any kswapds. But now,
    the allocator fastpath doubles as the fairness pass, which in turn can
    only consider the local node to prevent remote spilling based on
    exhausted fairness batches alone. Remote nodes are only considered in
    the slowpath, after the kswapds are woken up. But if remote nodes still
    have free memory, kswapd should not be woken to rebalance the local node
    or it may thrash cash or swap prematurely.

    Fix this by adding one more unfair pass over the zonelist that is
    allowed to spill to remote nodes after the local fairness pass fails but
    before entering the slowpath and waking the kswapds.

    This also gets rid of the GFP_THISNODE exemption from the fairness
    protocol because the unfair pass is no longer tied to kswapd, which
    GFP_THISNODE is not allowed to wake up.

    However, because remote spills can be more frequent now - we prefer them
    over local kswapd reclaim - the allocation batches on remote nodes could
    underflow more heavily. When resetting the batches, use
    atomic_long_read() directly instead of zone_page_state() to calculate the
    delta as the latter filters negative counter values.

    Signed-off-by: Johannes Weiner
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Cc: [3.12+]

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • mem_cgroup_newpage_charge is used only for charging anonymous memory so
    it is better to rename it to mem_cgroup_charge_anon.

    mem_cgroup_cache_charge is used for file backed memory so rename it to
    mem_cgroup_charge_file.

    Signed-off-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Some callsites pass a memcg directly, some callsites pass an mm that
    then has to be translated to a memcg. This makes for a terrible
    function interface.

    Just push the mm-to-memcg translation into the respective callsites and
    always pass a memcg to mem_cgroup_try_charge().

    [mhocko@suse.cz: add charge mm helper]
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • __mem_cgroup_try_charge duplicates get_mem_cgroup_from_mm for charges
    which came without a memcg. The only reason seems to be a tiny
    optimization when css_tryget is not called if the charge can be consumed
    from the stock. Nevertheless css_tryget is very cheap since it has been
    reworked to use per-cpu counting so this optimization doesn't give us
    anything these days.

    So let's drop the code duplication so that the code is more readable.

    Signed-off-by: Michal Hocko
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko