09 Jan, 2015

5 commits

  • We are supposed to take one css reference per each memory page and per
    each swap entry accounted to a memory cgroup. However, during task
    charges migration we take a reference to the destination cgroup twice
    per each swap entry: first in mem_cgroup_do_precharge()->try_charge()
    and then in mem_cgroup_move_swap_account(), permanently leaking the
    destination cgroup.

    The hunk taking the second reference seems to be a leftover from the
    pre-00501b531c472 ("mm: memcontrol: rewrite charge API") era. Remove it
    to fix the leak.

    Fixes: e8ea14cc6ead (mm: memcontrol: take a css reference for each charged page)
    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Commit 3e32cb2e0a12 ("mm: memcontrol: lockless page counters")
    accidentally switched the soft limit default from infinity to zero,
    which turns all memcgs with even a single page into soft limit excessors
    and engages soft limit reclaim on all of them during global memory
    pressure. This makes global reclaim generally more aggressive, but also
    inverts the meaning of existing soft limit configurations where unset
    soft limits are usually more generous than set ones.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • These are obsolete since commit e30825f1869a ("mm/debug-pagealloc:
    prepare boottime configurable") was merged. So remove them.

    [pebolle@tiscali.nl: find obsolete Kconfig options]
    Signed-off-by: Joonsoo Kim
    Cc: Paul Bolle
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Dave Hansen
    Cc: Michal Nazarewicz
    Cc: Jungsoo Son
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Tejun, while reviewing the code, spotted the following race condition
    between the dirtying and truncation of a page:

    __set_page_dirty_nobuffers() __delete_from_page_cache()
    if (TestSetPageDirty(page))
    page->mapping = NULL
    if (PageDirty())
    dec_zone_page_state(page, NR_FILE_DIRTY);
    dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
    if (page->mapping)
    account_page_dirtied(page)
    __inc_zone_page_state(page, NR_FILE_DIRTY);
    __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);

    which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.

    Dirtiers usually lock out truncation, either by holding the page lock
    directly, or in case of zap_pte_range(), by pinning the mapcount with
    the page table lock held. The notable exception to this rule, though,
    is do_wp_page(), for which this race exists. However, do_wp_page()
    already waits for a locked page to unlock before setting the dirty bit,
    in order to prevent a race where clear_page_dirty() misses the page bit
    in the presence of dirty ptes. Upgrade that wait to a fully locked
    set_page_dirty() to also cover the situation explained above.

    Afterwards, the code in set_page_dirty() dealing with a truncation race
    is no longer needed. Remove it.

    Reported-by: Tejun Heo
    Signed-off-by: Johannes Weiner
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Jan Kara
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Constantly forking task causes unlimited grow of anon_vma chain. Each
    next child allocates new level of anon_vmas and links vma to all
    previous levels because pages might be inherited from any level.

    This patch adds heuristic which decides to reuse existing anon_vma
    instead of forking new one. It adds counter anon_vma->degree which
    counts linked vmas and directly descending anon_vmas and reuses anon_vma
    if counter is lower than two. As a result each anon_vma has either vma
    or at least two descending anon_vmas. In such trees half of nodes are
    leafs with alive vmas, thus count of anon_vmas is no more than two times
    bigger than count of vmas.

    This heuristic reuses anon_vmas as few as possible because each reuse
    adds false aliasing among vmas and rmap walker ought to scan more ptes
    when it searches where page is might be mapped.

    Link: http://lkml.kernel.org/r/20120816024610.GA5350@evergreen.ssec.wisc.edu
    Fixes: 5beb49305251 ("mm: change anon_vma linking to fix multi-process server scalability issue")
    [akpm@linux-foundation.org: fix typo, per Rik]
    Signed-off-by: Konstantin Khlebnikov
    Reported-by: Daniel Forrest
    Tested-by: Michal Hocko
    Tested-by: Jerome Marchand
    Reviewed-by: Michal Hocko
    Reviewed-by: Rik van Riel
    Cc: [2.6.34+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

07 Jan, 2015

1 commit

  • Jay Foad reports that the address sanitizer test (asan) sometimes gets
    confused by a stack pointer that ends up being outside the stack vma
    that is reported by /proc/maps.

    This happens due to an interaction between RLIMIT_STACK and the guard
    page: when we do the guard page check, we ignore the potential error
    from the stack expansion, which effectively results in a missing guard
    page, since the expected stack expansion won't have been done.

    And since /proc/maps explicitly ignores the guard page (commit
    d7824370e263: "mm: fix up some user-visible effects of the stack guard
    page"), the stack pointer ends up being outside the reported stack area.

    This is the minimal patch: it just propagates the error. It also
    effectively makes the guard page part of the stack limit, which in turn
    measn that the actual real stack is one page less than the stack limit.

    Let's see if anybody notices. We could teach acct_stack_growth() to
    allow an extra page for a grow-up/grow-down stack in the rlimit test,
    but I don't want to add more complexity if it isn't needed.

    Reported-and-tested-by: Jay Foad
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

30 Dec, 2014

1 commit

  • Commit 2457aec63745 ("mm: non-atomically mark page accessed during page
    cache allocation where possible") has added a separate parameter for
    specifying gfp mask for radix tree allocations.

    Not only this is less than optimal from the API point of view because it
    is error prone, it is also buggy currently because
    grab_cache_page_write_begin is using GFP_KERNEL for radix tree and if
    fgp_flags doesn't contain FGP_NOFS (mostly controlled by fs by
    AOP_FLAG_NOFS flag) but the mapping_gfp_mask has __GFP_FS cleared then
    the radix tree allocation wouldn't obey the restriction and might
    recurse into filesystem and cause deadlocks. This is the case for most
    filesystems unfortunately because only ext4 and gfs2 are using
    AOP_FLAG_NOFS.

    Let's simply remove radix_gfp_mask parameter because the allocation
    context is same for both page cache and for the radix tree. Just make
    sure that the radix tree gets only the sane subset of the mask (e.g. do
    not pass __GFP_WRITE).

    Long term it is more preferable to convert remaining users of
    AOP_FLAG_NOFS to use mapping_gfp_mask instead and simplify this
    interface even further.

    Reported-by: Dave Chinner
    Signed-off-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

23 Dec, 2014

1 commit

  • This reverts commit c8475d144abb1e62958cc5ec281d2a9e161c1946.

    There are several[1][2] of bug reports which points to this commit as potential
    cause[3].

    Let's revert it until we figure out what's going on.

    [1] https://lkml.org/lkml/2014/11/14/342
    [2] https://lkml.org/lkml/2014/12/22/213
    [3] https://lkml.org/lkml/2014/12/9/741

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Sasha Levin
    Acked-by: Davidlohr Bueso
    Cc: Hugh Dickins
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra (Intel)
    Cc: Rik van Riel
    Cc: Srikar Dronamraju
    Cc: Mel Gorman
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

21 Dec, 2014

1 commit

  • Pull ACCESS_ONCE cleanup preparation from Christian Borntraeger:
    "kernel: Provide READ_ONCE and ASSIGN_ONCE

    As discussed on LKML http://marc.info/?i=54611D86.4040306%40de.ibm.com
    ACCESS_ONCE might fail with specific compilers for non-scalar
    accesses.

    Here is a set of patches to tackle that problem.

    The first patch introduce READ_ONCE and ASSIGN_ONCE. If the data
    structure is larger than the machine word size memcpy is used and a
    warning is emitted. The next patches fix up several in-tree users of
    ACCESS_ONCE on non-scalar types.

    This does not yet contain a patch that forces ACCESS_ONCE to work only
    on scalar types. This is targetted for the next merge window as Linux
    next already contains new offenders regarding ACCESS_ONCE vs.
    non-scalar types"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux:
    s390/kvm: REPLACE barrier fixup with READ_ONCE
    arm/spinlock: Replace ACCESS_ONCE with READ_ONCE
    arm64/spinlock: Replace ACCESS_ONCE READ_ONCE
    mips/gup: Replace ACCESS_ONCE with READ_ONCE
    x86/gup: Replace ACCESS_ONCE with READ_ONCE
    x86/spinlock: Replace ACCESS_ONCE with READ_ONCE
    mm: replace ACCESS_ONCE with READ_ONCE or barriers
    kernel: Provide READ_ONCE and ASSIGN_ONCE

    Linus Torvalds
     

20 Dec, 2014

1 commit

  • Pull vfs pile #3 from Al Viro:
    "Assorted fixes and patches from the last cycle"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    [regression] chunk lost from bd9b51
    vfs: make mounts and mountstats honor root dir like mountinfo does
    vfs: cleanup show_mountinfo
    init: fix read-write root mount
    unfuck binfmt_misc.c (broken by commit e6084d4)
    vm_area_operations: kill ->migrate()
    new helper: iter_is_iovec()
    move_extent_per_page(): get rid of unused w_flags
    lustre: get rid of playing with ->fs
    btrfs: filp_open() returns ERR_PTR() on failure, not NULL...

    Linus Torvalds
     

19 Dec, 2014

4 commits

  • Currently functions in zsmalloc.c does not arranged in a readable and
    reasonable sequence. With the more and more functions added, we may
    meet below inconvenience. For example:

    Current functions:

    void zs_init()
    {
    }

    static void get_maxobj_per_zspage()
    {
    }

    Then I want to add a func_1() which is called from zs_init(), and this
    new added function func_1() will used get_maxobj_per_zspage() which is
    defined below zs_init().

    void func_1()
    {
    get_maxobj_per_zspage()
    }

    void zs_init()
    {
    func_1()
    }

    static void get_maxobj_per_zspage()
    {
    }

    This will cause compiling issue. So we must add a declaration:

    static void get_maxobj_per_zspage();

    before func_1() if we do not put get_maxobj_per_zspage() before
    func_1().

    In addition, puting module_[init|exit] functions at the bottom of the
    file conforms to our habit.

    So, this patch ajusts function sequence as:

    /* helper functions */
    ...
    obj_location_to_handle()
    ...

    /* Some exported functions */
    ...

    zs_map_object()
    zs_unmap_object()

    zs_malloc()
    zs_free()

    zs_init()
    zs_exit()

    Signed-off-by: Ganesh Mahendran
    Cc: Nitin Gupta
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ganesh Mahendran
     
  • Belatedly document the changes in commit f0c6d4d295e4 ("mm: introduce
    do_shared_fault() and drop do_fault()").

    Cc: Andi Kleen
    Cc: Bob Liu
    Cc: Dave Hansen
    Cc: "Kirill A. Shutemov"
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Naoya Horiguchi
    Cc: Rik van Riel
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • When the system boots up, in the dmesg logs we can see the memory
    statistics along with total reserved as below. Memory: 458840k/458840k
    available, 65448k reserved, 0K highmem

    When CMA is enabled, still the total reserved memory remains the same.
    However, the CMA memory is not considered as reserved. But, when we see
    /proc/meminfo, the CMA memory is part of free memory. This creates
    confusion. This patch corrects the problem by properly subtracting the
    CMA reserved memory from the total reserved memory in dmesg logs.

    Below is the dmesg snapshot from an arm based device with 512MB RAM and
    12MB single CMA region.

    Before this change:
    Memory: 458840k/458840k available, 65448k reserved, 0K highmem

    After this change:
    Memory: 458840k/458840k available, 53160k reserved, 12288k cma-reserved, 0K highmem

    Signed-off-by: Pintu Kumar
    Signed-off-by: Vishnu Pratap Singh
    Acked-by: Michal Nazarewicz
    Cc: Rafael Aquini
    Cc: Jerome Marchand
    Cc: Marek Szyprowski
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pintu Kumar
     
  • When nodes is true, nsc->mask2 has already been filtered by nsc->mask1,
    which has already factored in node_states[N_MEMORY].

    Signed-off-by: Zhihui Zhang
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhihui Zhang
     

18 Dec, 2014

2 commits

  • ACCESS_ONCE does not work reliably on non-scalar types. For
    example gcc 4.6 and 4.7 might remove the volatile tag for such
    accesses during the SRA (scalar replacement of aggregates) step
    (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145)

    Let's change the code to access the page table elements with
    READ_ONCE that does implicit scalar accesses for the gup code.

    mm_find_pmd is tricky, because m68k and sparc(32bit) define pmd_t
    as array of longs. This code requires just that the pmd_present
    and pmd_trans_huge check are done on the same value, so a barrier
    is sufficent.

    A similar case is in handle_pte_fault. On ppc44x the word size is
    32 bit, but a pte is 64 bit. A barrier is ok as well.

    Signed-off-by: Christian Borntraeger
    Cc: linux-mm@kvack.org
    Acked-by: Paul E. McKenney

    Christian Borntraeger
     
  • Dave Hansen reports that commit fb7332a9fedf ("mmu_gather: move minimal
    range calculations into generic code") caused a performance problem:

    "tlb_finish_mmu() goes up about 9x in the profiles (~0.4%->3.6%) and
    tlb_flush_mmu_free() takes about 3.1% of CPU time with the patch
    applied, but does not show up at all on the commit before"

    and the reason is that Will moved the test for whether we need to flush
    from tlb_flush_mmu() into tlb_flush_mmu_tlbonly(). But that meant that
    tlb_flush_mmu_free() basically lost that check.

    Move it back into tlb_flush_mmu() where it belongs, so that it covers
    both tlb_flush_mmu_tlbonly() _and_ tlb_flush_mmu_free().

    Reported-and-tested-by: Dave Hansen
    Acked-by: Will Deacon
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

17 Dec, 2014

3 commits

  • the only instance this method has ever grown was one in kernfs -
    one that call ->migrate() of another vm_ops if it exists.

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Pull nfsd updates from Bruce Fields:
    "A comparatively quieter cycle for nfsd this time, but still with two
    larger changes:

    - RPC server scalability improvements from Jeff Layton (using RCU
    instead of a spinlock to find idle threads).

    - server-side NFSv4.2 ALLOCATE/DEALLOCATE support from Anna
    Schumaker, enabling fallocate on new clients"

    * 'for-3.19' of git://linux-nfs.org/~bfields/linux: (32 commits)
    nfsd4: fix xdr4 count of server in fs_location4
    nfsd4: fix xdr4 inclusion of escaped char
    sunrpc/cache: convert to use string_escape_str()
    sunrpc: only call test_bit once in svc_xprt_received
    fs: nfsd: Fix signedness bug in compare_blob
    sunrpc: add some tracepoints around enqueue and dequeue of svc_xprt
    sunrpc: convert to lockless lookup of queued server threads
    sunrpc: fix potential races in pool_stats collection
    sunrpc: add a rcu_head to svc_rqst and use kfree_rcu to free it
    sunrpc: require svc_create callers to pass in meaningful shutdown routine
    sunrpc: have svc_wake_up only deal with pool 0
    sunrpc: convert sp_task_pending flag to use atomic bitops
    sunrpc: move rq_cachetype field to better optimize space
    sunrpc: move rq_splice_ok flag into rq_flags
    sunrpc: move rq_dropme flag into rq_flags
    sunrpc: move rq_usedeferral flag to rq_flags
    sunrpc: move rq_local field to rq_flags
    sunrpc: add a generic rq_flags field to svc_rqst and move rq_secure to it
    nfsd: minor off by one checks in __write_versions()
    sunrpc: release svc_pool_map reference when serv allocation fails
    ...

    Linus Torvalds
     

16 Dec, 2014

1 commit

  • Pull drm updates from Dave Airlie:
    "Highlights:

    - AMD KFD driver merge

    This is the AMD HSA interface for exposing a lowlevel interface for
    GPGPU use. They have an open source userspace built on top of this
    interface, and the code looks as good as it was going to get out of
    tree.

    - Initial atomic modesetting work

    The need for an atomic modesetting interface to allow userspace to
    try and send a complete set of modesetting state to the driver has
    arisen, and been suffering from neglect this past year. No more,
    the start of the common code and changes for msm driver to use it
    are in this tree. Ongoing work to get the userspace ioctl finished
    and the code clean will probably wait until next kernel.

    - DisplayID 1.3 and tiled monitor exposed to userspace.

    Tiled monitor property is now exposed for userspace to make use of.

    - Rockchip drm driver merged.

    - imx gpu driver moved out of staging

    Other stuff:

    - core:
    panel - MIPI DSI + new panels.
    expose suggested x/y properties for virtual GPUs

    - i915:
    Initial Skylake (SKL) support
    gen3/4 reset work
    start of dri1/ums removal
    infoframe tracking
    fixes for lots of things.

    - nouveau:
    tegra k1 voltage support
    GM204 modesetting support
    GT21x memory reclocking work

    - radeon:
    CI dpm fixes
    GPUVM improvements
    Initial DPM fan control

    - rcar-du:
    HDMI support added
    removed some support for old boards
    slave encoder driver for Analog Devices adv7511

    - exynos:
    Exynos4415 SoC support

    - msm:
    a4xx gpu support
    atomic helper conversion

    - tegra:
    iommu support
    universal plane support
    ganged-mode DSI support

    - sti:
    HDMI i2c improvements

    - vmwgfx:
    some late fixes.

    - qxl:
    use suggested x/y properties"

    * 'drm-next' of git://people.freedesktop.org/~airlied/linux: (969 commits)
    drm: sti: fix module compilation issue
    drm/i915: save/restore GMBUS freq across suspend/resume on gen4
    drm: sti: correctly cleanup CRTC and planes
    drm: sti: add HQVDP plane
    drm: sti: add cursor plane
    drm: sti: enable auxiliary CRTC
    drm: sti: fix delay in VTG programming
    drm: sti: prepare sti_tvout to support auxiliary crtc
    drm: sti: use drm_crtc_vblank_{on/off} instead of drm_vblank_{on/off}
    drm: sti: fix hdmi avi infoframe
    drm: sti: remove event lock while disabling vblank
    drm: sti: simplify gdp code
    drm: sti: clear all mixer control
    drm: sti: remove gpio for HDMI hot plug detection
    drm: sti: allow to change hdmi ddc i2c adapter
    drm/doc: Document drm_add_modes_noedid() usage
    drm/i915: Remove '& 0xffff' from the mask given to WA_REG()
    drm/i915: Invert the mask and val arguments in wa_add() and WA_REG()
    drm: Zero out DRM object memory upon cleanup
    drm/i915/bdw: Fix the write setting up the WIZ hashing mode
    ...

    Linus Torvalds
     

15 Dec, 2014

1 commit


14 Dec, 2014

19 commits

  • There are actually two issues this patch addresses. Let me start with
    the one I tried to solve in the beginning.

    So, in the checkpoint-restore project (criu) we try to dump tasks'
    state and restore one back exactly as it was. One of the tasks' state
    bits is rings set up with io_setup() call. There's (almost) no problems
    in dumping them, there's a problem restoring them -- if I dump a task
    with aio ring originally mapped at address A, I want to restore one
    back at exactly the same address A. Unfortunately, the io_setup() does
    not allow for that -- it mmaps the ring at whatever place mm finds
    appropriate (it calls do_mmap_pgoff() with zero address and without
    the MAP_FIXED flag).

    To make restore possible I'm going to mremap() the freshly created ring
    into the address A (under which it was seen before dump). The problem is
    that the ring's virtual address is passed back to the user-space as the
    context ID and this ID is then used as search key by all the other io_foo()
    calls. Reworking this ID to be just some integer doesn't seem to work, as
    this value is already used by libaio as a pointer using which this library
    accesses memory for aio meta-data.

    So, to make restore work we need to make sure that

    a) ring is mapped at desired virtual address
    b) kioctx->user_id matches this value

    Having said that, the patch makes mremap() on aio region update the
    kioctx's user_id and mmap_base values.

    Here appears the 2nd issue I mentioned in the beginning of this mail.
    If (regardless of the C/R dances I do) someone creates an io context
    with io_setup(), then mremap()-s the ring and then destroys the context,
    the kill_ioctx() routine will call munmap() on wrong (old) address.
    This will result in a) aio ring remaining in memory and b) some other
    vma get unexpectedly unmapped.

    What do you think?

    Signed-off-by: Pavel Emelyanov
    Acked-by: Dmitry Monakhov
    Signed-off-by: Benjamin LaHaise

    Pavel Emelyanov
     
  • kmemleak will add allocations as objects to a pool. The memory allocated
    for each object in this pool is periodically searched for pointers to
    other allocated objects. This only works for memory that is mapped into
    the kernel's virtual address space, which happens not to be the case for
    most CMA regions.

    Furthermore, CMA regions are typically used to store data transferred to
    or from a device and therefore don't contain pointers to other objects.

    Without this, the kernel crashes on the first execution of the
    scan_gray_list() because it tries to access highmem. Perhaps a more
    appropriate fix would be to reject any object that can't map to a kernel
    virtual address?

    [akpm@linux-foundation.org: add comment]
    [akpm@linux-foundation.org: fix comment, per Catalin]
    [sfr@canb.auug.org.au: include linux/io.h for phys_to_virt()]
    Signed-off-by: Thierry Reding
    Cc: Michal Nazarewicz
    Cc: Marek Szyprowski
    Cc: Joonsoo Kim
    Cc: "Aneesh Kumar K.V"
    Cc: Catalin Marinas
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thierry Reding
     
  • If we fail to allocate from the current node's stock, we look for free
    objects on other nodes before calling the page allocator (see
    get_any_partial). While checking other nodes we respect cpuset
    constraints by calling cpuset_zone_allowed. We enforce hardwall check.
    As a result, we will fallback to the page allocator even if there are some
    pages cached on other nodes, but the current cpuset doesn't have them set.
    However, the page allocator uses softwall check for kernel allocations,
    so it may allocate from one of the other nodes in this case.

    Therefore we should use softwall cpuset check in get_any_partial to
    conform with the cpuset check in the page allocator.

    Signed-off-by: Vladimir Davydov
    Acked-by: Zefan Li
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • fallback_alloc is called on kmalloc if the preferred node doesn't have
    free or partial slabs and there's no pages on the node's free list
    (GFP_THISNODE allocations fail). Before invoking the reclaimer it tries
    to locate a free or partial slab on other allowed nodes' lists. While
    iterating over the preferred node's zonelist it skips those zones which
    hardwall cpuset check returns false for. That means that for a task bound
    to a specific node using cpusets fallback_alloc will always ignore free
    slabs on other nodes and go directly to the reclaimer, which, however, may
    allocate from other nodes if cpuset.mem_hardwall is unset (default). As a
    result, we may get lists of free slabs grow without bounds on other nodes,
    which is bad, because inactive slabs are only evicted by cache_reap at a
    very slow rate and cannot be dropped forcefully.

    To reproduce the issue, run a process that will walk over a directory tree
    with lots of files inside a cpuset bound to a node that constantly
    experiences memory pressure. Look at num_slabs vs active_slabs growth as
    reported by /proc/slabinfo.

    To avoid this we should use softwall cpuset check in fallback_alloc.

    Signed-off-by: Vladimir Davydov
    Acked-by: Zefan Li
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • When zbud is initialized through the zpool wrapper, pool->ops which
    points to user-defined operations is always set regardless of whether it
    is specified from the upper layer. This causes zbud_reclaim_page() to
    iterate its loop for evicting pool pages out without any gain.

    This patch sets the user-defined ops only when it is needed, so that
    zbud_reclaim_page() can bail out the reclamation loop earlier if there
    is no user-defined operations specified.

    Signed-off-by: Heesub Shin
    Acked-by: Dan Streetman
    Cc: Seth Jennings
    Cc: Sunae Seo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heesub Shin
     
  • free_percpu() tests whether its argument is NULL and then returns
    immediately. Thus the test around the call is not needed.

    This issue was detected by using the Coccinelle software.

    Signed-off-by: Markus Elfring
    Cc: Seth Jennings
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Markus Elfring
     
  • zswap_cpu_init/zswap_comp_exit/zswap_entry_cache_create is only called by
    __init init_zswap()

    Signed-off-by: Mahendran Ganesh
    Cc: Seth Jennings
    Cc: Dan Streetman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mahendran Ganesh
     
  • In zs_create_pool(), we allocate memory more then sizeof(struct zs_pool)
    ovhd_size = roundup(sizeof(*pool), PAGE_SIZE);

    This patch allocate memory of exactly needed size.

    Signed-off-by: Ganesh Mahendran
    Acked-by: Minchan Kim
    Cc: Nitin Gupta
    Cc: Dan Streetman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ganesh Mahendran
     
  • In zs_create_pool(), prev_class is assigned (ZS_SIZE_CLASSES - 1) times.
    And the prev_class only references to the previous size_class. So we do
    not need unnecessary assignement.

    This patch assigns *prev_class* when a new size_class structure is
    allocated and uses prev_class to check whether the first class has been
    allocated.

    [akpm@linux-foundation.org: remove now-unused ZS_SIZE_CLASSES]
    Signed-off-by: Ganesh Mahendran
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Reviewed-by: Dan Streetman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ganesh Mahendran
     
  • I sent a patch [1] for unnecessary check in zsmalloc. And Minchan Kim
    found zsmalloc even does not support allocating an obj with the size of
    ZS_MAX_ALLOC_SIZE in some situations.

    For example:
    In system with 64KB PAGE_SIZE and 32 bit of physical addr. Then:
    ZS_MIN_ALLOC_SIZE is 32 bytes which is calculated by:
    MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS))
    ZS_MAX_ALLOC_SIZE is 64KB(in current code, is PAGE_SIZE)
    ZS_SIZE_CLASS_DELTA is 256 bytes
    So, ZS_SIZE_CLASSES = (ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE) /
    ZS_SIZE_CLASS_DELTA + 1
    = 256

    In zs_create_pool(), the max size obj which can be allocated will be:
    ZS_MIN_ALLOC_SIZE + i * ZS_SIZE_CLASS_DELTA = 32 + 255*256 = 65312

    We can see that 65312 < 65536 (ZS_MAX_ALLOC_SIZE). So we can NOT
    allocate objs with size ZS_MAX_ALLOC_SIZE(65536) which we promise upper
    users we can do.

    [1] http://lkml.iu.edu/hypermail/linux/kernel/1411.2/03835.html
    [2] http://lkml.iu.edu/hypermail/linux/kernel/1411.2/04534.html

    This patch fixes this issue by dynamiclly calculating zs_size_classes when
    module is loaded, allocates buffer with size ZS_MAX_ALLOC_SIZE. Then the
    max obj(size is ZS_MAX_ALLOC_SIZE) can be stored in it.

    [akpm@linux-foundation.org: restore ZS_SIZE_CLASSES to fix bisectability]
    Signed-off-by: Mahendran Ganesh
    Suggested-by: Minchan Kim
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mahendran Ganesh
     
  • The kunmap_atomic should use virtual address getting by kmap_atomic.
    However, some pieces of code in zsmalloc uses modified address, not the
    one got by kmap_atomic for kunmap_atomic.

    It's okay for working because zsmalloc modifies the address inner
    PAGE_SIZE bounday so it works with current kmap_atomic's implementation.
    But it's still fragile with potential changing of kmap_atomic so let's
    correct it.

    I got a subtle bug when I implemented a new feature of zsmalloc
    (compaction) due to a link's mishandling (the link was over page
    boundary). Although it was totally my mistake, it took a while to find
    the cause because an unpredictable kmapped address was unmapped causing an
    almost random crash.

    Signed-off-by: Minchan Kim
    Cc: Nitin Gupta
    Cc: Sergey Senozhatsky
    Cc: Dan Streetman
    Cc: Seth Jennings
    Cc: Jerome Marchand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Mahendran Ganesh reported that zpool-enabled zsmalloc should not call
    zpool_unregister_driver() from zs_init() if cpu notifier registration has
    failed, because error handling is performed before we register the driver
    via zpool_register_driver() call.

    Factor out cpu notifier registration and unregistration code and fix
    zs_init() error handling.

    link: http://lkml.iu.edu//hypermail/linux/kernel/1411.1/04156.html
    [akpm@linux-foundation.org: squash bogus gcc warning]
    [akpm@linux-foundation.org: use __init and __exit]
    Signed-off-by: Sergey Senozhatsky
    Reported-by: Mahendran Ganesh
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • zsmalloc has many size_classes to reduce fragmentation and they are in 16
    bytes unit, for example, 16, 32, 48, etc., if PAGE_SIZE is 4096. And,
    zsmalloc has constraint that each zspage has 4 pages at maximum.

    In this situation, we can see interesting aspect. Let's think about
    size_class for 1488, 1472, ..., 1376. To prevent external fragmentation,
    they uses 4 pages per zspage and so all they can contain 11 objects at
    maximum.

    16384 (4096 * 4) = 1488 * 11 + remains
    16384 (4096 * 4) = 1472 * 11 + remains
    16384 (4096 * 4) = ...
    16384 (4096 * 4) = 1376 * 11 + remains

    It means that they have same characteristics and classification between
    them isn't needed. If we use one size_class for them, we can reduce
    fragementation and save some memory since both the 1488 and 1472 sized
    classes can only fit 11 objects into 4 pages, and an object that's 1472
    bytes can fit into an object that's 1488 bytes, merging these classes to
    always use objects that are 1488 bytes will reduce the total number of
    size classes. And reducing the total number of size classes reduces
    overall fragmentation, because a wider range of compressed pages can fit
    into a single size class, leaving less unused objects in each size class.

    For this purpose, this patch implement size_class merging. If there is
    size_class that have same pages_per_zspage and same number of objects per
    zspage with previous size_class, we don't create new size_class. Instead,
    we use previous, same characteristic size_class. With this way, above
    example sizes (1488, 1472, ..., 1376) use just one size_class so we can
    get much more memory utilization.

    Below is result of my simple test.

    TEST ENV: EXT4 on zram, mount with discard option WORKLOAD: untar kernel
    source code, remove directory in descending order in size. (drivers arch
    fs sound include net Documentation firmware kernel tools)

    Each line represents orig_data_size, compr_data_size, mem_used_total,
    fragmentation overhead (mem_used - compr_data_size) and overhead ratio
    (overhead to compr_data_size), respectively, after untar and remove
    operation is executed.

    * untar-nomerge.out

    orig_size compr_size used_size overhead overhead_ratio
    525.88MB 199.16MB 210.23MB 11.08MB 5.56%
    288.32MB 97.43MB 105.63MB 8.20MB 8.41%
    177.32MB 61.12MB 69.40MB 8.28MB 13.55%
    146.47MB 47.32MB 56.10MB 8.78MB 18.55%
    124.16MB 38.85MB 48.41MB 9.55MB 24.58%
    103.93MB 31.68MB 40.93MB 9.25MB 29.21%
    84.34MB 22.86MB 32.72MB 9.86MB 43.13%
    66.87MB 14.83MB 23.83MB 9.00MB 60.70%
    60.67MB 11.11MB 18.60MB 7.49MB 67.48%
    55.86MB 8.83MB 16.61MB 7.77MB 88.03%
    53.32MB 8.01MB 15.32MB 7.31MB 91.24%

    * untar-merge.out

    orig_size compr_size used_size overhead overhead_ratio
    526.23MB 199.18MB 209.81MB 10.64MB 5.34%
    288.68MB 97.45MB 104.08MB 6.63MB 6.80%
    177.68MB 61.14MB 66.93MB 5.79MB 9.47%
    146.83MB 47.34MB 52.79MB 5.45MB 11.51%
    124.52MB 38.87MB 44.30MB 5.43MB 13.96%
    104.29MB 31.70MB 36.83MB 5.13MB 16.19%
    84.70MB 22.88MB 27.92MB 5.04MB 22.04%
    67.11MB 14.83MB 19.26MB 4.43MB 29.86%
    60.82MB 11.10MB 14.90MB 3.79MB 34.17%
    55.90MB 8.82MB 12.61MB 3.79MB 42.97%
    53.32MB 8.01MB 11.73MB 3.73MB 46.53%

    As you can see above result, merged one has better utilization (overhead
    ratio, 5th column) and uses less memory (mem_used_total, 3rd column).

    Signed-off-by: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Cc: Sergey Senozhatsky
    Reviewed-by: Dan Streetman
    Cc: Luigi Semenzato
    Cc:
    Cc: "seungho1.park"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Remove unused mem_cgroup_lru_names_not_uptodate() and move BUILD_BUG_ON()
    to the beginning of memcg_stat_show().

    This was partially found by using a static code analysis program called
    cppcheck.

    Signed-off-by: Rickard Strandqvist
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rickard Strandqvist
     
  • Suppose task @t that belongs to a memory cgroup @memcg is going to
    allocate an object from a kmem cache @c. The copy of @c corresponding to
    @memcg, @mc, is empty. Then if kmem_cache_alloc races with the memory
    cgroup destruction we can access the memory cgroup's copy of the cache
    after it was destroyed:

    CPU0 CPU1
    ---- ----
    [ current=@t
    @mc->memcg_params->nr_pages=0 ]

    kmem_cache_alloc(@c):
    call memcg_kmem_get_cache(@c);
    proceed to allocation from @mc:
    alloc a page for @mc:
    ...

    move @t from @memcg
    destroy @memcg:
    mem_cgroup_css_offline(@memcg):
    memcg_unregister_all_caches(@memcg):
    kmem_cache_destroy(@mc)

    add page to @mc

    We could fix this issue by taking a reference to a per-memcg cache, but
    that would require adding a per-cpu reference counter to per-memcg caches,
    which would look cumbersome.

    Instead, let's take a reference to a memory cgroup, which already has a
    per-cpu reference counter, in the beginning of kmem_cache_alloc to be
    dropped in the end, and move per memcg caches destruction from css offline
    to css free. As a side effect, per-memcg caches will be destroyed not one
    by one, but all at once when the last page accounted to the memory cgroup
    is freed. This doesn't sound as a high price for code readability though.

    Note, this patch does add some overhead to the kmem_cache_alloc hot path,
    but it is pretty negligible - it's just a function call plus a per cpu
    counter decrement, which is comparable to what we already have in
    memcg_kmem_get_cache. Besides, it's only relevant if there are memory
    cgroups with kmem accounting enabled. I don't think we can find a way to
    handle this race w/o it, because alloc_page called from kmem_cache_alloc
    may sleep so we can't flush all pending kmallocs w/o reference counting.

    Signed-off-by: Vladimir Davydov
    Acked-by: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • test_mem_cgroup_node_reclaimable() is used only when MAX_NUMNODES > 1, so
    move it into the compiler if statement

    [akpm@linux-foundation.org: clean up layout]
    Signed-off-by: Michele Curti
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michele Curti
     
  • A random seek IO benchmark appeared to regress because of a change to
    readahead but the real problem was the benchmark. To ensure the IO
    request accesssed disk, it used fadvise(FADV_DONTNEED) on a block boundary
    (512K) but the hint is ignored by the kernel. This is correct but not
    necessarily obvious behaviour. As much as I dislike comment patches, the
    explanation for this behaviour predates current git history. Clarify why
    it behaves like this in case someone "fixes" fadvise or readahead for the
    wrong reasons.

    Signed-off-by: Mel Gorman
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Read memory barriers must follow the read operations.

    Signed-off-by: Dmitry Vyukov
    Cc: Eric Dumazet
    Acked-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     
  • After the previous patch we can remove the PT_TRACE_EXIT check in
    oom_scan_process_thread(), it was added to handle the case when the
    coredumping was "frozen" by ptrace, but it doesn't really work. If
    nothing else, we would need to check all threads which could share the
    same ->mm to make it more or less correct.

    Signed-off-by: Oleg Nesterov
    Cc: Cong Wang
    Cc: David Rientjes
    Acked-by: Michal Hocko
    Cc: "Rafael J. Wysocki"
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov