20 Dec, 2014

1 commit

  • Pull vfs pile #3 from Al Viro:
    "Assorted fixes and patches from the last cycle"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    [regression] chunk lost from bd9b51
    vfs: make mounts and mountstats honor root dir like mountinfo does
    vfs: cleanup show_mountinfo
    init: fix read-write root mount
    unfuck binfmt_misc.c (broken by commit e6084d4)
    vm_area_operations: kill ->migrate()
    new helper: iter_is_iovec()
    move_extent_per_page(): get rid of unused w_flags
    lustre: get rid of playing with ->fs
    btrfs: filp_open() returns ERR_PTR() on failure, not NULL...

    Linus Torvalds
     

19 Dec, 2014

4 commits

  • Currently functions in zsmalloc.c does not arranged in a readable and
    reasonable sequence. With the more and more functions added, we may
    meet below inconvenience. For example:

    Current functions:

    void zs_init()
    {
    }

    static void get_maxobj_per_zspage()
    {
    }

    Then I want to add a func_1() which is called from zs_init(), and this
    new added function func_1() will used get_maxobj_per_zspage() which is
    defined below zs_init().

    void func_1()
    {
    get_maxobj_per_zspage()
    }

    void zs_init()
    {
    func_1()
    }

    static void get_maxobj_per_zspage()
    {
    }

    This will cause compiling issue. So we must add a declaration:

    static void get_maxobj_per_zspage();

    before func_1() if we do not put get_maxobj_per_zspage() before
    func_1().

    In addition, puting module_[init|exit] functions at the bottom of the
    file conforms to our habit.

    So, this patch ajusts function sequence as:

    /* helper functions */
    ...
    obj_location_to_handle()
    ...

    /* Some exported functions */
    ...

    zs_map_object()
    zs_unmap_object()

    zs_malloc()
    zs_free()

    zs_init()
    zs_exit()

    Signed-off-by: Ganesh Mahendran
    Cc: Nitin Gupta
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ganesh Mahendran
     
  • Belatedly document the changes in commit f0c6d4d295e4 ("mm: introduce
    do_shared_fault() and drop do_fault()").

    Cc: Andi Kleen
    Cc: Bob Liu
    Cc: Dave Hansen
    Cc: "Kirill A. Shutemov"
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Naoya Horiguchi
    Cc: Rik van Riel
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • When the system boots up, in the dmesg logs we can see the memory
    statistics along with total reserved as below. Memory: 458840k/458840k
    available, 65448k reserved, 0K highmem

    When CMA is enabled, still the total reserved memory remains the same.
    However, the CMA memory is not considered as reserved. But, when we see
    /proc/meminfo, the CMA memory is part of free memory. This creates
    confusion. This patch corrects the problem by properly subtracting the
    CMA reserved memory from the total reserved memory in dmesg logs.

    Below is the dmesg snapshot from an arm based device with 512MB RAM and
    12MB single CMA region.

    Before this change:
    Memory: 458840k/458840k available, 65448k reserved, 0K highmem

    After this change:
    Memory: 458840k/458840k available, 53160k reserved, 12288k cma-reserved, 0K highmem

    Signed-off-by: Pintu Kumar
    Signed-off-by: Vishnu Pratap Singh
    Acked-by: Michal Nazarewicz
    Cc: Rafael Aquini
    Cc: Jerome Marchand
    Cc: Marek Szyprowski
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pintu Kumar
     
  • When nodes is true, nsc->mask2 has already been filtered by nsc->mask1,
    which has already factored in node_states[N_MEMORY].

    Signed-off-by: Zhihui Zhang
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhihui Zhang
     

18 Dec, 2014

1 commit

  • Dave Hansen reports that commit fb7332a9fedf ("mmu_gather: move minimal
    range calculations into generic code") caused a performance problem:

    "tlb_finish_mmu() goes up about 9x in the profiles (~0.4%->3.6%) and
    tlb_flush_mmu_free() takes about 3.1% of CPU time with the patch
    applied, but does not show up at all on the commit before"

    and the reason is that Will moved the test for whether we need to flush
    from tlb_flush_mmu() into tlb_flush_mmu_tlbonly(). But that meant that
    tlb_flush_mmu_free() basically lost that check.

    Move it back into tlb_flush_mmu() where it belongs, so that it covers
    both tlb_flush_mmu_tlbonly() _and_ tlb_flush_mmu_free().

    Reported-and-tested-by: Dave Hansen
    Acked-by: Will Deacon
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

17 Dec, 2014

3 commits

  • the only instance this method has ever grown was one in kernfs -
    one that call ->migrate() of another vm_ops if it exists.

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Pull nfsd updates from Bruce Fields:
    "A comparatively quieter cycle for nfsd this time, but still with two
    larger changes:

    - RPC server scalability improvements from Jeff Layton (using RCU
    instead of a spinlock to find idle threads).

    - server-side NFSv4.2 ALLOCATE/DEALLOCATE support from Anna
    Schumaker, enabling fallocate on new clients"

    * 'for-3.19' of git://linux-nfs.org/~bfields/linux: (32 commits)
    nfsd4: fix xdr4 count of server in fs_location4
    nfsd4: fix xdr4 inclusion of escaped char
    sunrpc/cache: convert to use string_escape_str()
    sunrpc: only call test_bit once in svc_xprt_received
    fs: nfsd: Fix signedness bug in compare_blob
    sunrpc: add some tracepoints around enqueue and dequeue of svc_xprt
    sunrpc: convert to lockless lookup of queued server threads
    sunrpc: fix potential races in pool_stats collection
    sunrpc: add a rcu_head to svc_rqst and use kfree_rcu to free it
    sunrpc: require svc_create callers to pass in meaningful shutdown routine
    sunrpc: have svc_wake_up only deal with pool 0
    sunrpc: convert sp_task_pending flag to use atomic bitops
    sunrpc: move rq_cachetype field to better optimize space
    sunrpc: move rq_splice_ok flag into rq_flags
    sunrpc: move rq_dropme flag into rq_flags
    sunrpc: move rq_usedeferral flag to rq_flags
    sunrpc: move rq_local field to rq_flags
    sunrpc: add a generic rq_flags field to svc_rqst and move rq_secure to it
    nfsd: minor off by one checks in __write_versions()
    sunrpc: release svc_pool_map reference when serv allocation fails
    ...

    Linus Torvalds
     

16 Dec, 2014

1 commit

  • Pull drm updates from Dave Airlie:
    "Highlights:

    - AMD KFD driver merge

    This is the AMD HSA interface for exposing a lowlevel interface for
    GPGPU use. They have an open source userspace built on top of this
    interface, and the code looks as good as it was going to get out of
    tree.

    - Initial atomic modesetting work

    The need for an atomic modesetting interface to allow userspace to
    try and send a complete set of modesetting state to the driver has
    arisen, and been suffering from neglect this past year. No more,
    the start of the common code and changes for msm driver to use it
    are in this tree. Ongoing work to get the userspace ioctl finished
    and the code clean will probably wait until next kernel.

    - DisplayID 1.3 and tiled monitor exposed to userspace.

    Tiled monitor property is now exposed for userspace to make use of.

    - Rockchip drm driver merged.

    - imx gpu driver moved out of staging

    Other stuff:

    - core:
    panel - MIPI DSI + new panels.
    expose suggested x/y properties for virtual GPUs

    - i915:
    Initial Skylake (SKL) support
    gen3/4 reset work
    start of dri1/ums removal
    infoframe tracking
    fixes for lots of things.

    - nouveau:
    tegra k1 voltage support
    GM204 modesetting support
    GT21x memory reclocking work

    - radeon:
    CI dpm fixes
    GPUVM improvements
    Initial DPM fan control

    - rcar-du:
    HDMI support added
    removed some support for old boards
    slave encoder driver for Analog Devices adv7511

    - exynos:
    Exynos4415 SoC support

    - msm:
    a4xx gpu support
    atomic helper conversion

    - tegra:
    iommu support
    universal plane support
    ganged-mode DSI support

    - sti:
    HDMI i2c improvements

    - vmwgfx:
    some late fixes.

    - qxl:
    use suggested x/y properties"

    * 'drm-next' of git://people.freedesktop.org/~airlied/linux: (969 commits)
    drm: sti: fix module compilation issue
    drm/i915: save/restore GMBUS freq across suspend/resume on gen4
    drm: sti: correctly cleanup CRTC and planes
    drm: sti: add HQVDP plane
    drm: sti: add cursor plane
    drm: sti: enable auxiliary CRTC
    drm: sti: fix delay in VTG programming
    drm: sti: prepare sti_tvout to support auxiliary crtc
    drm: sti: use drm_crtc_vblank_{on/off} instead of drm_vblank_{on/off}
    drm: sti: fix hdmi avi infoframe
    drm: sti: remove event lock while disabling vblank
    drm: sti: simplify gdp code
    drm: sti: clear all mixer control
    drm: sti: remove gpio for HDMI hot plug detection
    drm: sti: allow to change hdmi ddc i2c adapter
    drm/doc: Document drm_add_modes_noedid() usage
    drm/i915: Remove '& 0xffff' from the mask given to WA_REG()
    drm/i915: Invert the mask and val arguments in wa_add() and WA_REG()
    drm: Zero out DRM object memory upon cleanup
    drm/i915/bdw: Fix the write setting up the WIZ hashing mode
    ...

    Linus Torvalds
     

15 Dec, 2014

1 commit


14 Dec, 2014

29 commits

  • There are actually two issues this patch addresses. Let me start with
    the one I tried to solve in the beginning.

    So, in the checkpoint-restore project (criu) we try to dump tasks'
    state and restore one back exactly as it was. One of the tasks' state
    bits is rings set up with io_setup() call. There's (almost) no problems
    in dumping them, there's a problem restoring them -- if I dump a task
    with aio ring originally mapped at address A, I want to restore one
    back at exactly the same address A. Unfortunately, the io_setup() does
    not allow for that -- it mmaps the ring at whatever place mm finds
    appropriate (it calls do_mmap_pgoff() with zero address and without
    the MAP_FIXED flag).

    To make restore possible I'm going to mremap() the freshly created ring
    into the address A (under which it was seen before dump). The problem is
    that the ring's virtual address is passed back to the user-space as the
    context ID and this ID is then used as search key by all the other io_foo()
    calls. Reworking this ID to be just some integer doesn't seem to work, as
    this value is already used by libaio as a pointer using which this library
    accesses memory for aio meta-data.

    So, to make restore work we need to make sure that

    a) ring is mapped at desired virtual address
    b) kioctx->user_id matches this value

    Having said that, the patch makes mremap() on aio region update the
    kioctx's user_id and mmap_base values.

    Here appears the 2nd issue I mentioned in the beginning of this mail.
    If (regardless of the C/R dances I do) someone creates an io context
    with io_setup(), then mremap()-s the ring and then destroys the context,
    the kill_ioctx() routine will call munmap() on wrong (old) address.
    This will result in a) aio ring remaining in memory and b) some other
    vma get unexpectedly unmapped.

    What do you think?

    Signed-off-by: Pavel Emelyanov
    Acked-by: Dmitry Monakhov
    Signed-off-by: Benjamin LaHaise

    Pavel Emelyanov
     
  • kmemleak will add allocations as objects to a pool. The memory allocated
    for each object in this pool is periodically searched for pointers to
    other allocated objects. This only works for memory that is mapped into
    the kernel's virtual address space, which happens not to be the case for
    most CMA regions.

    Furthermore, CMA regions are typically used to store data transferred to
    or from a device and therefore don't contain pointers to other objects.

    Without this, the kernel crashes on the first execution of the
    scan_gray_list() because it tries to access highmem. Perhaps a more
    appropriate fix would be to reject any object that can't map to a kernel
    virtual address?

    [akpm@linux-foundation.org: add comment]
    [akpm@linux-foundation.org: fix comment, per Catalin]
    [sfr@canb.auug.org.au: include linux/io.h for phys_to_virt()]
    Signed-off-by: Thierry Reding
    Cc: Michal Nazarewicz
    Cc: Marek Szyprowski
    Cc: Joonsoo Kim
    Cc: "Aneesh Kumar K.V"
    Cc: Catalin Marinas
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thierry Reding
     
  • If we fail to allocate from the current node's stock, we look for free
    objects on other nodes before calling the page allocator (see
    get_any_partial). While checking other nodes we respect cpuset
    constraints by calling cpuset_zone_allowed. We enforce hardwall check.
    As a result, we will fallback to the page allocator even if there are some
    pages cached on other nodes, but the current cpuset doesn't have them set.
    However, the page allocator uses softwall check for kernel allocations,
    so it may allocate from one of the other nodes in this case.

    Therefore we should use softwall cpuset check in get_any_partial to
    conform with the cpuset check in the page allocator.

    Signed-off-by: Vladimir Davydov
    Acked-by: Zefan Li
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • fallback_alloc is called on kmalloc if the preferred node doesn't have
    free or partial slabs and there's no pages on the node's free list
    (GFP_THISNODE allocations fail). Before invoking the reclaimer it tries
    to locate a free or partial slab on other allowed nodes' lists. While
    iterating over the preferred node's zonelist it skips those zones which
    hardwall cpuset check returns false for. That means that for a task bound
    to a specific node using cpusets fallback_alloc will always ignore free
    slabs on other nodes and go directly to the reclaimer, which, however, may
    allocate from other nodes if cpuset.mem_hardwall is unset (default). As a
    result, we may get lists of free slabs grow without bounds on other nodes,
    which is bad, because inactive slabs are only evicted by cache_reap at a
    very slow rate and cannot be dropped forcefully.

    To reproduce the issue, run a process that will walk over a directory tree
    with lots of files inside a cpuset bound to a node that constantly
    experiences memory pressure. Look at num_slabs vs active_slabs growth as
    reported by /proc/slabinfo.

    To avoid this we should use softwall cpuset check in fallback_alloc.

    Signed-off-by: Vladimir Davydov
    Acked-by: Zefan Li
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • When zbud is initialized through the zpool wrapper, pool->ops which
    points to user-defined operations is always set regardless of whether it
    is specified from the upper layer. This causes zbud_reclaim_page() to
    iterate its loop for evicting pool pages out without any gain.

    This patch sets the user-defined ops only when it is needed, so that
    zbud_reclaim_page() can bail out the reclamation loop earlier if there
    is no user-defined operations specified.

    Signed-off-by: Heesub Shin
    Acked-by: Dan Streetman
    Cc: Seth Jennings
    Cc: Sunae Seo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heesub Shin
     
  • free_percpu() tests whether its argument is NULL and then returns
    immediately. Thus the test around the call is not needed.

    This issue was detected by using the Coccinelle software.

    Signed-off-by: Markus Elfring
    Cc: Seth Jennings
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Markus Elfring
     
  • zswap_cpu_init/zswap_comp_exit/zswap_entry_cache_create is only called by
    __init init_zswap()

    Signed-off-by: Mahendran Ganesh
    Cc: Seth Jennings
    Cc: Dan Streetman
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mahendran Ganesh
     
  • In zs_create_pool(), we allocate memory more then sizeof(struct zs_pool)
    ovhd_size = roundup(sizeof(*pool), PAGE_SIZE);

    This patch allocate memory of exactly needed size.

    Signed-off-by: Ganesh Mahendran
    Acked-by: Minchan Kim
    Cc: Nitin Gupta
    Cc: Dan Streetman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ganesh Mahendran
     
  • In zs_create_pool(), prev_class is assigned (ZS_SIZE_CLASSES - 1) times.
    And the prev_class only references to the previous size_class. So we do
    not need unnecessary assignement.

    This patch assigns *prev_class* when a new size_class structure is
    allocated and uses prev_class to check whether the first class has been
    allocated.

    [akpm@linux-foundation.org: remove now-unused ZS_SIZE_CLASSES]
    Signed-off-by: Ganesh Mahendran
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Reviewed-by: Dan Streetman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ganesh Mahendran
     
  • I sent a patch [1] for unnecessary check in zsmalloc. And Minchan Kim
    found zsmalloc even does not support allocating an obj with the size of
    ZS_MAX_ALLOC_SIZE in some situations.

    For example:
    In system with 64KB PAGE_SIZE and 32 bit of physical addr. Then:
    ZS_MIN_ALLOC_SIZE is 32 bytes which is calculated by:
    MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS))
    ZS_MAX_ALLOC_SIZE is 64KB(in current code, is PAGE_SIZE)
    ZS_SIZE_CLASS_DELTA is 256 bytes
    So, ZS_SIZE_CLASSES = (ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE) /
    ZS_SIZE_CLASS_DELTA + 1
    = 256

    In zs_create_pool(), the max size obj which can be allocated will be:
    ZS_MIN_ALLOC_SIZE + i * ZS_SIZE_CLASS_DELTA = 32 + 255*256 = 65312

    We can see that 65312 < 65536 (ZS_MAX_ALLOC_SIZE). So we can NOT
    allocate objs with size ZS_MAX_ALLOC_SIZE(65536) which we promise upper
    users we can do.

    [1] http://lkml.iu.edu/hypermail/linux/kernel/1411.2/03835.html
    [2] http://lkml.iu.edu/hypermail/linux/kernel/1411.2/04534.html

    This patch fixes this issue by dynamiclly calculating zs_size_classes when
    module is loaded, allocates buffer with size ZS_MAX_ALLOC_SIZE. Then the
    max obj(size is ZS_MAX_ALLOC_SIZE) can be stored in it.

    [akpm@linux-foundation.org: restore ZS_SIZE_CLASSES to fix bisectability]
    Signed-off-by: Mahendran Ganesh
    Suggested-by: Minchan Kim
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mahendran Ganesh
     
  • The kunmap_atomic should use virtual address getting by kmap_atomic.
    However, some pieces of code in zsmalloc uses modified address, not the
    one got by kmap_atomic for kunmap_atomic.

    It's okay for working because zsmalloc modifies the address inner
    PAGE_SIZE bounday so it works with current kmap_atomic's implementation.
    But it's still fragile with potential changing of kmap_atomic so let's
    correct it.

    I got a subtle bug when I implemented a new feature of zsmalloc
    (compaction) due to a link's mishandling (the link was over page
    boundary). Although it was totally my mistake, it took a while to find
    the cause because an unpredictable kmapped address was unmapped causing an
    almost random crash.

    Signed-off-by: Minchan Kim
    Cc: Nitin Gupta
    Cc: Sergey Senozhatsky
    Cc: Dan Streetman
    Cc: Seth Jennings
    Cc: Jerome Marchand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Mahendran Ganesh reported that zpool-enabled zsmalloc should not call
    zpool_unregister_driver() from zs_init() if cpu notifier registration has
    failed, because error handling is performed before we register the driver
    via zpool_register_driver() call.

    Factor out cpu notifier registration and unregistration code and fix
    zs_init() error handling.

    link: http://lkml.iu.edu//hypermail/linux/kernel/1411.1/04156.html
    [akpm@linux-foundation.org: squash bogus gcc warning]
    [akpm@linux-foundation.org: use __init and __exit]
    Signed-off-by: Sergey Senozhatsky
    Reported-by: Mahendran Ganesh
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • zsmalloc has many size_classes to reduce fragmentation and they are in 16
    bytes unit, for example, 16, 32, 48, etc., if PAGE_SIZE is 4096. And,
    zsmalloc has constraint that each zspage has 4 pages at maximum.

    In this situation, we can see interesting aspect. Let's think about
    size_class for 1488, 1472, ..., 1376. To prevent external fragmentation,
    they uses 4 pages per zspage and so all they can contain 11 objects at
    maximum.

    16384 (4096 * 4) = 1488 * 11 + remains
    16384 (4096 * 4) = 1472 * 11 + remains
    16384 (4096 * 4) = ...
    16384 (4096 * 4) = 1376 * 11 + remains

    It means that they have same characteristics and classification between
    them isn't needed. If we use one size_class for them, we can reduce
    fragementation and save some memory since both the 1488 and 1472 sized
    classes can only fit 11 objects into 4 pages, and an object that's 1472
    bytes can fit into an object that's 1488 bytes, merging these classes to
    always use objects that are 1488 bytes will reduce the total number of
    size classes. And reducing the total number of size classes reduces
    overall fragmentation, because a wider range of compressed pages can fit
    into a single size class, leaving less unused objects in each size class.

    For this purpose, this patch implement size_class merging. If there is
    size_class that have same pages_per_zspage and same number of objects per
    zspage with previous size_class, we don't create new size_class. Instead,
    we use previous, same characteristic size_class. With this way, above
    example sizes (1488, 1472, ..., 1376) use just one size_class so we can
    get much more memory utilization.

    Below is result of my simple test.

    TEST ENV: EXT4 on zram, mount with discard option WORKLOAD: untar kernel
    source code, remove directory in descending order in size. (drivers arch
    fs sound include net Documentation firmware kernel tools)

    Each line represents orig_data_size, compr_data_size, mem_used_total,
    fragmentation overhead (mem_used - compr_data_size) and overhead ratio
    (overhead to compr_data_size), respectively, after untar and remove
    operation is executed.

    * untar-nomerge.out

    orig_size compr_size used_size overhead overhead_ratio
    525.88MB 199.16MB 210.23MB 11.08MB 5.56%
    288.32MB 97.43MB 105.63MB 8.20MB 8.41%
    177.32MB 61.12MB 69.40MB 8.28MB 13.55%
    146.47MB 47.32MB 56.10MB 8.78MB 18.55%
    124.16MB 38.85MB 48.41MB 9.55MB 24.58%
    103.93MB 31.68MB 40.93MB 9.25MB 29.21%
    84.34MB 22.86MB 32.72MB 9.86MB 43.13%
    66.87MB 14.83MB 23.83MB 9.00MB 60.70%
    60.67MB 11.11MB 18.60MB 7.49MB 67.48%
    55.86MB 8.83MB 16.61MB 7.77MB 88.03%
    53.32MB 8.01MB 15.32MB 7.31MB 91.24%

    * untar-merge.out

    orig_size compr_size used_size overhead overhead_ratio
    526.23MB 199.18MB 209.81MB 10.64MB 5.34%
    288.68MB 97.45MB 104.08MB 6.63MB 6.80%
    177.68MB 61.14MB 66.93MB 5.79MB 9.47%
    146.83MB 47.34MB 52.79MB 5.45MB 11.51%
    124.52MB 38.87MB 44.30MB 5.43MB 13.96%
    104.29MB 31.70MB 36.83MB 5.13MB 16.19%
    84.70MB 22.88MB 27.92MB 5.04MB 22.04%
    67.11MB 14.83MB 19.26MB 4.43MB 29.86%
    60.82MB 11.10MB 14.90MB 3.79MB 34.17%
    55.90MB 8.82MB 12.61MB 3.79MB 42.97%
    53.32MB 8.01MB 11.73MB 3.73MB 46.53%

    As you can see above result, merged one has better utilization (overhead
    ratio, 5th column) and uses less memory (mem_used_total, 3rd column).

    Signed-off-by: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Cc: Sergey Senozhatsky
    Reviewed-by: Dan Streetman
    Cc: Luigi Semenzato
    Cc:
    Cc: "seungho1.park"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Remove unused mem_cgroup_lru_names_not_uptodate() and move BUILD_BUG_ON()
    to the beginning of memcg_stat_show().

    This was partially found by using a static code analysis program called
    cppcheck.

    Signed-off-by: Rickard Strandqvist
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rickard Strandqvist
     
  • Suppose task @t that belongs to a memory cgroup @memcg is going to
    allocate an object from a kmem cache @c. The copy of @c corresponding to
    @memcg, @mc, is empty. Then if kmem_cache_alloc races with the memory
    cgroup destruction we can access the memory cgroup's copy of the cache
    after it was destroyed:

    CPU0 CPU1
    ---- ----
    [ current=@t
    @mc->memcg_params->nr_pages=0 ]

    kmem_cache_alloc(@c):
    call memcg_kmem_get_cache(@c);
    proceed to allocation from @mc:
    alloc a page for @mc:
    ...

    move @t from @memcg
    destroy @memcg:
    mem_cgroup_css_offline(@memcg):
    memcg_unregister_all_caches(@memcg):
    kmem_cache_destroy(@mc)

    add page to @mc

    We could fix this issue by taking a reference to a per-memcg cache, but
    that would require adding a per-cpu reference counter to per-memcg caches,
    which would look cumbersome.

    Instead, let's take a reference to a memory cgroup, which already has a
    per-cpu reference counter, in the beginning of kmem_cache_alloc to be
    dropped in the end, and move per memcg caches destruction from css offline
    to css free. As a side effect, per-memcg caches will be destroyed not one
    by one, but all at once when the last page accounted to the memory cgroup
    is freed. This doesn't sound as a high price for code readability though.

    Note, this patch does add some overhead to the kmem_cache_alloc hot path,
    but it is pretty negligible - it's just a function call plus a per cpu
    counter decrement, which is comparable to what we already have in
    memcg_kmem_get_cache. Besides, it's only relevant if there are memory
    cgroups with kmem accounting enabled. I don't think we can find a way to
    handle this race w/o it, because alloc_page called from kmem_cache_alloc
    may sleep so we can't flush all pending kmallocs w/o reference counting.

    Signed-off-by: Vladimir Davydov
    Acked-by: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • test_mem_cgroup_node_reclaimable() is used only when MAX_NUMNODES > 1, so
    move it into the compiler if statement

    [akpm@linux-foundation.org: clean up layout]
    Signed-off-by: Michele Curti
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michele Curti
     
  • A random seek IO benchmark appeared to regress because of a change to
    readahead but the real problem was the benchmark. To ensure the IO
    request accesssed disk, it used fadvise(FADV_DONTNEED) on a block boundary
    (512K) but the hint is ignored by the kernel. This is correct but not
    necessarily obvious behaviour. As much as I dislike comment patches, the
    explanation for this behaviour predates current git history. Clarify why
    it behaves like this in case someone "fixes" fadvise or readahead for the
    wrong reasons.

    Signed-off-by: Mel Gorman
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Read memory barriers must follow the read operations.

    Signed-off-by: Dmitry Vyukov
    Cc: Eric Dumazet
    Acked-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     
  • After the previous patch we can remove the PT_TRACE_EXIT check in
    oom_scan_process_thread(), it was added to handle the case when the
    coredumping was "frozen" by ptrace, but it doesn't really work. If
    nothing else, we would need to check all threads which could share the
    same ->mm to make it more or less correct.

    Signed-off-by: Oleg Nesterov
    Cc: Cong Wang
    Cc: David Rientjes
    Acked-by: Michal Hocko
    Cc: "Rafael J. Wysocki"
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • oom_kill.c assumes that PF_EXITING task should exit and free the memory
    soon. This is wrong in many ways and one important case is the coredump.
    A task can sleep in exit_mm() "forever" while the coredumping sub-thread
    can need more memory.

    Change the PF_EXITING checks to take SIGNAL_GROUP_COREDUMP into account,
    we add the new trivial helper for that.

    Note: this is only the first step, this patch doesn't try to solve other
    problems. The SIGNAL_GROUP_COREDUMP check is obviously racy, a task can
    participate in coredump after it was already observed in PF_EXITING state,
    so TIF_MEMDIE (which also blocks oom-killer) still can be wrongly set.
    fatal_signal_pending() can be true because of SIGNAL_GROUP_COREDUMP so
    out_of_memory() and mem_cgroup_out_of_memory() shouldn't blindly trust it.
    And even the name/usage of the new helper is confusing, an exiting thread
    can only free its ->mm if it is the only/last task in thread group.

    [akpm@linux-foundation.org: add comment]
    Signed-off-by: Oleg Nesterov
    Cc: Cong Wang
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: "Rafael J. Wysocki"
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Since 01cefaef40c4 ("mm: provide more accurate estimation
    of pages occupied by memmap") allocate the pages from lowmem for the
    highmem zones' memmap. So It is not need to reserver the memmap's for
    the highmem.

    A 2G DDR3 for the arm platform:
    On node 0 totalpages: 524288
    free_area_init_node: node 0, pgdat 80ccd380, node_mem_map 80d38000
    DMA zone: 3568 pages used for memmap
    DMA zone: 0 pages reserved
    DMA zone: 456704 pages, LIFO batch:31
    HighMem zone: 528 pages used for memmap
    HighMem zone: 67584 pages, LIFO batch:15

    On node 0 totalpages: 524288
    free_area_init_node: node 0, pgdat 80cd6f40, node_mem_map 80d42000
    DMA zone: 3568 pages used for memmap
    DMA zone: 0 pages reserved
    DMA zone: 456704 pages, LIFO batch:31
    HighMem zone: 67584 pages, LIFO batch:15

    Signed-off-by: Hongbo Zhong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhong Hongbo
     
  • Page migration's __unmap_and_move(), and rmap's try_to_unmap(), were
    created for use on pages almost certainly mapped into userspace. But
    nowadays compaction often applies them to unmapped page cache pages: which
    may exacerbate contention on i_mmap_rwsem quite unnecessarily, since
    try_to_unmap_file() makes no preliminary page_mapped() check.

    Now check page_mapped() in __unmap_and_move(); and avoid repeating the
    same overhead in rmap_walk_file() - don't remove_migration_ptes() when we
    never inserted any.

    (The PageAnon(page) comment blocks now look even sillier than before, but
    clean that up on some other occasion. And note in passing that
    try_to_unmap_one() does not use a migration entry when PageSwapCache, so
    remove_migration_ptes() will then not update that swap entry to newpage
    pte: not a big deal, but something else to clean up later.)

    Davidlohr remarked in "mm,fs: introduce helpers around the i_mmap_mutex"
    conversion to i_mmap_rwsem, that "The biggest winner of these changes is
    migration": a part of the reason might be all of that unnecessary taking
    of i_mmap_mutex in page migration; and it's rather a shame that I didn't
    get around to sending this patch in before his - this one is much less
    useful after Davidlohr's conversion to rwsem, but still good.

    Signed-off-by: Hugh Dickins
    Cc: Davidlohr Bueso
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The slab shrinkers are currently invoked from the zonelist walkers in
    kswapd, direct reclaim, and zone reclaim, all of which roughly gauge the
    eligible LRU pages and assemble a nodemask to pass to NUMA-aware
    shrinkers, which then again have to walk over the nodemask. This is
    redundant code, extra runtime work, and fairly inaccurate when it comes to
    the estimation of actually scannable LRU pages. The code duplication will
    only get worse when making the shrinkers cgroup-aware and requiring them
    to have out-of-band cgroup hierarchy walks as well.

    Instead, invoke the shrinkers from shrink_zone(), which is where all
    reclaimers end up, to avoid this duplication.

    Take the count for eligible LRU pages out of get_scan_count(), which
    considers many more factors than just the availability of swap space, like
    zone_reclaimable_pages() currently does. Accumulate the number over all
    visited lruvecs to get the per-zone value.

    Some nodes have multiple zones due to memory addressing restrictions. To
    avoid putting too much pressure on the shrinkers, only invoke them once
    for each such node, using the class zone of the allocation as the pivot
    zone.

    For now, this integrates the slab shrinking better into the reclaim logic
    and gets rid of duplicative invocations from kswapd, direct reclaim, and
    zone reclaim. It also prepares for cgroup-awareness, allowing
    memcg-capable shrinkers to be added at the lruvec level without much
    duplication of both code and runtime work.

    This changes kswapd behavior, which used to invoke the shrinkers for each
    zone, but with scan ratios gathered from the entire node, resulting in
    meaningless pressure quantities on multi-zone nodes.

    Zone reclaim behavior also changes. It used to shrink slabs until the
    same amount of pages were shrunk as were reclaimed from the LRUs. Now it
    merely invokes the shrinkers once with the zone's scan ratio, which makes
    the shrinkers go easier on caches that implement aging and would prefer
    feeding back pressure from recently used slab objects to unused LRU pages.

    [vdavydov@parallels.com: assure class zone is populated]
    Signed-off-by: Johannes Weiner
    Cc: Dave Chinner
    Signed-off-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • These flushes deal with sequence number overflows, such as for long lived
    threads. These are rare, but interesting from a debugging PoV. As such,
    display the number of flushes when vmacache debugging is enabled.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Extended memory to store page owner information is initialized some time
    later than that page allocator starts. Until initialization, many pages
    can be allocated and they have no owner information. This make debugging
    using page owner harder, so some fixup will be helpful.

    This patch fixes up this situation by setting fake owner information
    immediately after page extension is initialized. Information doesn't tell
    the right owner, but, at least, it can tell whether page is allocated or
    not, more correctly.

    On my testing, this patch catches 13343 early allocated pages, although
    they are mostly allocated from page extension feature. Anyway, after
    then, there is no page left that it is allocated and has no page owner
    flag.

    Signed-off-by: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Dave Hansen
    Cc: Michal Nazarewicz
    Cc: Jungsoo Son
    Cc: Ingo Molnar
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • This is the page owner tracking code which is introduced so far ago. It
    is resident on Andrew's tree, though, nobody tried to upstream so it
    remain as is. Our company uses this feature actively to debug memory leak
    or to find a memory hogger so I decide to upstream this feature.

    This functionality help us to know who allocates the page. When
    allocating a page, we store some information about allocation in extra
    memory. Later, if we need to know status of all pages, we can get and
    analyze it from this stored information.

    In previous version of this feature, extra memory is statically defined in
    struct page, but, in this version, extra memory is allocated outside of
    struct page. It enables us to turn on/off this feature at boottime
    without considerable memory waste.

    Although we already have tracepoint for tracing page allocation/free,
    using it to analyze page owner is rather complex. We need to enlarge the
    trace buffer for preventing overlapping until userspace program launched.
    And, launched program continually dump out the trace buffer for later
    analysis and it would change system behaviour with more possibility rather
    than just keeping it in memory, so bad for debug.

    Moreover, we can use page_owner feature further for various purposes. For
    example, we can use it for fragmentation statistics implemented in this
    patch. And, I also plan to implement some CMA failure debugging feature
    using this interface.

    I'd like to give the credit for all developers contributed this feature,
    but, it's not easy because I don't know exact history. Sorry about that.
    Below is people who has "Signed-off-by" in the patches in Andrew's tree.

    Contributor:
    Alexander Nyberg
    Mel Gorman
    Dave Hansen
    Minchan Kim
    Michal Nazarewicz
    Andrew Morton
    Jungsoo Son

    Signed-off-by: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Dave Hansen
    Cc: Michal Nazarewicz
    Cc: Jungsoo Son
    Cc: Ingo Molnar
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • do_mmap_private() in nommu.c try to allocate physically contiguous pages
    with arbitrary size in some cases and we now have good abstract function
    to do exactly same thing, alloc_pages_exact(). So, change to use it.

    There is no functional change. This is the preparation step for support
    page owner feature accurately.

    Signed-off-by: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Dave Hansen
    Cc: Michal Nazarewicz
    Cc: Jungsoo Son
    Cc: Ingo Molnar
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Now, we have prepared to avoid using debug-pagealloc in boottime. So
    introduce new kernel-parameter to disable debug-pagealloc in boottime, and
    makes related functions to be disabled in this case.

    Only non-intuitive part is change of guard page functions. Because guard
    page is effective only if debug-pagealloc is enabled, turning off
    according to debug-pagealloc is reasonable thing to do.

    Signed-off-by: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Dave Hansen
    Cc: Michal Nazarewicz
    Cc: Jungsoo Son
    Cc: Ingo Molnar
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Until now, debug-pagealloc needs extra flags in struct page, so we need to
    recompile whole source code when we decide to use it. This is really
    painful, because it takes some time to recompile and sometimes rebuild is
    not possible due to third party module depending on struct page. So, we
    can't use this good feature in many cases.

    Now, we have the page extension feature that allows us to insert extra
    flags to outside of struct page. This gets rid of third party module
    issue mentioned above. And, this allows us to determine if we need extra
    memory for this page extension in boottime. With these property, we can
    avoid using debug-pagealloc in boottime with low computational overhead in
    the kernel built with CONFIG_DEBUG_PAGEALLOC. This will help our
    development process greatly.

    This patch is the preparation step to achive above goal. debug-pagealloc
    originally uses extra field of struct page, but, after this patch, it will
    use field of struct page_ext. Because memory for page_ext is allocated
    later than initialization of page allocator in CONFIG_SPARSEMEM, we should
    disable debug-pagealloc feature temporarily until initialization of
    page_ext. This patch implements this.

    Signed-off-by: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Dave Hansen
    Cc: Michal Nazarewicz
    Cc: Jungsoo Son
    Cc: Ingo Molnar
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim