03 Apr, 2012

1 commit

  • commit 1a5a9906d4e8d1976b701f889d8f35d54b928f25 upstream.

    In some cases it may happen that pmd_none_or_clear_bad() is called with
    the mmap_sem hold in read mode. In those cases the huge page faults can
    allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
    false positive from pmd_bad() that will not like to see a pmd
    materializing as trans huge.

    It's not khugepaged causing the problem, khugepaged holds the mmap_sem
    in write mode (and all those sites must hold the mmap_sem in read mode
    to prevent pagetables to go away from under them, during code review it
    seems vm86 mode on 32bit kernels requires that too unless it's
    restricted to 1 thread per process or UP builds). The race is only with
    the huge pagefaults that can convert a pmd_none() into a
    pmd_trans_huge().

    Effectively all these pmd_none_or_clear_bad() sites running with
    mmap_sem in read mode are somewhat speculative with the page faults, and
    the result is always undefined when they run simultaneously. This is
    probably why it wasn't common to run into this. For example if the
    madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
    fault, the hugepage will not be zapped, if the page fault runs first it
    will be zapped.

    Altering pmd_bad() not to error out if it finds hugepmds won't be enough
    to fix this, because zap_pmd_range would then proceed to call
    zap_pte_range (which would be incorrect if the pmd become a
    pmd_trans_huge()).

    The simplest way to fix this is to read the pmd in the local stack
    (regardless of what we read, no need of actual CPU barriers, only
    compiler barrier needed), and be sure it is not changing under the code
    that computes its value. Even if the real pmd is changing under the
    value we hold on the stack, we don't care. If we actually end up in
    zap_pte_range it means the pmd was not none already and it was not huge,
    and it can't become huge from under us (khugepaged locking explained
    above).

    All we need is to enforce that there is no way anymore that in a code
    path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
    can run into a hugepmd. The overhead of a barrier() is just a compiler
    tweak and should not be measurable (I only added it for THP builds). I
    don't exclude different compiler versions may have prevented the race
    too by caching the value of *pmd on the stack (that hasn't been
    verified, but it wouldn't be impossible considering
    pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
    and there's no external function called in between pmd_trans_huge and
    pmd_none_or_clear_bad).

    if (pmd_trans_huge(*pmd)) {
    if (next-addr != HPAGE_PMD_SIZE) {
    VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
    split_huge_page_pmd(vma->vm_mm, pmd);
    } else if (zap_huge_pmd(tlb, vma, pmd, addr))
    continue;
    /* fall through */
    }
    if (pmd_none_or_clear_bad(pmd))

    Because this race condition could be exercised without special
    privileges this was reported in CVE-2012-1179.

    The race was identified and fully explained by Ulrich who debugged it.
    I'm quoting his accurate explanation below, for reference.

    ====== start quote =======
    mapcount 0 page_mapcount 1
    kernel BUG at mm/huge_memory.c:1384!

    At some point prior to the panic, a "bad pmd ..." message similar to the
    following is logged on the console:

    mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).

    The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
    the page's PMD table entry.

    143 void pmd_clear_bad(pmd_t *pmd)
    144 {
    -> 145 pmd_ERROR(*pmd);
    146 pmd_clear(pmd);
    147 }

    After the PMD table entry has been cleared, there is an inconsistency
    between the actual number of PMD table entries that are mapping the page
    and the page's map count (_mapcount field in struct page). When the page
    is subsequently reclaimed, __split_huge_page() detects this inconsistency.

    1381 if (mapcount != page_mapcount(page))
    1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
    1383 mapcount, page_mapcount(page));
    -> 1384 BUG_ON(mapcount != page_mapcount(page));

    The root cause of the problem is a race of two threads in a multithreaded
    process. Thread B incurs a page fault on a virtual address that has never
    been accessed (PMD entry is zero) while Thread A is executing an madvise()
    system call on a virtual address within the same 2 MB (huge page) range.

    virtual address space
    .---------------------.
    | |
    | |
    .-|---------------------|
    | | |
    | | |< |/////////////////////| > A(range)
    page | |/////////////////////|-'
    | | |
    | | |
    '-|---------------------|
    | |
    | |
    '---------------------'

    - Thread A is executing an madvise(..., MADV_DONTNEED) system call
    on the virtual address range "A(range)" shown in the picture.

    sys_madvise
    // Acquire the semaphore in shared mode.
    down_read(¤t->mm->mmap_sem)
    ...
    madvise_vma
    switch (behavior)
    case MADV_DONTNEED:
    madvise_dontneed
    zap_page_range
    unmap_vmas
    unmap_page_range
    zap_pud_range
    zap_pmd_range
    //
    // Assume that this huge page has never been accessed.
    // I.e. content of the PMD entry is zero (not mapped).
    //
    if (pmd_trans_huge(*pmd)) {
    // We don't get here due to the above assumption.
    }
    //
    // Assume that Thread B incurred a page fault and
    .---------> // sneaks in here as shown below.
    | //
    | if (pmd_none_or_clear_bad(pmd))
    | {
    | if (unlikely(pmd_bad(*pmd)))
    | pmd_clear_bad
    | {
    | pmd_ERROR
    | // Log "bad pmd ..." message here.
    | pmd_clear
    | // Clear the page's PMD entry.
    | // Thread B incremented the map count
    | // in page_add_new_anon_rmap(), but
    | // now the page is no longer mapped
    | // by a PMD entry (-> inconsistency).
    | }
    | }
    |
    v
    - Thread B is handling a page fault on virtual address "B(fault)" shown
    in the picture.

    ...
    do_page_fault
    __do_page_fault
    // Acquire the semaphore in shared mode.
    down_read_trylock(&mm->mmap_sem)
    ...
    handle_mm_fault
    if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
    // We get here due to the above assumption (PMD entry is zero).
    do_huge_pmd_anonymous_page
    alloc_hugepage_vma
    // Allocate a new transparent huge page here.
    ...
    __do_huge_pmd_anonymous_page
    ...
    spin_lock(&mm->page_table_lock)
    ...
    page_add_new_anon_rmap
    // Here we increment the page's map count (starts at -1).
    atomic_set(&page->_mapcount, 0)
    set_pmd_at
    // Here we set the page's PMD entry which will be cleared
    // when Thread A calls pmd_clear_bad().
    ...
    spin_unlock(&mm->page_table_lock)

    The mmap_sem does not prevent the race because both threads are acquiring
    it in shared mode (down_read). Thread B holds the page_table_lock while
    the page's map count and PMD table entry are updated. However, Thread A
    does not synchronize on that lock.

    ====== end quote =======

    [akpm@linux-foundation.org: checkpatch fixes]
    Reported-by: Ulrich Obergfell
    Signed-off-by: Andrea Arcangeli
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Dave Jones
    Acked-by: Larry Woodman
    Acked-by: Rik van Riel
    Cc: Mark Salter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrea Arcangeli
     

13 Mar, 2012

1 commit

  • commit 371528caec553785c37f73fa3926ea0de84f986f upstream.

    There is an issue when memcg unregisters events that were attached to
    the same eventfd:

    - On the first call mem_cgroup_usage_unregister_event() removes all
    events attached to a given eventfd, and if there were no events left,
    thresholds->primary would become NULL;

    - Since there were several events registered, cgroups core will call
    mem_cgroup_usage_unregister_event() again, but now kernel will oops,
    as the function doesn't expect that threshold->primary may be NULL.

    That's a good question whether mem_cgroup_usage_unregister_event()
    should actually remove all events in one go, but nowadays it can't
    do any better as cftype->unregister_event callback doesn't pass
    any private event-associated cookie. So, let's fix the issue by
    simply checking for threshold->primary.

    FWIW, w/o the patch the following oops may be observed:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
    IP: [] mem_cgroup_usage_unregister_event+0x9c/0x1f0
    Pid: 574, comm: kworker/0:2 Not tainted 3.3.0-rc4+ #9 Bochs Bochs
    RIP: 0010:[] [] mem_cgroup_usage_unregister_event+0x9c/0x1f0
    RSP: 0018:ffff88001d0b9d60 EFLAGS: 00010246
    Process kworker/0:2 (pid: 574, threadinfo ffff88001d0b8000, task ffff88001de91cc0)
    Call Trace:
    [] cgroup_event_remove+0x2b/0x60
    [] process_one_work+0x174/0x450
    [] worker_thread+0x123/0x2d0

    Signed-off-by: Anton Vorontsov
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Anton Vorontsov
     

26 Jan, 2012

1 commit

  • commit ab936cbcd02072a34b60d268f94440fd5cf1970b upstream.

    Commit ef6a3c6311 ("mm: add replace_page_cache_page() function") added a
    function replace_page_cache_page(). This function replaces a page in the
    radix-tree with a new page. WHen doing this, memory cgroup needs to fix
    up the accounting information. memcg need to check PCG_USED bit etc.

    In some(many?) cases, 'newpage' is on LRU before calling
    replace_page_cache(). So, memcg's LRU accounting information should be
    fixed, too.

    This patch adds mem_cgroup_replace_page_cache() and removes the old hooks.
    In that function, old pages will be unaccounted without touching
    res_counter and new page will be accounted to the memcg (of old page).
    WHen overwriting pc->mem_cgroup of newpage, take zone->lru_lock and avoid
    races with LRU handling.

    Background:
    replace_page_cache_page() is called by FUSE code in its splice() handling.
    Here, 'newpage' is replacing oldpage but this newpage is not a newly allocated
    page and may be on LRU. LRU mis-accounting will be critical for memory cgroup
    because rmdir() checks the whole LRU is empty and there is no account leak.
    If a page is on the other LRU than it should be, rmdir() will fail.

    This bug was added in March 2011, but no bug report yet. I guess there
    are not many people who use memcg and FUSE at the same time with upstream
    kernels.

    The result of this bug is that admin cannot destroy a memcg because of
    account leak. So, no panic, no deadlock. And, even if an active cgroup
    exist, umount can succseed. So no problem at shutdown.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Miklos Szeredi
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    KAMEZAWA Hiroyuki
     

21 Dec, 2011

1 commit

  • If the request is to create non-root group and we fail to meet it, we
    should leave the root unchanged.

    Signed-off-by: Hillf Danton
    Acked-by: Hugh Dickins
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     

07 Nov, 2011

1 commit

  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds
     

03 Nov, 2011

6 commits

  • Various code in memcontrol.c () calls this_cpu_read() on the calculations
    to be done from two different percpu variables, or does an open-coded
    read-modify-write on a single percpu variable.

    Disable preemption throughout these operations so that the writes go to
    the correct palces.

    [hannes@cmpxchg.org: added this_cpu to __this_cpu conversion]
    Signed-off-by: Johannes Weiner
    Signed-off-by: Steven Rostedt
    Cc: Greg Thelen
    Cc: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     
  • There is a potential race between a thread charging a page and another
    thread putting it back to the LRU list:

    charge: putback:
    SetPageCgroupUsed SetPageLRU
    PageLRU && add to memcg LRU PageCgroupUsed && add to memcg LRU

    The order of setting one flag and checking the other is crucial, otherwise
    the charge may observe !PageLRU while the putback observes !PageCgroupUsed
    and the page is not linked to the memcg LRU at all.

    Global memory pressure may fix this by trying to isolate and putback the
    page for reclaim, where that putback would link it to the memcg LRU again.
    Without that, the memory cgroup is undeletable due to a charge whose
    physical page can not be found and moved out.

    Signed-off-by: Johannes Weiner
    Cc: Ying Han
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Reclaim decides to skip scanning an active list when the corresponding
    inactive list is above a certain size in comparison to leave the assumed
    working set alone while there are still enough reclaim candidates around.

    The memcg implementation of comparing those lists instead reports whether
    the whole memcg is low on the requested type of inactive pages,
    considering all nodes and zones.

    This can lead to an oversized active list not being scanned because of the
    state of the other lists in the memcg, as well as an active list being
    scanned while its corresponding inactive list has enough pages.

    Not only is this wrong, it's also a scalability hazard, because the global
    memory state over all nodes and zones has to be gathered for each memcg
    and zone scanned.

    Make these calculations purely based on the size of the two LRU lists
    that are actually affected by the outcome of the decision.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: KOSAKI Motohiro
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Reviewed-by: Minchan Kim
    Reviewed-by: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • If somebody is touching data too early, it might be easier to diagnose a
    problem when dereferencing NULL at mem->info.nodeinfo[node] than trying to
    understand why mem_cgroup_per_zone is [un|partly]initialized.

    Signed-off-by: Igor Mammedov
    Acked-by: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Igor Mammedov
     
  • Before calling schedule_timeout(), task state should be changed.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • The memcg code sometimes uses "struct mem_cgroup *mem" and sometimes uses
    "struct mem_cgroup *memcg". Rename all mem variables to memcg in source
    file.

    Signed-off-by: Raghavendra K T
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Raghavendra K T
     

01 Nov, 2011

1 commit

  • Change ISOLATE_XXX macro with bitwise isolate_mode_t type. Normally,
    macro isn't recommended as it's type-unsafe and making debugging harder as
    symbol cannot be passed throught to the debugger.

    Quote from Johannes
    " Hmm, it would probably be cleaner to fully convert the isolation mode
    into independent flags. INACTIVE, ACTIVE, BOTH is currently a
    tri-state among flags, which is a bit ugly."

    This patch moves isolate mode from swap.h to mmzone.h by memcontrol.h

    Signed-off-by: Minchan Kim
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

31 Oct, 2011

1 commit


15 Sep, 2011

1 commit

  • Revert the post-3.0 commit 82f9d486e59f5 ("memcg: add
    memory.vmscan_stat").

    The implementation of per-memcg reclaim statistics violates how memcg
    hierarchies usually behave: hierarchically.

    The reclaim statistics are accounted to child memcgs and the parent
    hitting the limit, but not to hierarchy levels in between. Usually,
    hierarchical statistics are perfectly recursive, with each level
    representing the sum of itself and all its children.

    Since this exports statistics to userspace, this may lead to confusion
    and problems with changing things after the release, so revert it now,
    we can try again later.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Michal Hocko
    Cc: Ying Han
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

26 Aug, 2011

2 commits

  • Commit 79dfdaccd1d5 ("memcg: make oom_lock 0 and 1 based rather than
    counter") tried to oom lock the hierarchy and roll back upon
    encountering an already locked memcg.

    The code is confused when it comes to detecting a locked memcg, though,
    so it would fail and rollback after locking one memcg and encountering
    an unlocked second one.

    The result is that oom-locking hierarchies fails unconditionally and
    that every oom killer invocation simply goes to sleep on the oom
    waitqueue forever. The tasks practically hang forever without anyone
    intervening, possibly holding locks that trip up unrelated tasks, too.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Commit d1a05b6973c7 ("memcg do not try to drain per-cpu caches without
    pages") added a drain_local_stock() call to a preemptible section.

    The draining task looks up the cpu-local stock twice to set the
    draining-flag, then to drain the stock and clear the flag again. If the
    task is migrated to a different CPU in between, noone will clear the
    flag on the first stock and it will be forever undrainable. Its charge
    can not be recovered and the cgroup can not be deleted anymore.

    Properly pin the task to the executing CPU while draining stocks.

    Signed-off-by: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

10 Aug, 2011

1 commit

  • This reverts commit 8521fc50d433507a7cdc96bec280f9e5888a54cc.

    The patch incorrectly assumes that using atomic FLUSHING_CACHED_CHARGE
    bit operations is sufficient but that is not true. Johannes Weiner has
    reported a crash during parallel memory cgroup removal:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
    IP: [] css_is_ancestor+0x20/0x70
    Oops: 0000 [#1] PREEMPT SMP
    Pid: 19677, comm: rmdir Tainted: G W 3.0.0-mm1-00188-gf38d32b #35 ECS MCP61M-M3/MCP61M-M3
    RIP: 0010:[] css_is_ancestor+0x20/0x70
    RSP: 0018:ffff880077b09c88 EFLAGS: 00010202
    Process rmdir (pid: 19677, threadinfo ffff880077b08000, task ffff8800781bb310)
    Call Trace:
    [] mem_cgroup_same_or_subtree+0x33/0x40
    [] drain_all_stock+0x11f/0x170
    [] mem_cgroup_force_empty+0x231/0x6d0
    [] mem_cgroup_pre_destroy+0x14/0x20
    [] cgroup_rmdir+0xb9/0x500
    [] vfs_rmdir+0x86/0xe0
    [] do_rmdir+0xfb/0x110
    [] sys_rmdir+0x16/0x20
    [] system_call_fastpath+0x16/0x1b

    We are crashing because we try to dereference cached memcg when we are
    checking whether we should wait for draining on the cache. The cache is
    already cleaned up, though.

    There is also a theoretical chance that the cached memcg gets freed
    between we test for the FLUSHING_CACHED_CHARGE and dereference it in
    mem_cgroup_same_or_subtree:

    CPU0 CPU1 CPU2
    mem=stock->cached
    stock->cached=NULL
    clear_bit
    test_and_set_bit
    test_bit() ...
    mem_cgroup_destroy
    use after free

    The percpu_charge_mutex protected from this race because sync draining
    is exclusive.

    It is safer to revert now and come up with a more parallel
    implementation later.

    Signed-off-by: Michal Hocko
    Reported-by: Johannes Weiner
    Acked-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

04 Aug, 2011

1 commit

  • Remove mem_cgroup_shmem_charge_fallback(): it was only required when we
    had to move swappage to filecache with GFP_NOWAIT.

    Remove the GFP_NOWAIT special case from mem_cgroup_cache_charge(), by
    moving its call out from shmem_add_to_page_cache() to two of thats three
    callers. But leave it doing mem_cgroup_uncharge_cache_page() on error:
    although asymmetrical, it's easier for all 3 callers to handle.

    These two changes would also be appropriate if anyone were to start
    using shmem_read_mapping_page_gfp() with GFP_NOWAIT.

    Remove mem_cgroup_get_shmem_target(): mc_handle_file_pte() can test
    radix_tree_exceptional_entry() to get what it needs for itself.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

27 Jul, 2011

10 commits

  • percpu_charge_mutex protects from multiple simultaneous per-cpu charge
    caches draining because we might end up having too many work items. At
    least this was the case until commit 26fe61684449 ("memcg: fix percpu
    cached charge draining frequency") when we introduced a more targeted
    draining for async mode.

    Now that also sync draining is targeted we can safely remove mutex
    because we will not send more work than the current number of CPUs.
    FLUSHING_CACHED_CHARGE protects from sending the same work multiple
    times and stock->nr_pages == 0 protects from pointless sending a work if
    there is obviously nothing to be done. This is of course racy but we
    can live with it as the race window is really small (we would have to
    see FLUSHING_CACHED_CHARGE cleared while nr_pages would be still
    non-zero).

    The only remaining place where we can race is synchronous mode when we
    rely on FLUSHING_CACHED_CHARGE test which might have been set by other
    drainer on the same group but we should wait in that case as well.

    Signed-off-by: Michal Hocko
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • We are checking whether a given two groups are same or at least in the
    same subtree of a hierarchy at several places. Let's make a helper for
    it to make code easier to read.

    Signed-off-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Currently we have two ways how to drain per-CPU caches for charges.
    drain_all_stock_sync will synchronously drain all caches while
    drain_all_stock_async will asynchronously drain only those that refer to
    a given memory cgroup or its subtree in hierarchy. Targeted async
    draining has been introduced by 26fe6168 (memcg: fix percpu cached
    charge draining frequency) to reduce the cpu workers number.

    sync draining is currently triggered only from mem_cgroup_force_empty
    which is triggered only by userspace (mem_cgroup_force_empty_write) or
    when a cgroup is removed (mem_cgroup_pre_destroy). Although these are
    not usually frequent operations it still makes some sense to do targeted
    draining as well, especially if the box has many CPUs.

    This patch unifies both methods to use the single code (drain_all_stock)
    which relies on the original async implementation and just adds
    flush_work to wait on all caches that are still under work for the sync
    mode. We are using FLUSHING_CACHED_CHARGE bit check to prevent from
    waiting on a work that we haven't triggered. Please note that both sync
    and async functions are currently protected by percpu_charge_mutex so we
    cannot race with other drainers.

    Signed-off-by: Michal Hocko
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • drain_all_stock_async tries to optimize a work to be done on the work
    queue by excluding any work for the current CPU because it assumes that
    the context we are called from already tried to charge from that cache
    and it's failed so it must be empty already.

    While the assumption is correct we can optimize it even more by checking
    the current number of pages in the cache. This will also reduce a work
    on other CPUs with an empty stock.

    For the current CPU we can simply call drain_local_stock rather than
    deferring it to the work queue.

    [kamezawa.hiroyu@jp.fujitsu.com: use drain_local_stock for current CPU optimization]
    Signed-off-by: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The commit log of 0ae5e89c60c9 ("memcg: count the soft_limit reclaim
    in...") says it adds scanning stats to memory.stat file. But it doesn't
    because we considered we needed to make a concensus for such new APIs.

    This patch is a trial to add memory.scan_stat. This shows
    - the number of scanned pages(total, anon, file)
    - the number of rotated pages(total, anon, file)
    - the number of freed pages(total, anon, file)
    - the number of elaplsed time (including sleep/pause time)

    for both of direct/soft reclaim.

    The biggest difference with oringinal Ying's one is that this file
    can be reset by some write, as

    # echo 0 ...../memory.scan_stat

    Example of output is here. This is a result after make -j 6 kernel
    under 300M limit.

    [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat
    [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat
    scanned_pages_by_limit 9471864
    scanned_anon_pages_by_limit 6640629
    scanned_file_pages_by_limit 2831235
    rotated_pages_by_limit 4243974
    rotated_anon_pages_by_limit 3971968
    rotated_file_pages_by_limit 272006
    freed_pages_by_limit 2318492
    freed_anon_pages_by_limit 962052
    freed_file_pages_by_limit 1356440
    elapsed_ns_by_limit 351386416101
    scanned_pages_by_system 0
    scanned_anon_pages_by_system 0
    scanned_file_pages_by_system 0
    rotated_pages_by_system 0
    rotated_anon_pages_by_system 0
    rotated_file_pages_by_system 0
    freed_pages_by_system 0
    freed_anon_pages_by_system 0
    freed_file_pages_by_system 0
    elapsed_ns_by_system 0
    scanned_pages_by_limit_under_hierarchy 9471864
    scanned_anon_pages_by_limit_under_hierarchy 6640629
    scanned_file_pages_by_limit_under_hierarchy 2831235
    rotated_pages_by_limit_under_hierarchy 4243974
    rotated_anon_pages_by_limit_under_hierarchy 3971968
    rotated_file_pages_by_limit_under_hierarchy 272006
    freed_pages_by_limit_under_hierarchy 2318492
    freed_anon_pages_by_limit_under_hierarchy 962052
    freed_file_pages_by_limit_under_hierarchy 1356440
    elapsed_ns_by_limit_under_hierarchy 351386416101
    scanned_pages_by_system_under_hierarchy 0
    scanned_anon_pages_by_system_under_hierarchy 0
    scanned_file_pages_by_system_under_hierarchy 0
    rotated_pages_by_system_under_hierarchy 0
    rotated_anon_pages_by_system_under_hierarchy 0
    rotated_file_pages_by_system_under_hierarchy 0
    freed_pages_by_system_under_hierarchy 0
    freed_anon_pages_by_system_under_hierarchy 0
    freed_file_pages_by_system_under_hierarchy 0
    elapsed_ns_by_system_under_hierarchy 0

    total_xxxx is for hierarchy management.

    This will be useful for further memcg developments and need to be
    developped before we do some complicated rework on LRU/softlimit
    management.

    This patch adds a new struct memcg_scanrecord into scan_control struct.
    sc->nr_scanned at el is not designed for exporting information. For
    example, nr_scanned is reset frequentrly and incremented +2 at scanning
    mapped pages.

    To avoid complexity, I added a new param in scan_control which is for
    exporting scanning score.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Michal Hocko
    Cc: Ying Han
    Cc: Andrew Bresticker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Commit 22a668d7c3ef ("memcg: fix behavior under memory.limit equals to
    memsw.limit") introduced "memsw_is_minimum" flag, which becomes true
    when mem_limit == memsw_limit. The flag is checked at the beginning of
    reclaim, and "noswap" is set if the flag is true, because using swap is
    meaningless in this case.

    This works well in most cases, but when we try to shrink mem_limit,
    which is the same as memsw_limit now, we might fail to shrink mem_limit
    because swap doesn't used.

    This patch fixes this behavior by:
    - check MEM_CGROUP_RECLAIM_SHRINK at the begining of reclaim
    - If it is set, don't set "noswap" flag even if memsw_is_minimum is true.

    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Ying Han
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • memcg_oom_mutex is used to protect memcg OOM path and eventfd interface
    for oom_control. None of the critical sections which it protects sleep
    (eventfd_signal works from atomic context and the rest are simple linked
    list resp. oom_lock atomic operations).

    Mutex is also too heavyweight for those code paths because it triggers a
    lot of scheduling. It also makes makes convoying effects more visible
    when we have a big number of oom killing because we take the lock
    mutliple times during mem_cgroup_handle_oom so we have multiple places
    where many processes can sleep.

    Signed-off-by: Michal Hocko
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Commit 867578cb ("memcg: fix oom kill behavior") introduced a oom_lock
    counter which is incremented by mem_cgroup_oom_lock when we are about to
    handle memcg OOM situation. mem_cgroup_handle_oom falls back to a sleep
    if oom_lock > 1 to prevent from multiple oom kills at the same time.
    The counter is then decremented by mem_cgroup_oom_unlock called from the
    same function.

    This works correctly but it can lead to serious starvations when we have
    many processes triggering OOM and many CPUs available for them (I have
    tested with 16 CPUs).

    Consider a process (call it A) which gets the oom_lock (the first one
    that got to mem_cgroup_handle_oom and grabbed memcg_oom_mutex) and other
    processes that are blocked on the mutex. While A releases the mutex and
    calls mem_cgroup_out_of_memory others will wake up (one after another)
    and increase the counter and fall into sleep (memcg_oom_waitq).

    Once A finishes mem_cgroup_out_of_memory it takes the mutex again and
    decreases oom_lock and wakes other tasks (if releasing memory by
    somebody else - e.g. killed process - hasn't done it yet).

    A testcase would look like:
    Assume malloc XXX is a program allocating XXX Megabytes of memory
    which touches all allocated pages in a tight loop
    # swapoff SWAP_DEVICE
    # cgcreate -g memory:A
    # cgset -r memory.oom_control=0 A
    # cgset -r memory.limit_in_bytes= 200M
    # for i in `seq 100`
    # do
    # cgexec -g memory:A malloc 10 &
    # done

    The main problem here is that all processes still race for the mutex and
    there is no guarantee that we will get counter back to 0 for those that
    got back to mem_cgroup_handle_oom. In the end the whole convoy
    in/decreases the counter but we do not get to 1 that would enable
    killing so nothing useful can be done. The time is basically unbounded
    because it highly depends on scheduling and ordering on mutex (I have
    seen this taking hours...).

    This patch replaces the counter by a simple {un}lock semantic. As
    mem_cgroup_oom_{un}lock works on the a subtree of a hierarchy we have to
    make sure that nobody else races with us which is guaranteed by the
    memcg_oom_mutex.

    We have to be careful while locking subtrees because we can encounter a
    subtree which is already locked: hierarchy:

    A
    / \
    B \
    /\ \
    C D E

    B - C - D tree might be already locked. While we want to enable locking
    E subtree because OOM situations cannot influence each other we
    definitely do not want to allow locking A.

    Therefore we have to refuse lock if any subtree is already locked and
    clear up the lock for all nodes that have been set up to the failure
    point.

    On the other hand we have to make sure that the rest of the world will
    recognize that a group is under OOM even though it doesn't have a lock.
    Therefore we have to introduce under_oom variable which is incremented
    and decremented for the whole subtree when we enter resp. leave
    mem_cgroup_handle_oom. under_oom, unlike oom_lock, doesn't need be
    updated under memcg_oom_mutex because its users only check a single
    group and they use atomic operations for that.

    This can be checked easily by the following test case:

    # cgcreate -g memory:A
    # cgset -r memory.use_hierarchy=1 A
    # cgset -r memory.oom_control=1 A
    # cgset -r memory.limit_in_bytes= 100M
    # cgset -r memory.memsw.limit_in_bytes= 100M
    # cgcreate -g memory:A/B
    # cgset -r memory.oom_control=1 A/B
    # cgset -r memory.limit_in_bytes=20M
    # cgset -r memory.memsw.limit_in_bytes=20M
    # cgexec -g memory:A/B malloc 30 & #->this will be blocked by OOM of group B
    # cgexec -g memory:A malloc 80 & #->this will be blocked by OOM of group A

    While B gets oom_lock A will not get it. Both of them go into sleep and
    wait for an external action. We can make the limit higher for A to
    enforce waking it up

    # cgset -r memory.memsw.limit_in_bytes=300M A
    # cgset -r memory.limit_in_bytes=300M A

    malloc in A has to wake up even though it doesn't have oom_lock.

    Finally, the unlock path is very easy because we always unlock only the
    subtree we have locked previously while we always decrement under_oom.

    Signed-off-by: Michal Hocko
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • In mm/memcontrol.c, there are many lru stat functions as..

    mem_cgroup_zone_nr_lru_pages
    mem_cgroup_node_nr_file_lru_pages
    mem_cgroup_nr_file_lru_pages
    mem_cgroup_node_nr_anon_lru_pages
    mem_cgroup_nr_anon_lru_pages
    mem_cgroup_node_nr_unevictable_lru_pages
    mem_cgroup_nr_unevictable_lru_pages
    mem_cgroup_node_nr_lru_pages
    mem_cgroup_nr_lru_pages
    mem_cgroup_get_local_zonestat

    Some of them are under #ifdef MAX_NUMNODES >1 and others are not.
    This seems bad. This patch consolidates all functions into

    mem_cgroup_zone_nr_lru_pages()
    mem_cgroup_node_nr_lru_pages()
    mem_cgroup_nr_lru_pages()

    For these functions, "which LRU?" information is passed by a mask.

    example:
    mem_cgroup_nr_lru_pages(mem, BIT(LRU_ACTIVE_ANON))

    And I added some macro as ALL_LRU, ALL_LRU_FILE, ALL_LRU_ANON.

    example:
    mem_cgroup_nr_lru_pages(mem, ALL_LRU)

    BTW, considering layout of NUMA memory placement of counters, this patch seems
    to be better.

    Now, when we gather all LRU information, we scan in following orer
    for_each_lru -> for_each_node -> for_each_zone.

    This means we'll touch cache lines in different node in turn.

    After patch, we'll scan
    for_each_node -> for_each_zone -> for_each_lru(mask)

    Then, we'll gather information in the same cacheline at once.

    [akpm@linux-foundation.org: fix warnigns, build error]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Michal Hocko
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Each memory cgroup has a 'swappiness' value which can be accessed by
    get_swappiness(memcg). The major user is try_to_free_mem_cgroup_pages()
    and swappiness is passed by argument. It's propagated by scan_control.

    get_swappiness() is a static function but some planned updates will need
    to get swappiness from files other than memcontrol.c This patch exports
    get_swappiness() as mem_cgroup_swappiness(). With this, we can remove the
    argument of swapiness from try_to_free... and drop swappiness from
    scan_control. only memcg uses it.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Michal Hocko
    Cc: Ying Han
    Cc: Shaohua Li
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

09 Jul, 2011

2 commits

  • commit 889976dbcb12 ("memcg: reclaim memory from nodes in round-robin
    order") adds an numa node round-robin for memcg. But the information is
    updated once per 10sec.

    This patch changes the update trigger from jiffies to memcg's event count.
    After this patch, numa scan information will be updated when we see 1024
    events of pagein/pageout under a memcg.

    [akpm@linux-foundation.org: attempt to repair code layout]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Ying Han
    Cc: Johannes Weiner
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Now, in mem_cgroup_hierarchical_reclaim(), mem_cgroup_local_usage() is
    used for checking whether the memcg contains reclaimable pages or not. If
    no pages in it, the routine skips it.

    But, mem_cgroup_local_usage() contains Unevictable pages and cannot handle
    "noswap" condition correctly. This doesn't work on a swapless system.

    This patch adds test_mem_cgroup_reclaimable() and replaces
    mem_cgroup_local_usage(). test_mem_cgroup_reclaimable() see LRU counter
    and returns correct answer to the caller. And this new function has
    "noswap" argument and can see only FILE LRU if necessary.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix kerneldoc layout]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Ying Han
    Cc: Johannes Weiner
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

28 Jun, 2011

1 commit

  • Before adding any more global entry points into shmem.c, gather such
    prototypes into shmem_fs.h. Remove mm's own declarations from swap.h,
    but for now leave the ones in mm.h: because shmem_file_setup() and
    shmem_zero_setup() are called from various places, and we should not
    force other subsystems to update immediately.

    Signed-off-by: Hugh Dickins
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

16 Jun, 2011

5 commits

  • Based on Michal Hocko's comment.

    We are not draining per cpu cached charges during soft limit reclaim
    because background reclaim doesn't care about charges. It tries to free
    some memory and charges will not give any.

    Cached charges might influence only selection of the biggest soft limit
    offender but as the call is done only after the selection has been already
    done it makes no change.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • For performance, memory cgroup caches some "charge" from res_counter into
    per cpu cache. This works well but because it's cache, it needs to be
    flushed in some cases. Typical cases are

    1. when someone hit limit.

    2. when rmdir() is called and need to charges to be 0.

    But "1" has problem.

    Recently, with large SMP machines, we see many kworker runs because of
    flushing memcg's cache. Bad things in implementation are that even if a
    cpu contains a cache for memcg not related to a memcg which hits limit,
    drain code is called.

    This patch does
    A) check percpu cache contains a useful data or not.
    B) check other asynchronous percpu draining doesn't run.
    C) don't call local cpu callback.

    (*)This patch avoid changing the calling condition with hard-limit.

    When I run "cat 1Gfile > /dev/null" under 300M limit memcg,

    [Before]
    13767 kamezawa 20 0 98.6m 424 416 D 10.0 0.0 0:00.61 cat
    58 root 20 0 0 0 0 S 0.6 0.0 0:00.09 kworker/2:1
    60 root 20 0 0 0 0 S 0.6 0.0 0:00.08 kworker/4:1
    4 root 20 0 0 0 0 S 0.3 0.0 0:00.02 kworker/0:0
    57 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/1:1
    61 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/5:1
    62 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/6:1
    63 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/7:1

    [After]
    2676 root 20 0 98.6m 416 416 D 9.3 0.0 0:00.87 cat
    2626 kamezawa 20 0 15192 1312 920 R 0.3 0.0 0:00.28 top
    1 root 20 0 19384 1496 1204 S 0.0 0.0 0:00.66 init
    2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
    3 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0
    4 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0

    [akpm@linux-foundation.org: make percpu_charge_mutex static, tweak comments]
    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Daisuke Nishimura
    Reviewed-by: Michal Hocko
    Tested-by: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Hierarchical reclaim doesn't swap out if memsw and resource limits are
    thye same (memsw_is_minimum == true) because we would hit mem+swap limit
    anyway (during hard limit reclaim).

    If it comes to the soft limit we shouldn't consider memsw_is_minimum at
    all because it doesn't make much sense. Either the soft limit is bellow
    the hard limit and then we cannot hit mem+swap limit or the direct reclaim
    takes a precedence.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Acked-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Commit 406eb0c9ba76 ("memcg: add memory.numastat api for numa
    statistics") adds memory.numa_stat file for memory cgroup. But the file
    permissions are wrong.

    [kamezawa@bluextal linux-2.6]$ ls -l /cgroup/memory/A/memory.numa_stat
    ---------- 1 root root 0 Jun 9 18:36 /cgroup/memory/A/memory.numa_stat

    This patch fixes the permission as

    [root@bluextal kamezawa]# ls -l /cgroup/memory/A/memory.numa_stat
    -r--r--r-- 1 root root 0 Jun 10 16:49 /cgroup/memory/A/memory.numa_stat

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Currently, memcg reclaim can disable swap token even if the swap token mm
    doesn't belong in its memory cgroup. It's slightly risky. If an admin
    creates very small mem-cgroup and silly guy runs contentious heavy memory
    pressure workload, every tasks are going to lose swap token and then
    system may become unresponsive. That's bad.

    This patch adds 'memcg' parameter into disable_swap_token(). and if the
    parameter doesn't match swap token, VM doesn't disable it.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

27 May, 2011

4 commits

  • Two new stats in per-memcg memory.stat which tracks the number of page
    faults and number of major page faults.

    "pgfault"
    "pgmajfault"

    They are different from "pgpgin"/"pgpgout" stat which count number of
    pages charged/discharged to the cgroup and have no meaning of reading/
    writing page to disk.

    It is valuable to track the two stats for both measuring application's
    performance as well as the efficiency of the kernel page reclaim path.
    Counting pagefaults per process is useful, but we also need the aggregated
    value since processes are monitored and controlled in cgroup basis in
    memcg.

    Functional test: check the total number of pgfault/pgmajfault of all
    memcgs and compare with global vmstat value:

    $ cat /proc/vmstat | grep fault
    pgfault 1070751
    pgmajfault 553

    $ cat /dev/cgroup/memory.stat | grep fault
    pgfault 1071138
    pgmajfault 553
    total_pgfault 1071142
    total_pgmajfault 553

    $ cat /dev/cgroup/A/memory.stat | grep fault
    pgfault 199
    pgmajfault 0
    total_pgfault 199
    total_pgmajfault 0

    Performance test: run page fault test(pft) wit 16 thread on faulting in
    15G anon pages in 16G container. There is no regression noticed on the
    "flt/cpu/s"

    Sample output from pft:

    TAG pft:anon-sys-default:
    Gb Thr CLine User System Wall flt/cpu/s fault/wsec
    15 16 1 0.67s 233.41s 14.76s 16798.546 266356.260

    +-------------------------------------------------------------------------+
    N Min Max Median Avg Stddev
    x 10 16682.962 17344.027 16913.524 16928.812 166.5362
    + 10 16695.568 16923.896 16820.604 16824.652 84.816568
    No difference proven at 95.0% confidence

    [akpm@linux-foundation.org: fix build]
    [hughd@google.com: shmem fix]
    Signed-off-by: Ying Han
    Acked-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Cc: Daisuke Nishimura
    Acked-by: Balbir Singh
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • The new API exports numa_maps per-memcg basis. This is a piece of useful
    information where it exports per-memcg page distribution across real numa
    nodes.

    One of the usecases is evaluating application performance by combining
    this information w/ the cpu allocation to the application.

    The output of the memory.numastat tries to follow w/ simiar format of
    numa_maps like:

    total= N0= N1= ...
    file= N0= N1= ...
    anon= N0= N1= ...
    unevictable= N0= N1= ...

    And we have per-node:

    total = file + anon + unevictable

    $ cat /dev/cgroup/memory/memory.numa_stat
    total=250020 N0=87620 N1=52367 N2=45298 N3=64735
    file=225232 N0=83402 N1=46160 N2=40522 N3=55148
    anon=21053 N0=3424 N1=6207 N2=4776 N3=6646
    unevictable=3735 N0=794 N1=0 N2=0 N3=2941

    Signed-off-by: Ying Han
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Daisuke Nishimura
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • The caller of the function has been renamed to zone_nr_lru_pages(), and
    this is just fixing up in the memcg code. The current name is easily to
    be mis-read as zone's total number of pages.

    Signed-off-by: Ying Han
    Acked-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • If the memcg reclaim code detects the target memcg below its limit it
    exits and returns a guaranteed non-zero value so that the charge is
    retried.

    Nowadays, the charge side checks the memcg limit itself and does not rely
    on this non-zero return value trick.

    This patch removes it. The reclaim code will now always return the true
    number of pages it reclaimed on its own.

    Signed-off-by: Johannes Weiner
    Acked-by: Rik van Riel
    Acked-by: Ying Han
    Acked-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Cc: Balbir Singh
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner