17 Nov, 2012

2 commits

  • When MEMCG is configured on (even when it's disabled by boot option),
    when adding or removing a page to/from its lru list, the zone pointer
    used for stats updates is nowadays taken from the struct lruvec. (On
    many configurations, calculating zone from page is slower.)

    But we have no code to update all the lruvecs (per zone, per memcg) when
    a memory node is hotadded. Here's an extract from the oops which
    results when running numactl to bind a program to a newly onlined node:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000f60
    IP: __mod_zone_page_state+0x9/0x60
    Pid: 1219, comm: numactl Not tainted 3.6.0-rc5+ #180 Bochs Bochs
    Process numactl (pid: 1219, threadinfo ffff880039abc000, task ffff8800383c4ce0)
    Call Trace:
    __pagevec_lru_add_fn+0xdf/0x140
    pagevec_lru_move_fn+0xb1/0x100
    __pagevec_lru_add+0x1c/0x30
    lru_add_drain_cpu+0xa3/0x130
    lru_add_drain+0x2f/0x40
    ...

    The natural solution might be to use a memcg callback whenever memory is
    hotadded; but that solution has not been scoped out, and it happens that
    we do have an easy location at which to update lruvec->zone. The lruvec
    pointer is discovered either by mem_cgroup_zone_lruvec() or by
    mem_cgroup_page_lruvec(), and both of those do know the right zone.

    So check and set lruvec->zone in those; and remove the inadequate
    attempt to set lruvec->zone from lruvec_init(), which is called before
    NODE_DATA(node) has been allocated in such cases.

    Ah, there was one exceptionr. For no particularly good reason,
    mem_cgroup_force_empty_list() has its own code for deciding lruvec.
    Change it to use the standard mem_cgroup_zone_lruvec() and
    mem_cgroup_get_lru_size() too. In fact it was already safe against such
    an oops (the lru lists in danger could only be empty), but we're better
    proofed against future changes this way.

    I've marked this for stable (3.6) since we introduced the problem in 3.5
    (now closed to stable); but I have no idea if this is the only fix
    needed to get memory hotadd working with memcg in 3.6, and received no
    answer when I enquired twice before.

    Reported-by: Tang Chen
    Signed-off-by: Hugh Dickins
    Acked-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Konstantin Khlebnikov
    Cc: Wen Congyang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • oom_badness() takes a totalpages argument which says how many pages are
    available and it uses it as a base for the score calculation. The value
    is calculated by mem_cgroup_get_limit which considers both limit and
    total_swap_pages (resp. memsw portion of it).

    This is usually correct but since fe35004fbf9e ("mm: avoid swapping out
    with swappiness==0") we do not swap when swappiness is 0 which means
    that we cannot really use up all the totalpages pages. This in turn
    confuses oom score calculation if the memcg limit is much smaller than
    the available swap because the used memory (capped by the limit) is
    negligible comparing to totalpages so the resulting score is too small
    if adj!=0 (typically task with CAP_SYS_ADMIN or non zero oom_score_adj).
    A wrong process might be selected as result.

    The problem can be worked around by checking mem_cgroup_swappiness==0
    and not considering swap at all in such a case.

    Signed-off-by: Michal Hocko
    Acked-by: David Rientjes
    Acked-by: Johannes Weiner
    Acked-by: KOSAKI Motohiro
    Acked-by: KAMEZAWA Hiroyuki
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

09 Oct, 2012

2 commits

  • kmem code uses this function and it is better to not use forward
    declarations for static inline functions as some (older) compilers don't
    like it:

    gcc version 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux)

    mm/memcontrol.c:421: warning: `mem_cgroup_is_root' declared inline after being called
    mm/memcontrol.c:421: warning: previous declaration of `mem_cgroup_is_root' was here

    Signed-off-by: Michal Hocko
    Cc: Glauber Costa
    Cc: Sachin Kamat
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • TCP kmem accounting is currently guarded by CONFIG_MEMCG_KMEM ifdefs but
    the code is not used if !CONFIG_INET so we should rather test for both.
    The same applies to net/sock.h, net/ip.h and net/tcp_memcontrol.h but
    let's keep those outside of any ifdefs because it is considered safer wrt.
    future maintainability.

    Tested with
    - CONFIG_INET && CONFIG_MEMCG_KMEM
    - !CONFIG_INET && CONFIG_MEMCG_KMEM
    - CONFIG_INET && !CONFIG_MEMCG_KMEM
    - !CONFIG_INET && !CONFIG_MEMCG_KMEM

    Signed-off-by: Sachin Kamat
    Signed-off-by: Michal Hocko
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

15 Sep, 2012

1 commit

  • Currently, cgroup hierarchy support is a mess. cpu related subsystems
    behave correctly - configuration, accounting and control on a parent
    properly cover its children. blkio and freezer completely ignore
    hierarchy and treat all cgroups as if they're directly under the root
    cgroup. Others show yet different behaviors.

    These differing interpretations of cgroup hierarchy make using cgroup
    confusing and it impossible to co-mount controllers into the same
    hierarchy and obtain sane behavior.

    Eventually, we want full hierarchy support from all subsystems and
    probably a unified hierarchy. Users using separate hierarchies
    expecting completely different behaviors depending on the mounted
    subsystem is deterimental to making any progress on this front.

    This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
    for controllers which are lacking in hierarchy support. The goal of
    this patch is two-fold.

    * Move users away from using hierarchy on currently non-hierarchical
    subsystems, so that implementing proper hierarchy support on those
    doesn't surprise them.

    * Keep track of which controllers are broken how and nudge the
    subsystems to implement proper hierarchy support.

    For now, start with a single warning message. We can whine louder
    later on.

    v2: Fixed a typo spotted by Michal. Warning message updated.

    v3: Updated memcg part so that it doesn't generate warning in the
    cases where .use_hierarchy=false doesn't make the behavior
    different from root.use_hierarchy=true. Fixed a typo spotted by
    Glauber.

    v4: Check ->broken_hierarchy after cgroup creation is complete so that
    ->create() can affect the result per Michal. Dropped unnecessary
    memcg root handling per Michal.

    Signed-off-by: Tejun Heo
    Acked-by: Michal Hocko
    Acked-by: Li Zefan
    Acked-by: Serge E. Hallyn
    Cc: Glauber Costa
    Cc: Peter Zijlstra
    Cc: Paul Turner
    Cc: Johannes Weiner
    Cc: Thomas Graf
    Cc: Vivek Goyal
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Neil Horman
    Cc: Aneesh Kumar K.V

    Tejun Heo
     

01 Aug, 2012

25 commits

  • Add a mem_cgroup_from_css() helper to replace open-coded invokations of
    container_of(). To clarify the code and to add a little more type safety.

    [akpm@linux-foundation.org: fix extensive breakage]
    Signed-off-by: Wanpeng Li
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Gavin Shan
    Cc: Wanpeng Li
    Cc: Gavin Shan
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • shmem knows for sure that the page is in swap cache when attempting to
    charge a page, because the cache charge entry function has a check for it.
    Only anon pages may be removed from swap cache already when trying to
    charge their swapin.

    Adjust the comment, though: '4969c11 mm: fix swapin race condition' added
    a stable PageSwapCache check under the page lock in the do_swap_page()
    before calling the memory controller, so it's unuse_pte()'s pte_same()
    that may fail.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Only anon and shmem pages in the swap cache are attempted to be charged
    multiple times, from every swap pte fault or from shmem_unuse(). No other
    pages require checking PageCgroupUsed().

    Charging pages in the swap cache is also serialized by the page lock, and
    since both the try_charge and commit_charge are called under the same page
    lock section, the PageCgroupUsed() check might as well happen before the
    counter charging, let alone reclaim.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When shmem is charged upon swapin, it does not need to check twice whether
    the memory controller is enabled.

    Also, shmem pages do not have to be checked for everything that regular
    anon pages have to be checked for, so let shmem use the internal version
    directly and allow future patches to move around checks that are only
    required when swapping in anon pages.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • It does not matter to __mem_cgroup_try_charge() if the passed mm is NULL
    or init_mm, it will charge the root memcg in either case.

    Also fix up the comment in __mem_cgroup_try_charge() that claimed the
    init_mm would be charged when no mm was passed. It's not really
    incorrect, but confusing. Clarify that the root memcg is charged in this
    case.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • shmem page charges have not needed a separate charge type to tell them
    from regular file pages since 08e552c ("memcg: synchronized LRU").

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Charging cache pages may require swapin in the shmem case. Save the
    forward declaration and just move the swapin functions above the cache
    charging functions.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Only anon pages that are uncharged at the time of the last page table
    mapping vanishing may be in swapcache.

    When shmem pages, file pages, swap-freed anon pages, or just migrated
    pages are uncharged, they are known for sure to be not in swapcache.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Not all uncharge paths need to check if the page is swapcache, some of
    them can know for sure.

    Push down the check into all callsites of uncharge_common() so that the
    patch that removes some of them is more obvious.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Compaction (and page migration in general) can currently be hindered
    through pages being owned by memory cgroups that are at their limits and
    unreclaimable.

    The reason is that the replacement page is being charged against the limit
    while the page being replaced is also still charged. But this seems
    unnecessary, given that only one of the two pages will still be in use
    after migration finishes.

    This patch changes the memcg migration sequence so that the replacement
    page is not charged. Whatever page is still in use after successful or
    failed migration gets to keep the charge of the page that was going to be
    replaced.

    The replacement page will still show up temporarily in the rss/cache
    statistics, this can be fixed in a later patch as it's less urgent.

    Reported-by: David Rientjes
    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • By globally defining check_panic_on_oom(), the memcg oom handler can be
    moved entirely to mm/memcontrol.c. This removes the ugly #ifdef in the
    oom killer and cleans up the code.

    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Oleg Nesterov
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Since exiting tasks require write_lock_irq(&tasklist_lock) several times,
    try to reduce the amount of time the readside is held for oom kills. This
    makes the interface with the memcg oom handler more consistent since it
    now never needs to take tasklist_lock unnecessarily.

    The only time the oom killer now takes tasklist_lock is when iterating the
    children of the selected task, everything else is protected by
    rcu_read_lock().

    This requires that a reference to the selected process, p, is grabbed
    before calling oom_kill_process(). It may release it and grab a reference
    on another one of p's threads if !p->mm, but it also guarantees that it
    will release the reference before returning.

    [hughd@google.com: fix duplicate put_task_struct()]
    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Cc: Oleg Nesterov
    Cc: KOSAKI Motohiro
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The global oom killer is serialized by the per-zonelist
    try_set_zonelist_oom() which is used in the page allocator. Concurrent
    oom kills are thus a rare event and only occur in systems using
    mempolicies and with a large number of nodes.

    Memory controller oom kills, however, can frequently be concurrent since
    there is no serialization once the oom killer is called for oom conditions
    in several different memcgs in parallel.

    This creates a massive contention on tasklist_lock since the oom killer
    requires the readside for the tasklist iteration. If several memcgs are
    calling the oom killer, this lock can be held for a substantial amount of
    time, especially if threads continue to enter it as other threads are
    exiting.

    Since the exit path grabs the writeside of the lock with irqs disabled in
    a few different places, this can cause a soft lockup on cpus as a result
    of tasklist_lock starvation.

    The kernel lacks unfair writelocks, and successful calls to the oom killer
    usually result in at least one thread entering the exit path, so an
    alternative solution is needed.

    This patch introduces a seperate oom handler for memcgs so that they do
    not require tasklist_lock for as much time. Instead, it iterates only
    over the threads attached to the oom memcg and grabs a reference to the
    selected thread before calling oom_kill_process() to ensure it doesn't
    prematurely exit.

    This still requires tasklist_lock for the tasklist dump, iterating
    children of the selected process, and killing all other threads on the
    system sharing the same memory as the selected victim. So while this
    isn't a complete solution to tasklist_lock starvation, it significantly
    reduces the amount of time that it is held.

    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Signed-off-by: David Rientjes
    Cc: Oleg Nesterov
    Cc: KOSAKI Motohiro
    Reviewed-by: Sha Zhengju
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Signed-off-by: Wanpeng Li
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • Signed-off-by: Wanpeng Li
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • Replace memory_cgroup_xxx() with memcg_xxx()

    Signed-off-by: Wanpeng Li
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • I have an application that does the following:

    * copy the state of all controllers attached to a hierarchy
    * replicate it as a child of the current level.

    I would expect writes to the files to mostly succeed, since they are
    inheriting sane values from parents.

    But that is not the case for use_hierarchy. If it is set to 0, we succeed
    ok. If we're set to 1, the value of the file is automatically set to 1 in
    the children, but if userspace tries to write the very same 1, it will
    fail. That same situation happens if we set use_hierarchy, create a
    child, and then try to write 1 again.

    Now, there is no reason whatsoever for failing to write a value that is
    already there. It doesn't even match the comments, that states:

    /* If parent's use_hierarchy is set, we can't make any modifications
    * in the child subtrees...

    since we are not changing anything.

    So test the new value against the one we're storing, and automatically
    return 0 if we're not proposing a change.

    Signed-off-by: Glauber Costa
    Cc: Dhaval Giani
    Acked-by: Michal Hocko
    Cc: Kamezawa Hiroyuki
    Acked-by: Johannes Weiner
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • Sanity:

    CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
    CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
    CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
    CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM

    [mhocko@suse.cz: fix missed bits]
    Cc: Glauber Costa
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Aneesh Kumar K.V
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • mem_cgroup_force_empty_list() just returns 0 or -EBUSY and -EBUSY
    indicates 'you need to retry'. Make mem_cgroup_force_empty_list() return
    a bool to simplify the logic.

    [akpm@linux-foundation.org: rework mem_cgroup_force_empty_list()'s comment]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • After bf544fdc241da8 "memcg: move charges to root cgroup if
    use_hierarchy=0 in mem_cgroup_move_hugetlb_parent()"
    mem_cgroup_move_parent() returns only -EBUSY or -EINVAL. So we can remove
    the -ENOMEM and -EINTR checks.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kamezawa Hiroyuki
     
  • After bf544fdc241da8 "memcg: move charges to root cgroup if
    use_hierarchy=0 in mem_cgroup_move_hugetlb_parent()", no memory reclaim
    will occur when removing a memory cgroup. If -EINTR is returned here,
    cgroup will show a warning.

    We don't need to handle any user interruption signal. Remove this.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kamezawa Hiroyuki
     
  • There are no users since commit b24028572fb69 ("memcg: remove PCG_CACHE").

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Hugh Dickins
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kamezawa Hiroyuki
     
  • Now, in memcg, 2 "MAPPED" enum/macro are found
    MEM_CGROUP_CHARGE_TYPE_MAPPED
    MEM_CGROUP_STAT_FILE_MAPPED

    Thier names looks similar to each other but the former is used for
    accounting anonymous memory. rename it as TYPE_ANON.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kamezawa Hiroyuki
     
  • MEM_CGROUP_STAT_SWAPOUT represents the usage of swap rather than
    the number of swap-out events. Rename it to be MEM_CGROUP_STAT_SWAP.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Acked-by: Hugh Dickins
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kamezawa Hiroyuki
     

21 Jun, 2012

2 commits

  • Fix kernel-doc warnings such as

    Warning(../mm/page_cgroup.c:432): No description found for parameter 'id'
    Warning(../mm/page_cgroup.c:432): Excess function parameter 'mem' description in 'swap_cgroup_record'

    Signed-off-by: Wanpeng Li
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • If use_hierarchy is set, reclaim testing soon oopses in css_is_ancestor()
    called from __mem_cgroup_same_or_subtree() called from page_referenced():
    when processes are exiting, it's easy for mm_match_cgroup() to pass along
    a NULL memcg coming from a NULL mm->owner.

    Check for that in __mem_cgroup_same_or_subtree(). Return true or false?
    False because we cannot know if it was in the hierarchy, but also false
    because it's better not to count a reference from an exiting process.

    Signed-off-by: Hugh Dickins
    Acked-by: Johannes Weiner
    Acked-by: Konstantin Khlebnikov
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

30 May, 2012

8 commits

  • We call the destroy function when a cgroup starts to be removed, such as
    by a rmdir event.

    However, because of our reference counters, some objects are still
    inflight. Right now, we are decrementing the static_keys at destroy()
    time, meaning that if we get rid of the last static_key reference, some
    objects will still have charges, but the code to properly uncharge them
    won't be run.

    This becomes a problem specially if it is ever enabled again, because now
    new charges will be added to the staled charges making keeping it pretty
    much impossible.

    We just need to be careful with the static branch activation: since there
    is no particular preferred order of their activation, we need to make sure
    that we only start using it after all call sites are active. This is
    achieved by having a per-memcg flag that is only updated after
    static_key_slow_inc() returns. At this time, we are sure all sites are
    active.

    This is made per-memcg, not global, for a reason: it also has the effect
    of making socket accounting more consistent. The first memcg to be
    limited will trigger static_key() activation, therefore, accounting. But
    all the others will then be accounted no matter what. After this patch,
    only limited memcgs will have its sockets accounted.

    [akpm@linux-foundation.org: move enum sock_flag_bits into sock.h,
    document enum sock_flag_bits,
    convert memcg_proto_active() and memcg_proto_activated() to test_bit(),
    redo tcp_update_limit() comment to 80 cols]
    Signed-off-by: Glauber Costa
    Cc: Tejun Heo
    Cc: Li Zefan
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Acked-by: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • Right now we free struct memcg with kfree right after a rcu grace period,
    but defer it if we need to use vfree() to get rid of that memory area. We
    do that by need, because we need vfree to be called in a process context.

    This patch unifies this behavior, by ensuring that even kfree will happen
    in a separate thread. The goal is to have a stable place to call the
    upcoming jump label destruction function outside the realm of the
    complicated and quite far-reaching cgroup lock (that can't be held when
    holding either the cpu_hotplug.lock or jump_label_mutex)

    [akpm@linux-foundation.org: tweak comment]
    Signed-off-by: Glauber Costa
    Cc: Tejun Heo
    Cc: Li Zefan
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • Take lruvec further: pass it instead of zone to add_page_to_lru_list() and
    del_page_from_lru_list(); and pagevec_lru_move_fn() pass lruvec down to
    its target functions.

    This cleanup eliminates a swathe of cruft in memcontrol.c, including
    mem_cgroup_lru_add_list(), mem_cgroup_lru_del_list() and
    mem_cgroup_lru_move_lists() - which never actually touched the lists.

    In their place, mem_cgroup_page_lruvec() to decide the lruvec, previously
    a side-effect of add, and mem_cgroup_update_lru_size() to maintain the
    lru_size stats.

    Whilst these are simplifications in their own right, the goal is to bring
    the evaluation of lruvec next to the spin_locking of the lrus, in
    preparation for a future patch.

    Signed-off-by: Hugh Dickins
    Cc: KOSAKI Motohiro
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Acked-by: Konstantin Khlebnikov
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Konstantin just introduced mem_cgroup_get_lruvec_size() and
    get_lruvec_size(), I'm about to add mem_cgroup_update_lru_size(): but
    we're dealing with the same thing, lru_size[lru]. We ought to agree on
    the naming, and I do think lru_size is the more correct: so rename his
    ones to get_lru_size().

    Signed-off-by: Hugh Dickins
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Konstantin Khlebnikov
    Acked-by: Michal Hocko
    Cc: KOSAKI Motohiro
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Directly print statistics and event counters instead of going through an
    intermediate accumulation stage into a separate array, which used to
    require defining statistic items in more than one place.

    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Ying Han
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The counter of currently swapped out pages in a memcg (hierarchy) is
    sitting amidst ever-increasing event counters. Move this item to the
    other counters that reflect current state rather than history.

    This technically breaks the kernel ABI, but hopefully nobody relies on the
    order of items in memory.stat.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Ying Han
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • All events except the ratelimit counter are statistics exported to
    userspace. Keep this internal value out of the event count array.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Ying Han
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Being able to use seq_printf() allows being smarter about statistics
    name strings, which are currently listed twice, with the only difference
    being a "total_" prefix on the hierarchical version.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Ying Han
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner