05 Nov, 2013

1 commit

  • Conflicts:
    drivers/net/ethernet/emulex/benet/be.h
    drivers/net/netconsole.c
    net/bridge/br_private.h

    Three mostly trivial conflicts.

    The net/bridge/br_private.h conflict was a function signature (argument
    addition) change overlapping with the extern removals from Joe Perches.

    In drivers/net/netconsole.c we had one change adjusting a printk message
    whilst another changed "printk(KERN_INFO" into "pr_info(".

    Lastly, the emulex change was a new inline function addition overlapping
    with Joe Perches's extern removals.

    Signed-off-by: David S. Miller

    David S. Miller
     

02 Nov, 2013

1 commit

  • When a memcg is deleted mem_cgroup_reparent_charges() moves charged
    memory to the parent memcg. As of v3.11-9444-g3ea67d0 "memcg: add per
    cgroup writeback pages accounting" there's bad pointer read. The goal
    was to check for counter underflow. The counter is a per cpu counter
    and there are two problems with the code:

    (1) per cpu access function isn't used, instead a naked pointer is used
    which easily causes oops.
    (2) the check doesn't sum all cpus

    Test:
    $ cd /sys/fs/cgroup/memory
    $ mkdir x
    $ echo 3 > /proc/sys/vm/drop_caches
    $ (echo $BASHPID >> x/tasks && exec cat) &
    [1] 7154
    $ grep ^mapped x/memory.stat
    mapped_file 53248
    $ echo 7154 > tasks
    $ rmdir x

    The fix is to remove the check. It's currently dangerous and isn't
    worth fixing it to use something expensive, such as
    percpu_counter_sum(), for each reparented page. __this_cpu_read() isn't
    enough to fix this because there's no guarantees of the current cpus
    count. The only guarantees is that the sum of all per-cpu counter is >=
    nr_pages.

    Fixes: 3ea67d06e467 ("memcg: add per cgroup writeback pages accounting")
    Reported-and-tested-by: Flavio Leitner
    Signed-off-by: Greg Thelen
    Reviewed-by: Sha Zhengju
    Acked-by: Johannes Weiner
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Greg Thelen
     

01 Nov, 2013

3 commits

  • When memcg code needs to know whether any given memcg has children, it
    uses the cgroup child iteration primitives and returns true/false
    depending on whether the iteration loop is executed at least once or
    not.

    Because a cgroup's list of children is RCU protected, these primitives
    require the RCU read-lock to be held, which is not the case for all
    memcg callers. This results in the following splat when e.g. enabling
    hierarchy mode:

    WARNING: CPU: 3 PID: 1 at kernel/cgroup.c:3043 css_next_child+0xa3/0x160()
    CPU: 3 PID: 1 Comm: systemd Not tainted 3.12.0-rc5-00117-g83f11a9-dirty #18
    Hardware name: LENOVO 3680B56/3680B56, BIOS 6QET69WW (1.39 ) 04/26/2012
    Call Trace:
    dump_stack+0x54/0x74
    warn_slowpath_common+0x78/0xa0
    warn_slowpath_null+0x1a/0x20
    css_next_child+0xa3/0x160
    mem_cgroup_hierarchy_write+0x5b/0xa0
    cgroup_file_write+0x108/0x2a0
    vfs_write+0xbd/0x1e0
    SyS_write+0x4c/0xa0
    system_call_fastpath+0x16/0x1b

    In the memcg case, we only care about children when we are attempting to
    modify inheritable attributes interactively. Racing with deletion could
    mean a spurious -EBUSY, no problem. Racing with addition is handled
    just fine as well through the memcg_create_mutex: if the child group is
    not on the list after the mutex is acquired, it won't be initialized
    from the parent's attributes until after the unlock.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The memcg OOM lock is a mutex-type lock that is open-coded due to
    memcg's special needs. Add annotations for lockdep coverage.

    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Commit 84235de394d9 ("fs: buffer: move allocation failure loop into the
    allocator") allowed __GFP_NOFAIL allocations to bypass the limit if they
    fail to reclaim enough memory for the charge. But because the main test
    case was on a 3.2-based system, the patch missed the fact that on newer
    kernels the charge function needs to return root_mem_cgroup when
    bypassing the limit, and not NULL. This will corrupt whatever memory is
    at NULL + percpu pointer offset. Fix this quickly before problems are
    reported.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

31 Oct, 2013

1 commit

  • As of commit 3ea67d06e467 ("memcg: add per cgroup writeback pages
    accounting") memcg counter errors are possible when moving charged
    memory to a different memcg. Charge movement occurs when processing
    writes to memory.force_empty, moving tasks to a memcg with
    memcg.move_charge_at_immigrate=1, or memcg deletion.

    An example showing error after memory.force_empty:

    $ cd /sys/fs/cgroup/memory
    $ mkdir x
    $ rm /data/tmp/file
    $ (echo $BASHPID >> x/tasks && exec mmap_writer /data/tmp/file 1M) &
    [1] 13600
    $ grep ^mapped x/memory.stat
    mapped_file 1048576
    $ echo 13600 > tasks
    $ echo 1 > x/memory.force_empty
    $ grep ^mapped x/memory.stat
    mapped_file 4503599627370496

    mapped_file should end with 0.
    4503599627370496 == 0x10,0000,0000,0000 == 0x100,0000,0000 pages
    1048576 == 0x10,0000 == 0x100 pages

    This issue only affects the source memcg on 64 bit machines; the
    destination memcg counters are correct. So the rmdir case is not too
    important because such counters are soon disappearing with the entire
    memcg. But the memcg.force_empty and memory.move_charge_at_immigrate=1
    cases are larger problems as the bogus counters are visible for the
    (possibly long) remaining life of the source memcg.

    The problem is due to memcg use of __this_cpu_from(.., -nr_pages), which
    is subtly wrong because it subtracts the unsigned int nr_pages (either
    -1 or -512 for THP) from a signed long percpu counter. When
    nr_pages=-1, -nr_pages=0xffffffff. On 64 bit machines stat->count[idx]
    is signed 64 bit. So memcg's attempt to simply decrement a count (e.g.
    from 1 to 0) boils down to:

    long count = 1
    unsigned int nr_pages = 1
    count += -nr_pages /* -nr_pages == 0xffff,ffff */
    count is now 0x1,0000,0000 instead of 0

    The fix is to subtract the unsigned page count rather than adding its
    negation. This only works once "percpu: fix this_cpu_sub() subtrahend
    casting for unsigneds" is applied to fix this_cpu_sub().

    Signed-off-by: Greg Thelen
    Acked-by: Tejun Heo
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     

24 Oct, 2013

1 commit


22 Oct, 2013

1 commit


17 Oct, 2013

3 commits

  • Buffer allocation has a very crude indefinite loop around waking the
    flusher threads and performing global NOFS direct reclaim because it can
    not handle allocation failures.

    The most immediate problem with this is that the allocation may fail due
    to a memory cgroup limit, where flushers + direct reclaim might not make
    any progress towards resolving the situation at all. Because unlike the
    global case, a memory cgroup may not have any cache at all, only
    anonymous pages but no swap. This situation will lead to a reclaim
    livelock with insane IO from waking the flushers and thrashing unrelated
    filesystem cache in a tight loop.

    Use __GFP_NOFAIL allocations for buffers for now. This makes sure that
    any looping happens in the page allocator, which knows how to
    orchestrate kswapd, direct reclaim, and the flushers sensibly. It also
    allows memory cgroups to detect allocations that can't handle failure
    and will allow them to ultimately bypass the limit if reclaim can not
    make progress.

    Reported-by: azurIt
    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Commit 3812c8c8f395 ("mm: memcg: do not trap chargers with full
    callstack on OOM") assumed that only a few places that can trigger a
    memcg OOM situation do not return VM_FAULT_OOM, like optional page cache
    readahead. But there are many more and it's impractical to annotate
    them all.

    First of all, we don't want to invoke the OOM killer when the failed
    allocation is gracefully handled, so defer the actual kill to the end of
    the fault handling as well. This simplifies the code quite a bit for
    added bonus.

    Second, since a failed allocation might not be the abrupt end of the
    fault, the memcg OOM handler needs to be re-entrant until the fault
    finishes for subsequent allocation attempts. If an allocation is
    attempted after the task already OOMed, allow it to bypass the limit so
    that it can quickly finish the fault and invoke the OOM killer.

    Reported-by: azurIt
    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • for_each_online_cpu() needs the protection of {get,put}_online_cpus() so
    cpu_online_mask doesn't change during the iteration.

    cpu_hotplug.lock is held while a cpu is going down, it's a coarse lock
    that is used kernel-wide to synchronize cpu hotplug activity. Memcg has
    a cpu hotplug notifier, called while there may not be any cpu hotplug
    refcounts, which drains per-cpu event counts to memcg->nocpu_base.events
    to maintain a cumulative event count as cpus disappear. Without
    get_online_cpus() in mem_cgroup_read_events(), it's possible to account
    for the event count on a dying cpu twice, and this value may be
    significantly large.

    In fact, all memcg->pcp_counter_lock use should be nested by
    {get,put}_online_cpus().

    This fixes that issue and ensures the reported statistics are not vastly
    over-reported during cpu hotplug.

    Signed-off-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

25 Sep, 2013

7 commits


13 Sep, 2013

16 commits

  • Add memcg routines to count writeback pages, later dirty pages will also
    be accounted.

    After Kame's commit 89c06bd52fb9 ("memcg: use new logic for page stat
    accounting"), we can use 'struct page' flag to test page state instead
    of per page_cgroup flag. But memcg has a feature to move a page from a
    cgroup to another one and may have race between "move" and "page stat
    accounting". So in order to avoid the race we have designed a new lock:

    mem_cgroup_begin_update_page_stat()
    modify page information -->(a)
    mem_cgroup_update_page_stat() -->(b)
    mem_cgroup_end_update_page_stat()

    It requires both (a) and (b)(writeback pages accounting) to be pretected
    in mem_cgroup_{begin/end}_update_page_stat(). It's full no-op for
    !CONFIG_MEMCG, almost no-op if memcg is disabled (but compiled in), rcu
    read lock in the most cases (no task is moving), and spin_lock_irqsave
    on top in the slow path.

    There're two writeback interfaces to modify: test_{clear/set}_page_writeback().
    And the lock order is:
    --> memcg->move_lock
    --> mapping->tree_lock

    Signed-off-by: Sha Zhengju
    Acked-by: Michal Hocko
    Reviewed-by: Greg Thelen
    Cc: Fengguang Wu
    Cc: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sha Zhengju
     
  • We should call mem_cgroup_begin_update_page_stat() before
    mem_cgroup_update_page_stat() to get proper locks, however the latter
    doesn't do any checking that we use proper locking, which would be hard.
    Suggested by Michal Hock we could at least test for rcu_read_lock_held()
    because RCU is held if !mem_cgroup_disabled().

    Signed-off-by: Sha Zhengju
    Acked-by: Michal Hocko
    Reviewed-by: Greg Thelen
    Cc: Fengguang Wu
    Cc: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sha Zhengju
     
  • While accounting memcg page stat, it's not worth to use
    MEMCG_NR_FILE_MAPPED as an extra layer of indirection because of the
    complexity and presumed performance overhead. We can use
    MEM_CGROUP_STAT_FILE_MAPPED directly.

    Signed-off-by: Sha Zhengju
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Acked-by: Fengguang Wu
    Reviewed-by: Greg Thelen
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sha Zhengju
     
  • RESOURCE_MAX is far too general name, change it to RES_COUNTER_MAX.

    Signed-off-by: Sha Zhengju
    Signed-off-by: Qiang Huang
    Acked-by: Michal Hocko
    Cc: Daisuke Nishimura
    Cc: Jeff Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sha Zhengju
     
  • The memcg OOM handling is incredibly fragile and can deadlock. When a
    task fails to charge memory, it invokes the OOM killer and loops right
    there in the charge code until it succeeds. Comparably, any other task
    that enters the charge path at this point will go to a waitqueue right
    then and there and sleep until the OOM situation is resolved. The problem
    is that these tasks may hold filesystem locks and the mmap_sem; locks that
    the selected OOM victim may need to exit.

    For example, in one reported case, the task invoking the OOM killer was
    about to charge a page cache page during a write(), which holds the
    i_mutex. The OOM killer selected a task that was just entering truncate()
    and trying to acquire the i_mutex:

    OOM invoking task:
    mem_cgroup_handle_oom+0x241/0x3b0
    mem_cgroup_cache_charge+0xbe/0xe0
    add_to_page_cache_locked+0x4c/0x140
    add_to_page_cache_lru+0x22/0x50
    grab_cache_page_write_begin+0x8b/0xe0
    ext3_write_begin+0x88/0x270
    generic_file_buffered_write+0x116/0x290
    __generic_file_aio_write+0x27c/0x480
    generic_file_aio_write+0x76/0xf0 # takes ->i_mutex
    do_sync_write+0xea/0x130
    vfs_write+0xf3/0x1f0
    sys_write+0x51/0x90
    system_call_fastpath+0x18/0x1d

    OOM kill victim:
    do_truncate+0x58/0xa0 # takes i_mutex
    do_last+0x250/0xa30
    path_openat+0xd7/0x440
    do_filp_open+0x49/0xa0
    do_sys_open+0x106/0x240
    sys_open+0x20/0x30
    system_call_fastpath+0x18/0x1d

    The OOM handling task will retry the charge indefinitely while the OOM
    killed task is not releasing any resources.

    A similar scenario can happen when the kernel OOM killer for a memcg is
    disabled and a userspace task is in charge of resolving OOM situations.
    In this case, ALL tasks that enter the OOM path will be made to sleep on
    the OOM waitqueue and wait for userspace to free resources or increase
    the group's limit. But a userspace OOM handler is prone to deadlock
    itself on the locks held by the waiting tasks. For example one of the
    sleeping tasks may be stuck in a brk() call with the mmap_sem held for
    writing but the userspace handler, in order to pick an optimal victim,
    may need to read files from /proc/, which tries to acquire the same
    mmap_sem for reading and deadlocks.

    This patch changes the way tasks behave after detecting a memcg OOM and
    makes sure nobody loops or sleeps with locks held:

    1. When OOMing in a user fault, invoke the OOM killer and restart the
    fault instead of looping on the charge attempt. This way, the OOM
    victim can not get stuck on locks the looping task may hold.

    2. When OOMing in a user fault but somebody else is handling it
    (either the kernel OOM killer or a userspace handler), don't go to
    sleep in the charge context. Instead, remember the OOMing memcg in
    the task struct and then fully unwind the page fault stack with
    -ENOMEM. pagefault_out_of_memory() will then call back into the
    memcg code to check if the -ENOMEM came from the memcg, and then
    either put the task to sleep on the memcg's OOM waitqueue or just
    restart the fault. The OOM victim can no longer get stuck on any
    lock a sleeping task may hold.

    Debugged by Michal Hocko.

    Signed-off-by: Johannes Weiner
    Reported-by: azurIt
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The memcg OOM handler open-codes a sleeping lock for OOM serialization
    (trylock, wait, repeat) because the required locking is so specific to
    memcg hierarchies. However, it would be nice if this construct would be
    clearly recognizable and not be as obfuscated as it is right now. Clean
    up as follows:

    1. Remove the return value of mem_cgroup_oom_unlock()

    2. Rename mem_cgroup_oom_lock() to mem_cgroup_oom_trylock().

    3. Pull the prepare_to_wait() out of the memcg_oom_lock scope. This
    makes it more obvious that the task has to be on the waitqueue
    before attempting to OOM-trylock the hierarchy, to not miss any
    wakeups before going to sleep. It just didn't matter until now
    because it was all lumped together into the global memcg_oom_lock
    spinlock section.

    4. Pull the mem_cgroup_oom_notify() out of the memcg_oom_lock scope.
    It is proctected by the hierarchical OOM-lock.

    5. The memcg_oom_lock spinlock is only required to propagate the OOM
    lock in any given hierarchy atomically. Restrict its scope to
    mem_cgroup_oom_(trylock|unlock).

    6. Do not wake up the waitqueue unconditionally at the end of the
    function. Only the lockholder has to wake up the next in line
    after releasing the lock.

    Note that the lockholder kicks off the OOM-killer, which in turn
    leads to wakeups from the uncharges of the exiting task. But a
    contender is not guaranteed to see them if it enters the OOM path
    after the OOM kills but before the lockholder releases the lock.
    Thus there has to be an explicit wakeup after releasing the lock.

    7. Put the OOM task on the waitqueue before marking the hierarchy as
    under OOM as that is the point where we start to receive wakeups.
    No point in listening before being on the waitqueue.

    8. Likewise, unmark the hierarchy before finishing the sleep, for
    symmetry.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: azurIt
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • System calls and kernel faults (uaccess, gup) can handle an out of memory
    situation gracefully and just return -ENOMEM.

    Enable the memcg OOM killer only for user faults, where it's really the
    only option available.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: azurIt
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Clean up some mess made by the "Soft limit rework" series, and a few other
    things.

    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Children in soft limit excess are currently tracked up the hierarchy in
    memcg->children_in_excess. Nevertheless there still might exist tons of
    groups that are not in hierarchy relation to the root cgroup (e.g. all
    first level groups if root_mem_cgroup->use_hierarchy == false).

    As the whole tree walk has to be done when the iteration starts at
    root_mem_cgroup the iterator should be able to skip the walk if there is
    no child above the limit without iterating them. This can be done
    easily if the root tracks all children rather than only hierarchical
    children. This is done by this patch which updates root_mem_cgroup
    children_in_excess if root_mem_cgroup->use_hierarchy == false so the
    root knows about all children in excess.

    Please note that this is not an issue for inner memcgs which have
    use_hierarchy == false because then only the single group is visited so
    no special optimization is necessary.

    Signed-off-by: Michal Hocko
    Cc: Balbir Singh
    Cc: Glauber Costa
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Michel Lespinasse
    Cc: Tejun Heo
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • mem_cgroup_should_soft_reclaim controls whether soft reclaim pass is
    done and it always says yes currently. Memcg iterators are clever to
    skip nodes that are not soft reclaimable quite efficiently but
    mem_cgroup_should_soft_reclaim can be more clever and do not start the
    soft reclaim pass at all if it knows that nothing would be scanned
    anyway.

    In order to do that, simply reuse mem_cgroup_soft_reclaim_eligible for
    the target group of the reclaim and allow the pass only if the whole
    subtree wouldn't be skipped.

    Signed-off-by: Michal Hocko
    Cc: Balbir Singh
    Cc: Glauber Costa
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Michel Lespinasse
    Cc: Tejun Heo
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Soft limit reclaim has to check the whole reclaim hierarchy while doing
    the first pass of the reclaim. This leads to a higher system time which
    can be visible especially when there are many groups in the hierarchy.

    This patch adds a per-memcg counter of children in excess. It also
    restores MEM_CGROUP_TARGET_SOFTLIMIT into mem_cgroup_event_ratelimit for a
    proper batching.

    If a group crosses soft limit for the first time it increases parent's
    children_in_excess up the hierarchy. The similarly if a group gets below
    the limit it will decrease the counter. The transition phase is recorded
    in soft_contributed flag.

    mem_cgroup_soft_reclaim_eligible then uses this information to better
    decide whether to skip the node or the whole subtree. The rule is simple.
    Skip the node with a children in excess or skip the whole subtree
    otherwise.

    This has been tested by a stream IO (dd if=/dev/zero of=file with
    4*MemTotal size) which is quite sensitive to overhead during reclaim. The
    load is running in a group with soft limit set to 0 and without any limit.
    Apart from that there was a hierarchy with ~500, 2k and 8k groups (two
    groups on each level) without any pages in them. base denotes to the
    kernel on which the whole series is based on, rework is the kernel before
    this patch and reworkoptim is with this patch applied:

    * Run with soft limit set to 0
    Elapsed
    0-0-limit/base: min: 88.21 max: 94.61 avg: 91.73 std: 2.65 runs: 3
    0-0-limit/rework: min: 76.05 [86.2%] max: 79.08 [83.6%] avg: 77.84 [84.9%] std: 1.30 runs: 3
    0-0-limit/reworkoptim: min: 77.98 [88.4%] max: 80.36 [84.9%] avg: 78.92 [86.0%] std: 1.03 runs: 3
    System
    0.5k-0-limit/base: min: 34.86 max: 36.42 avg: 35.89 std: 0.73 runs: 3
    0.5k-0-limit/rework: min: 43.26 [124.1%] max: 48.95 [134.4%] avg: 46.09 [128.4%] std: 2.32 runs: 3
    0.5k-0-limit/reworkoptim: min: 46.98 [134.8%] max: 50.98 [140.0%] avg: 48.49 [135.1%] std: 1.77 runs: 3
    Elapsed
    0.5k-0-limit/base: min: 88.50 max: 97.52 avg: 93.92 std: 3.90 runs: 3
    0.5k-0-limit/rework: min: 75.92 [85.8%] max: 78.45 [80.4%] avg: 77.34 [82.3%] std: 1.06 runs: 3
    0.5k-0-limit/reworkoptim: min: 75.79 [85.6%] max: 79.37 [81.4%] avg: 77.55 [82.6%] std: 1.46 runs: 3
    System
    2k-0-limit/base: min: 34.57 max: 37.65 avg: 36.34 std: 1.30 runs: 3
    2k-0-limit/rework: min: 64.17 [185.6%] max: 68.20 [181.1%] avg: 66.21 [182.2%] std: 1.65 runs: 3
    2k-0-limit/reworkoptim: min: 49.78 [144.0%] max: 52.99 [140.7%] avg: 51.00 [140.3%] std: 1.42 runs: 3
    Elapsed
    2k-0-limit/base: min: 92.61 max: 97.83 avg: 95.03 std: 2.15 runs: 3
    2k-0-limit/rework: min: 78.33 [84.6%] max: 84.08 [85.9%] avg: 81.09 [85.3%] std: 2.35 runs: 3
    2k-0-limit/reworkoptim: min: 75.72 [81.8%] max: 78.57 [80.3%] avg: 76.73 [80.7%] std: 1.30 runs: 3
    System
    8k-0-limit/base: min: 39.78 max: 42.09 avg: 41.09 std: 0.97 runs: 3
    8k-0-limit/rework: min: 200.86 [504.9%] max: 265.42 [630.6%] avg: 241.80 [588.5%] std: 29.06 runs: 3
    8k-0-limit/reworkoptim: min: 53.70 [135.0%] max: 54.89 [130.4%] avg: 54.43 [132.5%] std: 0.52 runs: 3
    Elapsed
    8k-0-limit/base: min: 95.11 max: 98.61 avg: 96.81 std: 1.43 runs: 3
    8k-0-limit/rework: min: 246.96 [259.7%] max: 331.47 [336.1%] avg: 301.32 [311.2%] std: 38.52 runs: 3
    8k-0-limit/reworkoptim: min: 76.79 [80.7%] max: 81.71 [82.9%] avg: 78.97 [81.6%] std: 2.05 runs: 3

    System time is increased by 30-40% but it is reduced a lot comparing to
    kernel without this patch. The higher time can be explained by the fact
    that the original soft reclaim scanned at priority 0 so it was much more
    effective for this workload (which is basically touch once and writeback).
    The Elapsed time looks better though (~20%).

    * Run with no soft limit set
    System
    0-no-limit/base: min: 42.18 max: 50.38 avg: 46.44 std: 3.36 runs: 3
    0-no-limit/rework: min: 40.57 [96.2%] max: 47.04 [93.4%] avg: 43.82 [94.4%] std: 2.64 runs: 3
    0-no-limit/reworkoptim: min: 40.45 [95.9%] max: 45.28 [89.9%] avg: 42.10 [90.7%] std: 2.25 runs: 3
    Elapsed
    0-no-limit/base: min: 75.97 max: 78.21 avg: 76.87 std: 0.96 runs: 3
    0-no-limit/rework: min: 75.59 [99.5%] max: 80.73 [103.2%] avg: 77.64 [101.0%] std: 2.23 runs: 3
    0-no-limit/reworkoptim: min: 77.85 [102.5%] max: 82.42 [105.4%] avg: 79.64 [103.6%] std: 1.99 runs: 3
    System
    0.5k-no-limit/base: min: 44.54 max: 46.93 avg: 46.12 std: 1.12 runs: 3
    0.5k-no-limit/rework: min: 42.09 [94.5%] max: 46.16 [98.4%] avg: 43.92 [95.2%] std: 1.69 runs: 3
    0.5k-no-limit/reworkoptim: min: 42.47 [95.4%] max: 45.67 [97.3%] avg: 44.06 [95.5%] std: 1.31 runs: 3
    Elapsed
    0.5k-no-limit/base: min: 78.26 max: 81.49 avg: 79.65 std: 1.36 runs: 3
    0.5k-no-limit/rework: min: 77.01 [98.4%] max: 80.43 [98.7%] avg: 78.30 [98.3%] std: 1.52 runs: 3
    0.5k-no-limit/reworkoptim: min: 76.13 [97.3%] max: 77.87 [95.6%] avg: 77.18 [96.9%] std: 0.75 runs: 3
    System
    2k-no-limit/base: min: 62.96 max: 69.14 avg: 66.14 std: 2.53 runs: 3
    2k-no-limit/rework: min: 76.01 [120.7%] max: 81.06 [117.2%] avg: 78.17 [118.2%] std: 2.12 runs: 3
    2k-no-limit/reworkoptim: min: 62.57 [99.4%] max: 66.10 [95.6%] avg: 64.53 [97.6%] std: 1.47 runs: 3
    Elapsed
    2k-no-limit/base: min: 76.47 max: 84.22 avg: 79.12 std: 3.60 runs: 3
    2k-no-limit/rework: min: 89.67 [117.3%] max: 93.26 [110.7%] avg: 91.10 [115.1%] std: 1.55 runs: 3
    2k-no-limit/reworkoptim: min: 76.94 [100.6%] max: 79.21 [94.1%] avg: 78.45 [99.2%] std: 1.07 runs: 3
    System
    8k-no-limit/base: min: 104.74 max: 151.34 avg: 129.21 std: 19.10 runs: 3
    8k-no-limit/rework: min: 205.23 [195.9%] max: 285.94 [188.9%] avg: 258.98 [200.4%] std: 38.01 runs: 3
    8k-no-limit/reworkoptim: min: 161.16 [153.9%] max: 184.54 [121.9%] avg: 174.52 [135.1%] std: 9.83 runs: 3
    Elapsed
    8k-no-limit/base: min: 125.43 max: 181.00 avg: 154.81 std: 22.80 runs: 3
    8k-no-limit/rework: min: 254.05 [202.5%] max: 355.67 [196.5%] avg: 321.46 [207.6%] std: 47.67 runs: 3
    8k-no-limit/reworkoptim: min: 193.77 [154.5%] max: 222.72 [123.0%] avg: 210.18 [135.8%] std: 12.13 runs: 3

    Both System and Elapsed are in stdev with the base kernel for all
    configurations except for 8k where both System and Elapsed are up by 35%.
    I do not have a good explanation for this because there is no soft reclaim
    pass going on as no group is above the limit which is checked in
    mem_cgroup_should_soft_reclaim.

    Then I have tested kernel build with the same configuration to see the
    behavior with a more general behavior.

    * Soft limit set to 0 for the build
    System
    0-0-limit/base: min: 242.70 max: 245.17 avg: 243.85 std: 1.02 runs: 3
    0-0-limit/rework min: 237.86 [98.0%] max: 240.22 [98.0%] avg: 239.00 [98.0%] std: 0.97 runs: 3
    0-0-limit/reworkoptim: min: 241.11 [99.3%] max: 243.53 [99.3%] avg: 242.01 [99.2%] std: 1.08 runs: 3
    Elapsed
    0-0-limit/base: min: 348.48 max: 360.86 avg: 356.04 std: 5.41 runs: 3
    0-0-limit/rework min: 286.95 [82.3%] max: 290.26 [80.4%] avg: 288.27 [81.0%] std: 1.43 runs: 3
    0-0-limit/reworkoptim: min: 286.55 [82.2%] max: 289.00 [80.1%] avg: 287.69 [80.8%] std: 1.01 runs: 3
    System
    0.5k-0-limit/base: min: 251.77 max: 254.41 avg: 252.70 std: 1.21 runs: 3
    0.5k-0-limit/rework min: 286.44 [113.8%] max: 289.30 [113.7%] avg: 287.60 [113.8%] std: 1.23 runs: 3
    0.5k-0-limit/reworkoptim: min: 252.18 [100.2%] max: 253.16 [99.5%] avg: 252.62 [100.0%] std: 0.41 runs: 3
    Elapsed
    0.5k-0-limit/base: min: 347.83 max: 353.06 avg: 350.04 std: 2.21 runs: 3
    0.5k-0-limit/rework min: 290.19 [83.4%] max: 295.62 [83.7%] avg: 293.12 [83.7%] std: 2.24 runs: 3
    0.5k-0-limit/reworkoptim: min: 293.91 [84.5%] max: 294.87 [83.5%] avg: 294.29 [84.1%] std: 0.42 runs: 3
    System
    2k-0-limit/base: min: 263.05 max: 271.52 avg: 267.94 std: 3.58 runs: 3
    2k-0-limit/rework min: 458.99 [174.5%] max: 468.31 [172.5%] avg: 464.45 [173.3%] std: 3.97 runs: 3
    2k-0-limit/reworkoptim: min: 267.10 [101.5%] max: 279.38 [102.9%] avg: 272.78 [101.8%] std: 5.05 runs: 3
    Elapsed
    2k-0-limit/base: min: 372.33 max: 379.32 avg: 375.47 std: 2.90 runs: 3
    2k-0-limit/rework min: 334.40 [89.8%] max: 339.52 [89.5%] avg: 337.44 [89.9%] std: 2.20 runs: 3
    2k-0-limit/reworkoptim: min: 301.47 [81.0%] max: 319.19 [84.1%] avg: 307.90 [82.0%] std: 8.01 runs: 3
    System
    8k-0-limit/base: min: 320.50 max: 332.10 avg: 325.46 std: 4.88 runs: 3
    8k-0-limit/rework min: 1115.76 [348.1%] max: 1165.66 [351.0%] avg: 1132.65 [348.0%] std: 23.34 runs: 3
    8k-0-limit/reworkoptim: min: 403.75 [126.0%] max: 409.22 [123.2%] avg: 406.16 [124.8%] std: 2.28 runs: 3
    Elapsed
    8k-0-limit/base: min: 475.48 max: 585.19 avg: 525.54 std: 45.30 runs: 3
    8k-0-limit/rework min: 616.25 [129.6%] max: 625.90 [107.0%] avg: 620.68 [118.1%] std: 3.98 runs: 3
    8k-0-limit/reworkoptim: min: 420.18 [88.4%] max: 428.28 [73.2%] avg: 423.05 [80.5%] std: 3.71 runs: 3

    Apart from 8k the system time is comparable with the base kernel while
    Elapsed is up to 20% better with all configurations.

    * No soft limit set
    System
    0-no-limit/base: min: 234.76 max: 237.42 avg: 236.25 std: 1.11 runs: 3
    0-no-limit/rework min: 233.09 [99.3%] max: 238.65 [100.5%] avg: 236.09 [99.9%] std: 2.29 runs: 3
    0-no-limit/reworkoptim: min: 236.12 [100.6%] max: 240.53 [101.3%] avg: 237.94 [100.7%] std: 1.88 runs: 3
    Elapsed
    0-no-limit/base: min: 288.52 max: 295.42 avg: 291.29 std: 2.98 runs: 3
    0-no-limit/rework min: 283.17 [98.1%] max: 284.33 [96.2%] avg: 283.78 [97.4%] std: 0.48 runs: 3
    0-no-limit/reworkoptim: min: 288.50 [100.0%] max: 290.79 [98.4%] avg: 289.78 [99.5%] std: 0.95 runs: 3
    System
    0.5k-no-limit/base: min: 286.51 max: 293.23 avg: 290.21 std: 2.78 runs: 3
    0.5k-no-limit/rework min: 291.69 [101.8%] max: 294.38 [100.4%] avg: 292.97 [101.0%] std: 1.10 runs: 3
    0.5k-no-limit/reworkoptim: min: 277.05 [96.7%] max: 288.76 [98.5%] avg: 284.17 [97.9%] std: 5.11 runs: 3
    Elapsed
    0.5k-no-limit/base: min: 294.94 max: 298.92 avg: 296.47 std: 1.75 runs: 3
    0.5k-no-limit/rework min: 292.55 [99.2%] max: 294.21 [98.4%] avg: 293.55 [99.0%] std: 0.72 runs: 3
    0.5k-no-limit/reworkoptim: min: 294.41 [99.8%] max: 301.67 [100.9%] avg: 297.78 [100.4%] std: 2.99 runs: 3
    System
    2k-no-limit/base: min: 443.41 max: 466.66 avg: 457.66 std: 10.19 runs: 3
    2k-no-limit/rework min: 490.11 [110.5%] max: 516.02 [110.6%] avg: 501.42 [109.6%] std: 10.83 runs: 3
    2k-no-limit/reworkoptim: min: 435.25 [98.2%] max: 458.11 [98.2%] avg: 446.73 [97.6%] std: 9.33 runs: 3
    Elapsed
    2k-no-limit/base: min: 330.85 max: 333.75 avg: 332.52 std: 1.23 runs: 3
    2k-no-limit/rework min: 343.06 [103.7%] max: 349.59 [104.7%] avg: 345.95 [104.0%] std: 2.72 runs: 3
    2k-no-limit/reworkoptim: min: 330.01 [99.7%] max: 333.92 [100.1%] avg: 332.22 [99.9%] std: 1.64 runs: 3
    System
    8k-no-limit/base: min: 1175.64 max: 1259.38 avg: 1222.39 std: 34.88 runs: 3
    8k-no-limit/rework min: 1226.31 [104.3%] max: 1241.60 [98.6%] avg: 1233.74 [100.9%] std: 6.25 runs: 3
    8k-no-limit/reworkoptim: min: 1023.45 [87.1%] max: 1056.74 [83.9%] avg: 1038.92 [85.0%] std: 13.69 runs: 3
    Elapsed
    8k-no-limit/base: min: 613.36 max: 619.60 avg: 616.47 std: 2.55 runs: 3
    8k-no-limit/rework min: 627.56 [102.3%] max: 642.33 [103.7%] avg: 633.44 [102.8%] std: 6.39 runs: 3
    8k-no-limit/reworkoptim: min: 545.89 [89.0%] max: 555.36 [89.6%] avg: 552.06 [89.6%] std: 4.37 runs: 3

    and these numbers look good as well. System time is around 100%
    (suprisingly better for the 8k case) and Elapsed is copies that trend.

    Signed-off-by: Michal Hocko
    Cc: Balbir Singh
    Cc: Glauber Costa
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Michel Lespinasse
    Cc: Tejun Heo
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The caller of the iterator might know that some nodes or even subtrees
    should be skipped but there is no way to tell iterators about that so the
    only choice left is to let iterators to visit each node and do the
    selection outside of the iterating code. This, however, doesn't scale
    well with hierarchies with many groups where only few groups are
    interesting.

    This patch adds mem_cgroup_iter_cond variant of the iterator with a
    callback which gets called for every visited node. There are three
    possible ways how the callback can influence the walk. Either the node is
    visited, it is skipped but the tree walk continues down the tree or the
    whole subtree of the current group is skipped.

    [hughd@google.com: fix memcg-less page reclaim]
    Signed-off-by: Michal Hocko
    Cc: Balbir Singh
    Cc: Glauber Costa
    Cc: Greg Thelen
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Michel Lespinasse
    Cc: Tejun Heo
    Cc: Ying Han
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Soft reclaim has been done only for the global reclaim (both background
    and direct). Since "memcg: integrate soft reclaim tighter with zone
    shrinking code" there is no reason for this limitation anymore as the soft
    limit reclaim doesn't use any special code paths and it is a part of the
    zone shrinking code which is used by both global and targeted reclaims.

    From the semantic point of view it is natural to consider soft limit
    before touching all groups in the hierarchy tree which is touching the
    hard limit because soft limit tells us where to push back when there is a
    memory pressure. It is not important whether the pressure comes from the
    limit or imbalanced zones.

    This patch simply enables soft reclaim unconditionally in
    mem_cgroup_should_soft_reclaim so it is enabled for both global and
    targeted reclaim paths. mem_cgroup_soft_reclaim_eligible needs to learn
    about the root of the reclaim to know where to stop checking soft limit
    state of parents up the hierarchy. Say we have

    A (over soft limit)
    \
    B (below s.l., hit the hard limit)
    / \
    C D (below s.l.)

    B is the source of the outside memory pressure now for D but we shouldn't
    soft reclaim it because it is behaving well under B subtree and we can
    still reclaim from C (pressumably it is over the limit).
    mem_cgroup_soft_reclaim_eligible should therefore stop climbing up the
    hierarchy at B (root of the memory pressure).

    Signed-off-by: Michal Hocko
    Reviewed-by: Glauber Costa
    Reviewed-by: Tejun Heo
    Cc: Balbir Singh
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Michel Lespinasse
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Now that the soft limit is integrated to the reclaim directly the whole
    soft-limit tree infrastructure is not needed anymore. Rip it out.

    Signed-off-by: Michal Hocko
    Reviewed-by: Glauber Costa
    Reviewed-by: Tejun Heo
    Cc: Balbir Singh
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Michel Lespinasse
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • This patchset is sitting out of tree for quite some time without any
    objections. I would be really happy if it made it into 3.12. I do not
    want to push it too hard but I think this work is basically ready and
    waiting more doesn't help.

    The basic idea is quite simple. Pull soft reclaim into shrink_zone in the
    first step and get rid of the previous soft reclaim infrastructure.
    shrink_zone is done in two passes now. First it tries to do the soft
    limit reclaim and it falls back to reclaim-all mode if no group is over
    the limit or no pages have been scanned. The second pass happens at the
    same priority so the only time we waste is the memcg tree walk which has
    been updated in the third step to have only negligible overhead.

    As a bonus we will get rid of a _lot_ of code by this and soft reclaim
    will not stand out like before when it wasn't integrated into the zone
    shrinking code and it reclaimed at priority 0 (the testing results show
    that some workloads suffers from such an aggressive reclaim). The clean
    up is in a separate patch because I felt it would be easier to review that
    way.

    The second step is soft limit reclaim integration into targeted reclaim.
    It should be rather straight forward. Soft limit has been used only for
    the global reclaim so far but it makes sense for any kind of pressure
    coming from up-the-hierarchy, including targeted reclaim.

    The third step (patches 4-8) addresses the tree walk overhead by enhancing
    memcg iterators to enable skipping whole subtrees and tracking number of
    over soft limit children at each level of the hierarchy. This information
    is updated same way the old soft limit tree was updated (from
    memcg_check_events) so we shouldn't see an additional overhead. In fact
    mem_cgroup_update_soft_limit is much simpler than tree manipulation done
    previously.

    __shrink_zone uses mem_cgroup_soft_reclaim_eligible as a predicate for
    mem_cgroup_iter so the decision whether a particular group should be
    visited is done at the iterator level which allows us to decide to skip
    the whole subtree as well (if there is no child in excess). This reduces
    the tree walk overhead considerably.

    * TEST 1
    ========

    My primary test case was a parallel kernel build with 2 groups (make is
    running with -j8 with a distribution .config in a separate cgroup without
    any hard limit) on a 32 CPU machine booted with 1GB memory and both builds
    run taskset to Node 0 cpus.

    I was mostly interested in 2 setups. Default - no soft limit set and -
    and 0 soft limit set to both groups. The first one should tell us whether
    the rework regresses the default behavior while the second one should show
    us improvements in an extreme case where both workloads are always over
    the soft limit.

    /usr/bin/time -v has been used to collect the statistics and each
    configuration had 3 runs after fresh boot without any other load on the
    system.

    base is mmotm-2013-07-18-16-40
    rework all 8 patches applied on top of base

    * No-limit
    User
    no-limit/base: min: 651.92 max: 672.65 avg: 664.33 std: 8.01 runs: 6
    no-limit/rework: min: 657.34 [100.8%] max: 668.39 [99.4%] avg: 663.13 [99.8%] std: 3.61 runs: 6
    System
    no-limit/base: min: 69.33 max: 71.39 avg: 70.32 std: 0.79 runs: 6
    no-limit/rework: min: 69.12 [99.7%] max: 71.05 [99.5%] avg: 70.04 [99.6%] std: 0.59 runs: 6
    Elapsed
    no-limit/base: min: 398.27 max: 422.36 avg: 408.85 std: 7.74 runs: 6
    no-limit/rework: min: 386.36 [97.0%] max: 438.40 [103.8%] avg: 416.34 [101.8%] std: 18.85 runs: 6

    The results are within noise. Elapsed time has a bigger variance but the
    average looks good.

    * 0-limit
    User
    0-limit/base: min: 573.76 max: 605.63 avg: 585.73 std: 12.21 runs: 6
    0-limit/rework: min: 645.77 [112.6%] max: 666.25 [110.0%] avg: 656.97 [112.2%] std: 7.77 runs: 6
    System
    0-limit/base: min: 69.57 max: 71.13 avg: 70.29 std: 0.54 runs: 6
    0-limit/rework: min: 68.68 [98.7%] max: 71.40 [100.4%] avg: 69.91 [99.5%] std: 0.87 runs: 6
    Elapsed
    0-limit/base: min: 1306.14 max: 1550.17 avg: 1430.35 std: 90.86 runs: 6
    0-limit/rework: min: 404.06 [30.9%] max: 465.94 [30.1%] avg: 434.81 [30.4%] std: 22.68 runs: 6

    The improvement is really huge here (even bigger than with my previous
    testing and I suspect that this highly depends on the storage). Page
    fault statistics tell us at least part of the story:

    Minor
    0-limit/base: min: 37180461.00 max: 37319986.00 avg: 37247470.00 std: 54772.71 runs: 6
    0-limit/rework: min: 36751685.00 [98.8%] max: 36805379.00 [98.6%] avg: 36774506.33 [98.7%] std: 17109.03 runs: 6
    Major
    0-limit/base: min: 170604.00 max: 221141.00 avg: 196081.83 std: 18217.01 runs: 6
    0-limit/rework: min: 2864.00 [1.7%] max: 10029.00 [4.5%] avg: 5627.33 [2.9%] std: 2252.71 runs: 6

    Same as with my previous testing Minor faults are more or less within
    noise but Major fault count is way bellow the base kernel.

    While this looks as a nice win it is fair to say that 0-limit
    configuration is quite artificial. So I was playing with 0-no-limit
    loads as well.

    * TEST 2
    ========

    The following results are from 2 groups configuration on a 16GB machine
    (single NUMA node).

    - A running stream IO (dd if=/dev/zero of=local.file bs=1024) with
    2*TotalMem with 0 soft limit.
    - B running a mem_eater which consumes TotalMem-1G without any limit. The
    mem_eater consumes the memory in 100 chunks with 1s nap after each
    mmap+poppulate so that both loads have chance to fight for the memory.

    The expected result is that B shouldn't be reclaimed and A shouldn't see
    a big dropdown in elapsed time.

    User
    base: min: 2.68 max: 2.89 avg: 2.76 std: 0.09 runs: 3
    rework: min: 3.27 [122.0%] max: 3.74 [129.4%] avg: 3.44 [124.6%] std: 0.21 runs: 3
    System
    base: min: 86.26 max: 88.29 avg: 87.28 std: 0.83 runs: 3
    rework: min: 81.05 [94.0%] max: 84.96 [96.2%] avg: 83.14 [95.3%] std: 1.61 runs: 3
    Elapsed
    base: min: 317.28 max: 332.39 avg: 325.84 std: 6.33 runs: 3
    rework: min: 281.53 [88.7%] max: 298.16 [89.7%] avg: 290.99 [89.3%] std: 6.98 runs: 3

    System time improved slightly as well as Elapsed. My previous testing
    has shown worse numbers but this again seem to depend on the storage
    speed.

    My theory is that the writeback doesn't catch up and prio-0 soft reclaim
    falls into wait on writeback page too often in the base kernel. The
    patched kernel doesn't do that because the soft reclaim is done from the
    kswapd/direct reclaim context. This can be seen on the following graph
    nicely. The A's group usage_in_bytes regurarly drops really low very often.

    All 3 runs
    http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/stream.png
    resp. a detail of the single run
    http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/stream-one-run.png

    mem_eater seems to be doing better as well. It gets to the full
    allocation size faster as can be seen on the following graph:
    http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/mem_eater-one-run.png

    /proc/meminfo collected during the test also shows that rework kernel
    hasn't swapped that much (well almost not at all):
    base: max: 123900 K avg: 56388.29 K
    rework: max: 300 K avg: 128.68 K

    kswapd and direct reclaim statistics are of no use unfortunatelly because
    soft reclaim is not accounted properly as the counters are hidden by
    global_reclaim() checks in the base kernel.

    * TEST 3
    ========

    Another test was the same configuration as TEST2 except the stream IO was
    replaced by a single kbuild (16 parallel jobs bound to Node0 cpus same as
    in TEST1) and mem_eater allocated TotalMem-200M so kbuild had only 200MB
    left.

    Kbuild did better with the rework kernel here as well:
    User
    base: min: 860.28 max: 872.86 avg: 868.03 std: 5.54 runs: 3
    rework: min: 880.81 [102.4%] max: 887.45 [101.7%] avg: 883.56 [101.8%] std: 2.83 runs: 3
    System
    base: min: 84.35 max: 85.06 avg: 84.79 std: 0.31 runs: 3
    rework: min: 85.62 [101.5%] max: 86.09 [101.2%] avg: 85.79 [101.2%] std: 0.21 runs: 3
    Elapsed
    base: min: 135.36 max: 243.30 avg: 182.47 std: 45.12 runs: 3
    rework: min: 110.46 [81.6%] max: 116.20 [47.8%] avg: 114.15 [62.6%] std: 2.61 runs: 3
    Minor
    base: min: 36635476.00 max: 36673365.00 avg: 36654812.00 std: 15478.03 runs: 3
    rework: min: 36639301.00 [100.0%] max: 36695541.00 [100.1%] avg: 36665511.00 [100.0%] std: 23118.23 runs: 3
    Major
    base: min: 14708.00 max: 53328.00 avg: 31379.00 std: 16202.24 runs: 3
    rework: min: 302.00 [2.1%] max: 414.00 [0.8%] avg: 366.33 [1.2%] std: 47.22 runs: 3

    Again we can see a significant improvement in Elapsed (it also seems to
    be more stable), there is a huge dropdown for the Major page faults and
    much more swapping:
    base: max: 583736 K avg: 112547.43 K
    rework: max: 4012 K avg: 124.36 K

    Graphs from all three runs show the variability of the kbuild quite
    nicely. It even seems that it took longer after every run with the base
    kernel which would be quite surprising as the source tree for the build is
    removed and caches are dropped after each run so the build operates on a
    freshly extracted sources everytime.
    http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/kbuild-mem_eater.png

    My other testing shows that this is just a matter of timing and other runs
    behave differently the std for Elapsed time is similar ~50. Example of
    other three runs:
    http://labs.suse.cz/mhocko/soft_limit_rework/stream_io-vs-mem_eater/kbuild-mem_eater2.png

    So to wrap this up. The series is still doing good and improves the soft
    limit.

    The testing results for bunch of cgroups with both stream IO and kbuild
    loads can be found in "memcg: track children in soft limit excess to
    improve soft limit".

    This patch:

    Memcg soft reclaim has been traditionally triggered from the global
    reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim
    then picked up a group which exceeds the soft limit the most and reclaimed
    it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages.

    The infrastructure requires per-node-zone trees which hold over-limit
    groups and keep them up-to-date (via memcg_check_events) which is not cost
    free. Although this overhead hasn't turned out to be a bottle neck the
    implementation is suboptimal because mem_cgroup_update_tree has no idea
    which zones consumed memory over the limit so we could easily end up
    having a group on a node-zone tree having only few pages from that
    node-zone.

    This patch doesn't try to fix node-zone trees management because it seems
    that integrating soft reclaim into zone shrinking sounds much easier and
    more appropriate for several reasons. First of all 0 priority reclaim was
    a crude hack which might lead to big stalls if the group's LRUs are big
    and hard to reclaim (e.g. a lot of dirty/writeback pages). Soft reclaim
    should be applicable also to the targeted reclaim which is awkward right
    now without additional hacks. Last but not least the whole infrastructure
    eats quite some code.

    After this patch shrink_zone is done in 2 passes. First it tries to do
    the soft reclaim if appropriate (only for global reclaim for now to keep
    compatible with the original state) and fall back to ignoring soft limit
    if no group is eligible to soft reclaim or nothing has been scanned during
    the first pass. Only groups which are over their soft limit or any of
    their parents up the hierarchy is over the limit are considered eligible
    during the first pass.

    Soft limit tree which is not necessary anymore will be removed in the
    follow up patch to make this patch smaller and easier to review.

    Signed-off-by: Michal Hocko
    Reviewed-by: Glauber Costa
    Reviewed-by: Tejun Heo
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Ying Han
    Cc: Hugh Dickins
    Cc: Michel Lespinasse
    Cc: Greg Thelen
    Cc: KOSAKI Motohiro
    Cc: Balbir Singh
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • vfs guarantees the cgroup won't be destroyed, so it's redundant to get a
    css reference.

    Signed-off-by: Li Zefan
    Acked-by: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     

12 Sep, 2013

2 commits

  • A memory cgroup with (1) multiple threshold notifications and (2) at least
    one threshold >=2G was not reliable. Specifically the notifications would
    either not fire or would not fire in the proper order.

    The __mem_cgroup_threshold() signaling logic depends on keeping 64 bit
    thresholds in sorted order. mem_cgroup_usage_register_event() sorts them
    with compare_thresholds(), which returns the difference of two 64 bit
    thresholds as an int. If the difference is positive but has bit[31] set,
    then sort() treats the difference as negative and breaks sort order.

    This fix compares the two arbitrary 64 bit thresholds returning the
    classic -1, 0, 1 result.

    The test below sets two notifications (at 0x1000 and 0x81001000):
    cd /sys/fs/cgroup/memory
    mkdir x
    for x in 4096 2164264960; do
    cgroup_event_listener x/memory.usage_in_bytes $x | sed "s/^/$x listener:/" &
    done
    echo $$ > x/cgroup.procs
    anon_leaker 500M

    v3.11-rc7 fails to signal the 4096 event listener:
    Leaking...
    Done leaking pages.

    Patched v3.11-rc7 properly notifies:
    Leaking...
    4096 listener:2013:8:31:14:13:36
    Done leaking pages.

    The fixed bug is old. It appears to date back to the introduction of
    memcg threshold notifications in v2.6.34-rc1-116-g2e72b6347c94 "memcg:
    implement memory thresholds"

    Signed-off-by: Greg Thelen
    Acked-by: Michal Hocko
    Acked-by: Kirill A. Shutemov
    Acked-by: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     
  • The memcg_cache_params structure contains the common part and the union,
    which represents two different types of data: one for root cashes and
    another for child caches.

    The size of child data is fixed. The size of the memcg_caches array is
    calculated in runtime.

    Currently the size of memcg_cache_params for root caches is calculated
    incorrectly, because it includes the size of parameters for child caches.

    ssize_t size = memcg_caches_array_size(num_groups);
    size *= sizeof(void *);

    size += sizeof(struct memcg_cache_params);

    v2: Fix a typo in calculations

    Signed-off-by: Andrey Vagin
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Vagin
     

04 Sep, 2013

1 commit

  • Pull cgroup updates from Tejun Heo:
    "A lot of activities on the cgroup front. Most changes aren't visible
    to userland at all at this point and are laying foundation for the
    planned unified hierarchy.

    - The biggest change is decoupling the lifetime management of css
    (cgroup_subsys_state) from that of cgroup's. Because controllers
    (cpu, memory, block and so on) will need to be dynamically enabled
    and disabled, css which is the association point between a cgroup
    and a controller may come and go dynamically across the lifetime of
    a cgroup. Till now, css's were created when the associated cgroup
    was created and stayed till the cgroup got destroyed.

    Assumptions around this tight coupling permeated through cgroup
    core and controllers. These assumptions are gradually removed,
    which consists bulk of patches, and css destruction path is
    completely decoupled from cgroup destruction path. Note that
    decoupling of creation path is relatively easy on top of these
    changes and the patchset is pending for the next window.

    - cgroup has its own event mechanism cgroup.event_control, which is
    only used by memcg. It is overly complex trying to achieve high
    flexibility whose benefits seem dubious at best. Going forward,
    new events will simply generate file modified event and the
    existing mechanism is being made specific to memcg. This pull
    request contains prepatory patches for such change.

    - Various fixes and cleanups"

    Fixed up conflict in kernel/cgroup.c as per Tejun.

    * 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (69 commits)
    cgroup: fix cgroup_css() invocation in css_from_id()
    cgroup: make cgroup_write_event_control() use css_from_dir() instead of __d_cgrp()
    cgroup: make cgroup_event hold onto cgroup_subsys_state instead of cgroup
    cgroup: implement CFTYPE_NO_PREFIX
    cgroup: make cgroup_css() take cgroup_subsys * instead and allow NULL subsys
    cgroup: rename cgroup_css_from_dir() to css_from_dir() and update its syntax
    cgroup: fix cgroup_write_event_control()
    cgroup: fix subsystem file accesses on the root cgroup
    cgroup: change cgroup_from_id() to css_from_id()
    cgroup: use css_get() in cgroup_create() to check CSS_ROOT
    cpuset: remove an unncessary forward declaration
    cgroup: RCU protect each cgroup_subsys_state release
    cgroup: move subsys file removal to kill_css()
    cgroup: factor out kill_css()
    cgroup: decouple cgroup_subsys_state destruction from cgroup destruction
    cgroup: replace cgroup->css_kill_cnt with ->nr_css
    cgroup: bounce cgroup_subsys_state ref kill confirmation to a work item
    cgroup: move cgroup->subsys[] assignment to online_css()
    cgroup: reorganize css init / exit paths
    cgroup: add __rcu modifier to cgroup->subsys[]
    ...

    Linus Torvalds
     

24 Aug, 2013

1 commit

  • The swapaccount kernel parameter without any values has been removed by
    commit a2c8990aed5a ("memsw: remove noswapaccount kernel parameter") but
    it seems that we didn't get rid of all the left overs.

    Make sure that menuconfig help text and kernel-parameters.txt are clear
    about value for the paramter and remove the stalled comment which is not
    very much useful on its own.

    Signed-off-by: Michal Hocko
    Reported-by: Gergely Risko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

14 Aug, 2013

1 commit

  • struct memcg_cache_params has a union. Different parts of this union
    are used for root and non-root caches. A part with destroying work is
    used only for non-root caches.

    I fixed the same problem in another place v3.9-rc1-16204-gf101a94, but
    didn't notice this one.

    This patch fixes the kernel panic:

    [ 46.848187] BUG: unable to handle kernel paging request at 000000fffffffeb8
    [ 46.849026] IP: [] kmem_cache_destroy_memcg_children+0x6c/0xc0
    [ 46.849092] PGD 0
    [ 46.849092] Oops: 0000 [#1] SMP
    ...

    Signed-off-by: Andrey Vagin
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Konstantin Khlebnikov
    Cc: [3.9.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Vagin
     

09 Aug, 2013

1 commit

  • Previously, all css descendant iterators didn't include the origin
    (root of subtree) css in the iteration. The reasons were maintaining
    consistency with css_for_each_child() and that at the time of
    introduction more use cases needed skipping the origin anyway;
    however, given that css_is_descendant() considers self to be a
    descendant, omitting the origin css has become more confusing and
    looking at the accumulated use cases rather clearly indicates that
    including origin would result in simpler code overall.

    While this is a change which can easily lead to subtle bugs, cgroup
    API including the iterators has recently gone through major
    restructuring and no out-of-tree changes will be applicable without
    adjustments making this a relatively acceptable opportunity for this
    type of change.

    The conversions are mostly straight-forward. If the iteration block
    had explicit origin handling before or after, it's moved inside the
    iteration. If not, if (pos == origin) continue; is added. Some
    conversions add extra reference get/put around origin handling by
    consolidating origin handling and the rest. While the extra ref
    operations aren't strictly necessary, this shouldn't cause any
    noticeable difference.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Acked-by: Michal Hocko
    Cc: Jens Axboe
    Cc: Matt Helsley
    Cc: Johannes Weiner
    Cc: Balbir Singh

    Tejun Heo