04 Feb, 2016

1 commit

  • When vmpressure is called for the entire subtree under pressure we
    mistakenly use vmpressure->scanned instead of vmpressure->tree_scanned
    when checking if vmpressure work is to be scheduled. This results in
    suppressing all vmpressure events in the legacy cgroup hierarchy. Fix it.

    Fixes: 8e8ae645249b ("mm: memcontrol: hook up vmpressure to socket pressure")
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

15 Jan, 2016

2 commits

  • A CONFIG_MEMCG=y kernel booted with "cgroup_disable=memory" crashes on a
    NULL memcg (but non-NULL root_mem_cgroup) when vmpressure kicks in.
    Here's the patch I use to avoid that, but you might prefer a test on
    mem_cgroup_disabled() somewhere.

    Signed-off-by: Hugh Dickins
    Acked-by: Johannes Weiner
    Cc: David S. Miller
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Let the networking stack know when a memcg is under reclaim pressure so
    that it can clamp its transmit windows accordingly.

    Whenever the reclaim efficiency of a cgroup's LRU lists drops low enough
    for a MEDIUM or HIGH vmpressure event to occur, assert a pressure state
    in the socket and tcp memory code that tells it to curb consumption
    growth from sockets associated with said control group.

    Traditionally, vmpressure reports for the entire subtree of a memcg
    under pressure, which drops useful information on the individual groups
    reclaimed. However, it's too late to change the userinterface, so add a
    second reporting mode that reports on the level of reclaim instead of at
    the level of pressure, and use that report for sockets.

    vmpressure events are naturally edge triggered, so for hysteresis assert
    socket pressure for a second to allow for subsequent vmpressure events
    to occur before letting the socket code return to normal.

    This will likely need finetuning for a wider variety of workloads, but
    for now stick to the vmpressure presets and keep hysteresis simple.

    Signed-off-by: Johannes Weiner
    Acked-by: David S. Miller
    Reviewed-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

03 Dec, 2014

1 commit

  • In some android devices, there will be a "divide by zero" exception.
    vmpr->scanned could be zero before spin_lock(&vmpr->sr_lock).

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=88051

    [akpm@linux-foundation.org: neaten]
    Reported-by: ji_ang
    Cc: Anton Vorontsov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

04 Feb, 2014

1 commit

  • arch/arm/mach-tegra/pm.c, kernel/power/console.c and mm/vmpressure.c
    were somehow getting slab.h indirectly through cgroup.h which in turn
    was getting it indirectly through xattr.h. A scheduled cgroup change
    drops xattr.h inclusion from cgroup.h and breaks compilation of these
    three files. Add explicit slab.h includes to the three files.

    A pending cgroup patch depends on this change and it'd be great if
    this can be routed through cgroup/for-3.14-fixes branch.

    Signed-off-by: Tejun Heo
    Acked-by: Stephen Warren
    Cc: Thierry Reding
    Cc: linux-tegra@vger.kernel.org
    Cc: "Rafael J. Wysocki"
    Cc: linux-pm@vger.kernel.org
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: cgroups@vger.kernel.org

    Tejun Heo
     

23 Nov, 2013

2 commits

  • cgroup_event is now memcg specific. Replace cgroup_event->css with
    ->memcg and convert [un]register_event() callbacks to take mem_cgroup
    pointer instead of cgroup_subsys_state one. This simplifies the code
    slightly and makes css_to_vmpressure() unnecessary which is removed.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Kirill A. Shutemov
    Acked-by: Michal Hocko

    Tejun Heo
     
  • The only use of cgroup_event->cft is distinguishing "usage_in_bytes"
    and "memsw.usgae_in_bytes" for mem_cgroup_usage_[un]register_event(),
    which can be done by adding an explicit argument to the function and
    implementing two wrappers so that the two cases can be distinguished
    from the function alone.

    Remove cgroup_event->cft and the related code including
    [un]register_events() methods.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Kirill A. Shutemov
    Acked-by: Michal Hocko

    Tejun Heo
     

04 Sep, 2013

1 commit

  • Pull cgroup updates from Tejun Heo:
    "A lot of activities on the cgroup front. Most changes aren't visible
    to userland at all at this point and are laying foundation for the
    planned unified hierarchy.

    - The biggest change is decoupling the lifetime management of css
    (cgroup_subsys_state) from that of cgroup's. Because controllers
    (cpu, memory, block and so on) will need to be dynamically enabled
    and disabled, css which is the association point between a cgroup
    and a controller may come and go dynamically across the lifetime of
    a cgroup. Till now, css's were created when the associated cgroup
    was created and stayed till the cgroup got destroyed.

    Assumptions around this tight coupling permeated through cgroup
    core and controllers. These assumptions are gradually removed,
    which consists bulk of patches, and css destruction path is
    completely decoupled from cgroup destruction path. Note that
    decoupling of creation path is relatively easy on top of these
    changes and the patchset is pending for the next window.

    - cgroup has its own event mechanism cgroup.event_control, which is
    only used by memcg. It is overly complex trying to achieve high
    flexibility whose benefits seem dubious at best. Going forward,
    new events will simply generate file modified event and the
    existing mechanism is being made specific to memcg. This pull
    request contains prepatory patches for such change.

    - Various fixes and cleanups"

    Fixed up conflict in kernel/cgroup.c as per Tejun.

    * 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (69 commits)
    cgroup: fix cgroup_css() invocation in css_from_id()
    cgroup: make cgroup_write_event_control() use css_from_dir() instead of __d_cgrp()
    cgroup: make cgroup_event hold onto cgroup_subsys_state instead of cgroup
    cgroup: implement CFTYPE_NO_PREFIX
    cgroup: make cgroup_css() take cgroup_subsys * instead and allow NULL subsys
    cgroup: rename cgroup_css_from_dir() to css_from_dir() and update its syntax
    cgroup: fix cgroup_write_event_control()
    cgroup: fix subsystem file accesses on the root cgroup
    cgroup: change cgroup_from_id() to css_from_id()
    cgroup: use css_get() in cgroup_create() to check CSS_ROOT
    cpuset: remove an unncessary forward declaration
    cgroup: RCU protect each cgroup_subsys_state release
    cgroup: move subsys file removal to kill_css()
    cgroup: factor out kill_css()
    cgroup: decouple cgroup_subsys_state destruction from cgroup destruction
    cgroup: replace cgroup->css_kill_cnt with ->nr_css
    cgroup: bounce cgroup_subsys_state ref kill confirmation to a work item
    cgroup: move cgroup->subsys[] assignment to online_css()
    cgroup: reorganize css init / exit paths
    cgroup: add __rcu modifier to cgroup->subsys[]
    ...

    Linus Torvalds
     

09 Aug, 2013

3 commits

  • cgroup is in the process of converting to css (cgroup_subsys_state)
    from cgroup as the principal subsystem interface handle. This is
    mostly to prepare for the unified hierarchy support where css's will
    be created and destroyed dynamically but also helps cleaning up
    subsystem implementations as css is usually what they are interested
    in anyway.

    cftype->[un]register_event() is among the remaining couple interfaces
    which still use struct cgroup. Convert it to cgroup_subsys_state.
    The conversion is mostly mechanical and removes the last users of
    mem_cgroup_from_cont() and cg_to_vmpressure(), which are removed.

    v2: indentation update as suggested by Li Zefan.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Balbir Singh

    Tejun Heo
     
  • cgroup is currently in the process of transitioning to using struct
    cgroup_subsys_state * as the primary handle instead of struct cgroup.
    Please see the previous commit which converts the subsystem methods
    for rationale.

    This patch converts all cftype file operations to take @css instead of
    @cgroup. cftypes for the cgroup core files don't have their subsytem
    pointer set. These will automatically use the dummy_css added by the
    previous patch and can be converted the same way.

    Most subsystem conversions are straight forwards but there are some
    interesting ones.

    * freezer: update_if_frozen() is also converted to take @css instead
    of @cgroup for consistency. This will make the code look simpler
    too once iterators are converted to use css.

    * memory/vmpressure: mem_cgroup_from_css() needs to be exported to
    vmpressure while mem_cgroup_from_cont() can be made static.
    Updated accordingly.

    * cpu: cgroup_tg() doesn't have any user left. Removed.

    * cpuacct: cgroup_ca() doesn't have any user left. Removed.

    * hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left.
    Removed.

    * net_cls: cgrp_cls_state() doesn't have any user left. Removed.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Acked-by: Daniel Wagner
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Matt Helsley
    Cc: Jens Axboe
    Cc: Steven Rostedt

    Tejun Heo
     
  • The names of the two struct cgroup_subsys_state accessors -
    cgroup_subsys_state() and task_subsys_state() - are somewhat awkward.
    The former clashes with the type name and the latter doesn't even
    indicate it's somehow related to cgroup.

    We're about to revamp large portion of cgroup API, so, let's rename
    them so that they're less awkward. Most per-controller usages of the
    accessors are localized in accessor wrappers and given the amount of
    scheduled changes, this isn't gonna add any noticeable headache.

    Rename cgroup_subsys_state() to cgroup_css() and task_subsys_state()
    to task_css(). This patch is pure rename.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     

01 Aug, 2013

3 commits

  • vmpressure is called synchronously from reclaim where the target_memcg
    is guaranteed to be alive but the eventfd is signaled from the work
    queue context. This means that memcg (along with vmpressure structure
    which is embedded into it) might go away while the work item is pending
    which would result in use-after-release bug.

    We have two possible ways how to fix this. Either vmpressure pins memcg
    before it schedules vmpr->work and unpin it in vmpressure_work_fn or
    explicitely flush the work item from the css_offline context (as
    suggested by Tejun).

    This patch implements the later one and it introduces vmpressure_cleanup
    which flushes the vmpressure work queue item item. It hooks into
    mem_cgroup_css_offline after the memcg itself is cleaned up.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Michal Hocko
    Reported-by: Tejun Heo
    Cc: Anton Vorontsov
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Li Zefan
    Acked-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • because it is racy and it doesn't give us much anyway as schedule_work
    handles this case already.

    Signed-off-by: Michal Hocko
    Reported-by: Tejun Heo
    Cc: Anton Vorontsov
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Li Zefan
    Acked-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • There is nothing that can sleep inside critical sections protected by
    this lock and those sections are really small so there doesn't make much
    sense to use mutex for them. Change the log to a spinlock

    Signed-off-by: Michal Hocko
    Reported-by: Tejun Heo
    Cc: Anton Vorontsov
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Li Zefan
    Reviewed-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

30 Apr, 2013

1 commit

  • With this patch userland applications that want to maintain the
    interactivity/memory allocation cost can use the pressure level
    notifications. The levels are defined like this:

    The "low" level means that the system is reclaiming memory for new
    allocations. Monitoring this reclaiming activity might be useful for
    maintaining cache level. Upon notification, the program (typically
    "Activity Manager") might analyze vmstat and act in advance (i.e.
    prematurely shutdown unimportant services).

    The "medium" level means that the system is experiencing medium memory
    pressure, the system might be making swap, paging out active file
    caches, etc. Upon this event applications may decide to further analyze
    vmstat/zoneinfo/memcg or internal memory usage statistics and free any
    resources that can be easily reconstructed or re-read from a disk.

    The "critical" level means that the system is actively thrashing, it is
    about to out of memory (OOM) or even the in-kernel OOM killer is on its
    way to trigger. Applications should do whatever they can to help the
    system. It might be too late to consult with vmstat or any other
    statistics, so it's advisable to take an immediate action.

    The events are propagated upward until the event is handled, i.e. the
    events are not pass-through. Here is what this means: for example you
    have three cgroups: A->B->C. Now you set up an event listener on
    cgroups A, B and C, and suppose group C experiences some pressure. In
    this situation, only group C will receive the notification, i.e. groups
    A and B will not receive it. This is done to avoid excessive
    "broadcasting" of messages, which disturbs the system and which is
    especially bad if we are low on memory or thrashing. So, organize the
    cgroups wisely, or propagate the events manually (or, ask us to
    implement the pass-through events, explaining why would you need them.)

    Performance wise, the memory pressure notifications feature itself is
    lightweight and does not require much of bookkeeping, in contrast to the
    rest of memcg features. Unfortunately, as of current memcg
    implementation, pages accounting is an inseparable part and cannot be
    turned off. The good news is that there are some efforts[1] to improve
    the situation; plus, implementing the same, fully API-compatible[2]
    interface for CONFIG_MEMCG=n case (e.g. embedded) is also a viable
    option, so it will not require any changes on the userland side.

    [1] http://permalink.gmane.org/gmane.linux.kernel.cgroups/6291
    [2] http://lkml.org/lkml/2013/2/21/454

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix CONFIG_CGROPUPS=n warnings]
    Signed-off-by: Anton Vorontsov
    Acked-by: Kirill A. Shutemov
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Tejun Heo
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Mel Gorman
    Cc: Glauber Costa
    Cc: Michal Hocko
    Cc: Luiz Capitulino
    Cc: Greg Thelen
    Cc: Leonid Moiseichuk
    Cc: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: Bartlomiej Zolnierkiewicz
    Cc: John Stultz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anton Vorontsov