01 Aug, 2013

1 commit

  • vmpressure is called synchronously from reclaim where the target_memcg
    is guaranteed to be alive but the eventfd is signaled from the work
    queue context. This means that memcg (along with vmpressure structure
    which is embedded into it) might go away while the work item is pending
    which would result in use-after-release bug.

    We have two possible ways how to fix this. Either vmpressure pins memcg
    before it schedules vmpr->work and unpin it in vmpressure_work_fn or
    explicitely flush the work item from the css_offline context (as
    suggested by Tejun).

    This patch implements the later one and it introduces vmpressure_cleanup
    which flushes the vmpressure work queue item item. It hooks into
    mem_cgroup_css_offline after the memcg itself is cleaned up.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Michal Hocko
    Reported-by: Tejun Heo
    Cc: Anton Vorontsov
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Li Zefan
    Acked-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

15 Jul, 2013

1 commit

  • The __cpuinit type of throwaway sections might have made sense
    some time ago when RAM was more constrained, but now the savings
    do not offset the cost and complications. For example, the fix in
    commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
    is a good example of the nasty type of bugs that can be created
    with improper use of the various __init prefixes.

    After a discussion on LKML[1] it was decided that cpuinit should go
    the way of devinit and be phased out. Once all the users are gone,
    we can then finally remove the macros themselves from linux/init.h.

    This removes all the uses of the __cpuinit macros from C files in
    the core kernel directories (kernel, init, lib, mm, and include)
    that don't really have a specific maintainer.

    [1] https://lkml.org/lkml/2013/5/20/589

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

10 Jul, 2013

12 commits

  • Now memcg has the same life cycle with its corresponding cgroup, and a
    cgroup is freed via RCU and then mem_cgroup_css_free() will be called in
    a work function, so we can simply call __mem_cgroup_free() in
    mem_cgroup_css_free().

    This actually reverts commit 59927fb984d ("memcg: free mem_cgroup by RCU
    to fix oops").

    Signed-off-by: Li Zefan
    Cc: Hugh Dickins
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Tejun Heo
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Now memcg has the same life cycle as its corresponding cgroup. Kill the
    useless refcnt.

    Signed-off-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • The cgroup core guarantees it's always safe to access the parent.

    Signed-off-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Use css_get/put instead of mem_cgroup_get/put. A simple replacement
    will do.

    The historical reason that memcg has its own refcnt instead of always
    using css_get/put, is that cgroup couldn't be removed if there're still
    css refs, so css refs can't be used as long-lived reference. The
    situation has changed so that rmdir a cgroup will succeed regardless css
    refs, but won't be freed until css refs goes down to 0.

    Signed-off-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Use css_get/put instead of mem_cgroup_get/put.

    We can't do a simple replacement, because here mem_cgroup_put() is
    called during mem_cgroup_css_free(), while mem_cgroup_css_free() won't
    be called until css refcnt goes down to 0.

    Instead we increment css refcnt in mem_cgroup_css_offline(), and then
    check if there's still kmem charges. If not, css refcnt will be
    decremented immediately, otherwise the refcnt will be released after the
    last kmem allocation is uncahred.

    [akpm@linux-foundation.org: tweak comment]
    Signed-off-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Reviewed-by: Tejun Heo
    Cc: Michal Hocko
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Use css_get()/css_put() instead of mem_cgroup_get()/mem_cgroup_put().

    There are two things being done in the current code:

    First, we acquired a css_ref to make sure that the underlying cgroup
    would not go away. That is a short lived reference, and it is put as
    soon as the cache is created.

    At this point, we acquire a long-lived per-cache memcg reference count
    to guarantee that the memcg will still be alive.

    so it is:

    enqueue: css_get
    create : memcg_get, css_put
    destroy: memcg_put

    So we only need to get rid of the memcg_get, change the memcg_put to
    css_put, and get rid of the now extra css_put.

    (This changelog is mostly written by Glauber)

    Signed-off-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Use css_get/css_put instead of mem_cgroup_get/put.

    Note, if at the same time someone is moving @current to a different
    cgroup and removing the old cgroup, css_tryget() may return false, and
    sock->sk_cgrp won't be initialized, which is fine.

    Signed-off-by: Li Zefan
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • mem_cgroup_css_online calls mem_cgroup_put if memcg_init_kmem fails.
    This is not correct because only memcg_propagate_kmem takes an
    additional reference while mem_cgroup_sockets_init is allowed to fail as
    well (although no current implementation fails) but it doesn't take any
    reference. This all suggests that it should be memcg_propagate_kmem
    that should clean up after itself so this patch moves mem_cgroup_put
    over there.

    Unfortunately this is not that easy (as pointed out by Li Zefan) because
    memcg_kmem_mark_dead marks the group dead (KMEM_ACCOUNTED_DEAD) if it is
    marked active (KMEM_ACCOUNTED_ACTIVE) which is the case even if
    memcg_propagate_kmem fails so the additional reference is dropped in
    that case in kmem_cgroup_destroy which means that the reference would be
    dropped two times.

    The easiest way then would be to simply remove mem_cgrroup_put from
    mem_cgroup_css_online and rely on kmem_cgroup_destroy doing the right
    thing.

    Signed-off-by: Michal Hocko
    Signed-off-by: Li Zefan
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: [3.8]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • This reverts commit e4715f01be697a.

    mem_cgroup_put is hierarchy aware so mem_cgroup_put(memcg) already drops
    an additional reference from all parents so the additional
    mem_cgrroup_put(parent) potentially causes use-after-free.

    Signed-off-by: Michal Hocko
    Signed-off-by: Li Zefan
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: [3.9+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The memory we used to hold the memcg arrays is currently accounted to
    the current memcg. But that creates a problem, because that memory can
    only be freed after the last user is gone. Our only way to know which
    is the last user, is to hook up to freeing time, but the fact that we
    still have some in flight kmallocs will prevent freeing to happen. I
    believe therefore to be just easier to account this memory as global
    overhead.

    Signed-off-by: Glauber Costa
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Kamezawa Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • The memory we used to hold the memcg arrays is currently accounted to
    the current memcg. But that creates a problem, because that memory can
    only be freed after the last user is gone. Our only way to know which
    is the last user, is to hook up to freeing time, but the fact that we
    still have some in flight kmallocs will prevent freeing to happen. I
    believe therefore to be just easier to account this memory as global
    overhead.

    This patch (of 2):

    Disabling accounting is only relevant for some specific memcg internal
    allocations. Therefore we would initially not have such check at
    memcg_kmem_newpage_charge, since direct calls to the page allocator that
    are marked with GFP_KMEMCG only happen outside memcg core. We are
    mostly concerned with cache allocations and by having this test at
    memcg_kmem_get_cache we are already able to relay the allocation to the
    root cache and bypass the memcg caches altogether.

    There is one exception, though: the SLUB allocator does not create large
    order caches, but rather service large kmallocs directly from the page
    allocator. Therefore, the following sequence, when backed by the SLUB
    allocator:

    memcg_stop_kmem_account();
    kmalloc()
    memcg_resume_kmem_account();

    would effectively ignore the fact that we should skip accounting, since
    it will drive us directly to this function without passing through the
    cache selector memcg_kmem_get_cache. Such large allocations are
    extremely rare but can happen, for instance, for the cache arrays.

    This was never a problem in practice, because we weren't skipping
    accounting for the cache arrays. All the allocations we were skipping
    were fairly small. However, the fact that we were not skipping those
    allocations are a problem and can prevent the memcgs from going away.
    As we fix that, we need to make sure that the fix will also work with
    the SLUB allocator.

    Signed-off-by: Glauber Costa
    Reported-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Kamezawa Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Glauber Costa
     
  • Remove struct mem_cgroup_lru_info and fold its single member, the
    variably sized nodeinfo[0], directly into struct mem_cgroup. This
    should make it more obvious why it has to be the last member there.

    Also move the comment that's above that special last member below it, so
    it is more visible to somebody that considers appending to the struct
    mem_cgroup.

    Signed-off-by: Johannes Weiner
    Cc: David Rientjes
    Acked-by: Michal Hocko
    Cc: Glauber Costa
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

04 Jul, 2013

2 commits

  • mem_cgroup_iter() is too hard to follow. Factor out the lockless reclaim
    iterator loading and updating so it's easier to follow the big picture.

    Also document the iterator invalidation mechanism a bit more extensively.

    Signed-off-by: Johannes Weiner
    Reported-by: Tejun Heo
    Reviewed-by: Tejun Heo
    Acked-by: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • For processes that have detached their mm's, task_in_mem_cgroup()
    unnecessarily takes task_lock() when rcu_read_lock() is all that is
    necessary to call mem_cgroup_from_task().

    While we're here, switch task_in_mem_cgroup() to return bool.

    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

13 Jun, 2013

2 commits

  • The lockless reclaim hierarchy iterator currently has a misplaced
    barrier that can lead to use-after-free crashes.

    The reclaim hierarchy iterator consist of a sequence count and a
    position pointer that are read and written locklessly, with memory
    barriers enforcing ordering.

    The write side sets the position pointer first, then updates the
    sequence count to "publish" the new position. Likewise, the read side
    must read the sequence count first, then the position. If the sequence
    count is up to date, it's guaranteed that the position is up to date as
    well:

    writer: reader:
    iter->position = position if iter->sequence == expected:
    smp_wmb() smp_rmb()
    iter->sequence = sequence position = iter->position

    However, the read side barrier is currently misplaced, which can lead to
    dereferencing stale position pointers that no longer point to valid
    memory. Fix this.

    Signed-off-by: Johannes Weiner
    Reported-by: Tejun Heo
    Reviewed-by: Tejun Heo
    Acked-by: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Glauber Costa
    Cc: [3.10+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • struct memcg_cache_params has a union. Different parts of this union
    are used for root and non-root caches. A part with destroying work is
    used only for non-root caches.

    BUG: unable to handle kernel paging request at 0000000fffffffe0
    IP: kmem_cache_alloc+0x41/0x1f0
    Modules linked in: netlink_diag af_packet_diag udp_diag tcp_diag inet_diag unix_diag ip6table_filter ip6_tables i2c_piix4 virtio_net virtio_balloon microcode i2c_core pcspkr floppy
    CPU: 0 PID: 1929 Comm: lt-vzctl Tainted: G D 3.10.0-rc1+ #2
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    RIP: kmem_cache_alloc+0x41/0x1f0
    Call Trace:
    getname_flags.part.34+0x30/0x140
    getname+0x38/0x60
    do_sys_open+0xc5/0x1e0
    SyS_open+0x22/0x30
    system_call_fastpath+0x16/0x1b
    Code: f4 53 48 83 ec 18 8b 05 8e 53 b7 00 4c 8b 4d 08 21 f0 a8 10 74 0d 4c 89 4d c0 e8 1b 76 4a 00 4c 8b 4d c0 e9 92 00 00 00 4d 89 f5 8b 45 00 65 4c 03 04 25 48 cd 00 00 49 8b 50 08 4d 8b 38 49
    RIP [] kmem_cache_alloc+0x41/0x1f0

    Signed-off-by: Andrey Vagin
    Cc: Konstantin Khlebnikov
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Cc: Li Zefan
    Cc: [3.9.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Vagin
     

25 May, 2013

1 commit

  • Commit 0c59b89c81ea ("mm: memcg: push down PageSwapCache check into
    uncharge entry functions") added a VM_BUG_ON() on PageSwapCache in the
    uncharge path after checking that page flag once, assuming that the
    state is stable in all paths, but this is not the case and the condition
    triggers in user environments. An uncharge after the last page table
    reference to the page goes away can race with reclaim adding the page to
    swap cache.

    Swap cache pages are usually uncharged when they are freed after
    swapout, from a path that also handles swap usage accounting and memcg
    lifetime management. However, since the last page table reference is
    gone and thus no references to the swap slot left, the swap slot will be
    freed shortly when reclaim attempts to write the page to disk. The
    whole swap accounting is not even necessary.

    So while the race condition for which this VM_BUG_ON was added is real
    and actually existed all along, there are no negative effects. Remove
    the VM_BUG_ON again.

    Reported-by: Heiko Carstens
    Reported-by: Lingzhu Xiang
    Signed-off-by: Johannes Weiner
    Acked-by: Hugh Dickins
    Acked-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

08 May, 2013

1 commit

  • This exports the amount of anonymous transparent hugepages for each
    memcg via the new "rss_huge" stat in memory.stat. The units are in
    bytes.

    This is helpful to determine the hugepage utilization for individual
    jobs on the system in comparison to rss and opportunities where
    MADV_HUGEPAGE may be helpful.

    The amount of anonymous transparent hugepages is also included in "rss"
    for backwards compatibility.

    Signed-off-by: David Rientjes
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

30 Apr, 2013

12 commits

  • Pull cgroup updates from Tejun Heo:

    - Fixes and a lot of cleanups. Locking cleanup is finally complete.
    cgroup_mutex is no longer exposed to individual controlelrs which
    used to cause nasty deadlock issues. Li fixed and cleaned up quite a
    bit including long standing ones like racy cgroup_path().

    - device cgroup now supports proper hierarchy thanks to Aristeu.

    - perf_event cgroup now supports proper hierarchy.

    - A new mount option "__DEVEL__sane_behavior" is added. As indicated
    by the name, this option is to be used for development only at this
    point and generates a warning message when used. Unfortunately,
    cgroup interface currently has too many brekages and inconsistencies
    to implement a consistent and unified hierarchy on top. The new flag
    is used to collect the behavior changes which are necessary to
    implement consistent unified hierarchy. It's likely that this flag
    won't be used verbatim when it becomes ready but will be enabled
    implicitly along with unified hierarchy.

    The option currently disables some of broken behaviors in cgroup core
    and also .use_hierarchy switch in memcg (will be routed through -mm),
    which can be used to make very unusual hierarchy where nesting is
    partially honored. It will also be used to implement hierarchy
    support for blk-throttle which would be impossible otherwise without
    introducing a full separate set of control knobs.

    This is essentially versioning of interface which isn't very nice but
    at this point I can't see any other options which would allow keeping
    the interface the same while moving towards hierarchy behavior which
    is at least somewhat sane. The planned unified hierarchy is likely
    to require some level of adaptation from userland anyway, so I think
    it'd be best to take the chance and update the interface such that
    it's supportable in the long term.

    Maintaining the existing interface does complicate cgroup core but
    shouldn't put too much strain on individual controllers and I think
    it'd be manageable for the foreseeable future. Maybe we'll be able
    to drop it in a decade.

    Fix up conflicts (including a semantic one adding a new #include to ppc
    that was uncovered by header the file changes) as per Tejun.

    * 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (45 commits)
    cpuset: fix compile warning when CONFIG_SMP=n
    cpuset: fix cpu hotplug vs rebuild_sched_domains() race
    cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn()
    cgroup: restore the call to eventfd->poll()
    cgroup: fix use-after-free when umounting cgroupfs
    cgroup: fix broken file xattrs
    devcg: remove parent_cgroup.
    memcg: force use_hierarchy if sane_behavior
    cgroup: remove cgrp->top_cgroup
    cgroup: introduce sane_behavior mount option
    move cgroupfs_root to include/linux/cgroup.h
    cgroup: convert cgroupfs_root flag bits to masks and add CGRP_ prefix
    cgroup: make cgroup_path() not print double slashes
    Revert "cgroup: remove bind() method from cgroup_subsys."
    perf: make perf_event cgroup hierarchical
    cgroup: implement cgroup_is_descendant()
    cgroup: make sure parent won't be destroyed before its children
    cgroup: remove bind() method from cgroup_subsys.
    devcg: remove broken_hierarchy tag
    cgroup: remove cgroup_lock_is_held()
    ...

    Linus Torvalds
     
  • The memcg is not referenced, so it can be destroyed at anytime right
    after we exit rcu read section, so it's not safe to access it.

    To fix this, we call css_tryget() to get a reference while we're still
    in rcu read section.

    This also removes a bogus comment above __memcg_create_cache_enqueue().

    Signed-off-by: Li Zefan
    Acked-by: Glauber Costa
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • A memcg may livelock when oom if the process that grabs the hierarchy's
    oom lock is never the first process with PF_EXITING set in the memcg's
    task iteration.

    The oom killer, both global and memcg, will defer if it finds an
    eligible process that is in the process of exiting and it is not being
    ptraced. The idea is to allow it to exit without using memory reserves
    before needlessly killing another process.

    This normally works fine except in the memcg case with a large number of
    threads attached to the oom memcg. In this case, the memcg oom killer
    only gets called for the process that grabs the hierarchy's oom lock;
    all others end up blocked on the memcg's oom waitqueue. Thus, if the
    process that grabs the hierarchy's oom lock is never the first
    PF_EXITING process in the memcg's task iteration, the oom killer is
    constantly deferred without anything making progress.

    The fix is to give PF_EXITING processes access to memory reserves so
    that we've marked them as oom killed without any iteration. This allows
    __mem_cgroup_try_charge() to succeed so that the process may exit. This
    makes the memcg oom killer exemption for TIF_MEMDIE tasks, now
    immediately granted for processes with pending SIGKILLs and those in the
    exit path, to be equivalent to what is done for the global oom killer.

    Signed-off-by: David Rientjes
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • This might cause a use-after-free bug.

    Signed-off-by: Li Zefan
    Cc: Glauber Costa
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • With this patch userland applications that want to maintain the
    interactivity/memory allocation cost can use the pressure level
    notifications. The levels are defined like this:

    The "low" level means that the system is reclaiming memory for new
    allocations. Monitoring this reclaiming activity might be useful for
    maintaining cache level. Upon notification, the program (typically
    "Activity Manager") might analyze vmstat and act in advance (i.e.
    prematurely shutdown unimportant services).

    The "medium" level means that the system is experiencing medium memory
    pressure, the system might be making swap, paging out active file
    caches, etc. Upon this event applications may decide to further analyze
    vmstat/zoneinfo/memcg or internal memory usage statistics and free any
    resources that can be easily reconstructed or re-read from a disk.

    The "critical" level means that the system is actively thrashing, it is
    about to out of memory (OOM) or even the in-kernel OOM killer is on its
    way to trigger. Applications should do whatever they can to help the
    system. It might be too late to consult with vmstat or any other
    statistics, so it's advisable to take an immediate action.

    The events are propagated upward until the event is handled, i.e. the
    events are not pass-through. Here is what this means: for example you
    have three cgroups: A->B->C. Now you set up an event listener on
    cgroups A, B and C, and suppose group C experiences some pressure. In
    this situation, only group C will receive the notification, i.e. groups
    A and B will not receive it. This is done to avoid excessive
    "broadcasting" of messages, which disturbs the system and which is
    especially bad if we are low on memory or thrashing. So, organize the
    cgroups wisely, or propagate the events manually (or, ask us to
    implement the pass-through events, explaining why would you need them.)

    Performance wise, the memory pressure notifications feature itself is
    lightweight and does not require much of bookkeeping, in contrast to the
    rest of memcg features. Unfortunately, as of current memcg
    implementation, pages accounting is an inseparable part and cannot be
    turned off. The good news is that there are some efforts[1] to improve
    the situation; plus, implementing the same, fully API-compatible[2]
    interface for CONFIG_MEMCG=n case (e.g. embedded) is also a viable
    option, so it will not require any changes on the userland side.

    [1] http://permalink.gmane.org/gmane.linux.kernel.cgroups/6291
    [2] http://lkml.org/lkml/2013/2/21/454

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix CONFIG_CGROPUPS=n warnings]
    Signed-off-by: Anton Vorontsov
    Acked-by: Kirill A. Shutemov
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Tejun Heo
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Mel Gorman
    Cc: Glauber Costa
    Cc: Michal Hocko
    Cc: Luiz Capitulino
    Cc: Greg Thelen
    Cc: Leonid Moiseichuk
    Cc: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: Bartlomiej Zolnierkiewicz
    Cc: John Stultz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anton Vorontsov
     
  • Just a trivial issue I stumbled on while doing something else...

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Since commit 2d11085e404f ("memcg: do not create memsw files if swap
    accounting is disabled") memsw files are created only if memcg swap
    accounting is enabled so it doesn't make any sense to check for it
    explicitly in mem_cgroup_read(), mem_cgroup_write() and
    mem_cgroup_reset().

    Signed-off-by: Michal Hocko
    Cc: Kamezawa Hiroyuki
    Cc: Tejun Heo
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • mem_cgroup_iter basically does two things currently. It takes care of
    the house keeping (reference counting, raclaim cookie) and it iterates
    through a hierarchy tree (by using cgroup generic tree walk). The code
    would be much more easier to follow if we move the iteration outside of
    the function (to __mem_cgrou_iter_next) so the distinction is more
    clear. This patch doesn't introduce any functional changes.

    Signed-off-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Cc: Li Zefan
    Cc: Ying Han
    Cc: Tejun Heo
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The current implementation of mem_cgroup_iter has to consider both css
    and memcg to find out whether no group has been found (css==NULL - aka
    the loop is completed) and that no memcg is associated with the found
    node (!memcg - aka css_tryget failed because the group is no longer
    alive). This leads to awkward tweaks like tests for css && !memcg to
    skip the current node.

    It will be much easier if we got rid off css variable altogether and
    only rely on memcg. In order to do that the iteration part has to skip
    dead nodes. This sounds natural to me and as a nice side effect we will
    get a simple invariant that memcg is always alive when non-NULL and all
    nodes have been visited otherwise.

    We could get rid of the surrounding while loop but keep it in for now to
    make review easier. It will go away in the following patch.

    Signed-off-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Cc: Li Zefan
    Cc: Ying Han
    Cc: Tejun Heo
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Now that the per-node-zone-priority iterator caches memory cgroups
    rather than their css ids we have to be careful and remove them from the
    iterator when they are on the way out otherwise they might live for
    unbounded amount of time even though their group is already gone (until
    the global/targeted reclaim triggers the zone under priority to find out
    the group is dead and let it to find the final rest).

    We can fix this issue by relaxing rules for the last_visited memcg.
    Instead of taking a reference to the css before it is stored into
    iter->last_visited we can just store its pointer and track the number of
    removed groups from each memcg's subhierarchy.

    This number would be stored into iterator everytime when a memcg is
    cached. If the iter count doesn't match the curent walker root's one we
    will start from the root again. The group counter is incremented
    upwards the hierarchy every time a group is removed.

    The iter_lock can be dropped because racing iterators cannot leak the
    reference anymore as the reference count is not elevated for
    last_visited when it is cached.

    Locking rules got a bit complicated by this change though. The iterator
    primarily relies on rcu read lock which makes sure that once we see a
    valid last_visited pointer then it will be valid for the whole RCU walk.
    smp_rmb makes sure that dead_count is read before last_visited and
    last_dead_count while smp_wmb makes sure that last_visited is updated
    before last_dead_count so the up-to-date last_dead_count cannot point to
    an outdated last_visited. css_tryget then makes sure that the
    last_visited is still alive in case the iteration races with the cached
    group removal (css is invalidated before mem_cgroup_css_offline
    increments dead_count).

    In short:
    mem_cgroup_iter
    rcu_read_lock()
    dead_count = atomic_read(parent->dead_count)
    smp_rmb()
    if (dead_count != iter->last_dead_count)
    last_visited POSSIBLY INVALID -> last_visited = NULL
    if (!css_tryget(iter->last_visited))
    last_visited DEAD -> last_visited = NULL
    next = find_next(last_visited)
    css_tryget(next)
    css_put(last_visited) // css would be invalidated and parent->dead_count
    // incremented if this was the last reference
    iter->last_visited = next
    smp_wmb()
    iter->last_dead_count = dead_count
    rcu_read_unlock()

    cgroup_rmdir
    cgroup_destroy_locked
    atomic_add(CSS_DEACT_BIAS, &css->refcnt) // subsequent css_tryget fail
    mem_cgroup_css_offline
    mem_cgroup_invalidate_reclaim_iterators
    while(parent = parent_mem_cgroup)
    atomic_inc(parent->dead_count)
    css_put(css) // last reference held by cgroup core

    Spotted by Ying Han.

    Original idea from Johannes Weiner.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Michal Hocko
    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Ying Han
    Cc: Li Zefan
    Cc: Tejun Heo
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • mem_cgroup_iter curently relies on css->id when walking down a group
    hierarchy tree. This is really awkward because the tree walk depends on
    the groups creation ordering. The only guarantee is that a parent node is
    visited before its children.

    Example:

    1) mkdir -p a a/d a/b/c
    2) mkdir -a a/b/c a/d

    Will create the same trees but the tree walks will be different:

    1) a, d, b, c
    2) a, b, c, d

    Commit 574bd9f7c7c1 ("cgroup: implement generic child / descendant walk
    macros") has introduced generic cgroup tree walkers which provide either
    pre-order or post-order tree walk. This patch converts css->id based
    iteration to pre-order tree walk to keep the semantic with the original
    iterator where parent is always visited before its subtree.

    cgroup_for_each_descendant_pre suggests using post_create and
    pre_destroy for proper synchronization with groups addidition resp.
    removal. This implementation doesn't use those because a new memory
    cgroup is initialized sufficiently for iteration in mem_cgroup_css_alloc
    already and css reference counting enforces that the group is alive for
    both the last seen cgroup and the found one resp. it signals that the
    group is dead and it should be skipped.

    If the reclaim cookie is used we need to store the last visited group
    into the iterator so we have to be careful that it doesn't disappear in
    the mean time. Elevated reference count on the css keeps it alive even
    though the group have been removed (parked waiting for the last dput so
    that it can be freed).

    Per node-zone-prio iter_lock has been introduced to ensure that
    css_tryget and iter->last_visited is set atomically. Otherwise two
    racing walkers could both take a references and only one release it
    leading to a css leak (which pins cgroup dentry).

    Signed-off-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Ying Han
    Cc: Tejun Heo
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The patchset tries to make mem_cgroup_iter saner in the way how it walks
    hierarchies. css->id based traversal is far from being ideal as it is not
    deterministic because it depends on the creation ordering. Additional to
    that css_id is considered a burden for cgroup maintainers because it is
    quite some code and memcg is the last user of it. After this series only
    the swap accounting uses css_id but that one will follow up later.

    Diffstat (if we exclude removed/added comments) looks quite
    promising. We got rid of some code:

    $ git diff mmotm... | grep -v "^[+-][[:space:]]*[/ ]\*" | diffstat
    b/include/linux/cgroup.h | 3 ---
    kernel/cgroup.c | 33 ---------------------------------
    mm/memcontrol.c | 4 +++-
    3 files changed, 3 insertions(+), 37 deletions(-)

    The first patch is just preparatory and it changes when we release css of
    the previously returned memcg. Nothing controlversial.

    The second patch is the core of the patchset and it replaces css_get_next
    based on css_id by the generic cgroup pre-order. This brings some
    chalanges for the last visited group caching during the reclaim
    (mem_cgroup_per_zone::reclaim_iter). We have to use memcg pointers
    directly now which means that we have to keep a reference to those groups'
    css to keep them alive.

    I also folded iter_lock introduced by https://lkml.org/lkml/2013/1/3/295
    in the previous version into this patch. Johannes felt the race I was
    describing should be mostly harmless and I haven't been able to trigger it
    so the lock doesn't deserve its own patch. It is still needed
    temporarily, though, because the reference counting on iter->last_visited
    depends on it. It will go away with the next patch.

    The next patch fixups an unbounded cgroup removal holdoff caused by the
    elevated css refcount. The issue has been observed by Ying Han. Johannes
    wasn't impressed by the previous version of the fix
    (https://lkml.org/lkml/2013/2/8/379) which cleaned up pending references
    during mem_cgroup_css_offline when a group is removed. He has suggested a
    different way when the iterator checks whether a cached memcg is still
    valid or no. More on that in the patch but the basic idea is that every
    memcg tracks the number removed subgroups and iterator records this number
    when a group is cached. These numbers are checked before
    iter->last_visited is about to be used and the iteration is restarted if
    it is invalid.

    The fourth and fifth patches are an attempt for simplification of the
    mem_cgroup_iter. css juggling is removed and the iteration logic is moved
    to a helper so that the reference counting and iteration are separated.

    The last patch just removes css_get_next as there is no user for it any
    longer.

    My testing looked as follows:
    A (use_hierarchy=1, limit_in_bytes=150M)
    /|\
    1 2 3

    Children groups were created so that the number is never higher than 3 and
    their limits were random between 50-100M. Each group hosts a kernel build
    (starting with tar -xf so the tree is not shared and make -jNUM_CPUs/3)
    and terminated after random time - up to 5 minutes) and then it is
    removed.

    This should exercise both leaf and hierarchical reclaim as well as races
    with cgroup removals and debugging messages I added on top proved that.
    100 groups were created during the test.

    This patch:

    css reference counting keeps the cgroup alive even though it has been
    already removed. mem_cgroup_iter relies on this fact and takes a
    reference to the returned group. The reference is then released on the
    next iteration or mem_cgroup_iter_break. mem_cgroup_iter currently
    releases the reference right after it gets the last css_id.

    This is correct because neither prev's memcg nor cgroup are accessed after
    then. This will change in the next patch so we need to hold the group
    alive a bit longer so let's move the css_put at the end of the function.

    Signed-off-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Cc: Li Zefan
    Cc: Ying Han
    Cc: Tejun Heo
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

16 Apr, 2013

1 commit

  • Turn on use_hierarchy by default if sane_behavior is specified and
    don't create .use_hierarchy file.

    It is debatable whether to remove .use_hierarchy file or make it ro as
    the former could make transition easier in certain cases; however, the
    behavior changes which will be gated by sane_behavior are intensive
    including changing basic meaning of certain control knobs in a few
    controllers and I don't really think keeping this piece would make
    things easier in any noticeable way, so let's remove it.

    v2: Explain that mem_cgroup_bind() doesn't have to worry about
    children as suggested by Michal Hocko.

    Signed-off-by: Tejun Heo
    Acked-by: Serge E. Hallyn
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki

    Tejun Heo
     

08 Apr, 2013

1 commit

  • As cgroup supports rename, it's unsafe to dereference dentry->d_name
    without proper vfs locks. Fix this by using cgroup_name() rather than
    dentry directly.

    Also open code memcg_cache_name because it is called only from
    kmem_cache_dup which frees the returned name right after
    kmem_cache_create_memcg makes a copy of it. Such a short-lived
    allocation doesn't make too much sense. So replace it by a static
    buffer as kmem_cache_dup is called with memcg_cache_mutex.

    Signed-off-by: Li Zefan
    Signed-off-by: Michal Hocko
    Acked-by: Glauber Costa
    Signed-off-by: Tejun Heo

    Michal Hocko
     

09 Mar, 2013

1 commit

  • Fix a warning from lockdep caused by calling cancel_work_sync() for
    uninitialized struct work. This path has been triggered by destructon
    kmem-cache hierarchy via destroying its root kmem-cache.

    cache ffff88003c072d80
    obj ffff88003b410000 cache ffff88003c072d80
    obj ffff88003b924000 cache ffff88003c20bd40
    INFO: trying to register non-static key.
    the code is fine but needs lockdep annotation.
    turning off the locking correctness validator.
    Pid: 2825, comm: insmod Tainted: G O 3.9.0-rc1-next-20130307+ #611
    Call Trace:
    __lock_acquire+0x16a2/0x1cb0
    lock_acquire+0x8a/0x120
    flush_work+0x38/0x2a0
    __cancel_work_timer+0x89/0xf0
    cancel_work_sync+0xb/0x10
    kmem_cache_destroy_memcg_children+0x81/0xb0
    kmem_cache_destroy+0xf/0xe0
    init_module+0xcb/0x1000 [kmem_test]
    do_one_initcall+0x11a/0x170
    load_module+0x19b0/0x2320
    SyS_init_module+0xc6/0xf0
    system_call_fastpath+0x16/0x1b

    Example module to demonstrate:

    #include
    #include
    #include
    #include

    int __init mod_init(void)
    {
    int size = 256;
    struct kmem_cache *cache;
    void *obj;
    struct page *page;

    cache = kmem_cache_create("kmem_cache_test", size, size, 0, NULL);
    if (!cache)
    return -ENOMEM;

    printk("cache %p\n", cache);

    obj = kmem_cache_alloc(cache, GFP_KERNEL);
    if (obj) {
    page = virt_to_head_page(obj);
    printk("obj %p cache %p\n", obj, page->slab_cache);
    kmem_cache_free(cache, obj);
    }

    flush_scheduled_work();

    obj = kmem_cache_alloc(cache, GFP_KERNEL);
    if (obj) {
    page = virt_to_head_page(obj);
    printk("obj %p cache %p\n", obj, page->slab_cache);
    kmem_cache_free(cache, obj);
    }

    kmem_cache_destroy(cache);

    return -EBUSY;
    }

    module_init(mod_init);
    MODULE_LICENSE("GPL");

    Signed-off-by: Konstantin Khlebnikov
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

24 Feb, 2013

5 commits

  • Whilst I run the risk of a flogging for disloyalty to the Lord of Sealand,
    I do have CONFIG_MEMCG=y CONFIG_MEMCG_KMEM not set, and grow tired of the
    "mm/memcontrol.c:4972:12: warning: `memcg_propagate_kmem' defined but not
    used [-Wunused-function]" seen in 3.8-rc: move the #ifdef outwards.

    Signed-off-by: Hugh Dickins
    Acked-by: Michal Hocko
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • We should encourage all memcg controller initialization independent on a
    specific mem_cgroup to be done here rather than exploit css_alloc
    callback and assume that nothing happens before root cgroup is created.

    Signed-off-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Tejun Heo
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • memcg_stock are currently initialized during the root cgroup allocation
    which is OK but it pointlessly pollutes memcg allocation code with
    something that can be called when the memcg subsystem is initialized by
    mem_cgroup_init along with other controller specific parts.

    This patch wraps the current memcg_stock initialization code into a
    helper calls it from the controller subsystem initialization code.

    Signed-off-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Tejun Heo
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Per-node-zone soft limit tree is currently initialized when the root
    cgroup is created which is OK but it pointlessly pollutes memcg
    allocation code with something that can be called when the memcg
    subsystem is initialized by mem_cgroup_init along with other controller
    specific parts.

    While we are at it let's make mem_cgroup_soft_limit_tree_init void
    because it doesn't make much sense to report memory failure because if
    we fail to allocate memory that early during the boot then we are
    screwed anyway (this saves some code).

    Signed-off-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Tejun Heo
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • An inactive file list is considered low when its active counterpart is
    bigger, regardless of whether it is a global zone LRU list or a memcg
    zone LRU list. The only difference is in how the LRU size is assessed.

    get_lru_size() does the right thing for both global and memcg reclaim
    situations.

    Get rid of inactive_file_is_low_global() and
    mem_cgroup_inactive_file_is_low() by using get_lru_size() and compare
    the numbers in common code.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner