30 Apr, 2013
40 commits
-
Now that the per-node-zone-priority iterator caches memory cgroups
rather than their css ids we have to be careful and remove them from the
iterator when they are on the way out otherwise they might live for
unbounded amount of time even though their group is already gone (until
the global/targeted reclaim triggers the zone under priority to find out
the group is dead and let it to find the final rest).We can fix this issue by relaxing rules for the last_visited memcg.
Instead of taking a reference to the css before it is stored into
iter->last_visited we can just store its pointer and track the number of
removed groups from each memcg's subhierarchy.This number would be stored into iterator everytime when a memcg is
cached. If the iter count doesn't match the curent walker root's one we
will start from the root again. The group counter is incremented
upwards the hierarchy every time a group is removed.The iter_lock can be dropped because racing iterators cannot leak the
reference anymore as the reference count is not elevated for
last_visited when it is cached.Locking rules got a bit complicated by this change though. The iterator
primarily relies on rcu read lock which makes sure that once we see a
valid last_visited pointer then it will be valid for the whole RCU walk.
smp_rmb makes sure that dead_count is read before last_visited and
last_dead_count while smp_wmb makes sure that last_visited is updated
before last_dead_count so the up-to-date last_dead_count cannot point to
an outdated last_visited. css_tryget then makes sure that the
last_visited is still alive in case the iteration races with the cached
group removal (css is invalidated before mem_cgroup_css_offline
increments dead_count).In short:
mem_cgroup_iter
rcu_read_lock()
dead_count = atomic_read(parent->dead_count)
smp_rmb()
if (dead_count != iter->last_dead_count)
last_visited POSSIBLY INVALID -> last_visited = NULL
if (!css_tryget(iter->last_visited))
last_visited DEAD -> last_visited = NULL
next = find_next(last_visited)
css_tryget(next)
css_put(last_visited) // css would be invalidated and parent->dead_count
// incremented if this was the last reference
iter->last_visited = next
smp_wmb()
iter->last_dead_count = dead_count
rcu_read_unlock()cgroup_rmdir
cgroup_destroy_locked
atomic_add(CSS_DEACT_BIAS, &css->refcnt) // subsequent css_tryget fail
mem_cgroup_css_offline
mem_cgroup_invalidate_reclaim_iterators
while(parent = parent_mem_cgroup)
atomic_inc(parent->dead_count)
css_put(css) // last reference held by cgroup coreSpotted by Ying Han.
Original idea from Johannes Weiner.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Michal Hocko
Signed-off-by: Johannes Weiner
Acked-by: KAMEZAWA Hiroyuki
Cc: Ying Han
Cc: Li Zefan
Cc: Tejun Heo
Cc: Glauber Costa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
mem_cgroup_iter curently relies on css->id when walking down a group
hierarchy tree. This is really awkward because the tree walk depends on
the groups creation ordering. The only guarantee is that a parent node is
visited before its children.Example:
1) mkdir -p a a/d a/b/c
2) mkdir -a a/b/c a/dWill create the same trees but the tree walks will be different:
1) a, d, b, c
2) a, b, c, dCommit 574bd9f7c7c1 ("cgroup: implement generic child / descendant walk
macros") has introduced generic cgroup tree walkers which provide either
pre-order or post-order tree walk. This patch converts css->id based
iteration to pre-order tree walk to keep the semantic with the original
iterator where parent is always visited before its subtree.cgroup_for_each_descendant_pre suggests using post_create and
pre_destroy for proper synchronization with groups addidition resp.
removal. This implementation doesn't use those because a new memory
cgroup is initialized sufficiently for iteration in mem_cgroup_css_alloc
already and css reference counting enforces that the group is alive for
both the last seen cgroup and the found one resp. it signals that the
group is dead and it should be skipped.If the reclaim cookie is used we need to store the last visited group
into the iterator so we have to be careful that it doesn't disappear in
the mean time. Elevated reference count on the css keeps it alive even
though the group have been removed (parked waiting for the last dput so
that it can be freed).Per node-zone-prio iter_lock has been introduced to ensure that
css_tryget and iter->last_visited is set atomically. Otherwise two
racing walkers could both take a references and only one release it
leading to a css leak (which pins cgroup dentry).Signed-off-by: Michal Hocko
Cc: Johannes Weiner
Cc: KAMEZAWA Hiroyuki
Cc: Li Zefan
Cc: Ying Han
Cc: Tejun Heo
Cc: Glauber Costa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The patchset tries to make mem_cgroup_iter saner in the way how it walks
hierarchies. css->id based traversal is far from being ideal as it is not
deterministic because it depends on the creation ordering. Additional to
that css_id is considered a burden for cgroup maintainers because it is
quite some code and memcg is the last user of it. After this series only
the swap accounting uses css_id but that one will follow up later.Diffstat (if we exclude removed/added comments) looks quite
promising. We got rid of some code:$ git diff mmotm... | grep -v "^[+-][[:space:]]*[/ ]\*" | diffstat
b/include/linux/cgroup.h | 3 ---
kernel/cgroup.c | 33 ---------------------------------
mm/memcontrol.c | 4 +++-
3 files changed, 3 insertions(+), 37 deletions(-)The first patch is just preparatory and it changes when we release css of
the previously returned memcg. Nothing controlversial.The second patch is the core of the patchset and it replaces css_get_next
based on css_id by the generic cgroup pre-order. This brings some
chalanges for the last visited group caching during the reclaim
(mem_cgroup_per_zone::reclaim_iter). We have to use memcg pointers
directly now which means that we have to keep a reference to those groups'
css to keep them alive.I also folded iter_lock introduced by https://lkml.org/lkml/2013/1/3/295
in the previous version into this patch. Johannes felt the race I was
describing should be mostly harmless and I haven't been able to trigger it
so the lock doesn't deserve its own patch. It is still needed
temporarily, though, because the reference counting on iter->last_visited
depends on it. It will go away with the next patch.The next patch fixups an unbounded cgroup removal holdoff caused by the
elevated css refcount. The issue has been observed by Ying Han. Johannes
wasn't impressed by the previous version of the fix
(https://lkml.org/lkml/2013/2/8/379) which cleaned up pending references
during mem_cgroup_css_offline when a group is removed. He has suggested a
different way when the iterator checks whether a cached memcg is still
valid or no. More on that in the patch but the basic idea is that every
memcg tracks the number removed subgroups and iterator records this number
when a group is cached. These numbers are checked before
iter->last_visited is about to be used and the iteration is restarted if
it is invalid.The fourth and fifth patches are an attempt for simplification of the
mem_cgroup_iter. css juggling is removed and the iteration logic is moved
to a helper so that the reference counting and iteration are separated.The last patch just removes css_get_next as there is no user for it any
longer.My testing looked as follows:
A (use_hierarchy=1, limit_in_bytes=150M)
/|\
1 2 3Children groups were created so that the number is never higher than 3 and
their limits were random between 50-100M. Each group hosts a kernel build
(starting with tar -xf so the tree is not shared and make -jNUM_CPUs/3)
and terminated after random time - up to 5 minutes) and then it is
removed.This should exercise both leaf and hierarchical reclaim as well as races
with cgroup removals and debugging messages I added on top proved that.
100 groups were created during the test.This patch:
css reference counting keeps the cgroup alive even though it has been
already removed. mem_cgroup_iter relies on this fact and takes a
reference to the returned group. The reference is then released on the
next iteration or mem_cgroup_iter_break. mem_cgroup_iter currently
releases the reference right after it gets the last css_id.This is correct because neither prev's memcg nor cgroup are accessed after
then. This will change in the next patch so we need to hold the group
alive a bit longer so let's move the css_put at the end of the function.Signed-off-by: Michal Hocko
Acked-by: KAMEZAWA Hiroyuki
Cc: Johannes Weiner
Cc: Li Zefan
Cc: Ying Han
Cc: Tejun Heo
Cc: Glauber Costa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use helper function free_highmem_page() to free highmem pages into
the buddy system.Signed-off-by: Jiang Liu
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Cong Wang
Cc: Yinghai Lu
Cc: Attilio Rao
Cc: Konrad Rzeszutek Wilk
Reviewed-by: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use helper function free_highmem_page() to free highmem pages into
the buddy system.Signed-off-by: Jiang Liu
Cc: Jeff Dike
Cc: Richard Weinberger
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use helper function free_highmem_page() to free highmem pages into
the buddy system.Signed-off-by: Jiang Liu
Cc: "David S. Miller"
Acked-by: Sam Ravnborg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use helper function free_highmem_page() to free highmem pages into
the buddy system.Signed-off-by: Jiang Liu
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Jiang Liu
Cc: Alexander Graf
Cc: "Suzuki K. Poulose"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use helper function free_highmem_page() to free highmem pages into
the buddy system.Signed-off-by: Jiang Liu
Cc: Ralf Baechle
Cc: David Daney
Cc: Cong Wang
Cc: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use helper function free_highmem_page() to free highmem pages into
the buddy system.Signed-off-by: Jiang Liu
Cc: Michal Simek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use helper function free_highmem_page() to free highmem pages into
the buddy system.Signed-off-by: Jiang Liu
Cc: James Hogan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use helper function free_highmem_page() to free highmem pages into
the buddy system.Also fix a bug that totalhigh_pages should be increased when freeing
a highmem page into the buddy system.Signed-off-by: Jiang Liu
Cc: David Howells
Cc: Geert Uytterhoeven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use helper function free_highmem_page() to free highmem pages into
the buddy system.Signed-off-by: Jiang Liu
Cc: Russell King
Cc: Linus Walleij
Cc: Marek Szyprowski
Cc: Stephen Boyd
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The original goal of this patchset is to fix the bug reported by
https://bugzilla.kernel.org/show_bug.cgi?id=53501
Now it has also been expanded to reduce common code used by memory
initializion.This is the second part, which applies to the previous part at:
http://marc.info/?l=linux-mm&m=136289696323825&w=2It introduces a helper function free_highmem_page() to free highmem
pages into the buddy system when initializing mm subsystem.
Introduction of free_highmem_page() is one step forward to clean up
accesses and modificaitons of totalhigh_pages, totalram_pages and
zone->managed_pages etc. I hope we could remove all references to
totalhigh_pages from the arch/ subdirectory.We have only tested these patchset on x86 platforms, and have done basic
compliation tests using cross-compilers from ftp.kernel.org. That means
some code may not pass compilation on some architectures. So any help
to test this patchset are welcomed!There are several other parts still under development:
Part3: refine code to manage totalram_pages, totalhigh_pages and
zone->managed_pages
Part4: introduce helper functions to simplify mem_init() and remove the
global variable num_physpages.This patch:
Introduce helper function free_highmem_page(), which will be used by
architectures with HIGHMEM enabled to free highmem pages into the buddy
system.Signed-off-by: Jiang Liu
Cc: "David S. Miller"
Cc: "H. Peter Anvin"
Cc: "Suzuki K. Poulose"
Cc: Alexander Graf
Cc: Arnd Bergmann
Cc: Attilio Rao
Cc: Benjamin Herrenschmidt
Cc: Cong Wang
Cc: David Daney
Cc: David Howells
Cc: Geert Uytterhoeven
Cc: Ingo Molnar
Cc: James Hogan
Cc: Jeff Dike
Cc: Jiang Liu
Cc: Jiang Liu
Cc: Konrad Rzeszutek Wilk
Cc: Konstantin Khlebnikov
Cc: Linus Walleij
Cc: Marek Szyprowski
Cc: Mel Gorman
Cc: Michal Nazarewicz
Cc: Michal Simek
Cc: Michel Lespinasse
Cc: Minchan Kim
Cc: Paul Mackerras
Cc: Ralf Baechle
Cc: Richard Weinberger
Cc: Rik van Riel
Cc: Russell King
Cc: Sam Ravnborg
Cc: Stephen Boyd
Cc: Thomas Gleixner
Cc: Yinghai Lu
Reviewed-by: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use common help functions to free reserved pages.
Signed-off-by: Jiang Liu
Cc: Eric Biederman
Reviewed-by: Zhang Yanfei
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use common help functions to free reserved pages.
Signed-off-by: Jiang Liu
Cc: James Hogan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use common help functions to free reserved pages.
Signed-off-by: Jiang Liu
Acked-by: Vineet Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use common help functions to free reserved pages.
Signed-off-by: Jiang Liu
Cc: Chris Zankel
Cc: Max Filippov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use common help functions to free reserved pages.
Signed-off-by: Jiang Liu
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use common help functions to free reserved pages.
Signed-off-by: Jiang Liu
Cc: Guan Xuetao
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use common help functions to free reserved pages.
Signed-off-by: Jiang Liu
Cc: Jeff Dike
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use common help functions to free reserved pages.
Signed-off-by: Jiang Liu
Acked-by: Paul Mundt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use common help functions to free reserved pages.
Signed-off-by: Jiang Liu
Cc: Chen Liqin
Cc: Lennox Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use common help functions to free reserved pages.
Signed-off-by: Jiang Liu
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use common help functions to free reserved pages.
Signed-off-by: Jiang Liu
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Anatolij Gustschin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use common help functions to free reserved pages.
Signed-off-by: Jiang Liu
Cc: "James E.J. Bottomley"
Cc: Helge Deller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use common help functions to free reserved pages.
Also include to avoid local declarations.Signed-off-by: Jiang Liu
Cc: Jonas Bonn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use common help functions to free reserved pages.
Signed-off-by: Jiang Liu
Cc: Koichi Yasutake
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use common help functions to free reserved pages.
Signed-off-by: Jiang Liu
Cc: Ralf Baechle
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use common help functions to free reserved pages.
Signed-off-by: Jiang Liu
Cc: Michal Simek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use common help functions to free reserved pages.
Signed-off-by: Jiang Liu
Acked-by: Geert Uytterhoeven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use common help functions to free reserved pages.
Also include to avoid local declarations.Signed-off-by: Jiang Liu
Cc: Hirokazu Takata
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use common help functions to free reserved pages.
Signed-off-by: Jiang Liu
Cc: Tony Luck
Cc: Fenghua Yu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use common help functions to free reserved pages.
Signed-off-by: Jiang Liu
Cc: Yoshinori Sato
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use common help functions to free reserved pages.
Signed-off-by: Jiang Liu
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use common help functions to free reserved pages.
Also include to avoid local declaration.Signed-off-by: Jiang Liu
Cc: Mikael Starvik
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use common help functions to free reserved pages.
Signed-off-by: Jiang Liu
Cc: Mark Salter
Cc: Aurelien Jacquiot
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use common help functions to free reserved pages.
Signed-off-by: Jiang Liu
Cc: Mike Frysinger
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use common help functions to free reserved pages.
Signed-off-by: Jiang Liu
Acked-by: Hans-Christian Egtvedt
Cc: Haavard Skinnemoen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use common help functions to free reserved pages.
Signed-off-by: Jiang Liu
Cc: Russell King
Cc: Catalin Marinas
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use common help functions to free reserved pages. Also include
to avoid local declarations.Signed-off-by: Jiang Liu
Cc: Richard Henderson
Cc: Ivan Kokshaysky
Cc: Matt Turner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds