30 Apr, 2013

40 commits

  • Now that the per-node-zone-priority iterator caches memory cgroups
    rather than their css ids we have to be careful and remove them from the
    iterator when they are on the way out otherwise they might live for
    unbounded amount of time even though their group is already gone (until
    the global/targeted reclaim triggers the zone under priority to find out
    the group is dead and let it to find the final rest).

    We can fix this issue by relaxing rules for the last_visited memcg.
    Instead of taking a reference to the css before it is stored into
    iter->last_visited we can just store its pointer and track the number of
    removed groups from each memcg's subhierarchy.

    This number would be stored into iterator everytime when a memcg is
    cached. If the iter count doesn't match the curent walker root's one we
    will start from the root again. The group counter is incremented
    upwards the hierarchy every time a group is removed.

    The iter_lock can be dropped because racing iterators cannot leak the
    reference anymore as the reference count is not elevated for
    last_visited when it is cached.

    Locking rules got a bit complicated by this change though. The iterator
    primarily relies on rcu read lock which makes sure that once we see a
    valid last_visited pointer then it will be valid for the whole RCU walk.
    smp_rmb makes sure that dead_count is read before last_visited and
    last_dead_count while smp_wmb makes sure that last_visited is updated
    before last_dead_count so the up-to-date last_dead_count cannot point to
    an outdated last_visited. css_tryget then makes sure that the
    last_visited is still alive in case the iteration races with the cached
    group removal (css is invalidated before mem_cgroup_css_offline
    increments dead_count).

    In short:
    mem_cgroup_iter
    rcu_read_lock()
    dead_count = atomic_read(parent->dead_count)
    smp_rmb()
    if (dead_count != iter->last_dead_count)
    last_visited POSSIBLY INVALID -> last_visited = NULL
    if (!css_tryget(iter->last_visited))
    last_visited DEAD -> last_visited = NULL
    next = find_next(last_visited)
    css_tryget(next)
    css_put(last_visited) // css would be invalidated and parent->dead_count
    // incremented if this was the last reference
    iter->last_visited = next
    smp_wmb()
    iter->last_dead_count = dead_count
    rcu_read_unlock()

    cgroup_rmdir
    cgroup_destroy_locked
    atomic_add(CSS_DEACT_BIAS, &css->refcnt) // subsequent css_tryget fail
    mem_cgroup_css_offline
    mem_cgroup_invalidate_reclaim_iterators
    while(parent = parent_mem_cgroup)
    atomic_inc(parent->dead_count)
    css_put(css) // last reference held by cgroup core

    Spotted by Ying Han.

    Original idea from Johannes Weiner.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Michal Hocko
    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Ying Han
    Cc: Li Zefan
    Cc: Tejun Heo
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • mem_cgroup_iter curently relies on css->id when walking down a group
    hierarchy tree. This is really awkward because the tree walk depends on
    the groups creation ordering. The only guarantee is that a parent node is
    visited before its children.

    Example:

    1) mkdir -p a a/d a/b/c
    2) mkdir -a a/b/c a/d

    Will create the same trees but the tree walks will be different:

    1) a, d, b, c
    2) a, b, c, d

    Commit 574bd9f7c7c1 ("cgroup: implement generic child / descendant walk
    macros") has introduced generic cgroup tree walkers which provide either
    pre-order or post-order tree walk. This patch converts css->id based
    iteration to pre-order tree walk to keep the semantic with the original
    iterator where parent is always visited before its subtree.

    cgroup_for_each_descendant_pre suggests using post_create and
    pre_destroy for proper synchronization with groups addidition resp.
    removal. This implementation doesn't use those because a new memory
    cgroup is initialized sufficiently for iteration in mem_cgroup_css_alloc
    already and css reference counting enforces that the group is alive for
    both the last seen cgroup and the found one resp. it signals that the
    group is dead and it should be skipped.

    If the reclaim cookie is used we need to store the last visited group
    into the iterator so we have to be careful that it doesn't disappear in
    the mean time. Elevated reference count on the css keeps it alive even
    though the group have been removed (parked waiting for the last dput so
    that it can be freed).

    Per node-zone-prio iter_lock has been introduced to ensure that
    css_tryget and iter->last_visited is set atomically. Otherwise two
    racing walkers could both take a references and only one release it
    leading to a css leak (which pins cgroup dentry).

    Signed-off-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Ying Han
    Cc: Tejun Heo
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The patchset tries to make mem_cgroup_iter saner in the way how it walks
    hierarchies. css->id based traversal is far from being ideal as it is not
    deterministic because it depends on the creation ordering. Additional to
    that css_id is considered a burden for cgroup maintainers because it is
    quite some code and memcg is the last user of it. After this series only
    the swap accounting uses css_id but that one will follow up later.

    Diffstat (if we exclude removed/added comments) looks quite
    promising. We got rid of some code:

    $ git diff mmotm... | grep -v "^[+-][[:space:]]*[/ ]\*" | diffstat
    b/include/linux/cgroup.h | 3 ---
    kernel/cgroup.c | 33 ---------------------------------
    mm/memcontrol.c | 4 +++-
    3 files changed, 3 insertions(+), 37 deletions(-)

    The first patch is just preparatory and it changes when we release css of
    the previously returned memcg. Nothing controlversial.

    The second patch is the core of the patchset and it replaces css_get_next
    based on css_id by the generic cgroup pre-order. This brings some
    chalanges for the last visited group caching during the reclaim
    (mem_cgroup_per_zone::reclaim_iter). We have to use memcg pointers
    directly now which means that we have to keep a reference to those groups'
    css to keep them alive.

    I also folded iter_lock introduced by https://lkml.org/lkml/2013/1/3/295
    in the previous version into this patch. Johannes felt the race I was
    describing should be mostly harmless and I haven't been able to trigger it
    so the lock doesn't deserve its own patch. It is still needed
    temporarily, though, because the reference counting on iter->last_visited
    depends on it. It will go away with the next patch.

    The next patch fixups an unbounded cgroup removal holdoff caused by the
    elevated css refcount. The issue has been observed by Ying Han. Johannes
    wasn't impressed by the previous version of the fix
    (https://lkml.org/lkml/2013/2/8/379) which cleaned up pending references
    during mem_cgroup_css_offline when a group is removed. He has suggested a
    different way when the iterator checks whether a cached memcg is still
    valid or no. More on that in the patch but the basic idea is that every
    memcg tracks the number removed subgroups and iterator records this number
    when a group is cached. These numbers are checked before
    iter->last_visited is about to be used and the iteration is restarted if
    it is invalid.

    The fourth and fifth patches are an attempt for simplification of the
    mem_cgroup_iter. css juggling is removed and the iteration logic is moved
    to a helper so that the reference counting and iteration are separated.

    The last patch just removes css_get_next as there is no user for it any
    longer.

    My testing looked as follows:
    A (use_hierarchy=1, limit_in_bytes=150M)
    /|\
    1 2 3

    Children groups were created so that the number is never higher than 3 and
    their limits were random between 50-100M. Each group hosts a kernel build
    (starting with tar -xf so the tree is not shared and make -jNUM_CPUs/3)
    and terminated after random time - up to 5 minutes) and then it is
    removed.

    This should exercise both leaf and hierarchical reclaim as well as races
    with cgroup removals and debugging messages I added on top proved that.
    100 groups were created during the test.

    This patch:

    css reference counting keeps the cgroup alive even though it has been
    already removed. mem_cgroup_iter relies on this fact and takes a
    reference to the returned group. The reference is then released on the
    next iteration or mem_cgroup_iter_break. mem_cgroup_iter currently
    releases the reference right after it gets the last css_id.

    This is correct because neither prev's memcg nor cgroup are accessed after
    then. This will change in the next patch so we need to hold the group
    alive a bit longer so let's move the css_put at the end of the function.

    Signed-off-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Cc: Li Zefan
    Cc: Ying Han
    Cc: Tejun Heo
    Cc: Glauber Costa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Use helper function free_highmem_page() to free highmem pages into
    the buddy system.

    Signed-off-by: Jiang Liu
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Cong Wang
    Cc: Yinghai Lu
    Cc: Attilio Rao
    Cc: Konrad Rzeszutek Wilk
    Reviewed-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use helper function free_highmem_page() to free highmem pages into
    the buddy system.

    Signed-off-by: Jiang Liu
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use helper function free_highmem_page() to free highmem pages into
    the buddy system.

    Signed-off-by: Jiang Liu
    Cc: "David S. Miller"
    Acked-by: Sam Ravnborg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use helper function free_highmem_page() to free highmem pages into
    the buddy system.

    Signed-off-by: Jiang Liu
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Jiang Liu
    Cc: Alexander Graf
    Cc: "Suzuki K. Poulose"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use helper function free_highmem_page() to free highmem pages into
    the buddy system.

    Signed-off-by: Jiang Liu
    Cc: Ralf Baechle
    Cc: David Daney
    Cc: Cong Wang
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use helper function free_highmem_page() to free highmem pages into
    the buddy system.

    Signed-off-by: Jiang Liu
    Cc: Michal Simek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use helper function free_highmem_page() to free highmem pages into
    the buddy system.

    Signed-off-by: Jiang Liu
    Cc: James Hogan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use helper function free_highmem_page() to free highmem pages into
    the buddy system.

    Also fix a bug that totalhigh_pages should be increased when freeing
    a highmem page into the buddy system.

    Signed-off-by: Jiang Liu
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use helper function free_highmem_page() to free highmem pages into
    the buddy system.

    Signed-off-by: Jiang Liu
    Cc: Russell King
    Cc: Linus Walleij
    Cc: Marek Szyprowski
    Cc: Stephen Boyd
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • The original goal of this patchset is to fix the bug reported by

    https://bugzilla.kernel.org/show_bug.cgi?id=53501

    Now it has also been expanded to reduce common code used by memory
    initializion.

    This is the second part, which applies to the previous part at:
    http://marc.info/?l=linux-mm&m=136289696323825&w=2

    It introduces a helper function free_highmem_page() to free highmem
    pages into the buddy system when initializing mm subsystem.
    Introduction of free_highmem_page() is one step forward to clean up
    accesses and modificaitons of totalhigh_pages, totalram_pages and
    zone->managed_pages etc. I hope we could remove all references to
    totalhigh_pages from the arch/ subdirectory.

    We have only tested these patchset on x86 platforms, and have done basic
    compliation tests using cross-compilers from ftp.kernel.org. That means
    some code may not pass compilation on some architectures. So any help
    to test this patchset are welcomed!

    There are several other parts still under development:
    Part3: refine code to manage totalram_pages, totalhigh_pages and
    zone->managed_pages
    Part4: introduce helper functions to simplify mem_init() and remove the
    global variable num_physpages.

    This patch:

    Introduce helper function free_highmem_page(), which will be used by
    architectures with HIGHMEM enabled to free highmem pages into the buddy
    system.

    Signed-off-by: Jiang Liu
    Cc: "David S. Miller"
    Cc: "H. Peter Anvin"
    Cc: "Suzuki K. Poulose"
    Cc: Alexander Graf
    Cc: Arnd Bergmann
    Cc: Attilio Rao
    Cc: Benjamin Herrenschmidt
    Cc: Cong Wang
    Cc: David Daney
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Cc: Ingo Molnar
    Cc: James Hogan
    Cc: Jeff Dike
    Cc: Jiang Liu
    Cc: Jiang Liu
    Cc: Konrad Rzeszutek Wilk
    Cc: Konstantin Khlebnikov
    Cc: Linus Walleij
    Cc: Marek Szyprowski
    Cc: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Michal Simek
    Cc: Michel Lespinasse
    Cc: Minchan Kim
    Cc: Paul Mackerras
    Cc: Ralf Baechle
    Cc: Richard Weinberger
    Cc: Rik van Riel
    Cc: Russell King
    Cc: Sam Ravnborg
    Cc: Stephen Boyd
    Cc: Thomas Gleixner
    Cc: Yinghai Lu
    Reviewed-by: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use common help functions to free reserved pages.

    Signed-off-by: Jiang Liu
    Cc: Eric Biederman
    Reviewed-by: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use common help functions to free reserved pages.

    Signed-off-by: Jiang Liu
    Cc: James Hogan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use common help functions to free reserved pages.

    Signed-off-by: Jiang Liu
    Acked-by: Vineet Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use common help functions to free reserved pages.

    Signed-off-by: Jiang Liu
    Cc: Chris Zankel
    Cc: Max Filippov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use common help functions to free reserved pages.

    Signed-off-by: Jiang Liu
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use common help functions to free reserved pages.

    Signed-off-by: Jiang Liu
    Cc: Guan Xuetao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use common help functions to free reserved pages.

    Signed-off-by: Jiang Liu
    Cc: Jeff Dike
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use common help functions to free reserved pages.

    Signed-off-by: Jiang Liu
    Acked-by: Paul Mundt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use common help functions to free reserved pages.

    Signed-off-by: Jiang Liu
    Cc: Chen Liqin
    Cc: Lennox Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use common help functions to free reserved pages.

    Signed-off-by: Jiang Liu
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use common help functions to free reserved pages.

    Signed-off-by: Jiang Liu
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Anatolij Gustschin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use common help functions to free reserved pages.

    Signed-off-by: Jiang Liu
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use common help functions to free reserved pages.
    Also include to avoid local declarations.

    Signed-off-by: Jiang Liu
    Cc: Jonas Bonn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use common help functions to free reserved pages.

    Signed-off-by: Jiang Liu
    Cc: Koichi Yasutake
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use common help functions to free reserved pages.

    Signed-off-by: Jiang Liu
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use common help functions to free reserved pages.

    Signed-off-by: Jiang Liu
    Cc: Michal Simek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use common help functions to free reserved pages.

    Signed-off-by: Jiang Liu
    Acked-by: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use common help functions to free reserved pages.
    Also include to avoid local declarations.

    Signed-off-by: Jiang Liu
    Cc: Hirokazu Takata
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use common help functions to free reserved pages.

    Signed-off-by: Jiang Liu
    Cc: Tony Luck
    Cc: Fenghua Yu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use common help functions to free reserved pages.

    Signed-off-by: Jiang Liu
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use common help functions to free reserved pages.

    Signed-off-by: Jiang Liu
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use common help functions to free reserved pages.
    Also include to avoid local declaration.

    Signed-off-by: Jiang Liu
    Cc: Mikael Starvik
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use common help functions to free reserved pages.

    Signed-off-by: Jiang Liu
    Cc: Mark Salter
    Cc: Aurelien Jacquiot
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use common help functions to free reserved pages.

    Signed-off-by: Jiang Liu
    Cc: Mike Frysinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use common help functions to free reserved pages.

    Signed-off-by: Jiang Liu
    Acked-by: Hans-Christian Egtvedt
    Cc: Haavard Skinnemoen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use common help functions to free reserved pages.

    Signed-off-by: Jiang Liu
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Use common help functions to free reserved pages. Also include
    to avoid local declarations.

    Signed-off-by: Jiang Liu
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu