09 Apr, 2008

1 commit

  • This should be N_NORMAL_MEMORY.

    N_NORMAL_MEMORY is "true" if a node has memory for the kernel. N_HIGH_MEMORY
    is "true" if a node has memory for HIGHMEM. (If CONFIG_HIGHMEM=n, always
    "true")

    This check is used for testing whether we can use kmalloc_node() on a node.
    Then, if there is a node which only contains HIGHMEM, the system will call
    kmalloc_node() which doesn't contain memory for the kernel. If it happens
    under SLUB, the kernel will panic. I think this only happens on x86_32-numa.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

05 Apr, 2008

1 commit

  • A boot option for the memory controller was discussed on lkml. It is a good
    idea to add it, since it saves memory for people who want to turn off the
    memory controller.

    By default the option is on for the following two reasons:

    1. It provides compatibility with the current scheme where the memory
    controller turns on if the config option is enabled
    2. It allows for wider testing of the memory controller, once the config
    option is enabled

    We still allow the create, destroy callbacks to succeed, since they are not
    aware of boot options. We do not populate the directory will memory resource
    controller specific files.

    Signed-off-by: Balbir Singh
    Cc: Paul Menage
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Sudhir Kumar
    Cc: YAMAMOTO Takashi
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     

20 Mar, 2008

1 commit

  • The check t->pid == t->pid is not the blessed way to check whether a task is a
    group leader.

    This is not about the code beautifulness only, but about pid namespaces fixes
    - both the tgid and the pid fields on the task_struct are (slowly :( )
    becoming deprecated.

    Besides, the thread_group_leader() macro makes only one dereference :)

    Signed-off-by: Pavel Emelyanov
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     

05 Mar, 2008

12 commits

  • While testing force_empty, during an exit_mmap, __mem_cgroup_remove_list
    called from mem_cgroup_uncharge_page oopsed on a NULL pointer in the lru list.
    I couldn't see what racing tasks on other cpus were doing, but surmise that
    another must have been in mem_cgroup_charge_common on the same page, between
    its unlock_page_cgroup and spin_lock_irqsave near done (thanks to that kzalloc
    which I'd almost changed to a kmalloc).

    Normally such a race cannot happen, the ref_cnt prevents it, the final
    uncharge cannot race with the initial charge. But force_empty buggers the
    ref_cnt, that's what it's all about; and thereafter forced pages are
    vulnerable to races such as this (just think of a shared page also mapped into
    an mm of another mem_cgroup than that just emptied). And remain vulnerable
    until they're freed indefinitely later.

    This patch just fixes the oops by moving the unlock_page_cgroups down below
    adding to and removing from the list (only possible given the previous patch);
    and while we're at it, we might as well make it an invariant that
    page->page_cgroup is always set while pc is on lru.

    But this behaviour of force_empty seems highly unsatisfactory to me: why have
    a ref_cnt if we always have to cope with it being violated (as in the earlier
    page migration patch). We may prefer force_empty to move pages to an orphan
    mem_cgroup (could be the root, but better not), from which other cgroups could
    recover them; we might need to reverse the locking again; but no time now for
    such concerns.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • As for force_empty, though this may not be the main topic here,
    mem_cgroup_force_empty_list() can be implemented simpler. It is possible to
    make the function just call mem_cgroup_uncharge_page() instead of releasing
    page_cgroups by itself. The tip is to call get_page() before invoking
    mem_cgroup_uncharge_page(), so the page won't be released during this
    function.

    Kamezawa-san points out that by the time mem_cgroup_uncharge_page() uncharges,
    the page might have been reassigned to an lru of a different mem_cgroup, and
    now be emptied from that; but Hugh claims that's okay, the end state is the
    same as when it hasn't gone to another list.

    And once force_empty stops taking lock_page_cgroup within mz->lru_lock,
    mem_cgroup_move_lists() can be simplified to take mz->lru_lock directly while
    holding page_cgroup lock (but still has to use try_lock_page_cgroup).

    Signed-off-by: Hirokazu Takahashi
    Signed-off-by: Hugh Dickins
    Cc: David Rientjes
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hirokazu Takahashi
     
  • Ever since the VM_BUG_ON(page_get_page_cgroup(page)) (now Bad page state) went
    into page freeing, I've hit it from time to time in testing on some machines,
    sometimes only after many days. Recently found a machine which could usually
    produce it within a few hours, which got me there at last.

    The culprit is mem_cgroup_move_lists, whose locking is inadequate; and the
    arrangement of structures was such that you got page_cgroups from the lru list
    neatly put on to SLUB's freelist. Kamezawa-san identified the same hole
    independently.

    The main problem was that it was missing the lock_page_cgroup it needs to
    safely page_get_page_cgroup; but it's tricky to go beyond that too, and I
    couldn't do it with SLAB_DESTROY_BY_RCU as I'd expected. See the code for
    comments on the constraints.

    This patch immediately gets replaced by a simpler one from Hirokazu-san; but
    is it just foolish pride that tells me to put this one on record, in case we
    need to come back to it later?

    Signed-off-by: Hugh Dickins
    Cc: David Rientjes
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hirokazu Takahashi
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • mem_cgroup_uncharge_page does css_put on the mem_cgroup before uncharging from
    it, and before removing page_cgroup from one of its lru lists: isn't there a
    danger that struct mem_cgroup memory could be freed and reused before
    completing that, so corrupting something? Never seen it, and for all I know
    there may be other constraints which make it impossible; but let's be
    defensive and reverse the ordering there.

    mem_cgroup_force_empty_list is safe because there's an extra css_get around
    all its works; but even so, change its ordering the same way round, to help
    get in the habit of doing it like this.

    Signed-off-by: Hugh Dickins
    Cc: David Rientjes
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hirokazu Takahashi
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Remove clear_page_cgroup: it's an unhelpful helper, see for example how
    mem_cgroup_uncharge_page had to unlock_page_cgroup just in order to call it
    (serious races from that? I'm not sure).

    Once that's gone, you can see it's pointless for page_cgroup's ref_cnt to be
    atomic: it's always manipulated under lock_page_cgroup, except where
    force_empty unilaterally reset it to 0 (and how does uncharge's
    atomic_dec_and_test protect against that?).

    Simplify this page_cgroup locking: if you've got the lock and the pc is
    attached, then the ref_cnt must be positive: VM_BUG_ONs to check that, and to
    check that pc->page matches page (we're on the way to finding why sometimes it
    doesn't, but this patch doesn't fix that).

    Signed-off-by: Hugh Dickins
    Cc: David Rientjes
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hirokazu Takahashi
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • More cleanup to memcontrol.c, this time changing some of the code generated.
    Let the compiler decide what to inline (except for page_cgroup_locked which is
    only used when CONFIG_DEBUG_VM): the __always_inline on lock_page_cgroup etc.
    was quite a waste since bit_spin_lock etc. are inlines in a header file; made
    mem_cgroup_force_empty and mem_cgroup_write_strategy static.

    Signed-off-by: Hugh Dickins
    Cc: David Rientjes
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hirokazu Takahashi
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Sorry, before getting down to more important changes, I'd like to do some
    cleanup in memcontrol.c. This patch doesn't change the code generated, but
    cleans up whitespace, moves up a double declaration, removes an unused enum,
    removes void returns, removes misleading comments, that kind of thing.

    Signed-off-by: Hugh Dickins
    Cc: David Rientjes
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Hirokazu Takahashi
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Nothing uses mem_cgroup_uncharge apart from mem_cgroup_uncharge_page, (a
    trivial wrapper around it) and mem_cgroup_end_migration (which does the same
    as mem_cgroup_uncharge_page). And it often ends up having to lock just to let
    its caller unlock. Remove it (but leave the silly locking until a later
    patch).

    Moved mem_cgroup_cache_charge next to mem_cgroup_charge in memcontrol.h.

    Signed-off-by: Hugh Dickins
    Cc: David Rientjes
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hirokazu Takahashi
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • My memcgroup patch to fix hang with shmem/tmpfs added NULL page handling to
    mem_cgroup_charge_common. It seemed convenient at the time, but hard to
    justify now: there's a perfectly appropriate swappage to charge and uncharge
    instead, this is not on any hot path through shmem_getpage, and no performance
    hit was observed from the slight extra overhead.

    So revert that NULL page handling from mem_cgroup_charge_common; and make it
    clearer by bringing page_cgroup_assign_new_page_cgroup into its body - that
    was a helper I found more of a hindrance to understanding.

    Signed-off-by: Hugh Dickins
    Cc: David Rientjes
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hirokazu Takahashi
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Replace free_hot_cold_page's VM_BUG_ON(page_get_page_cgroup(page)) by a "Bad
    page state" and clear: most users don't have CONFIG_DEBUG_VM on, and if it
    were set here, it'd likely cause corruption when the page is reused.

    Don't use page_assign_page_cgroup to clear it: that should be private to
    memcontrol.c, and always called with the lock taken; and memmap_init_zone
    doesn't need it either - like page->mapping and other pointers throughout the
    kernel, Linux assumes pointers in zeroed structures are NULL pointers.

    Instead use page_reset_bad_cgroup, added to memcontrol.h for this only.

    Signed-off-by: Hugh Dickins
    Cc: David Rientjes
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hirokazu Takahashi
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Each caller of mem_cgroup_move_lists is having to use page_get_page_cgroup:
    it's more convenient if it acts upon the page itself not the page_cgroup; and
    in a later patch this becomes important to handle within memcontrol.c.

    Signed-off-by: Hugh Dickins
    Cc: David Rientjes
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hirokazu Takahashi
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • vm_match_cgroup is a perverse name for a macro to match mm with cgroup: rename
    it mm_match_cgroup, matching mm_init_cgroup and mm_free_cgroup.

    Signed-off-by: Hugh Dickins
    Acked-by: David Rientjes
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hirokazu Takahashi
    Cc: YAMAMOTO Takashi
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

24 Feb, 2008

2 commits

  • Cgroup requires the subsystem to return negative error code on error in the
    create method.

    Signed-off-by: Li Zefan
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Remove this VM_BUG_ON(), as Balbir stated:

    We used to have a for loop with !list_empty() as a termination condition
    and VM_BUG_ON(!pc) is a spill over. With the new loop, VM_BUG_ON(!pc) does
    not make sense.

    Signed-off-by: Li Zefan
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     

10 Feb, 2008

1 commit

  • mm_cgroup() is exclusively used to test whether an mm's mem_cgroup pointer
    is pointing to a specific cgroup. Instead of returning the pointer, we can
    just do the test itself in a new macro:

    vm_match_cgroup(mm, cgroup)

    returns non-zero if the mm's mem_cgroup points to cgroup. Otherwise it
    returns zero.

    Signed-off-by: David Rientjes
    Cc: Balbir Singh
    Cc: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

08 Feb, 2008

22 commits

  • Based on the discussion at http://lkml.org/lkml/2007/12/20/383, it was felt
    that control_type might not be a good thing to implement right away. We
    can add this flexibility at a later point when required.

    Signed-off-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Now, lru is per-zone.

    Then, lru_lock can be (should be) per-zone, too.
    This patch implementes per-zone lru lock.

    lru_lock is placed into mem_cgroup_per_zone struct.

    lock can be accessed by
    mz = mem_cgroup_zoneinfo(mem_cgroup, node, zone);
    &mz->lru_lock

    or
    mz = page_cgroup_zoneinfo(page_cgroup);
    &mz->lru_lock

    Signed-off-by: KAMEZAWA hiroyuki
    Cc: "Eric W. Biederman"
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Herbert Poetzl
    Cc: Kirill Korotaev
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Pavel Emelianov
    Cc: Peter Zijlstra
    Cc: Vaidyanathan Srinivasan
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patch implements per-zone lru for memory cgroup.
    This patch makes use of mem_cgroup_per_zone struct for per zone lru.

    LRU can be accessed by

    mz = mem_cgroup_zoneinfo(mem_cgroup, node, zone);
    &mz->active_list
    &mz->inactive_list

    or
    mz = page_cgroup_zoneinfo(page_cgroup);
    &mz->active_list
    &mz->inactive_list

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: "Eric W. Biederman"
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Herbert Poetzl
    Cc: Kirill Korotaev
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Pavel Emelianov
    Cc: Peter Zijlstra
    Cc: Vaidyanathan Srinivasan
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • … pages to be scanned per cgroup

    Define function for calculating the number of scan target on each Zone/LRU.

    Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Herbert Poetzl <herbert@13thfloor.at>
    Cc: Kirill Korotaev <dev@sw.ru>
    Cc: Nick Piggin <nickpiggin@yahoo.com.au>
    Cc: Paul Menage <menage@google.com>
    Cc: Pavel Emelianov <xemul@openvz.org>
    Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    KAMEZAWA Hiroyuki
     
  • Functions to remember reclaim priority per cgroup (as zone->prev_priority)

    [akpm@linux-foundation.org: build fixes]
    [akpm@linux-foundation.org: more build fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: "Eric W. Biederman"
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Herbert Poetzl
    Cc: Kirill Korotaev
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Pavel Emelianov
    Cc: Peter Zijlstra
    Cc: Vaidyanathan Srinivasan
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • …ve imbalance per cgroup

    Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Herbert Poetzl <herbert@13thfloor.at>
    Cc: Kirill Korotaev <dev@sw.ru>
    Cc: Nick Piggin <nickpiggin@yahoo.com.au>
    Cc: Paul Menage <menage@google.com>
    Cc: Pavel Emelianov <xemul@openvz.org>
    Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    KAMEZAWA Hiroyuki
     
  • Define function for calculating mapped_ratio in memory cgroup.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: "Eric W. Biederman"
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Herbert Poetzl
    Cc: Kirill Korotaev
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Pavel Emelianov
    Cc: Peter Zijlstra
    Cc: Vaidyanathan Srinivasan
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patch adds per-zone status in memory cgroup. These values are often read
    (as per-zone value) by page reclaiming.

    In current design, per-zone stat is just a unsigned long value and not an
    atomic value because they are modified only under lru_lock. (So, atomic_ops
    is not necessary.)

    This patch adds ACTIVE and INACTIVE per-zone status values.

    For handling per-zone status, this patch adds
    struct mem_cgroup_per_zone {
    ...
    }
    and some helper functions. This will be useful to add per-zone objects
    in mem_cgroup.

    This patch turns memory controller's early_init to be 0 for calling
    kmalloc() in initialization.

    Acked-by: Balbir Singh
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: "Eric W. Biederman"
    Cc: David Rientjes
    Cc: Herbert Poetzl
    Cc: Kirill Korotaev
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Pavel Emelianov
    Cc: Peter Zijlstra
    Cc: Vaidyanathan Srinivasan
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Add macro to get node_id and zone_id of page_cgroup. Will be used in
    per-zone-xxx patches and others.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: "Eric W. Biederman"
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Herbert Poetzl
    Cc: Kirill Korotaev
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Pavel Emelianov
    Cc: Peter Zijlstra
    Cc: Vaidyanathan Srinivasan
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Add pre_destroy handler for mem_cgroup and try to make mem_cgroup empty at
    rmdir().

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: "Eric W. Biederman"
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Herbert Poetzl
    Cc: Kirill Korotaev
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Pavel Emelianov
    Cc: Peter Zijlstra
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Show accounted information of memory cgroup by memory.stat file

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix printk warning]
    Signed-off-by: YAMAMOTO Takashi
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Add statistics account infrastructure for memory controller. All account
    information is stored per-cpu and caller will not have to take lock or use
    atomic ops. This will be used by memory.stat file later.

    CACHE includes swapcache now. I'd like to divide it to
    PAGECACHE and SWAPCACHE later.

    This patch adds 3 functions for accounting.
    * __mem_cgroup_stat_add() ... for usual routine.
    * __mem_cgroup_stat_add_safe ... for calling under irq_disabled section.
    * mem_cgroup_read_stat() ... for reading stat value.
    * renamed PAGECACHE to CACHE (because it may include swapcache *now*)

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix smp_processor_id-in-preemptible]
    [akpm@linux-foundation.org: uninline things]
    [akpm@linux-foundation.org: remove dead code]
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: YAMAMOTO Takashi
    Cc: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Cc: Kirill Korotaev
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Pavel Emelianov
    Cc: Peter Zijlstra
    Cc: Vaidyanathan Srinivasan
    Cc: YAMAMOTO Takashi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Remember page_cgroup is on active_list or not in page_cgroup->flags.

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: YAMAMOTO Takashi
    Cc: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • The memcgroup regime relies upon a cgroup reclaiming pages from itself within
    add_to_page_cache: which may involve some waiting. Whereas shmem and tmpfs
    rely upon using add_to_page_cache while holding a spinlock: when it cannot
    wait. The consequence is that when a cgroup reaches its limit, shmem_getpage
    just hangs - unless there is outside memory pressure too, neither kswapd nor
    radix_tree_preload get it out of the retry loop.

    In most cases we can mem_cgroup_cache_charge the page waitably first, to
    attach the page_cgroup in advance, so add_to_page_cache will do no more than
    increment a count; then mem_cgroup_uncharge_page after (in both success and
    failure cases) to balance the books again.

    And where there used to be a congestion_wait for kswapd (recently made
    redundant by radix_tree_preload), use mem_cgroup_cache_charge with NULL page
    to go through a cycle of allocation and freeing, without accounting to any
    particular page, and without updating the statistics vector. This brings the
    cgroup below its limit so the next try usually succeeds.

    Signed-off-by: Hugh Dickins
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Tidy up mem_cgroup_charge_common before extending it. Adjust some comments,
    but mainly clean up its loop: I've an aversion to loops full of continues,
    then a break or a goto at the bottom. And the is_atomic test should be on the
    __GFP_WAIT bit, not GFP_ATOMIC bits.

    Signed-off-by: Hugh Dickins
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Hugh Dickins noticed that we were using rcu_dereference() without
    rcu_read_lock() in the cache charging routine. The patch below fixes
    this problem

    Signed-off-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Add a flag to page_cgroup to remember "this page is
    charged as cache."
    cache here includes page caches and swap cache.
    This is useful for implementing precise accounting in memory cgroup.
    TODO:
    distinguish page-cache and swap-cache

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: YAMAMOTO Takashi
    Cc: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patch adds an interface "memory.force_empty". Any write to this file
    will drop all charges in this cgroup if there is no task under.

    %echo 1 > /....../memory.force_empty

    will drop all charges of memory cgroup if cgroup's tasks is empty.

    This is useful to invoke rmdir() against memory cgroup successfully.

    Tested and worked well on x86_64/fake-NUMA system.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • mem_cgroup_charge_common shows a tendency to OOM without good reason, when
    a memhog goes well beyond its rss limit but with plenty of swap available.
    Seen on x86 but not on PowerPC; seen when the next patch omits swapcache
    from memcgroup, but we presume it can happen without.

    mem_cgroup_isolate_pages is not quite satisfying reclaim's criteria for OOM
    avoidance. Already it has to scan beyond the nr_to_scan limit when it
    finds a !LRU page or an active page when handling inactive or an inactive
    page when handling active. It needs to do exactly the same when it finds a
    page from the wrong zone (the x86 tests had two zones, the PowerPC tests
    had only one).

    Don't increment scan and then decrement it in these cases, just move the
    incrementation down. Fix recent off-by-one when checking against
    nr_to_scan. Cut out "Check if the meta page went away from under us",
    presumably left over from early debugging: no amount of such checks could
    save us if this list really were being updated without locking.

    This change does make the unlimited scan while holding two spinlocks
    even worse - bad for latency and bad for containment; but that's a
    separate issue which is better left to be fixed a little later.

    Signed-off-by: Hugh Dickins
    Cc: Pavel Emelianov
    Acked-by: Balbir Singh
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • This patch makes mem_cgroup_isolate_pages() to be

    - ignore !PageLRU pages.
    - fixes the bug that isolation makes no progress if page_zone(page) != zone
    page once find. (just increment scan in this case.)

    kswapd and memory migration removes a page from list when it handles
    a page for reclaiming/migration.

    Because __isolate_lru_page() doesn't moves page !PageLRU pages, it will
    be safe to avoid touching !PageLRU() page and its page_cgroup.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • While using memory control cgroup, page-migration under it works as following.
    ==
    1. uncharge all refs at try to unmap.
    2. charge regs again remove_migration_ptes()
    ==
    This is simple but has following problems.
    ==
    The page is uncharged and charged back again if *mapped*.
    - This means that cgroup before migration can be different from one after
    migration
    - If page is not mapped but charged as page cache, charge is just ignored
    (because not mapped, it will not be uncharged before migration)
    This is memory leak.
    ==
    This patch tries to keep memory cgroup at page migration by increasing
    one refcnt during it. 3 functions are added.

    mem_cgroup_prepare_migration() --- increase refcnt of page->page_cgroup
    mem_cgroup_end_migration() --- decrease refcnt of page->page_cgroup
    mem_cgroup_page_migration() --- copy page->page_cgroup from old page to
    new page.

    During migration
    - old page is under PG_locked.
    - new page is under PG_locked, too.
    - both old page and new page is not on LRU.

    These 3 facts guarantee that page_cgroup() migration has no race.

    Tested and worked well in x86_64/fake-NUMA box.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patch adds following functions.
    - clear_page_cgroup(page, pc)
    - page_cgroup_assign_new_page_group(page, pc)

    Mainly for cleanup.

    A manner "check page->cgroup again after lock_page_cgroup()" is
    implemented in straight way.

    A comment in mem_cgroup_uncharge() will be removed by force-empty patch

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki