08 Feb, 2008

40 commits

  • This is a patch for the Compaq ASIC3 multi function chip, found in many
    PDAs (iPAQs, HTCs...).

    It is a simplified version of Paul Sokolovsky's first proposal [1]. With
    this code, it is basically a GPIO and IRQ expander. My plan is to add more
    features once this patch gets reviewed and accepted.

    [1] http://lkml.org/lkml/2007/5/1/46

    Signed-off-by: Samuel Ortiz
    Cc: Paul Sokolovsky
    Cc: Ben Dooks
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Samuel Ortiz
     
  • Update cpuset documentation to match the October 2007 "Fix cpusets
    update_cpumask" changes that now apply changes to a cpusets 'cpus' allowed
    mask immediately to the cpus_allowed of the tasks in that cpuset.

    Signed-off-by: Paul Jackson
    Acked-by: Cliff Wickman
    Cc: David Rientjes
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • There's one place that works with task pids - its the "tasks" file in cgroups.
    The read/write handlers assume, that the pid values go to/come from the user
    space and thus it is a virtual pid, i.e. the pid as it is seen from inside a
    namespace.

    Tune the code accordingly.

    Signed-off-by: Pavel Emelyanov
    Cc: "Eric W. Biederman"
    Acked-by: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • - Narrow the scope of callback_mutex in scan_for_empty_cpusets().

    - Avoid rewriting the cpus, mems of cpusets except when it is likely that
    we'll be changing them.

    - Have remove_tasks_in_empty_cpuset() also check for empty mems.

    Signed-off-by: Paul Jackson
    Acked-by: Cliff Wickman
    Cc: David Rientjes
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Various minor formatting and comment tweaks to Cliff Wickman's
    [PATCH_3_of_3]_cpusets__update_cpumask_revision.patch

    I had had "iff", meaning "if and only if" in a comment. However, except for
    ancient mathematicians, the abbreviation "iff" was a tad too cryptic. Cliff
    changed it to "if", presumably figuring that the "iff" was a typo. However,
    it was the "only if" half of the conjunction that was most interesting.
    Reword to emphasis the "only if" aspect.

    The locking comment for remove_tasks_in_empty_cpuset() was wrong; it said
    callback_mutex had to be held on entry. The opposite is true.

    Several mentions of attach_task() in comments needed to be
    changed to cgroup_attach_task().

    A comment about notify_on_release was no longer relevant,
    as the line of code it had commented, namely:
    set_bit(CS_RELEASED_RESOURCE, &parent->flags);
    is no longer present in that place in the cpuset.c code.

    Similarly a comment about notify_on_release before the
    scan_for_empty_cpusets() routine was no longer relevant.

    Removed extra parentheses and unnecessary return statement.

    Renamed attach_task() to cpuset_attach() in various comments.

    Removed comment about not needing memory migration, as it seems the migration
    is done anyway, via the cpuset_attach() callback from cgroup_attach_task().

    Signed-off-by: Paul Jackson
    Acked-by: Cliff Wickman
    Cc: David Rientjes
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Some of the comments in kernel/cpuset.c were stale following the
    transition to control groups; this patch updates them to more closely
    match reality.

    Signed-off-by: Paul Menage
    Acked-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • Use the new function cgroup_scan_tasks() to step through all tasks in a
    cpuset.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Cliff Wickman
    Cc: Paul Menage
    Cc: Paul Jackson
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cliff Wickman
     
  • This patch corrects a situation that occurs when one disables all the cpus in
    a cpuset.

    Currently, the disabled (cpu-less) cpuset inherits the cpus of its parent,
    which is incorrect because it may then overlap its cpu-exclusive sibling.

    Tasks of an empty cpuset should be moved to the cpuset which is the parent of
    their current cpuset. Or if the parent cpuset has no cpus, to its parent,
    etc.

    And the empty cpuset should be released (if it is flagged notify_on_release).

    Depends on the cgroup_scan_tasks() function (proposed by David Rientjes) to
    iterate through all tasks in the cpu-less cpuset. We are deliberately
    avoiding a walk of the tasklist.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Cliff Wickman
    Cc: Paul Menage
    Cc: Paul Jackson
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cliff Wickman
     
  • Provide cgroup_scan_tasks(), which iterates through every task in a cgroup,
    calling a test function and a process function for each. And call the process
    function without holding the css_set_lock lock.

    The idea is David Rientjes', predicting that such a function will make it much
    easier in the future to extend things that require access to each task in a
    cgroup without holding the lock,

    [akpm@linux-foundation.org: cleanup]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Cliff Wickman
    Cc: Paul Menage
    Cc: Paul Jackson
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cliff Wickman
     
  • Documentation updates for memory controller.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Based on the discussion at http://lkml.org/lkml/2007/12/20/383, it was felt
    that control_type might not be a good thing to implement right away. We
    can add this flexibility at a later point when required.

    Signed-off-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Now, lru is per-zone.

    Then, lru_lock can be (should be) per-zone, too.
    This patch implementes per-zone lru lock.

    lru_lock is placed into mem_cgroup_per_zone struct.

    lock can be accessed by
    mz = mem_cgroup_zoneinfo(mem_cgroup, node, zone);
    &mz->lru_lock

    or
    mz = page_cgroup_zoneinfo(page_cgroup);
    &mz->lru_lock

    Signed-off-by: KAMEZAWA hiroyuki
    Cc: "Eric W. Biederman"
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Herbert Poetzl
    Cc: Kirill Korotaev
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Pavel Emelianov
    Cc: Peter Zijlstra
    Cc: Vaidyanathan Srinivasan
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patch implements per-zone lru for memory cgroup.
    This patch makes use of mem_cgroup_per_zone struct for per zone lru.

    LRU can be accessed by

    mz = mem_cgroup_zoneinfo(mem_cgroup, node, zone);
    &mz->active_list
    &mz->inactive_list

    or
    mz = page_cgroup_zoneinfo(page_cgroup);
    &mz->active_list
    &mz->inactive_list

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: "Eric W. Biederman"
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Herbert Poetzl
    Cc: Kirill Korotaev
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Pavel Emelianov
    Cc: Peter Zijlstra
    Cc: Vaidyanathan Srinivasan
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • …solate globa/cgroup lru activity

    When using memory controller, there are 2 levels of memory reclaim.
    1. zone memory reclaim because of system/zone memory shortage.
    2. memory cgroup memory reclaim because of hitting limit.

    These two can be distinguished by sc->mem_cgroup parameter.
    (scan_global_lru() macro)

    This patch tries to make memory cgroup reclaim routine avoid affecting
    system/zone memory reclaim. This patch inserts if (scan_global_lru()) and
    hook to memory_cgroup reclaim support functions.

    This patch can be a help for isolating system lru activity and group lru
    activity and shows what additional functions are necessary.

    * mem_cgroup_calc_mapped_ratio() ... calculate mapped ratio for cgroup.
    * mem_cgroup_reclaim_imbalance() ... calculate active/inactive balance in
    cgroup.
    * mem_cgroup_calc_reclaim_active() ... calculate the number of active pages to
    be scanned in this priority in mem_cgroup.

    * mem_cgroup_calc_reclaim_inactive() ... calculate the number of inactive pages
    to be scanned in this priority in mem_cgroup.

    * mem_cgroup_all_unreclaimable() .. checks cgroup's page is all unreclaimable
    or not.
    * mem_cgroup_get_reclaim_priority() ...
    * mem_cgroup_note_reclaim_priority() ... record reclaim priority (temporal)
    * mem_cgroup_remember_reclaim_priority()
    .... record reclaim priority as
    zone->prev_priority.
    This value is used for calc reclaim_mapped.

    [akpm@linux-foundation.org: fix unused var warning]
    Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Herbert Poetzl <herbert@13thfloor.at>
    Cc: Kirill Korotaev <dev@sw.ru>
    Cc: Nick Piggin <nickpiggin@yahoo.com.au>
    Cc: Paul Menage <menage@google.com>
    Cc: Pavel Emelianov <xemul@openvz.org>
    Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    KAMEZAWA Hiroyuki
     
  • … pages to be scanned per cgroup

    Define function for calculating the number of scan target on each Zone/LRU.

    Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Herbert Poetzl <herbert@13thfloor.at>
    Cc: Kirill Korotaev <dev@sw.ru>
    Cc: Nick Piggin <nickpiggin@yahoo.com.au>
    Cc: Paul Menage <menage@google.com>
    Cc: Pavel Emelianov <xemul@openvz.org>
    Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    KAMEZAWA Hiroyuki
     
  • Functions to remember reclaim priority per cgroup (as zone->prev_priority)

    [akpm@linux-foundation.org: build fixes]
    [akpm@linux-foundation.org: more build fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: "Eric W. Biederman"
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Herbert Poetzl
    Cc: Kirill Korotaev
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Pavel Emelianov
    Cc: Peter Zijlstra
    Cc: Vaidyanathan Srinivasan
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • …ve imbalance per cgroup

    Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Herbert Poetzl <herbert@13thfloor.at>
    Cc: Kirill Korotaev <dev@sw.ru>
    Cc: Nick Piggin <nickpiggin@yahoo.com.au>
    Cc: Paul Menage <menage@google.com>
    Cc: Pavel Emelianov <xemul@openvz.org>
    Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    KAMEZAWA Hiroyuki
     
  • Define function for calculating mapped_ratio in memory cgroup.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: "Eric W. Biederman"
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Herbert Poetzl
    Cc: Kirill Korotaev
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Pavel Emelianov
    Cc: Peter Zijlstra
    Cc: Vaidyanathan Srinivasan
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patch adds per-zone status in memory cgroup. These values are often read
    (as per-zone value) by page reclaiming.

    In current design, per-zone stat is just a unsigned long value and not an
    atomic value because they are modified only under lru_lock. (So, atomic_ops
    is not necessary.)

    This patch adds ACTIVE and INACTIVE per-zone status values.

    For handling per-zone status, this patch adds
    struct mem_cgroup_per_zone {
    ...
    }
    and some helper functions. This will be useful to add per-zone objects
    in mem_cgroup.

    This patch turns memory controller's early_init to be 0 for calling
    kmalloc() in initialization.

    Acked-by: Balbir Singh
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: "Eric W. Biederman"
    Cc: David Rientjes
    Cc: Herbert Poetzl
    Cc: Kirill Korotaev
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Pavel Emelianov
    Cc: Peter Zijlstra
    Cc: Vaidyanathan Srinivasan
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Add macro to get node_id and zone_id of page_cgroup. Will be used in
    per-zone-xxx patches and others.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: "Eric W. Biederman"
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Herbert Poetzl
    Cc: Kirill Korotaev
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Pavel Emelianov
    Cc: Peter Zijlstra
    Cc: Vaidyanathan Srinivasan
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This is used to detect which scan_control scans global lru or mem_cgroup lru.
    And compiled to be static value (1) when memory controller is not configured.
    This may make the meaning obvious.

    Acked-by: Balbir Singh
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: "Eric W. Biederman"
    Cc: David Rientjes
    Cc: Herbert Poetzl
    Cc: Kirill Korotaev
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Pavel Emelianov
    Cc: Peter Zijlstra
    Cc: Vaidyanathan Srinivasan
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Add pre_destroy handler for mem_cgroup and try to make mem_cgroup empty at
    rmdir().

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: "Eric W. Biederman"
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Herbert Poetzl
    Cc: Kirill Korotaev
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Pavel Emelianov
    Cc: Peter Zijlstra
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Add a handler "pre_destroy" to cgroup_subsys. It is called before
    cgroup_rmdir() checks all subsys's refcnt.

    I think this is useful for subsys which have some extra refs even if there
    are no tasks in cgroup. By adding pre_destroy(), the kernel keeps the rule
    "destroy() against subsystem is called only when refcnt=0." and allows css
    ref to be used by other objects than tasks.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: "Eric W. Biederman"
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Herbert Poetzl
    Cc: Kirill Korotaev
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Pavel Emelianov
    Cc: Peter Zijlstra
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Show accounted information of memory cgroup by memory.stat file

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix printk warning]
    Signed-off-by: YAMAMOTO Takashi
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Add statistics account infrastructure for memory controller. All account
    information is stored per-cpu and caller will not have to take lock or use
    atomic ops. This will be used by memory.stat file later.

    CACHE includes swapcache now. I'd like to divide it to
    PAGECACHE and SWAPCACHE later.

    This patch adds 3 functions for accounting.
    * __mem_cgroup_stat_add() ... for usual routine.
    * __mem_cgroup_stat_add_safe ... for calling under irq_disabled section.
    * mem_cgroup_read_stat() ... for reading stat value.
    * renamed PAGECACHE to CACHE (because it may include swapcache *now*)

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix smp_processor_id-in-preemptible]
    [akpm@linux-foundation.org: uninline things]
    [akpm@linux-foundation.org: remove dead code]
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: YAMAMOTO Takashi
    Cc: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Cc: Kirill Korotaev
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Pavel Emelianov
    Cc: Peter Zijlstra
    Cc: Vaidyanathan Srinivasan
    Cc: YAMAMOTO Takashi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Remember page_cgroup is on active_list or not in page_cgroup->flags.

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: YAMAMOTO Takashi
    Cc: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • The memcgroup regime relies upon a cgroup reclaiming pages from itself within
    add_to_page_cache: which may involve some waiting. Whereas shmem and tmpfs
    rely upon using add_to_page_cache while holding a spinlock: when it cannot
    wait. The consequence is that when a cgroup reaches its limit, shmem_getpage
    just hangs - unless there is outside memory pressure too, neither kswapd nor
    radix_tree_preload get it out of the retry loop.

    In most cases we can mem_cgroup_cache_charge the page waitably first, to
    attach the page_cgroup in advance, so add_to_page_cache will do no more than
    increment a count; then mem_cgroup_uncharge_page after (in both success and
    failure cases) to balance the books again.

    And where there used to be a congestion_wait for kswapd (recently made
    redundant by radix_tree_preload), use mem_cgroup_cache_charge with NULL page
    to go through a cycle of allocation and freeing, without accounting to any
    particular page, and without updating the statistics vector. This brings the
    cgroup below its limit so the next try usually succeeds.

    Signed-off-by: Hugh Dickins
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Tidy up mem_cgroup_charge_common before extending it. Adjust some comments,
    but mainly clean up its loop: I've an aversion to loops full of continues,
    then a break or a goto at the bottom. And the is_atomic test should be on the
    __GFP_WAIT bit, not GFP_ATOMIC bits.

    Signed-off-by: Hugh Dickins
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Hugh Dickins noticed that we were using rcu_dereference() without
    rcu_read_lock() in the cache charging routine. The patch below fixes
    this problem

    Signed-off-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Add a flag to page_cgroup to remember "this page is
    charged as cache."
    cache here includes page caches and swap cache.
    This is useful for implementing precise accounting in memory cgroup.
    TODO:
    distinguish page-cache and swap-cache

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: YAMAMOTO Takashi
    Cc: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patch adds an interface "memory.force_empty". Any write to this file
    will drop all charges in this cgroup if there is no task under.

    %echo 1 > /....../memory.force_empty

    will drop all charges of memory cgroup if cgroup's tasks is empty.

    This is useful to invoke rmdir() against memory cgroup successfully.

    Tested and worked well on x86_64/fake-NUMA system.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Because NODE_DATA(node)->node_zonelists[] is guaranteed to contain all
    necessary zones, it is not necessary to use for_each_online_node.

    And this for_each_online_node() makes reclaim routine start always
    from node 0. This is not good. This patch makes reclaim start from
    caller's node and just use usual (default) zonelist order.

    [akpm@linux-foundation.org: fix warning]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • If we're charging rss and we're charging cache, it seems obvious that we
    should be charging swapcache - as has been done. But in practice that
    doesn't work out so well: both swapin readahead and swapoff leave the
    majority of pages charged to the wrong cgroup (the cgroup that happened to
    read them in, rather than the cgroup to which they belong).

    (Which is why unuse_pte's GFP_KERNEL while holding pte lock never showed up
    as a problem: no allocation was ever done there, every page read being
    already charged to the cgroup which initiated the swapoff.)

    It all works rather better if we leave the charging to do_swap_page and
    unuse_pte, and do nothing for swapcache itself: revert mm/swap_state.c to
    what it was before the memory-controller patches. This also speeds up
    significantly a contained process working at its limit: because it no
    longer needs to keep waiting for swap writeback to complete.

    Is it unfair that swap pages become uncharged once they're unmapped, even
    though they're still clearly private to particular cgroups? For a short
    while, yes; but PageReclaim arranges for those pages to go to the end of
    the inactive list and be reclaimed soon if necessary.

    shmem/tmpfs pages are a distinct case: their charging also benefits from
    this change, but their second life on the lists as swapcache pages may
    prove more unfair - that I need to check next.

    Signed-off-by: Hugh Dickins
    Cc: Pavel Emelianov
    Acked-by: Balbir Singh
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • mem_cgroup_charge_common shows a tendency to OOM without good reason, when
    a memhog goes well beyond its rss limit but with plenty of swap available.
    Seen on x86 but not on PowerPC; seen when the next patch omits swapcache
    from memcgroup, but we presume it can happen without.

    mem_cgroup_isolate_pages is not quite satisfying reclaim's criteria for OOM
    avoidance. Already it has to scan beyond the nr_to_scan limit when it
    finds a !LRU page or an active page when handling inactive or an inactive
    page when handling active. It needs to do exactly the same when it finds a
    page from the wrong zone (the x86 tests had two zones, the PowerPC tests
    had only one).

    Don't increment scan and then decrement it in these cases, just move the
    incrementation down. Fix recent off-by-one when checking against
    nr_to_scan. Cut out "Check if the meta page went away from under us",
    presumably left over from early debugging: no amount of such checks could
    save us if this list really were being updated without locking.

    This change does make the unlimited scan while holding two spinlocks
    even worse - bad for latency and bad for containment; but that's a
    separate issue which is better left to be fixed a little later.

    Signed-off-by: Hugh Dickins
    Cc: Pavel Emelianov
    Acked-by: Balbir Singh
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • This patch makes mem_cgroup_isolate_pages() to be

    - ignore !PageLRU pages.
    - fixes the bug that isolation makes no progress if page_zone(page) != zone
    page once find. (just increment scan in this case.)

    kswapd and memory migration removes a page from list when it handles
    a page for reclaiming/migration.

    Because __isolate_lru_page() doesn't moves page !PageLRU pages, it will
    be safe to avoid touching !PageLRU() page and its page_cgroup.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • While using memory control cgroup, page-migration under it works as following.
    ==
    1. uncharge all refs at try to unmap.
    2. charge regs again remove_migration_ptes()
    ==
    This is simple but has following problems.
    ==
    The page is uncharged and charged back again if *mapped*.
    - This means that cgroup before migration can be different from one after
    migration
    - If page is not mapped but charged as page cache, charge is just ignored
    (because not mapped, it will not be uncharged before migration)
    This is memory leak.
    ==
    This patch tries to keep memory cgroup at page migration by increasing
    one refcnt during it. 3 functions are added.

    mem_cgroup_prepare_migration() --- increase refcnt of page->page_cgroup
    mem_cgroup_end_migration() --- decrease refcnt of page->page_cgroup
    mem_cgroup_page_migration() --- copy page->page_cgroup from old page to
    new page.

    During migration
    - old page is under PG_locked.
    - new page is under PG_locked, too.
    - both old page and new page is not on LRU.

    These 3 facts guarantee that page_cgroup() migration has no race.

    Tested and worked well in x86_64/fake-NUMA box.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patch adds following functions.
    - clear_page_cgroup(page, pc)
    - page_cgroup_assign_new_page_group(page, pc)

    Mainly for cleanup.

    A manner "check page->cgroup again after lock_page_cgroup()" is
    implemented in straight way.

    A comment in mem_cgroup_uncharge() will be removed by force-empty patch

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Cc: Vaidyanathan Srinivasan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • The current kswapd (and try_to_free_pages) code has an oddity where the
    code will wait on IO, even if there is no IO in flight. This problem is
    notable especially when the system scans through many unfreeable pages,
    causing unnecessary stalls in the VM.

    Additionally, tasks without __GFP_FS or __GFP_IO in the direct reclaim path
    will sleep if a significant number of pages are encountered that should be
    written out. This gives kswapd a chance to write out those pages, while
    the direct reclaim task sleeps.

    Signed-off-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Adds a new sysctl, 'oom_dump_tasks', that enables the kernel to produce a
    dump of all system tasks (excluding kernel threads) when performing an
    OOM-killing. Information includes pid, uid, tgid, vm size, rss, cpu,
    oom_adj score, and name.

    This is helpful for determining why there was an OOM condition and which
    rogue task caused it.

    It is configurable so that large systems, such as those with several
    thousand tasks, do not incur a performance penalty associated with dumping
    data they may not desire.

    If an OOM was triggered as a result of a memory controller, the tasklist
    shall be filtered to exclude tasks that are not a member of the same
    cgroup.

    Cc: Andrea Arcangeli
    Cc: Christoph Lameter
    Cc: Balbir Singh
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Creates a helper function to return non-zero if a task is a member of a
    memory controller:

    int task_in_mem_cgroup(const struct task_struct *task,
    const struct mem_cgroup *mem);

    When the OOM killer is constrained by the memory controller, the exclusion
    of tasks that are not a member of that controller was previously misplaced
    and appeared in the badness scoring function. It should be excluded
    during the tasklist scan in select_bad_process() instead.

    [akpm@linux-foundation.org: build fix]
    Cc: Christoph Lameter
    Cc: Balbir Singh
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes