11 Aug, 2010

9 commits

  • We have zone_to_nid(). this patch convert all existing users of
    zone->zone_pgdat->node_id.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Nishimura Daisuke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • mem_cgroup_soft_limit_reclaim() has zone, nid and zid argument. but nid
    and zid can be calculated from zone. So remove it.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Cc: Balbir Singh
    Cc: Nishimura Daisuke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Currently mem_cgroup_shrink_node_zone() call shrink_zone() directly. thus
    it doesn't need to initialize sc.nodemask because shrink_zone() doesn't
    use it at all.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Cc: Balbir Singh
    Cc: Nishimura Daisuke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Now, memory cgroup increments css(cgroup subsys state)'s reference count
    per a charged page. And the reference count is kept until the page is
    uncharged. But this has 2 bad effect.

    1. Because css_get/put calls atomic_inc()/dec, heavy call of them
    on large smp will not scale well.
    2. Because css's refcnt cannot be in a state as "ready-to-release",
    cgroup's notify_on_release handler can't work with memcg.
    3. css's refcnt is atomic_t, it means smaller than 32bit. Maybe too small.

    This has been a problem since the 1st merge of memcg.

    This is a trial to remove css's refcnt per a page. Even if we remove
    refcnt, pre_destroy() does enough synchronization as
    - check res->usage == 0.
    - check no pages on LRU.

    This patch removes css's refcnt per page. Even after this patch, at the
    1st look, it seems css_get() is still called in try_charge().

    But the logic is.

    - If a memcg of mm->owner is cached one, consume_stock() will work.
    At success, return immediately.
    - If consume_stock returns false, css_get() is called and go to
    slow path which may be blocked. At the end of slow path,
    css_put() is called and restart from the start if necessary.

    So, in the fast path, we don't call css_get() and can avoid access to
    shared counter. This patch can make the most possible case fast.

    Here is a result of multi-threaded page fault benchmark.

    [Before]
    25.32% multi-fault-all [kernel.kallsyms] [k] clear_page_c
    9.30% multi-fault-all [kernel.kallsyms] [k] _raw_spin_lock_irqsave
    8.02% multi-fault-all [kernel.kallsyms] [k] try_get_mem_cgroup_from_mm
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • When the OOM killer scans task, it check a task is under memcg or
    not when it's called via memcg's context.

    But, as Oleg pointed out, a thread group leader may have NULL ->mm
    and task_in_mem_cgroup() may do wrong decision. We have to use
    find_lock_task_mm() in memcg as generic OOM-Killer does.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Oleg Nesterov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • mem_cgroup_charge_common() is always called with @mem = NULL, so it's
    meaningless. This patch removes it.

    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • - try_get_mem_cgroup_from_mm() calls rcu_read_lock/unlock by itself, so we
    don't have to call them in task_in_mem_cgroup().
    - *mz is not used in __mem_cgroup_uncharge_common().
    - we don't have to call lookup_page_cgroup() in mem_cgroup_end_migration()
    after we've cleared PCG_MIGRATION of @oldpage.
    - remove empty comment.
    - remove redundant empty line in mem_cgroup_cache_charge().

    Signed-off-by: Daisuke Nishimura
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • Now, for checking a memcg is under task-account-moving, we do css_tryget()
    against mc.to and mc.from. But this is just complicating things. This
    patch makes the check easier.

    This patch adds a spinlock to move_charge_struct and guard modification of
    mc.to and mc.from. By this, we don't have to think about complicated
    races arount this not-critical path.

    [balbir@linux.vnet.ibm.com: don't crash on a null memcg being passed]
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • mem_cgroup_try_charge() has a big loop in it and seems to be hard to read.
    Most of routines are for slow path. This patch moves codes out from the
    loop and make it clear what's done.

    Summary:
    - refactoring a function to detect a memcg is under acccount move or not.
    - refactoring a function to wait for the end of moving task acct.
    - refactoring a main loop('s slow path) as a function and make it clear
    why we retry or quit by return code.
    - add fatal_signal_pending() check for bypassing charge loops.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

10 Aug, 2010

3 commits

  • Memcg also need to trace page isolation information as global reclaim.
    This patch does it.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Acked-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • This a complete rewrite of the oom killer's badness() heuristic which is
    used to determine which task to kill in oom conditions. The goal is to
    make it as simple and predictable as possible so the results are better
    understood and we end up killing the task which will lead to the most
    memory freeing while still respecting the fine-tuning from userspace.

    Instead of basing the heuristic on mm->total_vm for each task, the task's
    rss and swap space is used instead. This is a better indication of the
    amount of memory that will be freeable if the oom killed task is chosen
    and subsequently exits. This helps specifically in cases where KDE or
    GNOME is chosen for oom kill on desktop systems instead of a memory
    hogging task.

    The baseline for the heuristic is a proportion of memory that each task is
    currently using in memory plus swap compared to the amount of "allowable"
    memory. "Allowable," in this sense, means the system-wide resources for
    unconstrained oom conditions, the set of mempolicy nodes, the mems
    attached to current's cpuset, or a memory controller's limit. The
    proportion is given on a scale of 0 (never kill) to 1000 (always kill),
    roughly meaning that if a task has a badness() score of 500 that the task
    consumes approximately 50% of allowable memory resident in RAM or in swap
    space.

    The proportion is always relative to the amount of "allowable" memory and
    not the total amount of RAM systemwide so that mempolicies and cpusets may
    operate in isolation; they shall not need to know the true size of the
    machine on which they are running if they are bound to a specific set of
    nodes or mems, respectively.

    Root tasks are given 3% extra memory just like __vm_enough_memory()
    provides in LSMs. In the event of two tasks consuming similar amounts of
    memory, it is generally better to save root's task.

    Because of the change in the badness() heuristic's baseline, it is also
    necessary to introduce a new user interface to tune it. It's not possible
    to redefine the meaning of /proc/pid/oom_adj with a new scale since the
    ABI cannot be changed for backward compatability. Instead, a new tunable,
    /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may
    be used to polarize the heuristic such that certain tasks are never
    considered for oom kill while others may always be considered. The value
    is added directly into the badness() score so a value of -500, for
    example, means to discount 50% of its memory consumption in comparison to
    other tasks either on the system, bound to the mempolicy, in the cpuset,
    or sharing the same memory controller.

    /proc/pid/oom_adj is changed so that its meaning is rescaled into the
    units used by /proc/pid/oom_score_adj, and vice versa. Changing one of
    these per-task tunables will rescale the value of the other to an
    equivalent meaning. Although /proc/pid/oom_adj was originally defined as
    a bitshift on the badness score, it now shares the same linear growth as
    /proc/pid/oom_score_adj but with different granularity. This is required
    so the ABI is not broken with userspace applications and allows oom_adj to
    be deprecated for future removal.

    Signed-off-by: David Rientjes
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Since 2.6.28 zone->prev_priority is unused. Then it can be removed
    safely. It reduce stack usage slightly.

    Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
    can be integrate again, it's useful. but four (or more) times trying
    haven't got good performance number. Thus I give up such approach.

    The rest of this changelog is notes on prev_priority and why it existed in
    the first place and why it might be not necessary any more. This information
    is based heavily on discussions between Andrew Morton, Rik van Riel and
    Kosaki Motohiro who is heavily quotes from.

    Historically prev_priority was important because it determined when the VM
    would start unmapping PTE pages. i.e. there are no balances of note within
    the VM, Anon vs File and Mapped vs Unmapped. Without prev_priority, there
    is a potential risk of unnecessarily increasing minor faults as a large
    amount of read activity of use-once pages could push mapped pages to the
    end of the LRU and get unmapped.

    There is no proof this is still a problem but currently it is not considered
    to be. Active files are not deactivated if the active file list is smaller
    than the inactive list reducing the liklihood that file-mapped pages are
    being pushed off the LRU and referenced executable pages are kept on the
    active list to avoid them getting pushed out by read activity.

    Even if it is a problem, prev_priority prev_priority wouldn't works
    nowadays. First of all, current vmscan still a lot of UP centric code. it
    expose some weakness on some dozens CPUs machine. I think we need more and
    more improvement.

    The problem is, current vmscan mix up per-system-pressure, per-zone-pressure
    and per-task-pressure a bit. example, prev_priority try to boost priority to
    other concurrent priority. but if the another task have mempolicy restriction,
    it is unnecessary, but also makes wrong big latency and exceeding reclaim.
    per-task based priority + prev_priority adjustment make the emulation of
    per-system pressure. but it have two issue 1) too rough and brutal emulation
    2) we need per-zone pressure, not per-system.

    Another example, currently DEF_PRIORITY is 12. it mean the lru rotate about
    2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer.
    but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the
    system have higher memory pressure than priority==0 (1/4096*10,000 > 2).
    prev_priority can't solve such multithreads workload issue. In other word,
    prev_priority concept assume the sysmtem don't have lots threads."

    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Mel Gorman
    Reviewed-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: Dave Chinner
    Cc: Chris Mason
    Cc: Nick Piggin
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Christoph Hellwig
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Andrea Arcangeli
    Cc: Michael Rubin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

30 Jun, 2010

1 commit

  • OOM-waitqueue should be waken up when oom_disable is canceled. This is a
    fix for 3c11ecf448eff8f1 ("memcg: oom kill disable and oom status").

    How to test:
    Create a cgroup A...
    1. set memory.limit and memory.memsw.limit to be small value
    2. echo 1 > /cgroup/A/memory.oom_control, this disables oom-kill.
    3. run a program which must cause OOM.

    A program executed in 3 will sleep by oom_waiqueue in memcg. Then, how to
    wake it up is problem.

    1. echo 0 > /cgroup/A/memory.oom_control (enable OOM-killer)
    2. echo big mem > /cgroup/A/memory.memsw.limit_in_bytes(allow more swap)

    etc..

    Without the patch, a task in slept can not be waken up.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

28 May, 2010

10 commits

  • Introduce struct mem_cgroup_thresholds. It helps to reduce number of
    checks of thresholds type (memory or mem+swap).

    [akpm@linux-foundation.org: repair comment]
    Signed-off-by: Kirill A. Shutemov
    Cc: Phil Carmody
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: KAMEZAWA Hiroyuki
    Acked-by: Paul Menage
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Since we are unable to handle an error returned by
    cftype.unregister_event() properly, let's make the callback
    void-returning.

    mem_cgroup_unregister_event() has been rewritten to be a "never fail"
    function. On mem_cgroup_usage_register_event() we save old buffer for
    thresholds array and reuse it in mem_cgroup_usage_unregister_event() to
    avoid allocation.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Phil Carmody
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Paul Menage
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • FILE_MAPPED per memcg of migrated file cache is not properly updated,
    because our hook in page_add_file_rmap() can't know to which memcg
    FILE_MAPPED should be counted.

    Basically, this patch is for fixing the bug but includes some big changes
    to fix up other messes.

    Now, at migrating mapped file, events happen in following sequence.

    1. allocate a new page.
    2. get memcg of an old page.
    3. charge ageinst a new page before migration. But at this point,
    no changes to new page's page_cgroup, no commit for the charge.
    (IOW, PCG_USED bit is not set.)
    4. page migration replaces radix-tree, old-page and new-page.
    5. page migration remaps the new page if the old page was mapped.
    6. Here, the new page is unlocked.
    7. memcg commits the charge for newpage, Mark the new page's page_cgroup
    as PCG_USED.

    Because "commit" happens after page-remap, we can count FILE_MAPPED
    at "5", because we should avoid to trust page_cgroup->mem_cgroup.
    if PCG_USED bit is unset.
    (Note: memcg's LRU removal code does that but LRU-isolation logic is used
    for helping it. When we overwrite page_cgroup->mem_cgroup, page_cgroup is
    not on LRU or page_cgroup->mem_cgroup is NULL.)

    We can lose file_mapped accounting information at 5 because FILE_MAPPED
    is updated only when mapcount changes 0->1. So we should catch it.

    BTW, historically, above implemntation comes from migration-failure
    of anonymous page. Because we charge both of old page and new page
    with mapcount=0, we can't catch
    - the page is really freed before remap.
    - migration fails but it's freed before remap
    or .....corner cases.

    New migration sequence with memcg is:

    1. allocate a new page.
    2. mark PageCgroupMigration to the old page.
    3. charge against a new page onto the old page's memcg. (here, new page's pc
    is marked as PageCgroupUsed.)
    4. page migration replaces radix-tree, page table, etc...
    5. At remapping, new page's page_cgroup is now makrked as "USED"
    We can catch 0->1 event and FILE_MAPPED will be properly updated.

    And we can catch SWAPOUT event after unlock this and freeing this
    page by unmap() can be caught.

    7. Clear PageCgroupMigration of the old page.

    So, FILE_MAPPED will be correctly updated.

    Then, for what MIGRATION flag is ?
    Without it, at migration failure, we may have to charge old page again
    because it may be fully unmapped. "charge" means that we have to dive into
    memory reclaim or something complated. So, it's better to avoid
    charge it again. Before this patch, __commit_charge() was working for
    both of the old/new page and fixed up all. But this technique has some
    racy condtion around FILE_MAPPED and SWAPOUT etc...
    Now, the kernel use MIGRATION flag and don't uncharge old page until
    the end of migration.

    I hope this change will make memcg's page migration much simpler. This
    page migration has caused several troubles. Worth to add a flag for
    simplification.

    Reviewed-by: Daisuke Nishimura
    Tested-by: Daisuke Nishimura
    Reported-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    akpm@linux-foundation.org
     
  • Only an out of memory error will cause ret to be set.

    Signed-off-by: Phil Carmody
    Acked-by: Kirill A. Shutemov
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Phil Carmody
     
  • The bottom 4 hunks are atomically changing memory to which there are no
    aliases as it's freshly allocated, so there's no need to use atomic
    operations.

    The other hunks are just atomic_read and atomic_set, and do not involve
    any read-modify-write. The use of atomic_{read,set} doesn't prevent a
    read/write or write/write race, so if a race were possible (I'm not saying
    one is), then it would still be there even with atomic_set.

    See:
    http://digitalvampire.org/blog/index.php/2007/05/13/atomic-cargo-cults/

    Signed-off-by: Phil Carmody
    Acked-by: Kirill A. Shutemov
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Phil Carmody
     
  • This patch adds support for moving charge of file pages, which include
    normal file, tmpfs file and swaps of tmpfs file. It's enabled by setting
    bit 1 of /memory.move_charge_at_immigrate.

    Unlike the case of anonymous pages, file pages(and swaps) in the range
    mmapped by the task will be moved even if the task hasn't done page fault,
    i.e. they might not be the task's "RSS", but other task's "RSS" that maps
    the same file. And mapcount of the page is ignored(the page can be moved
    even if page_mapcount(page) > 1). So, conditions that the page/swap
    should be met to be moved is that it must be in the range mmapped by the
    target task and it must be charged to the old cgroup.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix warning]
    Signed-off-by: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • This patch cleans up move charge code by:

    - define functions to handle pte for each types, and make
    is_target_pte_for_mc() cleaner.

    - instead of checking the MOVE_CHARGE_TYPE_ANON bit, define a function
    that checks the bit.

    Signed-off-by: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • This adds a feature to disable oom-killer for memcg, if disabled, of
    course, tasks under memcg will stop.

    But now, we have oom-notifier for memcg. And the world around memcg is
    not under out-of-memory. memcg's out-of-memory just shows memcg hits
    limit. Then, administrator or management daemon can recover the situation
    by

    - kill some process
    - enlarge limit, add more swap.
    - migrate some tasks
    - remove file cache on tmps (difficult ?)

    Unlike oom-killer, you can take enough information before killing tasks.
    (by gcore, or, ps etc.)

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Considering containers or other resource management softwares in userland,
    event notification of OOM in memcg should be implemented. Now, memcg has
    "threshold" notifier which uses eventfd, we can make use of it for oom
    notification.

    This patch adds oom notification eventfd callback for memcg. The usage is
    very similar to threshold notifier, but control file is memory.oom_control
    and no arguments other than eventfd is required.

    % cgroup_event_notifier /cgroup/A/memory.oom_control dummy
    (About cgroup_event_notifier, see Documentation/cgroup/)

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: David Rientjes
    Cc: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • memcg's oom waitqueue is a system-wide wait_queue (for handling
    hierarchy.) So, it's better to add custom wake function and do filtering
    in wake up path.

    This patch adds a filtering feature for waking up oom-waiters. Hierarchy
    is properly handled.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

21 May, 2010

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (44 commits)
    vlynq: make whole Kconfig-menu dependant on architecture
    add descriptive comment for TIF_MEMDIE task flag declaration.
    EEPROM: max6875: Header file cleanup
    EEPROM: 93cx6: Header file cleanup
    EEPROM: Header file cleanup
    agp: use NULL instead of 0 when pointer is needed
    rtc-v3020: make bitfield unsigned
    PCI: make bitfield unsigned
    jbd2: use NULL instead of 0 when pointer is needed
    cciss: fix shadows sparse warning
    doc: inode uses a mutex instead of a semaphore.
    uml: i386: Avoid redefinition of NR_syscalls
    fix "seperate" typos in comments
    cocbalt_lcdfb: correct sections
    doc: Change urls for sparse
    Powerpc: wii: Fix typo in comment
    i2o: cleanup some exit paths
    Documentation/: it's -> its where appropriate
    UML: Fix compiler warning due to missing task_struct declaration
    UML: add kernel.h include to signal.c
    ...

    Linus Torvalds
     

12 May, 2010

2 commits

  • Some callers (in memcontrol.c) calls css_is_ancestor() without
    rcu_read_lock. Because css_is_ancestor() has to access RCU protected
    data, it should be under rcu_read_lock().

    This makes css_is_ancestor() itself does safe access to RCU protected
    area. (At least, "root" can have refcnt==0 if it's not an ancestor of
    "child". So, we need rcu_read_lock().)

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: "Paul E. McKenney"
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Commit ad4ba375373937817404fd92239ef4cadbded23b ("memcg: css_id() must be
    called under rcu_read_lock()") modifies memcontol.c for fixing RCU check
    message. But Andrew Morton pointed out that the fix doesn't seems sane
    and it was just for hidining lockdep messages.

    This is a patch for do proper things. Checking again, all places,
    accessing without rcu_read_lock, that commit fixies was intentional....
    all callers of css_id() has reference count on it. So, it's not necessary
    to be under rcu_read_lock().

    Considering again, we can use rcu_dereference_check for css_id(). We know
    css->id is valid if css->refcnt > 0. (css->id never changes and freed
    after css->refcnt going to be 0.)

    This patch makes use of rcu_dereference_check() in css_id/depth and remove
    unnecessary rcu-read-lock added by the commit.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: "Paul E. McKenney"
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

08 May, 2010

1 commit

  • …/git/tip/linux-2.6-tip

    * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    rcu: create rcu_my_thread_group_empty() wrapper
    memcg: css_id() must be called under rcu_read_lock()
    cgroup: Check task_lock in task_subsys_state()
    sched: Fix an RCU warning in print_task()
    cgroup: Fix an RCU warning in alloc_css_id()
    cgroup: Fix an RCU warning in cgroup_path()
    KEYS: Fix an RCU warning in the reading of user keys
    KEYS: Fix an RCU warning

    Linus Torvalds
     

05 May, 2010

1 commit

  • This patch fixes task_in_mem_cgroup(), mem_cgroup_uncharge_swapcache(),
    mem_cgroup_move_swap_account(), and is_target_pte_for_mc() to protect
    calls to css_id(). An additional RCU lockdep splat was reported for
    memcg_oom_wake_function(), however, this function is not yet in
    mainline as of 2.6.34-rc5.

    Reported-by: Li Zefan
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: KAMEZAWA Hiroyuki
    Tested-by: Li Zefan
    Signed-off-by: Paul E. McKenney
    Cc: Andrew Morton

    Paul E. McKenney
     

25 Apr, 2010

1 commit

  • If a signal is pending (task being killed by sigkill)
    __mem_cgroup_try_charge will write NULL into &mem, and css_put will oops
    on null pointer dereference.

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
    IP: [] mem_cgroup_prepare_migration+0x7c/0xc0
    PGD a5d89067 PUD a5d8a067 PMD 0
    Oops: 0000 [#1] SMP
    last sysfs file: /sys/devices/platform/microcode/firmware/microcode/loading
    CPU 0
    Modules linked in: nfs lockd nfs_acl auth_rpcgss sunrpc acpi_cpufreq pcspkr sg [last unloaded: microcode]

    Pid: 5299, comm: largepages Tainted: G W 2.6.34-rc3 #3 Penryn1600SLI-110dB/To Be Filled By O.E.M.
    RIP: 0010:[] [] mem_cgroup_prepare_migration+0x7c/0xc0

    [nishimura@mxp.nes.nec.co.jp: fix merge issues]
    Signed-off-by: Andrea Arcangeli
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Daisuke Nishimura
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

23 Apr, 2010

1 commit


07 Apr, 2010

1 commit

  • Presently, memcg's FILE_MAPPED accounting has following race with
    move_account (happens at rmdir()).

    increment page->mapcount (rmap.c)
    mem_cgroup_update_file_mapped() move_account()
    lock_page_cgroup()
    check page_mapped() if
    page_mapped(page)>1 {
    FILE_MAPPED -1 from old memcg
    FILE_MAPPED +1 to old memcg
    }
    .....
    overwrite pc->mem_cgroup
    unlock_page_cgroup()
    lock_page_cgroup()
    FILE_MAPPED + 1 to pc->mem_cgroup
    unlock_page_cgroup()

    Then,
    old memcg (-1 file mapped)
    new memcg (+2 file mapped)

    This happens because move_account see page_mapped() which is not guarded
    by lock_page_cgroup(). This patch adds FILE_MAPPED flag to page_cgroup
    and move account information based on it. Now, all checks are synchronous
    with lock_page_cgroup().

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Balbir Singh
    Reviewed-by: Daisuke Nishimura
    Cc: Andrea Righi
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

25 Mar, 2010

2 commits

  • There was a potential null deref introduced in c62b1a3b31b5 ("memcg: use
    generic percpu instead of private implementation").

    Signed-off-by: Dan Carpenter
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Carpenter
     
  • In commit 02491447 ("memcg: move charges of anonymous swap"), I tried to
    disable move charge feature in no mmu case by enclosing all the related
    functions with "#ifdef CONFIG_MMU", but the commit places these ifdefs in
    wrong place. (it seems that it's mangled while handling some fixes...)

    This patch fixes it up.

    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     

15 Mar, 2010

1 commit


13 Mar, 2010

6 commits

  • In current page-fault code,

    handle_mm_fault()
    -> ...
    -> mem_cgroup_charge()
    -> map page or handle error.
    -> check return code.

    If page fault's return code is VM_FAULT_OOM, page_fault_out_of_memory() is
    called. But if it's caused by memcg, OOM should have been already
    invoked.

    Then, I added a patch: a636b327f731143ccc544b966cfd8de6cb6d72c6. That
    patch records last_oom_jiffies for memcg's sub-hierarchy and prevents
    page_fault_out_of_memory from being invoked in near future.

    But Nishimura-san reported that check by jiffies is not enough when the
    system is terribly heavy.

    This patch changes memcg's oom logic as.
    * If memcg causes OOM-kill, continue to retry.
    * remove jiffies check which is used now.
    * add memcg-oom-lock which works like perzone oom lock.
    * If current is killed(as a process), bypass charge.

    Something more sophisticated can be added but this pactch does
    fundamental things.
    TODO:
    - add oom notifier
    - add permemcg disable-oom-kill flag and freezer at oom.
    - more chances for wake up oom waiter (when changing memory limit etc..)

    Reviewed-by: Daisuke Nishimura
    Tested-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: David Rientjes
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Events should be removed after rmdir of cgroup directory, but before
    destroying subsystem state objects. Let's take reference to cgroup
    directory dentry to do that.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Acked-by: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Cc: Dan Malek
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Memcg has 2 eventcountes which counts "the same" event. Just usages are
    different from each other. This patch tries to reduce event counter.

    Now logic uses "only increment, no reset" counter and masks for each
    checks. Softlimit chesk was done per 1000 evetns. So, the similar check
    can be done by !(new_counter & 0x3ff). Threshold check was done per 100
    events. So, the similar check can be done by (!new_counter & 0x7f)

    ALL event checks are done right after EVENT percpu counter is updated.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Kirill A. Shutemov
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Presently, move_task does "batched" precharge. Because res_counter or
    css's refcnt are not-scalable jobs for memcg, try_charge_().. tend to be
    done in batched manner if allowed.

    Now, softlimit and threshold check their event counter in try_charge, but
    the charge is not a per-page event. And event counter is not updated at
    charge(). Moreover, precharge doesn't pass "page" to try_charge() and
    softlimit tree will be never updated until uncharge() causes an event."

    So the best place to check the event counter is commit_charge(). This is
    per-page event by its nature. This patch move checks to there.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Kirill A. Shutemov
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • When per-cpu counter for memcg was implemneted, dynamic percpu allocator
    was not very good. But now, we have good one and useful macros. This
    patch replaces memcg's private percpu counter implementation with generic
    dynamic percpu allocator.

    The benefits are
    - We can remove private implementation.
    - The counters will be NUMA-aware. (Current one is not...)
    - This patch makes sizeof struct mem_cgroup smaller. Then,
    struct mem_cgroup may be fit in page size on small config.
    - About basic performance aspects, see below.

    [Before]
    # size mm/memcontrol.o
    text data bss dec hex filename
    24373 2528 4132 31033 7939 mm/memcontrol.o

    [page-fault-throuput test on 8cpu/SMP in root cgroup]
    # /root/bin/perf stat -a -e page-faults,cache-misses --repeat 5 ./multi-fault-fork 8

    Performance counter stats for './multi-fault-fork 8' (5 runs):

    45878618 page-faults ( +- 0.110% )
    602635826 cache-misses ( +- 0.105% )

    61.005373262 seconds time elapsed ( +- 0.004% )

    Then cache-miss/page fault = 13.14

    [After]
    #size mm/memcontrol.o
    text data bss dec hex filename
    23913 2528 4132 30573 776d mm/memcontrol.o
    # /root/bin/perf stat -a -e page-faults,cache-misses --repeat 5 ./multi-fault-fork 8

    Performance counter stats for './multi-fault-fork 8' (5 runs):

    48179400 page-faults ( +- 0.271% )
    588628407 cache-misses ( +- 0.136% )

    61.004615021 seconds time elapsed ( +- 0.004% )

    Then cache-miss/page fault = 12.22

    Text size is reduced.
    This performance improvement is not big and will be invisible in real world
    applications. But this result shows this patch has some good effect even
    on (small) SMP.

    Here is a test program I used.

    1. fork() processes on each cpus.
    2. do page fault repeatedly on each process.
    3. after 60secs, kill all childredn and exit.

    (3 is necessary for getting stable data, this is improvement from previous one.)

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    /*
    * For avoiding contention in page table lock, FAULT area is
    * sparse. If FAULT_LENGTH is too large for your cpus, decrease it.
    */
    #define FAULT_LENGTH (2 * 1024 * 1024)
    #define PAGE_SIZE 4096
    #define MAXNUM (128)

    void alarm_handler(int sig)
    {
    }

    void *worker(int cpu, int ppid)
    {
    void *start, *end;
    char *c;
    cpu_set_t set;
    int i;

    CPU_ZERO(&set);
    CPU_SET(cpu, &set);
    sched_setaffinity(0, sizeof(set), &set);

    start = mmap(NULL, FAULT_LENGTH, PROT_READ|PROT_WRITE,
    MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
    if (start == MAP_FAILED) {
    perror("mmap");
    exit(1);
    }
    end = start + FAULT_LENGTH;

    pause();
    //fprintf(stderr, "run%d", cpu);
    while (1) {
    for (c = (char*)start; (void *)c < end; c += PAGE_SIZE)
    *c = 0;
    madvise(start, FAULT_LENGTH, MADV_DONTNEED);
    }
    return NULL;
    }

    int main(int argc, char *argv[])
    {
    int num, i, ret, pid, status;
    int pids[MAXNUM];

    if (argc < 2)
    return 0;

    setpgid(0, 0);
    signal(SIGALRM, alarm_handler);
    num = atoi(argv[1]);
    pid = getpid();

    for (i = 0; i < num; ++i) {
    ret = fork();
    if (!ret) {
    worker(i, pid);
    exit(0);
    }
    pids[i] = ret;
    }
    sleep(1);
    kill(-pid, SIGALRM);
    sleep(60);
    for (i = 0; i < num; i++)
    kill(pids[i], SIGKILL);
    for (i = 0; i < num; i++)
    waitpid(pids[i], &status, 0);
    return 0;
    }

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • s/mem_cgroup_print_mem_info/mem_cgroup_print_oom_info/

    Signed-off-by: Kirill A. Shutemov
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov