30 Jun, 2010

1 commit

  • OOM-waitqueue should be waken up when oom_disable is canceled. This is a
    fix for 3c11ecf448eff8f1 ("memcg: oom kill disable and oom status").

    How to test:
    Create a cgroup A...
    1. set memory.limit and memory.memsw.limit to be small value
    2. echo 1 > /cgroup/A/memory.oom_control, this disables oom-kill.
    3. run a program which must cause OOM.

    A program executed in 3 will sleep by oom_waiqueue in memcg. Then, how to
    wake it up is problem.

    1. echo 0 > /cgroup/A/memory.oom_control (enable OOM-killer)
    2. echo big mem > /cgroup/A/memory.memsw.limit_in_bytes(allow more swap)

    etc..

    Without the patch, a task in slept can not be waken up.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

28 May, 2010

10 commits

  • Introduce struct mem_cgroup_thresholds. It helps to reduce number of
    checks of thresholds type (memory or mem+swap).

    [akpm@linux-foundation.org: repair comment]
    Signed-off-by: Kirill A. Shutemov
    Cc: Phil Carmody
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: KAMEZAWA Hiroyuki
    Acked-by: Paul Menage
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Since we are unable to handle an error returned by
    cftype.unregister_event() properly, let's make the callback
    void-returning.

    mem_cgroup_unregister_event() has been rewritten to be a "never fail"
    function. On mem_cgroup_usage_register_event() we save old buffer for
    thresholds array and reuse it in mem_cgroup_usage_unregister_event() to
    avoid allocation.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Phil Carmody
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Paul Menage
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • FILE_MAPPED per memcg of migrated file cache is not properly updated,
    because our hook in page_add_file_rmap() can't know to which memcg
    FILE_MAPPED should be counted.

    Basically, this patch is for fixing the bug but includes some big changes
    to fix up other messes.

    Now, at migrating mapped file, events happen in following sequence.

    1. allocate a new page.
    2. get memcg of an old page.
    3. charge ageinst a new page before migration. But at this point,
    no changes to new page's page_cgroup, no commit for the charge.
    (IOW, PCG_USED bit is not set.)
    4. page migration replaces radix-tree, old-page and new-page.
    5. page migration remaps the new page if the old page was mapped.
    6. Here, the new page is unlocked.
    7. memcg commits the charge for newpage, Mark the new page's page_cgroup
    as PCG_USED.

    Because "commit" happens after page-remap, we can count FILE_MAPPED
    at "5", because we should avoid to trust page_cgroup->mem_cgroup.
    if PCG_USED bit is unset.
    (Note: memcg's LRU removal code does that but LRU-isolation logic is used
    for helping it. When we overwrite page_cgroup->mem_cgroup, page_cgroup is
    not on LRU or page_cgroup->mem_cgroup is NULL.)

    We can lose file_mapped accounting information at 5 because FILE_MAPPED
    is updated only when mapcount changes 0->1. So we should catch it.

    BTW, historically, above implemntation comes from migration-failure
    of anonymous page. Because we charge both of old page and new page
    with mapcount=0, we can't catch
    - the page is really freed before remap.
    - migration fails but it's freed before remap
    or .....corner cases.

    New migration sequence with memcg is:

    1. allocate a new page.
    2. mark PageCgroupMigration to the old page.
    3. charge against a new page onto the old page's memcg. (here, new page's pc
    is marked as PageCgroupUsed.)
    4. page migration replaces radix-tree, page table, etc...
    5. At remapping, new page's page_cgroup is now makrked as "USED"
    We can catch 0->1 event and FILE_MAPPED will be properly updated.

    And we can catch SWAPOUT event after unlock this and freeing this
    page by unmap() can be caught.

    7. Clear PageCgroupMigration of the old page.

    So, FILE_MAPPED will be correctly updated.

    Then, for what MIGRATION flag is ?
    Without it, at migration failure, we may have to charge old page again
    because it may be fully unmapped. "charge" means that we have to dive into
    memory reclaim or something complated. So, it's better to avoid
    charge it again. Before this patch, __commit_charge() was working for
    both of the old/new page and fixed up all. But this technique has some
    racy condtion around FILE_MAPPED and SWAPOUT etc...
    Now, the kernel use MIGRATION flag and don't uncharge old page until
    the end of migration.

    I hope this change will make memcg's page migration much simpler. This
    page migration has caused several troubles. Worth to add a flag for
    simplification.

    Reviewed-by: Daisuke Nishimura
    Tested-by: Daisuke Nishimura
    Reported-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    akpm@linux-foundation.org
     
  • Only an out of memory error will cause ret to be set.

    Signed-off-by: Phil Carmody
    Acked-by: Kirill A. Shutemov
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Phil Carmody
     
  • The bottom 4 hunks are atomically changing memory to which there are no
    aliases as it's freshly allocated, so there's no need to use atomic
    operations.

    The other hunks are just atomic_read and atomic_set, and do not involve
    any read-modify-write. The use of atomic_{read,set} doesn't prevent a
    read/write or write/write race, so if a race were possible (I'm not saying
    one is), then it would still be there even with atomic_set.

    See:
    http://digitalvampire.org/blog/index.php/2007/05/13/atomic-cargo-cults/

    Signed-off-by: Phil Carmody
    Acked-by: Kirill A. Shutemov
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Phil Carmody
     
  • This patch adds support for moving charge of file pages, which include
    normal file, tmpfs file and swaps of tmpfs file. It's enabled by setting
    bit 1 of /memory.move_charge_at_immigrate.

    Unlike the case of anonymous pages, file pages(and swaps) in the range
    mmapped by the task will be moved even if the task hasn't done page fault,
    i.e. they might not be the task's "RSS", but other task's "RSS" that maps
    the same file. And mapcount of the page is ignored(the page can be moved
    even if page_mapcount(page) > 1). So, conditions that the page/swap
    should be met to be moved is that it must be in the range mmapped by the
    target task and it must be charged to the old cgroup.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix warning]
    Signed-off-by: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • This patch cleans up move charge code by:

    - define functions to handle pte for each types, and make
    is_target_pte_for_mc() cleaner.

    - instead of checking the MOVE_CHARGE_TYPE_ANON bit, define a function
    that checks the bit.

    Signed-off-by: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • This adds a feature to disable oom-killer for memcg, if disabled, of
    course, tasks under memcg will stop.

    But now, we have oom-notifier for memcg. And the world around memcg is
    not under out-of-memory. memcg's out-of-memory just shows memcg hits
    limit. Then, administrator or management daemon can recover the situation
    by

    - kill some process
    - enlarge limit, add more swap.
    - migrate some tasks
    - remove file cache on tmps (difficult ?)

    Unlike oom-killer, you can take enough information before killing tasks.
    (by gcore, or, ps etc.)

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Considering containers or other resource management softwares in userland,
    event notification of OOM in memcg should be implemented. Now, memcg has
    "threshold" notifier which uses eventfd, we can make use of it for oom
    notification.

    This patch adds oom notification eventfd callback for memcg. The usage is
    very similar to threshold notifier, but control file is memory.oom_control
    and no arguments other than eventfd is required.

    % cgroup_event_notifier /cgroup/A/memory.oom_control dummy
    (About cgroup_event_notifier, see Documentation/cgroup/)

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: David Rientjes
    Cc: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • memcg's oom waitqueue is a system-wide wait_queue (for handling
    hierarchy.) So, it's better to add custom wake function and do filtering
    in wake up path.

    This patch adds a filtering feature for waking up oom-waiters. Hierarchy
    is properly handled.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

21 May, 2010

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (44 commits)
    vlynq: make whole Kconfig-menu dependant on architecture
    add descriptive comment for TIF_MEMDIE task flag declaration.
    EEPROM: max6875: Header file cleanup
    EEPROM: 93cx6: Header file cleanup
    EEPROM: Header file cleanup
    agp: use NULL instead of 0 when pointer is needed
    rtc-v3020: make bitfield unsigned
    PCI: make bitfield unsigned
    jbd2: use NULL instead of 0 when pointer is needed
    cciss: fix shadows sparse warning
    doc: inode uses a mutex instead of a semaphore.
    uml: i386: Avoid redefinition of NR_syscalls
    fix "seperate" typos in comments
    cocbalt_lcdfb: correct sections
    doc: Change urls for sparse
    Powerpc: wii: Fix typo in comment
    i2o: cleanup some exit paths
    Documentation/: it's -> its where appropriate
    UML: Fix compiler warning due to missing task_struct declaration
    UML: add kernel.h include to signal.c
    ...

    Linus Torvalds
     

12 May, 2010

2 commits

  • Some callers (in memcontrol.c) calls css_is_ancestor() without
    rcu_read_lock. Because css_is_ancestor() has to access RCU protected
    data, it should be under rcu_read_lock().

    This makes css_is_ancestor() itself does safe access to RCU protected
    area. (At least, "root" can have refcnt==0 if it's not an ancestor of
    "child". So, we need rcu_read_lock().)

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: "Paul E. McKenney"
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Commit ad4ba375373937817404fd92239ef4cadbded23b ("memcg: css_id() must be
    called under rcu_read_lock()") modifies memcontol.c for fixing RCU check
    message. But Andrew Morton pointed out that the fix doesn't seems sane
    and it was just for hidining lockdep messages.

    This is a patch for do proper things. Checking again, all places,
    accessing without rcu_read_lock, that commit fixies was intentional....
    all callers of css_id() has reference count on it. So, it's not necessary
    to be under rcu_read_lock().

    Considering again, we can use rcu_dereference_check for css_id(). We know
    css->id is valid if css->refcnt > 0. (css->id never changes and freed
    after css->refcnt going to be 0.)

    This patch makes use of rcu_dereference_check() in css_id/depth and remove
    unnecessary rcu-read-lock added by the commit.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: "Paul E. McKenney"
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

08 May, 2010

1 commit

  • …/git/tip/linux-2.6-tip

    * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    rcu: create rcu_my_thread_group_empty() wrapper
    memcg: css_id() must be called under rcu_read_lock()
    cgroup: Check task_lock in task_subsys_state()
    sched: Fix an RCU warning in print_task()
    cgroup: Fix an RCU warning in alloc_css_id()
    cgroup: Fix an RCU warning in cgroup_path()
    KEYS: Fix an RCU warning in the reading of user keys
    KEYS: Fix an RCU warning

    Linus Torvalds
     

05 May, 2010

1 commit

  • This patch fixes task_in_mem_cgroup(), mem_cgroup_uncharge_swapcache(),
    mem_cgroup_move_swap_account(), and is_target_pte_for_mc() to protect
    calls to css_id(). An additional RCU lockdep splat was reported for
    memcg_oom_wake_function(), however, this function is not yet in
    mainline as of 2.6.34-rc5.

    Reported-by: Li Zefan
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: KAMEZAWA Hiroyuki
    Tested-by: Li Zefan
    Signed-off-by: Paul E. McKenney
    Cc: Andrew Morton

    Paul E. McKenney
     

25 Apr, 2010

1 commit

  • If a signal is pending (task being killed by sigkill)
    __mem_cgroup_try_charge will write NULL into &mem, and css_put will oops
    on null pointer dereference.

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
    IP: [] mem_cgroup_prepare_migration+0x7c/0xc0
    PGD a5d89067 PUD a5d8a067 PMD 0
    Oops: 0000 [#1] SMP
    last sysfs file: /sys/devices/platform/microcode/firmware/microcode/loading
    CPU 0
    Modules linked in: nfs lockd nfs_acl auth_rpcgss sunrpc acpi_cpufreq pcspkr sg [last unloaded: microcode]

    Pid: 5299, comm: largepages Tainted: G W 2.6.34-rc3 #3 Penryn1600SLI-110dB/To Be Filled By O.E.M.
    RIP: 0010:[] [] mem_cgroup_prepare_migration+0x7c/0xc0

    [nishimura@mxp.nes.nec.co.jp: fix merge issues]
    Signed-off-by: Andrea Arcangeli
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Signed-off-by: Daisuke Nishimura
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

23 Apr, 2010

1 commit


07 Apr, 2010

1 commit

  • Presently, memcg's FILE_MAPPED accounting has following race with
    move_account (happens at rmdir()).

    increment page->mapcount (rmap.c)
    mem_cgroup_update_file_mapped() move_account()
    lock_page_cgroup()
    check page_mapped() if
    page_mapped(page)>1 {
    FILE_MAPPED -1 from old memcg
    FILE_MAPPED +1 to old memcg
    }
    .....
    overwrite pc->mem_cgroup
    unlock_page_cgroup()
    lock_page_cgroup()
    FILE_MAPPED + 1 to pc->mem_cgroup
    unlock_page_cgroup()

    Then,
    old memcg (-1 file mapped)
    new memcg (+2 file mapped)

    This happens because move_account see page_mapped() which is not guarded
    by lock_page_cgroup(). This patch adds FILE_MAPPED flag to page_cgroup
    and move account information based on it. Now, all checks are synchronous
    with lock_page_cgroup().

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Balbir Singh
    Reviewed-by: Daisuke Nishimura
    Cc: Andrea Righi
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

25 Mar, 2010

2 commits

  • There was a potential null deref introduced in c62b1a3b31b5 ("memcg: use
    generic percpu instead of private implementation").

    Signed-off-by: Dan Carpenter
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Carpenter
     
  • In commit 02491447 ("memcg: move charges of anonymous swap"), I tried to
    disable move charge feature in no mmu case by enclosing all the related
    functions with "#ifdef CONFIG_MMU", but the commit places these ifdefs in
    wrong place. (it seems that it's mangled while handling some fixes...)

    This patch fixes it up.

    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     

15 Mar, 2010

1 commit


13 Mar, 2010

15 commits

  • In current page-fault code,

    handle_mm_fault()
    -> ...
    -> mem_cgroup_charge()
    -> map page or handle error.
    -> check return code.

    If page fault's return code is VM_FAULT_OOM, page_fault_out_of_memory() is
    called. But if it's caused by memcg, OOM should have been already
    invoked.

    Then, I added a patch: a636b327f731143ccc544b966cfd8de6cb6d72c6. That
    patch records last_oom_jiffies for memcg's sub-hierarchy and prevents
    page_fault_out_of_memory from being invoked in near future.

    But Nishimura-san reported that check by jiffies is not enough when the
    system is terribly heavy.

    This patch changes memcg's oom logic as.
    * If memcg causes OOM-kill, continue to retry.
    * remove jiffies check which is used now.
    * add memcg-oom-lock which works like perzone oom lock.
    * If current is killed(as a process), bypass charge.

    Something more sophisticated can be added but this pactch does
    fundamental things.
    TODO:
    - add oom notifier
    - add permemcg disable-oom-kill flag and freezer at oom.
    - more chances for wake up oom waiter (when changing memory limit etc..)

    Reviewed-by: Daisuke Nishimura
    Tested-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: David Rientjes
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Events should be removed after rmdir of cgroup directory, but before
    destroying subsystem state objects. Let's take reference to cgroup
    directory dentry to do that.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Acked-by: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Cc: Dan Malek
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Memcg has 2 eventcountes which counts "the same" event. Just usages are
    different from each other. This patch tries to reduce event counter.

    Now logic uses "only increment, no reset" counter and masks for each
    checks. Softlimit chesk was done per 1000 evetns. So, the similar check
    can be done by !(new_counter & 0x3ff). Threshold check was done per 100
    events. So, the similar check can be done by (!new_counter & 0x7f)

    ALL event checks are done right after EVENT percpu counter is updated.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Kirill A. Shutemov
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Presently, move_task does "batched" precharge. Because res_counter or
    css's refcnt are not-scalable jobs for memcg, try_charge_().. tend to be
    done in batched manner if allowed.

    Now, softlimit and threshold check their event counter in try_charge, but
    the charge is not a per-page event. And event counter is not updated at
    charge(). Moreover, precharge doesn't pass "page" to try_charge() and
    softlimit tree will be never updated until uncharge() causes an event."

    So the best place to check the event counter is commit_charge(). This is
    per-page event by its nature. This patch move checks to there.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Kirill A. Shutemov
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • When per-cpu counter for memcg was implemneted, dynamic percpu allocator
    was not very good. But now, we have good one and useful macros. This
    patch replaces memcg's private percpu counter implementation with generic
    dynamic percpu allocator.

    The benefits are
    - We can remove private implementation.
    - The counters will be NUMA-aware. (Current one is not...)
    - This patch makes sizeof struct mem_cgroup smaller. Then,
    struct mem_cgroup may be fit in page size on small config.
    - About basic performance aspects, see below.

    [Before]
    # size mm/memcontrol.o
    text data bss dec hex filename
    24373 2528 4132 31033 7939 mm/memcontrol.o

    [page-fault-throuput test on 8cpu/SMP in root cgroup]
    # /root/bin/perf stat -a -e page-faults,cache-misses --repeat 5 ./multi-fault-fork 8

    Performance counter stats for './multi-fault-fork 8' (5 runs):

    45878618 page-faults ( +- 0.110% )
    602635826 cache-misses ( +- 0.105% )

    61.005373262 seconds time elapsed ( +- 0.004% )

    Then cache-miss/page fault = 13.14

    [After]
    #size mm/memcontrol.o
    text data bss dec hex filename
    23913 2528 4132 30573 776d mm/memcontrol.o
    # /root/bin/perf stat -a -e page-faults,cache-misses --repeat 5 ./multi-fault-fork 8

    Performance counter stats for './multi-fault-fork 8' (5 runs):

    48179400 page-faults ( +- 0.271% )
    588628407 cache-misses ( +- 0.136% )

    61.004615021 seconds time elapsed ( +- 0.004% )

    Then cache-miss/page fault = 12.22

    Text size is reduced.
    This performance improvement is not big and will be invisible in real world
    applications. But this result shows this patch has some good effect even
    on (small) SMP.

    Here is a test program I used.

    1. fork() processes on each cpus.
    2. do page fault repeatedly on each process.
    3. after 60secs, kill all childredn and exit.

    (3 is necessary for getting stable data, this is improvement from previous one.)

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    /*
    * For avoiding contention in page table lock, FAULT area is
    * sparse. If FAULT_LENGTH is too large for your cpus, decrease it.
    */
    #define FAULT_LENGTH (2 * 1024 * 1024)
    #define PAGE_SIZE 4096
    #define MAXNUM (128)

    void alarm_handler(int sig)
    {
    }

    void *worker(int cpu, int ppid)
    {
    void *start, *end;
    char *c;
    cpu_set_t set;
    int i;

    CPU_ZERO(&set);
    CPU_SET(cpu, &set);
    sched_setaffinity(0, sizeof(set), &set);

    start = mmap(NULL, FAULT_LENGTH, PROT_READ|PROT_WRITE,
    MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
    if (start == MAP_FAILED) {
    perror("mmap");
    exit(1);
    }
    end = start + FAULT_LENGTH;

    pause();
    //fprintf(stderr, "run%d", cpu);
    while (1) {
    for (c = (char*)start; (void *)c < end; c += PAGE_SIZE)
    *c = 0;
    madvise(start, FAULT_LENGTH, MADV_DONTNEED);
    }
    return NULL;
    }

    int main(int argc, char *argv[])
    {
    int num, i, ret, pid, status;
    int pids[MAXNUM];

    if (argc < 2)
    return 0;

    setpgid(0, 0);
    signal(SIGALRM, alarm_handler);
    num = atoi(argv[1]);
    pid = getpid();

    for (i = 0; i < num; ++i) {
    ret = fork();
    if (!ret) {
    worker(i, pid);
    exit(0);
    }
    pids[i] = ret;
    }
    sleep(1);
    kill(-pid, SIGALRM);
    sleep(60);
    for (i = 0; i < num; i++)
    kill(pids[i], SIGKILL);
    for (i = 0; i < num; i++)
    waitpid(pids[i], &status, 0);
    return 0;
    }

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • s/mem_cgroup_print_mem_info/mem_cgroup_print_oom_info/

    Signed-off-by: Kirill A. Shutemov
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • It allows to register multiple memory and memsw thresholds and gets
    notifications when it crosses.

    To register a threshold application need:
    - create an eventfd;
    - open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
    - write string like " " to
    cgroup.event_control.

    Application will be notified through eventfd when memory usage crosses
    threshold in any direction.

    It's applicable for root and non-root cgroup.

    It uses stats to track memory usage, simmilar to soft limits. It checks
    if we need to send event to userspace on every 100 page in/out. I guess
    it's good compromise between performance and accuracy of thresholds.

    [akpm@linux-foundation.org: coding-style fixes]
    [nishimura@mxp.nes.nec.co.jp: fix documentation merge issue]
    Signed-off-by: Kirill A. Shutemov
    Cc: Li Zefan
    Cc: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Cc: Dan Malek
    Cc: Vladislav Buzov
    Cc: Daisuke Nishimura
    Cc: Alexander Shishkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Instead of incrementing counter on each page in/out and comparing it with
    constant, we set counter to constant, decrement counter on each page
    in/out and compare it with zero. We want to make comparing as fast as
    possible. On many RISC systems (probably not only RISC) comparing with
    zero is more effective than comparing with a constant, since not every
    constant can be immediate operand for compare instruction.

    Also, I've renamed MEM_CGROUP_STAT_EVENTS to MEM_CGROUP_STAT_SOFTLIMIT,
    since really it's not a generic counter.

    Signed-off-by: Kirill A. Shutemov
    Cc: Li Zefan
    Cc: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Cc: Dan Malek
    Cc: Vladislav Buzov
    Cc: Daisuke Nishimura
    Cc: Alexander Shishkin
    Cc: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Helper to get memory or mem+swap usage of the cgroup.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Pavel Emelyanov
    Cc: Dan Malek
    Cc: Vladislav Buzov
    Cc: Daisuke Nishimura
    Cc: Alexander Shishkin
    Cc: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Try to reduce overheads in moving swap charge by:

    - Adds a new function(__mem_cgroup_put), which takes "count" as a arg and
    decrement mem->refcnt by "count".
    - Removed res_counter_uncharge, css_put, and mem_cgroup_put from the path
    of moving swap account, and consolidate all of them into mem_cgroup_clear_mc.
    We cannot do that about mc.to->refcnt.

    These changes reduces the overhead from 1.35sec to 0.9sec to move charges
    of 1G anonymous memory(including 500MB swap) in my test environment.

    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Paul Menage
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • This patch is another core part of this move-charge-at-task-migration
    feature. It enables moving charges of anonymous swaps.

    To move the charge of swap, we need to exchange swap_cgroup's record.

    In current implementation, swap_cgroup's record is protected by:

    - page lock: if the entry is on swap cache.
    - swap_lock: if the entry is not on swap cache.

    This works well in usual swap-in/out activity.

    But this behavior make the feature of moving swap charge check many
    conditions to exchange swap_cgroup's record safely.

    So I changed modification of swap_cgroup's recored(swap_cgroup_record())
    to use xchg, and define a new function to cmpxchg swap_cgroup's record.

    This patch also enables moving charge of non pte_present but not uncharged
    swap caches, which can be exist on swap-out path, by getting the target
    pages via find_get_page() as do_mincore() does.

    [kosaki.motohiro@jp.fujitsu.com: fix ia64 build]
    [akpm@linux-foundation.org: fix typos]
    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Paul Menage
    Cc: Daisuke Nishimura
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • This move-charge-at-task-migration feature has extra charges on
    "to"(pre-charges) and "from"(left-over charges) during moving charge.
    This means unnecessary oom can happen.

    This patch tries to avoid such oom.

    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Paul Menage
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • Try to reduce overheads in moving charge by:

    - Instead of calling res_counter_uncharge() against the old cgroup in
    __mem_cgroup_move_account() everytime, call res_counter_uncharge() at the end
    of task migration once.
    - removed css_get(&to->css) from __mem_cgroup_move_account() because callers
    should have already called css_get(). And removed css_put(&to->css) too,
    which was called by callers of move_account on success of move_account.
    - Instead of calling __mem_cgroup_try_charge(), i.e. res_counter_charge(),
    repeatedly, call res_counter_charge(PAGE_SIZE * count) in can_attach() if
    possible.
    - Instead of calling css_get()/css_put() repeatedly, make use of coalesce
    __css_get()/__css_put() if possible.

    These changes reduces the overhead from 1.7sec to 0.6sec to move charges
    of 1G anonymous memory in my test environment.

    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Paul Menage
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • This patch is the core part of this move-charge-at-task-migration feature.
    It implements functions to move charges of anonymous pages mapped only by
    the target task.

    Implementation:
    - define struct move_charge_struct and a valuable of it(mc) to remember the
    count of pre-charges and other information.
    - At can_attach(), get anon_rss of the target mm, call __mem_cgroup_try_charge()
    repeatedly and count up mc.precharge.
    - At attach(), parse the page table, find a target page to be move, and call
    mem_cgroup_move_account() about the page.
    - Cancel all precharges if mc.precharge > 0 on failure or at the end of
    task move.

    [akpm@linux-foundation.org: a little simplification]
    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Paul Menage
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • In current memcg, charges associated with a task aren't moved to the new
    cgroup at task migration. Some users feel this behavior to be strange.
    These patches are for this feature, that is, for charging to the new
    cgroup and, of course, uncharging from the old cgroup at task migration.

    This patch adds "memory.move_charge_at_immigrate" file, which is a flag
    file to determine whether charges should be moved to the new cgroup at
    task migration or not and what type of charges should be moved. This
    patch also adds read and write handlers of the file.

    This patch also adds no-op handlers for this feature. These handlers will
    be implemented in later patches. And you cannot write any values other
    than 0 to move_charge_at_immigrate yet.

    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Paul Menage
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     

07 Mar, 2010

1 commit


17 Jan, 2010

1 commit

  • Current mem_cgroup_force_empty() only ensures mem->res.usage == 0 on
    success. But this doesn't guarantee memcg's LRU is really empty, because
    there are some cases in which !PageCgrupUsed pages exist on memcg's LRU.

    For example:
    - Pages can be uncharged by its owner process while they are on LRU.
    - race between mem_cgroup_add_lru_list() and __mem_cgroup_uncharge_common().

    So there can be a case in which the usage is zero but some of the LRUs are not empty.

    OTOH, mem_cgroup_del_lru_list(), which can be called asynchronously with
    rmdir, accesses the mem_cgroup, so this access can cause a problem if it
    races with rmdir because the mem_cgroup might have been freed by rmdir.

    Actually, I saw a bug which seems to be caused by this race.

    [1530745.949906] BUG: unable to handle kernel NULL pointer dereference at 0000000000000230
    [1530745.950651] IP: [] mem_cgroup_del_lru_list+0x30/0x80
    [1530745.950651] PGD 3863de067 PUD 3862c7067 PMD 0
    [1530745.950651] Oops: 0002 [#1] SMP
    [1530745.950651] last sysfs file: /sys/devices/system/cpu/cpu7/cache/index1/shared_cpu_map
    [1530745.950651] CPU 3
    [1530745.950651] Modules linked in: configs ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp nfsd nfs_acl auth_rpcgss exportfs autofs4 hidp rfcomm l2cap crc16 bluetooth lockd sunrpc ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i cxgb3 mdio libiscsi_tcp libiscsi scsi_transport_iscsi dm_mirror dm_multipath scsi_dh video output sbs sbshc battery ac lp kvm_intel kvm sg ide_cd_mod cdrom serio_raw tpm_tis tpm tpm_bios acpi_memhotplug button parport_pc parport rtc_cmos rtc_core rtc_lib e1000 i2c_i801 i2c_core pcspkr dm_region_hash dm_log dm_mod ata_piix libata shpchp megaraid_mbox sd_mod scsi_mod megaraid_mm ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: freq_table]
    [1530745.950651] Pid: 19653, comm: shmem_test_02 Tainted: G M 2.6.32-mm1-00701-g2b04386 #3 Express5800/140Rd-4 [N8100-1065]
    [1530745.950651] RIP: 0010:[] [] mem_cgroup_del_lru_list+0x30/0x80
    [1530745.950651] RSP: 0018:ffff8803863ddcb8 EFLAGS: 00010002
    [1530745.950651] RAX: 00000000000001e0 RBX: ffff8803abc02238 RCX: 00000000000001e0
    [1530745.950651] RDX: 0000000000000000 RSI: ffff88038611a000 RDI: ffff8803abc02238
    [1530745.950651] RBP: ffff8803863ddcc8 R08: 0000000000000002 R09: ffff8803a04c8643
    [1530745.950651] R10: 0000000000000000 R11: ffffffff810c7333 R12: 0000000000000000
    [1530745.950651] R13: ffff880000017f00 R14: 0000000000000092 R15: ffff8800179d0310
    [1530745.950651] FS: 0000000000000000(0000) GS:ffff880017800000(0000) knlGS:0000000000000000
    [1530745.950651] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [1530745.950651] CR2: 0000000000000230 CR3: 0000000379d87000 CR4: 00000000000006e0
    [1530745.950651] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [1530745.950651] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    [1530745.950651] Process shmem_test_02 (pid: 19653, threadinfo ffff8803863dc000, task ffff88038612a8a0)
    [1530745.950651] Stack:
    [1530745.950651] ffffea00040c2fe8 0000000000000000 ffff8803863ddd98 ffffffff810c739a
    [1530745.950651] 00000000863ddd18 000000000000000c 0000000000000000 0000000000000000
    [1530745.950651] 0000000000000002 0000000000000000 ffff8803863ddd68 0000000000000046
    [1530745.950651] Call Trace:
    [1530745.950651] [] release_pages+0x142/0x1e7
    [1530745.950651] [] ? pagevec_move_tail+0x6e/0x112
    [1530745.950651] [] pagevec_move_tail+0xfd/0x112
    [1530745.950651] [] lru_add_drain+0x76/0x94
    [1530745.950651] [] exit_mmap+0x6e/0x145
    [1530745.950651] [] mmput+0x5e/0xcf
    [1530745.950651] [] exit_mm+0x11c/0x129
    [1530745.950651] [] ? audit_free+0x196/0x1c9
    [1530745.950651] [] do_exit+0x1f5/0x6b7
    [1530745.950651] [] ? up_read+0x2b/0x2f
    [1530745.950651] [] ? lockdep_sys_exit_thunk+0x35/0x67
    [1530745.950651] [] do_group_exit+0x83/0xb0
    [1530745.950651] [] sys_exit_group+0x17/0x1b
    [1530745.950651] [] system_call_fastpath+0x16/0x1b
    [1530745.950651] Code: 54 53 0f 1f 44 00 00 83 3d cc 29 7c 00 00 41 89 f4 75 63 eb 4e 48 83 7b 08 00 75 04 0f 0b eb fe 48 89 df e8 18 f3 ff ff 44 89 e2 ff 4c d0 50 48 8b 05 2b 2d 7c 00 48 39 43 08 74 39 48 8b 4b
    [1530745.950651] RIP [] mem_cgroup_del_lru_list+0x30/0x80
    [1530745.950651] RSP
    [1530745.950651] CR2: 0000000000000230
    [1530745.950651] ---[ end trace c3419c1bb8acc34f ]---
    [1530745.950651] Fixing recursive fault but reboot is needed!

    The problem here is pages on LRU may contain pointer to stale memcg. To
    make res->usage to be 0, all pages on memcg must be uncharged or moved to
    another(parent) memcg. Moved page_cgroup have already removed from
    original LRU, but uncharged page_cgroup contains pointer to memcg withou
    PCG_USED bit. (This asynchronous LRU work is for improving performance.)
    If PCG_USED bit is not set, page_cgroup will never be added to memcg's
    LRU. So, about pages not on LRU, they never access stale pointer. Then,
    what we have to take care of is page_cgroup _on_ LRU list. This patch
    fixes this problem by making mem_cgroup_force_empty() visit all LRUs
    before exiting its loop and guarantee there are no pages on its LRU.

    Signed-off-by: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     

17 Dec, 2009

1 commit

  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (34 commits)
    HWPOISON: Remove stray phrase in a comment
    HWPOISON: Try to allocate migration page on the same node
    HWPOISON: Don't do early filtering if filter is disabled
    HWPOISON: Add a madvise() injector for soft page offlining
    HWPOISON: Add soft page offline support
    HWPOISON: Undefine short-hand macros after use to avoid namespace conflict
    HWPOISON: Use new shake_page in memory_failure
    HWPOISON: Use correct name for MADV_HWPOISON in documentation
    HWPOISON: mention HWPoison in Kconfig entry
    HWPOISON: Use get_user_page_fast in hwpoison madvise
    HWPOISON: add an interface to switch off/on all the page filters
    HWPOISON: add memory cgroup filter
    memcg: add accessor to mem_cgroup.css
    memcg: rename and export try_get_mem_cgroup_from_page()
    HWPOISON: add page flags filter
    mm: export stable page flags
    HWPOISON: limit hwpoison injector to known page types
    HWPOISON: add fs/device filters
    HWPOISON: return 0 to indicate success reliably
    HWPOISON: make semantics of IGNORED/DELAYED clear
    ...

    Linus Torvalds