13 Mar, 2010

40 commits

  • While in theory user_enable_single_step/user_disable_single_step/
    user_enable_blockstep could also be provided as an inline or macro there's
    no good reason to do so, and having the prototype in one places keeps code
    size and confusion down.

    Roland said:

    The original thought there was that user_enable_single_step() et al
    might well be only an instruction or three on a sane machine (as if we
    have any of those!), and since there is only one call site inlining
    would be beneficial. But I agree that there is no strong reason to care
    about inlining it.

    As to the arch changes, there is only one thought I'd add to the
    record. It was always my thinking that for an arch where
    PTRACE_SINGLESTEP does text-modifying breakpoint insertion,
    user_enable_single_step() should not be provided. That is,
    arch_has_single_step()=>true means that there is an arch facility with
    "pure" semantics that does not have any unexpected side effects.
    Inserting a breakpoint might do very unexpected strange things in
    multi-threaded situations. Aside from that, it is a peculiar side
    effect that user_{enable,disable}_single_step() should cause COW
    de-sharing of text pages and so forth. For PTRACE_SINGLESTEP, all these
    peculiarities are the status quo ante for that arch, so having
    arch_ptrace() itself do those is one thing. But for building other
    things in the future, it is nicer to have a uniform "pure" semantics
    that arch-independent code can expect.

    OTOH, all such arch issues are really up to the arch maintainer. As
    of today, there is nothing but ptrace using user_enable_single_step() et
    al so it's a distinction without a practical difference. If/when there
    are other facilities that use user_enable_single_step() and might care,
    the affected arch's can revisit the question when someone cares about
    the quality of the arch support for said new facility.

    Signed-off-by: Christoph Hellwig
    Cc: Oleg Nesterov
    Cc: Roland McGrath
    Acked-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Use ptrace_request() in the three remaining architectures that didn't use it
    (m68knommu, h8300, microblaze). This means:

    - ptrace_request now handles PTRACE_{PEEK,POKE}{TEXT,DATA} and PTRACE_DETATCH
    calls that were previously called directly, or in case of h8300 even open
    coded.
    - adds new support for PTRACE_SETOPTIONS/PTRACE_GETEVENTMSG/
    PTRACE_GETSIGINFO/PTRACE_SETSIGINFO

    Signed-off-by: Christoph Hellwig
    Cc: Geert Uytterhoeven
    Cc: Yoshinori Sato
    Cc: Oleg Nesterov
    Cc: Michal Simek
    Acked-by: Greg Ungerer
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • we can't declarate two variable at the same scope by NODEMASK_ALLOC().

    This patch fixes it.

    Signed-off-by: Miao Xie
    Cc: David Rientjes
    Cc: Lee Schermerhorn
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     
  • Nishimura-san have been working for memcg very good. His review and tests
    give us much improvements and account migraiton which he is now
    challenging is really important.

    He is a stakeholder.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Acked-by: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • In current page-fault code,

    handle_mm_fault()
    -> ...
    -> mem_cgroup_charge()
    -> map page or handle error.
    -> check return code.

    If page fault's return code is VM_FAULT_OOM, page_fault_out_of_memory() is
    called. But if it's caused by memcg, OOM should have been already
    invoked.

    Then, I added a patch: a636b327f731143ccc544b966cfd8de6cb6d72c6. That
    patch records last_oom_jiffies for memcg's sub-hierarchy and prevents
    page_fault_out_of_memory from being invoked in near future.

    But Nishimura-san reported that check by jiffies is not enough when the
    system is terribly heavy.

    This patch changes memcg's oom logic as.
    * If memcg causes OOM-kill, continue to retry.
    * remove jiffies check which is used now.
    * add memcg-oom-lock which works like perzone oom lock.
    * If current is killed(as a process), bypass charge.

    Something more sophisticated can be added but this pactch does
    fundamental things.
    TODO:
    - add oom notifier
    - add permemcg disable-oom-kill flag and freezer at oom.
    - more chances for wake up oom waiter (when changing memory limit etc..)

    Reviewed-by: Daisuke Nishimura
    Tested-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: David Rientjes
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Decription of sanity check for memory thresholds.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Cc: Dan Malek
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • An example of cgroup notification API usage.

    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Cc: Dan Malek
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Events should be removed after rmdir of cgroup directory, but before
    destroying subsystem state objects. Let's take reference to cgroup
    directory dentry to do that.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Acked-by: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Cc: Dan Malek
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Notify userspace about cgroup removing only after rmdir of cgroup
    directory to avoid race between userspace and kernelspace.

    eventfd are used to notify about two types of event:
    - control file-specific, like crossing memory threshold;
    - cgroup removing.

    To understand what really happen, userspace can check if the cgroup still
    exists. To avoid race beetween userspace and kernelspace we have to
    notify userspace about cgroup removing only after rmdir of cgroup
    directory.

    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Acked-by: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Cc: Dan Malek
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Presently, if panic_on_oom=2, the whole system panics even if the oom
    happend in some special situation (as cpuset, mempolicy....). Then,
    panic_on_oom=2 means painc_on_oom_always.

    Now, memcg doesn't check panic_on_oom flag. This patch adds a check.

    BTW, how it's useful ?

    kdump+panic_on_oom=2 is the last tool to investigate what happens in
    oom-ed system. When a task is killed, the sysytem recovers and there will
    be few hint to know what happnes. In mission critical system, oom should
    never happen. Then, panic_on_oom=2+kdump is useful to avoid next OOM by
    knowing precise information via snapshot.

    TODO:
    - For memcg, it's for isolate system's memory usage, oom-notiifer and
    freeze_at_oom (or rest_at_oom) should be implemented. Then, management
    daemon can do similar jobs (as kdump) or taking snapshot per cgroup.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Nick Piggin
    Reviewed-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Update memcg_test.txt to describe how to test the move-charge feature.

    Signed-off-by: Daisuke Nishimura
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • Memcg has 2 eventcountes which counts "the same" event. Just usages are
    different from each other. This patch tries to reduce event counter.

    Now logic uses "only increment, no reset" counter and masks for each
    checks. Softlimit chesk was done per 1000 evetns. So, the similar check
    can be done by !(new_counter & 0x3ff). Threshold check was done per 100
    events. So, the similar check can be done by (!new_counter & 0x7f)

    ALL event checks are done right after EVENT percpu counter is updated.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Kirill A. Shutemov
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Presently, move_task does "batched" precharge. Because res_counter or
    css's refcnt are not-scalable jobs for memcg, try_charge_().. tend to be
    done in batched manner if allowed.

    Now, softlimit and threshold check their event counter in try_charge, but
    the charge is not a per-page event. And event counter is not updated at
    charge(). Moreover, precharge doesn't pass "page" to try_charge() and
    softlimit tree will be never updated until uncharge() causes an event."

    So the best place to check the event counter is commit_charge(). This is
    per-page event by its nature. This patch move checks to there.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Kirill A. Shutemov
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • When per-cpu counter for memcg was implemneted, dynamic percpu allocator
    was not very good. But now, we have good one and useful macros. This
    patch replaces memcg's private percpu counter implementation with generic
    dynamic percpu allocator.

    The benefits are
    - We can remove private implementation.
    - The counters will be NUMA-aware. (Current one is not...)
    - This patch makes sizeof struct mem_cgroup smaller. Then,
    struct mem_cgroup may be fit in page size on small config.
    - About basic performance aspects, see below.

    [Before]
    # size mm/memcontrol.o
    text data bss dec hex filename
    24373 2528 4132 31033 7939 mm/memcontrol.o

    [page-fault-throuput test on 8cpu/SMP in root cgroup]
    # /root/bin/perf stat -a -e page-faults,cache-misses --repeat 5 ./multi-fault-fork 8

    Performance counter stats for './multi-fault-fork 8' (5 runs):

    45878618 page-faults ( +- 0.110% )
    602635826 cache-misses ( +- 0.105% )

    61.005373262 seconds time elapsed ( +- 0.004% )

    Then cache-miss/page fault = 13.14

    [After]
    #size mm/memcontrol.o
    text data bss dec hex filename
    23913 2528 4132 30573 776d mm/memcontrol.o
    # /root/bin/perf stat -a -e page-faults,cache-misses --repeat 5 ./multi-fault-fork 8

    Performance counter stats for './multi-fault-fork 8' (5 runs):

    48179400 page-faults ( +- 0.271% )
    588628407 cache-misses ( +- 0.136% )

    61.004615021 seconds time elapsed ( +- 0.004% )

    Then cache-miss/page fault = 12.22

    Text size is reduced.
    This performance improvement is not big and will be invisible in real world
    applications. But this result shows this patch has some good effect even
    on (small) SMP.

    Here is a test program I used.

    1. fork() processes on each cpus.
    2. do page fault repeatedly on each process.
    3. after 60secs, kill all childredn and exit.

    (3 is necessary for getting stable data, this is improvement from previous one.)

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    /*
    * For avoiding contention in page table lock, FAULT area is
    * sparse. If FAULT_LENGTH is too large for your cpus, decrease it.
    */
    #define FAULT_LENGTH (2 * 1024 * 1024)
    #define PAGE_SIZE 4096
    #define MAXNUM (128)

    void alarm_handler(int sig)
    {
    }

    void *worker(int cpu, int ppid)
    {
    void *start, *end;
    char *c;
    cpu_set_t set;
    int i;

    CPU_ZERO(&set);
    CPU_SET(cpu, &set);
    sched_setaffinity(0, sizeof(set), &set);

    start = mmap(NULL, FAULT_LENGTH, PROT_READ|PROT_WRITE,
    MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
    if (start == MAP_FAILED) {
    perror("mmap");
    exit(1);
    }
    end = start + FAULT_LENGTH;

    pause();
    //fprintf(stderr, "run%d", cpu);
    while (1) {
    for (c = (char*)start; (void *)c < end; c += PAGE_SIZE)
    *c = 0;
    madvise(start, FAULT_LENGTH, MADV_DONTNEED);
    }
    return NULL;
    }

    int main(int argc, char *argv[])
    {
    int num, i, ret, pid, status;
    int pids[MAXNUM];

    if (argc < 2)
    return 0;

    setpgid(0, 0);
    signal(SIGALRM, alarm_handler);
    num = atoi(argv[1]);
    pid = getpid();

    for (i = 0; i < num; ++i) {
    ret = fork();
    if (!ret) {
    worker(i, pid);
    exit(0);
    }
    pids[i] = ret;
    }
    sleep(1);
    kill(-pid, SIGALRM);
    sleep(60);
    for (i = 0; i < num; i++)
    kill(pids[i], SIGKILL);
    for (i = 0; i < num; i++)
    waitpid(pids[i], &status, 0);
    return 0;
    }

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • s/mem_cgroup_print_mem_info/mem_cgroup_print_oom_info/

    Signed-off-by: Kirill A. Shutemov
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • It allows to register multiple memory and memsw thresholds and gets
    notifications when it crosses.

    To register a threshold application need:
    - create an eventfd;
    - open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
    - write string like " " to
    cgroup.event_control.

    Application will be notified through eventfd when memory usage crosses
    threshold in any direction.

    It's applicable for root and non-root cgroup.

    It uses stats to track memory usage, simmilar to soft limits. It checks
    if we need to send event to userspace on every 100 page in/out. I guess
    it's good compromise between performance and accuracy of thresholds.

    [akpm@linux-foundation.org: coding-style fixes]
    [nishimura@mxp.nes.nec.co.jp: fix documentation merge issue]
    Signed-off-by: Kirill A. Shutemov
    Cc: Li Zefan
    Cc: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Cc: Dan Malek
    Cc: Vladislav Buzov
    Cc: Daisuke Nishimura
    Cc: Alexander Shishkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Instead of incrementing counter on each page in/out and comparing it with
    constant, we set counter to constant, decrement counter on each page
    in/out and compare it with zero. We want to make comparing as fast as
    possible. On many RISC systems (probably not only RISC) comparing with
    zero is more effective than comparing with a constant, since not every
    constant can be immediate operand for compare instruction.

    Also, I've renamed MEM_CGROUP_STAT_EVENTS to MEM_CGROUP_STAT_SOFTLIMIT,
    since really it's not a generic counter.

    Signed-off-by: Kirill A. Shutemov
    Cc: Li Zefan
    Cc: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Cc: Dan Malek
    Cc: Vladislav Buzov
    Cc: Daisuke Nishimura
    Cc: Alexander Shishkin
    Cc: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Helper to get memory or mem+swap usage of the cgroup.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Pavel Emelyanov
    Cc: Dan Malek
    Cc: Vladislav Buzov
    Cc: Daisuke Nishimura
    Cc: Alexander Shishkin
    Cc: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • This patchset introduces eventfd-based API for notifications in cgroups
    and implements memory notifications on top of it.

    It uses statistics in memory controler to track memory usage.

    Output of time(1) on building kernel on tmpfs:

    Root cgroup before changes:
    make -j2 506.37 user 60.93s system 193% cpu 4:52.77 total
    Non-root cgroup before changes:
    make -j2 507.14 user 62.66s system 193% cpu 4:54.74 total
    Root cgroup after changes (0 thresholds):
    make -j2 507.13 user 62.20s system 193% cpu 4:53.55 total
    Non-root cgroup after changes (0 thresholds):
    make -j2 507.70 user 64.20s system 193% cpu 4:55.70 total
    Root cgroup after changes (1 thresholds, never crossed):
    make -j2 506.97 user 62.20s system 193% cpu 4:53.90 total
    Non-root cgroup after changes (1 thresholds, never crossed):
    make -j2 507.55 user 64.08s system 193% cpu 4:55.63 total

    This patch:

    Introduce the write-only file "cgroup.event_control" in every cgroup.

    To register new notification handler you need:
    - create an eventfd;
    - open a control file to be monitored. Callbacks register_event() and
    unregister_event() must be defined for the control file;
    - write " " to cgroup.event_control.
    Interpretation of args is defined by control file implementation;

    eventfd will be woken up by control file implementation or when the
    cgroup is removed.

    To unregister notification handler just close eventfd.

    If you need notification functionality for a control file you have to
    implement callbacks register_event() and unregister_event() in the
    struct cftype.

    [kamezawa.hiroyu@jp.fujitsu.com: Kconfig fix]
    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: KAMEZAWA Hiroyuki
    Paul Menage
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Cc: Dan Malek
    Cc: Vladislav Buzov
    Cc: Daisuke Nishimura
    Cc: Alexander Shishkin
    Cc: Davide Libenzi
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Try to reduce overheads in moving swap charge by:

    - Adds a new function(__mem_cgroup_put), which takes "count" as a arg and
    decrement mem->refcnt by "count".
    - Removed res_counter_uncharge, css_put, and mem_cgroup_put from the path
    of moving swap account, and consolidate all of them into mem_cgroup_clear_mc.
    We cannot do that about mc.to->refcnt.

    These changes reduces the overhead from 1.35sec to 0.9sec to move charges
    of 1G anonymous memory(including 500MB swap) in my test environment.

    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Paul Menage
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • This patch is another core part of this move-charge-at-task-migration
    feature. It enables moving charges of anonymous swaps.

    To move the charge of swap, we need to exchange swap_cgroup's record.

    In current implementation, swap_cgroup's record is protected by:

    - page lock: if the entry is on swap cache.
    - swap_lock: if the entry is not on swap cache.

    This works well in usual swap-in/out activity.

    But this behavior make the feature of moving swap charge check many
    conditions to exchange swap_cgroup's record safely.

    So I changed modification of swap_cgroup's recored(swap_cgroup_record())
    to use xchg, and define a new function to cmpxchg swap_cgroup's record.

    This patch also enables moving charge of non pte_present but not uncharged
    swap caches, which can be exist on swap-out path, by getting the target
    pages via find_get_page() as do_mincore() does.

    [kosaki.motohiro@jp.fujitsu.com: fix ia64 build]
    [akpm@linux-foundation.org: fix typos]
    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Paul Menage
    Cc: Daisuke Nishimura
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • This move-charge-at-task-migration feature has extra charges on
    "to"(pre-charges) and "from"(left-over charges) during moving charge.
    This means unnecessary oom can happen.

    This patch tries to avoid such oom.

    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Paul Menage
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • Try to reduce overheads in moving charge by:

    - Instead of calling res_counter_uncharge() against the old cgroup in
    __mem_cgroup_move_account() everytime, call res_counter_uncharge() at the end
    of task migration once.
    - removed css_get(&to->css) from __mem_cgroup_move_account() because callers
    should have already called css_get(). And removed css_put(&to->css) too,
    which was called by callers of move_account on success of move_account.
    - Instead of calling __mem_cgroup_try_charge(), i.e. res_counter_charge(),
    repeatedly, call res_counter_charge(PAGE_SIZE * count) in can_attach() if
    possible.
    - Instead of calling css_get()/css_put() repeatedly, make use of coalesce
    __css_get()/__css_put() if possible.

    These changes reduces the overhead from 1.7sec to 0.6sec to move charges
    of 1G anonymous memory in my test environment.

    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Paul Menage
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • This patch is the core part of this move-charge-at-task-migration feature.
    It implements functions to move charges of anonymous pages mapped only by
    the target task.

    Implementation:
    - define struct move_charge_struct and a valuable of it(mc) to remember the
    count of pre-charges and other information.
    - At can_attach(), get anon_rss of the target mm, call __mem_cgroup_try_charge()
    repeatedly and count up mc.precharge.
    - At attach(), parse the page table, find a target page to be move, and call
    mem_cgroup_move_account() about the page.
    - Cancel all precharges if mc.precharge > 0 on failure or at the end of
    task move.

    [akpm@linux-foundation.org: a little simplification]
    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Paul Menage
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • In current memcg, charges associated with a task aren't moved to the new
    cgroup at task migration. Some users feel this behavior to be strange.
    These patches are for this feature, that is, for charging to the new
    cgroup and, of course, uncharging from the old cgroup at task migration.

    This patch adds "memory.move_charge_at_immigrate" file, which is a flag
    file to determine whether charges should be moved to the new cgroup at
    task migration or not and what type of charges should be moved. This
    patch also adds read and write handlers of the file.

    This patch also adds no-op handlers for this feature. These handlers will
    be implemented in later patches. And you cannot write any values other
    than 0 to move_charge_at_immigrate yet.

    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Paul Menage
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • Don't call get_pid_ns() before we locate/alloc the ns.

    Signed-off-by: Li Zefan
    Cc: Serge Hallyn
    Acked-by: Paul Menage
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Modify the Block I/O cgroup subsystem to be able to be built as a module.
    As the CFQ disk scheduler optionally depends on blk-cgroup, config options
    in block/Kconfig, block/Kconfig.iosched, and block/blk-cgroup.h are
    enhanced to support the new module dependency.

    Signed-off-by: Ben Blum
    Cc: Li Zefan
    Cc: Paul Menage
    Cc: "David S. Miller"
    Cc: KAMEZAWA Hiroyuki
    Cc: Lai Jiangshan
    Cc: Vivek Goyal
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • Add a forgotten item into CONTENTS.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Provides support for unloading modular subsystems.

    This patch adds a new function cgroup_unload_subsys which is to be used
    for removing a loaded subsystem during module deletion. Reference
    counting of the subsystems' modules is moved from once (at load time) to
    once per attached hierarchy (in parse_cgroupfs_options and
    rebind_subsystems) (i.e., 0 or 1).

    Signed-off-by: Ben Blum
    Acked-by: Li Zefan
    Cc: Paul Menage
    Cc: "David S. Miller"
    Cc: KAMEZAWA Hiroyuki
    Cc: Lai Jiangshan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • Add interface between cgroups subsystem management and module loading

    This patch implements rudimentary module-loading support for cgroups -
    namely, a cgroup_load_subsys (similar to cgroup_init_subsys) for use as a
    module initcall, and a struct module pointer in struct cgroup_subsys.

    Several functions that might be wanted by modules have had EXPORT_SYMBOL
    added to them, but it's unclear exactly which functions want it and which
    won't.

    Signed-off-by: Ben Blum
    Acked-by: Li Zefan
    Cc: Paul Menage
    Cc: "David S. Miller"
    Cc: KAMEZAWA Hiroyuki
    Cc: Lai Jiangshan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • This patch series provides the ability for cgroup subsystems to be
    compiled as modules both within and outside the kernel tree. This is
    mainly useful for classifiers and subsystems that hook into components
    that are already modules. cls_cgroup and blkio-cgroup serve as the
    example use cases for this feature.

    It provides an interface cgroup_load_subsys() and cgroup_unload_subsys()
    which modular subsystems can use to register and depart during runtime.
    The net_cls classifier subsystem serves as the example for a subsystem
    which can be converted into a module using these changes.

    Patch #1 sets up the subsys[] array so its contents can be dynamic as
    modules appear and (eventually) disappear. Iterations over the array are
    modified to handle when subsystems are absent, and the dynamic section of
    the array is protected by cgroup_mutex.

    Patch #2 implements an interface for modules to load subsystems, called
    cgroup_load_subsys, similar to cgroup_init_subsys, and adds a module
    pointer in struct cgroup_subsys.

    Patch #3 adds a mechanism for unloading modular subsystems, which includes
    a more advanced rework of the rudimentary reference counting introduced in
    patch 2.

    Patch #4 modifies the net_cls subsystem, which already had some module
    declarations, to be configurable as a module, which also serves as a
    simple proof-of-concept.

    Part of implementing patches 2 and 4 involved updating css pointers in
    each css_set when the module appears or leaves. In doing this, it was
    discovered that css_sets always remain linked to the dummy cgroup,
    regardless of whether or not any subsystems are actually bound to it
    (i.e., not mounted on an actual hierarchy). The subsystem loading and
    unloading code therefore should keep in mind the special cases where the
    added subsystem is the only one in the dummy cgroup (and therefore all
    css_sets need to be linked back into it) and where the removed subsys was
    the only one in the dummy cgroup (and therefore all css_sets should be
    unlinked from it) - however, as all css_sets always stay attached to the
    dummy cgroup anyway, these cases are ignored. Any fix that addresses this
    issue should also make sure these cases are addressed in the subsystem
    loading and unloading code.

    This patch:

    Make subsys[] able to be dynamically populated to support modular
    subsystems

    This patch reworks the way the subsys[] array is used so that subsystems
    can register themselves after boot time, and enables the internals of
    cgroups to be able to handle when subsystems are not present or may
    appear/disappear.

    Signed-off-by: Ben Blum
    Acked-by: Li Zefan
    Cc: Paul Menage
    Cc: "David S. Miller"
    Cc: KAMEZAWA Hiroyuki
    Cc: Lai Jiangshan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • Current css_get() and css_put() increment/decrement css->refcnt one by
    one.

    This patch add a new function __css_get(), which takes "count" as a arg
    and increment the css->refcnt by "count". And this patch also add a new
    arg("count") to __css_put() and change the function to decrement the
    css->refcnt by "count".

    These coalesce version of __css_get()/__css_put() will be used to improve
    performance of memcg's moving charge feature later, where instead of
    calling css_get()/css_put() repeatedly, these new functions will be used.

    No change is needed for current users of css_get()/css_put().

    Signed-off-by: Daisuke Nishimura
    Acked-by: Paul Menage
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • Add cancel_attach() operation to struct cgroup_subsys. cancel_attach()
    can be used when can_attach() operation prepares something for the subsys,
    but we should rollback what can_attach() operation has prepared if attach
    task fails after we've succeeded in can_attach().

    Signed-off-by: Daisuke Nishimura
    Acked-by: Li Zefan
    Reviewed-by: Paul Menage
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • Gmail web gui does not work for sending patches now even with firefox
    "view source with" extension. It will use windows style line breaks to
    wrap lines automatically when sening email.

    Rewrite the gmail web gui part of email client documentation.

    Signed-off-by: Dave Young
    Cc: Randy Dunlap
    Cc: Martin Bligh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Young
     
  • scripts/kernel-doc mishandles a function that has a multi-line function
    short description and no function parameters. The observed problem was
    from drivers/scsi/scsi_netlink.c:

    /**
    * scsi_netlink_init - Called by SCSI subsystem to intialize
    * the SCSI transport netlink interface
    *
    **/

    kernel-doc treated the " * " line as a Description: section with only a
    newline character in the Description contents. This caused
    output_highlight() to complain: "output_highlight got called with no
    args?", plus produce a perl call stack backtrace.

    The fix is just to ignore Description sections if they only contain "\n".

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • x86/Voyager support has been removed a year ago.

    Signed-off-by: Bartlomiej Zolnierkiewicz
    Inspired-by: Jonathan Corbet
    Cc: James Bottomley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bartlomiej Zolnierkiewicz
     
  • Add header file requirements. Compliments of Stephen Rothwell.

    Stephen calls this Rule #1, so I put it there, but I didn't want to demote
    any of the others in the list, so I made one of them number 2b.

    Signed-off-by: Randy Dunlap
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Documentation/vm/:
    Expose example and tool source files in the Documentation/ directory in
    their own files instead of being buried (almost hidden) in readme/txt files.
    This should help to prevent bitrot.

    This will make them more visible/usable to users who may need
    to use them, to developers who may need to test with them, and
    to anyone who would fix/update them if they were more visible.

    Also, if any of these possibly should not be in the kernel tree at
    all, it will be clearer that they are here and we can discuss if
    they should be removed.

    Also build the recently-added map_hugetlb.c.
    Make several functions static to prevent linker warnings.

    Signed-off-by: Randy Dunlap
    Acked-by: Eric B Munson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Make dnotify_test.c source file and add it to Makefile so that
    bitrot can be prevented.

    Signed-off-by: Randy Dunlap
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap