24 Sep, 2009

40 commits

  • There are two useless lines in fs/char_dev.c.

    In register_chrdev there is a loop to change all '/' into '!' in the
    kernel object name.
    This code is useless as the same substitution is in kobject_set_name_vargs in
    lib/kobject.c:
    228 /* ewww... some of these buggers have '/' in the name ... */
    229 while ((s = strchr(kobj->name, '/')))
    230 s[0] = '!';

    kobject_set_name_vargs is called by kobject_set_name.
    kobject_set_name is called just above the useless loop.

    [hidave.darkstar@gmail.com: fix warning, remove the unused char *s]
    Signed-off-by: Renzo Davoli
    Cc: Al Viro
    Signed-off-by: Dave Young
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Renzo Davoli
     
  • In read_zero, we check for access_ok() once for the count bytes. It is
    unnecessarily checked again in clear_user. Use __clear_user, which does
    not check for access_ok().

    Signed-off-by: Nikanth Karthikesan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nikanth Karthikesan
     
  • There is a common macro now for testing mixed pointer/errno values, so use
    that rather than handling the casts ourself.

    Signed-off-by: Mike Frysinger
    Acked-by: David McCullough
    Acked-by: Greg Ungerer
    Cc: David Howells
    Cc: Paul Mundt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Frysinger
     
  • Ignore the loader's PT_GNU_STACK when calculating the stack size, and only
    consider the executable's PT_GNU_STACK, assuming the executable has one.

    Currently the behaviour is to take the largest stack size and use that,
    but that means you can't reduce the stack size in the executable. The
    loader's stack size should probably only be used when executing the loader
    directly.

    WARNING: This patch is slightly dangerous - it may render a system
    inoperable if the loader's stack size is larger than that of important
    executables, and the system relies unknowingly on this increasing the size
    of the stack.

    Signed-off-by: David Howells
    Signed-off-by: Mike Frysinger
    Acked-by: Paul Mundt
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Introduce a helper function elf_note_info_init() to help fill_note_info()
    to do initializations, also fix the potential memory leaks.

    [akpm@linux-foundation.org: remove NUM_NOTES]
    Signed-off-by: WANG Cong
    Cc: Alexander Viro
    Cc: David Howells
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Amerigo Wang
     
  • __fatal_signal_pending inlines to one instruction on x86, probably two
    instructions on other machines. It takes two longer x86 instructions just
    to call it and test its return value, not to mention the function itself.

    On my random x86_64 config, this saved 70 bytes of text (59 of those being
    __fatal_signal_pending itself).

    Signed-off-by: Roland McGrath
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland McGrath
     
  • In order to direct the SIGIO signal to a particular thread of a
    multi-threaded application we cannot, like suggested by the manpage, put a
    TID into the regular fcntl(F_SETOWN) call. It will still be send to the
    whole process of which that thread is part.

    Since people do want to properly direct SIGIO we introduce F_SETOWN_EX.

    The need to direct SIGIO comes from self-monitoring profiling such as with
    perf-counters. Perf-counters uses SIGIO to notify that new sample data is
    available. If the signal is delivered to the same task that generated the
    new sample it can augment that data by inspecting the task's user-space
    state right after it returns from the kernel. This is esp. convenient
    for interpreted or virtual machine driven environments.

    Both F_SETOWN_EX and F_GETOWN_EX take a pointer to a struct f_owner_ex
    as argument:

    struct f_owner_ex {
    int type;
    pid_t pid;
    };

    Where type is one of F_OWNER_TID, F_OWNER_PID or F_OWNER_GID.

    Signed-off-by: Peter Zijlstra
    Reviewed-by: Oleg Nesterov
    Tested-by: stephane eranian
    Cc: Michael Kerrisk
    Cc: Roland McGrath
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • group_send_sig_info()->check_kill_permission() assumes that current is the
    sender and uses current_cred().

    This is not true in send_sigio_to_task() case. From the security pov the
    sender is not current, but the task which did fcntl(F_SETOWN), that is why
    we have sigio_perm() which uses the right creds to check.

    Fortunately, send_sigio() always sends either SEND_SIG_PRIV or
    SI_FROMKERNEL() signal, so check_kill_permission() does nothing. But
    still it would be tidier to avoid this bogus security check and save a
    couple of cycles.

    Signed-off-by: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: stephane eranian
    Cc: Ingo Molnar
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Introduce do_send_sig_info() and convert group_send_sig_info(),
    send_sig_info(), do_send_specific() to use this helper.

    Hopefully it will have more users soon, it allows to specify
    specific/group behaviour via "bool group" argument.

    Shaves 80 bytes from .text.

    Signed-off-by: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: stephane eranian
    Cc: Ingo Molnar
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • sys_delete_module() can set MODULE_STATE_GOING after
    search_binary_handler() does try_module_get(). In this case
    set_binfmt()->try_module_get() fails but since none of the callers
    check the returned error, the task will run with the wrong old
    ->binfmt.

    The proper fix should change all ->load_binary() methods, but we can
    rely on fact that the caller must hold a reference to binfmt->module
    and use __module_get() which never fails.

    Signed-off-by: Oleg Nesterov
    Acked-by: Rusty Russell
    Cc: Hiroshi Shimamoto
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Allow core_pattern pipes to wait for user space to complete

    One of the things that user space processes like to do is look at metadata
    for a crashing process in their /proc/ directory. this is racy
    however, since do_coredump in the kernel doesn't wait for the user space
    process to complete before it reaps the crashing process. This patch
    corrects that. Allowing the kernel to wait for the user space process to
    complete before cleaning up the crashing process. This is a bit tricky to
    do for a few reasons:

    1) The user space process isn't our child, so we can't sys_wait4 on it
    2) We need to close the pipe before waiting for the user process to complete,
    since the user process may rely on an EOF condition

    I've discussed several solutions with Oleg Nesterov off-list about this,
    and this is the one we've come up with. We add ourselves as a pipe reader
    (to prevent premature cleanup of the pipe_inode_info), and remove
    ourselves as a writer (to provide an EOF condition to the writer in user
    space), then we iterate until the user space process exits (which we
    detect by pipe->readers == 1, hence the > 1 check in the loop). When we
    exit the loop, we restore the proper reader/writer values, then we return
    and let filp_close in do_coredump clean up the pipe data properly.

    Signed-off-by: Neil Horman
    Reported-by: Earl Chew
    Cc: Oleg Nesterov
    Cc: Andi Kleen
    Cc: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Horman
     
  • Introduce core pipe limiting sysctl.

    Since we can dump cores to pipe, rather than directly to the filesystem,
    we create a condition in which a user can create a very high load on the
    system simply by running bad applications.

    If the pipe reader specified in core_pattern is poorly written, we can
    have lots of ourstandig resources and processes in the system.

    This sysctl introduces an ability to limit that resource consumption.
    core_pipe_limit defines how many in-flight dumps may be run in parallel,
    dumps beyond this value are skipped and a note is made in the kernel log.
    A special value of 0 in core_pipe_limit denotes unlimited core dumps may
    be handled (this is the default value).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Neil Horman
    Reported-by: Earl Chew
    Cc: Oleg Nesterov
    Cc: Andi Kleen
    Cc: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Horman
     
  • Change how we detect recursive dumps.

    Currently we have a mechanism by which we try to compare pathnames of the
    crashing process to the core_pattern path. This is broken for a dozen
    reasons, and just doesn't work in any sort of robust way.

    I'm replacing it with the use of a 0 RLIMIT_CORE value. Since helper apps
    set RLIMIT_CORE to zero, we don't write out core files for any process
    with that particular limit set. It the core_pattern is a pipe, any
    non-zero limit is translated to RLIM_INFINITY.

    This allows complete dumps to be captured, but prevents infinite recursion
    in the event that the core_pattern process itself crashes.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Neil Horman
    Reported-by: Earl Chew
    Cc: Oleg Nesterov
    Cc: Andi Kleen
    Cc: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Horman
     
  • This changes tracehook_notify_jctl() so it's called with the siglock held,
    and changes its argument and return value definition. These clean-ups
    make it a better fit for what new tracing hooks need to check.

    Tracing needs the siglock here, held from the time TASK_STOPPED was set,
    to avoid potential SIGCONT races if it wants to allow any blocking in its
    tracing hooks.

    This also folds the finish_stop() function into its caller
    do_signal_stop(). The function is short, called only once and only
    unconditionally. It aids readability to fold it in.

    [oleg@redhat.com: do not call tracehook_notify_jctl() in TASK_STOPPED state]
    [oleg@redhat.com: introduce tracehook_finish_jctl() helper]
    Signed-off-by: Roland McGrath
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland McGrath
     
  • Current behaviour of sys_waitid() looks odd. If user passes infop ==
    NULL, sys_waitid() returns success. When user additionally specifies flag
    WNOWAIT, sys_waitid() returns -EFAULT on the same conditions. When user
    combines WNOWAIT with WCONTINUED, sys_waitid() again returns success.

    This patch adds check for ->wo_info in wait_noreap_copyout().

    User-visible change: starting from this commit, sys_waitid() always checks
    infop != NULL and does not fail if it is NULL.

    Signed-off-by: Vitaly Mayatskikh
    Reviewed-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Mayatskikh
     
  • do_wait() checks ->wo_info to figure out who is the caller. If it's not
    NULL the caller should be sys_waitid(), in that case do_wait() fixes up
    the retval or zeros ->wo_info, depending on retval from underlying
    function.

    This is bug: user can pass ->wo_info == NULL and sys_waitid() will return
    incorrect value.

    man 2 waitid says:

    waitid(): returns 0 on success

    Test-case:

    int main(void)
    {
    if (fork())
    assert(waitid(P_ALL, 0, NULL, WEXITED) == 0);

    return 0;
    }

    Result:

    Assertion `waitid(P_ALL, 0, ((void *)0), 4) == 0' failed.

    Move that code to sys_waitid().

    User-visible change: sys_waitid() will return 0 on success, either
    infop is set or not.

    Note, there's another bug in wait_noreap_copyout() which affects
    return value of sys_waitid(). It will be fixed in next patch.

    Signed-off-by: Vitaly Mayatskikh
    Reviewed-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Mayatskikh
     
  • Kill the unused "parent" argument in wait_consider_task(), it was never used.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Ingo Molnar
    Cc: Ratan Nalumasu
    Cc: Vitaly Mayatskikh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • task_pid_type() is only used by eligible_pid() which has to check wo_type
    != PIDTYPE_MAX anyway. Remove this check from task_pid_type() and factor
    out ->pids[type] access, this shrinks .text a bit and simplifies the code.

    The matches the behaviour of other similar helpers, say get_task_pid().
    The caller must ensure that pid_type is valid, not the callee.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • child_wait_callback()->eligible_child() is not right, we can miss the
    wakeup if the task was detached before __wake_up_parent() and the caller
    of do_wait() didn't use __WALL.

    Move ->wo_pid checks from eligible_child() to the new helper,
    eligible_pid(), and change child_wait_callback() to use it instead of
    eligible_child().

    Note: actually I think it would be better to fix the __WCLONE check in
    eligible_child(), it doesn't look exactly right. But it is not clear what
    is the supposed behaviour, and any change is user-visible.

    Reported-by: KAMEZAWA Hiroyuki
    Tested-by: KAMEZAWA Hiroyuki
    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Suggested by Roland.

    do_wait(__WNOTHREAD) can only succeed if the caller is either ptracer, or
    it is ->real_parent and the child is not traced. IOW, caller == p->parent
    otherwise we should not wake up.

    Change child_wait_callback() to check this. Ratan reports the workload with
    CPU load >99% caused by unnecessary wakeups, should be fixed by this patch.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Ingo Molnar
    Cc: Ratan Nalumasu
    Cc: Vitaly Mayatskikh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Ratan Nalumasu reported that in a process with many threads doing
    unnecessary wakeups. Every waiting thread in the process wakes up to loop
    through the children and see that the only ones it cares about are still
    not ready.

    Now that we have struct wait_opts we can change do_wait/__wake_up_parent
    to use filtered wakeups.

    We can make child_wait_callback() more clever later, right now it only
    checks eligible_child().

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Ingo Molnar
    Cc: Ratan Nalumasu
    Cc: Vitaly Mayatskikh
    Acked-by: James Morris
    Tested-by: Valdis Kletnieks
    Acked-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • …to wait_consider_task()

    Preparation, no functional changes.

    eligible_child() has a single caller, wait_consider_task(). We can move
    security_task_wait() out from eligible_child(), this allows us to use it
    for filtered wake_up().

    Signed-off-by: Oleg Nesterov <oleg@redhat.com>
    Acked-by: Roland McGrath <roland@redhat.com>
    Cc: Ingo Molnar <mingo@elte.hu>
    Cc: Ratan Nalumasu <rnalumasu@gmail.com>
    Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Oleg Nesterov
     
  • The bug is old, it wasn't cause by recent changes.

    Test case:

    static void *tfunc(void *arg)
    {
    int pid = (long)arg;

    assert(ptrace(PTRACE_ATTACH, pid, NULL, NULL) == 0);
    kill(pid, SIGKILL);

    sleep(1);
    return NULL;
    }

    int main(void)
    {
    pthread_t th;
    long pid = fork();

    if (!pid)
    pause();

    signal(SIGCHLD, SIG_IGN);
    assert(pthread_create(&th, NULL, tfunc, (void*)pid) == 0);

    int r = waitpid(-1, NULL, __WNOTHREAD);
    printf("waitpid: %d %m\n", r);

    return 0;
    }

    Before the patch this program hangs, after this patch waitpid() correctly
    fails with errno == -ECHILD.

    The problem is, __ptrace_detach() reaps the EXIT_ZOMBIE tracee if its
    ->real_parent is our sub-thread and we ignore SIGCHLD. But in this case
    we should wake up other threads which can sleep in do_wait().

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Vitaly Mayatskikh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • We now count MEM_CGROUP_STAT_SWAPOUT, so we can show swap usage. It would
    be useful for users to show swap usage in memory.stat file, because they
    don't need calculate memsw.usage - res.usage to know swap usage.

    Signed-off-by: Daisuke Nishimura
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • Reduce the resource counter overhead (mostly spinlock) associated with the
    root cgroup. This is a part of the several patches to reduce mem cgroup
    overhead. I had posted other approaches earlier (including using percpu
    counters). Those patches will be a natural addition and will be added
    iteratively on top of these.

    The patch stops resource counter accounting for the root cgroup. The data
    for display is derived from the statisitcs we maintain via
    mem_cgroup_charge_statistics (which is more scalable). What happens today
    is that, we do double accounting, once using res_counter_charge() and once
    using memory_cgroup_charge_statistics(). For the root, since we don't
    implement limits any more, we don't need to track every charge via
    res_counter_charge() and check for limit being exceeded and reclaim.

    The main mem->res usage_in_bytes can be derived by summing the cache and
    rss usage data from memory statistics (MEM_CGROUP_STAT_RSS and
    MEM_CGROUP_STAT_CACHE). However, for memsw->res usage_in_bytes, we need
    additional data about swapped out memory. This patch adds a
    MEM_CGROUP_STAT_SWAPOUT and uses that along with MEM_CGROUP_STAT_RSS and
    MEM_CGROUP_STAT_CACHE to derive the memsw data. This data is computed
    recursively when hierarchy is enabled.

    The tests results I see on a 24 way show that

    1. The lock contention disappears from /proc/lock_stats
    2. The results of the test are comparable to running with
    cgroup_disable=memory.

    Here is a sample of my program runs

    Without Patch

    Performance counter stats for '/home/balbir/parallel_pagefault':

    7192804.124144 task-clock-msecs # 23.937 CPUs
    424691 context-switches # 0.000 M/sec
    267 CPU-migrations # 0.000 M/sec
    28498113 page-faults # 0.004 M/sec
    5826093739340 cycles # 809.989 M/sec
    408883496292 instructions # 0.070 IPC
    7057079452 cache-references # 0.981 M/sec
    3036086243 cache-misses # 0.422 M/sec

    300.485365680 seconds time elapsed

    With cgroup_disable=memory

    Performance counter stats for '/home/balbir/parallel_pagefault':

    7182183.546587 task-clock-msecs # 23.915 CPUs
    425458 context-switches # 0.000 M/sec
    203 CPU-migrations # 0.000 M/sec
    92545093 page-faults # 0.013 M/sec
    6034363609986 cycles # 840.185 M/sec
    437204346785 instructions # 0.072 IPC
    6636073192 cache-references # 0.924 M/sec
    2358117732 cache-misses # 0.328 M/sec

    300.320905827 seconds time elapsed

    With this patch applied

    Performance counter stats for '/home/balbir/parallel_pagefault':

    7191619.223977 task-clock-msecs # 23.955 CPUs
    422579 context-switches # 0.000 M/sec
    88 CPU-migrations # 0.000 M/sec
    91946060 page-faults # 0.013 M/sec
    5957054385619 cycles # 828.333 M/sec
    1058117350365 instructions # 0.178 IPC
    9161776218 cache-references # 1.274 M/sec
    1920494280 cache-misses # 0.267 M/sec

    300.218764862 seconds time elapsed

    Data from Prarit (kernel compile with make -j64 on a 64
    CPU/32G machine)

    For a single run

    Without patch

    real 27m8.988s
    user 87m24.916s
    sys 382m6.037s

    With patch

    real 4m18.607s
    user 84m58.943s
    sys 50m52.682s

    With config turned off

    real 4m54.972s
    user 90m13.456s
    sys 50m19.711s

    NOTE: The data looks counterintuitive due to the increased performance
    with the patch, even over the config being turned off. We probably need
    more runs, but so far all testing has shown that the patches definitely
    help.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Balbir Singh
    Cc: Prarit Bhargava
    Cc: Andi Kleen
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Cc: KOSAKI Motohiro
    Cc: Paul Menage
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Implement reclaim from groups over their soft limit

    Permit reclaim from memory cgroups on contention (via the direct reclaim
    path).

    memory cgroup soft limit reclaim finds the group that exceeds its soft
    limit by the largest number of pages and reclaims pages from it and then
    reinserts the cgroup into its correct place in the rbtree.

    Add additional checks to mem_cgroup_hierarchical_reclaim() to detect long
    loops in case all swap is turned off. The code has been refactored and
    the loop check (loop < 2) has been enhanced for soft limits. For soft
    limits, we try to do more targetted reclaim. Instead of bailing out after
    two loops, the routine now reclaims memory proportional to the size by
    which the soft limit is exceeded. The proportion has been empirically
    determined.

    [akpm@linux-foundation.org: build fix]
    [kamezawa.hiroyu@jp.fujitsu.com: fix softlimit css refcnt handling]
    [nishimura@mxp.nes.nec.co.jp: refcount of the "victim" should be decremented before exiting the loop]
    Signed-off-by: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Acked-by: KOSAKI Motohiro
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Refactor mem_cgroup_hierarchical_reclaim()

    Refactor the arguments passed to mem_cgroup_hierarchical_reclaim() into
    flags, so that new parameters don't have to be passed as we make the
    reclaim routine more flexible

    Signed-off-by: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Organize cgroups over soft limit in a RB-Tree

    Introduce an RB-Tree for storing memory cgroups that are over their soft
    limit. The overall goal is to

    1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
    We are careful about updates, updates take place only after a particular
    time interval has passed
    2. We remove the node from the RB-Tree when the usage goes below the soft
    limit

    The next set of patches will exploit the RB-Tree to get the group that is
    over its soft limit by the largest amount and reclaim from it, when we
    face memory contention.

    [hugh.dickins@tiscali.co.uk: CONFIG_CGROUP_MEM_RES_CTLR=y CONFIG_PREEMPT=y fails to boot]
    Signed-off-by: Balbir Singh
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Cc: Jiri Slaby
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Add an interface to allow get/set of soft limits. Soft limits for memory
    plus swap controller (memsw) is currently not supported. Resource
    counters have been enhanced to support soft limits and new type
    RES_SOFT_LIMIT has been added. Unlike hard limits, soft limits can be
    directly set and do not need any reclaim or checks before setting them to
    a newer value.

    Kamezawa-San raised a question as to whether soft limit should belong to
    res_counter. Since all resources understand the basic concepts of hard
    and soft limits, it is justified to add soft limits here. Soft limits are
    a generic resource usage feature, even file system quotas support soft
    limits.

    Signed-off-by: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Soft limits is a new feature for the memory resource controller, something
    similar has existed in the group scheduler in the form of shares. The CPU
    controllers interpretation of shares is very different though.

    Soft limits are the most useful feature to have for environments where the
    administrator wants to overcommit the system, such that only on memory
    contention do the limits become active. The current soft limits
    implementation provides a soft_limit_in_bytes interface for the memory
    controller and not for memory+swap controller. The implementation
    maintains an RB-Tree of groups that exceed their soft limit and starts
    reclaiming from the group that exceeds this limit by the maximum amount.

    This patch:

    Add documentation for soft limits

    Signed-off-by: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Add comments for the reason of smp_wmb() in mem_cgroup_commit_charge().

    [akpm@linux-foundation.org: coding-style fixes]
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Change the memory cgroup to remove the overhead associated with accounting
    all pages in the root cgroup. As a side-effect, we can no longer set a
    memory hard limit in the root cgroup.

    A new flag to track whether the page has been accounted or not has been
    added as well. Flags are now set atomically for page_cgroup,
    pcg_default_flags is now obsolete and removed.

    [akpm@linux-foundation.org: fix a few documentation glitches]
    Signed-off-by: Balbir Singh
    Signed-off-by: Daisuke Nishimura
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Li Zefan
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • Alter the ss->can_attach and ss->attach functions to be able to deal with
    a whole threadgroup at a time, for use in cgroup_attach_proc. (This is a
    pre-patch to cgroup-procs-writable.patch.)

    Currently, new mode of the attach function can only tell the subsystem
    about the old cgroup of the threadgroup leader. No subsystem currently
    needs that information for each thread that's being moved, but if one were
    to be added (for example, one that counts tasks within a group) this bit
    would need to be reworked a bit to tell the subsystem the right
    information.

    [hidave.darkstar@gmail.com: fix build]
    Signed-off-by: Ben Blum
    Signed-off-by: Paul Menage
    Acked-by: Li Zefan
    Reviewed-by: Matt Helsley
    Cc: "Eric W. Biederman"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Dave Young
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • Changes css_set freeing mechanism to be under RCU

    This is a prepatch for making the procs file writable. In order to free the
    old css_sets for each task to be moved as they're being moved, the freeing
    mechanism must be RCU-protected, or else we would have to have a call to
    synchronize_rcu() for each task before freeing its old css_set.

    Signed-off-by: Ben Blum
    Signed-off-by: Paul Menage
    Cc: "Paul E. McKenney"
    Acked-by: Li Zefan
    Cc: Matt Helsley
    Cc: "Eric W. Biederman"
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • Separates all pidlist allocation requests to a separate function that
    judges based on the requested size whether or not the array needs to be
    vmalloced or can be gotten via kmalloc, and similar for kfree/vfree.

    Signed-off-by: Ben Blum
    Signed-off-by: Paul Menage
    Acked-by: Li Zefan
    Cc: Matt Helsley
    Cc: "Eric W. Biederman"
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • Previously there was the problem in which two processes from different pid
    namespaces reading the tasks or procs file could result in one process
    seeing results from the other's namespace. Rather than one pidlist for
    each file in a cgroup, we now keep a list of pidlists keyed by namespace
    and file type (tasks versus procs) in which entries are placed on demand.
    Each pidlist has its own lock, and that the pidlists themselves are passed
    around in the seq_file's private pointer means we don't have to touch the
    cgroup or its master list except when creating and destroying entries.

    Signed-off-by: Ben Blum
    Signed-off-by: Paul Menage
    Cc: Li Zefan
    Cc: Matt Helsley
    Cc: "Eric W. Biederman"
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • struct cgroup used to have a bunch of fields for keeping track of the
    pidlist for the tasks file. Those are now separated into a new struct
    cgroup_pidlist, of which two are had, one for procs and one for tasks.
    The way the seq_file operations are set up is changed so that just the
    pidlist struct gets passed around as the private data.

    Interface example: Suppose a multithreaded process has pid 1000 and other
    threads with ids 1001, 1002, 1003:
    $ cat tasks
    1000
    1001
    1002
    1003
    $ cat cgroup.procs
    1000
    $

    Signed-off-by: Ben Blum
    Signed-off-by: Paul Menage
    Acked-by: Li Zefan
    Cc: Matt Helsley
    Cc: "Eric W. Biederman"
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • The following series adds a "cgroup.procs" file to each cgroup that
    reports unique tgids rather than pids, and allows all threads in a
    threadgroup to be atomically moved to a new cgroup.

    The subsystem "attach" interface is modified to support attaching whole
    threadgroups at a time, which could introduce potential problems if any
    subsystem were to need to access the old cgroup of every thread being
    moved. The attach interface may need to be revised if this becomes the
    case.

    Also added is functionality for read/write locking all CLONE_THREAD
    fork()ing within a threadgroup, by means of an rwsem that lives in the
    sighand_struct, for per-threadgroup-ness and also for sharing a cacheline
    with the sighand's atomic count. This scheme should introduce no extra
    overhead in the fork path when there's no contention.

    The final patch reveals potential for a race when forking before a
    subsystem's attach function is called - one potential solution in case any
    subsystem has this problem is to hang on to the group's fork mutex through
    the attach() calls, though no subsystem yet demonstrates need for an
    extended critical section.

    This patch:

    Revert

    commit 096b7fe012d66ed55e98bc8022405ede0cc80e96
    Author: Li Zefan
    AuthorDate: Wed Jul 29 15:04:04 2009 -0700
    Commit: Linus Torvalds
    CommitDate: Wed Jul 29 19:10:35 2009 -0700

    cgroups: fix pid namespace bug

    This is in preparation for some clashing cgroups changes that subsume the
    original commit's functionaliy.

    The original commit fixed a pid namespace bug which Ben Blum fixed
    independently (in the same way, but with different code) as part of a
    series of patches. I played around with trying to reconcile Ben's patch
    series with Li's patch, but concluded that it was simpler to just revert
    Li's, given that Ben's patch series contained essentially the same fix.

    Signed-off-by: Paul Menage
    Cc: Li Zefan
    Cc: Matt Helsley
    Cc: "Eric W. Biederman"
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • This patch removes the restriction that a cgroup hierarchy must have at
    least one bound subsystem. The mount option "none" is treated as an
    explicit request for no bound subsystems.

    A hierarchy with no subsystems can be useful for plain task tracking, and
    is also a step towards the support for multiply-bindable subsystems.

    As part of this change, the hierarchy id is no longer calculated from the
    bitmask of subsystems in the hierarchy (since this is not guaranteed to be
    unique) but is allocated via an ida. Reference counts on cgroups from
    css_set objects are now taken explicitly one per hierarchy, rather than
    one per subsystem.

    Example usage:

    mount -t cgroup -o none,name=foo cgroup /mnt/cgroup

    Based on the "no-op"/"none" subsystem concept proposed by
    kamezawa.hiroyu@jp.fujitsu.com

    Signed-off-by: Paul Menage
    Reviewed-by: Li Zefan
    Cc: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Dhaval Giani
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage
     
  • Currently the cgroups code makes the assumption that the subsystem
    pointers in a struct css_set uniquely identify the hierarchy->cgroup
    mappings associated with the css_set; and there's no way to directly
    identify the associated set of cgroups other than by indirecting through
    the appropriate subsystem state pointers.

    This patch removes the need for that assumption by adding a back-pointer
    from struct cg_cgroup_link object to its associated cgroup; this allows
    the set of cgroups to be determined by traversing the cg_links list in
    the struct css_set.

    Signed-off-by: Paul Menage
    Reviewed-by: Li Zefan
    Cc: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Dhaval Giani
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Menage