17 May, 2008

1 commit


02 May, 2008

1 commit


30 Apr, 2008

1 commit

  • Suggested by Roland McGrath.

    Initialize signal->curr_target in copy_signal(). This way ->curr_target is
    never == NULL, we can kill the check in __group_complete_signal's hot path.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

29 Apr, 2008

4 commits

  • The kernel implements readlink of /proc/pid/exe by getting the file from
    the first executable VMA. Then the path to the file is reconstructed and
    reported as the result.

    Because of the VMA walk the code is slightly different on nommu systems.
    This patch avoids separate /proc/pid/exe code on nommu systems. Instead of
    walking the VMAs to find the first executable file-backed VMA we store a
    reference to the exec'd file in the mm_struct.

    That reference would prevent the filesystem holding the executable file
    from being unmounted even after unmapping the VMAs. So we track the number
    of VM_EXECUTABLE VMAs and drop the new reference when the last one is
    unmapped. This avoids pinning the mounted filesystem.

    [akpm@linux-foundation.org: improve comments]
    [yamamoto@valinux.co.jp: fix dup_mmap]
    Signed-off-by: Matt Helsley
    Cc: Oleg Nesterov
    Cc: David Howells
    Cc:"Eric W. Biederman"
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: Hugh Dickins
    Signed-off-by: YAMAMOTO Takashi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Helsley
     
  • sys_unshare(CLONE_NEWIPC) doesn't handle the undo lists properly, this can
    cause a kernel memory corruption. CLONE_NEWIPC must detach from the existing
    undo lists.

    Fix, part 2: perform an implicit CLONE_SYSVSEM in CLONE_NEWIPC. CLONE_NEWIPC
    creates a new IPC namespace, the task cannot access the existing semaphore
    arrays after the unshare syscall. Thus the task can/must detach from the
    existing undo list entries, too.

    This fixes the kernel corruption, because it makes it impossible that
    undo records from two different namespaces are in sysvsem.undo_list.

    Signed-off-by: Manfred Spraul
    Signed-off-by: Serge E. Hallyn
    Acked-by: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Michael Kerrisk
    Cc: Pierre Peiffer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • sys_unshare(CLONE_NEWIPC) doesn't handle the undo lists properly, this can
    cause a kernel memory corruption. CLONE_NEWIPC must detach from the existing
    undo lists.

    Fix, part 1: add support for sys_unshare(CLONE_SYSVSEM)

    The original reason to not support it was the potential (inevitable?)
    confusion due to the fact that sys_unshare(CLONE_SYSVSEM) has the
    inverse meaning of clone(CLONE_SYSVSEM).

    Our two most reasonable options then appear to be (1) fully support
    CLONE_SYSVSEM, or (2) continue to refuse explicit CLONE_SYSVSEM,
    but always do it anyway on unshare(CLONE_SYSVSEM). This patch does
    (1).

    Changelog:
    Apr 16: SEH: switch to Manfred's alternative patch which
    removes the unshare_semundo() function which
    always refused CLONE_SYSVSEM.

    Signed-off-by: Manfred Spraul
    Signed-off-by: Serge E. Hallyn
    Acked-by: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Michael Kerrisk
    Cc: Pierre Peiffer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • Remove the mem_cgroup member from mm_struct and instead adds an owner.

    This approach was suggested by Paul Menage. The advantage of this approach
    is that, once the mm->owner is known, using the subsystem id, the cgroup
    can be determined. It also allows several control groups that are
    virtually grouped by mm_struct, to exist independent of the memory
    controller i.e., without adding mem_cgroup's for each controller, to
    mm_struct.

    A new config option CONFIG_MM_OWNER is added and the memory resource
    controller selects this config option.

    This patch also adds cgroup callbacks to notify subsystems when mm->owner
    changes. The mm_cgroup_changed callback is called with the task_lock() of
    the new task held and is called just prior to changing the mm->owner.

    I am indebted to Paul Menage for the several reviews of this patchset and
    helping me make it lighter and simpler.

    This patch was tested on a powerpc box, it was compiled with both the
    MM_OWNER config turned on and off.

    After the thread group leader exits, it's moved to init_css_state by
    cgroup_exit(), thus all future charges from runnings threads would be
    redirected to the init_css_set's subsystem.

    Signed-off-by: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Hugh Dickins
    Cc: Sudhir Kumar
    Cc: YAMAMOTO Takashi
    Cc: Hirokazu Takahashi
    Cc: David Rientjes ,
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Pekka Enberg
    Reviewed-by: Paul Menage
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     

28 Apr, 2008

2 commits

  • This patch renames mpol_copy() to mpol_dup() because, well, that's what it
    does. Like, e.g., strdup() for strings, mpol_dup() takes a pointer to an
    existing mempolicy, allocates a new one and copies the contents.

    In a later patch, I want to use the name mpol_copy() to copy the contents from
    one mempolicy to another like, e.g., strcpy() does for strings.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • This is a change that was requested some time ago by Mel Gorman. Makes sense
    to me, so here it is.

    Note: I retain the name "mpol_free_shared_policy()" because it actually does
    free the shared_policy, which is NOT a reference counted object. However, ...

    The mempolicy object[s] referenced by the shared_policy are reference counted,
    so mpol_put() is used to release the reference held by the shared_policy. The
    mempolicy might not be freed at this time, because some task attached to the
    shared object associated with the shared policy may be in the process of
    allocating a page based on the mempolicy. In that case, the task performing
    the allocation will hold a reference on the mempolicy, obtained via
    mpol_shared_policy_lookup(). The mempolicy will be freed when all tasks
    holding such a reference have called mpol_put() for the mempolicy.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     

27 Apr, 2008

2 commits

  • The SIE instruction on s390 uses the 2nd half of the page table page to
    virtualize the storage keys of a guest. This patch offers the s390_enable_sie
    function, which reorganizes the page tables of a single-threaded process to
    reserve space in the page table:
    s390_enable_sie makes sure that the process is single threaded and then uses
    dup_mm to create a new mm with reorganized page tables. The old mm is freed
    and the process has now a page status extended field after every page table.

    Code that wants to exploit pgstes should SELECT CONFIG_PGSTE.

    This patch has a small common code hit, namely making dup_mm non-static.

    Edit (Carsten): I've modified Martin's patch, following Jeremy Fitzhardinge's
    review feedback. Now we do have the prototype for dup_mm in
    include/linux/sched.h. Following Martin's suggestion, s390_enable_sie() does now
    call task_lock() to prevent race against ptrace modification of mm_users.

    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Carsten Otte
    Acked-by: Andrew Morton
    Signed-off-by: Avi Kivity

    Carsten Otte
     
  • Arrgghhh...

    Sorry about that, I'd been sure I'd folded that one, but it actually got
    lost. Please apply - that breaks execve().

    Signed-off-by: Al Viro
    Tested-by: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Al Viro
     

25 Apr, 2008

3 commits

  • * let unshare_files() give caller the displaced files_struct
    * don't bother with grabbing reference only to drop it in the
    caller if it hadn't been shared in the first place
    * in that form unshare_files() is trivially implemented via
    unshare_fd(), so we eliminate the duplicate logics in fork.c
    * reset_files_struct() is not just only called for current;
    it will break the system if somebody ever calls it for anything
    else (we can't modify ->files of somebody else). Lose the
    task_struct * argument.

    Signed-off-by: Al Viro

    Al Viro
     
  • * unshare_files() can fail; doing it after irreversible actions is wrong
    and de_thread() is certainly irreversible.
    * since we do it unconditionally anyway, we might as well do it in do_execve()
    and save ourselves the PITA in binfmt handlers, etc.
    * while we are at it, binfmt_som actually leaked files_struct on failure.

    As a side benefit, unshare_files(), put_files_struct() and reset_files_struct()
    become unexported.

    Signed-off-by: Al Viro

    Al Viro
     
  • updating current->files requires task_lock

    Signed-off-by: Al Viro

    Al Viro
     

20 Apr, 2008

2 commits

  • Signed-off-by: Ingo Molnar
    Signed-off-by: Thomas Gleixner

    Suresh Siddha
     
  • Split the FPU save area from the task struct. This allows easy migration
    of FPU context, and it's generally cleaner. It also allows the following
    two optimizations:

    1) only allocate when the application actually uses FPU, so in the first
    lazy FPU trap. This could save memory for non-fpu using apps. Next patch
    does this lazy allocation.

    2) allocate the right size for the actual cpu rather than 512 bytes always.
    Patches enabling xsave/xrstor support (coming shortly) will take advantage
    of this.

    Signed-off-by: Suresh Siddha
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Ingo Molnar
    Signed-off-by: Thomas Gleixner

    Suresh Siddha
     

29 Mar, 2008

1 commit


15 Feb, 2008

1 commit


09 Feb, 2008

3 commits

  • [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Harvey Harrison
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Harvey Harrison
     
  • Some time ago the xxx_vnr() calls (e.g. pid_vnr or find_task_by_vpid) were
    _all_ converted to operate on the current pid namespace. After this each call
    like xxx_nr_ns(foo, current->nsproxy->pid_ns) is nothing but a xxx_vnr(foo)
    one.

    Switch all the xxx_nr_ns() callers to use the xxx_vnr() calls where
    appropriate.

    Signed-off-by: Pavel Emelyanov
    Reviewed-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • signal_struct->tsk points to the ->group_leader and thus we have the nasty
    code in de_thread() which has to change it and restart ->real_timer if the
    leader is changed.

    Use "struct pid *leader_pid" instead. This also allows us to kill now
    unneeded send_group_sig_info().

    Signed-off-by: Oleg Nesterov
    Acked-by: "Eric W. Biederman"
    Cc: Davide Libenzi
    Cc: Pavel Emelyanov
    Acked-by: Roland McGrath
    Acked-by: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

08 Feb, 2008

1 commit

  • Basic setup routines, the mm_struct has a pointer to the cgroup that
    it belongs to and the the page has a page_cgroup associated with it.

    Signed-off-by: Pavel Emelianov
    Signed-off-by: Balbir Singh
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Nick Piggin
    Cc: Kirill Korotaev
    Cc: Herbert Poetzl
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelianov
     

07 Feb, 2008

2 commits

  • Fix the following section mismatch with CONFIG_HOTPLUG=n,
    CONFIG_HOTPLUG_CPU=y:

    WARNING: vmlinux.o(.text+0x399a6): Section mismatch: reference to .init.text.5:idle_regs (between 'fork_idle' and 'get_task_mm')

    Signed-off-by: Adrian Bunk
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "Luck, Tony"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • 1. It is much easier to grep for ->state change if __set_task_state() is used
    instead of the direct assignment.

    2. ptrace_stop() and handle_group_stop() use set_task_state() which adds the
    unneeded mb() (btw even if we use mb() it is still possible that do_wait()
    sees the new ->state but not ->exit_code, but this is ok).

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

06 Feb, 2008

3 commits

  • The capability bounding set is a set beyond which capabilities cannot grow.
    Currently cap_bset is per-system. It can be manipulated through sysctl,
    but only init can add capabilities. Root can remove capabilities. By
    default it includes all caps except CAP_SETPCAP.

    This patch makes the bounding set per-process when file capabilities are
    enabled. It is inherited at fork from parent. Noone can add elements,
    CAP_SETPCAP is required to remove them.

    One example use of this is to start a safer container. For instance, until
    device namespaces or per-container device whitelists are introduced, it is
    best to take CAP_MKNOD away from a container.

    The bounding set will not affect pP and pE immediately. It will only
    affect pP' and pE' after subsequent exec()s. It also does not affect pI,
    and exec() does not constrain pI'. So to really start a shell with no way
    of regain CAP_MKNOD, you would do

    prctl(PR_CAPBSET_DROP, CAP_MKNOD);
    cap_t cap = cap_get_proc();
    cap_value_t caparray[1];
    caparray[0] = CAP_MKNOD;
    cap_set_flag(cap, CAP_INHERITABLE, 1, caparray, CAP_DROP);
    cap_set_proc(cap);
    cap_free(cap);

    The following test program will get and set the bounding
    set (but not pI). For instance

    ./bset get
    (lists capabilities in bset)
    ./bset drop cap_net_raw
    (starts shell with new bset)
    (use capset, setuid binary, or binary with
    file capabilities to try to increase caps)

    ************************************************************
    cap_bound.c
    ************************************************************
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #ifndef PR_CAPBSET_READ
    #define PR_CAPBSET_READ 23
    #endif

    #ifndef PR_CAPBSET_DROP
    #define PR_CAPBSET_DROP 24
    #endif

    int usage(char *me)
    {
    printf("Usage: %s get\n", me);
    printf(" %s drop \n", me);
    return 1;
    }

    #define numcaps 32
    char *captable[numcaps] = {
    "cap_chown",
    "cap_dac_override",
    "cap_dac_read_search",
    "cap_fowner",
    "cap_fsetid",
    "cap_kill",
    "cap_setgid",
    "cap_setuid",
    "cap_setpcap",
    "cap_linux_immutable",
    "cap_net_bind_service",
    "cap_net_broadcast",
    "cap_net_admin",
    "cap_net_raw",
    "cap_ipc_lock",
    "cap_ipc_owner",
    "cap_sys_module",
    "cap_sys_rawio",
    "cap_sys_chroot",
    "cap_sys_ptrace",
    "cap_sys_pacct",
    "cap_sys_admin",
    "cap_sys_boot",
    "cap_sys_nice",
    "cap_sys_resource",
    "cap_sys_time",
    "cap_sys_tty_config",
    "cap_mknod",
    "cap_lease",
    "cap_audit_write",
    "cap_audit_control",
    "cap_setfcap"
    };

    int getbcap(void)
    {
    int comma=0;
    unsigned long i;
    int ret;

    printf("i know of %d capabilities\n", numcaps);
    printf("capability bounding set:");
    for (i=0; i< 0)
    perror("prctl");
    else if (ret==1)
    printf("%s%s", (comma++) ? ", " : " ", captable[i]);
    }
    printf("\n");
    return 0;
    }

    int capdrop(char *str)
    {
    unsigned long i;

    int found=0;
    for (i=0; i
    Signed-off-by: Andrew G. Morgan
    Cc: Stephen Smalley
    Cc: James Morris
    Cc: Chris Wright
    Cc: Casey Schaufler a
    Signed-off-by: "Serge E. Hallyn"
    Tested-by: Jiri Slaby
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • (with Martin Schwidefsky )

    The pgd/pud/pmd/pte page table allocation functions get a mm_struct pointer as
    first argument. The free functions do not get the mm_struct argument. This
    is 1) asymmetrical and 2) to do mm related page table allocations the mm
    argument is needed on the free function as well.

    [kamalesh@linux.vnet.ibm.com: i386 fix]
    [akpm@linux-foundation.org: coding-syle fixes]
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Martin Schwidefsky
    Cc:
    Signed-off-by: Kamalesh Babulal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     
  • Ulrich says that we never used this clone flags and that nothing should be
    using it.

    As we're down to only a single bit left in clone's flags argument, let's add a
    warning to check that no userspace is actually using it. Hopefully we will
    be able to recycle it.

    Roland said:

    CLONE_STOPPED was previously used by some NTPL versions when under
    thread_db (i.e. only when being actively debugged by gdb), but not for a
    long time now, and it never worked reliably when it was used. Removing it
    seems fine to me.

    [akpm@linux-foundation.org: it looks like CLONE_DETACHED is being used]
    Cc: Ulrich Drepper
    Cc: Ingo Molnar
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

30 Jan, 2008

1 commit


28 Jan, 2008

3 commits


26 Jan, 2008

5 commits

  • LatencyTOP kernel infrastructure; it measures latencies in the
    scheduler and tracks it system wide and per process.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Ingo Molnar

    Arjan van de Ven
     
  • Extend group scheduling to also cover the realtime classes. It uses the time
    limiting introduced by the previous patch to allow multiple realtime groups.

    The hard time limit is required to keep behaviour deterministic.

    The algorithms used make the realtime scheduler O(tg), linear scaling wrt the
    number of task groups. This is the worst case behaviour I can't seem to get out
    of, the avg. case of the algorithms can be improved, I focused on correctness
    and worst case.

    [ akpm@linux-foundation.org: move side-effects out of BUG_ON(). ]

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • This patch implements a new version of RCU which allows its read-side
    critical sections to be preempted. It uses a set of counter pairs
    to keep track of the read-side critical sections and flips them
    when all tasks exit read-side critical section. The details
    of this implementation can be found in this paper -

    http://www.rdrop.com/users/paulmck/RCU/OLSrtRCU.2006.08.11a.pdf

    and the article-

    http://lwn.net/Articles/253651/

    This patch was developed as a part of the -rt kernel development and
    meant to provide better latencies when read-side critical sections of
    RCU don't disable preemption. As a consequence of keeping track of RCU
    readers, the readers have a slight overhead (optimizations in the paper).
    This implementation co-exists with the "classic" RCU implementations
    and can be switched to at compiler.

    Also includes RCU tracing summarized in debugfs.

    [ akpm@linux-foundation.org: build fixes on non-preempt architectures ]

    Signed-off-by: Gautham R Shenoy
    Signed-off-by: Dipankar Sarma
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Steven Rostedt
    Signed-off-by: Ingo Molnar

    Paul E. McKenney
     
  • Some RT tasks (particularly kthreads) are bound to one specific CPU.
    It is fairly common for two or more bound tasks to get queued up at the
    same time. Consider, for instance, softirq_timer and softirq_sched. A
    timer goes off in an ISR which schedules softirq_thread to run at RT50.
    Then the timer handler determines that it's time to smp-rebalance the
    system so it schedules softirq_sched to run. So we are in a situation
    where we have two RT50 tasks queued, and the system will go into
    rt-overload condition to request other CPUs for help.

    This causes two problems in the current code:

    1) If a high-priority bound task and a low-priority unbounded task queue
    up behind the running task, we will fail to ever relocate the unbounded
    task because we terminate the search on the first unmovable task.

    2) We spend precious futile cycles in the fast-path trying to pull
    overloaded tasks over. It is therefore optimial to strive to avoid the
    overhead all together if we can cheaply detect the condition before
    overload even occurs.

    This patch tries to achieve this optimization by utilizing the hamming
    weight of the task->cpus_allowed mask. A weight of 1 indicates that
    the task cannot be migrated. We will then utilize this information to
    skip non-migratable tasks and to eliminate uncessary rebalance attempts.

    We introduce a per-rq variable to count the number of migratable tasks
    that are currently running. We only go into overload if we have more
    than one rt task, AND at least one of them is migratable.

    In addition, we introduce a per-task variable to cache the cpus_allowed
    weight, since the hamming calculation is probably relatively expensive.
    We only update the cached value when the mask is updated which should be
    relatively infrequent, especially compared to scheduling frequency
    in the fast path.

    Signed-off-by: Gregory Haskins
    Signed-off-by: Steven Rostedt
    Signed-off-by: Ingo Molnar

    Gregory Haskins
     
  • this patch extends the soft-lockup detector to automatically
    detect hung TASK_UNINTERRUPTIBLE tasks. Such hung tasks are
    printed the following way:

    ------------------>
    INFO: task prctl:3042 blocked for more than 120 seconds.
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message
    prctl D fd5e3793 0 3042 2997
    f6050f38 00000046 00000001 fd5e3793 00000009 c06d8264 c06dae80 00000286
    f6050f40 f6050f00 f7d34d90 f7d34fc8 c1e1be80 00000001 f6050000 00000000
    f7e92d00 00000286 f6050f18 c0489d1a f6050f40 00006605 00000000 c0133a5b
    Call Trace:
    [] schedule_timeout+0x6d/0x8b
    [] schedule_timeout_uninterruptible+0x15/0x17
    [] msleep+0x10/0x16
    [] sys_prctl+0x30/0x1e2
    [] sysenter_past_esp+0x5f/0xa5
    =======================
    2 locks held by prctl/3042:
    #0: (&sb->s_type->i_mutex_key#5){--..}, at: [] do_fsync+0x38/0x7a
    #1: (jbd_handle){--..}, at: [] journal_start+0xc7/0xe9
    : CPU hotplug fixes. ]
    [ Andrew Morton : build warning fix. ]

    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven

    Ingo Molnar
     

06 Dec, 2007

1 commit

  • Currently we are complicating the code in copy_process, the clone ABI, and
    if we fix the bugs sys_setsid itself, with an unnecessary open coded
    version of sys_setsid.

    So just simplify everything and don't special case the session and pgrp of
    the initial process in a pid namespace.

    Having this special case actually presents to user space the classic linux
    startup conditions with session == pgrp == 0 for /sbin/init.

    We already handle sending signals to processes in a child pid namespace.

    We need to handle sending signals to processes in a parent pid namespace
    for cases like SIGCHILD and SIGIO.

    This makes nothing extra visible inside a pid namespace. So this extra
    special case appears to have no redeeming merits.

    Further removing this special case increases the flexibility of how we can
    use pid namespaces, by not requiring the initial process in a pid namespace
    to be a daemon.

    Signed-off-by: Eric W. Biederman
    Cc: Oleg Nesterov
    Cc: Pavel Emelyanov
    Cc: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

10 Nov, 2007

1 commit

  • Sukadev Bhattiprolu reported a kernel crash with control groups.
    There are couple of problems discovered by Suka's test:

    - The test requires the cgroup filesystem to be mounted with
    atleast the cpu and ns options (i.e both namespace and cpu
    controllers are active in the same hierarchy).

    # mkdir /dev/cpuctl
    # mount -t cgroup -ocpu,ns none cpuctl
    (or simply)
    # mount -t cgroup none cpuctl -> Will activate all controllers
    in same hierarchy.

    - The test invokes clone() with CLONE_NEWNS set. This causes a a new child
    to be created, also a new group (do_fork->copy_namespaces->ns_cgroup_clone->
    cgroup_clone) and the child is attached to the new group (cgroup_clone->
    attach_task->sched_move_task). At this point in time, the child's scheduler
    related fields are uninitialized (including its on_rq field, which it has
    inherited from parent). As a result sched_move_task thinks its on
    runqueue, when it isn't.

    As a solution to this problem, I moved sched_fork() call, which
    initializes scheduler related fields on a new task, before
    copy_namespaces(). I am not sure though whether moving up will
    cause other side-effects. Do you see any issue?

    - The second problem exposed by this test is that task_new_fair()
    assumes that parent and child will be part of the same group (which
    needn't be as this test shows). As a result, cfs_rq->curr can be NULL
    for the child.

    The solution is to test for curr pointer being NULL in
    task_new_fair().

    With the patch below, I could run ns_exec() fine w/o a crash.

    Reported-by: Sukadev Bhattiprolu
    Signed-off-by: Srivatsa Vaddagiri
    Signed-off-by: Ingo Molnar

    Srivatsa Vaddagiri
     

30 Oct, 2007

2 commits