16 Feb, 2009

1 commit


09 Feb, 2009

1 commit


05 Feb, 2009

1 commit

  • Change the process wide cpu timers/clocks so that we:

    1) don't mess up the kernel with too many threads,
    2) don't have a per-cpu allocation for each process,
    3) have no impact when not used.

    In order to accomplish this we're going to split it into two parts:

    - clocks; which can take all the time they want since they run
    from user context -- ie. sys_clock_gettime(CLOCK_PROCESS_CPUTIME_ID)

    - timers; which need constant time sampling but since they're
    explicity used, the user can pay the overhead.

    The clock readout will go back to a full sum of the thread group, while the
    timers will run of a global 'clock' that only runs when needed, so only
    programs that make use of the facility pay the price.

    Signed-off-by: Peter Zijlstra
    Reviewed-by: Ingo Molnar
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

11 Jan, 2009

1 commit


08 Jan, 2009

1 commit

  • Either we bounce once cacheline per cpu per tick, yielding n^2 bounces
    or we just bounce a single..

    Also, using per-cpu allocations for the thread-groups complicates the
    per-cpu allocator in that its currently aimed to be a fixed sized
    allocator and the only possible extention to that would be vmap based,
    which is seriously constrained on 32 bit archs.

    So making the per-cpu memory requirement depend on the number of
    processes is an issue.

    Lastly, it didn't deal with cpu-hotplug, although admittedly that might
    be fixable.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

01 Jan, 2009

1 commit


29 Dec, 2008

1 commit

  • The RT scheduler employs a "push/pull" design to actively balance tasks
    within the system (on a per disjoint cpuset basis). When a task is
    awoken, it is immediately determined if there are any lower priority
    cpus which should be preempted. This is opposed to the way normal
    SCHED_OTHER tasks behave, which will wait for a periodic rebalancing
    operation to occur before spreading out load.

    When a particular RQ has more than 1 active RT task, it is said to
    be in an "overloaded" state. Once this occurs, the system enters
    the active balancing mode, where it will try to push the task away,
    or persuade a different cpu to pull it over. The system will stay
    in this state until the system falls back below the lock for certain workloads, and by making sure the algorithm
    considers all eligible tasks in the system.

    [ rostedt: added a couple more BUG_ONs ]

    Signed-off-by: Gregory Haskins
    Acked-by: Steven Rostedt

    Gregory Haskins
     

25 Nov, 2008

1 commit

  • The user_ns is moved from nsproxy to user_struct, so that a struct
    cred by itself is sufficient to determine access (which it otherwise
    would not be). Corresponding ecryptfs fixes (by David Howells) are
    here as well.

    Fix refcounting. The following rules now apply:
    1. The task pins the user struct.
    2. The user struct pins its user namespace.
    3. The user namespace pins the struct user which created it.

    User namespaces are cloned during copy_creds(). Unsharing a new user_ns
    is no longer possible. (We could re-add that, but it'll cause code
    duplication and doesn't seem useful if PAM doesn't need to clone user
    namespaces).

    When a user namespace is created, its first user (uid 0) gets empty
    keyrings and a clean group_info.

    This incorporates a previous patch by David Howells. Here
    is his original patch description:

    >I suggest adding the attached incremental patch. It makes the following
    >changes:
    >
    > (1) Provides a current_user_ns() macro to wrap accesses to current's user
    > namespace.
    >
    > (2) Fixes eCryptFS.
    >
    > (3) Renames create_new_userns() to create_user_ns() to be more consistent
    > with the other associated functions and because the 'new' in the name is
    > superfluous.
    >
    > (4) Moves the argument and permission checks made for CLONE_NEWUSER to the
    > beginning of do_fork() so that they're done prior to making any attempts
    > at allocation.
    >
    > (5) Calls create_user_ns() after prepare_creds(), and gives it the new creds
    > to fill in rather than have it return the new root user. I don't imagine
    > the new root user being used for anything other than filling in a cred
    > struct.
    >
    > This also permits me to get rid of a get_uid() and a free_uid(), as the
    > reference the creds were holding on the old user_struct can just be
    > transferred to the new namespace's creator pointer.
    >
    > (6) Makes create_user_ns() reset the UIDs and GIDs of the creds under
    > preparation rather than doing it in copy_creds().
    >
    >David

    >Signed-off-by: David Howells

    Changelog:
    Oct 20: integrate dhowells comments
    1. leave thread_keyring alone
    2. use current_user_ns() in set_user()

    Signed-off-by: Serge Hallyn

    Serge Hallyn
     

14 Nov, 2008

4 commits

  • Differentiate the objective and real subjective credentials from the effective
    subjective credentials on a task by introducing a second credentials pointer
    into the task_struct.

    task_struct::real_cred then refers to the objective and apparent real
    subjective credentials of a task, as perceived by the other tasks in the
    system.

    task_struct::cred then refers to the effective subjective credentials of a
    task, as used by that task when it's actually running. These are not visible
    to the other tasks in the system.

    __task_cred(task) then refers to the objective/real credentials of the task in
    question.

    current_cred() refers to the effective subjective credentials of the current
    task.

    prepare_creds() uses the objective creds as a base and commit_creds() changes
    both pointers in the task_struct (indeed commit_creds() requires them to be the
    same).

    override_creds() and revert_creds() change the subjective creds pointer only,
    and the former returns the old subjective creds. These are used by NFSD,
    faccessat() and do_coredump(), and will by used by CacheFiles.

    In SELinux, current_has_perm() is provided as an alternative to
    task_has_perm(). This uses the effective subjective context of current,
    whereas task_has_perm() uses the objective/real context of the subject.

    Signed-off-by: David Howells
    Signed-off-by: James Morris

    David Howells
     
  • Inaugurate copy-on-write credentials management. This uses RCU to manage the
    credentials pointer in the task_struct with respect to accesses by other tasks.
    A process may only modify its own credentials, and so does not need locking to
    access or modify its own credentials.

    A mutex (cred_replace_mutex) is added to the task_struct to control the effect
    of PTRACE_ATTACHED on credential calculations, particularly with respect to
    execve().

    With this patch, the contents of an active credentials struct may not be
    changed directly; rather a new set of credentials must be prepared, modified
    and committed using something like the following sequence of events:

    struct cred *new = prepare_creds();
    int ret = blah(new);
    if (ret < 0) {
    abort_creds(new);
    return ret;
    }
    return commit_creds(new);

    There are some exceptions to this rule: the keyrings pointed to by the active
    credentials may be instantiated - keyrings violate the COW rule as managing
    COW keyrings is tricky, given that it is possible for a task to directly alter
    the keys in a keyring in use by another task.

    To help enforce this, various pointers to sets of credentials, such as those in
    the task_struct, are declared const. The purpose of this is compile-time
    discouragement of altering credentials through those pointers. Once a set of
    credentials has been made public through one of these pointers, it may not be
    modified, except under special circumstances:

    (1) Its reference count may incremented and decremented.

    (2) The keyrings to which it points may be modified, but not replaced.

    The only safe way to modify anything else is to create a replacement and commit
    using the functions described in Documentation/credentials.txt (which will be
    added by a later patch).

    This patch and the preceding patches have been tested with the LTP SELinux
    testsuite.

    This patch makes several logical sets of alteration:

    (1) execve().

    This now prepares and commits credentials in various places in the
    security code rather than altering the current creds directly.

    (2) Temporary credential overrides.

    do_coredump() and sys_faccessat() now prepare their own credentials and
    temporarily override the ones currently on the acting thread, whilst
    preventing interference from other threads by holding cred_replace_mutex
    on the thread being dumped.

    This will be replaced in a future patch by something that hands down the
    credentials directly to the functions being called, rather than altering
    the task's objective credentials.

    (3) LSM interface.

    A number of functions have been changed, added or removed:

    (*) security_capset_check(), ->capset_check()
    (*) security_capset_set(), ->capset_set()

    Removed in favour of security_capset().

    (*) security_capset(), ->capset()

    New. This is passed a pointer to the new creds, a pointer to the old
    creds and the proposed capability sets. It should fill in the new
    creds or return an error. All pointers, barring the pointer to the
    new creds, are now const.

    (*) security_bprm_apply_creds(), ->bprm_apply_creds()

    Changed; now returns a value, which will cause the process to be
    killed if it's an error.

    (*) security_task_alloc(), ->task_alloc_security()

    Removed in favour of security_prepare_creds().

    (*) security_cred_free(), ->cred_free()

    New. Free security data attached to cred->security.

    (*) security_prepare_creds(), ->cred_prepare()

    New. Duplicate any security data attached to cred->security.

    (*) security_commit_creds(), ->cred_commit()

    New. Apply any security effects for the upcoming installation of new
    security by commit_creds().

    (*) security_task_post_setuid(), ->task_post_setuid()

    Removed in favour of security_task_fix_setuid().

    (*) security_task_fix_setuid(), ->task_fix_setuid()

    Fix up the proposed new credentials for setuid(). This is used by
    cap_set_fix_setuid() to implicitly adjust capabilities in line with
    setuid() changes. Changes are made to the new credentials, rather
    than the task itself as in security_task_post_setuid().

    (*) security_task_reparent_to_init(), ->task_reparent_to_init()

    Removed. Instead the task being reparented to init is referred
    directly to init's credentials.

    NOTE! This results in the loss of some state: SELinux's osid no
    longer records the sid of the thread that forked it.

    (*) security_key_alloc(), ->key_alloc()
    (*) security_key_permission(), ->key_permission()

    Changed. These now take cred pointers rather than task pointers to
    refer to the security context.

    (4) sys_capset().

    This has been simplified and uses less locking. The LSM functions it
    calls have been merged.

    (5) reparent_to_kthreadd().

    This gives the current thread the same credentials as init by simply using
    commit_thread() to point that way.

    (6) __sigqueue_alloc() and switch_uid()

    __sigqueue_alloc() can't stop the target task from changing its creds
    beneath it, so this function gets a reference to the currently applicable
    user_struct which it then passes into the sigqueue struct it returns if
    successful.

    switch_uid() is now called from commit_creds(), and possibly should be
    folded into that. commit_creds() should take care of protecting
    __sigqueue_alloc().

    (7) [sg]et[ug]id() and co and [sg]et_current_groups.

    The set functions now all use prepare_creds(), commit_creds() and
    abort_creds() to build and check a new set of credentials before applying
    it.

    security_task_set[ug]id() is called inside the prepared section. This
    guarantees that nothing else will affect the creds until we've finished.

    The calling of set_dumpable() has been moved into commit_creds().

    Much of the functionality of set_user() has been moved into
    commit_creds().

    The get functions all simply access the data directly.

    (8) security_task_prctl() and cap_task_prctl().

    security_task_prctl() has been modified to return -ENOSYS if it doesn't
    want to handle a function, or otherwise return the return value directly
    rather than through an argument.

    Additionally, cap_task_prctl() now prepares a new set of credentials, even
    if it doesn't end up using it.

    (9) Keyrings.

    A number of changes have been made to the keyrings code:

    (a) switch_uid_keyring(), copy_keys(), exit_keys() and suid_keys() have
    all been dropped and built in to the credentials functions directly.
    They may want separating out again later.

    (b) key_alloc() and search_process_keyrings() now take a cred pointer
    rather than a task pointer to specify the security context.

    (c) copy_creds() gives a new thread within the same thread group a new
    thread keyring if its parent had one, otherwise it discards the thread
    keyring.

    (d) The authorisation key now points directly to the credentials to extend
    the search into rather pointing to the task that carries them.

    (e) Installing thread, process or session keyrings causes a new set of
    credentials to be created, even though it's not strictly necessary for
    process or session keyrings (they're shared).

    (10) Usermode helper.

    The usermode helper code now carries a cred struct pointer in its
    subprocess_info struct instead of a new session keyring pointer. This set
    of credentials is derived from init_cred and installed on the new process
    after it has been cloned.

    call_usermodehelper_setup() allocates the new credentials and
    call_usermodehelper_freeinfo() discards them if they haven't been used. A
    special cred function (prepare_usermodeinfo_creds()) is provided
    specifically for call_usermodehelper_setup() to call.

    call_usermodehelper_setkeys() adjusts the credentials to sport the
    supplied keyring as the new session keyring.

    (11) SELinux.

    SELinux has a number of changes, in addition to those to support the LSM
    interface changes mentioned above:

    (a) selinux_setprocattr() no longer does its check for whether the
    current ptracer can access processes with the new SID inside the lock
    that covers getting the ptracer's SID. Whilst this lock ensures that
    the check is done with the ptracer pinned, the result is only valid
    until the lock is released, so there's no point doing it inside the
    lock.

    (12) is_single_threaded().

    This function has been extracted from selinux_setprocattr() and put into
    a file of its own in the lib/ directory as join_session_keyring() now
    wants to use it too.

    The code in SELinux just checked to see whether a task shared mm_structs
    with other tasks (CLONE_VM), but that isn't good enough. We really want
    to know if they're part of the same thread group (CLONE_THREAD).

    (13) nfsd.

    The NFS server daemon now has to use the COW credentials to set the
    credentials it is going to use. It really needs to pass the credentials
    down to the functions it calls, but it can't do that until other patches
    in this series have been applied.

    Signed-off-by: David Howells
    Acked-by: James Morris
    Signed-off-by: James Morris

    David Howells
     
  • Detach the credentials from task_struct, duplicating them in copy_process()
    and releasing them in __put_task_struct().

    Signed-off-by: David Howells
    Acked-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    David Howells
     
  • Separate the task security context from task_struct. At this point, the
    security data is temporarily embedded in the task_struct with two pointers
    pointing to it.

    Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in
    entry.S via asm-offsets.

    With comment fixes Signed-off-by: Marc Dionne

    Signed-off-by: David Howells
    Acked-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    David Howells
     

06 Sep, 2008

1 commit

  • We want to be able to control the default "rounding" that is used by
    select() and poll() and friends. This is a per process property
    (so that we can have a "nice" like program to start certain programs with
    a looser or stricter rounding) that can be set/get via a prctl().

    For this purpose, a field called "timer_slack_ns" is added to the task
    struct. In addition, a field called "default_timer_slack"ns" is added
    so that tasks easily can temporarily to a more/less accurate slack and then
    back to the default.

    The default value of the slack is set to 50 usec; this is significantly less
    than 2.6.27's average select() and poll() timing error but still allows
    the kernel to group timers somewhat to preserve power behavior. Applications
    and admins can override this via the prctl()

    Signed-off-by: Arjan van de Ven

    Arjan van de Ven
     

26 Jul, 2008

1 commit

  • Introduce the new PF_KTHREAD flag to mark the kernel threads. It is set
    by INIT_TASK() and copied to the forked childs (we could set it in
    kthreadd() along with PF_NOFREEZE instead).

    daemonize() was changed as well. In that case testing of PF_KTHREAD is
    racy, but daemonize() is hopeless anyway.

    This flag is cleared in do_execve(), before search_binary_handler().
    Probably not the best place, we can do this in exec_mmap() or in
    start_thread(), or clear it along with PF_FORKNOEXEC. But I think this
    doesn't matter in practice, and if do_execve() fails kthread should die
    soon.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

17 Jul, 2008

1 commit

  • ptrace no longer fiddles with the children/sibling links, and the
    old ptrace_children list is gone. Now ptrace, whether of one's own
    children or another's via PTRACE_ATTACH, just uses the new ptraced
    list instead.

    There should be no user-visible difference that matters. The only
    change is the order in which do_wait() sees multiple stopped
    children and stopped ptrace attachees. Since wait_task_stopped()
    was changed earlier so it no longer reorders the children list, we
    already know this won't cause any new problems.

    Signed-off-by: Roland McGrath

    Roland McGrath
     

17 May, 2008

1 commit


02 May, 2008

1 commit


28 Apr, 2008

1 commit

  • Filesystem capability support makes it possible to do away with (set)uid-0
    based privilege and use capabilities instead. That is, with filesystem
    support for capabilities but without this present patch, it is (conceptually)
    possible to manage a system with capabilities alone and never need to obtain
    privilege via (set)uid-0.

    Of course, conceptually isn't quite the same as currently possible since few
    user applications, certainly not enough to run a viable system, are currently
    prepared to leverage capabilities to exercise privilege. Further, many
    applications exist that may never get upgraded in this way, and the kernel
    will continue to want to support their setuid-0 base privilege needs.

    Where pure-capability applications evolve and replace setuid-0 binaries, it is
    desirable that there be a mechanisms by which they can contain their
    privilege. In addition to leveraging the per-process bounding and inheritable
    sets, this should include suppressing the privilege of the uid-0 superuser
    from the process' tree of children.

    The feature added by this patch can be leveraged to suppress the privilege
    associated with (set)uid-0. This suppression requires CAP_SETPCAP to
    initiate, and only immediately affects the 'current' process (it is inherited
    through fork()/exec()). This reimplementation differs significantly from the
    historical support for securebits which was system-wide, unwieldy and which
    has ultimately withered to a dead relic in the source of the modern kernel.

    With this patch applied a process, that is capable(CAP_SETPCAP), can now drop
    all legacy privilege (through uid=0) for itself and all subsequently
    fork()'d/exec()'d children with:

    prctl(PR_SET_SECUREBITS, 0x2f);

    This patch represents a no-op unless CONFIG_SECURITY_FILE_CAPABILITIES is
    enabled at configure time.

    [akpm@linux-foundation.org: fix uninitialised var warning]
    [serue@us.ibm.com: capabilities: use cap_task_prctl when !CONFIG_SECURITY]
    Signed-off-by: Andrew G. Morgan
    Acked-by: Serge Hallyn
    Reviewed-by: James Morris
    Cc: Stephen Smalley
    Cc: Paul Moore
    Signed-off-by: Serge E. Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew G. Morgan
     

20 Apr, 2008

1 commit


06 Feb, 2008

1 commit

  • The capability bounding set is a set beyond which capabilities cannot grow.
    Currently cap_bset is per-system. It can be manipulated through sysctl,
    but only init can add capabilities. Root can remove capabilities. By
    default it includes all caps except CAP_SETPCAP.

    This patch makes the bounding set per-process when file capabilities are
    enabled. It is inherited at fork from parent. Noone can add elements,
    CAP_SETPCAP is required to remove them.

    One example use of this is to start a safer container. For instance, until
    device namespaces or per-container device whitelists are introduced, it is
    best to take CAP_MKNOD away from a container.

    The bounding set will not affect pP and pE immediately. It will only
    affect pP' and pE' after subsequent exec()s. It also does not affect pI,
    and exec() does not constrain pI'. So to really start a shell with no way
    of regain CAP_MKNOD, you would do

    prctl(PR_CAPBSET_DROP, CAP_MKNOD);
    cap_t cap = cap_get_proc();
    cap_value_t caparray[1];
    caparray[0] = CAP_MKNOD;
    cap_set_flag(cap, CAP_INHERITABLE, 1, caparray, CAP_DROP);
    cap_set_proc(cap);
    cap_free(cap);

    The following test program will get and set the bounding
    set (but not pI). For instance

    ./bset get
    (lists capabilities in bset)
    ./bset drop cap_net_raw
    (starts shell with new bset)
    (use capset, setuid binary, or binary with
    file capabilities to try to increase caps)

    ************************************************************
    cap_bound.c
    ************************************************************
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #ifndef PR_CAPBSET_READ
    #define PR_CAPBSET_READ 23
    #endif

    #ifndef PR_CAPBSET_DROP
    #define PR_CAPBSET_DROP 24
    #endif

    int usage(char *me)
    {
    printf("Usage: %s get\n", me);
    printf(" %s drop \n", me);
    return 1;
    }

    #define numcaps 32
    char *captable[numcaps] = {
    "cap_chown",
    "cap_dac_override",
    "cap_dac_read_search",
    "cap_fowner",
    "cap_fsetid",
    "cap_kill",
    "cap_setgid",
    "cap_setuid",
    "cap_setpcap",
    "cap_linux_immutable",
    "cap_net_bind_service",
    "cap_net_broadcast",
    "cap_net_admin",
    "cap_net_raw",
    "cap_ipc_lock",
    "cap_ipc_owner",
    "cap_sys_module",
    "cap_sys_rawio",
    "cap_sys_chroot",
    "cap_sys_ptrace",
    "cap_sys_pacct",
    "cap_sys_admin",
    "cap_sys_boot",
    "cap_sys_nice",
    "cap_sys_resource",
    "cap_sys_time",
    "cap_sys_tty_config",
    "cap_mknod",
    "cap_lease",
    "cap_audit_write",
    "cap_audit_control",
    "cap_setfcap"
    };

    int getbcap(void)
    {
    int comma=0;
    unsigned long i;
    int ret;

    printf("i know of %d capabilities\n", numcaps);
    printf("capability bounding set:");
    for (i=0; i< 0)
    perror("prctl");
    else if (ret==1)
    printf("%s%s", (comma++) ? ", " : " ", captable[i]);
    }
    printf("\n");
    return 0;
    }

    int capdrop(char *str)
    {
    unsigned long i;

    int found=0;
    for (i=0; i
    Signed-off-by: Andrew G. Morgan
    Cc: Stephen Smalley
    Cc: James Morris
    Cc: Chris Wright
    Cc: Casey Schaufler a
    Signed-off-by: "Serge E. Hallyn"
    Tested-by: Jiri Slaby
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     

02 Feb, 2008

2 commits


28 Jan, 2008

1 commit


26 Jan, 2008

3 commits

  • Extend group scheduling to also cover the realtime classes. It uses the time
    limiting introduced by the previous patch to allow multiple realtime groups.

    The hard time limit is required to keep behaviour deterministic.

    The algorithms used make the realtime scheduler O(tg), linear scaling wrt the
    number of task groups. This is the worst case behaviour I can't seem to get out
    of, the avg. case of the algorithms can be improved, I focused on correctness
    and worst case.

    [ akpm@linux-foundation.org: move side-effects out of BUG_ON(). ]

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Move the task_struct members specific to rt scheduling together.
    A future optimization could be to put sched_entity and sched_rt_entity
    into a union.

    Signed-off-by: Peter Zijlstra
    CC: Srivatsa Vaddagiri
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Some RT tasks (particularly kthreads) are bound to one specific CPU.
    It is fairly common for two or more bound tasks to get queued up at the
    same time. Consider, for instance, softirq_timer and softirq_sched. A
    timer goes off in an ISR which schedules softirq_thread to run at RT50.
    Then the timer handler determines that it's time to smp-rebalance the
    system so it schedules softirq_sched to run. So we are in a situation
    where we have two RT50 tasks queued, and the system will go into
    rt-overload condition to request other CPUs for help.

    This causes two problems in the current code:

    1) If a high-priority bound task and a low-priority unbounded task queue
    up behind the running task, we will fail to ever relocate the unbounded
    task because we terminate the search on the first unmovable task.

    2) We spend precious futile cycles in the fast-path trying to pull
    overloaded tasks over. It is therefore optimial to strive to avoid the
    overhead all together if we can cheaply detect the condition before
    overload even occurs.

    This patch tries to achieve this optimization by utilizing the hamming
    weight of the task->cpus_allowed mask. A weight of 1 indicates that
    the task cannot be migrated. We will then utilize this information to
    skip non-migratable tasks and to eliminate uncessary rebalance attempts.

    We introduce a per-rq variable to count the number of migratable tasks
    that are currently running. We only go into overload if we have more
    than one rt task, AND at least one of them is migratable.

    In addition, we introduce a per-task variable to cache the cpus_allowed
    weight, since the hamming calculation is probably relatively expensive.
    We only update the cached value when the mask is updated which should be
    relatively infrequent, especially compared to scheduling frequency
    in the fast path.

    Signed-off-by: Gregory Haskins
    Signed-off-by: Steven Rostedt
    Signed-off-by: Ingo Molnar

    Gregory Haskins
     

20 Oct, 2007

3 commits

  • The pgrp field is not used widely around the kernel so it is now marked as
    deprecated with appropriate comment.

    The initialization of INIT_SIGNALS is trimmed because
    a) they are set to 0 automatically;
    b) gcc cannot properly initialize two anonymous (the second one
    is the one with the session) unions. In this particular case
    to make it compile we'd have to add some field initialized
    right before the .pgrp.

    This is the same patch as the 1ec320afdc9552c92191d5f89fcd1ebe588334ca one
    (from Cedric), but for the pgrp field.

    Some progress report:

    We have to deprecate the pid, tgid, session and pgrp fields on struct
    task_struct and struct signal_struct. The session and pgrp are already
    deprecated. The tgid value is close to being such - the worst known usage
    in in fs/locks.c and audit code. The pid field deprecation is mainly
    blocked by numerous printk-s around the kernel that print the tsk->pid to
    log.

    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Cedric Le Goater
    Cc: Serge Hallyn
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Since we've switched from using pid->nr to pid->upids->nr some
    fields on struct pid are no longer needed

    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Since task will be visible from different pid namespaces each of them have to
    be addressed by multiple pids. struct upid is to store the information about
    which id refers to which namespace.

    The constuciton looks like this. Each struct pid carried the reference
    counter and the list of tasks attached to this pid. At its end it has a
    variable length array of struct upid-s. Each struct upid has a numerical id
    (pid itself), pointer to the namespace, this ID is valid in and is hashed into
    a pid_hash for searching the pids.

    The nr and pid_chain fields are kept in struct pid for a while to make kernel
    still work (no patch initialize the upids yet), but it will be removed at the
    end of this series when we switch to upids completely.

    Signed-off-by: Sukadev Bhattiprolu
    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     

17 Oct, 2007

2 commits

  • The nslock spinlock is not used in the kernel at all. Remove it.

    Signed-off-by: Pavel Emelyanov
    Acked-by: Serge Hallyn
    Cc: Cedric Le Goater
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Based on ideas of Andrew:
    http://marc.info/?l=linux-kernel&m=102912915020543&w=2

    Scale the bdi dirty limit inversly with the tasks dirty rate.
    This makes heavy writers have a lower dirty limit than the occasional writer.

    Andrea proposed something similar:
    http://lwn.net/Articles/152277/

    The main disadvantage to his patch is that he uses an unrelated quantity to
    measure time, which leaves him with a workload dependant tunable. Other than
    that the two approaches appear quite similar.

    [akpm@linux-foundation.org: fix warning]
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

11 Oct, 2007

2 commits

  • When CONFIG_NET=no, init_net is unresolved because net_namespace.c
    is not compiled and the include pull init_net definition.

    This problem was very similar with the ipc namespace where the kernel
    can be compiled with SYSV ipc out.

    This patch fix that defining a macro which simply remove init_net
    initialization from nsproxy namespace aggregator.

    Compiled and booted on qemu-i386 with CONFIG_NET=no and CONFIG_NET=yes.

    Signed-off-by: Daniel Lezcano
    Acked-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Daniel Lezcano
     
  • This is the network namespace from which all which all sockets
    and anything else under user control ultimately get their network
    namespace parameters.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

21 Sep, 2007

1 commit

  • This simplifies signalfd code, by avoiding it to remain attached to the
    sighand during its lifetime.

    In this way, the signalfd remain attached to the sighand only during
    poll(2) (and select and epoll) and read(2). This also allows to remove
    all the custom "tsk == current" checks in kernel/signal.c, since
    dequeue_signal() will only be called by "current".

    I think this is also what Ben was suggesting time ago.

    The external effect of this, is that a thread can extract only its own
    private signals and the group ones. I think this is an acceptable
    behaviour, in that those are the signals the thread would be able to
    fetch w/out signalfd.

    Signed-off-by: Davide Libenzi
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

17 Jul, 2007

1 commit

  • Basically, it will allow a process to unshare its user_struct table,
    resetting at the same time its own user_struct and all the associated
    accounting.

    A new root user (uid == 0) is added to the user namespace upon creation.
    Such root users have full privileges and it seems that theses privileges
    should be controlled through some means (process capabilities ?)

    The unshare is not included in this patch.

    Changes since [try #4]:
    - Updated get_user_ns and put_user_ns to accept NULL, and
    get_user_ns to return the namespace.

    Changes since [try #3]:
    - moved struct user_namespace to files user_namespace.{c,h}

    Changes since [try #2]:
    - removed struct user_namespace* argument from find_user()

    Changes since [try #1]:
    - removed struct user_namespace* argument from find_user()
    - added a root_user per user namespace

    Signed-off-by: Cedric Le Goater
    Signed-off-by: Serge E. Hallyn
    Acked-by: Pavel Emelianov
    Cc: Herbert Poetzl
    Cc: Kirill Korotaev
    Cc: Eric W. Biederman
    Cc: Chris Wright
    Cc: Stephen Smalley
    Cc: James Morris
    Cc: Andrew Morgan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cedric Le Goater
     

11 May, 2007

3 commits

  • This patch series implements the new signalfd() system call.

    I took part of the original Linus code (and you know how badly it can be
    broken :), and I added even more breakage ;) Signals are fetched from the same
    signal queue used by the process, so signalfd will compete with standard
    kernel delivery in dequeue_signal(). If you want to reliably fetch signals on
    the signalfd file, you need to block them with sigprocmask(SIG_BLOCK). This
    seems to be working fine on my Dual Opteron machine. I made a quick test
    program for it:

    http://www.xmailserver.org/signafd-test.c

    The signalfd() system call implements signal delivery into a file descriptor
    receiver. The signalfd file descriptor if created with the following API:

    int signalfd(int ufd, const sigset_t *mask, size_t masksize);

    The "ufd" parameter allows to change an existing signalfd sigmask, w/out going
    to close/create cycle (Linus idea). Use "ufd" == -1 if you want a brand new
    signalfd file.

    The "mask" allows to specify the signal mask of signals that we are interested
    in. The "masksize" parameter is the size of "mask".

    The signalfd fd supports the poll(2) and read(2) system calls. The poll(2)
    will return POLLIN when signals are available to be dequeued. As a direct
    consequence of supporting the Linux poll subsystem, the signalfd fd can use
    used together with epoll(2) too.

    The read(2) system call will return a "struct signalfd_siginfo" structure in
    the userspace supplied buffer. The return value is the number of bytes copied
    in the supplied buffer, or -1 in case of error. The read(2) call can also
    return 0, in case the sighand structure to which the signalfd was attached,
    has been orphaned. The O_NONBLOCK flag is also supported, and read(2) will
    return -EAGAIN in case no signal is available.

    If the size of the buffer passed to read(2) is lower than sizeof(struct
    signalfd_siginfo), -EINVAL is returned. A read from the signalfd can also
    return -ERESTARTSYS in case a signal hits the process. The format of the
    struct signalfd_siginfo is, and the valid fields depends of the (->code &
    __SI_MASK) value, in the same way a struct siginfo would:

    struct signalfd_siginfo {
    __u32 signo; /* si_signo */
    __s32 err; /* si_errno */
    __s32 code; /* si_code */
    __u32 pid; /* si_pid */
    __u32 uid; /* si_uid */
    __s32 fd; /* si_fd */
    __u32 tid; /* si_fd */
    __u32 band; /* si_band */
    __u32 overrun; /* si_overrun */
    __u32 trapno; /* si_trapno */
    __s32 status; /* si_status */
    __s32 svint; /* si_int */
    __u64 svptr; /* si_ptr */
    __u64 utime; /* si_utime */
    __u64 stime; /* si_stime */
    __u64 addr; /* si_addr */
    };

    [akpm@linux-foundation.org: fix signalfd_copyinfo() on i386]
    Signed-off-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     
  • Remove initialization of pgrp and __session in INIT_SIGNALS, as these are
    later set by the call to __set_special_pids() in init/main.c by the patch:

    explicitly-set-pgid-and-sid-of-init-process.patch

    Signed-off-by: Sukadev Bhattiprolu
    Cc: Cedric Le Goater
    Cc: Dave Hansen
    Cc: Serge Hallyn
    Cc: Eric Biederman
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • Statically initialize a struct pid for the swapper process (pid_t == 0) and
    attach it to init_task. This is needed so task_pid(), task_pgrp() and
    task_session() interfaces work on the swapper process also.

    Signed-off-by: Sukadev Bhattiprolu
    Cc: Cedric Le Goater
    Cc: Dave Hansen
    Cc: Serge Hallyn
    Cc: Eric Biederman
    Cc: Herbert Poetzl
    Cc:
    Acked-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     

10 May, 2007

1 commit

  • This finally renames the thread_info field in task structure to stack, so that
    the assumptions about this field are gone and archs have more freedom about
    placing the thread_info structure.

    Nonbroken archs which have a proper thread pointer can do the access to both
    current thread and task structure via a single pointer.

    It'll allow for a few more cleanups of the fork code, from which e.g. ia64
    could benefit.

    Signed-off-by: Roman Zippel
    [akpm@linux-foundation.org: build fix]
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Russell King
    Cc: Ian Molton
    Cc: Haavard Skinnemoen
    Cc: Mikael Starvik
    Cc: David Howells
    Cc: Yoshinori Sato
    Cc: "Luck, Tony"
    Cc: Hirokazu Takata
    Cc: Geert Uytterhoeven
    Cc: Roman Zippel
    Cc: Greg Ungerer
    Cc: Ralf Baechle
    Cc: Ralf Baechle
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Paul Mundt
    Cc: Kazumoto Kojima
    Cc: Richard Curnow
    Cc: William Lee Irwin III
    Cc: "David S. Miller"
    Cc: Jeff Dike
    Cc: Paolo 'Blaisorblade' Giarrusso
    Cc: Miles Bader
    Cc: Andi Kleen
    Cc: Chris Zankel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Zippel
     

09 May, 2007

1 commit