28 Jan, 2008

1 commit


26 Jan, 2008

3 commits

  • Extend group scheduling to also cover the realtime classes. It uses the time
    limiting introduced by the previous patch to allow multiple realtime groups.

    The hard time limit is required to keep behaviour deterministic.

    The algorithms used make the realtime scheduler O(tg), linear scaling wrt the
    number of task groups. This is the worst case behaviour I can't seem to get out
    of, the avg. case of the algorithms can be improved, I focused on correctness
    and worst case.

    [ akpm@linux-foundation.org: move side-effects out of BUG_ON(). ]

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Move the task_struct members specific to rt scheduling together.
    A future optimization could be to put sched_entity and sched_rt_entity
    into a union.

    Signed-off-by: Peter Zijlstra
    CC: Srivatsa Vaddagiri
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Some RT tasks (particularly kthreads) are bound to one specific CPU.
    It is fairly common for two or more bound tasks to get queued up at the
    same time. Consider, for instance, softirq_timer and softirq_sched. A
    timer goes off in an ISR which schedules softirq_thread to run at RT50.
    Then the timer handler determines that it's time to smp-rebalance the
    system so it schedules softirq_sched to run. So we are in a situation
    where we have two RT50 tasks queued, and the system will go into
    rt-overload condition to request other CPUs for help.

    This causes two problems in the current code:

    1) If a high-priority bound task and a low-priority unbounded task queue
    up behind the running task, we will fail to ever relocate the unbounded
    task because we terminate the search on the first unmovable task.

    2) We spend precious futile cycles in the fast-path trying to pull
    overloaded tasks over. It is therefore optimial to strive to avoid the
    overhead all together if we can cheaply detect the condition before
    overload even occurs.

    This patch tries to achieve this optimization by utilizing the hamming
    weight of the task->cpus_allowed mask. A weight of 1 indicates that
    the task cannot be migrated. We will then utilize this information to
    skip non-migratable tasks and to eliminate uncessary rebalance attempts.

    We introduce a per-rq variable to count the number of migratable tasks
    that are currently running. We only go into overload if we have more
    than one rt task, AND at least one of them is migratable.

    In addition, we introduce a per-task variable to cache the cpus_allowed
    weight, since the hamming calculation is probably relatively expensive.
    We only update the cached value when the mask is updated which should be
    relatively infrequent, especially compared to scheduling frequency
    in the fast path.

    Signed-off-by: Gregory Haskins
    Signed-off-by: Steven Rostedt
    Signed-off-by: Ingo Molnar

    Gregory Haskins
     

20 Oct, 2007

3 commits

  • The pgrp field is not used widely around the kernel so it is now marked as
    deprecated with appropriate comment.

    The initialization of INIT_SIGNALS is trimmed because
    a) they are set to 0 automatically;
    b) gcc cannot properly initialize two anonymous (the second one
    is the one with the session) unions. In this particular case
    to make it compile we'd have to add some field initialized
    right before the .pgrp.

    This is the same patch as the 1ec320afdc9552c92191d5f89fcd1ebe588334ca one
    (from Cedric), but for the pgrp field.

    Some progress report:

    We have to deprecate the pid, tgid, session and pgrp fields on struct
    task_struct and struct signal_struct. The session and pgrp are already
    deprecated. The tgid value is close to being such - the worst known usage
    in in fs/locks.c and audit code. The pid field deprecation is mainly
    blocked by numerous printk-s around the kernel that print the tsk->pid to
    log.

    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Cedric Le Goater
    Cc: Serge Hallyn
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Since we've switched from using pid->nr to pid->upids->nr some
    fields on struct pid are no longer needed

    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Since task will be visible from different pid namespaces each of them have to
    be addressed by multiple pids. struct upid is to store the information about
    which id refers to which namespace.

    The constuciton looks like this. Each struct pid carried the reference
    counter and the list of tasks attached to this pid. At its end it has a
    variable length array of struct upid-s. Each struct upid has a numerical id
    (pid itself), pointer to the namespace, this ID is valid in and is hashed into
    a pid_hash for searching the pids.

    The nr and pid_chain fields are kept in struct pid for a while to make kernel
    still work (no patch initialize the upids yet), but it will be removed at the
    end of this series when we switch to upids completely.

    Signed-off-by: Sukadev Bhattiprolu
    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     

17 Oct, 2007

2 commits

  • The nslock spinlock is not used in the kernel at all. Remove it.

    Signed-off-by: Pavel Emelyanov
    Acked-by: Serge Hallyn
    Cc: Cedric Le Goater
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Based on ideas of Andrew:
    http://marc.info/?l=linux-kernel&m=102912915020543&w=2

    Scale the bdi dirty limit inversly with the tasks dirty rate.
    This makes heavy writers have a lower dirty limit than the occasional writer.

    Andrea proposed something similar:
    http://lwn.net/Articles/152277/

    The main disadvantage to his patch is that he uses an unrelated quantity to
    measure time, which leaves him with a workload dependant tunable. Other than
    that the two approaches appear quite similar.

    [akpm@linux-foundation.org: fix warning]
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

11 Oct, 2007

2 commits

  • When CONFIG_NET=no, init_net is unresolved because net_namespace.c
    is not compiled and the include pull init_net definition.

    This problem was very similar with the ipc namespace where the kernel
    can be compiled with SYSV ipc out.

    This patch fix that defining a macro which simply remove init_net
    initialization from nsproxy namespace aggregator.

    Compiled and booted on qemu-i386 with CONFIG_NET=no and CONFIG_NET=yes.

    Signed-off-by: Daniel Lezcano
    Acked-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Daniel Lezcano
     
  • This is the network namespace from which all which all sockets
    and anything else under user control ultimately get their network
    namespace parameters.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

21 Sep, 2007

1 commit

  • This simplifies signalfd code, by avoiding it to remain attached to the
    sighand during its lifetime.

    In this way, the signalfd remain attached to the sighand only during
    poll(2) (and select and epoll) and read(2). This also allows to remove
    all the custom "tsk == current" checks in kernel/signal.c, since
    dequeue_signal() will only be called by "current".

    I think this is also what Ben was suggesting time ago.

    The external effect of this, is that a thread can extract only its own
    private signals and the group ones. I think this is an acceptable
    behaviour, in that those are the signals the thread would be able to
    fetch w/out signalfd.

    Signed-off-by: Davide Libenzi
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     

17 Jul, 2007

1 commit

  • Basically, it will allow a process to unshare its user_struct table,
    resetting at the same time its own user_struct and all the associated
    accounting.

    A new root user (uid == 0) is added to the user namespace upon creation.
    Such root users have full privileges and it seems that theses privileges
    should be controlled through some means (process capabilities ?)

    The unshare is not included in this patch.

    Changes since [try #4]:
    - Updated get_user_ns and put_user_ns to accept NULL, and
    get_user_ns to return the namespace.

    Changes since [try #3]:
    - moved struct user_namespace to files user_namespace.{c,h}

    Changes since [try #2]:
    - removed struct user_namespace* argument from find_user()

    Changes since [try #1]:
    - removed struct user_namespace* argument from find_user()
    - added a root_user per user namespace

    Signed-off-by: Cedric Le Goater
    Signed-off-by: Serge E. Hallyn
    Acked-by: Pavel Emelianov
    Cc: Herbert Poetzl
    Cc: Kirill Korotaev
    Cc: Eric W. Biederman
    Cc: Chris Wright
    Cc: Stephen Smalley
    Cc: James Morris
    Cc: Andrew Morgan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cedric Le Goater
     

11 May, 2007

3 commits

  • This patch series implements the new signalfd() system call.

    I took part of the original Linus code (and you know how badly it can be
    broken :), and I added even more breakage ;) Signals are fetched from the same
    signal queue used by the process, so signalfd will compete with standard
    kernel delivery in dequeue_signal(). If you want to reliably fetch signals on
    the signalfd file, you need to block them with sigprocmask(SIG_BLOCK). This
    seems to be working fine on my Dual Opteron machine. I made a quick test
    program for it:

    http://www.xmailserver.org/signafd-test.c

    The signalfd() system call implements signal delivery into a file descriptor
    receiver. The signalfd file descriptor if created with the following API:

    int signalfd(int ufd, const sigset_t *mask, size_t masksize);

    The "ufd" parameter allows to change an existing signalfd sigmask, w/out going
    to close/create cycle (Linus idea). Use "ufd" == -1 if you want a brand new
    signalfd file.

    The "mask" allows to specify the signal mask of signals that we are interested
    in. The "masksize" parameter is the size of "mask".

    The signalfd fd supports the poll(2) and read(2) system calls. The poll(2)
    will return POLLIN when signals are available to be dequeued. As a direct
    consequence of supporting the Linux poll subsystem, the signalfd fd can use
    used together with epoll(2) too.

    The read(2) system call will return a "struct signalfd_siginfo" structure in
    the userspace supplied buffer. The return value is the number of bytes copied
    in the supplied buffer, or -1 in case of error. The read(2) call can also
    return 0, in case the sighand structure to which the signalfd was attached,
    has been orphaned. The O_NONBLOCK flag is also supported, and read(2) will
    return -EAGAIN in case no signal is available.

    If the size of the buffer passed to read(2) is lower than sizeof(struct
    signalfd_siginfo), -EINVAL is returned. A read from the signalfd can also
    return -ERESTARTSYS in case a signal hits the process. The format of the
    struct signalfd_siginfo is, and the valid fields depends of the (->code &
    __SI_MASK) value, in the same way a struct siginfo would:

    struct signalfd_siginfo {
    __u32 signo; /* si_signo */
    __s32 err; /* si_errno */
    __s32 code; /* si_code */
    __u32 pid; /* si_pid */
    __u32 uid; /* si_uid */
    __s32 fd; /* si_fd */
    __u32 tid; /* si_fd */
    __u32 band; /* si_band */
    __u32 overrun; /* si_overrun */
    __u32 trapno; /* si_trapno */
    __s32 status; /* si_status */
    __s32 svint; /* si_int */
    __u64 svptr; /* si_ptr */
    __u64 utime; /* si_utime */
    __u64 stime; /* si_stime */
    __u64 addr; /* si_addr */
    };

    [akpm@linux-foundation.org: fix signalfd_copyinfo() on i386]
    Signed-off-by: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davide Libenzi
     
  • Remove initialization of pgrp and __session in INIT_SIGNALS, as these are
    later set by the call to __set_special_pids() in init/main.c by the patch:

    explicitly-set-pgid-and-sid-of-init-process.patch

    Signed-off-by: Sukadev Bhattiprolu
    Cc: Cedric Le Goater
    Cc: Dave Hansen
    Cc: Serge Hallyn
    Cc: Eric Biederman
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • Statically initialize a struct pid for the swapper process (pid_t == 0) and
    attach it to init_task. This is needed so task_pid(), task_pgrp() and
    task_session() interfaces work on the swapper process also.

    Signed-off-by: Sukadev Bhattiprolu
    Cc: Cedric Le Goater
    Cc: Dave Hansen
    Cc: Serge Hallyn
    Cc: Eric Biederman
    Cc: Herbert Poetzl
    Cc:
    Acked-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     

10 May, 2007

1 commit

  • This finally renames the thread_info field in task structure to stack, so that
    the assumptions about this field are gone and archs have more freedom about
    placing the thread_info structure.

    Nonbroken archs which have a proper thread pointer can do the access to both
    current thread and task structure via a single pointer.

    It'll allow for a few more cleanups of the fork code, from which e.g. ia64
    could benefit.

    Signed-off-by: Roman Zippel
    [akpm@linux-foundation.org: build fix]
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Russell King
    Cc: Ian Molton
    Cc: Haavard Skinnemoen
    Cc: Mikael Starvik
    Cc: David Howells
    Cc: Yoshinori Sato
    Cc: "Luck, Tony"
    Cc: Hirokazu Takata
    Cc: Geert Uytterhoeven
    Cc: Roman Zippel
    Cc: Greg Ungerer
    Cc: Ralf Baechle
    Cc: Ralf Baechle
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: Paul Mundt
    Cc: Kazumoto Kojima
    Cc: Richard Curnow
    Cc: William Lee Irwin III
    Cc: "David S. Miller"
    Cc: Jeff Dike
    Cc: Paolo 'Blaisorblade' Giarrusso
    Cc: Miles Bader
    Cc: Andi Kleen
    Cc: Chris Zankel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Zippel
     

09 May, 2007

1 commit


13 Feb, 2007

1 commit

  • Of kernel subsystems that work with pids the tty layer is probably the largest
    consumer. But it has the nice virtue that the assiation with a session only
    lasts until the session leader exits. Which means that no reference counting
    is required. So using struct pid winds up being a simple optimization to
    avoid hash table lookups.

    In the long term the use of pid_nr also ensures that when we have multiple pid
    spaces mixed everything will work correctly.

    Signed-off-by: Eric W. Biederman
    Cc: Alan Cox
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

14 Dec, 2006

1 commit

  • This reverts commit 373beb35cd6b625e0ba4ad98baace12310a26aa8.

    No one is using this identifier yet. The purpose of this identifier is to
    export nsproxy to user space which is wrong. nsproxy is an internal
    implementation optimization, which should keep our fork times from getting
    slower as we increase the number of global namespaces you don't have to
    share.

    Adding a global identifier like this is inappropriate because it makes
    namespaces inherently non-recursive, greatly limiting what we can do with
    them in the future.

    Signed-off-by: Eric W. Biederman
    Cc: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

11 Dec, 2006

2 commits

  • An fdtable can either be embedded inside a files_struct or standalone (after
    being expanded). When an fdtable is being discarded after all RCU references
    to it have expired, we must either free it directly, in the standalone case,
    or free the files_struct it is contained within, in the embedded case.

    Currently the free_files field controls this behavior, but we can get rid of
    it entirely, as all the necessary information is already recorded. We can
    distinguish embedded and standalone fdtables using max_fds, and if it is
    embedded we can divine the relevant files_struct using container_of().

    Signed-off-by: Vadim Lobanov
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: Dipankar Sarma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vadim Lobanov
     
  • Currently, each fdtable supports three dynamically-sized arrays of data: the
    fdarray and two fdsets. The code allows the number of fds supported by the
    fdarray (fdtable->max_fds) to differ from the number of fds supported by each
    of the fdsets (fdtable->max_fdset).

    In practice, it is wasteful for these two sizes to differ: whenever we hit a
    limit on the smaller-capacity structure, we will reallocate the entire fdtable
    and all the dynamic arrays within it, so any delta in the memory used by the
    larger-capacity structure will never be touched at all.

    Rather than hogging this excess, we shouldn't even allocate it in the first
    place, and keep the capacities of the fdarray and the fdsets equal. This
    patch removes fdtable->max_fdset. As an added bonus, most of the supporting
    code becomes simpler.

    Signed-off-by: Vadim Lobanov
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: Dipankar Sarma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vadim Lobanov
     

09 Dec, 2006

4 commits

  • Add the pid namespace framework to the nsproxy object. The copy of the pid
    namespace only increases the refcount on the global pid namespace,
    init_pid_ns, and unshare is not implemented.

    There is no configuration option to activate or deactivate this feature
    because this not relevant for the moment.

    Signed-off-by: Cedric Le Goater
    Cc: Kirill Korotaev
    Cc: Eric W. Biederman
    Cc: Herbert Poetzl
    Cc: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cedric Le Goater
     
  • Add an identifier to nsproxy. The default init_ns_proxy has identifier 0 and
    allocated nsproxies are given -1.

    This identifier will be used by a new syscall sys_bind_ns.

    Signed-off-by: Cedric Le Goater
    Cc: Kirill Korotaev
    Cc: Eric W. Biederman
    Cc: Herbert Poetzl
    Cc: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cedric Le Goater
     
  • Rename 'struct namespace' to 'struct mnt_namespace' to avoid confusion with
    other namespaces being developped for the containers : pid, uts, ipc, etc.
    'namespace' variables and attributes are also renamed to 'mnt_ns'

    Signed-off-by: Kirill Korotaev
    Signed-off-by: Cedric Le Goater
    Cc: Eric W. Biederman
    Cc: Herbert Poetzl
    Cc: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Korotaev
     
  • Add an anonymous union and ((deprecated)) to catch direct usage of the
    session field.

    [akpm@osdl.org: fix various missed conversions]
    [jdike@addtoit.com: fix UML bug]
    Signed-off-by: Jeff Dike
    Cc: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cedric Le Goater
     

08 Dec, 2006

1 commit


02 Oct, 2006

5 commits

  • This patch adds basic IPC namespace functionality to
    IPC utils:
    - init_ipc_ns
    - copy/clone/unshare/free IPC ns
    - /proc preparations

    Signed-off-by: Pavel Emelianov
    Signed-off-by: Kirill Korotaev
    Cc: "Eric W. Biederman"
    Cc: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Korotaev
     
  • This patch set allows to unshare IPCs and have a private set of IPC objects
    (sem, shm, msg) inside namespace. Basically, it is another building block of
    containers functionality.

    This patch implements core IPC namespace changes:
    - ipc_namespace structure
    - new config option CONFIG_IPC_NS
    - adds CLONE_NEWIPC flag
    - unshare support

    [clg@fr.ibm.com: small fix for unshare of ipc namespace]
    [akpm@osdl.org: build fix]
    Signed-off-by: Pavel Emelianov
    Signed-off-by: Kirill Korotaev
    Signed-off-by: Cedric Le Goater
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Korotaev
     
  • This patch defines the uts namespace and some manipulators.
    Adds the uts namespace to task_struct, and initializes a
    system-wide init namespace.

    It leaves a #define for system_utsname so sysctl will compile.
    This define will be removed in a separate patch.

    [akpm@osdl.org: build fix, cleanup]
    Signed-off-by: Serge Hallyn
    Cc: Kirill Korotaev
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Cc: Andrey Savochkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • This moves the mount namespace into the nsproxy. The mount namespace count
    now refers to the number of nsproxies point to it, rather than the number of
    tasks. As a result, the unshare_namespace() function in kernel/fork.c no
    longer checks whether it is being shared.

    Signed-off-by: Serge Hallyn
    Cc: Kirill Korotaev
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Cc: Andrey Savochkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • This patch adds a nsproxy structure to the task struct. Later patches will
    move the fs namespace pointer into this structure, and introduce a new utsname
    namespace into the nsproxy.

    The vserver and openvz functionality, then, would be implemented in large part
    by virtualizing/isolating more and more resources into namespaces, each
    contained in the nsproxy.

    [akpm@osdl.org: build fix]
    Signed-off-by: Serge Hallyn
    Cc: Kirill Korotaev
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Cc: Andrey Savochkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     

04 Jul, 2006

4 commits

  • Do 'make oldconfig' and accept all the defaults for new config options -
    reboot into the kernel and if everything goes well it should boot up fine and
    you should have /proc/lockdep and /proc/lockdep_stats files.

    Typically if the lock validator finds some problem it will print out
    voluminous debug output that begins with "BUG: ..." and which syslog output
    can be used by kernel developers to figure out the precise locking scenario.

    What does the lock validator do? It "observes" and maps all locking rules as
    they occur dynamically (as triggered by the kernel's natural use of spinlocks,
    rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a
    new locking scenario, it validates this new rule against the existing set of
    rules. If this new rule is consistent with the existing set of rules then the
    new rule is added transparently and the kernel continues as normal. If the
    new rule could create a deadlock scenario then this condition is printed out.

    When determining validity of locking, all possible "deadlock scenarios" are
    considered: assuming arbitrary number of CPUs, arbitrary irq context and task
    context constellations, running arbitrary combinations of all the existing
    locking scenarios. In a typical system this means millions of separate
    scenarios. This is why we call it a "locking correctness" validator - for all
    rules that are observed the lock validator proves it with mathematical
    certainty that a deadlock could not occur (assuming that the lock validator
    implementation itself is correct and its internal data structures are not
    corrupted by some other kernel subsystem). [see more details and conditionals
    of this statement in include/linux/lockdep.h and
    Documentation/lockdep-design.txt]

    Furthermore, this "all possible scenarios" property of the validator also
    enables the finding of complex, highly unlikely multi-CPU multi-context races
    via single single-context rules, increasing the likelyhood of finding bugs
    drastically. In practical terms: the lock validator already found a bug in
    the upstream kernel that could only occur on systems with 3 or more CPUs, and
    which needed 3 very unlikely code sequences to occur at once on the 3 CPUs.
    That bug was found and reported on a single-CPU system (!). So in essence a
    race will be found "piecemail-wise", triggering all the necessary components
    for the race, without having to reproduce the race scenario itself! In its
    short existence the lock validator found and reported many bugs before they
    actually caused a real deadlock.

    To further increase the efficiency of the validator, the mapping is not per
    "lock instance", but per "lock-class". For example, all struct inode objects
    in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached,
    then there are 10,000 lock objects. But ->inotify_mutex is a single "lock
    type", and all locking activities that occur against ->inotify_mutex are
    "unified" into this single lock-class. The advantage of the lock-class
    approach is that all historical ->inotify_mutex uses are mapped into a single
    (and as narrow as possible) set of locking rules - regardless of how many
    different tasks or inode structures it took to build this set of rules. The
    set of rules persist during the lifetime of the kernel.

    To see the rough magnitude of checking that the lock validator does, here's a
    portion of /proc/lockdep_stats, fresh after bootup:

    lock-classes: 694 [max: 2048]
    direct dependencies: 1598 [max: 8192]
    indirect dependencies: 17896
    all direct dependencies: 16206
    dependency chains: 1910 [max: 8192]
    in-hardirq chains: 17
    in-softirq chains: 105
    in-process chains: 1065
    stack-trace entries: 38761 [max: 131072]
    combined max dependencies: 2033928
    hardirq-safe locks: 24
    hardirq-unsafe locks: 176
    softirq-safe locks: 53
    softirq-unsafe locks: 137
    irq-safe locks: 59
    irq-unsafe locks: 176

    The lock validator has observed 1598 actual single-thread locking patterns,
    and has validated all possible 2033928 distinct locking scenarios.

    More details about the design of the lock validator can be found in
    Documentation/lockdep-design.txt, which can also found at:

    http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt

    [bunk@stusta.de: cleanups]
    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Accurate hard-IRQ-flags and softirq-flags state tracing.

    This allows us to attach extra functionality to IRQ flags on/off
    events (such as trace-on/off).

    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Locking init improvement:

    - introduce and use __SPIN_LOCK_UNLOCKED for array initializations,
    to pass in the name string of locks, used by debugging

    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Generic lock debugging:

    - generalized lock debugging framework. For example, a bug in one lock
    subsystem turns off debugging in all lock subsystems.

    - got rid of the caller address passing (__IP__/__IP_DECL__/etc.) from
    the mutex/rtmutex debugging code: it caused way too much prototype
    hackery, and lockdep will give the same information anyway.

    - ability to do silent tests

    - check lock freeing in vfree too.

    - more finegrained debugging options, to allow distributions to
    turn off more expensive debugging features.

    There's no separate 'held mutexes' list anymore - but there's a 'held locks'
    stack within lockdep, which unifies deadlock detection across all lock
    classes. (this is independent of the lockdep validation stuff - lockdep first
    checks whether we are holding a lock already)

    Here are the current debugging options:

    CONFIG_DEBUG_MUTEXES=y
    CONFIG_DEBUG_LOCK_ALLOC=y

    which do:

    config DEBUG_MUTEXES
    bool "Mutex debugging, basic checks"

    config DEBUG_LOCK_ALLOC
    bool "Detect incorrect freeing of live mutexes"

    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

28 Jun, 2006

2 commits

  • Core functions for the rt-mutex subsystem.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Add framework to boost/unboost the priority of RT tasks.

    This consists of:

    - caching the 'normal' priority in ->normal_prio
    - providing a functions to set/get the priority of the task
    - make sched_setscheduler() aware of boosting

    The effective_prio() cleanups also fix a priority-calculation bug pointed out
    by Andrey Gelman, in set_user_nice().

    has_rt_policy() fix: Peter Williams

    Signed-off-by: Ingo Molnar
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Arjan van de Ven
    Cc: Andrey Gelman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

27 Jun, 2006

1 commit

  • To keep the dcache from filling up with dead /proc entries we flush them on
    process exit. However over the years that code has gotten hairy with a
    dentry_pointer and a lock in task_struct and misdocumented as a correctness
    feature.

    I have rewritten this code to look and see if we have a corresponding entry in
    the dcache and if so flush it on process exit. This removes the extra fields
    in the task_struct and allows me to trivially handle the case of a
    /proc//task/ entry as well as the current /proc/ entries.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

29 Mar, 2006

1 commit

  • daemonize() calls set_special_pids(1,1), while init and kernel threads spawned
    from init/main.c:init() run with 0,0 special pids. This patch changes
    INIT_SIGNALS() so that that they run with ->pgrp == ->session == 1 also. This
    patch relies on fact that swapper's pid == 1.

    Now we have no hashed zero pids in pid_hash[].

    User-space visibible change is that now /sbin/init runs with (1,1) special
    pids and becomes a session leader.

    Quoting Eric W. Biederman:
    >
    > daemonize consuming pids (1,1) then consumes pgrp 1. So that when
    > /sbin/init calls setsid() it thinks /sbin/init is a process group
    > leader and setsid() fails. So /sbin/init wants pgrp 1 session 1
    > but doesn't get it. I am pretty certain daemonize did not exist so
    > /sbin/init got pgrp 1 session 1 in 2.4.
    >
    > That is the bug that is being fixed.
    >
    > This patch takes things one step farther and essentially calls
    > setsid() for pid == 1 before init is execed. That is new behavior
    > but it cleans up the kernel as we now do not need to support the
    > case of a process without a process group or a session.
    >
    > The only process that could have possibly cared was /sbin/init
    > and it already calls setsid() because it doesn't want that.
    >
    > If this was going to break anything noticeable the change in behavior
    > from 2.4 to 2.6 would have already done that.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov