20 Apr, 2006

1 commit

  • While we can currently walk through thread groups, process groups, and
    sessions with just the rcu_read_lock, this opens the door to walking the
    entire task list.

    We already have all of the other RCU guarantees so there is no cost in
    doing this, this should be enough so that proc can stop taking the
    tasklist lock during readdir.

    prev_task was killed because it has no users, and using it will miss new
    tasks when doing an rcu traversal.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

15 Apr, 2006

1 commit

  • Somehow in the midst of dotting i's and crossing t's during
    the merge up to rc1 we wound up keeping __put_task_struct_cb
    when it should have been killed as it no longer has any users.
    Sorry I probably should have caught this while it was
    still in the -mm tree.

    Having the old code there gets confusing when reading
    through the code and trying to understand what is
    happening.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

11 Apr, 2006

4 commits

  • * 'splice' of git://brick.kernel.dk/data/git/linux-2.6-block:
    [PATCH] vfs: add splice_write and splice_read to documentation
    [PATCH] Remove sys_ prefix of new syscalls from __NR_sys_*
    [PATCH] splice: warning fix
    [PATCH] another round of fs/pipe.c cleanups
    [PATCH] splice: comment styles
    [PATCH] splice: add Ingo as addition copyright holder
    [PATCH] splice: unlikely() optimizations
    [PATCH] splice: speedups and optimizations
    [PATCH] pipe.c/fifo.c code cleanups
    [PATCH] get rid of the PIPE_*() macros
    [PATCH] splice: speedup __generic_file_splice_read
    [PATCH] splice: add direct fd fd splicing support
    [PATCH] splice: add optional input and output offsets
    [PATCH] introduce a "kernel-internal pipe object" abstraction
    [PATCH] splice: be smarter about calling do_page_cache_readahead()
    [PATCH] splice: optimize the splice buffer mapping
    [PATCH] splice: cleanup __generic_file_splice_read()
    [PATCH] splice: only call wake_up_interruptible() when we really have to
    [PATCH] splice: potential !page dereference
    [PATCH] splice: mark the io page as accessed

    Linus Torvalds
     
  • Before commit 47e65328a7b1cdfc4e3102e50d60faf94ebba7d3, next_thread() took
    a const task_t. Reinstate the const qualifier, getting the next thread
    never changes the current thread.

    Signed-off-by: Keith Owens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Keith Owens
     
  • It's more efficient for sendfile() emulation. Basically we cache an
    internal private pipe and just use that as the intermediate area for
    pages. Direct splicing is not available from sys_splice(), it is only
    meant to be used for sendfile() emulation.

    Additional patch from Ingo Molnar to avoid the PIPE_BUFFERS loop at
    exit for the normal fast path.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Oleg Nesterov spotted two interesting bugs with the current de_thread
    code. The simplest is a long standing double decrement of
    __get_cpu_var(process_counts) in __unhash_process. Caused by
    two processes exiting when only one was created.

    The other is that since we no longer detach from the thread_group list
    it is possible for do_each_thread when run under the tasklist_lock to
    see the same task_struct twice. Once on the task list as a
    thread_group_leader, and once on the thread list of another
    thread.

    The double appearance in do_each_thread can cause a double increment
    of mm_core_waiters in zap_threads resulting in problems later on in
    coredump_wait.

    To remedy those two problems this patch takes the simple approach
    of changing the old thread group leader into a child thread.
    The only routine in release_task that cares is __unhash_process,
    and it can be trivially seen that we handle cleaning up a
    thread group leader properly.

    Since de_thread doesn't change the pid of the exiting leader process
    and instead shares it with the new leader process. I change
    thread_group_leader to recognize group leadership based on the
    group_leader field and not based on pids. This should also be
    slightly cheaper then the existing thread_group_leader macro.

    I performed a quick audit and I couldn't see any user of
    thread_group_leader that cared about the difference.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

01 Apr, 2006

6 commits

  • Simplifies the code, reduces the need for 4 pid hash tables, and makes the
    code more capable.

    In the discussions I had with Oleg it was felt that to a large extent the
    cleanup itself justified the work. With struct pid being dynamically
    allocated meant we could create the hash table entry when the pid was
    allocated and free the hash table entry when the pid was freed. Instead of
    playing with the hash lists when ever a process would attach or detach to a
    process.

    For myself the fact that it gave what my previous task_ref patch gave for free
    with simpler code was a big win. The problem is that if you hold a reference
    to struct task_struct you lock in 10K of low memory. If you do that in a user
    controllable way like /proc does, with an unprivileged but hostile user space
    application with typical resource limits of 1000 fds and 100 processes I can
    trigger the OOM killer by consuming all of low memory with task structs, on a
    machine wight 1GB of low memory.

    If I instead hold a reference to struct pid which holds a pointer to my
    task_struct, I don't suffer from that problem because struct pid is 2 orders
    of magnitude smaller. In fact struct pid is small enough that most other
    kernel data structures dwarf it, so simply limiting the number of referring
    data structures is enough to prevent exhaustion of low memory.

    This splits the current struct pid into two structures, struct pid and struct
    pid_link, and reduces our number of hash tables from PIDTYPE_MAX to just one.
    struct pid_link is the per process linkage into the hash tables and lives in
    struct task_struct. struct pid is given an indepedent lifetime, and holds
    pointers to each of the pid types.

    The independent life of struct pid simplifies attach_pid, and detach_pid,
    because we are always manipulating the list of pids and not the hash table.
    In addition in giving struct pid an indpendent life it makes the concept much
    more powerful.

    Kernel data structures can now embed a struct pid * instead of a pid_t and
    not suffer from pid wrap around problems or from keeping unnecessarily
    large amounts of memory allocated.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • A big problem with rcu protected data structures that are also reference
    counted is that you must jump through several hoops to increase the reference
    count. I think someone finally implemented atomic_inc_not_zero(&count) to
    automate the common case. Unfortunately this means you must special case the
    rcu access case.

    When data structures are only visible via rcu in a manner that is not
    determined by the reference count on the object (i.e. tasks are visible until
    their zombies are reaped) there is a much simpler technique we can employ.
    Simply delaying the decrement of the reference count until the rcu interval is
    over.

    What that means is that the proc code that looks up a task and later
    wants to sleep can now do:

    rcu_read_lock();
    task = find_task_by_pid(some_pid);
    if (task) {
    get_task_struct(task);
    }
    rcu_read_unlock();

    The effect on the rest of the kernel is that put_task_struct becomes cheaper
    and immediate, and in the case where the task has been reaped it frees the
    task immediate instead of unnecessarily waiting an until the rcu interval is
    over.

    Cleanup of task_struct does not happen when its reference count drops to
    zero, instead cleanup happens when release_task is called. Tasks can only
    be looked up via rcu before release_task is called. All rcu protected
    members of task_struct are freed by release_task.

    Therefore we can move call_rcu from put_task_struct into release_task. And
    we can modify release_task to not immediately release the reference count
    but instead have it call put_task_struct from the function it gives to
    call_rcu.

    The end result:

    - get_task_struct is safe in an rcu context where we have just looked
    up the task.

    - put_task_struct() simplifies into its old pre rcu self.

    This reorganization also makes put_task_struct uncallable from modules as
    it is not exported but it does not appear to be called from any modules so
    this should not be an issue, and is trivially fixed.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • This just got nuked in mainline. Bring it back because Eric's patches use it.

    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • To increase the strength of SCHED_BATCH as a scheduling hint we can
    activate batch tasks on the expired array since by definition they are
    latency insensitive tasks.

    Signed-off-by: Con Kolivas
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Con Kolivas
     
  • The activated flag in task_struct is used to track different sleep types and
    its usage is somewhat obfuscated. Convert the variable to an enum with more
    descriptive names without altering the function.

    Signed-off-by: Con Kolivas
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Con Kolivas
     
  • Currently, count_active_tasks() calls both nr_running() &
    nr_interruptible(). Each of these functions does a "for_each_cpu" & reads
    values from the runqueue of each cpu. Although this is not a lot of
    instructions, each runqueue may be located on different node. Depending on
    the architecture, a unique TLB entry may be required to access each
    runqueue.

    Since there may be more runqueues than cpu TLB entries, a scan of all
    runqueues can trash the TLB. Each memory reference incurs a TLB miss &
    refill.

    In addition, the runqueue cacheline that contains nr_running &
    nr_uninterruptible may be evicted from the cache between the two passes.
    This causes unnecessary cache misses.

    Combining nr_running() & nr_interruptible() into a single function
    substantially reduces the TLB & cache misses on large systems. This should
    have no measureable effect on smaller systems.

    On a 128p IA64 system running a memory stress workload, the new function
    reduced the overhead of calc_load() from 605 usec/call to 324 usec/call.

    Signed-off-by: Jack Steiner
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jack Steiner
     

29 Mar, 2006

12 commits

  • Move 'tsk->sighand = NULL' from cleanup_sighand() to __exit_signal(). This
    makes the exit path more understandable and allows us to do
    cleanup_sighand() outside of ->siglock protected section.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • This patch kills PIDTYPE_TGID pid_type thus saving one hash table in
    kernel/pid.c and speeding up subthreads create/destroy a bit. It is also a
    preparation for the further tref/pids rework.

    This patch adds 'struct list_head thread_group' to 'struct task_struct'
    instead.

    We don't detach group leader from PIDTYPE_PID namespace until another
    thread inherits it's ->pid == ->tgid, so we are safe wrt premature
    free_pidmap(->tgid) call.

    Currently there are no users of find_task_by_pid_type(PIDTYPE_TGID).
    Should the need arise, we can use find_task_by_pid()->group_leader.

    Signed-off-by: Oleg Nesterov
    Acked-By: Eric Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • __exit_signal() is private to release_task() now. I think it is better to
    make it static in kernel/exit.c and export flush_sigqueue() instead - this
    function is much more simple and straightforward.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Cosmetic, rename __exit_sighand to cleanup_sighand and move it close to
    copy_sighand().

    This matches copy_signal/cleanup_signal naming, and I think it is easier to
    follow.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Acked-by: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • __exit_signal() does important cleanups atomically under ->siglock. It is
    also called from copy_process's error path. This is not good, for example we
    can't move __unhash_process() under ->siglock for that reason.

    We should not mix these 2 paths, just look at ugly 'if (p->sighand)' under
    'bad_fork_cleanup_sighand:' label. For copy_process() case it is sufficient
    to just backout copy_signal(), nothing more.

    Again, nobody can see this task yet. For CLONE_THREAD case we just decrement
    signal->count, otherwise nobody can see this ->signal and we can free it
    lockless.

    This patch assumes it is safe to do exit_thread_group_keys() without
    tasklist_lock.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Acked-by: David Howells
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The only caller of exit_sighand(tsk) is copy_process's error path. We can
    call __exit_sighand() directly and kill exit_sighand().

    This 'tsk' was not yet registered in pid_hash[] or init_task.tasks, it has no
    external references, nobody can see it, and

    IF (clone_flags & CLONE_SIGHAND)
    At least 'current' has a reference to ->sighand, this
    means atomic_dec_and_test(sighand->count) can't be true.

    ELSE
    Nobody can see this ->sighand, this means we can free it
    without any locking.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Acked-by: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Add lock_task_sighand() helper and converts group_send_sig_info() to use
    it. Hopefully we will have more users soon.

    This patch also removes '!sighand->count' and '!p->usage' checks, I think
    they both are bogus, racy and unneeded (but probably it makes sense to
    restore them as BUG_ON()s).

    ->sighand is cleared and it's ->count is decremented in release_task() with
    sighand->siglock held, so it is a bug to have '!p->usage || !->count' after
    we already locked and verified it is the same. On the other hand, an
    already dead task without ->sighand can have a non-zero ->usage due to
    ptrace, for example.

    If we read the stale value of ->sighand we must see the change after
    spin_lock(), because that change was done while holding that same old
    ->sighand.siglock.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • This patch borrows a clever Hugh's 'struct anon_vma' trick.

    Without tasklist_lock held we can't trust task->sighand until we locked it
    and re-checked that it is still the same.

    But this means we don't need to defer 'kmem_cache_free(sighand)'. We can
    return the memory to slab immediately, all we need is to be sure that
    sighand->siglock can't dissapear inside rcu protected section.

    To do so we need to initialize ->siglock inside ctor function,
    SLAB_DESTROY_BY_RCU does the rest.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • fork_idle() does unhash_process() just after copy_process(). Contrary,
    boot_cpu's idle thread explicitely registers itself for each pid_type with nr
    = 0.

    copy_process() already checks p->pid != 0 before process_counts++, I think we
    can just skip attach_pid() calls and job control inits for idle threads and
    kill unhash_process(). We don't need to cleanup ->proc_dentry in fork_idle()
    because with this patch idle threads are never hashed in
    kernel/pid.c:pid_hash[].

    We don't need to hash pid == 0 in pidmap_init(). free_pidmap() is never
    called with pid == 0 arg, so it will never be reused. So it is still possible
    to use pid == 0 in any PIDTYPE_xxx namespace from kernel/pid.c's POV.

    However with this patch we don't hash pid == 0 for PIDTYPE_PID case. We still
    have have PIDTYPE_PGID/PIDTYPE_SID entries with pid == 0: /sbin/init and
    kernel threads which don't call daemonize().

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Both SET_LINKS() and SET_LINKS/REMOVE_LINKS() have exactly one caller, and
    these callers already check thread_group_leader().

    This patch kills theese macros, they mix two different things: setting
    process's parent and registering it in init_task.tasks list. Callers are
    updated to do these actions by hand.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • add_parent(p, parent) is always called with parent == p->parent, and it makes
    no sense to do it differently. This patch removes this argument.

    No changes in affected .o files.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The kill_sl function doesn't exist in the kernel so a prototype is completely
    unnecessary.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     

28 Mar, 2006

2 commits

  • 32-bit syscall compatibility support. (This patch also moves all futex
    related compat functionality into kernel/futex_compat.c.)

    Signed-off-by: Ingo Molnar
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Arjan van de Ven
    Acked-by: Ulrich Drepper
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Add the core infrastructure for robust futexes: structure definitions, the new
    syscalls and the do_exit() based cleanup mechanism.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Arjan van de Ven
    Acked-by: Ulrich Drepper
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

27 Mar, 2006

1 commit

  • The nanosleep cleanup allows to remove the data field of hrtimer. The
    callback function can use container_of() to get it's own data. Since the
    hrtimer structure is anyway embedded in other structures, this adds no
    overhead.

    Signed-off-by: Roman Zippel
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Zippel
     

24 Mar, 2006

4 commits

  • Make the softlockup detector purely timer-interrupt driven, removing
    softirq-context (timer) dependencies. This means that if the softlockup
    watchdog triggers, it has truly observed a longer than 10 seconds
    scheduling delay of a SCHED_FIFO prio 99 task.

    (the patch also turns off the softlockup detector during the initial bootup
    phase and does small style fixes)

    Signed-off-by: Ingo Molnar
    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • The hooks in the slab cache allocator code path for support of NUMA
    mempolicies and cpuset memory spreading are in an important code path. Many
    systems will use neither feature.

    This patch optimizes those hooks down to a single check of some bits in the
    current tasks task_struct flags. For non NUMA systems, this hook and related
    code is already ifdef'd out.

    The optimization is done by using another task flag, set if the task is using
    a non-default NUMA mempolicy. Taking this flag bit along with the
    PF_SPREAD_PAGE and PF_SPREAD_SLAB flag bits added earlier in this 'cpuset
    memory spreading' patch set, one can check for the combination of any of these
    special case memory placement mechanisms with a single test of the current
    tasks task_struct flags.

    This patch also tightens up the code, to save a few bytes of kernel text
    space, and moves some of it out of line. Due to the nested inlines called
    from multiple places, we were ending up with three copies of this code, which
    once we get off the main code path (for local node allocation) seems a bit
    wasteful of instruction memory.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • This patch provides the implementation and cpuset interface for an alternative
    memory allocation policy that can be applied to certain kinds of memory
    allocations, such as the page cache (file system buffers) and some slab caches
    (such as inode caches).

    The policy is called "memory spreading." If enabled, it spreads out these
    kinds of memory allocations over all the nodes allowed to a task, instead of
    preferring to place them on the node where the task is executing.

    All other kinds of allocations, including anonymous pages for a tasks stack
    and data regions, are not affected by this policy choice, and continue to be
    allocated preferring the node local to execution, as modified by the NUMA
    mempolicy.

    There are two boolean flag files per cpuset that control where the kernel
    allocates pages for the file system buffers and related in kernel data
    structures. They are called 'memory_spread_page' and 'memory_spread_slab'.

    If the per-cpuset boolean flag file 'memory_spread_page' is set, then the
    kernel will spread the file system buffers (page cache) evenly over all the
    nodes that the faulting task is allowed to use, instead of preferring to put
    those pages on the node where the task is running.

    If the per-cpuset boolean flag file 'memory_spread_slab' is set, then the
    kernel will spread some file system related slab caches, such as for inodes
    and dentries evenly over all the nodes that the faulting task is allowed to
    use, instead of preferring to put those pages on the node where the task is
    running.

    The implementation is simple. Setting the cpuset flags 'memory_spread_page'
    or 'memory_spread_cache' turns on the per-process flags PF_SPREAD_PAGE or
    PF_SPREAD_SLAB, respectively, for each task that is in the cpuset or
    subsequently joins that cpuset. In subsequent patches, the page allocation
    calls for the affected page cache and slab caches are modified to perform an
    inline check for these flags, and if set, a call to a new routine
    cpuset_mem_spread_node() returns the node to prefer for the allocation.

    The cpuset_mem_spread_node() routine is also simple. It uses the value of a
    per-task rotor cpuset_mem_spread_rotor to select the next node in the current
    tasks mems_allowed to prefer for the allocation.

    This policy can provide substantial improvements for jobs that need to place
    thread local data on the corresponding node, but that need to access large
    file system data sets that need to be spread across the several nodes in the
    jobs cpuset in order to fit. Without this patch, especially for jobs that
    might have one thread reading in the data set, the memory allocation across
    the nodes in the jobs cpuset can become very uneven.

    A couple of Copyright year ranges are updated as well. And a couple of email
    addresses that can be found in the MAINTAINERS file are removed.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Signed-off-by: Jens Axboe

    Jens Axboe
     

12 Mar, 2006

1 commit

  • The patch '[PATCH] RCU signal handling' [1] added an export for
    __put_task_struct_cb, a put_task_struct helper newly introduced in that
    patch. But the put_task_struct couldn't be used modular previously as
    __put_task_struct wasn't exported. There are not callers of it in modular
    code, and it shouldn't be exported because we don't want drivers to hold
    references to task_structs.

    This patch removes the export and folds __put_task_struct into
    __put_task_struct_cb as there's no other caller.

    [1] http://www2.kernel.org/git/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e56d090310d7625ecb43a1eeebd479f04affb48b

    Signed-off-by: Christoph Hellwig
    Acked-by: Paul E. McKenney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

01 Mar, 2006

1 commit

  • This patch adds mm->task_size to keep track of the task size of a given mm
    and uses that to fix the powerpc vdso so that it uses the mm task size to
    decide what pages to fault in instead of the current thread flags (which
    broke when ptracing).

    (akpm: I expect that mm_struct.task_size will become the way in which we
    finally sort out the confusion between 32-bit processes and 32-bit mm's. It
    may need tweaks, but at this stage this patch is powerpc-only.)

    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Herrenschmidt
     

15 Feb, 2006

1 commit

  • Revert commit d7102e95b7b9c00277562c29aad421d2d521c5f6:

    [PATCH] sched: filter affine wakeups

    Apparently caused more than 10% performance regression for aim7 benchmark.
    The setup in use is 16-cpu HP rx8620, 64Gb of memory and 12 MSA1000s with 144
    disks. Each disk is 72Gb with a single ext3 filesystem (courtesy of HP, who
    supplied benchmark results).

    The problem is, for aim7, the wake-up pattern is random, but it still needs
    load balancing action in the wake-up path to achieve best performance. With
    the above commit, lack of load balancing hurts that workload.

    However, for workloads like database transaction processing, the requirement
    is exactly opposite. In the wake up path, best performance is achieved with
    absolutely zero load balancing. We simply wake up the process on the CPU that
    it was previously run. Worst performance is obtained when we do load
    balancing at wake up.

    There isn't an easy way to auto detect the workload characteristics. Ingo's
    earlier patch that detects idle CPU and decide whether to load balance or not
    doesn't perform with aim7 either since all CPUs are busy (it causes even
    bigger perf. regression).

    Revert commit d7102e95b7b9c00277562c29aad421d2d521c5f6, which causes more
    than 10% performance regression with aim7.

    Signed-off-by: Ken Chen
    Acked-by: Ingo Molnar
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen, Kenneth W
     

10 Feb, 2006

1 commit


19 Jan, 2006

1 commit

  • The TIF_RESTORE_SIGMASK flag allows us to have a generic implementation of
    sys_rt_sigsuspend() instead of duplicating it for each architecture. This
    provides such an implementation and makes arch/powerpc use it.

    It also tidies up the ppc32 sys_sigsuspend() to use TIF_RESTORE_SIGMASK.

    Signed-off-by: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Woodhouse
     

15 Jan, 2006

1 commit

  • Add a new SCHED_BATCH (3) scheduling policy: such tasks are presumed
    CPU-intensive, and will acquire a constant +5 priority level penalty. Such
    policy is nice for workloads that are non-interactive, but which do not
    want to give up their nice levels. The policy is also useful for workloads
    that want a deterministic scheduling policy without interactivity causing
    extra preemptions (between that workload's tasks).

    Signed-off-by: Ingo Molnar
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

13 Jan, 2006

3 commits

  • Patchset annotates arch/* uses of ->thread_info. Ones that really are about
    access of thread_info of given process are simply switched to
    task_thread_info(task); ones that deal with access to objects on stack are
    switched to new helper - task_stack_page(). A _lot_ of the latter are
    actually open-coded instances of "find where pt_regs are"; those are
    consolidated into task_pt_regs(task) (many architectures actually have such
    helper already).

    Note that these annotations are not mandatory - any code not converted to
    these helpers still works. However, they clean up a lot of places and have
    actually caught a number of bugs, so converting out of tree ports would be a
    good idea...

    As an example of breakage caught by that stuff, see i386 pt_regs mess - we
    used to have it open-coded in a bunch of places and when back in April Stas
    had fixed a bug in copy_thread(), the rest had been left out of sync. That
    required two followup patches (the latest - just before 2.6.15) _and_ still
    had left /proc/*/stat eip field broken. Try ps -eo eip on i386 and watch the
    junk...

    This patch:

    new helper - task_stack_page(task). Returns pointer to the memory object
    containing task stack; usually thread_info of task sits in the beginning
    of that object.

    Signed-off-by: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Al Viro
     
  • )

    From: Nick Piggin

    Track the last waker CPU, and only consider wakeup-balancing if there's a
    match between current waker CPU and the previous waker CPU. This ensures
    that there is some correlation between two subsequent wakeup events before
    we move the task. Should help random-wakeup workloads on large SMP
    systems, by reducing the migration attempts by a factor of nr_cpus.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    akpm@osdl.org
     
  • )

    From: Ingo Molnar

    This is the latest version of the scheduler cache-hot-auto-tune patch.

    The first problem was that detection time scaled with O(N^2), which is
    unacceptable on larger SMP and NUMA systems. To solve this:

    - I've added a 'domain distance' function, which is used to cache
    measurement results. Each distance is only measured once. This means
    that e.g. on NUMA distances of 0, 1 and 2 might be measured, on HT
    distances 0 and 1, and on SMP distance 0 is measured. The code walks
    the domain tree to determine the distance, so it automatically follows
    whatever hierarchy an architecture sets up. This cuts down on the boot
    time significantly and removes the O(N^2) limit. The only assumption
    is that migration costs can be expressed as a function of domain
    distance - this covers the overwhelming majority of existing systems,
    and is a good guess even for more assymetric systems.

    [ People hacking systems that have assymetries that break this
    assumption (e.g. different CPU speeds) should experiment a bit with
    the cpu_distance() function. Adding a ->migration_distance factor to
    the domain structure would be one possible solution - but lets first
    see the problem systems, if they exist at all. Lets not overdesign. ]

    Another problem was that only a single cache-size was used for measuring
    the cost of migration, and most architectures didnt set that variable
    up. Furthermore, a single cache-size does not fit NUMA hierarchies with
    L3 caches and does not fit HT setups, where different CPUs will often
    have different 'effective cache sizes'. To solve this problem:

    - Instead of relying on a single cache-size provided by the platform and
    sticking to it, the code now auto-detects the 'effective migration
    cost' between two measured CPUs, via iterating through a wide range of
    cachesizes. The code searches for the maximum migration cost, which
    occurs when the working set of the test-workload falls just below the
    'effective cache size'. I.e. real-life optimized search is done for
    the maximum migration cost, between two real CPUs.

    This, amongst other things, has the positive effect hat if e.g. two
    CPUs share a L2/L3 cache, a different (and accurate) migration cost
    will be found than between two CPUs on the same system that dont share
    any caches.

    (The reliable measurement of migration costs is tricky - see the source
    for details.)

    Furthermore i've added various boot-time options to override/tune
    migration behavior.

    Firstly, there's a blanket override for autodetection:

    migration_cost=1000,2000,3000

    will override the depth 0/1/2 values with 1msec/2msec/3msec values.

    Secondly, there's a global factor that can be used to increase (or
    decrease) the autodetected values:

    migration_factor=120

    will increase the autodetected values by 20%. This option is useful to
    tune things in a workload-dependent way - e.g. if a workload is
    cache-insensitive then CPU utilization can be maximized by specifying
    migration_factor=0.

    I've tested the autodetection code quite extensively on x86, on 3
    P3/Xeon/2MB, and the autodetected values look pretty good:

    Dual Celeron (128K L2 cache):

    ---------------------
    migration cost matrix (max_cache_size: 131072, cpu: 467 MHz):
    ---------------------
    [00] [01]
    [00]: - 1.7(1)
    [01]: 1.7(1) -
    ---------------------
    cacheflush times [2]: 0.0 (0) 1.7 (1784008)
    ---------------------

    Here the slow memory subsystem dominates system performance, and even
    though caches are small, the migration cost is 1.7 msecs.

    Dual HT P4 (512K L2 cache):

    ---------------------
    migration cost matrix (max_cache_size: 524288, cpu: 2379 MHz):
    ---------------------
    [00] [01] [02] [03]
    [00]: - 0.4(1) 0.0(0) 0.4(1)
    [01]: 0.4(1) - 0.4(1) 0.0(0)
    [02]: 0.0(0) 0.4(1) - 0.4(1)
    [03]: 0.4(1) 0.0(0) 0.4(1) -
    ---------------------
    cacheflush times [2]: 0.0 (33900) 0.4 (448514)
    ---------------------

    Here it can be seen that there is no migration cost between two HT
    siblings (CPU#0/2 and CPU#1/3 are separate physical CPUs). A fast memory
    system makes inter-physical-CPU migration pretty cheap: 0.4 msecs.

    8-way P3/Xeon [2MB L2 cache]:

    ---------------------
    migration cost matrix (max_cache_size: 2097152, cpu: 700 MHz):
    ---------------------
    [00] [01] [02] [03] [04] [05] [06] [07]
    [00]: - 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)
    [01]: 19.2(1) - 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)
    [02]: 19.2(1) 19.2(1) - 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)
    [03]: 19.2(1) 19.2(1) 19.2(1) - 19.2(1) 19.2(1) 19.2(1) 19.2(1)
    [04]: 19.2(1) 19.2(1) 19.2(1) 19.2(1) - 19.2(1) 19.2(1) 19.2(1)
    [05]: 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) - 19.2(1) 19.2(1)
    [06]: 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) - 19.2(1)
    [07]: 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) -
    ---------------------
    cacheflush times [2]: 0.0 (0) 19.2 (19281756)
    ---------------------

    This one has huge caches and a relatively slow memory subsystem - so the
    migration cost is 19 msecs.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Ashok Raj
    Signed-off-by: Ken Chen
    Cc:
    Signed-off-by: John Hawkes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    akpm@osdl.org