12 Mar, 2006

1 commit

  • The patch '[PATCH] RCU signal handling' [1] added an export for
    __put_task_struct_cb, a put_task_struct helper newly introduced in that
    patch. But the put_task_struct couldn't be used modular previously as
    __put_task_struct wasn't exported. There are not callers of it in modular
    code, and it shouldn't be exported because we don't want drivers to hold
    references to task_structs.

    This patch removes the export and folds __put_task_struct into
    __put_task_struct_cb as there's no other caller.

    [1] http://www2.kernel.org/git/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e56d090310d7625ecb43a1eeebd479f04affb48b

    Signed-off-by: Christoph Hellwig
    Acked-by: Paul E. McKenney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

09 Mar, 2006

3 commits

  • I have benchmarked this on an x86_64 NUMA system and see no significant
    performance difference on kernbench. Tested on both x86_64 and powerpc.

    The way we do file struct accounting is not very suitable for batched
    freeing. For scalability reasons, file accounting was
    constructor/destructor based. This meant that nr_files was decremented
    only when the object was removed from the slab cache. This is susceptible
    to slab fragmentation. With RCU based file structure, consequent batched
    freeing and a test program like Serge's, we just speed this up and end up
    with a very fragmented slab -

    llm22:~ # cat /proc/sys/fs/file-nr
    587730 0 758844

    At the same time, I see only a 2000+ objects in filp cache. The following
    patch I fixes this problem.

    This patch changes the file counting by removing the filp_count_lock.
    Instead we use a separate percpu counter, nr_files, for now and all
    accesses to it are through get_nr_files() api. In the sysctl handler for
    nr_files, we populate files_stat.nr_files before returning to user.

    Counting files as an when they are created and destroyed (as opposed to
    inside slab) allows us to correctly count open files with RCU.

    Signed-off-by: Dipankar Sarma
    Cc: "Paul E. McKenney"
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dipankar Sarma
     
  • This patch adds new tunables for RCU queue and finished batches. There are
    two types of controls - number of completed RCU updates invoked in a batch
    (blimit) and monitoring for high rate of incoming RCUs on a cpu (qhimark,
    qlowmark).

    By default, the per-cpu batch limit is set to a small value. If the input
    RCU rate exceeds the high watermark, we do two things - force quiescent
    state on all cpus and set the batch limit of the CPU to INTMAX. Setting
    batch limit to INTMAX forces all finished RCUs to be processed in one shot.
    If we have more than INTMAX RCUs queued up, then we have bigger problems
    anyway. Once the incoming queued RCUs fall below the low watermark, the
    batch limit is set to the default.

    Signed-off-by: Dipankar Sarma
    Cc: "Paul E. McKenney"
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dipankar Sarma
     
  • Idle threads should have a sane ->timestamp value, to avoid init kernel
    thread(s) from inheriting it and causing miscalculations in
    try_to_wake_up().

    Reported-by: Mike Galbraith .
    Signed-off-by: Ingo Molnar
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

07 Mar, 2006

3 commits

  • Add a compiler barrier so that we don't read jiffies before updating
    jiffies_64.

    Signed-off-by: Atsushi Nemoto
    Cc: Ralf Baechle
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Atsushi Nemoto
     
  • Also from Thomas Gleixner

    Function next_timer_interrupt() got broken with a recent patch
    6ba1b91213e81aa92b5cf7539f7d2a94ff54947c as sys_nanosleep() was moved to
    hrtimer. This broke things as next_timer_interrupt() did not check hrtimer
    tree for next event.

    Function next_timer_interrupt() is needed with dyntick (CONFIG_NO_IDLE_HZ,
    VST) implementations, as the system can be in idle when next hrtimer event
    was supposed to happen. At least ARM and S390 currently use
    next_timer_interrupt().

    Signed-off-by: Thomas Gleixner
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tony Lindgren
     
  • Just to be safe, we should not trigger a conditional reschedule during
    the early boot sequence. We've historically done some questionable
    early on, and the safety warnings in __might_sleep() are generally
    turned off during that period, so there might be problems lurking.

    This affects CONFIG_PREEMPT_VOLUNTARY, which takes over might_sleep() to
    cause a voluntary conditional reschedule.

    Acked-by: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

03 Mar, 2006

2 commits

  • On some platforms readq performs additional work to make sure I/O is done
    in a coherent way. This is not needed for time retrieval as done by the
    time interpolator. So we can use readq_relaxed instead which will improve
    performance.

    It affects sparc64 and ia64 only. Apparently it makes a significant
    difference on ia64.

    Signed-off-by: Christoph Lameter
    Cc: john stultz
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • acpi_video_flags variable is unsigned long, so it should be set as such.
    This actually matters on x86-64.

    Signed-off-by: Stefan Seyfried
    Signed-off-by: Pavel Machek
    Cc: "Brown, Len"
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stefan Seyfried
     

01 Mar, 2006

1 commit

  • Allow sysadmin to disable all warnings about userland apps
    making unaligned accesses by using:
    # echo 1 > /proc/sys/kernel/ignore-unaligned-usertrap
    Rather than having to use prctl on a process by process basis.

    Default behaivour leaves the warnings enabled.

    Signed-off-by: Jes Sorensen
    Signed-off-by: Tony Luck

    Jes Sorensen
     

21 Feb, 2006

5 commits


19 Feb, 2006

1 commit


18 Feb, 2006

4 commits

  • Restore the compatibility with the older code and make it possible to
    suspend if the kernel command line doesn't contain the "resume=" argument

    Signed-off-by: Rafael J. Wysocki
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • Heiko Carstens wrote:

    The boot sequence on s390 sometimes takes ages and we spend a very long
    time (up to one or two minutes) in calibrate_migration_costs. The time
    spent there differs from boot to boot. Also the calculated costs differ
    a lot. I've seen differences by up to a factor of 15 (yes, factor not
    percent). Also I doubt that making these measurements make much sense on
    a completely virtualized architecture where you cannot tell how much cpu
    time you will get anyway.

    So introduce the CONFIG_DEFAULT_MIGRATION_COST method for an architecture
    to set the scheduler migration costs. This turns off automatic detection
    of migration costs. Makes sense on virtual platforms, where migration
    costs are hard to measure accurately.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • This provides an interface for arch code to find out how many
    nanoseconds are going to be added on to xtime by the next call to
    do_timer. The value returned is a fixed-point number in 52.12 format
    in nanoseconds. The reason for this format is that it gives the
    full precision that the timekeeping code is using internally.

    The motivation for this is to fix a problem that has arisen on 32-bit
    powerpc in that the value returned by do_gettimeofday drifts apart
    from xtime if NTP is being used. PowerPC is now using a lockless
    do_gettimeofday based on reading the timebase register and performing
    some simple arithmetic. (This method of getting the time is also
    exported to userspace via the VDSO.) However, the factor and offset
    it uses were calculated based on the nominal tick length and weren't
    being adjusted when NTP varied the tick length.

    Note that 64-bit powerpc has had the lockless do_gettimeofday for a
    long time now. It also had an extremely hairy routine that got called
    from the 32-bit compat routine for adjtimex, which adjusted the
    factor and offset according to what it thought the timekeeping code
    was going to do. Not only was this only called if a 32-bit task did
    adjtimex (i.e. not if a 64-bit task did adjtimex), it was also
    duplicating computations from kernel/timer.c and it wasn't clear that
    it was (still) correct.

    The simple solution is to ask the timekeeping code how long the
    current jiffy will be on each timer interrupt, after calling
    do_timer. If this jiffy will be a different length from the last one,
    we then need to compute new values for the factor and offset used in
    the lockless do_gettimeofday. In this way we can keep xtime and
    do_gettimeofday in sync, even when NTP is varying the tick length.

    Note that when adjtimex varies the tick length, it almost always
    introduces the variation from the next tick on. The only case I could
    see where adjtimex would vary the length of the current tick is when
    an old-style adjtime adjustment is being cancelled. (It's not clear
    to me why the adjustment has to be cancelled immediately rather than
    from the next tick on.) Thus I don't see any real need for a hook in
    adjtimex; the rare case of an old-style adjustment being cancelled can
    be fixed up at the next tick.

    Signed-off-by: Paul Mackerras
    Acked-by: john stultz
    Signed-off-by: Linus Torvalds

    Paul Mackerras
     
  • AMD SimNow!'s JIT doesn't like them at all in the guest. For distribution
    installation it's easiest if it's a boot time option.

    Also I moved the variable to a more appropiate place and make
    it independent from sysctl

    And marked __read_mostly which it is.

    Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

16 Feb, 2006

5 commits

  • I get about 88 squillion of these when suspending an old ad450nx server.

    Cc: Pavel Roskin
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Fix a latent bug in cpuset_exit() handling. If a task tried to allocate
    memory after calling cpuset_exit(), it oops'd in
    cpuset_update_task_memory_state() on a NULL cpuset pointer.

    So set the exiting tasks cpuset to the root cpuset instead of to NULL.

    A distro kernel hit this with an added kernel package that had just such a
    hook (allocating memory) in the exit code path.

    Signed-off-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • 1. The tracee can go from ptrace_stop() to do_signal_stop()
    after __ptrace_unlink(p).

    2. It is unsafe to __ptrace_unlink(p) while p->parent may wait
    for tasklist_lock in ptrace_detach().

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Ingo Molnar
    Cc: Christoph Hellwig
    Cc: Eric W. Biederman
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • copy_process:

    attach_pid(p, PIDTYPE_PID, p->pid);
    attach_pid(p, PIDTYPE_TGID, p->tgid);

    What if kill_proc_info(p->pid) happens in between?

    copy_process() holds current->sighand.siglock, so we are safe
    in CLONE_THREAD case, because current->sighand == p->sighand.

    Otherwise, p->sighand is unlocked, the new process is already
    visible to the find_task_by_pid(), but have a copy of parent's
    'struct pid' in ->pids[PIDTYPE_TGID].

    This means that __group_complete_signal() may hang while doing

    do ... while (next_thread() != p)

    We can solve this problem if we reverse these 2 attach_pid()s:

    attach_pid() does wmb()

    group_send_sig_info() calls spin_lock(), which
    provides a read barrier. // Yes ?

    I don't think we can hit this race in practice, but still.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • There is a window after copy_process() unlocks ->sighand.siglock
    and before it adds the new thread to the thread list.

    In that window __group_complete_signal(SIGKILL) will not see the
    new thread yet, so this thread will start running while the whole
    thread group was supposed to exit.

    I beleive we have another good reason to place attach_pid(PID/TGID)
    under ->sighand.siglock. We can do the same for

    release_task()->__unhash_process()

    de_thread()->switch_exec_pids()

    After that we don't need tasklist_lock to iterate over the thread
    list, and we can simplify things, see for example do_sigaction()
    or sys_times().

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

15 Feb, 2006

3 commits

  • CONFIG_TIME_LOW_RES is a temporary way for architectures to signal that
    they simply return xtime in do_gettimeoffset(). In this corner-case we
    want to round up by resolution when starting a relative timer, to avoid
    short timeouts. This will go away with the GTOD framework.

    Signed-off-by: Ingo Molnar
    Cc: Roman Zippel
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Revert commit d7102e95b7b9c00277562c29aad421d2d521c5f6:

    [PATCH] sched: filter affine wakeups

    Apparently caused more than 10% performance regression for aim7 benchmark.
    The setup in use is 16-cpu HP rx8620, 64Gb of memory and 12 MSA1000s with 144
    disks. Each disk is 72Gb with a single ext3 filesystem (courtesy of HP, who
    supplied benchmark results).

    The problem is, for aim7, the wake-up pattern is random, but it still needs
    load balancing action in the wake-up path to achieve best performance. With
    the above commit, lack of load balancing hurts that workload.

    However, for workloads like database transaction processing, the requirement
    is exactly opposite. In the wake up path, best performance is achieved with
    absolutely zero load balancing. We simply wake up the process on the CPU that
    it was previously run. Worst performance is obtained when we do load
    balancing at wake up.

    There isn't an easy way to auto detect the workload characteristics. Ingo's
    earlier patch that detects idle CPU and decide whether to load balance or not
    doesn't perform with aim7 either since all CPUs are busy (it causes even
    bigger perf. regression).

    Revert commit d7102e95b7b9c00277562c29aad421d2d521c5f6, which causes more
    than 10% performance regression with aim7.

    Signed-off-by: Ken Chen
    Acked-by: Ingo Molnar
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen, Kenneth W
     
  • The PageCompound check before access_process_vm's set_page_dirty_lock is no
    longer necessary, so remove it. But leave the PageCompound checks in
    bio_set_pages_dirty, dio_bio_complete and nfs_free_user_pages: at least some
    of those were introduced as a little optimization on hugetlb pages.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

11 Feb, 2006

2 commits

  • When panic_timeout is zero, suppress triggering a nested panic due to soft
    lockup detection.

    Signed-off-by: Jan Beulich
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Beulich
     
  • I don't think the code is quite ready, which is why I asked for Peter's
    additions to also be merged before I acked it (although it turned out that
    it still isn't quite ready with his additions either).

    Basically I have had similar observations to Suresh in that it does not
    play nicely with the rest of the balancing infrastructure (and raised
    similar concerns in my review).

    The samples (group of 4) I got for "maximum recorded imbalance" on a 2x2
    SMP+HT Xeon are as follows:

    | Following boot | hackbench 20 | hackbench 40
    -----------+----------------+---------------------+---------------------
    2.6.16-rc2 | 30,37,100,112 | 5600,5530,6020,6090 | 6390,7090,8760,8470
    +nosmpnice | 3, 2, 4, 2 | 28, 150, 294, 132 | 348, 348, 294, 347

    Hackbench raw performance is down around 15% with smpnice (but that in
    itself isn't a huge deal because it is just a benchmark). However, the
    samples show that the imbalance passed into move_tasks is increased by
    about a factor of 10-30. I think this would also go some way to explaining
    latency blips turning up in the balancing code (though I haven't actually
    measured that).

    We'll probably have to revert this in the SUSE kernel.

    Cc: "Siddha, Suresh B"
    Acked-by: Ingo Molnar
    Cc: Peter Williams
    Cc: "Martin J. Bligh"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

10 Feb, 2006

2 commits


08 Feb, 2006

8 commits

  • Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • A bunch of asm/bug.h includes are both not needed (since it will get
    pulled anyway) and bogus (since they are done too early). Removed.

    Signed-off-by: Al Viro

    Al Viro
     
  • If the file descriptor structure is being shared, allocate a new one and copy
    information from the current, shared, structure.

    Signed-off-by: Janak Desai
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Michael Kerrisk
    Cc: Andi Kleen
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    JANAK DESAI
     
  • If vm structure is being shared, allocate a new one and copy information from
    the current, shared, structure.

    Signed-off-by: Janak Desai
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Michael Kerrisk
    Cc: Andi Kleen
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    JANAK DESAI
     
  • If the namespace structure is being shared, allocate a new one and copy
    information from the current, shared, structure.

    Signed-off-by: Janak Desai
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Michael Kerrisk
    Cc: Andi Kleen
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    JANAK DESAI
     
  • If filesystem structure is being shared, allocate a new one and copy
    information from the current, shared, structure.

    Signed-off-by: Janak Desai
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Michael Kerrisk
    Cc: Andi Kleen
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    JANAK DESAI
     
  • sys_unshare system call handler function accepts the same flags as clone
    system call, checks constraints on each of the flags and invokes corresponding
    unshare functions to disassociate respective process context if it was being
    shared with another task.

    Here is the link to a program for testing unshare system call.

    http://prdownloads.sourceforge.net/audit/unshare_test.c?download

    Please note that because of a problem in rmdir associated with bind mounts and
    clone with CLONE_NEWNS, the test fails while trying to remove temporary test
    directory. You can remove that temporary directory by doing rmdir, twice,
    from the command line. The first will fail with EBUSY, but the second will
    succeed. I have reported the problem to Ram Pai and Al Viro with a small
    program which reproduces the problem. Al told us yesterday that he will be
    looking at the problem soon. I have tried multiple rmdirs from the
    unshare_test program itself, but for some reason that is not working. Doing
    two rmdirs from command line does seem to remove the directory.

    Signed-off-by: Janak Desai
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Michael Kerrisk
    Cc: Andi Kleen
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    JANAK DESAI