27 Jun, 2006

15 commits

  • Every inode in /proc holds a reference to a struct task_struct. If a
    directory or file is opened and remains open after the the task exits this
    pinning continues. With 8K stacks on a 32bit machine the amount pinned per
    file descriptor is about 10K.

    Normally I would figure a reasonable per user process limit is about 100
    processes. With 80 processes, with a 1000 file descriptors each I can trigger
    the 00M killer on a 32bit kernel, because I have pinned about 800MB of useless
    data.

    This patch replaces the struct task_struct pointer with a pointer to a struct
    task_ref which has a struct task_struct pointer. The so the pinning of dead
    tasks does not happen.

    The code now has to contend with the fact that the task may now exit at any
    time. Which is a little but not muh more complicated.

    With this change it takes about 1000 processes each opening up 1000 file
    descriptors before I can trigger the OOM killer. Much better.

    [mlp@google.com: task_mmu small fixes]
    Signed-off-by: Eric W. Biederman
    Cc: Trond Myklebust
    Cc: Paul Jackson
    Cc: Oleg Nesterov
    Cc: Albert Cahalan
    Signed-off-by: Prasanna Meda
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • To keep the dcache from filling up with dead /proc entries we flush them on
    process exit. However over the years that code has gotten hairy with a
    dentry_pointer and a lock in task_struct and misdocumented as a correctness
    feature.

    I have rewritten this code to look and see if we have a corresponding entry in
    the dcache and if so flush it on process exit. This removes the extra fields
    in the task_struct and allows me to trivially handle the case of a
    /proc//task/ entry as well as the current /proc/ entries.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • With this patch Kprobes now registers for page fault notifications only when
    their is an active probe registered. Once all the active probes are
    unregistered their is no need to be notified of page faults and kprobes
    unregisters itself from the page fault notifications. Hence we will have ZERO
    side effects when no probes are active.

    Signed-off-by: Anil S Keshavamurthy
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anil S Keshavamurthy
     
  • Kprobes now registers for page fault notifications.

    Signed-off-by: Anil S Keshavamurthy
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anil S Keshavamurthy
     
  • If there are multi kprobes on the same probepoint, there will be one extra
    aggr_kprobe on the head of kprobe list. The aggr_kprobe has
    aggr_post_handler/aggr_break_handler whether the other kprobe
    post_hander/break_handler is NULL or not. This patch modifies this, only
    when there is one or more kprobe in the list whose post_handler is not
    NULL, post_handler of aggr_kprobe will be set as aggr_post_handler.

    [soshima@redhat.com: !CONFIG_PREEMPT fix]
    Signed-off-by: bibo, mao
    Cc: Masami Hiramatsu
    Cc: Ananth N Mavinakayanahalli
    Cc: "Keshavamurthy, Anil S"
    Cc: Prasanna S Panchamukhi
    Cc: Jim Keniston
    Cc: Yumiko Sugita
    Cc: Hideo Aoki
    Signed-off-by: Satoshi Oshima
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    mao, bibo
     
  • This fixes the clock source updates in update_wall_time() to correctly
    track the time coming in via current_tick_length(). Optimize the fast
    paths to be as short as possible to keep the overhead low.

    Signed-off-by: Roman Zippel
    Acked-by: John Stultz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Zippel
     
  • As suggested by Roman Zippel, change clocksource functions to use
    clocksource_xyz rather then xyz_clocksource to avoid polluting the
    namespace.

    Signed-off-by: John Stultz
    Cc: Roman Zippel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    john stultz
     
  • Implement the time sources for i386 (acpi_pm, cyclone, hpet, pit, and tsc).
    With this patch, the conversion of the i386 arch to the generic timekeeping
    code should be complete.

    The patch should be fairly straight forward, only adding the new clocksources.

    [hirofumi@mail.parknet.co.jp: acpi_pm cleanup]
    Signed-off-by: John Stultz
    Signed-off-by: Adrian Bunk
    Signed-off-by: Paul Mundt
    Signed-off-by: John Stultz
    Signed-off-by: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    john stultz
     
  • Introduces clocksource switching code and the arch generic time accessor
    functions that use the clocksource infrastructure.

    Signed-off-by: John Stultz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    john stultz
     
  • Instead of incrementing xtime by tick_nsec + ntp adjustments, use the
    clocksource abstraction to increment and scale time. Using the clocksource
    abstraction allows other clocksources to be used consistently in the face of
    late or lost ticks, while preserving the existing behavior via the jiffies
    clocksource.

    This removes the need to keep time_phase adjustments as we just use the
    current_tick_length() function as the NTP interface and accumulate time using
    shifted nanoseconds.

    The basics of this design was by Roman Zippel, however it is my own
    interpretation and implementation, so the credit should go to him and the
    blame to me.

    Signed-off-by: John Stultz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    john stultz
     
  • Change the current_tick_length() function so it takes an argument which
    specifies how much precision to return in shifted nanoseconds. This provides
    a simple way to convert between NTPs internal nanoseconds shifted by
    (SHIFT_SCALE - 10) to other shifted nanosecond units that are used by the
    clocksource abstraction.

    Signed-off-by: John Stultz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    john stultz
     
  • Modify the update_wall_time function so it increments time using the
    clocksource abstraction instead of jiffies. Since the only clocksource driver
    currently provided is the jiffies clocksource, this should result in no
    functional change. Additionally, a timekeeping_init and timekeeping_resume
    function has been added to initialize and maintain some of the new timekeping
    state.

    [hirofumi@mail.parknet.co.jp: fixlet]
    Signed-off-by: John Stultz
    Signed-off-by: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    john stultz
     
  • This introduces the clocksource management infrastructure. A clocksource is a
    driver-like architecture generic abstraction of a free-running counter. This
    code defines the clocksource structure, and provides management code for
    registering, selecting, accessing and scaling clocksources.

    Additionally, this includes the trivial jiffies clocksource, a lowest common
    denominator clocksource, provided mainly for use as an example.

    [hirofumi@mail.parknet.co.jp: Don't enable IRQ too early]
    Signed-off-by: John Stultz
    Signed-off-by: Ingo Molnar
    Signed-off-by: Paul Mundt
    Signed-off-by: John Stultz
    Signed-off-by: OGAWA Hirofumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    john stultz
     
  • Convert kernel/cpu.c from semaphore to mutex.

    I've reviewed all lock_cpu_hotplug() critical sections, and they all seem to
    fit mutex semantics.

    Signed-off-by: Ingo Molnar
    Cc: Rusty Russell
    Cc: Ashok Raj
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • It seems ppc64 wants to lock mutexes in early bootup code, with interrupts
    disabled, and they expect interrupts to stay disabled, else they crash.

    Work around this bug by making mutex debugging variants save/restore irq
    flags.

    Signed-off-by: Ingo Molnar
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

26 Jun, 2006

21 commits

  • This reverts commits

    3e3318dee0878d42ed62a19c292a2ac284135db3 [PATCH] swsusp: x86_64 mark special saveable/unsaveable pages
    b6370d96e09944c6e3ae8d5743ca8a8ab1f79f6c [PATCH] swsusp: i386 mark special saveable/unsaveable pages
    ce4ab0012b32c1a4a1d6e934aeb73bf3151c48d9 [PATCH] swsusp: add architecture special saveable pages support

    because not only do they apparently cause page faults on x86, the
    infrastructure doesn't compile on powerpc.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Not that x86-64 and other architecture support should be difficult to
    add (trivial fixups to the data format and add the proper linker script
    entry).

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • In current 2.6.17 implementation, signal_struct refered from task_struct is
    used for per-process data structure. The pacct facility also uses it as a
    per-process data structure to store stime, utime, minflt, majflt. But those
    members are saved in __exit_signal(). It's too late.

    For example, if some threads exits at same time, pacct facility has a
    possibility to drop accountings for a part of those threads. (see, the
    following 'The results of original 2.6.17 kernel') I think accounting
    information should be completely collected into the per-process data structure
    before writing out an accounting record.

    This patch fixes this matter. Accumulation of stime, utime, minflt and majflt
    are done before generating accounting record.

    [mingo@elte.hu: fix acct_collect() siglock bug found by lockdep]
    Signed-off-by: KaiGai Kohei
    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KaiGai Kohei
     
  • When pacct facility generate an 'ac_flag' field in accounting record, it
    refers a task_struct of the thread which died last in the process. But any
    other task_structs are ignored.

    Therefore, pacct facility drops ASU flag even if root-privilege operations are
    used by any other threads except the last one. In addition, AFORK flag is
    always set when the thread of group-leader didn't die last, although this
    process has called execve() after fork().

    We have a same matter in ac_exitcode. The recorded ac_exitcode is an exit
    code of the last thread in the process. There is a possibility this exitcode
    is not the group leader's one.

    KaiGai Kohei
     
  • The pacct facility need an i/o operation when an accounting record is
    generated. There is a possibility to wake OOM killer up. If OOM killer is
    activated, it kills some processes to make them release process memory
    regions.

    But acct_process() is called in the killed processes context before calling
    exit_mm(), so those processes cannot release own memory. In the results, any
    processes stop in this point and it finally cause a system stall.

    KaiGai Kohei
     
  • Move kthread API kernel-doc from kthread.h to kthread.c & fix it.
    Add kthread API to kernel-api DocBook.

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Fix kernel-doc formatting in ktime.h and hrtimer.[ch] files.

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • If a cpu hotplug callback fails on CPU_UP_PREPARE, all callbacks will be
    called with CPU_UP_CANCELED. A few of these callbacks assume that on
    CPU_UP_PREPARE a pointer to task has been stored in a percpu array. This
    assumption is not true if CPU_UP_PREPARE fails and the following calls to
    kthread_bind() in CPU_UP_CANCELED will cause an addressing exception
    because of passing a NULL pointer.

    Signed-off-by: Heiko Carstens
    Cc: Ashok Raj
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     
  • - Update stop_machine.c to spawn stop_machine as kthreads rather than the
    deprecated kernel_threads.

    - Update stop_machine to use the more efficient kthread_bind() before
    running task in place of set_cpus_allowed() after.

    [akpm@osdl.org: remove now-wrong set_cpus_allowed()]
    Signed-off-by: Serge E. Hallyn
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn
     
  • If futexes are disabled we fail to link on ppc64.

    Signed-off-by: Anton Blanchard
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anton Blanchard
     
  • I'm testing glibc on MIPS64, little-endian, N32, O32 and N64 multilibs.

    Among the NPTL test failures seen are some arising from sigsuspend problems
    for N32: it blocks the wrong signals, so SIGCANCEL (SIGRTMIN) is blocked
    despite glibc's carefully excluding it from sets of signals to block.
    Specifically, testing suggests it blocks signal N^32 instead of signal N,
    so (in the example tested) blocking SIGUSR1 (17) blocks signal 49 instead.

    glibc's sigset_t uses an array of unsigned long, as does the kernel.
    In both cases, signal N+1 is represented as
    (1UL << (N % (8 * sizeof (unsigned long)))) in word number
    (N / (8 * sizeof (unsigned long))).

    Thus the N32 glibc uses an array of 32-bit words and the N64 kernel uses an
    array of 64-bit words. For little-endian, the layout is the same, with
    signals 1-32 in the first 4 bytes, signals 33-64 in the second, etc.; for
    big-endian, userspace has that layout while in the kernel each 8 bytes have
    the two halves swapped from the userspace layout.

    The N32 sigsuspend syscall uses sigset_from_compat to convert the userspace
    sigset to kernel format. If __COMPAT_ENDIAN_SWAP__ is *not* set, this uses
    logic of the form

    set->sig[0] = compat->sig[0] | (((long)compat->sig[1]) << 32 )

    to convert the userspace sigset to a kernel one. This looks correct to me
    for both big and little endian, given that in userspace compat->sig[1] will
    represent signals 33-64, and so will the high 32 bits of set->sig[0] in the
    kernel. If however __COMPAT_ENDIAN_SWAP__ *is* set, as it is for
    __MIPSEL__, it uses

    set->sig[0] = compat->sig[1] | (((long)compat->sig[0]) << 32 );

    which seems incorrect for both big and little endian, and would
    explain the observed symptoms.

    This code is the only use of __COMPAT_ENDIAN_SWAP__, so if incorrect
    then that macro serves no purpose, in which case something like the
    following patch would seem appropriate to remove it.

    Signed-off-by: Joseph Myers
    Signed-off-by: Ralf Baechle
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    akpm@osdl.org
     
  • The table is empty, why does it still exist?

    Signed-off-by: Stephen Hemminger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Hemminger
     
  • Currently, enabling/disabling printk timestamps is only possible through
    reboot (bootparam) or recompile. I normally do not run with timestamps
    (since syslog handles that in a good manner), but for measuring small
    kernel delays (e.g. irq probing - see parport thread) I needed subsecond
    precision, but then again, just for some minutes rather than all kernel
    messages to come. The following patch adds a module_param() with which the
    timestamps can be en-/disabled in a live system through
    /sys/modules/printk/parameters/printk_time.

    Signed-off-by: Jan Engelhardt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Engelhardt
     
  • copy_process() appears to be the only caller of acct_clear_integrals() and
    does not pass in NULL task pointers. Remove the unecessary check.

    Signed-off-by: Matt Helsley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Helsley
     
  • Signed-off-by: Andreas Mohr
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andreas Mohr
     
  • schedule_on_each_cpu() presently does a large kmalloc - 96 kbytes on 1024 CPU
    64-bit.

    Rework it so that we do one 8192-byte allocation and then a pile of tiny ones,
    via alloc_percpu(). This has a much higher chance of success (100% in the
    current VM).

    This also has the effect of reducing the memory requirements from NR_CPUS*n to
    num_possible_cpus()*n.

    Cc: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • - proper prototypes for the following functions:
    - ctrl_alt_del() (in include/linux/reboot.h)
    - getrusage() (in include/linux/resource.h)
    - make the following needlessly global functions static:
    - kernel_restart_prepare()
    - kernel_kexec()

    [akpm@osdl.org: compile fix]
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • Currently printk is no use for early debugging because it refuses to
    actually print anything to the console unless
    cpu_online(smp_processor_id()) is true.

    The stated explanation is that console drivers may require per-cpu
    resources, or otherwise barf, because the system is not yet setup
    correctly. Fair enough.

    However some console drivers might be quite happy running early during
    boot, in fact we have one, and so it'd be nice if printk understood that.

    So I added a flag (which I would have called CON_BOOT, but that's taken)
    called CON_ANYTIME, which indicates that a console is happy to be called
    anytime, even if the cpu is not yet online.

    Tested on a Power 5 machine, with both a CON_ANYTIME driver and a bogus
    console driver that BUG()s if called while offline. No problems AFAICT.
    Built for i386 UP & SMP.

    Signed-off-by: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Ellerman
     
  • Since raw_notifier chains don't benefit from any centralized locking
    protections, they shouldn't suffer from the associated limitations. Under
    some circumstances it might make sense for a raw_notifier callout routine
    to unregister itself from the notifier chain. This patch (as678) changes
    the notifier core to allow for such things.

    Signed-off-by: Alan Stern
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alan Stern
     
  • There are several instances of per_cpu(foo, raw_smp_processor_id()), which
    is semantically equivalent to __get_cpu_var(foo) but without the warning
    that smp_processor_id() can give if CONFIG_DEBUG_PREEMPT is enabled. For
    those architectures with optimized per-cpu implementations, namely ia64,
    powerpc, s390, sparc64 and x86_64, per_cpu() turns into more and slower
    code than __get_cpu_var(), so it would be preferable to use __get_cpu_var
    on those platforms.

    This defines a __raw_get_cpu_var(x) macro which turns into per_cpu(x,
    raw_smp_processor_id()) on architectures that use the generic per-cpu
    implementation, and turns into __get_cpu_var(x) on the architectures that
    have an optimized per-cpu implementation.

    Signed-off-by: Paul Mackerras
    Acked-by: David S. Miller
    Acked-by: Ingo Molnar
    Acked-by: Martin Schwidefsky
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Mackerras
     
  • If CONFIG_KALLSYMS is defined and if it should happen that is_exported() is
    given a NULL 'mod' and lookup_symbol(name, __start___ksymtab,
    __stop___ksymtab) returns 0, then we'll end up dereferencing a NULL
    pointer.

    Signed-off-by: Jesper Juhl
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Juhl
     

25 Jun, 2006

1 commit

  • Considering that there isn't a lot of hw we can depend on during resume,
    this is about as good as it gets.

    This is x86-only for now, although the basic concept (and most of the
    code) will certainly work on almost any platform.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

23 Jun, 2006

3 commits

  • Signed-off-by: Eric Sesterhenn
    Signed-off-by: Alexey Dobriyan
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Alan Cox
    Cc: James Bottomley
    Acked-by: "Salyzyn, Mark"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Sesterhenn
     
  • Correct the return type of handle_IRQ_event() (inconsistency noticed during
    Xen development), and remove redundant declarations. The return type
    adjustment required breaking out the definition of irqreturn_t into a
    separate header, in order to satisfy current include order dependencies.

    Signed-off-by: Jan Beulich

    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Russell King
    Cc: Ian Molton
    Cc: Mikael Starvik
    Cc: Yoshinori Sato
    Cc: Hirokazu Takata
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: William Lee Irwin III
    Cc: "David S. Miller"
    Cc: Miles Bader
    Cc: Geert Uytterhoeven
    Cc: Roman Zippel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Beulich
     
  • When CONFIG_BASE_SAMLL=1, cascade() in may enter the infinite loop.
    Because of CONFIG_BASE_SMALL=1(TVR_BITS=6 and TVN_BITS=4), the list
    base->tv5 may cascade into base->tv5. So, the kernel enters the infinite
    loop in the function cascade().

    I created a test module to verify this bug, and a patch to fix it.

    #include
    #include
    #include
    #include
    #if 0
    #include
    #else
    #define kdb_printf printk
    #endif

    #define TVN_BITS (CONFIG_BASE_SMALL ? 4 : 6)
    #define TVR_BITS (CONFIG_BASE_SMALL ? 6 : 8)
    #define TVN_SIZE (1 << TVN_BITS)
    #define TVR_SIZE (1 << TVR_BITS)
    #define TVN_MASK (TVN_SIZE - 1)
    #define TVR_MASK (TVR_SIZE - 1)

    #define TV_SIZE(N) (N*TVN_BITS + TVR_BITS)

    struct timer_list timer0;
    struct timer_list dummy_timer1;
    struct timer_list dummy_timer2;

    void dummy_timer_fun(unsigned long data) {
    }
    unsigned long j=0;
    void check_timer_base(unsigned long data)
    {
    kdb_printf("check_timer_base %08x\n",jiffies);
    mod_timer(&timer0,(jiffies & (~0xFFF)) + 0x1FFF);
    }

    int init_module(void)
    {
    init_timer(&timer0);
    timer0.data = (unsigned long)0;
    timer0.function = check_timer_base;
    mod_timer(&timer0,jiffies+1);

    init_timer(&dummy_timer1);
    dummy_timer1.data = (unsigned long)0;
    dummy_timer1.function = dummy_timer_fun;

    init_timer(&dummy_timer2);
    dummy_timer2.data = (unsigned long)0;
    dummy_timer2.function = dummy_timer_fun;

    j=jiffies;
    j&=(~((1<<<
    Cc: Matt Mackall
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Porpoise