19 Sep, 2006

1 commit


17 Sep, 2006

2 commits

  • I think there is a bug in kmod.c: In __call_usermodehelper(), when
    kernel_thread(wait_for_helper, ...) return success, since wait_for_helper()
    might call complete() at any time, the sub_info should not be used any
    more.

    Normally wait_for_helper() take a long time to finish, you may not get
    problem for most of the case. But if you remove /sbin/modprobe, it may
    become easier for you to get a oop in khelper.

    Cc: Matt Helsley
    Cc: Martin Schwidefsky
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kenneth Lee
     
  • Fix a bug where the IRQ_PENDING flag is never cleared and the ISR is called
    endlessly without an actual interrupt.

    Signed-off-by: Imre Deak
    Acked-by: Thomas Gleixner
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Imre Deak
     

13 Sep, 2006

2 commits

  • rcu_do_batch() decrements rdp->qlen with irqs enabled. This is not good,
    it can also be modified by call_rcu() from interrupt.

    Decrement ->qlen once with irqs disabled, after a main loop.

    Signed-off-by: Oleg Nesterov
    Cc: Dipankar Sarma
    Cc: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Miles Lane reported the "BUG: MAX_STACK_TRACE_ENTRIES too low!" message,
    which means that during normal use his system produced enough lockdep
    events so that the 128-thousand entries stack-trace array got exhausted.
    Double the size of the array.

    Signed-off-by: Ingo Molnar
    Cc: Miles Lane
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

12 Sep, 2006

4 commits


09 Sep, 2006

1 commit

  • The current implementation of futex_lock_pi returns -ERESTART_RESTARTBLOCK
    in case that the lock operation has been interrupted by a signal. This
    results in a return of -EINTR to userspace in case there is an handler for
    the signal. This is wrong, because userspace expects that the lock
    function does not return in any case of signal delivery.

    This was not caught by my insufficient test case, but triggered a nasty
    userspace problem in an high load application scenario. Unfortunately also
    glibc does not check for this invalid return value.

    Using -ERSTARTNOINTR makes sure, that the interrupted syscall is restarted.
    The restart block related code can be safely removed, as the possible
    timeout argument is an absolute time value.

    Signed-off-by: Thomas Gleixner
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     

07 Sep, 2006

3 commits


03 Sep, 2006

1 commit

  • It is not possible to find a sub-thread in ->children/->ptrace_children
    lists, ptrace_attach() does not allow to attach to sub-threads.

    Even if it was possible to ptrace the task from the same thread group,
    we can't allow to release ->group_leader while there are others (ptracer)
    threads in the same group.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

02 Sep, 2006

2 commits

  • Adds the description of the parameters from handle_bad_irq().

    Signed-off-by: Henrik Kretzschmar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Henrik Kretzschmar
     
  • Cleanup allocation and freeing of tsk->delays used by delay accounting.
    This solves two problems reported for delay accounting:

    1. oops in __delayacct_blkio_ticks
    http://www.uwsg.indiana.edu/hypermail/linux/kernel/0608.2/1844.html

    Currently tsk->delays is getting freed too early in task exit which can
    cause a NULL tsk->delays to get accessed via reading of /proc//stats.
    The patch fixes this problem by freeing tsk->delays closer to when
    task_struct itself is freed up. As a result, it also eliminates the use of
    tsk->delays_lock which was only being used (inadequately) to safeguard
    access to tsk->delays while a task was exiting.

    2. Possible memory leak in kernel/delayacct.c
    http://www.uwsg.indiana.edu/hypermail/linux/kernel/0608.2/1389.html

    The patch cleans up tsk->delays allocations after a bad fork which was
    missing earlier.

    The patch has been tested to fix the problems listed above and stress
    tested with rapid calls to delay accounting's taskstats command interface
    (which is the other path that can access the same data, besides the /proc
    interface causing the oops above).

    Signed-off-by: Shailabh Nagar
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shailabh Nagar
     

28 Aug, 2006

5 commits

  • cpuset_excl_nodes_overlap always returns 0 if current is exiting. This caused
    customer's systems to panic in the OOM killer when processes were having
    trouble getting memory for the final put_user in mm_release. Even though
    there were lots of processes to kill.

    Change to returning 1 in this case. This achieves parity with !CONFIG_CPUSETS
    case, and was observed to fix the problem.

    Signed-off-by: Nick Piggin
    Acked-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Change the list of cpus allowed to tasks in the top (root) cpuset to
    dynamically track what cpus are online, using a CPU hotplug notifier. Make
    this top cpus file read-only.

    On systems that have cpusets configured in their kernel, but that aren't
    actively using cpusets (for some distros, this covers the majority of
    systems) all tasks end up in the top cpuset.

    If that system does support CPU hotplug, then these tasks cannot make use
    of CPUs that are added after system boot, because the CPUs are not allowed
    in the top cpuset. This is a surprising regression over earlier kernels
    that didn't have cpusets enabled.

    In order to keep the behaviour of cpusets consistent between systems
    actively making use of them and systems not using them, this patch changes
    the behaviour of the 'cpus' file in the top (root) cpuset, making it read
    only, and making it automatically track the value of cpu_online_map. Thus
    tasks in the top cpuset will have automatic use of hot plugged CPUs allowed
    by their cpuset.

    Thanks to Anton Blanchard and Nathan Lynch for reporting this problem,
    driving the fix, and earlier versions of this patch.

    Signed-off-by: Paul Jackson
    Cc: Nathan Lynch
    Cc: Anton Blanchard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • An up() is called in kernel/stop_machine.c on failure, and also in the
    caller (unconditionally).

    Signed-off-by: Zhou Yingchao
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yingchao Zhou
     
  • futex_find_get_task:

    if (p->state == EXIT_ZOMBIE || p->exit_state == EXIT_ZOMBIE)
    return NULL;

    I can't understand this. First, p->state can't be EXIT_ZOMBIE. The
    ->exit_state check looks strange too. Sub-threads or tasks whose ->parent
    ignores SIGCHLD go directly to EXIT_DEAD state (I am ignoring a ptrace
    case). Why EXIT_DEAD tasks should be ok? Yes, EXIT_ZOMBIE is more
    important (a task may stay zombie for a long time), but this doesn't mean
    we should explicitely ignore other EXIT_XXX states.

    Signed-off-by: Oleg Nesterov
    Acked-by: Ingo Molnar
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • sched_setscheduler() looks at ->signal->rlim[]. It is unsafe do
    dereference ->signal unless tasklist_lock or ->siglock is held (or p ==
    current). We pin the task structure, but this can't prevent from
    release_task()->__exit_signal() which sets ->signal = NULL.

    Restore tasklist_lock across the setscheduler call.

    Signed-off-by: Oleg Nesterov
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

15 Aug, 2006

5 commits

  • Use a private lock instead. It protects all per-cpu data structures in
    workqueue.c, including the workqueues list.

    Fix a bug in schedule_on_each_cpu(): it was forgetting to lock down the
    per-cpu resources.

    Unfixed long-standing bug: if someone unplugs the CPU identified by
    `singlethread_cpu' the kernel will get very sick.

    Cc: Dave Jones
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Andrew Morton
     
  • We found this issue last week w/ the -RT kernel, but it seems the same
    issue is in mainline as well.

    Basically it is possible for futex_unlock_pi to return without actually
    freeing the lock. This is due to buggy logic in the use of
    futex_handle_fault() and its attempt argument in a failure case.

    Looking at futex.c the logic is as follows:

    1) In futex_unlock_pi() we start w/ ret=0 and we go down to the first
    futex_atomic_cmpxchg_inatomic(), where we find uval==-EFAULT. We then
    jump to the pi_faulted label.

    2) From pi_faulted: We increment attempt, unlock the sem and hit the
    retry label.

    3) From the retry label, with ret still zero, we again hit EFAULT on the
    first futex_atomic_cmpxchg_inatomic(), and again goto the pi_faulted
    label.

    4) Again from pi_faulted: we increment attempt and enter the
    conditional, where we call futex_handle_fault.

    5) futex_handle_fault fails, and we goto the out_unlock_release_sem
    label.

    6) From out_unlock_release_sem we return, and since ret is still zero,
    we return without error, while never actually unlocking the lock.

    Issue #1: at the first futex_atomic_cmpxchg_inatomic() we should probably
    be setting ret=-EFAULT before jumping to pi_faulted: However in our case
    this doesn't really affect anything, as the glibc we're using ignores the
    error value from futex_unlock_pi().

    Issue #2: Look at futex_handle_fault(), its first conditional will return
    -EFAULT if attempt is >= 2. However, from the "if(attempt++)
    futex_handle_fault(attempt)" logic above, we'll *never* call
    futex_handle_fault when attempt is less then two. So we never get a chance
    to even try to fault the page in.

    The following patch addresses these two issues by 1) Always setting ret to
    -EFAULT if futex_handle_fault fails, and 2) Removing the = in
    futex_handle_fault's (attempt >= 2) check.

    I'm really not sure this is the right fix, but wanted to bring it up so
    folks knew the issue is alive and well in the current -git tree. From
    looking at the git logs the logic was first introduced (then later copied
    to other places) in the following commit almost a year ago:

    http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=4732efbeb997189d9f9b04708dc26bf8613ed721;hp=5b039e681b8c5f30aac9cc04385cc94be45d0823

    Cc: Rusty Russell
    Cc: Ingo Molnar
    Acked-by: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    john stultz
     
  • sys_getppid() optimization can access a freed memory. On kernels with
    DEBUG_SLAB turned ON, this results in Oops. As Dave Hansen noted, this
    optimization is also unsafe for memory hotplug.

    So this patch always takes the lock to be safe.

    [oleg@tv-sign.ru: simplifications]
    Signed-off-by: Kirill Korotaev
    Cc:
    Cc: Dave Hansen
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Kirill Korotaev
     
  • kernel/panic.c: In function 'add_taint':
    kernel/panic.c:176: warning: implicit declaration of function 'debug_locks_off'

    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Andrew Morton
     
  • The percpu variable is used incorrectly in switch_hrtimer_base().

    Signed-off-by: Jan Blunck
    Acked-by: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Jan Blunck
     

06 Aug, 2006

7 commits

  • The recent fixups in futex.c need to be applied to futex_compat.c too. Fixes
    a hang reported by Olaf.

    Signed-off-by: Thomas Gleixner
    Cc: Olaf Hering
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     
  • find_next_system_ram() is used to find available memory resource at onlining
    newly added memory. This patch fixes following problem.

    find_next_system_ram() cannot catch this case.

    Resource: (start)-------------(end)
    Section : (start)-------------(end)

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Keith Mannthey
    Cc: Yasunori Goto
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • find_next_system_ram() returns valid memory range which meets requested area,
    only used by memory-hot-add.

    This function always rewrite requested resource even if returned area is not
    fully fit in requested one. And sometimes the returnd resource is larger than
    requested area. This annoyes the caller. This patch changes the returned
    value to fit in requested area.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Keith Mannthey
    Cc: Yasunori Goto
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Reported by: Dave Jones

    Whilst printk'ing to both console and serial console, I got this...
    (2.6.18rc1)

    BUG: sleeping function called from invalid context at kernel/sched.c:4438
    in_atomic():0, irqs_disabled():1

    Call Trace:
    [] show_trace+0xaa/0x23d
    [] dump_stack+0x15/0x17
    [] __might_sleep+0xb2/0xb4
    [] __cond_resched+0x15/0x55
    [] cond_resched+0x3b/0x42
    [] console_conditional_schedule+0x12/0x14
    [] fbcon_redraw+0xf6/0x160
    [] fbcon_scroll+0x5d9/0xb52
    [] scrup+0x6b/0xd6
    [] lf+0x24/0x44
    [] vt_console_print+0x166/0x23d
    [] __call_console_drivers+0x65/0x76
    [] _call_console_drivers+0x5e/0x62
    [] release_console_sem+0x14b/0x232
    [] fb_flashcursor+0x279/0x2a6
    [] run_workqueue+0xa8/0xfb
    [] worker_thread+0xef/0x122
    [] kthread+0x100/0x136
    [] child_rip+0x8/0x12

    This can occur when release_console_sem() is called but the log
    buffer still has contents that need to be flushed. The console drivers
    are called while the console_may_schedule flag is still true. The
    might_sleep() is triggered when fbcon calls console_conditional_schedule().

    Fix by setting console_may_schedule to zero earlier, before the call to the
    console drivers.

    Signed-off-by: Antonino Daplas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Antonino A. Daplas
     
  • When delivering PTRACE_EVENT_VFORK_DONE, provide pid of the child process
    when tracer calls ptrace(PTRACE_GETEVENTMSG). This is already
    (accidentally) available when the tracer is tracing VFORK in addition to
    VFORK_DONE.

    Signed-off-by: Chuck Ebbert
    Cc: Daniel Jacobowitz
    Cc: Albert Cahalan
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chuck Ebbert
     
  • This patch adds a barrier() in futex unqueue_me to avoid aliasing of two
    pointers.

    On my s390x system I saw the following oops:

    Unable to handle kernel pointer dereference at virtual kernel address
    0000000000000000
    Oops: 0004 [#1]
    CPU: 0 Not tainted
    Process mytool (pid: 13613, task: 000000003ecb6ac0, ksp: 00000000366bdbd8)
    Krnl PSW : 0704d00180000000 00000000003c9ac2 (_spin_lock+0xe/0x30)
    Krnl GPRS: 00000000ffffffff 000000003ecb6ac0 0000000000000000 0700000000000000
    0000000000000000 0000000000000000 000001fe00002028 00000000000c091f
    000001fe00002054 000001fe00002054 0000000000000000 00000000366bddc0
    00000000005ef8c0 00000000003d00e8 0000000000144f91 00000000366bdcb8
    Krnl Code: ba 4e 20 00 12 44 b9 16 00 3e a7 84 00 08 e3 e0 f0 88 00 04
    Call Trace:
    ([] unqueue_me+0x40/0xe4)
    [] do_futex+0x33c/0xc40
    [] sys_futex+0x12e/0x144
    [] sysc_noemu+0x10/0x16
    [] 0x2000003741c

    The code in question is:

    static int unqueue_me(struct futex_q *q)
    {
    int ret = 0;
    spinlock_t *lock_ptr;

    /* In the common case we don't take the spinlock, which is nice. */
    retry:
    lock_ptr = q->lock_ptr;
    if (lock_ptr != 0) {
    spin_lock(lock_ptr);
    /*
    * q->lock_ptr can change between reading it and
    * spin_lock(), causing us to take the wrong lock. This
    * corrects the race condition.
    [...]

    and my compiler (gcc 4.1.0) makes the following out of it:

    00000000000003c8 :
    3c8: eb bf f0 70 00 24 stmg %r11,%r15,112(%r15)
    3ce: c0 d0 00 00 00 00 larl %r13,3ce
    3d0: R_390_PC32DBL .rodata+0x2a
    3d4: a7 f1 1e 00 tml %r15,7680
    3d8: a7 84 00 01 je 3da
    3dc: b9 04 00 ef lgr %r14,%r15
    3e0: a7 fb ff d0 aghi %r15,-48
    3e4: b9 04 00 b2 lgr %r11,%r2
    3e8: e3 e0 f0 98 00 24 stg %r14,152(%r15)
    3ee: e3 c0 b0 28 00 04 lg %r12,40(%r11)
    /* write q->lock_ptr in r12 */
    3f4: b9 02 00 cc ltgr %r12,%r12
    3f8: a7 84 00 4b je 48e
    /* if r12 is zero then jump over the code.... */
    3fc: e3 20 b0 28 00 04 lg %r2,40(%r11)
    /* write q->lock_ptr in r2 */
    402: c0 e5 00 00 00 00 brasl %r14,402
    404: R_390_PC32DBL _spin_lock+0x2
    /* use r2 as parameter for spin_lock */

    So the code becomes more or less:
    if (q->lock_ptr != 0) spin_lock(q->lock_ptr)
    instead of
    if (lock_ptr != 0) spin_lock(lock_ptr)

    Which caused the oops from above.
    After adding a barrier gcc creates code without this problem:
    [...] (the same)
    3ee: e3 c0 b0 28 00 04 lg %r12,40(%r11)
    3f4: b9 02 00 cc ltgr %r12,%r12
    3f8: b9 04 00 2c lgr %r2,%r12
    3fc: a7 84 00 48 je 48c
    400: c0 e5 00 00 00 00 brasl %r14,400
    402: R_390_PC32DBL _spin_lock+0x2

    As a general note, this code of unqueue_me seems a bit fishy. The retry logic
    of unqueue_me only works if we can guarantee, that the original value of
    q->lock_ptr is always a spinlock (Otherwise we overwrite kernel memory). We
    know that q->lock_ptr can change. I dont know what happens with the original
    spinlock, as I am not an expert with the futex code.

    Cc: Martin Schwidefsky
    Cc: Rusty Russell
    Acked-by: Ingo Molnar
    Cc: Thomas Gleixner
    Signed-off-by: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christian Borntraeger
     
  • It should be possible to suspend, either to RAM or to disk, if there's a
    traced process that has just reached a breakpoint. However, this is a
    special case, because its parent process might have been frozen already and
    then we are unable to deliver the "freeze" signal to the traced process.
    If this happens, it's better to cancel the freezing of the traced process.

    Ref. http://bugzilla.kernel.org/show_bug.cgi?id=6787

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     

03 Aug, 2006

7 commits