07 May, 2010

1 commit


25 Apr, 2010

1 commit

  • On ppc64 you get this error:

    $ setarch ppc -R true
    setarch: ppc: Unrecognized architecture

    because uname still reports ppc64 as the machine.

    So mask off the personality flags when checking for PER_LINUX32.

    Signed-off-by: Andreas Schwab
    Reviewed-by: Christoph Hellwig
    Acked-by: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andreas Schwab
     

23 Apr, 2010

3 commits

  • Issues in the current select_idle_sibling() logic in select_task_rq_fair()
    in the context of a task wake-up:

    a) Once we select the idle sibling, we use that domain (spanning the cpu that
    the task is currently woken-up and the idle sibling that we found) in our
    wake_affine() decisions. This domain is completely different from the
    domain(we are supposed to use) that spans the cpu that the task currently
    woken-up and the cpu where the task previously ran.

    b) We do select_idle_sibling() check only for the cpu that the task is
    currently woken-up on. If select_task_rq_fair() selects the previously run
    cpu for waking the task, doing a select_idle_sibling() check
    for that cpu also helps and we don't do this currently.

    c) In the scenarios where the cpu that the task is woken-up is busy but
    with its HT siblings are idle, we are selecting the task be woken-up
    on the idle HT sibling instead of a core that it previously ran
    and currently completely idle. i.e., we are not taking decisions based on
    wake_affine() but directly selecting an idle sibling that can cause
    an imbalance at the SMT/MC level which will be later corrected by the
    periodic load balancer.

    Fix this by first going through the load imbalance calculations using
    wake_affine() and once we make a decision of woken-up cpu vs previously-ran cpu,
    then choose a possible idle sibling for waking up the task on.

    Signed-off-by: Suresh Siddha
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Suresh Siddha
     
  • Dave reported that his large SPARC machines spend lots of time in
    hweight64(), try and optimize some of those needless cpumask_weight()
    invocations (esp. with the large offstack cpumasks these are very
    expensive indeed).

    Reported-by: David Miller
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Chase reported that due to us decrementing calc_load_task prematurely
    (before the next LOAD_FREQ sample), the load average could be scewed
    by as much as the number of CPUs in the machine.

    This patch, based on Chase's patch, cures the problem by keeping the
    delta of the CPU going into NO_HZ idle separately and folding that in
    on the next LOAD_FREQ update.

    This restores the balance and we get strict LOAD_FREQ period samples.

    Signed-off-by: Peter Zijlstra
    Acked-by: Chase Douglas
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

22 Apr, 2010

1 commit

  • creds_are_invalid() reads both cred->usage and cred->subscribers and then
    compares them to make sure the number of processes subscribed to a cred struct
    never exceeds the refcount of that cred struct.

    The problem is that this can cause a race with both copy_creds() and
    exit_creds() as the two counters, whilst they are of atomic_t type, are only
    atomic with respect to themselves, and not atomic with respect to each other.

    This means that if creds_are_invalid() can read the values on one CPU whilst
    they're being modified on another CPU, and so can observe an evolving state in
    which the subscribers count now is greater than the usage count a moment
    before.

    Switching the order in which the counts are read cannot help, so the thing to
    do is to remove that particular check.

    I had considered rechecking the values to see if they're in flux if the test
    fails, but I can't guarantee they won't appear the same, even if they've
    changed several times in the meantime.

    Note that this can only happen if CONFIG_DEBUG_CREDENTIALS is enabled.

    The problem is only likely to occur with multithreaded programs, and can be
    tested by the tst-eintr1 program from glibc's "make check". The symptoms look
    like:

    CRED: Invalid credentials
    CRED: At include/linux/cred.h:240
    CRED: Specified credentials: ffff88003dda5878 [real][eff]
    CRED: ->magic=43736564, put_addr=(null)
    CRED: ->usage=766, subscr=766
    CRED: ->*uid = { 0,0,0,0 }
    CRED: ->*gid = { 0,0,0,0 }
    CRED: ->security is ffff88003d72f538
    CRED: ->security {359, 359}
    ------------[ cut here ]------------
    kernel BUG at kernel/cred.c:850!
    ...
    RIP: 0010:[] [] __invalid_creds+0x4e/0x52
    ...
    Call Trace:
    [] copy_creds+0x6b/0x23f

    Note the ->usage=766 and subscr=766. The values appear the same because
    they've been re-read since the check was made.

    Reported-by: Roland McGrath
    Signed-off-by: David Howells
    Signed-off-by: James Morris

    David Howells
     

21 Apr, 2010

1 commit

  • Patch 570b8fb505896e007fd3bb07573ba6640e51851d:

    Author: Mathieu Desnoyers
    Date: Tue Mar 30 00:04:00 2010 +0100
    Subject: CRED: Fix memory leak in error handling

    attempts to fix a memory leak in the error handling by making the offending
    return statement into a jump down to the bottom of the function where a
    kfree(tgcred) is inserted.

    This is, however, incorrect, as it does a kfree() after doing put_cred() if
    security_prepare_creds() fails. That will result in a double free if 'error'
    is jumped to as put_cred() will also attempt to free the new tgcred record by
    virtue of it being pointed to by the new cred record.

    Signed-off-by: David Howells
    Signed-off-by: James Morris

    David Howells
     

19 Apr, 2010

1 commit

  • The lockdep facility temporarily disables lockdep checking by
    incrementing the current->lockdep_recursion variable. Such
    disabling happens in NMIs and in other situations where lockdep
    might expect to recurse on itself.

    This patch therefore checks current->lockdep_recursion, disabling RCU
    lockdep splats when this variable is non-zero. In addition, this patch
    removes the "likely()", as suggested by Lai Jiangshan.

    Reported-by: Frederic Weisbecker
    Reported-by: David Miller
    Tested-by: Frederic Weisbecker
    Signed-off-by: Paul E. McKenney
    Cc: laijs@cn.fujitsu.com
    Cc: dipankar@in.ibm.com
    Cc: mathieu.desnoyers@polymtl.ca
    Cc: josh@joshtriplett.org
    Cc: dvhltc@us.ibm.com
    Cc: niv@us.ibm.com
    Cc: peterz@infradead.org
    Cc: rostedt@goodmis.org
    Cc: Valdis.Kletnieks@vt.edu
    Cc: dhowells@redhat.com
    Cc: eric.dumazet@gmail.com
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul E. McKenney
     

15 Apr, 2010

2 commits


11 Apr, 2010

1 commit

  • When CONFIG_DEBUG_BLOCK_EXT_DEVT is set we decode the device
    improperly by old_decode_dev and it results in an error while
    hibernating with s2disk.

    All users already pass the new device number, so switch to
    new_decode_dev().

    Signed-off-by: Jiri Slaby
    Reported-and-tested-by: Jiri Kosina
    Signed-off-by: "Rafael J. Wysocki"

    Jiri Slaby
     

08 Apr, 2010

1 commit


07 Apr, 2010

2 commits


06 Apr, 2010

5 commits

  • taskset on 2.6.34-rc3 fails on one of my ppc64 test boxes with
    the following error:

    sched_getaffinity(0, 16, 0x10029650030) = -1 EINVAL (Invalid argument)

    This box has 128 threads and 16 bytes is enough to cover it.

    Commit cd3d8031eb4311e516329aee03c79a08333141f1 (sched:
    sched_getaffinity(): Allow less than NR_CPUS length) is
    comparing this 16 bytes agains nr_cpu_ids.

    Fix it by comparing nr_cpu_ids to the number of bits in the
    cpumask we pass in.

    Signed-off-by: Anton Blanchard
    Reviewed-by: KOSAKI Motohiro
    Cc: Sharyathi Nagesh
    Cc: Ulrich Drepper
    Cc: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Jack Steiner
    Cc: Russ Anderson
    Cc: Mike Travis
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Anton Blanchard
     
  • Module refcounting is implemented with a per-cpu counter for speed.
    However there is a race when tallying the counter where a reference may
    be taken by one CPU and released by another. Reference count summation
    may then see the decrement without having seen the previous increment,
    leading to lower than expected count. A module which never has its
    actual reference drop below 1 may return a reference count of 0 due to
    this race.

    Module removal generally runs under stop_machine, which prevents this
    race causing bugs due to removal of in-use modules. However there are
    other real bugs in module.c code and driver code (module_refcount is
    exported) where the callers do not run under stop_machine.

    Fix this by maintaining running per-cpu counters for the number of
    module refcount increments and the number of refcount decrements. The
    increments are tallied after the decrements, so any decrement seen will
    always have its corresponding increment counted. The final refcount is
    the difference of the total increments and decrements, preventing a
    low-refcount from being returned.

    Signed-off-by: Nick Piggin
    Acked-by: Rusty Russell
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • There have been a number of reports of people seeing the message:
    "name_count maxed, losing inode data: dev=00:05, inode=3185"
    in dmesg. These usually lead to people reporting problems to the filesystem
    group who are in turn clueless what they mean.

    Eventually someone finds me and I explain what is going on and that
    these come from the audit system. The basics of the problem is that the
    audit subsystem never expects a single syscall to 'interact' (for some
    wish washy meaning of interact) with more than 20 inodes. But in fact
    some operations like loading kernel modules can cause changes to lots of
    inodes in debugfs.

    There are a couple real fixes being bandied about including removing the
    fixed compile time limit of 20 or not auditing changes in debugfs (or
    both) but neither are small and obvious so I am not sending them for
    immediate inclusion (I hope Al forwards a real solution next devel
    window).

    In the meantime this patch simply adds 'audit' to the beginning of the
    crap message so if a user sees it, they come blame me first and we can
    talk about what it means and make sure we understand all of the reasons
    it can happen and make sure this gets solved correctly in the long run.

    Signed-off-by: Eric Paris
    Signed-off-by: Linus Torvalds

    Eric Paris
     
  • * 'slabh' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc:
    eeepc-wmi: include slab.h
    staging/otus: include slab.h from usbdrv.h
    percpu: don't implicitly include slab.h from percpu.h
    kmemcheck: Fix build errors due to missing slab.h
    include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
    iwlwifi: don't include iwl-dev.h from iwl-devtrace.h
    x86: don't include slab.h from arch/x86/include/asm/pgtable_32.h

    Fix up trivial conflicts in include/linux/percpu.h due to
    is_kernel_percpu_address() having been introduced since the slab.h
    cleanup with the percpu_up.c splitup.

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
    module: add stub for is_module_percpu_address
    percpu, module: implement and use is_kernel/module_percpu_address()
    module: encapsulate percpu handling better and record percpu_size

    Linus Torvalds
     

05 Apr, 2010

4 commits


03 Apr, 2010

17 commits

  • Now that software events use perf_arch_fetch_caller_regs() too, we
    need the stub version to be always built in for archs that don't
    implement it.

    Fixes the following build error in PARISC:

    kernel/built-in.o: In function `perf_event_task_sched_out':
    (.text.perf_event_task_sched_out+0x54): undefined reference to `perf_arch_fetch_caller_regs'

    Reported-by: Ingo Molnar
    Signed-off-by: Frederic Weisbecker
    Cc: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Paul Mackerras

    Frederic Weisbecker
     
  • * 'kgdb-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb:
    kgdb: Turn off tracing while in the debugger
    kgdb: use atomic_inc and atomic_dec instead of atomic_set
    kgdb: eliminate kgdb_wait(), all cpus enter the same way
    kgdbts,sh: Add in breakpoint pc offset for superh
    kgdb: have ebin2mem call probe_kernel_write once

    Linus Torvalds
     
  • * 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
    Freezer: Fix buggy resume test for tasks frozen with cgroup freezer
    Freezer: Only show the state of tasks refusing to freeze

    Linus Torvalds
     
  • The kernel debugger should turn off kernel tracing any time the
    debugger is active and restore it on resume.

    Signed-off-by: Jason Wessel
    Reviewed-by: Steven Rostedt

    Jason Wessel
     
  • Memory barriers should be used for the kgdb cpu synchronization. The
    atomic_set() does not imply a memory barrier.

    Reported-by: Will Deacon
    Signed-off-by: Jason Wessel

    Jason Wessel
     
  • This is a kgdb architectural change to have all the cpus (master or
    slave) enter the same function.

    A cpu that hits an exception (wants to be the master cpu) will call
    kgdb_handle_exception() from the trap handler and then invoke a
    kgdb_roundup_cpu() to synchronize the other cpus and bring them into
    the kgdb_handle_exception() as well.

    A slave cpu will enter kgdb_handle_exception() from the
    kgdb_nmicallback() and set the exception state to note that the
    processor is a slave.

    Previously the salve cpu would have called kgdb_wait(). This change
    allows the debug core to change cpus without resuming the system in
    order to inspect arch specific cpu information.

    Signed-off-by: Jason Wessel

    Jason Wessel
     
  • Rather than call probe_kernel_write() one byte at a time, process the
    whole buffer locally and pass the entire result in one go. This way,
    architectures that need to do special handling based on the length can
    do so, or we only end up calling memcpy() once.

    [sonic.zhang@analog.com: Reported original problem and preliminary patch]

    Signed-off-by: Jason Wessel
    Signed-off-by: Sonic Zhang
    Signed-off-by: Mike Frysinger

    Jason Wessel
     
  • In order to reduce the dependency on TASK_WAKING rework the enqueue
    interface to support a proper flags field.

    Replace the int wakeup, bool head arguments with an int flags argument
    and create the following flags:

    ENQUEUE_WAKEUP - the enqueue is a wakeup of a sleeping task,
    ENQUEUE_WAKING - the enqueue has relative vruntime due to
    having sched_class::task_waking() called,
    ENQUEUE_HEAD - the waking task should be places on the head
    of the priority queue (where appropriate).

    For symmetry also convert sched_class::dequeue() to a flags scheme.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The cpuload calculation in calc_load_account_active() assumes
    rq->nr_uninterruptible will not change on an offline cpu after
    migrate_nr_uninterruptible(). However the recent migrate on wakeup
    changes broke that and would result in decrementing the offline cpu's
    rq->nr_uninterruptible.

    Fix this by accounting the nr_uninterruptible on the waking cpu.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Now that we hold the rq->lock over set_task_cpu() again, we can do
    away with most of the TASK_WAKING checks and reduce them again to
    set_cpus_allowed_ptr().

    Removes some conditionals from scheduling hot-paths.

    Signed-off-by: Peter Zijlstra
    Cc: Oleg Nesterov
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Oleg noticed a few races with the TASK_WAKING usage on fork.

    - since TASK_WAKING is basically a spinlock, it should be IRQ safe
    - since we set TASK_WAKING (*) without holding rq->lock it could
    be there still is a rq->lock holder, thereby not actually
    providing full serialization.

    (*) in fact we clear PF_STARTING, which in effect enables TASK_WAKING.

    Cure the second issue by not setting TASK_WAKING in sched_fork(), but
    only temporarily in wake_up_new_task() while calling select_task_rq().

    Cure the first by holding rq->lock around the select_task_rq() call,
    this will disable IRQs, this however requires that we push down the
    rq->lock release into select_task_rq_fair()'s cgroup stuff.

    Because select_task_rq_fair() still needs to drop the rq->lock we
    cannot fully get rid of TASK_WAKING.

    Reported-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Introduce cpuset_cpus_allowed_fallback() helper to fix the cpuset problems
    with select_fallback_rq(). It can be called from any context and can't use
    any cpuset locks including task_lock(). It is called when the task doesn't
    have online cpus in ->cpus_allowed but ttwu/etc must be able to find a
    suitable cpu.

    I am not proud of this patch. Everything which needs such a fat comment
    can't be good even if correct. But I'd prefer to not change the locking
    rules in the code I hardly understand, and in any case I believe this
    simple change make the code much more correct compared to deadlocks we
    currently have.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • _cpu_down() changes the current task's affinity and then recovers it at
    the end. The problems are well known: we can't restore old_allowed if it
    was bound to the now-dead-cpu, and we can race with the userspace which
    can change cpu-affinity during unplug.

    _cpu_down() should not play with current->cpus_allowed at all. Instead,
    take_cpu_down() can migrate the caller of _cpu_down() after __cpu_disable()
    removes the dying cpu from cpu_online_mask.

    Signed-off-by: Oleg Nesterov
    Acked-by: Rafael J. Wysocki
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • sched_exec()->select_task_rq() reads/updates ->cpus_allowed lockless.
    This can race with other CPUs updating our ->cpus_allowed, and this
    looks meaningless to me.

    The task is current and running, it must have online cpus in ->cpus_allowed,
    the fallback mode is bogus. And, if ->sched_class returns the "wrong" cpu,
    this likely means we raced with set_cpus_allowed() which was called
    for reason, why should sched_exec() retry and call ->select_task_rq()
    again?

    Change the code to call sched_class->select_task_rq() directly and do
    nothing if the returned cpu is wrong after re-checking under rq->lock.

    From now task_struct->cpus_allowed is always stable under TASK_WAKING,
    select_fallback_rq() is always called under rq-lock or the caller or
    the caller owns TASK_WAKING (select_task_rq).

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • The previous patch preserved the retry logic, but it looks unneeded.

    __migrate_task() can only fail if we raced with migration after we dropped
    the lock, but in this case the caller of set_cpus_allowed/etc must initiate
    migration itself if ->on_rq == T.

    We already fixed p->cpus_allowed, the changes in active/online masks must
    be visible to racer, it should migrate the task to online cpu correctly.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • move_task_off_dead_cpu()->select_fallback_rq() reads/updates ->cpus_allowed
    lockless. We can race with set_cpus_allowed() running in parallel.

    Change it to take rq->lock around select_fallback_rq(). Note that it is not
    trivial to move this spin_lock() into select_fallback_rq(), we must recheck
    the task was not migrated after we take the lock and other callers do not
    need this lock.

    To avoid the races with other callers of select_fallback_rq() which rely on
    TASK_WAKING, we also check p->state != TASK_WAKING and do nothing otherwise.
    The owner of TASK_WAKING must update ->cpus_allowed and choose the correct
    CPU anyway, and the subsequent __migrate_task() is just meaningless because
    p->se.on_rq must be false.

    Alternatively, we could change select_task_rq() to take rq->lock right
    after it calls sched_class->select_task_rq(), but this looks a bit ugly.

    Also, change it to not assume irqs are disabled and absorb __migrate_task_irq().

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • This patch just states the fact the cpusets/cpuhotplug interaction is
    broken and removes the deadlockable code which only pretends to work.

    - cpuset_lock() doesn't really work. It is needed for
    cpuset_cpus_allowed_locked() but we can't take this lock in
    try_to_wake_up()->select_fallback_rq() path.

    - cpuset_lock() is deadlockable. Suppose that a task T bound to CPU takes
    callback_mutex. If cpu_down(CPU) happens before T drops callback_mutex
    stop_machine() preempts T, then migration_call(CPU_DEAD) tries to take
    cpuset_lock() and hangs forever because CPU is already dead and thus
    T can't be scheduled.

    - cpuset_cpus_allowed_locked() is deadlockable too. It takes task_lock()
    which is not irq-safe, but try_to_wake_up() can be called from irq.

    Kill them, and change select_fallback_rq() to use cpu_possible_mask, like
    we currently do without CONFIG_CPUSETS.

    Also, with or without this patch, with or without CONFIG_CPUSETS, the
    callers of select_fallback_rq() can race with each other or with
    set_cpus_allowed() pathes.

    The subsequent patches try to to fix these problems.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Oleg Nesterov