19 Feb, 2010

1 commit


17 Feb, 2010

2 commits

  • This patch fixes following sparse warnings:

    include/linux/kfifo.h:127:25: warning: Using plain integer as NULL pointer
    kernel/kfifo.c:83:21: warning: Using plain integer as NULL pointer

    Signed-off-by: Anton Vorontsov
    Acked-by: Stefani Seibold
    Signed-off-by: Greg Kroah-Hartman

    Anton Vorontsov
     
  • After kfifo rework it's no longer possible to reliably know if kfifo is
    usable, since after kfifo_free(), kfifo_initialized() would still return
    true. The correct behaviour is needed for at least FHCI USB driver.

    This patch fixes the issue by resetting the kfifo to zero values (the
    same approach is used in kfifo_alloc() if allocation failed).

    Signed-off-by: Anton Vorontsov
    Acked-by: Stefani Seibold
    Signed-off-by: Greg Kroah-Hartman

    Anton Vorontsov
     

16 Feb, 2010

3 commits


14 Feb, 2010

1 commit

  • Trying to add a probe like:

    echo p:myprobe 0x10000 > /sys/kernel/debug/tracing/kprobe_events

    will fail since the wrong pointer is passed to strict_strtoul
    when trying to convert the address to an unsigned long.

    Signed-off-by: Heiko Carstens
    Acked-by: Masami Hiramatsu
    Cc: Frederic Weisbecker
    Cc: Steven Rostedt
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Heiko Carstens
     

10 Feb, 2010

1 commit


05 Feb, 2010

1 commit


04 Feb, 2010

2 commits

  • Change 'bp_len' type to __u64 to make it work across archs as
    the s390 architecture watch point length can be upto 2^64.

    reference:
    http://lkml.org/lkml/2010/1/25/212

    This is an ABI change that is not backward compatible with
    the previous hardware breakpoint info layout integrated in this
    development cycle, a rebuilt of perf tools is necessary for
    versions based on 2.6.33-rc1 - 2.6.33-rc6 to work with a
    kernel based on this patch.

    Signed-off-by: Mahesh Salgaonkar
    Acked-by: Peter Zijlstra
    Cc: Ananth N Mavinakayanahalli
    Cc: "K. Prasad"
    Cc: Maneesh Soni
    Cc: Heiko Carstens
    Cc: Martin
    LKML-Reference:
    Signed-off-by: Frederic Weisbecker

    Mahesh Salgaonkar
     
  • hrtimers callbacks are always done from hardirq context, either the
    jiffy tick interrupt or the hrtimer device interrupt.

    [ there is currently one exception that can still call a hrtimer
    callback from softirq, but even in that case this will still
    work correctly. ]

    Reported-by: Wei Yongjun
    Signed-off-by: Peter Zijlstra
    Cc: Yury Polyanskiy
    Tested-by: Wei Yongjun
    Acked-by: David S. Miller
    LKML-Reference:
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     

03 Feb, 2010

7 commits

  • The WARN_ON in lookup_pi_state which complains about a mismatch
    between pi_state->owner->pid and the pid which we retrieved from the
    user space futex is completely bogus.

    The code just emits the warning and then continues despite the fact
    that it detected an inconsistent state of the futex. A conveniant way
    for user space to spam the syslog.

    Replace the WARN_ON by a consistency check. If the values do not match
    return -EINVAL and let user space deal with the mess it created.

    This also fixes the missing task_pid_vnr() when we compare the
    pi_state->owner pid with the futex value.

    Reported-by: Jermome Marchand
    Signed-off-by: Thomas Gleixner
    Acked-by: Darren Hart
    Acked-by: Peter Zijlstra
    Cc:

    Thomas Gleixner
     
  • If the owner of a PI futex dies we fix up the pi_state and set
    pi_state->owner to NULL. When a malicious or just sloppy programmed
    user space application sets the futex value to 0 e.g. by calling
    pthread_mutex_init(), then the futex can be acquired again. A new
    waiter manages to enqueue itself on the pi_state w/o damage, but on
    unlock the kernel dereferences pi_state->owner and oopses.

    Prevent this by checking pi_state->owner in the unlock path. If
    pi_state->owner is not current we know that user space manipulated the
    futex value. Ignore the mess and return -EINVAL.

    This catches the above case and also the case where a task hijacks the
    futex by setting the tid value and then tries to unlock it.

    Reported-by: Jermome Marchand
    Signed-off-by: Thomas Gleixner
    Acked-by: Darren Hart
    Acked-by: Peter Zijlstra
    Cc:

    Thomas Gleixner
     
  • This fixes a futex key reference count bug in futex_lock_pi(),
    where a key's reference count is incremented twice but decremented
    only once, causing the backing object to not be released.

    If the futex is created in a temporary file in an ext3 file system,
    this bug causes the file's inode to become an "undead" orphan,
    which causes an oops from a BUG_ON() in ext3_put_super() when the
    file system is unmounted. glibc's test suite is known to trigger this,
    see .

    The bug is a regression from 2.6.28-git3, namely Peter Zijlstra's
    38d47c1b7075bd7ec3881141bb3629da58f88dab "[PATCH] futex: rely on
    get_user_pages() for shared futexes". That commit made get_futex_key()
    also increment the reference count of the futex key, and updated its
    callers to decrement the key's reference count before returning.
    Unfortunately the normal exit path in futex_lock_pi() wasn't corrected:
    the reference count is incremented by get_futex_key() and queue_lock(),
    but the normal exit path only decrements once, via unqueue_me_pi().
    The fix is to put_futex_key() after unqueue_me_pi(), since 2.6.31
    this is easily done by 'goto out_put_key' rather than 'goto out'.

    Signed-off-by: Mikael Pettersson
    Acked-by: Peter Zijlstra
    Acked-by: Darren Hart
    Signed-off-by: Thomas Gleixner
    Cc:

    Mikael Pettersson
     
  • …s/security-testing-2.6

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6:
    kernel/cred.c: use kmem_cache_free

    Linus Torvalds
     
  • In cgroup_create(), if alloc_css_id() returns failure, the errno is not
    propagated to userspace, so mkdir will fail silently.

    To trigger this bug, we mount blkio (or memory subsystem), and create more
    then 65534 cgroups. (The number of cgroups is limited to 65535 if a
    subsystem has use_id == 1)

    # mount -t cgroup -o blkio xxx /mnt
    # for ((i = 0; i < 65534; i++)); do mkdir /mnt/$i; done
    # mkdir /mnt/65534
    (should return ENOSPC)
    #

    Signed-off-by: Li Zefan
    Acked-by: Serge Hallyn
    Acked-by: Paul Menage
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Fix kfifo kernel-doc warnings:

    Warning(kernel/kfifo.c:361): No description found for parameter 'total'
    Warning(kernel/kfifo.c:402): bad line: @ @lenout: pointer to output variable with copied data
    Warning(kernel/kfifo.c:412): No description found for parameter 'lenout'

    Signed-off-by: Randy Dunlap
    Cc: Stefani Seibold
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Free memory allocated using kmem_cache_zalloc using kmem_cache_free rather
    than kfree.

    The semantic patch that makes this change is as follows:
    (http://coccinelle.lip6.fr/)

    //
    @@
    expression x,E,c;
    @@

    x = \(kmem_cache_alloc\|kmem_cache_zalloc\|kmem_cache_alloc_node\)(c,...)
    ... when != x = E
    when != &x
    ?-kfree(x)
    +kmem_cache_free(c,x)
    //

    Signed-off-by: Julia Lawall
    Acked-by: David Howells
    Cc: James Morris
    Cc: Steve Dickson
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: James Morris

    Julia Lawall
     

02 Feb, 2010

5 commits


01 Feb, 2010

1 commit

  • When CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is set, sched_clock() gets
    the time from hardware such as the TSC on x86. In this
    configuration kgdb will report a softlock warning message on
    resuming or detaching from a debug session.

    Sequence of events in the problem case:

    1) "cpu sched clock" and "hardware time" are at 100 sec prior
    to a call to kgdb_handle_exception()

    2) Debugger waits in kgdb_handle_exception() for 80 sec and on
    exit the following is called ... touch_softlockup_watchdog() -->
    __raw_get_cpu_var(touch_timestamp) = 0;

    3) "cpu sched clock" = 100s (it was not updated, because the
    interrupt was disabled in kgdb) but the "hardware time" = 180 sec

    4) The first timer interrupt after resuming from
    kgdb_handle_exception updates the watchdog from the "cpu sched clock"

    update_process_times() { ... run_local_timers() -->
    softlockup_tick() --> check (touch_timestamp == 0) (it is "YES"
    here, we have set "touch_timestamp = 0" at kgdb) -->
    __touch_softlockup_watchdog() ***(A)--> reset "touch_timestamp"
    to "get_timestamp()" (Here, the "touch_timestamp" will still be
    set to 100s.) ...

    scheduler_tick() ***(B)--> sched_clock_tick() (update "cpu sched
    clock" to "hardware time" = 180s) ... }

    5) The Second timer interrupt handler appears to have a large
    jump and trips the softlockup warning.

    update_process_times() { ... run_local_timers() -->
    softlockup_tick() --> "cpu sched clock" - "touch_timestamp" =
    180s-100s > 60s --> printk "soft lockup error messages" ... }

    note: ***(A) reset "touch_timestamp" to
    "get_timestamp(this_cpu)"

    Why is "touch_timestamp" 100 sec, instead of 180 sec?

    When CONFIG_HAVE_UNSTABLE_SCHED_CLOCK is set, the call trace of
    get_timestamp() is:

    get_timestamp(this_cpu)
    -->cpu_clock(this_cpu)
    -->sched_clock_cpu(this_cpu)
    -->__update_sched_clock(sched_clock_data, now)

    The __update_sched_clock() function uses the GTOD tick value to
    create a window to normalize the "now" values. So if "now"
    value is too big for sched_clock_data, it will be ignored.

    The fix is to invoke sched_clock_tick() to update "cpu sched
    clock" in order to recover from this state. This is done by
    introducing the function touch_softlockup_watchdog_sync(). This
    allows kgdb to request that the sched clock is updated when the
    watchdog thread runs the first time after a resume from kgdb.

    [yong.zhang0@gmail.com: Use per cpu instead of an array]
    Signed-off-by: Jason Wessel
    Signed-off-by: Dongdong Deng
    Cc: kgdb-bugreport@lists.sourceforge.net
    Cc: peterz@infradead.org
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Jason Wessel
     

30 Jan, 2010

2 commits

  • This patch fixes the regression in functionality where the
    kernel debugger and the perf API do not nicely share hw
    breakpoint reservations.

    The kernel debugger cannot use any mutex_lock() calls because it
    can start the kernel running from an invalid context.

    A mutex free version of the reservation API needed to get
    created for the kernel debugger to safely update hw breakpoint
    reservations.

    The possibility for a breakpoint reservation to be concurrently
    processed at the time that kgdb interrupts the system is
    improbable. Should this corner case occur the end user is
    warned, and the kernel debugger will prohibit updating the
    hardware breakpoint reservations.

    Any time the kernel debugger reserves a hardware breakpoint it
    will be a system wide reservation.

    Signed-off-by: Jason Wessel
    Acked-by: Frederic Weisbecker
    Cc: kgdb-bugreport@lists.sourceforge.net
    Cc: K.Prasad
    Cc: Peter Zijlstra
    Cc: Alan Stern
    Cc: torvalds@linux-foundation.org
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Jason Wessel
     
  • In the 2.6.33 kernel, the hw_breakpoint API is now used for the
    performance event counters. The hw_breakpoint_handler() now
    consumes the hw breakpoints that were previously set by kgdb
    arch specific code. In order for kgdb to work in conjunction
    with this core API change, kgdb must use some of the low level
    functions of the hw_breakpoint API to install, uninstall, and
    deal with hw breakpoint reservations.

    The kgdb core required a change to call kgdb_disable_hw_debug
    anytime a slave cpu enters kgdb_wait() in order to keep all the
    hw breakpoints in sync as well as to prevent hitting a hw
    breakpoint while kgdb is active.

    During the architecture specific initialization of kgdb, it will
    pre-allocate 4 disabled (struct perf event **) structures. Kgdb
    will use these to manage the capabilities for the 4 hw
    breakpoint registers, per cpu. Right now the hw_breakpoint API
    does not have a way to ask how many breakpoints are available,
    on each CPU so it is possible that the install of a breakpoint
    might fail when kgdb restores the system to the run state. The
    intent of this patch is to first get the basic functionality of
    hw breakpoints working and leave it to the person debugging the
    kernel to understand what hw breakpoints are in use and what
    restrictions have been imposed as a result. Breakpoint
    constraints will be dealt with in a future patch.

    While atomic, the x86 specific kgdb code will call
    arch_uninstall_hw_breakpoint() and arch_install_hw_breakpoint()
    to manage the cpu specific hw breakpoints.

    The net result of these changes allow kgdb to use the same pool
    of hw_breakpoints that are used by the perf event API, but
    neither knows about future reservations for the available hw
    breakpoint slots.

    Signed-off-by: Jason Wessel
    Acked-by: Frederic Weisbecker
    Cc: kgdb-bugreport@lists.sourceforge.net
    Cc: K.Prasad
    Cc: Peter Zijlstra
    Cc: Alan Stern
    Cc: torvalds@linux-foundation.org
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Jason Wessel
     

28 Jan, 2010

3 commits

  • On a given architecture, when hardware breakpoint registration fails
    due to un-supported access type (read/write/execute), we lose the bp
    slot since register_perf_hw_breakpoint() does not release the bp slot
    on failure.
    Hence, any subsequent hardware breakpoint registration starts failing
    with 'no space left on device' error.

    This patch introduces error handling in register_perf_hw_breakpoint()
    function and releases bp slot on error.

    Signed-off-by: Mahesh Salgaonkar
    Cc: Ananth N Mavinakayanahalli
    Cc: K. Prasad
    Cc: Maneesh Soni
    LKML-Reference:
    Signed-off-by: Frederic Weisbecker

    Mahesh Salgaonkar
     
  • Due to an incorrect line break the output currently contains tabs.
    Also remove trailing space.

    The actual output that logcheck sent me looked like this:
    Task events/1 (pid = 10) is on cpu 1^I^I^I^I(state = 1, flags = 84208040)

    After this patch it becomes:
    Task events/1 (pid = 10) is on cpu 1 (state = 1, flags = 84208040)

    Signed-off-by: Frans Pop
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Frans Pop
     
  • We moved to migrate on wakeup, which means that sleeping tasks could
    still be present on offline cpus. Amend the check to only test running
    tasks.

    Reported-by: Heiko Carstens
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

27 Jan, 2010

4 commits

  • Lockdep has found the real bug, but the output doesn't look right to me:

    > =========================================================
    > [ INFO: possible irq lock inversion dependency detected ]
    > 2.6.33-rc5 #77
    > ---------------------------------------------------------
    > emacs/1609 just changed the state of lock:
    > (&(&tty->ctrl_lock)->rlock){+.....}, at: [] tty_fasync+0xe8/0x190
    > but this lock took another, HARDIRQ-unsafe lock in the past:
    > (&(&sighand->siglock)->rlock){-.....}

    "HARDIRQ-unsafe" and "this lock took another" looks wrong, afaics.

    > ... key at: [] __key.46539+0x0/0x8
    > ... acquired at:
    > [] __lock_acquire+0x1056/0x15a0
    > [] lock_acquire+0x9f/0x120
    > [] _raw_spin_lock_irqsave+0x52/0x90
    > [] __proc_set_tty+0x3e/0x150
    > [] tty_open+0x51d/0x5e0

    The stack-trace shows that this lock (ctrl_lock) was taken under
    ->siglock (which is hopefully irq-safe).

    This is a clear typo in check_usage_backwards() where we tell the print a
    fancy routine we're forwards.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • Update the graph tracer examples to cover the new frame pointer semantics
    (in terms of passing it along). Move the HAVE_FUNCTION_GRAPH_FP_TEST docs
    out of the Kconfig, into the right place, and expand on the details.

    Signed-off-by: Mike Frysinger
    LKML-Reference:
    Signed-off-by: Steven Rostedt

    Mike Frysinger
     
  • If the iterator comes to an empty page for some reason, or if
    the page is emptied by a consuming read. The iterator code currently
    does not check if the iterator is pass the contents, and may
    return a false entry.

    This patch adds a check to the ring buffer iterator to test if the
    current page has been completely read and sets the iterator to the
    next page if necessary.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • Usually reads of the ring buffer is performed by a single task.
    There are two types of reads from the ring buffer.

    One is a consuming read which will consume the entry that was read
    and the next read will be the entry that follows.

    The other is an iterator that will let the user read the contents of
    the ring buffer without modifying it. When an iterator is allocated,
    writes to the ring buffer are disabled to protect the iterator.

    The problem exists when consuming reads happen while an iterator is
    allocated. Specifically, the kind of read that swaps out an entire
    page (used by splice) and replaces it with a new read. If the iterator
    is on the page that is swapped out, then the next read may read
    from this swapped out page and return garbage.

    This patch adds a check when reading the iterator to make sure that
    the iterator contents are still valid. If a consuming read has taken
    place, the iterator is reset.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

26 Jan, 2010

2 commits

  • commit 0f8e8ef7 (clocksource: Simplify clocksource watchdog resume
    logic) introduced a potential kgdb dead lock. When the kernel is
    stopped by kgdb inside code which holds watchdog_lock then kgdb dead
    locks in clocksource_resume_watchdog().

    clocksource_resume_watchdog() is called from kbdg via
    clocksource_touch_watchdog() to avoid that the clock source watchdog
    marks TSC unstable after the kernel has been stopped.

    Solve this by replacing spin_lock with a spin_trylock and just return
    in case the lock is held. Not resetting the watchdog might result in
    TSC becoming marked unstable, but that's an acceptable penalty for
    using kgdb.

    The timekeeping is anyway easily screwed up by kgdb when the system
    uses either jiffies or a clock source which wraps in short intervals
    (e.g. pm_timer wraps about every 4.6s), so we really do not have to
    worry about that occasional TSC marked unstable side effect.

    The second caller of clocksource_resume_watchdog() is
    clocksource_resume(). The trylock is safe here as well because the
    system is UP at this point, interrupts are disabled and nothing else
    can hold watchdog_lock().

    Reported-by: Jason Wessel
    LKML-Reference:
    Cc: kgdb-bugreport@lists.sourceforge.net
    Cc: Martin Schwidefsky
    Cc: John Stultz
    Cc: Andrew Morton
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • If the contents of the ftrace ring buffer gets corrupted and the trace
    file is read, it could create a kernel oops (usualy just killing the user
    task thread). This is caused by the checking of the pid in the buffer.
    If the pid is negative, it still references the cmdline cache array,
    which could point to an invalid address.

    The simple fix is to test for negative PIDs.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

25 Jan, 2010

2 commits


22 Jan, 2010

2 commits

  • There are a number of issues:

    1) TASK_WAKING vs cgroup_clone (cpusets)

    copy_process():

    sched_fork()
    child->state = TASK_WAKING; /* waiting for wake_up_new_task() */
    if (current->nsproxy != p->nsproxy)
    ns_cgroup_clone()
    cgroup_clone()
    mutex_lock(inode->i_mutex)
    mutex_lock(cgroup_mutex)
    cgroup_attach_task()
    ss->can_attach()
    ss->attach() [ -> cpuset_attach() ]
    cpuset_attach_task()
    set_cpus_allowed_ptr();
    while (child->state == TASK_WAKING)
    cpu_relax();
    will deadlock the system.

    2) cgroup_clone (cpusets) vs copy_process

    So even if the above would work we still have:

    copy_process():

    if (current->nsproxy != p->nsproxy)
    ns_cgroup_clone()
    cgroup_clone()
    mutex_lock(inode->i_mutex)
    mutex_lock(cgroup_mutex)
    cgroup_attach_task()
    ss->can_attach()
    ss->attach() [ -> cpuset_attach() ]
    cpuset_attach_task()
    set_cpus_allowed_ptr();
    ...

    p->cpus_allowed = current->cpus_allowed

    over-writing the modified cpus_allowed.

    3) fork() vs hotplug

    if we unplug the child's cpu after the sanity check when the child
    gets attached to the task_list but before wake_up_new_task() shit
    will meet with fan.

    Solve all these issues by moving fork cpu selection into
    wake_up_new_task().

    Reported-by: Serge E. Hallyn
    Tested-by: Serge E. Hallyn
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • …/git/tip/linux-2.6-tip

    * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    perf: x86: Add support for the ANY bit
    perf: Change the is_software_event() definition
    perf: Honour event state for aux stream data
    perf: Fix perf_event_do_pending() fallback callsite
    perf kmem: Print usage help for unknown commands
    perf kmem: Increase "Hit" column length
    hw-breakpoints, perf: Fix broken mmiotrace due to dr6 by reference change
    perf timechart: Use tid not pid for COMM change

    Linus Torvalds
     

21 Jan, 2010

1 commit

  • Anton reported that perf record kept receiving events even after calling
    ioctl(PERF_EVENT_IOC_DISABLE). It turns out that FORK,COMM and MMAP
    events didn't respect the disabled state and kept flowing in.

    Reported-by: Anton Blanchard
    Signed-off-by: Peter Zijlstra
    Tested-by: Anton Blanchard
    LKML-Reference:
    CC: stable@kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra