23 Aug, 2010

1 commit


22 Aug, 2010

1 commit

  • With the introduction of the new unified work queue thread pools,
    we lost one feature: It's no longer possible to know which worker
    is causing the CPU to wake out of idle. The result is that PowerTOP
    now reports a lot of "kworker/a:b" instead of more readable results.

    This patch adds a pair of tracepoints to the new workqueue code,
    similar in style to the timer/hrtimer tracepoints.

    With this pair of tracepoints, the next PowerTOP can correctly
    report which work item caused the wakeup (and how long it took):

    Interrupt (43) i915 time 3.51ms wakeups 141
    Work ieee80211_iface_work time 0.81ms wakeups 29
    Work do_dbs_timer time 0.55ms wakeups 24
    Process Xorg time 21.36ms wakeups 4
    Timer sched_rt_period_timer time 0.01ms wakeups 1

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Linus Torvalds

    Arjan van de Ven
     

21 Aug, 2010

8 commits

  • It's a really simple list, and several of the users want to go backwards
    in it to find the previous vma. So rather than have to look up the
    previous entry with 'find_vma_prev()' or something similar, just make it
    doubly linked instead.

    Tested-by: Ian Campbell
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • kfifo_skip() is currently broken, due to the missing of the internal
    helper function. Add it.

    Signed-off-by: Andrea Righi
    Cc: Greg KH
    Acked-by: Stefani Seibold
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Righi
     
  • Because list_empty() does not dereference any RCU-protected pointers, and
    further does not pass such pointers to the caller (so that the caller
    does not dereference them either), it is safe to use list_empty() on
    RCU-protected lists. There is no need for a list_empty_rcu(). This
    commit adds a comment stating this explicitly.

    Requested-by: Andrew Morton
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • The CONFIG_PREEMPT_RCU kernel configuration parameter was recently
    re-introduced, but as an indication of the type of RCU (preemptible
    vs. non-preemptible) instead of as selecting a given implementation.
    This commit uses CONFIG_PREEMPT_RCU to combine duplicate code
    from include/linux/rcutiny.h and include/linux/rcutree.h into
    include/linux/rcupdate.h. This commit also combines a few other pieces
    of duplicate code that have accumulated.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • It is illegal to wait for an SRCU grace period while within the
    corresponding flavor of SRCU read-side critical section. Therefore,
    this commit updates the srcu_read_lock() docbook accordingly.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • Combine the duplicate definitions of ULONG_CMP_GE(), ULONG_CMP_LT(),
    and rcu_preempt_depth() into include/linux/rcupdate.h.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • When using a kernel debugger, a long sojourn in the debugger can get
    you lots of RCU CPU stall warnings once you resume. This might not be
    helpful, especially if you are using the system console. This patch
    therefore allows RCU CPU stall warnings to be suppressed, but only for
    the duration of the current set of grace periods.

    This differs from Jason's original patch in that it adds support for
    tiny RCU and preemptible RCU, and uses a slightly different method for
    suppressing the RCU CPU stall warning messages.

    Signed-off-by: Jason Wessel
    Signed-off-by: Paul E. McKenney
    Tested-by: Jason Wessel

    Paul E. McKenney
     
  • The comment says that blocking is illegal in rcu_read_lock()-style
    RCU read-side critical sections, which is no longer entirely true
    given preemptible RCU. This commit provides a fix.

    Suggested-by: David Miller
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

20 Aug, 2010

18 commits

  • Implement a small-memory-footprint uniprocessor-only implementation of
    preemptible RCU. This implementation uses but a single blocked-tasks
    list rather than the combinatorial number used per leaf rcu_node by
    TREE_PREEMPT_RCU, which reduces memory consumption and greatly simplifies
    processing. This version also takes advantage of uniprocessor execution
    to accelerate grace periods in the case where there are no readers.

    The general design is otherwise broadly similar to that of TREE_PREEMPT_RCU.

    This implementation is a step towards having RCU implementation driven
    off of the SMP and PREEMPT kernel configuration variables, which can
    happen once this implementation has accumulated sufficient experience.

    Removed ACCESS_ONCE() from __rcu_read_unlock() and added barrier() as
    suggested by Steve Rostedt in order to avoid the compiler-reordering
    issue noted by Mathieu Desnoyers (http://lkml.org/lkml/2010/8/16/183).

    As can be seen below, CONFIG_TINY_PREEMPT_RCU represents almost 5Kbyte
    savings compared to CONFIG_TREE_PREEMPT_RCU. Of course, for non-real-time
    workloads, CONFIG_TINY_RCU is even better.

    CONFIG_TREE_PREEMPT_RCU

    text data bss dec filename
    13 0 0 13 kernel/rcupdate.o
    6170 825 28 7023 kernel/rcutree.o
    ----
    7026 Total

    CONFIG_TINY_PREEMPT_RCU

    text data bss dec filename
    13 0 0 13 kernel/rcupdate.o
    2081 81 8 2170 kernel/rcutiny.o
    ----
    2183 Total

    CONFIG_TINY_RCU (non-preemptible)

    text data bss dec filename
    13 0 0 13 kernel/rcupdate.o
    719 25 0 744 kernel/rcutiny.o
    ---
    757 Total

    Requested-by: Loïc Minier
    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • RCU heads really don't need to be initialized. Their state before call_rcu()
    really does not matter.

    We need to keep init/destroy_rcu_head_on_stack() though, since we want
    debugobjects to be able to keep track of these objects.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Mathieu Desnoyers
    CC: David S. Miller
    CC: "Paul E. McKenney"
    CC: akpm@linux-foundation.org
    CC: mingo@elte.hu
    CC: laijs@cn.fujitsu.com
    CC: dipankar@in.ibm.com
    CC: josh@joshtriplett.org
    CC: dvhltc@us.ibm.com
    CC: niv@us.ibm.com
    CC: tglx@linutronix.de
    CC: peterz@infradead.org
    CC: rostedt@goodmis.org
    CC: Valdis.Kletnieks@vt.edu
    CC: dhowells@redhat.com
    CC: eric.dumazet@gmail.com
    CC: Alexey Dobriyan
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Mathieu Desnoyers
     
  • This adds annotations for RCU operations in core kernel components

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Paul E. McKenney
    Cc: Al Viro
    Cc: Jens Axboe
    Cc: Andrew Morton
    Reviewed-by: Josh Triplett

    Arnd Bergmann
     
  • Signed-off-by: Arnd Bergmann
    Signed-off-by: Paul E. McKenney
    Cc: Manfred Spraul
    Reviewed-by: Josh Triplett

    Arnd Bergmann
     
  • Signed-off-by: Arnd Bergmann
    Signed-off-by: Paul E. McKenney
    Cc: Nick Piggin
    Reviewed-by: Josh Triplett

    Arnd Bergmann
     
  • Signed-off-by: Arnd Bergmann
    Signed-off-by: Paul E. McKenney
    Cc: Alan Cox
    Reviewed-by: Josh Triplett

    Arnd Bergmann
     
  • Make it explicit that new RCU read-side critical sections that start
    after call_rcu() and synchronize_rcu() start might still be running
    after the end of the relevant grace period.

    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • find_task_by_vpid() says "Must be called under rcu_read_lock().". But due to
    commit 3120438 "rcu: Disable lockdep checking in RCU list-traversal primitives",
    we are currently unable to catch "find_task_by_vpid() with tasklist_lock held
    but RCU lock not held" errors due to the RCU-lockdep checks being
    suppressed in the RCU variants of the struct list_head traversals.
    This commit therefore places an explicit check for being in an RCU
    read-side critical section in find_task_by_pid_ns().

    ===================================================
    [ INFO: suspicious rcu_dereference_check() usage. ]
    ---------------------------------------------------
    kernel/pid.c:386 invoked rcu_dereference_check() without protection!

    other info that might help us debug this:

    rcu_scheduler_active = 1, debug_locks = 1
    1 lock held by rc.sysinit/1102:
    #0: (tasklist_lock){.+.+..}, at: [] sys_setpgid+0x40/0x160

    stack backtrace:
    Pid: 1102, comm: rc.sysinit Not tainted 2.6.35-rc3-dirty #1
    Call Trace:
    [] lockdep_rcu_dereference+0x94/0xb0
    [] find_task_by_pid_ns+0x6d/0x70
    [] find_task_by_vpid+0x18/0x20
    [] sys_setpgid+0x47/0x160
    [] sysenter_do_call+0x12/0x36

    Commit updated to use a new rcu_lockdep_assert() exported API rather than
    the old internal __do_rcu_dereference().

    Signed-off-by: Tetsuo Handa
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Tetsuo Handa
     
  • Signed-off-by: Arnd Bergmann
    Signed-off-by: Paul E. McKenney
    Cc: Avi Kivity
    Cc: Marcelo Tosatti
    Reviewed-by: Josh Triplett

    Arnd Bergmann
     
  • Signed-off-by: Arnd Bergmann
    Signed-off-by: Paul E. McKenney
    Acked-by: Patrick McHardy
    Cc: "David S. Miller"
    Cc: Eric Dumazet
    Reviewed-by: Josh Triplett

    Arnd Bergmann
     
  • Signed-off-by: Arnd Bergmann
    Signed-off-by: Paul E. McKenney
    Cc: Dmitry Torokhov
    Acked-by: Dmitry Torokhov
    Reviewed-by: Josh Triplett

    Arnd Bergmann
     
  • Signed-off-by: Arnd Bergmann
    Signed-off-by: Paul E. McKenney
    Acked-by: Trond Myklebust

    Arnd Bergmann
     
  • Signed-off-by: Arnd Bergmann
    Signed-off-by: Paul E. McKenney
    Acked-by: David Howells
    Reviewed-by: Josh Triplett

    Arnd Bergmann
     
  • Signed-off-by: Arnd Bergmann
    Signed-off-by: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Acked-by: David Howells
    Reviewed-by: Josh Triplett

    Arnd Bergmann
     
  • Signed-off-by: Arnd Bergmann
    Signed-off-by: Paul E. McKenney
    Acked-by: Paul Menage
    Cc: Li Zefan
    Reviewed-by: Josh Triplett

    Arnd Bergmann
     
  • This avoids warnings from missing __rcu annotations
    in the rculist implementation, making it possible to
    use the same lists in both RCU and non-RCU cases.

    We can add rculist annotations later, together with
    lockdep support for rculist, which is missing as well,
    but that may involve changing all the users.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Paul E. McKenney
    Cc: Pavel Emelyanov
    Cc: Sukadev Bhattiprolu
    Reviewed-by: Josh Triplett

    Arnd Bergmann
     
  • This commit provides definitions for the __rcu annotation defined earlier.
    This annotation permits sparse to check for correct use of RCU-protected
    pointers. If a pointer that is annotated with __rcu is accessed
    directly (as opposed to via rcu_dereference(), rcu_assign_pointer(),
    or one of their variants), sparse can be made to complain. To enable
    such complaints, use the new default-disabled CONFIG_SPARSE_RCU_POINTER
    kernel configuration option. Please note that these sparse complaints are
    intended to be a debugging aid, -not- a code-style-enforcement mechanism.

    There are special rcu_dereference_protected() and rcu_access_pointer()
    accessors for use when RCU read-side protection is not required, for
    example, when no other CPU has access to the data structure in question
    or while the current CPU hold the update-side lock.

    This patch also updates a number of docbook comments that were showing
    their age.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Paul E. McKenney
    Cc: Christopher Li
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • The task_cls_classid() function applies rcu_dereference() to integers,
    which does not work with the shiny new sparse-based checking in
    rcu_dereference(). This commit therefore moves to the new RCU API
    rcu_dereference_index_check().

    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett
    Acked-by: David S. Miller
    Acked-by: Herbert Xu

    Paul E. McKenney
     

19 Aug, 2010

4 commits

  • Fix the declaration of sys_execve() in asm-generic/syscalls.h to have
    various consts applied to its pointers.

    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds

    David Howells
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    fs: brlock vfsmount_lock
    fs: scale files_lock
    lglock: introduce special lglock and brlock spin locks
    tty: fix fu_list abuse
    fs: cleanup files_lock locking
    fs: remove extra lookup in __lookup_hash
    fs: fs_struct rwlock to spinlock
    apparmor: use task path helpers
    fs: dentry allocation consolidation
    fs: fix do_lookup false negative
    mbcache: Limit the maximum number of cache entries
    hostfs ->follow_link() braino
    hostfs: dumb (and usually harmless) tpyo - strncpy instead of strlcpy
    remove SWRITE* I/O types
    kill BH_Ordered flag
    vfs: update ctime when changing the file's permission by setfacl
    cramfs: only unlock new inodes
    fix reiserfs_evict_inode end_writeback second call

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6:
    ALSA: emu10k1 - delay the PCM interrupts (add pcm_irq_delay parameter)
    ALSA: hda - Fix ALC680 base model capture
    ASoC: Remove DSP mode support for WM8776
    ALSA: hda - Add quirk for Dell Vostro 1220
    ALSA: riptide - Fix detection / load of firmware files

    Linus Torvalds
     
  • * 'merge-devicetree' of git://git.secretlab.ca/git/linux-2.6:
    spi.h: missing kernel-doc notation, please fix
    of: fix missing headers for of_address_to_resource() in MTD and SysACE drivers
    of: Fix missing includes
    ata: update for of_device to platform_device replacement
    microblaze: Fix of: eliminate of_device->node and dev_archdata->{of,prom}_node
    microblaze: Fix of/address: Merge all of the bus translation code
    booting-without-of: Remove nonexistent chapters from TOC, fix numbering

    Linus Torvalds
     

18 Aug, 2010

8 commits

  • With some hardware combinations, the PCM interrupts are acknowledged
    before the period boundary from the emu10k1 chip. The midlevel PCM code
    gets confused and the playback stream is interrupted.

    It seems that the interrupt processing shift by 2 samples is enough
    to fix this issue. This default value does not harm other,
    non-affected hardware.

    More information: Kernel bugzilla bug#16300

    [A copmile warning fixed by tiwai]

    Signed-off-by: Jaroslav Kysela
    Cc:
    Signed-off-by: Takashi Iwai

    Jaroslav Kysela
     
  • fs: scale files_lock

    Improve scalability of files_lock by adding per-cpu, per-sb files lists,
    protected with an lglock. The lglock provides fast access to the per-cpu lists
    to add and remove files. It also provides a snapshot of all the per-cpu lists
    (although this is very slow).

    One difficulty with this approach is that a file can be removed from the list
    by another CPU. We must track which per-cpu list the file is on with a new
    variale in the file struct (packed into a hole on 64-bit archs). Scalability
    could suffer if files are frequently removed from different cpu's list.

    However loads with frequent removal of files imply short interval between
    adding and removing the files, and the scheduler attempts to avoid moving
    processes too far away. Also, even in the case of cross-CPU removal, the
    hardware has much more opportunity to parallelise cacheline transfers with N
    cachelines than with 1.

    A worst-case test of 1 CPU allocating files subsequently being freed by N CPUs
    degenerates to contending on a single lock, which is no worse than before. When
    more than one CPU are allocating files, even if they are always freed by
    different CPUs, there will be more parallelism than the single-lock case.

    Testing results:

    On a 2 socket, 8 core opteron, I measure the number of times the lock is taken
    to remove the file, the number of times it is removed by the same CPU that
    added it, and the number of times it is removed by the same node that added it.

    Booting: locks= 25049 cpu-hits= 23174 (92.5%) node-hits= 23945 (95.6%)
    kbuild -j16 locks=2281913 cpu-hits=2208126 (96.8%) node-hits=2252674 (98.7%)
    dbench 64 locks=4306582 cpu-hits=4287247 (99.6%) node-hits=4299527 (99.8%)

    So a file is removed from the same CPU it was added by over 90% of the time.
    It remains within the same node 95% of the time.

    Tim Chen ran some numbers for a 64 thread Nehalem system performing a compile.

    throughput
    2.6.34-rc2 24.5
    +patch 24.9

    us sys idle IO wait (in %)
    2.6.34-rc2 51.25 28.25 17.25 3.25
    +patch 53.75 18.5 19 8.75

    So significantly less CPU time spent in kernel code, higher idle time and
    slightly higher throughput.

    Single threaded performance difference was within the noise of microbenchmarks.
    That is not to say penalty does not exist, the code is larger and more memory
    accesses required so it will be slightly slower.

    Cc: linux-kernel@vger.kernel.org
    Cc: Tim Chen
    Cc: Andi Kleen
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    Nick Piggin
     
  • lglock: introduce special lglock and brlock spin locks

    This patch introduces "local-global" locks (lglocks). These can be used to:

    - Provide fast exclusive access to per-CPU data, with exclusive access to
    another CPU's data allowed but possibly subject to contention, and to provide
    very slow exclusive access to all per-CPU data.
    - Or to provide very fast and scalable read serialisation, and to provide
    very slow exclusive serialisation of data (not necessarily per-CPU data).

    Brlocks are also implemented as a short-hand notation for the latter use
    case.

    Thanks to Paul for local/global naming convention.

    Cc: linux-kernel@vger.kernel.org
    Cc: Al Viro
    Cc: "Paul E. McKenney"
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    Nick Piggin
     
  • tty: fix fu_list abuse

    tty code abuses fu_list, which causes a bug in remount,ro handling.

    If a tty device node is opened on a filesystem, then the last link to the inode
    removed, the filesystem will be allowed to be remounted readonly. This is
    because fs_may_remount_ro does not find the 0 link tty inode on the file sb
    list (because the tty code incorrectly removed it to use for its own purpose).
    This can result in a filesystem with errors after it is marked "clean".

    Taking idea from Christoph's initial patch, allocate a tty private struct
    at file->private_data and put our required list fields in there, linking
    file and tty. This makes tty nodes behave the same way as other device nodes
    and avoid meddling with the vfs, and avoids this bug.

    The error handling is not trivial in the tty code, so for this bugfix, I take
    the simple approach of using __GFP_NOFAIL and don't worry about memory errors.
    This is not a problem because our allocator doesn't fail small allocs as a rule
    anyway. So proper error handling is left as an exercise for tty hackers.

    [ Arguably filesystem's device inode would ideally be divorced from the
    driver's pseudo inode when it is opened, but in practice it's not clear whether
    that will ever be worth implementing. ]

    Cc: linux-kernel@vger.kernel.org
    Cc: Christoph Hellwig
    Cc: Alan Cox
    Cc: Greg Kroah-Hartman
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    Nick Piggin
     
  • fs: cleanup files_lock locking

    Lock tty_files with a new spinlock, tty_files_lock; provide helpers to
    manipulate the per-sb files list; unexport the files_lock spinlock.

    Cc: linux-kernel@vger.kernel.org
    Cc: Christoph Hellwig
    Cc: Alan Cox
    Acked-by: Andi Kleen
    Acked-by: Greg Kroah-Hartman
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    Nick Piggin
     
  • fs: fs_struct rwlock to spinlock

    struct fs_struct.lock is an rwlock with the read-side used to protect root and
    pwd members while taking references to them. Taking a reference to a path
    typically requires just 2 atomic ops, so the critical section is very small.
    Parallel read-side operations would have cacheline contention on the lock, the
    dentry, and the vfsmount cachelines, so the rwlock is unlikely to ever give a
    real parallelism increase.

    Replace it with a spinlock to avoid one or two atomic operations in typical
    path lookup fastpath.

    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    Nick Piggin
     
  • These flags aren't real I/O types, but tell ll_rw_block to always
    lock the buffer instead of giving up on a failed trylock.

    Instead add a new write_dirty_buffer helper that implements this semantic
    and use it from the existing SWRITE* callers. Note that the ll_rw_block
    code had a bug where it didn't promote WRITE_SYNC_PLUG properly, which
    this patch fixes.

    In the ufs code clean up the helper that used to call ll_rw_block
    to mirror sync_dirty_buffer, which is the function it implements for
    compound buffers.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Instead of abusing a buffer_head flag just add a variant of
    sync_dirty_buffer which allows passing the exact type of write
    flag required.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig