07 Jan, 2012

3 commits

  • If there are any inodes on the super block that have been unlinked
    (i_nlink == 0) but have not yet been deleted then prevent the
    remounting the super block read-only.

    Reported-by: Toshiyuki Okajima
    Signed-off-by: Miklos Szeredi
    Tested-by: Toshiyuki Okajima
    Signed-off-by: Al Viro

    Miklos Szeredi
     
  • Currently remouting superblock read-only is racy in a major way.

    With the per mount read-only infrastructure it is now possible to
    prevent most races, which this patch attempts.

    Before starting the remount read-only, iterate through all mounts
    belonging to the superblock and if none of them have any pending
    writes, set sb->s_readonly_remount. This indicates that remount is in
    progress and no further write requests are allowed. If the remount
    succeeds set MS_RDONLY and reset s_readonly_remount.

    If the remounting is unsuccessful just reset s_readonly_remount.
    This can result in transient EROFS errors, despite the fact the
    remount failed. Unfortunately hodling off writes is difficult as
    remount itself may touch the filesystem (e.g. through load_nls())
    which would deadlock.

    A later patch deals with delayed writes due to nlink going to zero.

    Signed-off-by: Miklos Szeredi
    Tested-by: Toshiyuki Okajima
    Signed-off-by: Al Viro

    Miklos Szeredi
     
  • Keep track of vfsmounts belonging to a superblock. List is protected
    by vfsmount_lock.

    Signed-off-by: Miklos Szeredi
    Tested-by: Toshiyuki Okajima
    Signed-off-by: Al Viro

    Miklos Szeredi
     

04 Jan, 2012

3 commits

  • unfortunately, just checking MS_BORN after having grabbed ->s_umount in
    sget() is not enough; places that pick superblock from a list and
    grab s_umount shared need the same check in addition to checking for
    ->s_root; otherwise three-way race between failing mount, sget() and
    such list-walker can leave us with list-walker coming *second*, when
    temporary active ref grabbed by sget() (to be dropped when sget()
    notices that original mount has failed by checking MS_BORN) has
    lead to deactivate_locked_super() from failing ->mount() *not* doing
    ->kill_sb() and just releasing ->s_umount. Once sget() gets through
    and notices that MS_BORN had never been set it will drop the active
    ref and fs will be shut down and kicked out of all lists, but it's
    too late for something like sync_supers().

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • some stuff in there can actually become static; some belongs to pnode.h
    as it's a private interface between namespace.c and pnode.c...

    Signed-off-by: Al Viro

    Al Viro
     

02 Nov, 2011

1 commit


01 Nov, 2011

1 commit


26 Jul, 2011

1 commit

  • * 'for-3.1/core' of git://git.kernel.dk/linux-block: (24 commits)
    block: strict rq_affinity
    backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu
    block: fix patch import error in max_discard_sectors check
    block: reorder request_queue to remove 64 bit alignment padding
    CFQ: add think time check for group
    CFQ: add think time check for service tree
    CFQ: move think time check variables to a separate struct
    fixlet: Remove fs_excl from struct task.
    cfq: Remove special treatment for metadata rqs.
    block: document blk_plug list access
    block: avoid building too big plug list
    compat_ioctl: fix make headers_check regression
    block: eliminate potential for infinite loop in blkdev_issue_discard
    compat_ioctl: fix warning caused by qemu
    block: flush MEDIA_CHANGE from drivers on close(2)
    blk-throttle: Make total_nr_queued unsigned
    block: Add __attribute__((format(printf...) and fix fallout
    fs/partitions/check.c: make local symbols static
    block:remove some spare spaces in genhd.c
    block:fix the comment error in blkdev.h
    ...

    Linus Torvalds
     

21 Jul, 2011

3 commits

  • Now that the per-sb shrinker is responsible for shrinking 2 or more
    caches, increase the batch size to keep econmies of scale for
    shrinking each cache. Increase the shrinker batch size to 1024
    objects.

    To allow for a large increase in batch size, add a conditional
    reschedule to prune_icache_sb() so that we don't hold the LRU spin
    lock for too long. This mirrors the behaviour of the
    __shrink_dcache_sb(), and allows us to increase the batch size
    without needing to worry about problems caused by long lock hold
    times.

    To ensure that filesystems using the per-sb shrinker callouts don't
    cause problems, document that the object freeing method must
    reschedule appropriately inside loops.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Now we have a per-superblock shrinker implementation, we can add a
    filesystem specific callout to it to allow filesystem internal
    caches to be shrunk by the superblock shrinker.

    Rather than perpetuate the multipurpose shrinker callback API (i.e.
    nr_to_scan == 0 meaning "tell me how many objects freeable in the
    cache), two operations will be added. The first will return the
    number of objects that are freeable, the second is the actual
    shrinker call.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • With context based shrinkers, we can implement a per-superblock
    shrinker that shrinks the caches attached to the superblock. We
    currently have global shrinkers for the inode and dentry caches that
    split up into per-superblock operations via a coarse proportioning
    method that does not batch very well. The global shrinkers also
    have a dependency - dentries pin inodes - so we have to be very
    careful about how we register the global shrinkers so that the
    implicit call order is always correct.

    With a per-sb shrinker callout, we can encode this dependency
    directly into the per-sb shrinker, hence avoiding the need for
    strictly ordering shrinker registrations. We also have no need for
    any proportioning code for the shrinker subsystem already provides
    this functionality across all shrinkers. Allowing the shrinker to
    operate on a single superblock at a time means that we do less
    superblock list traversals and locking and reclaim should batch more
    effectively. This should result in less CPU overhead for reclaim and
    potentially faster reclaim of items from each filesystem.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     

20 Jul, 2011

5 commits

  • The per-sb shrinker has the same requirement as the writeback
    threads of ensuring that the superblock is usable and pinned for the
    time it takes to run the work. Both need to take a passive reference
    to the sb, take a read lock on the s_umount lock and then only
    continue if an unmount is not in progress.

    pin_sb_for_writeback() does this exactly, so move it to fs/super.c
    and rename it to grab_super_passive() and exporting it via
    fs/internal.h for all the VFS code to be able to use.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • With the inode LRUs moving to per-sb structures, there is no longer
    a need for a global inode_lru_lock. The locking can be made more
    fine-grained by moving to a per-sb LRU lock, isolating the LRU
    operations of different filesytsems completely from each other.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • The inode unused list is currently a global LRU. This does not match
    the other global filesystem cache - the dentry cache - which uses
    per-superblock LRU lists. Hence we have related filesystem object
    types using different LRU reclaimation schemes.

    To enable a per-superblock filesystem cache shrinker, both of these
    caches need to have per-sb unused object LRU lists. Hence this patch
    converts the global inode LRU to per-sb LRUs.

    The patch only does rudimentary per-sb propotioning in the shrinker
    infrastructure, as this gets removed when the per-sb shrinker
    callouts are introduced later on.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Call the given function for all superblocks of given type. Function
    gets a superblock (with s_umount locked shared) and (void *) argument
    supplied by caller of iterator.

    Signed-off-by: Al Viro

    Al Viro
     

12 Jul, 2011

1 commit

  • fs_excl is a poor man's priority inheritance for filesystems to hint to
    the block layer that an operation is important. It was never clearly
    specified, not widely adopted, and will not prevent starvation in many
    cases (like across cgroups).

    fs_excl was introduced with the time sliced CFQ IO scheduler, to
    indicate when a process held FS exclusive resources and thus needed
    a boost.

    It doesn't cover all file systems, and it was never fully complete.
    Lets kill it.

    Signed-off-by: Justin TerAvest
    Signed-off-by: Jens Axboe

    Justin TerAvest
     

04 Jun, 2011

1 commit

  • Caching "we have already removed suid/caps" was overenthusiastic as merged.
    On network filesystems we might have had suid/caps set on another client,
    silently picked by this client on revalidate, all of that *without* clearing
    the S_NOSEC flag.

    AFAICS, the only reasonably sane way to deal with that is
    * new superblock flag; unless set, S_NOSEC is not going to be set.
    * local block filesystems set it in their ->mount() (more accurately,
    mount_bdev() does, so does btrfs ->mount(), users of mount_bdev() other than
    local block ones clear it)
    * if any network filesystem (or a cluster one) wants to use S_NOSEC,
    it'll need to set MS_NOSEC in sb->s_flags *AND* take care to clear S_NOSEC when
    inode attribute changes are picked from other clients.

    It's not an earth-shattering hole (anybody that can set suid on another client
    will almost certainly be able to write to the file before doing that anyway),
    but it's a bug that needs fixing.

    Signed-off-by: Al Viro

    Al Viro
     

27 May, 2011

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/djm/tmem:
    xen: cleancache shim to Xen Transcendent Memory
    ocfs2: add cleancache support
    ext4: add cleancache support
    btrfs: add cleancache support
    ext3: add cleancache support
    mm/fs: add hooks to support cleancache
    mm: cleancache core ops functions and config
    fs: add field to superblock to support cleancache
    mm/fs: cleancache documentation

    Fix up trivial conflict in fs/btrfs/extent_io.c due to includes

    Linus Torvalds
     
  • This fourth patch of eight in this cleancache series provides the
    core hooks in VFS for: initializing cleancache per filesystem;
    capturing clean pages reclaimed by page cache; attempting to get
    pages from cleancache before filesystem read; and ensuring coherency
    between pagecache, disk, and cleancache. Note that the placement
    of these hooks was stable from 2.6.18 to 2.6.38; a minor semantic
    change was required due to a patchset in 2.6.39.

    All hooks become no-ops if CONFIG_CLEANCACHE is unset, or become
    a check of a boolean global if CONFIG_CLEANCACHE is set but no
    cleancache "backend" has claimed cleancache_ops.

    Details and a FAQ can be found in Documentation/vm/cleancache.txt

    [v8: minchan.kim@gmail.com: adapt to new remove_from_page_cache function]
    Signed-off-by: Chris Mason
    Signed-off-by: Dan Magenheimer
    Reviewed-by: Jeremy Fitzhardinge
    Reviewed-by: Konrad Rzeszutek Wilk
    Cc: Andrew Morton
    Cc: Al Viro
    Cc: Matthew Wilcox
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Rik Van Riel
    Cc: Jan Beulich
    Cc: Andreas Dilger
    Cc: Ted Ts'o
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Nitin Gupta

    Dan Magenheimer
     

19 May, 2011

1 commit


25 Mar, 2011

1 commit

  • * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
    Documentation/iostats.txt: bit-size reference etc.
    cfq-iosched: removing unnecessary think time checking
    cfq-iosched: Don't clear queue stats when preempt.
    blk-throttle: Reset group slice when limits are changed
    blk-cgroup: Only give unaccounted_time under debug
    cfq-iosched: Don't set active queue in preempt
    block: fix non-atomic access to genhd inflight structures
    block: attempt to merge with existing requests on plug flush
    block: NULL dereference on error path in __blkdev_get()
    cfq-iosched: Don't update group weights when on service tree
    fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
    block: Require subsystems to explicitly allocate bio_set integrity mempool
    jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    fs: make fsync_buffers_list() plug
    mm: make generic_writepages() use plugging
    blk-cgroup: Add unaccounted time to timeslice_used.
    block: fixup plugging stubs for !CONFIG_BLOCK
    block: remove obsolete comments for blkdev_issue_zeroout.
    blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
    ...

    Fix up conflicts in fs/{aio.c,super.c}

    Linus Torvalds
     

18 Mar, 2011

1 commit

  • new function: mount_fs(). Does all work done by vfs_kern_mount()
    except the allocation and filling of vfsmount; returns root dentry
    or ERR_PTR().

    vfs_kern_mount() switched to using it and taken to fs/namespace.c,
    along with its wrappers.

    alloc_vfsmnt()/free_vfsmnt() made static.

    functions in namespace.c slightly reordered.

    Signed-off-by: Al Viro

    Al Viro
     

17 Mar, 2011

2 commits


12 Feb, 2011

1 commit

  • In commit fa0d7e3de6d6 ("fs: icache RCU free inodes"), we use rcu free
    inode instead of freeing the inode directly. It causes a crash when we
    rmmod immediately after we umount the volume[1].

    So we need to call rcu_barrier after we kill_sb so that the inode is
    freed before we do rmmod. The idea is inspired by Aneesh Kumar.
    rcu_barrier will wait for all callbacks to end before preceding. The
    original patch was done by Tao Ma, but synchronize_rcu() is not enough
    here.

    1. http://marc.info/?l=linux-fsdevel&m=129680863330185&w=2

    Tested-by: Tao Ma
    Signed-off-by: Boaz Harrosh
    Cc: Nick Piggin
    Cc: Al Viro
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Boaz Harrosh
     

17 Jan, 2011

1 commit

  • Instead of splitting refcount between (per-cpu) mnt_count
    and (SMP-only) mnt_longrefs, make all references contribute
    to mnt_count again and keep track of how many are longterm
    ones.

    Accounting rules for longterm count:
    * 1 for each fs_struct.root.mnt
    * 1 for each fs_struct.pwd.mnt
    * 1 for having non-NULL ->mnt_ns
    * decrement to 0 happens only under vfsmount lock exclusive

    That allows nice common case for mntput() - since we can't drop the
    final reference until after mnt_longterm has reached 0 due to the rules
    above, mntput() can grab vfsmount lock shared and check mnt_longterm.
    If it turns out to be non-zero (which is the common case), we know
    that this is not the final mntput() and can just blindly decrement
    percpu mnt_count. Otherwise we grab vfsmount lock exclusive and
    do usual decrement-and-check of percpu mnt_count.

    For fs_struct.c we have mnt_make_longterm() and mnt_make_shortterm();
    namespace.c uses the latter in places where we don't already hold
    vfsmount lock exclusive and opencodes a few remaining spots where
    we need to manipulate mnt_longterm.

    Note that we mostly revert the code outside of fs/namespace.c back
    to what we used to have; in particular, normal code doesn't need
    to care about two kinds of references, etc. And we get to keep
    the optimization Nick's variant had bought us...

    Signed-off-by: Al Viro

    Al Viro
     

14 Jan, 2011

1 commit

  • * 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
    block: ensure that completion error gets properly traced
    blktrace: add missing probe argument to block_bio_complete
    block cfq: don't use atomic_t for cfq_group
    block cfq: don't use atomic_t for cfq_queue
    block: trace event block fix unassigned field
    block: add internal hd part table references
    block: fix accounting bug on cross partition merges
    kref: add kref_test_and_get
    bio-integrity: mark kintegrityd_wq highpri and CPU intensive
    block: make kblockd_workqueue smarter
    Revert "sd: implement sd_check_events()"
    block: Clean up exit_io_context() source code.
    Fix compile warnings due to missing removal of a 'ret' variable
    fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
    block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
    cfq-iosched: don't check cfqg in choose_service_tree()
    fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
    cdrom: export cdrom_check_events()
    sd: implement sd_check_events()
    sr: implement sr_check_events()
    ...

    Linus Torvalds
     

07 Jan, 2011

2 commits

  • The problem that this patch aims to fix is vfsmount refcounting scalability.
    We need to take a reference on the vfsmount for every successful path lookup,
    which often go to the same mount point.

    The fundamental difficulty is that a "simple" reference count can never be made
    scalable, because any time a reference is dropped, we must check whether that
    was the last reference. To do that requires communication with all other CPUs
    that may have taken a reference count.

    We can make refcounts more scalable in a couple of ways, involving keeping
    distributed counters, and checking for the global-zero condition less
    frequently.

    - check the global sum once every interval (this will delay zero detection
    for some interval, so it's probably a showstopper for vfsmounts).

    - keep a local count and only taking the global sum when local reaches 0 (this
    is difficult for vfsmounts, because we can't hold preempt off for the life of
    a reference, so a counter would need to be per-thread or tied strongly to a
    particular CPU which requires more locking).

    - keep a local difference of increments and decrements, which allows us to sum
    the total difference and hence find the refcount when summing all CPUs. Then,
    keep a single integer "long" refcount for slow and long lasting references,
    and only take the global sum of local counters when the long refcount is 0.

    This last scheme is what I implemented here. Attached mounts and process root
    and working directory references are "long" references, and everything else is
    a short reference.

    This allows scalable vfsmount references during path walking over mounted
    subtrees and unattached (lazy umounted) mounts with processes still running
    in them.

    This results in one fewer atomic op in the fastpath: mntget is now just a
    per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
    and non-atomic decrement in the common case. However code is otherwise bigger
    and heavier, so single threaded performance is basically a wash.

    Signed-off-by: Nick Piggin

    Nick Piggin
     
  • We can turn the dcache hash locking from a global dcache_hash_lock into
    per-bucket locking.

    Signed-off-by: Nick Piggin

    Nick Piggin
     

13 Nov, 2010

2 commits

  • After recent blkdev_get() modifications, open_by_devnum() and
    open_bdev_exclusive() are simple wrappers around blkdev_get().
    Replace them with blkdev_get_by_dev() and blkdev_get_by_path().

    blkdev_get_by_dev() is identical to open_by_devnum().
    blkdev_get_by_path() is slightly different in that it doesn't
    automatically add %FMODE_EXCL to @mode.

    All users are converted. Most conversions are mechanical and don't
    introduce any behavior difference. There are several exceptions.

    * btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no
    reason to OR it explicitly on blkdev_put().

    * gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in
    sb->s_mode.

    * With the above changes, sb->s_mode now always should contain
    FMODE_EXCL. WARN_ON_ONCE() added to kill_block_super() to detect
    errors.

    The new blkdev_get_*() functions are with proper docbook comments.
    While at it, add function description to blkdev_get() too.

    Signed-off-by: Tejun Heo
    Cc: Philipp Reisner
    Cc: Neil Brown
    Cc: Mike Snitzer
    Cc: Joern Engel
    Cc: Chris Mason
    Cc: Jan Kara
    Cc: "Theodore Ts'o"
    Cc: KONISHI Ryusuke
    Cc: reiserfs-devel@vger.kernel.org
    Cc: xfs-masters@oss.sgi.com
    Cc: Alexander Viro

    Tejun Heo
     
  • Over time, block layer has accumulated a set of APIs dealing with bdev
    open, close, claim and release.

    * blkdev_get/put() are the primary open and close functions.

    * bd_claim/release() deal with exclusive open.

    * open/close_bdev_exclusive() are combination of open and claim and
    the other way around, respectively.

    * bd_link/unlink_disk_holder() to create and remove holder/slave
    symlinks.

    * open_by_devnum() wraps bdget() + blkdev_get().

    The interface is a bit confusing and the decoupling of open and claim
    makes it impossible to properly guarantee exclusive access as
    in-kernel open + claim sequence can disturb the existing exclusive
    open even before the block layer knows the current open if for another
    exclusive access. Reorganize the interface such that,

    * blkdev_get() is extended to include exclusive access management.
    @holder argument is added and, if is @FMODE_EXCL specified, it will
    gain exclusive access atomically w.r.t. other exclusive accesses.

    * blkdev_put() is similarly extended. It now takes @mode argument and
    if @FMODE_EXCL is set, it releases an exclusive access. Also, when
    the last exclusive claim is released, the holder/slave symlinks are
    removed automatically.

    * bd_claim/release() and close_bdev_exclusive() are no longer
    necessary and either made static or removed.

    * bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
    is no longer necessary and removed.

    * open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
    and blkdev_get(). It also has an unexpected extra bdev_read_only()
    test which probably should be moved into blkdev_get().

    * open_by_devnum() is modified to take @holder argument and pass it to
    blkdev_get().

    Most of bdev open/close operations are unified into blkdev_get/put()
    and most exclusive accesses are tested atomically at the open time (as
    it should). This cleans up code and removes some, both valid and
    invalid, but unnecessary all the same, corner cases.

    open_bdev_exclusive() and open_by_devnum() can use further cleanup -
    rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
    special features. Well, let's leave them for another day.

    Most conversions are straight-forward. drbd conversion is a bit more
    involved as there was some reordering, but the logic should stay the
    same.

    Signed-off-by: Tejun Heo
    Acked-by: Neil Brown
    Acked-by: Ryusuke Konishi
    Acked-by: Mike Snitzer
    Acked-by: Philipp Reisner
    Cc: Peter Osterlund
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Jan Kara
    Cc: Andrew Morton
    Cc: Andreas Dilger
    Cc: "Theodore Ts'o"
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Alex Elder
    Cc: Christoph Hellwig
    Cc: dm-devel@redhat.com
    Cc: drbd-dev@lists.linbit.com
    Cc: Leo Chen
    Cc: Scott Branden
    Cc: Chris Mason
    Cc: Steven Whitehouse
    Cc: Dave Kleikamp
    Cc: Joern Engel
    Cc: reiserfs-devel@vger.kernel.org
    Cc: Alexander Viro

    Tejun Heo
     

29 Oct, 2010

5 commits


26 Oct, 2010

1 commit

  • Pull removal of fsnotify marks into generic_shutdown_super().
    Split umount-time work into a new function - evict_inodes().
    Make sure that invalidate_inodes() will be able to cope with
    I_FREEING once we change locking in iput().

    Signed-off-by: Al Viro

    Al Viro
     

18 Aug, 2010

1 commit

  • fs: scale files_lock

    Improve scalability of files_lock by adding per-cpu, per-sb files lists,
    protected with an lglock. The lglock provides fast access to the per-cpu lists
    to add and remove files. It also provides a snapshot of all the per-cpu lists
    (although this is very slow).

    One difficulty with this approach is that a file can be removed from the list
    by another CPU. We must track which per-cpu list the file is on with a new
    variale in the file struct (packed into a hole on 64-bit archs). Scalability
    could suffer if files are frequently removed from different cpu's list.

    However loads with frequent removal of files imply short interval between
    adding and removing the files, and the scheduler attempts to avoid moving
    processes too far away. Also, even in the case of cross-CPU removal, the
    hardware has much more opportunity to parallelise cacheline transfers with N
    cachelines than with 1.

    A worst-case test of 1 CPU allocating files subsequently being freed by N CPUs
    degenerates to contending on a single lock, which is no worse than before. When
    more than one CPU are allocating files, even if they are always freed by
    different CPUs, there will be more parallelism than the single-lock case.

    Testing results:

    On a 2 socket, 8 core opteron, I measure the number of times the lock is taken
    to remove the file, the number of times it is removed by the same CPU that
    added it, and the number of times it is removed by the same node that added it.

    Booting: locks= 25049 cpu-hits= 23174 (92.5%) node-hits= 23945 (95.6%)
    kbuild -j16 locks=2281913 cpu-hits=2208126 (96.8%) node-hits=2252674 (98.7%)
    dbench 64 locks=4306582 cpu-hits=4287247 (99.6%) node-hits=4299527 (99.8%)

    So a file is removed from the same CPU it was added by over 90% of the time.
    It remains within the same node 95% of the time.

    Tim Chen ran some numbers for a 64 thread Nehalem system performing a compile.

    throughput
    2.6.34-rc2 24.5
    +patch 24.9

    us sys idle IO wait (in %)
    2.6.34-rc2 51.25 28.25 17.25 3.25
    +patch 53.75 18.5 19 8.75

    So significantly less CPU time spent in kernel code, higher idle time and
    slightly higher throughput.

    Single threaded performance difference was within the noise of microbenchmarks.
    That is not to say penalty does not exist, the code is larger and more memory
    accesses required so it will be slightly slower.

    Cc: linux-kernel@vger.kernel.org
    Cc: Tim Chen
    Cc: Andi Kleen
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    Nick Piggin