12 Jun, 2009

40 commits

  • It is unnecessarily fragile to have two places (fsync_super() and do_sync())
    doing data integrity sync of the filesystem. Alter __fsync_super() to
    accommodate needs of both callers and use it. So after this patch
    __fsync_super() is the only place where we gather all the calls needed to
    properly send all data on a filesystem to disk.

    Nice bonus is that we get a complete livelock avoidance and write_supers()
    is now only used for periodic writeback of superblocks.

    sync_blockdevs() introduced a couple of patches ago is gone now.

    [build fixes folded]

    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • __fsync_super() does the same thing as fsync_super(). So change the only
    caller to use fsync_super() and make __fsync_super() static. This removes
    unnecessarily duplicated call to sync_blockdev() and prepares ground
    for the changes to __fsync_super() in the following patches.

    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • sync_filesystems() has a condition that if wait == 0 and s_dirt == 0, then
    ->sync_fs() isn't called. This does not really make much sence since s_dirt is
    generally used by a filesystem to mean that ->write_super() needs to be called.
    But ->sync_fs() does different things. I even suspect that some filesystems
    (btrfs?) sets s_dirt just to fool this logic.

    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • So far, do_sync() called:
    sync_inodes(0);
    sync_supers();
    sync_filesystems(0);
    sync_filesystems(1);
    sync_inodes(1);

    This ordering makes it kind of hard for filesystems as sync_inodes(0) need not
    submit all the IO (for example it skips inodes with I_SYNC set) so e.g. forcing
    transaction to disk in ->sync_fs() is not really enough. Therefore sys_sync has
    not been completely reliable on some filesystems (ext3, ext4, reiserfs, ocfs2
    and others are hit by this) when racing e.g. with background writeback. A
    similar problem hits also other filesystems (e.g. ext2) because of
    write_supers() being called before the sync_inodes(1).

    Change the ordering of calls in do_sync() - this requires a new function
    sync_blockdevs() to preserve the property that block devices are always synced
    after write_super() / sync_fs() call.

    The same issue is fixed in __fsync_super() function used on umount /
    remount read-only.

    [AV: build fixes]

    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • Remove the unused s_async_list in the superblock, a leftover of the
    broken async inode deletion code that leaked into mainline. Having this
    in the middle of the sync/unmount path is not helpful for the following
    cleanups.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • This function walks the s_files lock, and operates primarily on the
    files in a superblock, so it better belongs here (eg. see also
    fs_may_remount_ro).

    [AV: ... and it shouldn't be static after that move]

    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    npiggin@suse.de
     
  • This patch speeds up lmbench lat_mmap test by about another 2% after the
    first patch.

    Before:
    avg = 462.286
    std = 5.46106

    After:
    avg = 453.12
    std = 9.58257

    (50 runs of each, stddev gives a reasonable confidence)

    It does this by introducing mnt_clone_write, which avoids some heavyweight
    operations of mnt_want_write if called on a vfsmount which we know already
    has a write count; and mnt_want_write_file, which can call mnt_clone_write
    if the file is open for write.

    After these two patches, mnt_want_write and mnt_drop_write go from 7% on
    the profile down to 1.3% (including mnt_clone_write).

    [AV: mnt_want_write_file() should take file alone and derive mnt from it;
    not only all callers have that form, but that's the only mnt about which
    we know that it's already held for write if file is opened for write]

    Cc: Dave Hansen
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    npiggin@suse.de
     
  • This patch speeds up lmbench lat_mmap test by about 8%. lat_mmap is set up
    basically to mmap a 64MB file on tmpfs, fault in its pages, then unmap it.
    A microbenchmark yes, but it exercises some important paths in the mm.

    Before:
    avg = 501.9
    std = 14.7773

    After:
    avg = 462.286
    std = 5.46106

    (50 runs of each, stddev gives a reasonable confidence, but there is quite
    a bit of variation there still)

    It does this by removing the complex per-cpu locking and counter-cache and
    replaces it with a percpu counter in struct vfsmount. This makes the code
    much simpler, and avoids spinlocks (although the msync is still pretty
    costly, unfortunately). It results in about 900 bytes smaller code too. It
    does increase the size of a vfsmount, however.

    It should also give a speedup on large systems if CPUs are frequently operating
    on different mounts (because the existing scheme has to operate on an atomic in
    the struct vfsmount when switching between mounts). But I'm most interested in
    the single threaded path performance for the moment.

    [AV: minor cleanup]

    Cc: Dave Hansen
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    npiggin@suse.de
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • ... and lose the always-NULL last argument (non-NULL case had been
    split off a while ago).

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • These guys are what we add as submounts; checks for "is that attached in
    our namespace" are simply irrelevant for those and counterproductive for
    use of private vfsmount trees a-la what NFS folks want.

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • New field: nd->root. When pathname resolution wants to know the root,
    check if nd->root.mnt is non-NULL; use nd->root if it is, otherwise
    copy current->fs->root there. After path_walk() is finished, we check
    if we'd got a cached value in nd->root and drop it. Before calling
    path_walk() we should either set nd->root.mnt to NULL *or* copy (and
    pin down) some path to nd->root. In the latter case we won't be
    looking at current->fs->root at all.

    Signed-off-by: Al Viro

    Al Viro
     
  • Split do_path_lookup(), opencode the call from do_filp_open()
    do_filp_open() is the only caller of do_path_lookup() that
    cares about root afterwards (it keeps resolving symlinks on
    O_CREAT path after it'd done LOOKUP_PARENT walk). So when
    we start caching fs->root in path_walk(), it'll need a different
    treatment.

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • This patch adds an -oexpose_privroot option to allow access to the privroot.

    Signed-off-by: Jeff Mahoney
    Signed-off-by: Al Viro

    Jeff Mahoney
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (23 commits)
    Btrfs: fix extent_buffer leak during tree log replay
    Btrfs: fix oops when btrfs_inherit_iflags called with a NULL dir
    Btrfs: fix -o nodatasum printk spelling
    Btrfs: check duplicate backrefs for both data and metadata
    Btrfs: init worker struct fields before kthread-run
    Btrfs: pin buffers during write_dev_supers
    Btrfs: avoid races between super writeout and device list updates
    Fix btrfs when ACLs are configured out
    Btrfs: fdatasync should skip metadata writeout
    Btrfs: remove crc32c.h and use libcrc32c directly.
    Btrfs: implement FS_IOC_GETFLAGS/SETFLAGS/GETVERSION
    Btrfs: autodetect SSD devices
    Btrfs: add mount -o ssd_spread to spread allocations out
    Btrfs: avoid allocation clusters that are too spread out
    Btrfs: Add mount -o nossd
    Btrfs: avoid IO stalls behind congested devices in a multi-device FS
    Btrfs: don't allow WRITE_SYNC bios to starve out regular writes
    Btrfs: fix metadata dirty throttling limits
    Btrfs: reduce mount -o ssd CPU usage
    Btrfs: balance btree more often
    ...

    Linus Torvalds
     
  • * 'for-linus' of git://git.infradead.org/users/eparis/notify:
    fsnotify: allow groups to set freeing_mark to null
    inotify/dnotify: should_send_event shouldn't match on FS_EVENT_ON_CHILD
    dnotify: do not bother to lock entry->lock when reading mask
    dnotify: do not use ?true:false when assigning to a bool
    fsnotify: move events should indicate the event was on a child
    inotify: reimplement inotify using fsnotify
    fsnotify: handle filesystem unmounts with fsnotify marks
    fsnotify: fsnotify marks on inodes pin them in core
    fsnotify: allow groups to add private data to events
    fsnotify: add correlations between events
    fsnotify: include pathnames with entries when possible
    fsnotify: generic notification queue and waitq
    dnotify: reimplement dnotify using fsnotify
    fsnotify: parent event notification
    fsnotify: add marks to inodes so groups can interpret how to handle those inodes
    fsnotify: unified filesystem notification backend

    Linus Torvalds
     
  • * 'for-linus' of git://linux-arm.org/linux-2.6:
    kmemleak: Add the corresponding MAINTAINERS entry
    kmemleak: Simple testing module for kmemleak
    kmemleak: Enable the building of the memory leak detector
    kmemleak: Remove some of the kmemleak false positives
    kmemleak: Add modules support
    kmemleak: Add kmemleak_alloc callback from alloc_large_system_hash
    kmemleak: Add the vmalloc memory allocation/freeing hooks
    kmemleak: Add the slub memory allocation/freeing hooks
    kmemleak: Add the slob memory allocation/freeing hooks
    kmemleak: Add the slab memory allocation/freeing hooks
    kmemleak: Add documentation on the memory leak detector
    kmemleak: Add the base support

    Manual conflict resolution (with the slab/earlyboot changes) in:
    drivers/char/vt.c
    init/main.c
    mm/slab.c

    Linus Torvalds
     
  • …el/git/tip/linux-2.6-tip

    * 'perfcounters-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (574 commits)
    perf_counter: Turn off by default
    perf_counter: Add counter->id to the throttle event
    perf_counter: Better align code
    perf_counter: Rename L2 to LL cache
    perf_counter: Standardize event names
    perf_counter: Rename enums
    perf_counter tools: Clean up u64 usage
    perf_counter: Rename perf_counter_limit sysctl
    perf_counter: More paranoia settings
    perf_counter: powerpc: Implement generalized cache events for POWER processors
    perf_counters: powerpc: Add support for POWER7 processors
    perf_counter: Accurate period data
    perf_counter: Introduce struct for sample data
    perf_counter tools: Normalize data using per sample period data
    perf_counter: Annotate exit ctx recursion
    perf_counter tools: Propagate signals properly
    perf_counter tools: Small frequency related fixes
    perf_counter: More aggressive frequency adjustment
    perf_counter/x86: Fix the model number of Intel Core2 processors
    perf_counter, x86: Correct some event and umask values for Intel processors
    ...

    Linus Torvalds
     
  • …/git/penberg/slab-2.6

    * 'topic/slab/earlyboot' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
    vgacon: use slab allocator instead of the bootmem allocator
    irq: use kcalloc() instead of the bootmem allocator
    sched: use slab in cpupri_init()
    sched: use alloc_cpumask_var() instead of alloc_bootmem_cpumask_var()
    memcg: don't use bootmem allocator in setup code
    irq/cpumask: make memoryless node zero happy
    x86: remove some alloc_bootmem_cpumask_var calling
    vt: use kzalloc() instead of the bootmem allocator
    sched: use kzalloc() instead of the bootmem allocator
    init: introduce mm_init()
    vmalloc: use kzalloc() instead of alloc_bootmem()
    slab: setup allocators earlier in the boot sequence
    bootmem: fix slab fallback on numa
    bootmem: use slab if bootmem is no longer available

    Linus Torvalds
     
  • Most fsnotify listeners (all but inotify) do not care about marks being
    freed. Allow groups to set freeing_mark to null and do not call any
    function if it is set that way.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • inotify and dnotify will both indicate that they want any event which came
    from a child inode. The fix is to mask off FS_EVENT_ON_CHILD when deciding
    if inotify or dnotify is interested in a given event.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • entry->lock is needed to make sure entry->mask does not change while
    manipulating it. In dnotify_should_send_event() we don't care if we get an
    old or a new mask value out of this entry so there is no point it taking
    the lock.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • dnotify_should send event assigned a bool using ?true:false when computing
    a bit operation. This is poitless and the bool type does this for us.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • fsnotify tells its listeners explicitly when an event happened on the given
    inode verses on the child of the given inode. (see __fsnotify_parent)
    However, the semantics of fsnotify_move() are such that we deliver events
    directly to the two parent directories in question (old_dir and new_dir)
    directly without using the __fsnotify_parent() call. fsnotify should be
    adding FS_EVENT_ON_CHILD for the notifications to these parents.

    Signed-off-by: Eric Paris

    Eric Paris
     
  • Reimplement inotify_user using fsnotify. This should be feature for feature
    exactly the same as the original inotify_user. This does not make any changes
    to the in kernel inotify feature used by audit. Those patches (and the eventual
    removal of in kernel inotify) will come after the new inotify_user proves to be
    working correctly.

    Signed-off-by: Eric Paris
    Acked-by: Al Viro
    Cc: Christoph Hellwig

    Eric Paris
     
  • When an fs is unmounted with an fsnotify mark entry attached to one of its
    inodes we need to destroy that mark entry and we also (like inotify) send
    an unmount event.

    Signed-off-by: Eric Paris
    Acked-by: Al Viro
    Cc: Christoph Hellwig

    Eric Paris
     
  • This patch pins any inodes with an fsnotify mark in core. The idea is that
    as soon as the mark is removed from the inode->fsnotify_mark_entries list
    the inode will be iput. In reality is doesn't quite work exactly this way.
    The igrab will happen when the mark is added to an inode, but the iput will
    happen when the inode pointer is NULL'd inside the mark.

    It's possible that 2 racing things will try to remove the mark from
    different directions. One may try to remove the mark because of an
    explicit request and one might try to remove it because the inode was
    deleted. It's possible that the removal because of inode deletion will
    remove the mark from the inode's list, but the removal by explicit request
    will actually set entry->inode == NULL; and call the iput. This is safe.

    Signed-off-by: Eric Paris
    Acked-by: Al Viro
    Cc: Christoph Hellwig

    Eric Paris
     
  • inotify needs per group information attached to events. This patch allows
    groups to attach private information and implements a callback so that
    information can be freed when an event is being destroyed.

    Signed-off-by: Eric Paris
    Acked-by: Al Viro
    Cc: Christoph Hellwig

    Eric Paris
     
  • As part of the standard inotify events it includes a correlation cookie
    between two dentry move operations. This patch includes the same behaviour
    in fsnotify events. It is needed so that inotify userspace can be
    implemented on top of fsnotify.

    Signed-off-by: Eric Paris
    Acked-by: Al Viro
    Cc: Christoph Hellwig

    Eric Paris
     
  • When inotify wants to send events to a directory about a child it includes
    the name of the original file. This patch collects that filename and makes
    it available for notification.

    Signed-off-by: Eric Paris
    Acked-by: Al Viro
    Cc: Christoph Hellwig

    Eric Paris