13 Apr, 2012

1 commit

  • commit 66189be74ff5f9f3fd6444315b85be210d07cef2 upstream.

    We can deadlock if we have a write oplock and two processes
    use the same file handle. In this case the first process can't
    unlock its lock if the second process blocked on the lock in the
    same time.

    Fix it by using posix_lock_file rather than posix_lock_file_wait
    under cinode->lock_mutex. If we request a blocking lock and
    posix_lock_file indicates that there is another lock that prevents
    us, wait untill that lock is released and restart our call.

    Acked-by: Jeff Layton
    Signed-off-by: Pavel Shilovsky
    Signed-off-by: Steve French
    Signed-off-by: Greg Kroah-Hartman

    Pavel Shilovsky
     

01 Mar, 2012

1 commit

  • commit 28d82dc1c4edbc352129f97f4ca22624d1fe61de upstream.

    The current epoll code can be tickled to run basically indefinitely in
    both loop detection path check (on ep_insert()), and in the wakeup paths.
    The programs that tickle this behavior set up deeply linked networks of
    epoll file descriptors that cause the epoll algorithms to traverse them
    indefinitely. A couple of these sample programs have been previously
    posted in this thread: https://lkml.org/lkml/2011/2/25/297.

    To fix the loop detection path check algorithms, I simply keep track of
    the epoll nodes that have been already visited. Thus, the loop detection
    becomes proportional to the number of epoll file descriptor and links.
    This dramatically decreases the run-time of the loop check algorithm. In
    one diabolical case I tried it reduced the run-time from 15 mintues (all
    in kernel time) to .3 seconds.

    Fixing the wakeup paths could be done at wakeup time in a similar manner
    by keeping track of nodes that have already been visited, but the
    complexity is harder, since there can be multiple wakeups on different
    cpus...Thus, I've opted to limit the number of possible wakeup paths when
    the paths are created.

    This is accomplished, by noting that the end file descriptor points that
    are found during the loop detection pass (from the newly added link), are
    actually the sources for wakeup events. I keep a list of these file
    descriptors and limit the number and length of these paths that emanate
    from these 'source file descriptors'. In the current implemetation I
    allow 1000 paths of length 1, 500 of length 2, 100 of length 3, 50 of
    length 4 and 10 of length 5. Note that it is sufficient to check the
    'source file descriptors' reachable from the newly added link, since no
    other 'source file descriptors' will have newly added links. This allows
    us to check only the wakeup paths that may have gotten too long, and not
    re-check all possible wakeup paths on the system.

    In terms of the path limit selection, I think its first worth noting that
    the most common case for epoll, is probably the model where you have 1
    epoll file descriptor that is monitoring n number of 'source file
    descriptors'. In this case, each 'source file descriptor' has a 1 path of
    length 1. Thus, I believe that the limits I'm proposing are quite
    reasonable and in fact may be too generous. Thus, I'm hoping that the
    proposed limits will not prevent any workloads that currently work to
    fail.

    In terms of locking, I have extended the use of the 'epmutex' to all
    epoll_ctl add and remove operations. Currently its only used in a subset
    of the add paths. I need to hold the epmutex, so that we can correctly
    traverse a coherent graph, to check the number of paths. I believe that
    this additional locking is probably ok, since its in the setup/teardown
    paths, and doesn't affect the running paths, but it certainly is going to
    add some extra overhead. Also, worth noting is that the epmuex was
    recently added to the ep_ctl add operations in the initial path loop
    detection code using the argument that it was not on a critical path.

    Another thing to note here, is the length of epoll chains that is allowed.
    Currently, eventpoll.c defines:

    /* Maximum number of nesting allowed inside epoll sets */
    #define EP_MAX_NESTS 4

    This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS
    + 1). However, this limit is currently only enforced during the loop
    check detection code, and only when the epoll file descriptors are added
    in a certain order. Thus, this limit is currently easily bypassed. The
    newly added check for wakeup paths, stricly limits the wakeup paths to a
    length of 5, regardless of the order in which ep's are linked together.
    Thus, a side-effect of the new code is a more consistent enforcement of
    the graph depth.

    Thus far, I've tested this, using the sample programs previously
    mentioned, which now either return quickly or return -EINVAL. I've also
    testing using the piptest.c epoll tester, which showed no difference in
    performance. I've also created a number of different epoll networks and
    tested that they behave as expectded.

    I believe this solves the original diabolical test cases, while still
    preserving the sane epoll nesting.

    Signed-off-by: Jason Baron
    Cc: Nelson Elhage
    Cc: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jason Baron
     

09 Dec, 2011

1 commit


07 Dec, 2011

1 commit

  • __d_path() API is asking for trouble and in case of apparmor d_namespace_path()
    getting just that. The root cause is that when __d_path() misses the root
    it had been told to look for, it stores the location of the most remote ancestor
    in *root. Without grabbing references. Sure, at the moment of call it had
    been pinned down by what we have in *path. And if we raced with umount -l, we
    could have very well stopped at vfsmount/dentry that got freed as soon as
    prepend_path() dropped vfsmount_lock.

    It is safe to compare these pointers with pre-existing (and known to be still
    alive) vfsmount and dentry, as long as all we are asking is "is it the same
    address?". Dereferencing is not safe and apparmor ended up stepping into
    that. d_namespace_path() really wants to examine the place where we stopped,
    even if it's not connected to our namespace. As the result, it looked
    at ->d_sb->s_magic of a dentry that might've been already freed by that point.
    All other callers had been careful enough to avoid that, but it's really
    a bad interface - it invites that kind of trouble.

    The fix is fairly straightforward, even though it's bigger than I'd like:
    * prepend_path() root argument becomes const.
    * __d_path() is never called with NULL/NULL root. It was a kludge
    to start with. Instead, we have an explicit function - d_absolute_root().
    Same as __d_path(), except that it doesn't get root passed and stops where
    it stops. apparmor and tomoyo are using it.
    * __d_path() returns NULL on path outside of root. The main
    caller is show_mountinfo() and that's precisely what we pass root for - to
    skip those outside chroot jail. Those who don't want that can (and do)
    use d_path().
    * __d_path() root argument becomes const. Everyone agrees, I hope.
    * apparmor does *NOT* try to use __d_path() or any of its variants
    when it sees that path->mnt is an internal vfsmount. In that case it's
    definitely not mounted anywhere and dentry_path() is exactly what we want
    there. Handling of sysctl()-triggered weirdness is moved to that place.
    * if apparmor is asked to do pathname relative to chroot jail
    and __d_path() tells it we it's not in that jail, the sucker just calls
    d_absolute_path() instead. That's the other remaining caller of __d_path(),
    BTW.
    * seq_path_root() does _NOT_ return -ENAMETOOLONG (it's stupid anyway -
    the normal seq_file logics will take care of growing the buffer and redoing
    the call of ->show() just fine). However, if it gets path not reachable
    from root, it returns SEQ_SKIP. The only caller adjusted (i.e. stopped
    ignoring the return value as it used to do).

    Reviewed-by: John Johansen
    ACKed-by: John Johansen
    Signed-off-by: Al Viro
    Cc: stable@vger.kernel.org

    Al Viro
     

17 Nov, 2011

1 commit

  • takes vfsmount and relative path, does lookup within that vfsmount
    (possibly triggering automounts) and returns the result as root
    of subtree suitable for return by ->mount() (i.e. a reference to
    dentry and an active reference to its superblock grabbed, superblock
    locked exclusive).

    btrfs and nfs switched to it instead of open-coding the sucker.

    Signed-off-by: Al Viro

    Al Viro
     

03 Nov, 2011

2 commits

  • * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue:
    vfs: add d_prune dentry operation
    vfs: protect i_nlink
    filesystems: add set_nlink()
    filesystems: add missing nlink wrappers
    logfs: remove unnecessary nlink setting
    ocfs2: remove unnecessary nlink setting
    jfs: remove unnecessary nlink setting
    hypfs: remove unnecessary nlink setting
    vfs: ignore error on forced remount
    readlinkat: ensure we return ENOENT for the empty pathname for normal lookups
    vfs: fix dentry leak in simple_fill_super()

    Linus Torvalds
     
  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (97 commits)
    jbd2: Unify log messages in jbd2 code
    jbd/jbd2: validate sb->s_first in journal_get_superblock()
    ext4: let ext4_ext_rm_leaf work with EXT_DEBUG defined
    ext4: fix a syntax error in ext4_ext_insert_extent when debugging enabled
    ext4: fix a typo in struct ext4_allocation_context
    ext4: Don't normalize an falloc request if it can fit in 1 extent.
    ext4: remove comments about extent mount option in ext4_new_inode()
    ext4: let ext4_discard_partial_buffers handle unaligned range correctly
    ext4: return ENOMEM if find_or_create_pages fails
    ext4: move vars to local scope in ext4_discard_partial_page_buffers_no_lock()
    ext4: Create helper function for EXT4_IO_END_UNWRITTEN and i_aiodio_unwritten
    ext4: optimize locking for end_io extent conversion
    ext4: remove unnecessary call to waitqueue_active()
    ext4: Use correct locking for ext4_end_io_nolock()
    ext4: fix race in xattr block allocation path
    ext4: trace punch_hole correctly in ext4_ext_map_blocks
    ext4: clean up AGGRESSIVE_TEST code
    ext4: move variables to their scope
    ext4: fix quota accounting during migration
    ext4: migrate cleanup
    ...

    Linus Torvalds
     

02 Nov, 2011

2 commits


01 Nov, 2011

2 commits

  • Standardize the style for compiler based printf format verification.
    Standardized the location of __printf too.

    Done via script and a little typing.

    $ grep -rPl --include=*.[ch] -w "__attribute__" * | \
    grep -vP "^(tools|scripts|include/linux/compiler-gcc.h)" | \
    xargs perl -n -i -e 'local $/; while (<>) { s/\b__attribute__\s*\(\s*\(\s*format\s*\(\s*printf\s*,\s*(.+)\s*,\s*(.+)\s*\)\s*\)\s*\)/__printf($1, $2)/g ; print; }'

    [akpm@linux-foundation.org: revert arch bits]
    Signed-off-by: Joe Perches
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • The basic idea behind cross memory attach is to allow MPI programs doing
    intra-node communication to do a single copy of the message rather than a
    double copy of the message via shared memory.

    The following patch attempts to achieve this by allowing a destination
    process, given an address and size from a source process, to copy memory
    directly from the source process into its own address space via a system
    call. There is also a symmetrical ability to copy from the current
    process's address space into a destination process's address space.

    - Use of /proc/pid/mem has been considered, but there are issues with
    using it:
    - Does not allow for specifying iovecs for both src and dest, assuming
    preadv or pwritev was implemented either the area read from or
    written to would need to be contiguous.
    - Currently mem_read allows only processes who are currently
    ptrace'ing the target and are still able to ptrace the target to read
    from the target. This check could possibly be moved to the open call,
    but its not clear exactly what race this restriction is stopping
    (reason appears to have been lost)
    - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
    domain socket is a bit ugly from a userspace point of view,
    especially when you may have hundreds if not (eventually) thousands
    of processes that all need to do this with each other
    - Doesn't allow for some future use of the interface we would like to
    consider adding in the future (see below)
    - Interestingly reading from /proc/pid/mem currently actually
    involves two copies! (But this could be fixed pretty easily)

    As mentioned previously use of vmsplice instead was considered, but has
    problems. Since you need the reader and writer working co-operatively if
    the pipe is not drained then you block. Which requires some wrapping to
    do non blocking on the send side or polling on the receive. In all to all
    communication it requires ordering otherwise you can deadlock. And in the
    example of many MPI tasks writing to one MPI task vmsplice serialises the
    copying.

    There are some cases of MPI collectives where even a single copy interface
    does not get us the performance gain we could. For example in an
    MPI_Reduce rather than copy the data from the source we would like to
    instead use it directly in a mathops (say the reduce is doing a sum) as
    this would save us doing a copy. We don't need to keep a copy of the data
    from the source. I haven't implemented this, but I think this interface
    could in the future do all this through the use of the flags - eg could
    specify the math operation and type and the kernel rather than just
    copying the data would apply the specified operation between the source
    and destination and store it in the destination.

    Although we don't have a "second user" of the interface (though I've had
    some nibbles from people who may be interested in using it for intra
    process messaging which is not MPI). This interface is something which
    hardware vendors are already doing for their custom drivers to implement
    fast local communication. And so in addition to this being useful for
    OpenMPI it would mean the driver maintainers don't have to fix things up
    when the mm changes.

    There was some discussion about how much faster a true zero copy would
    go. Here's a link back to the email with some testing I did on that:

    http://marc.info/?l=linux-mm&m=130105930902915&w=2

    There is a basic man page for the proposed interface here:

    http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt

    This has been implemented for x86 and powerpc, other architecture should
    mainly (I think) just need to add syscall numbers for the process_vm_readv
    and process_vm_writev. There are 32 bit compatibility versions for
    64-bit kernels.

    For arch maintainers there are some simple tests to be able to quickly
    verify that the syscalls are working correctly here:

    http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgz

    Signed-off-by: Chris Yeoh
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: Arnd Bergmann
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: David Howells
    Cc: James Morris
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christopher Yeoh
     

29 Oct, 2011

2 commits

  • Rearrange the fields in struct inode so that on an x86_64 system,
    fields that require 8-byte alignment don't end up causing 4-byte holes
    in the structure. It reduces the size of struct inode from 568 bytes
    to 552 bytes.

    Also move the fields protected by i_lock (i_blocks, i_bytes, and
    i_size) into the same cache line as i_lock.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue: (21 commits)
    leases: fix write-open/read-lease race
    nfs: drop unnecessary locking in llseek
    ext4: replace cut'n'pasted llseek code with generic_file_llseek_size
    vfs: add generic_file_llseek_size
    vfs: do (nearly) lockless generic_file_llseek
    direct-io: merge direct_io_walker into __blockdev_direct_IO
    direct-io: inline the complete submission path
    direct-io: separate map_bh from dio
    direct-io: use a slab cache for struct dio
    direct-io: rearrange fields in dio/dio_submit to avoid holes
    direct-io: fix a wrong comment
    direct-io: separate fields only used in the submission path from struct dio
    vfs: fix spinning prevention in prune_icache_sb
    vfs: add a comment to inode_permission()
    vfs: pass all mask flags check_acl and posix_acl_permission
    vfs: add hex format for MAY_* flag values
    vfs: indicate that the permission functions take all the MAY_* flags
    compat: sync compat_stats with statfs.
    vfs: add "device" tag to /proc/self/mountstats
    cleanup: vfs: small comment fix for block_invalidatepage
    ...

    Fix up trivial conflict in fs/gfs2/file.c (llseek changes)

    Linus Torvalds
     

28 Oct, 2011

3 commits

  • Add a generic_file_llseek variant to the VFS that allows passing in
    the maximum file size of the file system, instead of always
    using maxbytes from the superblock.

    This can be used to eliminate some cut'n'paste seek code in ext4.

    Signed-off-by: Andi Kleen
    Signed-off-by: Christoph Hellwig

    Andi Kleen
     
  • The i_mutex lock use of generic _file_llseek hurts. Independent processes
    accessing the same file synchronize over a single lock, even though
    they have no need for synchronization at all.

    Under high utilization this can cause llseek to scale very poorly on larger
    systems.

    This patch does some rethinking of the llseek locking model:

    First the 64bit f_pos is not necessarily atomic without locks
    on 32bit systems. This can already cause races with read() today.
    This was discussed on linux-kernel in the past and deemed acceptable.
    The patch does not change that.

    Let's look at the different seek variants:

    SEEK_SET: Doesn't really need any locking.
    If there's a race one writer wins, the other loses.

    For 32bit the non atomic update races against read()
    stay the same. Without a lock they can also happen
    against write() now. The read() race was deemed
    acceptable in past discussions, and I think if it's
    ok for read it's ok for write too.

    => Don't need a lock.

    SEEK_END: This behaves like SEEK_SET plus it reads
    the maximum size too. Reading the maximum size would have the
    32bit atomic problem. But luckily we already have a way to read
    the maximum size without locking (i_size_read), so we
    can just use that instead.

    Without i_mutex there is no synchronization with write() anymore,
    however since the write() update is atomic on 64bit it just behaves
    like another racy SEEK_SET. On non atomic 32bit it's the same
    as SEEK_SET.

    => Don't need a lock, but need to use i_size_read()

    SEEK_CUR: This has a read-modify-write race window
    on the same file. One could argue that any application
    doing unsynchronized seeks on the same file is already broken.
    But for the sake of not adding a regression here I'm
    using the file->f_lock to synchronize this. Using this
    lock is much better than the inode mutex because it doesn't
    synchronize between processes.

    => So still need a lock, but can use a f_lock.

    This patch implements this new scheme in generic_file_llseek.
    I dropped generic_file_llseek_unlocked and changed all callers.

    Signed-off-by: Andi Kleen
    Signed-off-by: Christoph Hellwig

    Andi Kleen
     
  • We are going to add more flags and having them in hex format
    make it simpler

    Acked-by: J. Bruce Fields
    Acked-by: David Howells
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Christoph Hellwig

    Aneesh Kumar K.V
     

25 Oct, 2011

1 commit

  • * 'for-3.2' of git://linux-nfs.org/~bfields/linux: (103 commits)
    nfs41: implement DESTROY_CLIENTID operation
    nfsd4: typo logical vs bitwise negate for want_mask
    nfsd4: allow NFS4_SHARE_SIGNAL_DELEG_WHEN_RESRC_AVAIL | NFS4_SHARE_PUSH_DELEG_WHEN_UNCONTENDED
    nfsd4: seq->status_flags may be used unitialized
    nfsd41: use SEQ4_STATUS_BACKCHANNEL_FAULT when cb_sequence is invalid
    nfsd4: implement new 4.1 open reclaim types
    nfsd4: remove unneeded CLAIM_DELEGATE_CUR workaround
    nfsd4: warn on open failure after create
    nfsd4: preallocate open stateid in process_open1()
    nfsd4: do idr preallocation with stateid allocation
    nfsd4: preallocate nfs4_file in process_open1()
    nfsd4: clean up open owners on OPEN failure
    nfsd4: simplify process_open1 logic
    nfsd4: make is_open_owner boolean
    nfsd4: centralize renew_client() calls
    nfsd4: typo logical vs bitwise negate
    nfs: fix bug about IPv6 address scope checking
    nfsd4: more robust ignoring of WANT bits in OPEN
    nfsd4: move name-length checks to xdr
    nfsd4: move access/deny validity checks to xdr code
    ...

    Linus Torvalds
     

22 Sep, 2011

1 commit

  • * 'for-linus' of git://git.kernel.dk/linux-block:
    floppy: use del_timer_sync() in init cleanup
    blk-cgroup: be able to remove the record of unplugged device
    block: Don't check QUEUE_FLAG_SAME_COMP in __blk_complete_request
    mm: Add comment explaining task state setting in bdi_forker_thread()
    mm: Cleanup clearing of BDI_pending bit in bdi_forker_thread()
    block: simplify force plug flush code a little bit
    block: change force plug flush call order
    block: Fix queue_flag update when rq_affinity goes from 2 to 1
    block: separate priority boosting from REQ_META
    block: remove READ_META and WRITE_META
    xen-blkback: fixed indentation and comments
    xen-blkback: Don't disconnect backend until state switched to XenbusStateClosed.

    Linus Torvalds
     

26 Aug, 2011

1 commit

  • Purely in-memory filesystems do not use the inode hash as the dcache
    tells us if an entry already exists. As a result, they do not call
    unlock_new_inode, and thus directory inodes do not get put into a
    different lockdep class for i_sem.

    We need the different lockdep classes, because the locking order for
    i_mutex is different for directory inodes and regular inodes. Directory
    inodes can do "readdir()", which takes i_mutex *before* possibly taking
    mm->mmap_sem (due to a page fault while copying the directory entry to
    user space).

    In contrast, regular inodes can be mmap'ed, which takes mm->mmap_sem
    before accessing i_mutex.

    The two cases can never happen for the same inode, so no real deadlock
    can occur, but without the different lockdep classes, lockdep cannot
    understand that. As a result, if CONFIG_DEBUG_LOCK_ALLOC is set, this
    can lead to false positives from lockdep like below:

    find/645 is trying to acquire lock:
    (&mm->mmap_sem){++++++}, at: [] might_fault+0x5c/0xac

    but task is already holding lock:
    (&sb->s_type->i_mutex_key#15){+.+.+.}, at: []
    vfs_readdir+0x5b/0xb4

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&sb->s_type->i_mutex_key#15){+.+.+.}:
    [] lock_acquire+0xbf/0x103
    [] __mutex_lock_common+0x4c/0x361
    [] mutex_lock_nested+0x40/0x45
    [] hugetlbfs_file_mmap+0x82/0x110
    [] mmap_region+0x258/0x432
    [] do_mmap_pgoff+0x2ac/0x306
    [] sys_mmap_pgoff+0x118/0x16a
    [] sys_mmap+0x22/0x24
    [] system_call_fastpath+0x16/0x1b

    -> #0 (&mm->mmap_sem){++++++}:
    [] __lock_acquire+0xa1a/0xcf7
    [] lock_acquire+0xbf/0x103
    [] might_fault+0x89/0xac
    [] filldir+0x6f/0xc7
    [] dcache_readdir+0x67/0x205
    [] vfs_readdir+0x7b/0xb4
    [] sys_getdents+0x7e/0xd1
    [] system_call_fastpath+0x16/0x1b

    This patch moves the directory vs file lockdep annotation into a helper
    function that can be called by in-memory filesystems and has hugetlbfs
    call it.

    Signed-off-by: Josh Boyer
    Acked-by: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Josh Boyer
     

23 Aug, 2011

1 commit


20 Aug, 2011

2 commits

  • We currently use a bit in fl_flags to record whether a lease is being
    broken, and set fl_type to the type (RDLCK or UNLCK) that it will
    eventually have. This means that once the lease break starts, we forget
    what the lease's type *used* to be. Breaking a read lease will then
    result in blocking read opens, even though there's no conflict--because
    the lease type is now F_UNLCK and we can no longer tell whether it was
    previously a read or write lease.

    So, instead keep fl_type as the original type (the type which we
    enforce), and keep track of whether we're unlocking or merely
    downgrading by replacing the single FL_INPROGRESS flag by
    FL_UNLOCK_PENDING and FL_DOWNGRADE_PENDING flags.

    To get this right we also need to track separate downgrade and break
    times, to handle the case where a write-leased file gets conflicting
    opens first for read, then later for write.

    (I first considered just eliminating the downgrade behavior
    completely--nfsv4 doesn't need it, and nobody as far as I can tell
    actually uses it currently--but Jeremy Allison tells me that Windows
    oplocks do behave this way, so Samba will probably use this some day.)

    Reviewed-by: Jeff Layton
    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     
  • F_INPROGRESS isn't exposed to userspace. To me it makes more sense in
    fl_flags....

    Reviewed-by: Jeff Layton
    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     

07 Aug, 2011

1 commit

  • The inode structure layout is largely random, and some of the vfs paths
    really do care. The path lookup in particular is already quite D$
    intensive, and profiles show that accessing the 'inode->i_op->xyz'
    fields is quite costly.

    We already optimized the dcache to not unnecessarily load the d_op
    structure for members that are often NULL using the DCACHE_OP_xyz bits
    in dentry->d_flags, and this does something very similar for the inode
    ops that are used during pathname lookup.

    It also re-orders the fields so that the fields accessed by 'stat' are
    together at the beginning of the inode structure, and roughly in the
    order accessed.

    The effect of this seems to be in the 1-2% range for an empty kernel
    "make -j" run (which is fairly kernel-intensive, mostly in filename
    lookup), so it's visible. The numbers are fairly noisy, though, and
    likely depend a lot on exact microarchitecture. So there's more tuning
    to be done.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

01 Aug, 2011

1 commit


27 Jul, 2011

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    merge fchmod() and fchmodat() guts, kill ancient broken kludge
    xfs: fix misspelled S_IS...()
    xfs: get rid of open-coded S_ISREG(), etc.
    vfs: document locking requirements for d_move, __d_move and d_materialise_unique
    omfs: fix (mode & S_IFDIR) abuse
    btrfs: S_ISREG(mode) is not mode & S_IFREG...
    ima: fmode_t misspelled as mode_t...
    pci-label.c: size_t misspelled as mode_t
    jffs2: S_ISLNK(mode & S_IFMT) is pointless
    snd_msnd ->mode is fmode_t, not mode_t
    v9fs_iop_get_acl: get rid of unused variable
    vfs: dont chain pipe/anon/socket on superblock s_inodes list
    Documentation: Exporting: update description of d_splice_alias
    fs: add missing unlock in default_llseek()

    Linus Torvalds
     
  • Workloads using pipes and sockets hit inode_sb_list_lock contention.

    superblock s_inodes list is needed for quota, dirty, pagecache and
    fsnotify management. pipe/anon/socket fs are clearly not candidates for
    these.

    Signed-off-by: Eric Dumazet
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Eric Dumazet
     

26 Jul, 2011

5 commits

  • * 'for-3.1' of git://linux-nfs.org/~bfields/linux:
    nfsd: don't break lease on CLAIM_DELEGATE_CUR
    locks: rename lock-manager ops
    nfsd4: update nfsv4.1 implementation notes
    nfsd: turn on reply cache for NFSv4
    nfsd4: call nfsd4_release_compoundargs from pc_release
    nfsd41: Deny new lock before RECLAIM_COMPLETE done
    fs: locks: remove init_once
    nfsd41: check the size of request
    nfsd41: error out when client sets maxreq_sz or maxresp_sz too small
    nfsd4: fix file leak on open_downgrade
    nfsd4: remember to put RW access on stateid destruction
    NFSD: Added TEST_STATEID operation
    NFSD: added FREE_STATEID operation
    svcrpc: fix list-corrupting race on nfsd shutdown
    rpc: allow autoloading of gss mechanisms
    svcauth_unix.c: quiet sparse noise
    svcsock.c: include sunrpc.h to quiet sparse noise
    nfsd: Remove deprecated nfsctl system call and related code.
    NFSD: allow OP_DESTROY_CLIENTID to be only op in COMPOUND

    Fix up trivial conflicts in Documentation/feature-removal-schedule.txt

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
    fs: Merge split strings
    treewide: fix potentially dangerous trailing ';' in #defined values/expressions
    uwb: Fix misspelling of neighbourhood in comment
    net, netfilter: Remove redundant goto in ebt_ulog_packet
    trivial: don't touch files that are removed in the staging tree
    lib/vsprintf: replace link to Draft by final RFC number
    doc: Kconfig: `to be' -> `be'
    doc: Kconfig: Typo: square -> squared
    doc: Konfig: Documentation/power/{pm => apm-acpi}.txt
    drivers/net: static should be at beginning of declaration
    drivers/media: static should be at beginning of declaration
    drivers/i2c: static should be at beginning of declaration
    XTENSA: static should be at beginning of declaration
    SH: static should be at beginning of declaration
    MIPS: static should be at beginning of declaration
    ARM: static should be at beginning of declaration
    rcu: treewide: Do not use rcu_read_lock_held when calling rcu_dereference_check
    Update my e-mail address
    PCIe ASPM: forcedly -> forcibly
    gma500: push through device driver tree
    ...

    Fix up trivial conflicts:
    - arch/arm/mach-ep93xx/dma-m2p.c (deleted)
    - drivers/gpio/gpio-ep93xx.c (renamed and context nearby)
    - drivers/net/r8169.c (just context changes)

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    fs: take the ACL checks to common code
    bury posix_acl_..._masq() variants
    kill boilerplates around posix_acl_create_masq()
    generic_acl: no need to clone acl just to push it to set_cached_acl()
    kill boilerplate around posix_acl_chmod_masq()
    reiserfs: cache negative ACLs for v1 stat format
    xfs: cache negative ACLs if there is no attribute fork
    9p: do no return 0 from ->check_acl without actually checking
    vfs: move ACL cache lookup into generic code
    CIFS: Fix oops while mounting with prefixpath
    xfs: Fix wrong return value of xfs_file_aio_write
    fix devtmpfs race
    caam: don't pass bogus S_IFCHR to debugfs_create_...()
    get rid of create_proc_entry() abuses - proc_mkdir() is there for purpose
    asus-wmi: ->is_visible() can't return negative
    fix jffs2 ACLs on big-endian with 16bit mode_t
    9p: close ACL leaks
    ocfs2_init_acl(): fix a leak
    VFS : mount lock scalability for internal mounts

    Linus Torvalds
     
  • Replace the ->check_acl method with a ->get_acl method that simply reads an
    ACL from disk after having a cache miss. This means we can replace the ACL
    checking boilerplate code with a single implementation in namei.c.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • * 'for-3.1/core' of git://git.kernel.dk/linux-block: (24 commits)
    block: strict rq_affinity
    backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu
    block: fix patch import error in max_discard_sectors check
    block: reorder request_queue to remove 64 bit alignment padding
    CFQ: add think time check for group
    CFQ: add think time check for service tree
    CFQ: move think time check variables to a separate struct
    fixlet: Remove fs_excl from struct task.
    cfq: Remove special treatment for metadata rqs.
    block: document blk_plug list access
    block: avoid building too big plug list
    compat_ioctl: fix make headers_check regression
    block: eliminate potential for infinite loop in blkdev_issue_discard
    compat_ioctl: fix warning caused by qemu
    block: flush MEDIA_CHANGE from drivers on close(2)
    blk-throttle: Make total_nr_queued unsigned
    block: Add __attribute__((format(printf...) and fix fallout
    fs/partitions/check.c: make local symbols static
    block:remove some spare spaces in genhd.c
    block:fix the comment error in blkdev.h
    ...

    Linus Torvalds
     

24 Jul, 2011

1 commit

  • For a number of file systems that don't have a mount point (e.g. sockfs
    and pipefs), they are not marked as long term. Therefore in
    mntput_no_expire, all locks in vfs_mount lock are taken instead of just
    local cpu's lock to aggregate reference counts when we release
    reference to file objects. In fact, only local lock need to have been
    taken to update ref counts as these file systems are in no danger of
    going away until we are ready to unregister them.

    The attached patch marks file systems using kern_mount without
    mount point as long term. The contentions of vfs_mount lock
    is now eliminated. Before un-registering such file system,
    kern_unmount should be called to remove the long term flag and
    make the mount point ready to be freed.

    Signed-off-by: Tim Chen
    Signed-off-by: Al Viro

    Tim Chen
     

23 Jul, 2011

1 commit


21 Jul, 2011

7 commits

  • Big kernel lock had been removed and setlease now use the lock_flocks()
    to hold a special spin lock file_lock_lock by Matthew.
    So just remove the out-of-date NOTE.

    Signed-off-by: Wanlong Gao
    Signed-off-by: Al Viro

    Wanlong Gao
     
  • Btrfs needs to be able to control how filemap_write_and_wait_range() is called
    in fsync to make it less of a painful operation, so push down taking i_mutex and
    the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some
    file systems can drop taking the i_mutex altogether it seems, like ext3 and
    ocfs2. For correctness sake I just pushed everything down in all cases to make
    sure that we keep the current behavior the same for everybody, and then each
    individual fs maintainer can make up their mind about what to do from there.
    Thanks,

    Acked-by: Jan Kara
    Signed-off-by: Josef Bacik
    Signed-off-by: Al Viro

    Josef Bacik
     
  • This just gets us ready to support the SEEK_HOLE and SEEK_DATA flags. Turns out
    using fiemap in things like cp cause more problems than it solves, so lets try
    and give userspace an interface that doesn't suck. We need to match solaris
    here, and the definitions are

    *o* If /whence/ is SEEK_HOLE, the offset of the start of the
    next hole greater than or equal to the supplied offset
    is returned. The definition of a hole is provided near
    the end of the DESCRIPTION.

    *o* If /whence/ is SEEK_DATA, the file pointer is set to the
    start of the next non-hole file region greater than or
    equal to the supplied offset.

    So in the generic case the entire file is data and there is a virtual hole at
    the end. That means we will just return i_size for SEEK_HOLE and will return
    the same offset for SEEK_DATA. This is how Solaris does it so we have to do it
    the same way.

    Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Al Viro

    Josef Bacik
     
  • Simple filesystems always pass inode->i_sb_bdev as the block device
    argument, and never need a end_io handler. Let's simply things for
    them and for my grepping activity by dropping these arguments. The
    only thing not falling into that scheme is ext4, which passes and
    end_io handler without needing special flags (yet), but given how
    messy the direct I/O code there is use of __blockdev_direct_IO
    in one instead of two out of three cases isn't going to make a large
    difference anyway.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • i_alloc_sem is a rather special rw_semaphore. It's the last one that may
    be released by a non-owner, and it's write side is always mirrored by
    real exclusion. It's intended use it to wait for all pending direct I/O
    requests to finish before starting a truncate.

    Replace it with a hand-grown construct:

    - exclusion for truncates is already guaranteed by i_mutex, so it can
    simply fall way
    - the reader side is replaced by an i_dio_count member in struct inode
    that counts the number of pending direct I/O requests. Truncate can't
    proceed as long as it's non-zero
    - when i_dio_count reaches non-zero we wake up a pending truncate using
    wake_up_bit on a new bit in i_flags
    - new references to i_dio_count can't appear while we are waiting for
    it to read zero because the direct I/O count always needs i_mutex
    (or an equivalent like XFS's i_iolock) for starting a new operation.

    This scheme is much simpler, and saves the space of a spinlock_t and a
    struct list_head in struct inode (typically 160 bits on a non-debug 64-bit
    system).

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Now we have a per-superblock shrinker implementation, we can add a
    filesystem specific callout to it to allow filesystem internal
    caches to be shrunk by the superblock shrinker.

    Rather than perpetuate the multipurpose shrinker callback API (i.e.
    nr_to_scan == 0 meaning "tell me how many objects freeable in the
    cache), two operations will be added. The first will return the
    number of objects that are freeable, the second is the actual
    shrinker call.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • With context based shrinkers, we can implement a per-superblock
    shrinker that shrinks the caches attached to the superblock. We
    currently have global shrinkers for the inode and dentry caches that
    split up into per-superblock operations via a coarse proportioning
    method that does not batch very well. The global shrinkers also
    have a dependency - dentries pin inodes - so we have to be very
    careful about how we register the global shrinkers so that the
    implicit call order is always correct.

    With a per-sb shrinker callout, we can encode this dependency
    directly into the per-sb shrinker, hence avoiding the need for
    strictly ordering shrinker registrations. We also have no need for
    any proportioning code for the shrinker subsystem already provides
    this functionality across all shrinkers. Allowing the shrinker to
    operate on a single superblock at a time means that we do less
    superblock list traversals and locking and reclaim should batch more
    effectively. This should result in less CPU overhead for reclaim and
    potentially faster reclaim of items from each filesystem.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner