05 Jan, 2012

1 commit

  • From c6d615d2b97fe305cbf123a8751ced859dca1d5e Mon Sep 17 00:00:00 2001
    From: NeilBrown
    Date: Wed, 16 Nov 2011 09:39:05 +1100
    Subject: [PATCH] NFS - fix recent breakage to NFS error handling.

    commit 02c24a82187d5a628c68edfe71ae60dc135cd178 made a small and
    presumably unintended change to write error handling in NFS.

    Previously an error from filemap_write_and_wait_range would only be of
    interest if nfs_file_fsync did not return an error. After this commit,
    an error from filemap_write_and_wait_range would mean that (the rest of)
    nfs_file_fsync would not even be called.

    This means that:
    1/ you are more likely to see EIO than e.g. EDQUOT or ENOSPC.
    2/ NFS_CONTEXT_ERROR_WRITE remains set for longer so more writes are
    synchronous.

    This patch restores previous behaviour.

    Cc: stable@kernel.org
    Cc: Josef Bacik
    Cc: Jan Kara
    Cc: Al Viro
    Signed-off-by: NeilBrown
    Signed-off-by: Trond Myklebust

    NeilBrown
     

16 Dec, 2011

1 commit

  • After commit 06222e491e663dac939f04b125c9dc52126a75c4 (fs: handle
    SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek)
    the behaviour of llseek() was changed so that it always revalidates
    the file size. The bug appears to be due to a logic error in the
    afore-mentioned commit, which always evaluates to 'true'.

    Reported-by: Roel Kluin
    Signed-off-by: Trond Myklebust
    Cc: stable@vger.kernel.org [>=3.1]

    Trond Myklebust
     

05 Nov, 2011

2 commits

  • ...a remove a set of forward declarations.

    Signed-off-by: Jeff Layton
    Signed-off-by: Trond Myklebust

    Jeff Layton
     
  • commit d953126 changed how nfs_atomic_lookup handles an -EISDIR return
    from an OPEN call. Prior to that patch, that caused the client to fall
    back to doing a normal lookup. When that patch went in, the code began
    returning that error to userspace. The d_revalidate codepath however
    never had the corresponding change, so it was still possible to end up
    with a NULL ctx->state pointer after that.

    That patch caused a regression. When we attempt to open a directory that
    does not have a cached dentry, that open now errors out with EISDIR. If
    you attempt the same open with a cached dentry, it will succeed.

    Fix this by reverting the change in nfs_atomic_lookup and allowing
    attempts to open directories to fall back to a normal lookup

    Also, add a NFSv4-specific f_ops->open routine that just returns
    -ENOTDIR. This should never be called if things are working properly,
    but if it ever is, then the dprintk may help in debugging.

    To facilitate this, a new file_operations field is also added to the
    nfs_rpc_ops struct.

    Cc: stable@kernel.org
    Signed-off-by: Jeff Layton
    Signed-off-by: Trond Myklebust

    Jeff Layton
     

03 Nov, 2011

2 commits


28 Oct, 2011

2 commits

  • This makes NFS follow the standard generic_file_llseek locking scheme.

    Cc: Trond.Myklebust@netapp.com
    Signed-off-by: Andi Kleen
    Signed-off-by: Christoph Hellwig

    Andi Kleen
     
  • The i_mutex lock use of generic _file_llseek hurts. Independent processes
    accessing the same file synchronize over a single lock, even though
    they have no need for synchronization at all.

    Under high utilization this can cause llseek to scale very poorly on larger
    systems.

    This patch does some rethinking of the llseek locking model:

    First the 64bit f_pos is not necessarily atomic without locks
    on 32bit systems. This can already cause races with read() today.
    This was discussed on linux-kernel in the past and deemed acceptable.
    The patch does not change that.

    Let's look at the different seek variants:

    SEEK_SET: Doesn't really need any locking.
    If there's a race one writer wins, the other loses.

    For 32bit the non atomic update races against read()
    stay the same. Without a lock they can also happen
    against write() now. The read() race was deemed
    acceptable in past discussions, and I think if it's
    ok for read it's ok for write too.

    => Don't need a lock.

    SEEK_END: This behaves like SEEK_SET plus it reads
    the maximum size too. Reading the maximum size would have the
    32bit atomic problem. But luckily we already have a way to read
    the maximum size without locking (i_size_read), so we
    can just use that instead.

    Without i_mutex there is no synchronization with write() anymore,
    however since the write() update is atomic on 64bit it just behaves
    like another racy SEEK_SET. On non atomic 32bit it's the same
    as SEEK_SET.

    => Don't need a lock, but need to use i_size_read()

    SEEK_CUR: This has a read-modify-write race window
    on the same file. One could argue that any application
    doing unsynchronized seeks on the same file is already broken.
    But for the sake of not adding a regression here I'm
    using the file->f_lock to synchronize this. Using this
    lock is much better than the inode mutex because it doesn't
    synchronize between processes.

    => So still need a lock, but can use a f_lock.

    This patch implements this new scheme in generic_file_llseek.
    I dropped generic_file_llseek_unlocked and changed all callers.

    Signed-off-by: Andi Kleen
    Signed-off-by: Christoph Hellwig

    Andi Kleen
     

21 Jul, 2011

2 commits

  • Btrfs needs to be able to control how filemap_write_and_wait_range() is called
    in fsync to make it less of a painful operation, so push down taking i_mutex and
    the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some
    file systems can drop taking the i_mutex altogether it seems, like ext3 and
    ocfs2. For correctness sake I just pushed everything down in all cases to make
    sure that we keep the current behavior the same for everybody, and then each
    individual fs maintainer can make up their mind about what to do from there.
    Thanks,

    Acked-by: Jan Kara
    Signed-off-by: Josef Bacik
    Signed-off-by: Al Viro

    Josef Bacik
     
  • This converts everybody to handle SEEK_HOLE/SEEK_DATA properly. In some cases
    we just return -EINVAL, in others we do the normal generic thing, and in others
    we're simply making sure that the properly due-dilligence is done. For example
    in NFS/CIFS we need to make sure the file size is update properly for the
    SEEK_HOLE and SEEK_DATA case, but since it calls the generic llseek stuff itself
    that is all we have to do. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Al Viro

    Josef Bacik
     

31 Mar, 2011

1 commit


25 Mar, 2011

1 commit


24 Mar, 2011

1 commit

  • The filelayout driver sends LAYOUTCOMMIT only when COMMIT goes to
    the data server (as opposed to the MDS) and the data server WRITE
    is not NFS_FILE_SYNC.

    Only whole file layout support means that there is only one IOMODE_RW layout
    segment.

    Signed-off-by: Andy Adamson
    Signed-off-by: Alexandros Batsakis
    Signed-off-by: Boaz Harrosh
    Signed-off-by: Dean Hildebrand
    Signed-off-by: Fred Isaman
    Signed-off-by: Mingyang Guo
    Signed-off-by: Tao Guo
    Signed-off-by: Zhang Jingwang
    Tested-by: Boaz Harrosh
    Signed-off-by: Benny Halevy
    Signed-off-by: Fred Isaman
    Signed-off-by: Trond Myklebust

    Andy Adamson
     

12 Mar, 2011

1 commit

  • Move the pnfs_update_layout call location to nfs_pageio_do_add_request().
    Grab the lseg sent in the doio function to nfs_read_rpcsetup and attach
    it to each nfs_read_data so it can be sent to the layout driver.

    Signed-off-by: Andy Adamson
    Signed-off-by: Andy Adamson
    Signed-off-by: Dean Hildebrand
    Signed-off-by: Fred Isaman
    Signed-off-by: Fred Isaman
    Signed-off-by: Benny Halevy
    Signed-off-by: Boaz Harrosh
    Signed-off-by: Oleg Drokin
    Signed-off-by: Tao Guo
    Signed-off-by: Trond Myklebust

    Fred Isaman
     

08 Dec, 2010

1 commit

  • The commit 129a84de2347002f09721cda3155ccfd19fade40 (locks: fix F_GETLK
    regression (failure to find conflicts)) fixed the posix_test_lock()
    function by itself, however, its usage in NFS changed by the commit
    9d6a8c5c213e34c475e72b245a8eb709258e968c (locks: give posix_test_lock
    same interface as ->lock) remained broken - subsequent NFS-specific
    locking code received F_UNLCK instead of the user-specified lock type.
    To fix the problem, fl->fl_type needs to be saved before the
    posix_test_lock() call and restored if no local conflicts were reported.

    Reference: https://bugzilla.kernel.org/show_bug.cgi?id=23892
    Tested-by: Alexander Morozov
    Signed-off-by: Sergey Vlasov
    Cc:
    Signed-off-by: Trond Myklebust

    Sergey Vlasov
     

31 Oct, 2010

2 commits

  • The caller allocated it, the caller should free it.

    The only issue so far is that we could change the flp pointer even on an
    error return if the fl_change callback failed. But we can simply move
    the flp assignment after the fl_change invocation, as the callers don't
    care about the flp return value if the setlease call failed.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • We modified setlease to require the caller to allocate the new lease in
    the case of creating a new lease, but forgot to fix up the filesystem
    methods.

    Cc: Steven Whitehouse
    Cc: Steve French
    Cc: Trond Myklebust
    Signed-off-by: J. Bruce Fields
    Acked-by: Arnd Bergmann
    Signed-off-by: Linus Torvalds

    J. Bruce Fields
     

27 Oct, 2010

1 commit

  • * 'nfs-for-2.6.37' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
    net/sunrpc: Use static const char arrays
    nfs4: fix channel attribute sanity-checks
    NFSv4.1: Use more sensible names for 'initialize_mountpoint'
    NFSv4.1: pnfs: filelayout: add driver's LAYOUTGET and GETDEVICEINFO infrastructure
    NFSv4.1: pnfs: add LAYOUTGET and GETDEVICEINFO infrastructure
    NFS: client needs to maintain list of inodes with active layouts
    NFS: create and destroy inode's layout cache
    NFSv4.1: pnfs: filelayout: introduce minimal file layout driver
    NFSv4.1: pnfs: full mount/umount infrastructure
    NFS: set layout driver
    NFS: ask for layouttypes during v4 fsinfo call
    NFS: change stateid to be a union
    NFSv4.1: pnfsd, pnfs: protocol level pnfs constants
    SUNRPC: define xdr_decode_opaque_fixed
    NFSD: remove duplicate NFS4_STATEID_SIZE

    Linus Torvalds
     

26 Oct, 2010

1 commit

  • * 'nfs-for-2.6.37' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (67 commits)
    SUNRPC: Cleanup duplicate assignment in rpcauth_refreshcred
    nfs: fix unchecked value
    Ask for time_delta during fsinfo probe
    Revalidate caches on lock
    SUNRPC: After calling xprt_release(), we must restart from call_reserve
    NFSv4: Fix up the 'dircount' hint in encode_readdir
    NFSv4: Clean up nfs4_decode_dirent
    NFSv4: nfs4_decode_dirent must clear entry->fattr->valid
    NFSv4: Fix a regression in decode_getfattr
    NFSv4: Fix up decode_attr_filehandle() to handle the case of empty fh pointer
    NFS: Ensure we check all allocation return values in new readdir code
    NFS: Readdir plus in v4
    NFS: introduce generic decode_getattr function
    NFS: check xdr_decode for errors
    NFS: nfs_readdir_filler catch all errors
    NFS: readdir with vmapped pages
    NFS: remove page size checking code
    NFS: decode_dirent should use an xdr_stream
    SUNRPC: Add a helper function xdr_inline_peek
    NFS: remove readdir plus limit
    ...

    Linus Torvalds
     

25 Oct, 2010

2 commits

  • At the start of the io paths, try to grab the relevant layout
    information. This will initiate the inode's layout cache, but
    stubs ensure the cache stays empty.

    Signed-off-by: Benny Halevy
    Signed-off-by: Dean Hildebrand
    Signed-off-by: Marc Eshel
    Signed-off-by: Tao Guo
    Signed-off-by: Ricardo Labiaga
    Signed-off-by: Boaz Harrosh
    Signed-off-by: Andy Adamson
    Signed-off-by: Fred Isaman
    Signed-off-by: Trond Myklebust

    Benny Halevy
     
  • Instead of blindly zapping the caches, attempt to revalidate them if
    the server has indicated that it uses high resolution timestamps.

    NFSv4 should be able to always revalidate the cache since the
    protocol requires the update of the change attribute on modification of
    the data. In reality, there are servers (the Linux NFS server
    for example) that do not obey this requirement and use ctime as the
    basis for change attribute. Long term, the server needs to be fixed.
    At this time, and to be on the safe side, continue zapping caches if
    the server indicates that it does not have a high resolution timestamp.

    Signed-off-by: Ricardo Labiaga
    Signed-off-by: Trond Myklebust

    Ricardo Labiaga
     

20 Oct, 2010

1 commit


23 Sep, 2010

1 commit

  • NFS clients since 2.6.12 support flock locks by emulating fcntl byte-range
    locks. Due to this, some windows applications which seem to use both flock
    (share mode lock mapped as flock by Samba) and fcntl locks sequentially on
    the same file, can't lock as they falsely assume the file is already locked.
    The problem was reported on a setup with windows clients accessing excel files
    on a Samba exported share which is originally a NFS mount from a NetApp filer.

    Older NFS clients (< 2.6.12) did not see this problem as flock locks were
    considered local. To support legacy flock behavior, this patch adds a mount
    option "-olocal_lock=" which can take the following values:

    'none' - Neither flock locks nor POSIX locks are local
    'flock' - flock locks are local
    'posix' - fcntl/POSIX locks are local
    'all' - Both flock locks and POSIX locks are local

    Testing:

    - This patch was tested by using -olocal_lock option with different values
    and the NLM calls were noted from the network packet captured.

    'none' - NLM calls were seen during both flock() and fcntl(), flock lock
    was granted, fcntl was denied
    'flock' - no NLM calls for flock(), NLM call was seen for fcntl(),
    granted
    'posix' - NLM call was seen for flock() - granted, no NLM call for fcntl()
    'all' - no NLM calls were seen during both flock() and fcntl()

    - No bugs were seen during NFSv4 locking/unlocking in general and NFSv4
    reboot recovery.

    Cc: Neil Brown
    Signed-off-by: Suresh Jayaraman
    Signed-off-by: Trond Myklebust

    Suresh Jayaraman
     

13 Sep, 2010

1 commit

  • The do_vfs_lock function on fs/nfs/file.c is only called if NLM is
    not being used, via the -onolock mount option. Therefore it cannot
    really be "out of sync with lock manager" when the local locking
    function called returns an error, as there will be no corresponding
    call to the NLM. For details, simply check the if/else on do_setlk
    and do_unlk on fs/nfs/file.c.

    Signed-Off-By: Fabio Olive Leite
    Reviewed-by: Jeff Layton
    Signed-off-by: Trond Myklebust

    Fabio Olive Leite
     

12 Aug, 2010

1 commit


08 Aug, 2010

1 commit

  • * 'nfs-for-2.6.36' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (42 commits)
    NFS: NFSv4.1 is no longer a "developer only" feature
    NFS: NFS_V4 is no longer an EXPERIMENTAL feature
    NFS: Fix /proc/mount for legacy binary interface
    NFS: Fix the locking in nfs4_callback_getattr
    SUNRPC: Defer deleting the security context until gss_do_free_ctx()
    SUNRPC: prevent task_cleanup running on freed xprt
    SUNRPC: Reduce asynchronous RPC task stack usage
    SUNRPC: Move the bound cred to struct rpc_rqst
    SUNRPC: Clean up of rpc_bindcred()
    SUNRPC: Move remaining RPC client related task initialisation into clnt.c
    SUNRPC: Ensure that rpc_exit() always wakes up a sleeping task
    SUNRPC: Make the credential cache hashtable size configurable
    SUNRPC: Store the hashtable size in struct rpc_cred_cache
    NFS: Ensure the AUTH_UNIX credcache is allocated dynamically
    NFS: Fix the NFS users of rpc_restart_call()
    SUNRPC: The function rpc_restart_call() should return success/failure
    NFSv4: Get rid of the bogus RPC_ASSASSINATED(task) checks
    NFSv4: Clean up the process of renewing the NFSv4 lease
    NFSv4.1: Handle NFS4ERR_DELAY on SEQUENCE correctly
    NFS: nfs_rename() should not have to flush out writebacks
    ...

    Linus Torvalds
     

04 Aug, 2010

1 commit

  • Christoph points out that the VFS will always flush out data before calling
    nfs_fsync(), so we can dispense with a full call to nfs_wb_all(), and
    replace that with a simpler call to nfs_commit_inode().

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     

31 Jul, 2010

1 commit


28 May, 2010

1 commit


15 May, 2010

1 commit


30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

20 Mar, 2010

1 commit


10 Feb, 2010

4 commits

  • The bytes counted by the performance counters for NFS writes should
    reflect write and sync errors. If the write(2) system call reports
    an error, the bytes should not be counted. And, if the write is
    short, the actual number of bytes that was written should be counted,
    not the number of bytes that was requested.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • Bytes read via the splice API should be accounted for in the NFS
    performance statistics.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • Currently, the NFS I/O counters count the number of bytes requested
    by applications, rather than the number of bytes actually read by the
    system calls.

    The number of bytes requested for reads is actually not that useful,
    because the value is usually a buffer size for reads. That is, that
    requested number is usually a maximum, and frequently doesn't reflect
    the actual number of bytes read.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     
  • Nit: The VFSOPEN and VFSFLUSH counters are function call counters.
    Count every call to these routines.

    Signed-off-by: Chuck Lever
    Signed-off-by: Trond Myklebust

    Chuck Lever
     

27 Jan, 2010

1 commit


10 Dec, 2009

1 commit

  • While Linux provided an O_SYNC flag basically since day 1, it took until
    Linux 2.4.0-test12pre2 to actually get it implemented for filesystems,
    since that day we had generic_osync_around with only minor changes and the
    great "For now, when the user asks for O_SYNC, we'll actually give
    O_DSYNC" comment. This patch intends to actually give us real O_SYNC
    semantics in addition to the O_DSYNC semantics. After Jan's O_SYNC
    patches which are required before this patch it's actually surprisingly
    simple, we just need to figure out when to set the datasync flag to
    vfs_fsync_range and when not.

    This patch renames the existing O_SYNC flag to O_DSYNC while keeping it's
    numerical value to keep binary compatibility, and adds a new real O_SYNC
    flag. To guarantee backwards compatiblity it is defined as expanding to
    both the O_DSYNC and the new additional binary flag (__O_SYNC) to make
    sure we are backwards-compatible when compiled against the new headers.

    This also means that all places that don't care about the differences can
    just check O_DSYNC and get the right behaviour for O_SYNC, too - only
    places that actuall care need to check __O_SYNC in addition. Drivers and
    network filesystems have been updated in a fail safe way to always do the
    full sync magic if O_DSYNC is set. The few places setting O_SYNC for
    lower layers are kept that way for now to stay failsafe.

    We enforce that O_DSYNC is set when __O_SYNC is set early in the open path
    to make sure we always get these sane options.

    Note that parisc really screwed up their headers as they already define a
    O_DSYNC that has always been a no-op. We try to repair it by using it for
    the new O_DSYNC and redefinining O_SYNC to send both the traditional
    O_SYNC numerical value _and_ the O_DSYNC one.

    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Grant Grundler
    Cc: "David S. Miller"
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Cc: Al Viro
    Cc: Andreas Dilger
    Acked-by: Trond Myklebust
    Acked-by: Kyle McMartin
    Acked-by: Ulrich Drepper
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Jan Kara

    Christoph Hellwig
     

28 Sep, 2009

1 commit


16 Sep, 2009

1 commit

  • Enable hardware memory error handling for NFS

    Truncation of data pages at runtime should be safe in NFS,
    even when it doesn't support migration so far.

    Trond tells me migration is also queued up for 2.6.32.

    Acked-by: Trond.Myklebust@netapp.com
    Signed-off-by: Andi Kleen

    Andi Kleen