30 Nov, 2017

1 commit

  • commit fcfa447062b2061e11f68b846d61cbfe60d0d604 upstream.

    Commit e12937279c8b "NFS: Move the flock open mode check into nfs_flock()"
    changed NFSv3 behavior for flock() such that the open mode must match the
    lock type, however that requirement shouldn't be enforced for flock().

    Signed-off-by: Benjamin Coddington
    Signed-off-by: Anna Schumaker
    Signed-off-by: Greg Kroah-Hartman

    Benjamin Coddington
     

12 Sep, 2017

1 commit

  • 1/ remove 'start' and 'end' args from nfs_file_fsync_commit().
    They aren't used.

    2/ Make nfs_context_set_write_error() a "static inline" in internal.h
    so we can...

    3/ Use nfs_context_set_write_error() instead of mapping_set_error()
    if nfs_pageio_add_request() fails before sending any request.
    NFS generally keeps errors in the open_context, not the mapping,
    so this is more consistent.

    4/ If filemap_write_and_write_range() reports any error, still
    check ctx->error. The value in ctx->error is likely to be
    more useful. As part of this, NFS_CONTEXT_ERROR_WRITE is
    cleared slightly earlier, before nfs_file_fsync_commit() is called,
    rather than at the start of that function.

    Signed-off-by: NeilBrown
    Signed-off-by: Trond Myklebust

    NeilBrown
     

07 Sep, 2017

2 commits

  • Since commit 18290650b1c8 ("NFS: Move buffered I/O locking into
    nfs_file_write()") nfs_file_write() has not flushed the correct byte
    range during synchronous writes. generic_write_sync() expects that
    iocb->ki_pos points to the right edge of the range rather than the
    left edge.

    To replicate the problem, open a file with O_DSYNC, have the client
    write at increasing offsets, and then print the successful offsets.
    Block port 2049 partway through that sequence, and observe that the
    client application indicates successful writes in advance of what the
    server received.

    Fixes: 18290650b1c8 ("NFS: Move buffered I/O locking into nfs_file_write()")
    Signed-off-by: Jacob Strauss
    Signed-off-by: Tarang Gupta
    Tested-by: Tarang Gupta
    Cc: stable@vger.kernel.org # v4.8+
    Signed-off-by: Trond Myklebust

    tarangg@amazon.com
     
  • When a byte range lock (or flock) is taken out on an NFS file, the
    validity of the cached data is checked and the inode is marked
    NFS_INODE_INVALID_DATA. However the cached data isn't flushed from
    the page cache.

    This is sufficient for future read() requests or mmap() requests as
    they call nfs_revalidate_mapping() which performs the flush if
    necessary.

    However an existing mapping is not affected. Accessing data through
    that mapping will continue to return old data even though the inode is
    marked NFS_INODE_INVALID_DATA.

    This can easily be confirmed using the 'nfs' tool in
    git://github.com/okirch/twopence-nfs.git
    and running

    nfs coherence FILENAME
    on one client, and
    nfs coherence -r FILENAME
    on another client.

    It appears that prior to Linux 2.6.0 this worked correctly.

    However commit:

    http://git.kernel.org/cgit/linux/kernel/git/history/history.git/commit/?id=ca9268fe3ddd075714005adecd4afbd7f9ab87d0

    removed the call to inode_invalidate_pages() from nfs_zap_caches(). I
    haven't tested this code, but inspection suggests that prior to this
    commit, file locking would invalidate all inode pages.

    This patch adds a call to nfs_revalidate_mapping() after a
    successful SETLK so that invalid data is flushed. With this patch the
    above test passes. To minimize impact (and possibly avoid a GETATTR
    call) this only happens if the mapping might be mapped into
    userspace.

    Cc: Olaf Kirch
    Signed-off-by: NeilBrown
    Signed-off-by: Trond Myklebust

    NeilBrown
     

27 Jul, 2017

2 commits

  • posix_fallocate() will allocate space in an NFS file by considering
    the last byte of every 4K block. If it is before EOF, it will read
    the byte and if it is zero, a zero is written out. If it is after EOF,
    the zero is unconditionally written.

    For the blocks beyond EOF, if NFS believes its cache is valid, it will
    expand these writes to write full pages, and then will merge the pages.
    This results if (typically) 1MB writes. If NFS believes its cache is
    not valid (particularly if NFS_INO_INVALID_DATA or
    NFS_INO_REVAL_PAGECACHE are set - see nfs_write_pageuptodate()), it will
    send the individual 1-byte writes. This results in (typically) 256 times
    as many RPC requests, and can be substantially slower.

    Currently nfs_revalidate_mapping() is only used when reading a file or
    mmapping a file, as these are times when the content needs to be
    up-to-date. Writes don't generally need the cache to be up-to-date, but
    writes beyond EOF can benefit, particularly in the posix_fallocate()
    case.

    So this patch calls nfs_revalidate_mapping() when writing beyond EOF -
    i.e. when there is a gap between the end of the file and the start of
    the write. If the cache is thought to be out of date (as happens after
    taking a file lock), this will cause a GETATTR, and the two flags
    mentioned above will be cleared. With this, posix_fallocate() on a
    newly locked file does not generate excessive tiny writes.

    Signed-off-by: NeilBrown
    Signed-off-by: Anna Schumaker

    NeilBrown
     
  • Prior to commit ca0daa277aca ("NFS: Cache aggressively when file is open
    for writing"), NFS would revalidate, or invalidate, the file size when
    taking a lock. Since that commit it only invalidates the file content.

    If the file size is changed on the server while wait for the lock, the
    client will have an incorrect understanding of the file size and could
    corrupt data. This particularly happens when writing beyond the
    (supposed) end of file and can be easily be demonstrated with
    posix_fallocate().

    If an application opens an empty file, waits for a write lock, and then
    calls posix_fallocate(), glibc will determine that the underlying
    filesystem doesn't support fallocate (assuming version 4.1 or earlier)
    and will write out a '0' byte at the end of each 4K page in the region
    being fallocated that is after the end of the file.
    NFS will (usually) detect that these writes are beyond EOF and will
    expand them to cover the whole page, and then will merge the pages.
    Consequently, NFS will write out large blocks of zeroes beyond where it
    thought EOF was. If EOF had moved, the pre-existing part of the file
    will be over-written. Locking should have protected against this,
    but it doesn't.

    This patch restores the use of nfs_zap_caches() which invalidated the
    cached attributes. When posix_fallocate() asks for the file size, the
    request will go to the server and get a correct answer.

    cc: stable@vger.kernel.org (v4.8+)
    Fixes: ca0daa277aca ("NFS: Cache aggressively when file is open for writing")
    Signed-off-by: NeilBrown
    Signed-off-by: Anna Schumaker

    NeilBrown
     

27 Apr, 2017

1 commit


21 Apr, 2017

2 commits

  • NFS attempts to wait for read and write completion before unlocking in
    order to ensure that the data returned was protected by the lock. When
    this waiting is interrupted by a signal, the unlock may be skipped, and
    messages similar to the following are seen in the kernel ring buffer:

    [20.167876] Leaked locks on dev=0x0:0x2b ino=0x8dd4c3:
    [20.168286] POSIX: fl_owner=ffff880078b06940 fl_flags=0x1 fl_type=0x0 fl_pid=20183
    [20.168727] POSIX: fl_owner=ffff880078b06680 fl_flags=0x1 fl_type=0x0 fl_pid=20185

    For NFSv3, the missing unlock will cause the server to refuse conflicting
    locks indefinitely. For NFSv4, the leftover lock will be removed by the
    server after the lease timeout.

    This patch fixes this issue by skipping the usual wait in
    nfs_iocounter_wait if the FL_CLOSE flag is set when signaled. Instead, the
    wait happens in the unlock RPC task on the NFS UOC rpc_waitqueue.

    For NFSv3, use lockd's new nlmclnt_operations along with
    nfs_async_iocounter_wait to defer NLM's unlock task until the lock
    context's iocounter reaches zero.

    For NFSv4, call nfs_async_iocounter_wait() directly from unlock's
    current rpc_call_prepare.

    Signed-off-by: Benjamin Coddington
    Reviewed-by: Jeff Layton
    Signed-off-by: Trond Myklebust

    Benjamin Coddington
     
  • We only need to check lock exclusive/shared types against open mode when
    flock() is used on NFS, so move it into the flock-specific path instead of
    checking it for all locks.

    Signed-off-by: Benjamin Coddington
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jeff Layton
    Signed-off-by: Trond Myklebust

    Benjamin Coddington
     

25 Feb, 2017

1 commit

  • ->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to
    take a vma and vmf parameter when the vma already resides in vmf.

    Remove the vma parameter to simplify things.

    [arnd@arndb.de: fix ARM build]
    Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com
    Signed-off-by: Dave Jiang
    Signed-off-by: Arnd Bergmann
    Reviewed-by: Ross Zwisler
    Cc: Theodore Ts'o
    Cc: Darrick J. Wong
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Christoph Hellwig
    Cc: Jan Kara
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Jiang
     

25 Dec, 2016

1 commit


22 Dec, 2016

1 commit

  • Pull more NFS client updates from Trond Myklebust:
    "Highlights include:

    - further attribute cache improvements to make revalidation more fine
    grained

    - NFSv4 locking improvements

    Bugfixes:

    - nfs4_fl_prepare_ds must be careful about reporting success in files
    layout

    - pNFS/flexfiles: Instead of marking a device inactive, remove it
    from the cache"

    * tag 'nfs-for-4.10-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
    NFSv4: Retry the DELEGRETURN if the embedded GETATTR is rejected with EACCES
    NFS: Retry the CLOSE if the embedded GETATTR is rejected with EACCES
    NFSv4: Place the GETATTR operation before the CLOSE
    NFSv4: Also ask for attributes when downgrading to a READ-only state
    NFS: Don't abuse NFS_INO_REVAL_FORCED in nfs_post_op_update_inode_locked()
    pNFS: Return RW layouts on OPEN_DOWNGRADE
    NFSv4: Add encode/decode of the layoutreturn op in OPEN_DOWNGRADE
    NFS: Don't disconnect open-owner on NFS4ERR_BAD_SEQID
    NFSv4: ensure __nfs4_find_lock_state returns consistent result.
    NFSv4.1: nfs4_fl_prepare_ds must be careful about reporting success.
    pNFS/flexfiles: delete deviceid, don't mark inactive
    NFS: Clean up nfs_attribute_timeout()
    NFS: Remove unused function nfs_revalidate_inode_rcu()
    NFS: Fix and clean up the access cache validity checking
    NFS: Only look at the change attribute cache state in nfs_weak_revalidate()
    NFS: Clean up cache validity checking
    NFS: Don't revalidate the file on close if we hold a delegation
    NFSv4: Don't discard the attributes returned by asynchronous DELEGRETURN
    NFSv4: Update the attribute cache info in update_changeattr

    Linus Torvalds
     

20 Dec, 2016

1 commit


18 Dec, 2016

1 commit

  • Pull more vfs updates from Al Viro:
    "In this pile:

    - autofs-namespace series
    - dedupe stuff
    - more struct path constification"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
    ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features
    ocfs2: charge quota for reflinked blocks
    ocfs2: fix bad pointer cast
    ocfs2: always unlock when completing dio writes
    ocfs2: don't eat io errors during _dio_end_io_write
    ocfs2: budget for extent tree splits when adding refcount flag
    ocfs2: prohibit refcounted swapfiles
    ocfs2: add newlines to some error messages
    ocfs2: convert inode refcount test to a helper
    simple_write_end(): don't zero in short copy into uptodate
    exofs: don't mess with simple_write_{begin,end}
    9p: saner ->write_end() on failing copy into non-uptodate page
    fix gfs2_stuffed_write_end() on short copies
    fix ceph_write_end()
    nfs_write_end(): fix handling of short copies
    vfs: refactor clone/dedupe_file_range common functions
    fs: try to clone files first in vfs_copy_file_range
    vfs: misc struct path constification
    namespace.c: constify struct path passed to a bunch of primitives
    quota: constify struct path in quota_on
    ...

    Linus Torvalds
     

10 Dec, 2016

1 commit

  • What matters when deciding if we should make a page uptodate is
    not how much we _wanted_ to copy, but how much we actually have
    copied. As it is, on architectures that do not zero tail on
    short copy we can leave uninitialized data in page marked uptodate.

    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro

    Al Viro
     

05 Dec, 2016

1 commit


14 Oct, 2016

1 commit

  • Pull NFS client updates from Anna Schumaker:
    "Highlights include:

    Stable bugfixes:
    - sunrpc: fix writ espace race causing stalls
    - NFS: Fix inode corruption in nfs_prime_dcache()
    - NFSv4: Don't report revoked delegations as valid in nfs_have_delegation()
    - NFSv4: nfs4_copy_delegation_stateid() must fail if the delegation is invalid
    - NFSv4: Open state recovery must account for file permission changes
    - NFSv4.2: Fix a reference leak in nfs42_proc_layoutstats_generic

    Features:
    - Add support for tracking multiple layout types with an ordered list
    - Add support for using multiple backchannel threads on the client
    - Add support for pNFS file layout session trunking
    - Delay xprtrdma use of DMA API (for device driver removal)
    - Add support for xprtrdma remote invalidation
    - Add support for larger xprtrdma inline thresholds
    - Use a scatter/gather list for sending xprtrdma RPC calls
    - Add support for the CB_NOTIFY_LOCK callback
    - Improve hashing sunrpc auth_creds by using both uid and gid

    Bugfixes:
    - Fix xprtrdma use of DMA API
    - Validate filenames before adding to the dcache
    - Fix corruption of xdr->nwords in xdr_copy_to_scratch
    - Fix setting buffer length in xdr_set_next_buffer()
    - Don't deadlock the state manager on the SEQUENCE status flags
    - Various delegation and stateid related fixes
    - Retry operations if an interrupted slot receives EREMOTEIO
    - Make nfs boot time y2038 safe"

    * tag 'nfs-for-4.9-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (100 commits)
    NFSv4.2: Fix a reference leak in nfs42_proc_layoutstats_generic
    fs: nfs: Make nfs boot time y2038 safe
    sunrpc: replace generic auth_cred hash with auth-specific function
    sunrpc: add RPCSEC_GSS hash_cred() function
    sunrpc: add auth_unix hash_cred() function
    sunrpc: add generic_auth hash_cred() function
    sunrpc: add hash_cred() function to rpc_authops struct
    Retry operation on EREMOTEIO on an interrupted slot
    pNFS: Fix atime updates on pNFS clients
    sunrpc: queue work on system_power_efficient_wq
    NFSv4.1: Even if the stateid is OK, we may need to recover the open modes
    NFSv4: If recovery failed for a specific open stateid, then don't retry
    NFSv4: Fix retry issues with nfs41_test/free_stateid
    NFSv4: Open state recovery must account for file permission changes
    NFSv4: Mark the lock and open stateids as invalid after freeing them
    NFSv4: Don't test open_stateid unless it is set
    NFSv4: nfs4_do_handle_exception() handle revoke/expiry of a single stateid
    NFS: Always call nfs_inode_find_state_and_recover() when revoking a delegation
    NFSv4: Fix a race when updating an open_stateid
    NFSv4: Fix a race in nfs_inode_reclaim_delegation()
    ...

    Linus Torvalds
     

06 Oct, 2016

1 commit


23 Sep, 2016

1 commit


20 Sep, 2016

1 commit


04 Sep, 2016

1 commit


25 Jul, 2016

1 commit


20 Jul, 2016

1 commit

  • A generic_cred can be used to look up a unx_cred or a gss_cred, so it's
    not really safe to use the the generic_cred->acred->ac_flags to store
    the NO_CRKEY_TIMEOUT flag. A lookup for a unx_cred triggered while the
    KEY_EXPIRE_SOON flag is already set will cause both NO_CRKEY_TIMEOUT and
    KEY_EXPIRE_SOON to be set in the ac_flags, leaving the user associated
    with the auth_cred to be in a state where they're perpetually doing 4K
    NFS_FILE_SYNC writes.

    This can be reproduced as follows:

    1. Mount two NFS filesystems, one with sec=krb5 and one with sec=sys.
    They do not need to be the same export, nor do they even need to be from
    the same NFS server. Also, v3 is fine.
    $ sudo mount -o v3,sec=krb5 server1:/export /mnt/krb5
    $ sudo mount -o v3,sec=sys server2:/export /mnt/sys

    2. As the normal user, before accessing the kerberized mount, kinit with
    a short lifetime (but not so short that renewing the ticket would leave
    you within the 4-minute window again by the time the original ticket
    expires), e.g.
    $ kinit -l 10m -r 60m

    3. Do some I/O to the kerberized mount and verify that the writes are
    wsize, UNSTABLE:
    $ dd if=/dev/zero of=/mnt/krb5/file bs=1M count=1

    4. Wait until you're within 4 minutes of key expiry, then do some more
    I/O to the kerberized mount to ensure that RPC_CRED_KEY_EXPIRE_SOON gets
    set. Verify that the writes are 4K, FILE_SYNC:
    $ dd if=/dev/zero of=/mnt/krb5/file bs=1M count=1

    5. Now do some I/O to the sec=sys mount. This will cause
    RPC_CRED_NO_CRKEY_TIMEOUT to be set:
    $ dd if=/dev/zero of=/mnt/sys/file bs=1M count=1

    6. Writes for that user will now be permanently 4K, FILE_SYNC for that
    user, regardless of which mount is being written to, until you reboot
    the client. Renewing the kerberos ticket (assuming it hasn't already
    expired) will have no effect. Grabbing a new kerberos ticket at this
    point will have no effect either.

    Move the flag to the auth->au_flags field (which is currently unused)
    and rename it slightly to reflect that it's no longer associated with
    the auth_cred->ac_flags. Add the rpc_auth to the arg list of
    rpcauth_cred_key_to_expire and check the au_flags there too. Finally,
    add the inode to the arg list of nfs_ctx_key_to_expire so we can
    determine the rpc_auth to pass to rpcauth_cred_key_to_expire.

    Signed-off-by: Scott Mayhew
    Signed-off-by: Trond Myklebust

    Scott Mayhew
     

06 Jul, 2016

5 commits


22 Jun, 2016

4 commits


02 May, 2016

1 commit


05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

17 Mar, 2016

2 commits


23 Jan, 2016

1 commit

  • parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
    inode_foo(inode) being mutex_foo(&inode->i_mutex).

    Please, use those for access to ->i_mutex; over the coming cycle
    ->i_mutex will become rwsem, with ->lookup() done with it held
    only shared.

    Signed-off-by: Al Viro

    Al Viro
     

08 Jan, 2016

1 commit

  • The use of wait_on_atomic_t() for waiting on I/O to complete before
    unlocking allows us to git rid of the NFS_IO_INPROGRESS flag, and thus the
    nfs_iocounter's flags member, and finally the nfs_iocounter altogether.
    The count of I/O is moved to the lock context, and the counter
    increment/decrement functions become simple enough to open-code.

    Signed-off-by: Benjamin Coddington
    [Trond: Fix up conflict with existing function nfs_wait_atomic_killable()]
    Signed-off-by: Trond Myklebust

    Benjamin Coddington
     

05 Jan, 2016

1 commit

  • * pnfs_generic:
    NFSv4.1/pNFS: Cleanup constify struct pnfs_layout_range arguments
    NFSv4.1/pnfs: Cleanup copying of pnfs_layout_range structures
    NFSv4.1/pNFS: Cleanup pnfs_mark_matching_lsegs_invalid()
    NFSv4.1/pNFS: Fix a race in initiate_file_draining()
    NFSv4.1/pNFS: pnfs_error_mark_layout_for_return() must always return layout
    NFSv4.1/pNFS: pnfs_mark_matching_lsegs_return() should set the iomode
    NFSv4.1/pNFS: Use nfs4_stateid_copy for copying stateids
    NFSv4.1/pNFS: Don't pass stateids by value to pnfs_send_layoutreturn()
    NFS: Relax requirements in nfs_flush_incompatible
    NFSv4.1/pNFS: Don't queue up a new commit if the layout segment is invalid
    NFS: Allow multiple commit requests in flight per file
    NFS/pNFS: Fix up pNFS write reschedule layering violations and bugs
    NFSv4: List stateid information in the callback tracepoints
    NFSv4.1/pNFS: Don't return NFS4ERR_DELAY unnecessarily in CB_LAYOUTRECALL
    NFSv4.1/pNFS: Ensure we enforce RFC5661 Section 12.5.5.2.1
    pNFS: If we have to delay the layout callback, mark the layout for return
    NFSv4.1/pNFS: Add a helper to mark the layout as returned
    pNFS: Ensure nfs4_layoutget_prepare returns the correct error

    Trond Myklebust
     

01 Jan, 2016

1 commit