27 Apr, 2015

1 commit

  • Pull NFS client updates from Trond Myklebust:
    "Another set of mainly bugfixes and a couple of cleanups. No new
    functionality in this round.

    Highlights include:

    Stable patches:
    - Fix a regression in /proc/self/mountstats
    - Fix the pNFS flexfiles O_DIRECT support
    - Fix high load average due to callback thread sleeping

    Bugfixes:
    - Various patches to fix the pNFS layoutcommit support
    - Do not cache pNFS deviceids unless server notifications are enabled
    - Fix a SUNRPC transport reconnection regression
    - make debugfs file creation failure non-fatal in SUNRPC
    - Another fix for circular directory warnings on NFSv4 "junctioned"
    mountpoints
    - Fix locking around NFSv4.2 fallocate() support
    - Truncating NFSv4 file opens should also sync O_DIRECT writes
    - Prevent infinite loop in rpcrdma_ep_create()

    Features:
    - Various improvements to the RDMA transport code's handling of
    memory registration
    - Various code cleanups"

    * tag 'nfs-for-4.1-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (55 commits)
    fs/nfs: fix new compiler warning about boolean in switch
    nfs: Remove unneeded casts in nfs
    NFS: Don't attempt to decode missing directory entries
    Revert "nfs: replace nfs_add_stats with nfs_inc_stats when add one"
    NFS: Rename idmap.c to nfs4idmap.c
    NFS: Move nfs_idmap.h into fs/nfs/
    NFS: Remove CONFIG_NFS_V4 checks from nfs_idmap.h
    NFS: Add a stub for GETDEVICELIST
    nfs: remove WARN_ON_ONCE from nfs_direct_good_bytes
    nfs: fix DIO good bytes calculation
    nfs: Fetch MOUNTED_ON_FILEID when updating an inode
    sunrpc: make debugfs file creation failure non-fatal
    nfs: fix high load average due to callback thread sleeping
    NFS: Reduce time spent holding the i_mutex during fallocate()
    NFS: Don't zap caches on fallocate()
    xprtrdma: Make rpcrdma_{un}map_one() into inline functions
    xprtrdma: Handle non-SEND completions via a callout
    xprtrdma: Add "open" memreg op
    xprtrdma: Add "destroy MRs" memreg op
    xprtrdma: Add "reset MRs" memreg op
    ...

    Linus Torvalds
     

16 Apr, 2015

1 commit


12 Apr, 2015

1 commit


28 Mar, 2015

1 commit


04 Mar, 2015

1 commit

  • When invalidating the page cache for a regular file, we want to first
    sync all dirty data to disk and then call invalidate_inode_pages2().
    The latter relies on nfs_launder_page() and nfs_release_page() to deal
    respectively with dirty pages, and unstable written pages.

    When commit 9590544694bec ("NFS: avoid deadlocks with loop-back mounted
    NFS filesystems.") changed the behaviour of nfs_release_page(), then it
    made it possible for invalidate_inode_pages2() to fail with an EBUSY.
    Unfortunately, that error is then propagated back to read().

    Let's therefore work around the problem for now by protecting the call
    to sync the data and invalidate_inode_pages2() so that they are atomic
    w.r.t. the addition of new writes.
    Later on, we can revisit whether or not we still need nfs_launder_page()
    and nfs_release_page().

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     

02 Mar, 2015

3 commits


14 Feb, 2015

1 commit


25 Nov, 2014

1 commit

  • Recent work in the pgio layer made it possible for there to be more than one
    request per page. This caused a subtle change in commit behavior, because
    write.c:nfs_commit_unstable_pages compares the number of *pages* waiting for
    writeback against the number of requests on a commit list to choose when to
    send a COMMIT in a non-blocking flush.

    This is probably hard to hit in normal operation - you have to be using
    rsize/wsize < PAGE_SIZE, or pnfs with lots of boundaries that are not page
    aligned to have a noticeable change in behavior.

    Signed-off-by: Weston Andros Adamson
    Signed-off-by: Trond Myklebust

    Weston Andros Adamson
     

19 Oct, 2014

1 commit

  • Pull core block layer changes from Jens Axboe:
    "This is the core block IO pull request for 3.18. Apart from the new
    and improved flush machinery for blk-mq, this is all mostly bug fixes
    and cleanups.

    - blk-mq timeout updates and fixes from Christoph.

    - Removal of REQ_END, also from Christoph. We pass it through the
    ->queue_rq() hook for blk-mq instead, freeing up one of the request
    bits. The space was overly tight on 32-bit, so Martin also killed
    REQ_KERNEL since it's no longer used.

    - blk integrity updates and fixes from Martin and Gu Zheng.

    - Update to the flush machinery for blk-mq from Ming Lei. Now we
    have a per hardware context flush request, which both cleans up the
    code should scale better for flush intensive workloads on blk-mq.

    - Improve the error printing, from Rob Elliott.

    - Backing device improvements and cleanups from Tejun.

    - Fixup of a misplaced rq_complete() tracepoint from Hannes.

    - Make blk_get_request() return error pointers, fixing up issues
    where we NULL deref when a device goes bad or missing. From Joe
    Lawrence.

    - Prep work for drastically reducing the memory consumption of dm
    devices from Junichi Nomura. This allows creating clone bio sets
    without preallocating a lot of memory.

    - Fix a blk-mq hang on certain combinations of queue depths and
    hardware queues from me.

    - Limit memory consumption for blk-mq devices for crash dump
    scenarios and drivers that use crazy high depths (certain SCSI
    shared tag setups). We now just use a single queue and limited
    depth for that"

    * 'for-3.18/core' of git://git.kernel.dk/linux-block: (58 commits)
    block: Remove REQ_KERNEL
    blk-mq: allocate cpumask on the home node
    bio-integrity: remove the needless fail handle of bip_slab creating
    block: include func name in __get_request prints
    block: make blk_update_request print prefix match ratelimited prefix
    blk-merge: don't compute bi_phys_segments from bi_vcnt for cloned bio
    block: fix alignment_offset math that assumes io_min is a power-of-2
    blk-mq: Make bt_clear_tag() easier to read
    blk-mq: fix potential hang if rolling wakeup depth is too high
    block: add bioset_create_nobvec()
    block: use bio_clone_fast() in blk_rq_prep_clone()
    block: misplaced rq_complete tracepoint
    sd: Honor block layer integrity handling flags
    block: Replace strnicmp with strncasecmp
    block: Add T10 Protection Information functions
    block: Don't merge requests if integrity flags differ
    block: Integrity checksum flag
    block: Relocate bio integrity flags
    block: Add a disk flag to block integrity profile
    block: Add prefix to block integrity profile flags
    ...

    Linus Torvalds
     

14 Oct, 2014

1 commit

  • REQ_KERNEL is no longer used. Remove it and drop the redundant uio
    argument to nfs_file_direct_{read,write}.

    Signed-off-by: Martin K. Petersen
    Cc: Christoph Hellwig
    Reported-by: Dan Carpenter
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

13 Sep, 2014

2 commits


04 Aug, 2014

2 commits

  • The access cache is used during RCU-walk path lookups, so it is best
    to avoid locking if possible as taking a lock kills concurrency.

    The rbtree is not rcu-safe and cannot easily be made so.
    Instead we simply check the last (i.e. most recent) entry on the LRU
    list. If this doesn't match, then we return -ECHILD and retry in
    lock/refcount mode.

    This requires freeing the nfs_access_entry struct with rcu, and
    requires using rcu access primatives when adding entries to the lru, and
    when examining the last entry.

    Calling put_rpccred before kfree_rcu looks a bit odd, but as
    put_rpccred already provides rcu protection, we know that the cred will
    not actually be freed until the next grace period, so any concurrent
    access will be safe.

    This patch provides about 5% performance improvement on a stat-heavy
    synthetic work load with 4 threads on a 2-core CPU.

    Signed-off-by: NeilBrown
    Signed-off-by: Trond Myklebust

    NeilBrown
     
  • This requires nfs_check_verifier to take an rcu_walk flag, and requires
    an rcu version of nfs_revalidate_inode which returns -ECHILD rather
    than making an RPC call.

    With this, nfs_lookup_revalidate can call nfs_neg_need_reval in
    RCU-walk mode.

    We can also move the LOOKUP_RCU check past the nfs_check_verifier()
    call in nfs_lookup_revalidate.

    If RCU_WALK prevents nfs_check_verifier or nfs_neg_need_reval from
    doing a full check, they return a status indicating that a revalidation
    is required. As this revalidation will not be possible in RCU_WALK
    mode, -ECHILD will ultimately be returned, which is the desired result.

    Signed-off-by: NeilBrown
    Signed-off-by: Trond Myklebust

    NeilBrown
     

13 Jun, 2014

1 commit

  • Pull vfs updates from Al Viro:
    "This the bunch that sat in -next + lock_parent() fix. This is the
    minimal set; there's more pending stuff.

    In particular, I really hope to get acct.c fixes merged this cycle -
    we need that to deal sanely with delayed-mntput stuff. In the next
    pile, hopefully - that series is fairly short and localized
    (kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more
    iov_iter work. Most of prereqs for ->splice_write with sane locking
    order are there and Kent's dio rewrite would also fit nicely on top of
    this pile"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits)
    lock_parent: don't step on stale ->d_parent of all-but-freed one
    kill generic_file_splice_write()
    ceph: switch to iter_file_splice_write()
    shmem: switch to iter_file_splice_write()
    nfs: switch to iter_splice_write_file()
    fs/splice.c: remove unneeded exports
    ocfs2: switch to iter_file_splice_write()
    ->splice_write() via ->write_iter()
    bio_vec-backed iov_iter
    optimize copy_page_{to,from}_iter()
    bury generic_file_aio_{read,write}
    lustre: get rid of messing with iovecs
    ceph: switch to ->write_iter()
    ceph_sync_direct_write: stop poking into iov_iter guts
    ceph_sync_read: stop poking into iov_iter guts
    new helper: copy_page_from_iter()
    fuse: switch to ->write_iter()
    btrfs: switch to ->write_iter()
    ocfs2: switch to ->write_iter()
    xfs: switch to ->write_iter()
    ...

    Linus Torvalds
     

29 May, 2014

2 commits


07 May, 2014

2 commits


18 Mar, 2014

1 commit


12 Feb, 2014

1 commit


01 Feb, 2014

1 commit

  • Pull NFS client bugfixes from Trond Myklebust:
    "Highlights:

    - Fix several races in nfs_revalidate_mapping
    - NFSv4.1 slot leakage in the pNFS files driver
    - Stable fix for a slot leak in nfs40_sequence_done
    - Don't reject NFSv4 servers that support ACLs with only ALLOW aces"

    * tag 'nfs-for-3.14-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
    nfs: initialize the ACL support bits to zero.
    NFSv4.1: Cleanup
    NFSv4.1: Clean up nfs41_sequence_done
    NFSv4: Fix a slot leak in nfs40_sequence_done
    NFSv4.1 free slot before resending I/O to MDS
    nfs: add memory barriers around NFS_INO_INVALID_DATA and NFS_INO_INVALIDATING
    NFS: Fix races in nfs_revalidate_mapping
    sunrpc: turn warn_gssd() log message into a dprintk()
    NFS: fix the handling of NFS_INO_INVALID_DATA flag in nfs_revalidate_mapping
    nfs: handle servers that support only ALLOW ACE type.

    Linus Torvalds
     

28 Jan, 2014

1 commit

  • There is a possible race in how the nfs_invalidate_mapping function is
    handled. Currently, we go and invalidate the pages in the file and then
    clear NFS_INO_INVALID_DATA.

    The problem is that it's possible for a stale page to creep into the
    mapping after the page was invalidated (i.e., via readahead). If another
    writer comes along and sets the flag after that happens but before
    invalidate_inode_pages2 returns then we could clear the flag
    without the cache having been properly invalidated.

    So, we must clear the flag first and then invalidate the pages. Doing
    this however, opens another race:

    It's possible to have two concurrent read() calls that end up in
    nfs_revalidate_mapping at the same time. The first one clears the
    NFS_INO_INVALID_DATA flag and then goes to call nfs_invalidate_mapping.

    Just before calling that though, the other task races in, checks the
    flag and finds it cleared. At that point, it trusts that the mapping is
    good and gets the lock on the page, allowing the read() to be satisfied
    from the cache even though the data is no longer valid.

    These effects are easily manifested by running diotest3 from the LTP
    test suite on NFS. That program does a series of DIO writes and buffered
    reads. The operations are serialized and page-aligned but the existing
    code fails the test since it occasionally allows a read to come out of
    the cache incorrectly. While mixing direct and buffered I/O isn't
    recommended, I believe it's possible to hit this in other ways that just
    use buffered I/O, though that situation is much harder to reproduce.

    The problem is that the checking/clearing of that flag and the
    invalidation of the mapping really need to be atomic. Fix this by
    serializing concurrent invalidations with a bitlock.

    At the same time, we also need to allow other places that check
    NFS_INO_INVALID_DATA to check whether we might be in the middle of
    invalidating the file, so fix up a couple of places that do that
    to look for the new NFS_INO_INVALIDATING flag.

    Doing this requires us to be careful not to set the bitlock
    unnecessarily, so this code only does that if it believes it will
    be doing an invalidation.

    Signed-off-by: Jeff Layton
    Signed-off-by: Trond Myklebust

    Jeff Layton
     

26 Jan, 2014

1 commit


20 Nov, 2013

1 commit


28 Sep, 2013

1 commit

  • Use i_writecount to control whether to get an fscache cookie in nfs_open() as
    NFS does not do write caching yet. I *think* this is the cause of a problem
    encountered by Mark Moseley whereby __fscache_uncache_page() gets a NULL
    pointer dereference because cookie->def is NULL:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
    IP: [] __fscache_uncache_page+0x23/0x160
    PGD 0
    Thread overran stack, or stack corrupted
    Oops: 0000 [#1] SMP
    Modules linked in: ...
    CPU: 7 PID: 18993 Comm: php Not tainted 3.11.1 #1
    Hardware name: Dell Inc. PowerEdge R420/072XWF, BIOS 1.3.5 08/21/2012
    task: ffff8804203460c0 ti: ffff880420346640
    RIP: 0010:[] __fscache_uncache_page+0x23/0x160
    RSP: 0018:ffff8801053af878 EFLAGS: 00210286
    RAX: 0000000000000000 RBX: ffff8800be2f8780 RCX: ffff88022ffae5e8
    RDX: 0000000000004c66 RSI: ffffea00055ff440 RDI: ffff8800be2f8780
    RBP: ffff8801053af898 R08: 0000000000000001 R09: 0000000000000003
    R10: 0000000000000000 R11: 0000000000000000 R12: ffffea00055ff440
    R13: 0000000000001000 R14: ffff8800c50be538 R15: 0000000000000000
    FS: 0000000000000000(0000) GS:ffff88042fc60000(0063) knlGS:00000000e439c700
    CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
    CR2: 0000000000000010 CR3: 0000000001d8f000 CR4: 00000000000607f0
    Stack:
    ...
    Call Trace:
    [] __nfs_fscache_invalidate_page+0x42/0x70
    [] nfs_invalidate_page+0x75/0x90
    [] truncate_inode_page+0x8e/0x90
    [] truncate_inode_pages_range.part.12+0x14d/0x620
    [] ? __mutex_lock_slowpath+0x1fd/0x2e0
    [] truncate_inode_pages_range+0x53/0x70
    [] truncate_inode_pages+0x2d/0x40
    [] truncate_pagecache+0x4f/0x70
    [] nfs_setattr_update_inode+0xa0/0x120
    [] nfs3_proc_setattr+0xc4/0xe0
    [] nfs_setattr+0xc8/0x150
    [] notify_change+0x1cb/0x390
    [] do_truncate+0x7b/0xc0
    [] do_last+0xa4c/0xfd0
    [] path_openat+0xcc/0x670
    [] do_filp_open+0x4e/0xb0
    [] do_sys_open+0x13f/0x2b0
    [] compat_SyS_open+0x36/0x50
    [] sysenter_dispatch+0x7/0x24

    The code at the instruction pointer was disassembled:

    > (gdb) disas __fscache_uncache_page
    > Dump of assembler code for function __fscache_uncache_page:
    > ...
    > 0xffffffff812a18ff : mov 0x48(%rbx),%rax
    > 0xffffffff812a1903 : cmpb $0x0,0x10(%rax)
    > 0xffffffff812a1907 : je 0xffffffff812a19cd

    These instructions make up:

    ASSERTCMP(cookie->def->type, !=, FSCACHE_COOKIE_TYPE_INDEX);

    That cmpb is the faulting instruction (%rax is 0). So cookie->def is NULL -
    which presumably means that the cookie has already been at least partway
    through __fscache_relinquish_cookie().

    What I think may be happening is something like a three-way race on the same
    file:

    PROCESS 1 PROCESS 2 PROCESS 3
    =============== =============== ===============
    open(O_TRUNC|O_WRONLY)
    open(O_RDONLY)
    open(O_WRONLY)
    -->nfs_open()
    -->nfs_fscache_set_inode_cookie()
    nfs_fscache_inode_lock()
    nfs_fscache_disable_inode_cookie()
    __fscache_relinquish_cookie()
    nfs_inode->fscache = NULL
    nfs_open()
    -->nfs_fscache_set_inode_cookie()
    nfs_fscache_inode_lock()
    nfs_fscache_enable_inode_cookie()
    __fscache_acquire_cookie()
    nfs_inode->fscache = cookie
    nfs_setattr()
    ...
    ...
    -->nfs_invalidate_page()
    -->__nfs_fscache_invalidate_page()
    cookie = nfsi->fscache
    -->nfs_open()
    -->nfs_fscache_set_inode_cookie()
    nfs_fscache_inode_lock()
    nfs_fscache_disable_inode_cookie()
    -->__fscache_relinquish_cookie()
    -->__fscache_uncache_page(cookie)

    fscache = NULL

    Signed-off-by: David Howells

    David Howells
     

04 Sep, 2013

1 commit

  • If an NFS client does

    mkdir("dir");
    fd = open("dir/file");
    unlink("dir/file");
    close(fd);
    rmdir("dir");

    then the asynchronous nature of the sillyrename operation means that
    we can end up getting EBUSY for the rmdir() in the above test. Fix
    that by ensuring that we wait for any in-progress sillyrenames
    before sending the rmdir() to the server.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     

10 Jul, 2013

1 commit


29 Jun, 2013

1 commit

  • * labeled-nfs:
    NFS: Apply v4.1 capabilities to v4.2
    NFS: Add in v4.2 callback operation
    NFS: Make callbacks minor version generic
    Kconfig: Add Kconfig entry for Labeled NFS V4 client
    NFS: Extend NFS xattr handlers to accept the security namespace
    NFS: Client implementation of Labeled-NFS
    NFS: Add label lifecycle management
    NFS:Add labels to client function prototypes
    NFSv4: Extend fattr bitmaps to support all 3 words
    NFSv4: Introduce new label structure
    NFSv4: Add label recommended attribute and NFSv4 flags
    NFSv4.2: Added NFS v4.2 support to the NFS client
    SELinux: Add new labeling type native labels
    LSM: Add flags field to security_sb_set_mnt_opts for in kernel mount data.
    Security: Add Hook to test if the particular xattr is part of a MAC model.
    Security: Add hook to calculate context based on a negative dentry.
    NFS: Add NFSv4.2 protocol constants

    Conflicts:
    fs/nfs/nfs4proc.c

    Trond Myklebust
     

09 Jun, 2013

3 commits

  • This patch implements the client transport and handling support for labeled
    NFS. The patch adds two functions to encode and decode the security label
    recommended attribute which makes use of the LSM hooks added earlier. It also
    adds code to grab the label from the file attribute structures and encode the
    label to be sent back to the server.

    Acked-by: James Morris
    Signed-off-by: Matthew N. Dodd
    Signed-off-by: Miguel Rodel Felipe
    Signed-off-by: Phua Eu Gene
    Signed-off-by: Khin Mi Mi Aung
    Signed-off-by: Steve Dickson
    Signed-off-by: Trond Myklebust

    David Quigley
     
  • After looking at all of the nfsv4 operations the label structure has been added
    to the prototypes of the functions which can transmit label data.

    Signed-off-by: Matthew N. Dodd
    Signed-off-by: Miguel Rodel Felipe
    Signed-off-by: Phua Eu Gene
    Signed-off-by: Khin Mi Mi Aung
    Signed-off-by: Steve Dickson
    Signed-off-by: Trond Myklebust

    David Quigley
     
  • In order to mimic the way that NFSv4 ACLs are implemented we have created a
    structure to be used to pass label data up and down the call chain. This patch
    adds the new structure and new members to the required NFSv4 call structures.

    Signed-off-by: Matthew N. Dodd
    Signed-off-by: Miguel Rodel Felipe
    Signed-off-by: Phua Eu Gene
    Signed-off-by: Khin Mi Mi Aung
    Signed-off-by: Steve Dickson
    Signed-off-by: Trond Myklebust

    Steve Dickson
     

07 Jun, 2013

1 commit

  • State recovery currently relies on being able to find a valid
    nfs_open_context in the inode->open_files list.
    We therefore need to put the nfs_open_context on the list while
    we're still protected by the sp->so_reclaim_seqcount in order
    to avoid reboot races.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     

09 Apr, 2013

1 commit


26 Mar, 2013

1 commit


13 Oct, 2012

1 commit


02 Oct, 2012

1 commit

  • The OPEN operation has no way to differentiate an open for read and an
    open for execution - both look like read to the server. This allowed
    users to read files that didn't have READ access but did have EXEC access,
    which is obviously wrong.

    This patch adds an ACCESS call to the OPEN compound to handle the
    difference between OPENs for reading and execution. Since we're going
    through the trouble of calling ACCESS, we check all possible access bits
    and cache the results hopefully avoiding an ACCESS call in the future.

    Signed-off-by: Weston Andros Adamson
    Signed-off-by: Trond Myklebust

    Weston Andros Adamson
     

29 Sep, 2012

1 commit

  • If the server reboots before it can commit the unstable writes to disk,
    then nfs_commit_release_pages() will detect this when it compares the
    verifier returned by COMMIT to the one returned by WRITE. When this
    happens, the client needs to resend those writes in order to guarantee
    that they make it to stable storage.

    This patch adds a signalling mechanism to notify fsync() that it
    needs to retry all writes before it can exit.

    Signed-off-by: Trond Myklebust

    Trond Myklebust