09 Feb, 2017

1 commit

  • commit d19fb70dd68c4e960e2ac09b0b9c79dfdeefa726 upstream.

    nfsd assigns the nfs4_free_lock_stateid to .sc_free in init_lock_stateid().

    If nfsd doesn't go through init_lock_stateid() and put stateid at end,
    there is a NULL reference to .sc_free when calling nfs4_put_stid(ns).

    This patch let the nfs4_stid.sc_free assignment to nfs4_alloc_stid().

    Fixes: 356a95ece7aa "nfsd: clean up races in lock stateid searching..."
    Signed-off-by: Kinglong Mee
    Reviewed-by: Jeff Layton
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Greg Kroah-Hartman

    Kinglong Mee
     

02 Nov, 2016

1 commit

  • When I push NFSv4.1 / RDMA hard, (xfstests generic/089, for example),
    I get this crash on the server:

    Oct 28 22:04:30 klimt kernel: general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
    Oct 28 22:04:30 klimt kernel: Modules linked in: cts rpcsec_gss_krb5 iTCO_wdt iTCO_vendor_support sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm btrfs irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd xor pcspkr raid6_pq i2c_i801 i2c_smbus lpc_ich mfd_core sg mei_me mei ioatdma shpchp wmi ipmi_si ipmi_msghandler rpcrdma ib_ipoib rdma_ucm acpi_power_meter acpi_pad ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c mlx4_ib mlx4_en ib_core sr_mod cdrom sd_mod ast drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crc32c_intel igb ahci libahci ptp mlx4_core pps_core dca libata i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod
    Oct 28 22:04:30 klimt kernel: CPU: 7 PID: 1558 Comm: nfsd Not tainted 4.9.0-rc2-00005-g82cd754 #8
    Oct 28 22:04:30 klimt kernel: Hardware name: Supermicro Super Server/X10SRL-F, BIOS 1.0c 09/09/2015
    Oct 28 22:04:30 klimt kernel: task: ffff880835c3a100 task.stack: ffff8808420d8000
    Oct 28 22:04:30 klimt kernel: RIP: 0010:[] [] release_lock_stateid+0x1f/0x60 [nfsd]
    Oct 28 22:04:30 klimt kernel: RSP: 0018:ffff8808420dbce0 EFLAGS: 00010246
    Oct 28 22:04:30 klimt kernel: RAX: ffff88084e6660f0 RBX: ffff88084e667020 RCX: 0000000000000000
    Oct 28 22:04:30 klimt kernel: RDX: 0000000000000007 RSI: 0000000000000000 RDI: ffff88084e667020
    Oct 28 22:04:30 klimt kernel: RBP: ffff8808420dbcf8 R08: 0000000000000001 R09: 0000000000000000
    Oct 28 22:04:30 klimt kernel: R10: ffff880835c3a100 R11: ffff880835c3aca8 R12: 6b6b6b6b6b6b6b6b
    Oct 28 22:04:30 klimt kernel: R13: ffff88084e6670d8 R14: ffff880835f546f0 R15: ffff880835f1c548
    Oct 28 22:04:30 klimt kernel: FS: 0000000000000000(0000) GS:ffff88087bdc0000(0000) knlGS:0000000000000000
    Oct 28 22:04:30 klimt kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Oct 28 22:04:30 klimt kernel: CR2: 00007ff020389000 CR3: 0000000001c06000 CR4: 00000000001406e0
    Oct 28 22:04:30 klimt kernel: Stack:
    Oct 28 22:04:30 klimt kernel: ffff88084e667020 0000000000000000 ffff88084e6670d8 ffff8808420dbd20
    Oct 28 22:04:30 klimt kernel: ffffffffa05ac80d ffff880835f54548 ffff88084e640008 ffff880835f545b0
    Oct 28 22:04:30 klimt kernel: ffff8808420dbd70 ffffffffa059803d ffff880835f1c768 0000000000000870
    Oct 28 22:04:30 klimt kernel: Call Trace:
    Oct 28 22:04:30 klimt kernel: [] nfsd4_free_stateid+0xfd/0x1b0 [nfsd]
    Oct 28 22:04:30 klimt kernel: [] nfsd4_proc_compound+0x40d/0x690 [nfsd]
    Oct 28 22:04:30 klimt kernel: [] nfsd_dispatch+0xd4/0x1d0 [nfsd]
    Oct 28 22:04:30 klimt kernel: [] svc_process_common+0x3d9/0x700 [sunrpc]
    Oct 28 22:04:30 klimt kernel: [] svc_process+0xf4/0x330 [sunrpc]
    Oct 28 22:04:30 klimt kernel: [] nfsd+0xfa/0x160 [nfsd]
    Oct 28 22:04:30 klimt kernel: [] ? nfsd_destroy+0x170/0x170 [nfsd]
    Oct 28 22:04:30 klimt kernel: [] kthread+0x10b/0x120
    Oct 28 22:04:30 klimt kernel: [] ? kthread_stop+0x280/0x280
    Oct 28 22:04:30 klimt kernel: [] ret_from_fork+0x2a/0x40
    Oct 28 22:04:30 klimt kernel: Code: c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 48 8b 87 b0 00 00 00 48 89 fb 4c 8b a0 98 00 00 00 8b 44 24 20 48 8d b8 80 03 00 00 e8 10 66 1a e1 48 89 df e8
    Oct 28 22:04:30 klimt kernel: RIP [] release_lock_stateid+0x1f/0x60 [nfsd]
    Oct 28 22:04:30 klimt kernel: RSP
    Oct 28 22:04:30 klimt kernel: ---[ end trace cf5d0b371973e167 ]---

    Jeff Layton says:
    > Hm...now that I look though, this is a little suspicious:
    >
    > struct nfs4_openowner *oo = openowner(stp->st_openstp->st_stateowner);
    >
    > I wonder if it's possible for the openstateid to have already been
    > destroyed at this point.
    >
    > We might be better off doing something like this to get the client pointer:
    >
    > stp->st_stid.sc_client;
    >
    > ...which should be more direct and less dependent on other stateids
    > staying valid.

    With the suggested change, I am no longer able to reproduce the above oops.

    v2: Fix unhash_lock_stateid() as well

    Fix-suggested-by: Jeff Layton
    Fixes: 42691398be08 ('nfsd: Fix race between FREE_STATEID and LOCK')
    Signed-off-by: Chuck Lever
    Reviewed-by: Jeff Layton
    Cc: stable@vger.kernel.org
    Signed-off-by: J. Bruce Fields

    Chuck Lever
     

25 Oct, 2016

1 commit

  • Bruce was hitting some lockdep warnings in testing, showing that we
    could hit a deadlock with the new CB_NOTIFY_LOCK handling, involving a
    rather complex situation involving four different spinlocks.

    The crux of the matter is that we end up taking the nn->client_lock in
    the lm_notify handler. The simplest fix is to just declare a new
    per-nfsd_net spinlock to protect the new CB_NOTIFY_LOCK structures.

    Signed-off-by: Jeff Layton
    Signed-off-by: J. Bruce Fields

    Jeff Layton
     

14 Oct, 2016

1 commit

  • Pull nfsd updates from Bruce Fields:
    "Some RDMA work and some good bugfixes, and two new features that could
    benefit from user testing:

    - Anna Schumacker contributed a simple NFSv4.2 COPY implementation.
    COPY is already supported on the client side, so a call to
    copy_file_range() on a recent client should now result in a
    server-side copy that doesn't require all the data to make a round
    trip to the client and back.

    - Jeff Layton implemented callbacks to notify clients when contended
    locks become available, which should reduce latency on workloads
    with contended locks"

    * tag 'nfsd-4.9' of git://linux-nfs.org/~bfields/linux:
    NFSD: Implement the COPY call
    nfsd: handle EUCLEAN
    nfsd: only WARN once on unmapped errors
    exportfs: be careful to only return expected errors.
    nfsd4: setclientid_confirm with unmatched verifier should fail
    nfsd: randomize SETCLIENTID reply to help distinguish servers
    nfsd: set the MAY_NOTIFY_LOCK flag in OPEN replies
    nfs: add a new NFS4_OPEN_RESULT_MAY_NOTIFY_LOCK constant
    nfsd: add a LRU list for blocked locks
    nfsd: have nfsd4_lock use blocking locks for v4.1+ locks
    nfsd: plumb in a CB_NOTIFY_LOCK operation
    NFSD: fix corruption in notifier registration
    svcrdma: support Remote Invalidation
    svcrdma: Server-side support for rpcrdma_connect_private
    rpcrdma: RDMA/CM private message data structure
    svcrdma: Skip put_page() when send_reply() fails
    svcrdma: Tail iovec leaves an orphaned DMA mapping
    nfsd: fix dprintk in nfsd4_encode_getdeviceinfo
    nfsd: eliminate cb_minorversion field
    nfsd: don't set a FL_LAYOUT lease for flexfiles layouts

    Linus Torvalds
     

11 Oct, 2016

2 commits

  • Pull more vfs updates from Al Viro:
    ">rename2() work from Miklos + current_time() from Deepa"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: Replace current_fs_time() with current_time()
    fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
    fs: Replace CURRENT_TIME with current_time() for inode timestamps
    fs: proc: Delete inode time initializations in proc_alloc_inode()
    vfs: Add current_time() api
    vfs: add note about i_op->rename changes to porting
    fs: rename "rename2" i_op to "rename"
    vfs: remove unused i_op->rename
    fs: make remaining filesystems use .rename2
    libfs: support RENAME_NOREPLACE in simple_rename()
    fs: support RENAME_NOREPLACE for local filesystems
    ncpfs: fix unused variable warning

    Linus Torvalds
     
  • Pull misc vfs updates from Al Viro:
    "Assorted misc bits and pieces.

    There are several single-topic branches left after this (rename2
    series from Miklos, current_time series from Deepa Dinamani, xattr
    series from Andreas, uaccess stuff from from me) and I'd prefer to
    send those separately"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
    proc: switch auxv to use of __mem_open()
    hpfs: support FIEMAP
    cifs: get rid of unused arguments of CIFSSMBWrite()
    posix_acl: uapi header split
    posix_acl: xattr representation cleanups
    fs/aio.c: eliminate redundant loads in put_aio_ring_file
    fs/internal.h: add const to ns_dentry_operations declaration
    compat: remove compat_printk()
    fs/buffer.c: make __getblk_slow() static
    proc: unsigned file descriptors
    fs/file: more unsigned file descriptors
    fs: compat: remove redundant check of nr_segs
    cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
    cifs: don't use memcpy() to copy struct iov_iter
    get rid of separate multipage fault-in primitives
    fs: Avoid premature clearing of capabilities
    fs: Give dentry to inode_change_ok() instead of inode
    fuse: Propagate dentry down to inode_change_ok()
    ceph: Propagate dentry down to inode_change_ok()
    xfs: Propagate dentry down to inode_change_ok()
    ...

    Linus Torvalds
     

08 Oct, 2016

5 commits


28 Sep, 2016

1 commit

  • current_fs_time() uses struct super_block* as an argument.
    As per Linus's suggestion, this is changed to take struct
    inode* as a parameter instead. This is because the function
    is primarily meant for vfs inode timestamps.
    Also the function was renamed as per Arnd's suggestion.

    Change all calls to current_fs_time() to use the new
    current_time() function instead. current_fs_time() will be
    deleted.

    Signed-off-by: Deepa Dinamani
    Signed-off-by: Al Viro

    Deepa Dinamani
     

27 Sep, 2016

7 commits

  • A setclientid_confirm with (clientid, verifier) both matching an
    existing confirmed record is assumed to be a replay, but if the verifier
    doesn't match, it shouldn't be.

    This would be a very rare case, except that clients following
    https://tools.ietf.org/html/rfc7931#section-5.8 may depend on the
    failure.

    Reviewed-by: Jeff Layton
    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     
  • NFSv4.1 has built-in trunking support that allows a client to determine
    whether two connections to two different IP addresses are actually to
    the same server. NFSv4.0 does not, but RFC 7931 attempts to provide
    clients a means to do this, basically by performing a SETCLIENTID to one
    address and confirming it with a SETCLIENTID_CONFIRM to the other.

    Linux clients since 05f4c350ee02 "NFS: Discover NFSv4 server trunking
    when mounting" implement a variation on this suggestion. It is possible
    that other clients do too.

    This depends on the clientid and verifier not being accepted by an
    unrelated server. Since both are 64-bit values, that would be very
    unlikely if they were random numbers. But they aren't:

    knfsd generates the 64-bit clientid by concatenating the 32-bit boot
    time (in seconds) and a counter. This makes collisions between
    clientids generated by the same server extremely unlikely. But
    collisions are very likely between clientids generated by servers that
    boot at the same time, and it's quite common for multiple servers to
    boot at the same time. The verifier is a concatenation of the
    SETCLIENTID time (in seconds) and a counter, so again collisions between
    different servers are likely if multiple SETCLIENTIDs are done at the
    same time, which is a common case.

    Therefore recent NFSv4.0 clients may decide two different servers are
    really the same, and mount a filesystem from the wrong server.

    Fortunately the Linux client, since 55b9df93ddd6 "nfsv4/v4.1: Verify the
    client owner id during trunking detection", only does this when given
    the non-default "migration" mount option.

    The fault is really with RFC 7931, and needs a client fix, but in the
    meantime we can mitigate the chance of these collisions by randomizing
    the starting value of the counters used to generate clientids and
    verifiers.

    Reported-by: Frank Sorenson
    Reviewed-by: Jeff Layton
    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     
  • If we are using v4.1+, then we can send notification when contended
    locks become free. Inform the client of that fact.

    Signed-off-by: Jeff Layton
    Signed-off-by: J. Bruce Fields

    Jeff Layton
     
  • It's possible for a client to call in on a lock that is blocked for a
    long time, but discontinue polling for it. A malicious client could
    even set a lock on a file, and then spam the server with failing lock
    requests from different lockowners that pile up in a DoS attack.

    Add the blocked lock structures to a per-net namespace LRU when hashing
    them, and timestamp them. If the lock request is not revisited after a
    lease period, we'll drop it under the assumption that the client is no
    longer interested.

    This also gives us a mechanism to clean up these objects at server
    shutdown time as well.

    Signed-off-by: Jeff Layton
    Signed-off-by: J. Bruce Fields

    Jeff Layton
     
  • Create a new per-lockowner+per-inode structure that contains a
    file_lock. Have nfsd4_lock add this structure to the lockowner's list
    prior to setting the lock. Then call the vfs and request a blocking lock
    (by setting FL_SLEEP). If we get anything besides FILE_LOCK_DEFERRED
    back, then we dequeue the block structure and free it. When the next
    lock request comes in, we'll look for an existing block for the same
    filehandle and dequeue and reuse it if there is one.

    When the lock comes free (a'la an lm_notify call), we dequeue it
    from the lockowner's list and kick off a CB_NOTIFY_LOCK callback to
    inform the client that it should retry the lock request.

    Signed-off-by: Jeff Layton
    Signed-off-by: J. Bruce Fields

    Jeff Layton
     
  • Add the encoding/decoding for CB_NOTIFY_LOCK operations.

    Signed-off-by: Jeff Layton
    Signed-off-by: J. Bruce Fields

    Jeff Layton
     
  • By design notifier can be registered once only, however nfsd registers
    the same inetaddr notifiers per net-namespace. When this happen it
    corrupts list of notifiers, as result some notifiers can be not called
    on proper event, traverse on list can be cycled forever, and second
    unregister can access already freed memory.

    Cc: stable@vger.kernel.org
    fixes: 36684996 ("nfsd: Register callbacks on the inetaddr_chain and inet6addr_chain")
    Signed-off-by: Vasily Averin
    Reviewed-by: Jeff Layton
    Cc: stable@vger.kernel.org
    Signed-off-by: J. Bruce Fields

    Vasily Averin
     

23 Sep, 2016

1 commit


22 Sep, 2016

1 commit

  • inode_change_ok() will be resposible for clearing capabilities and IMA
    extended attributes and as such will need dentry. Give it as an argument
    to inode_change_ok() instead of an inode. Also rename inode_change_ok()
    to setattr_prepare() to better relect that it does also some
    modifications in addition to checks.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Jan Kara
     

17 Sep, 2016

2 commits

  • We already have that info in the client pointer. No need to pass around
    a copy.

    Signed-off-by: Jeff Layton
    Signed-off-by: J. Bruce Fields

    Jeff Layton
     
  • We currently can hit a deadlock (of sorts) when trying to use flexfiles
    layouts with XFS. XFS will call break_layout when something wants to
    write to the file. In the case of the (super-simple) flexfiles layout
    driver in knfsd, the MDS and DS are the same machine.

    The client can get a layout and then issue a v3 write to do its I/O. XFS
    will then call xfs_break_layouts, which will cause a CB_LAYOUTRECALL to
    be issued to the client. The client however can't return the layout
    until the v3 WRITE completes, but XFS won't allow the write to proceed
    until the layout is returned.

    Christoph says:

    XFS only cares about block-like layouts where the client has direct
    access to the file blocks. I'd need to look how to propagate the
    flag into break_layout, but in principle we don't need to do any
    recalls on truncate ever for file and flexfile layouts.

    If we're never going to recall the layout, then we don't even need to
    set the lease at all. Just skip doing so on flexfiles layouts by
    adding a new flag to struct nfsd4_layout_ops and skipping the lease
    setting and removal when that flag is true.

    Cc: Christoph Hellwig
    Signed-off-by: Jeff Layton
    Signed-off-by: J. Bruce Fields

    Jeff Layton
     

13 Aug, 2016

1 commit

  • nfsd4_lock will take the st_mutex before working with the stateid it
    gets, but between the time when we drop the cl_lock and take the mutex,
    the stateid could become unhashed (a'la FREE_STATEID). If that happens
    the lock stateid returned to the client will be forgotten.

    Fix this by first moving the st_mutex acquisition into
    lookup_or_create_lock_state. Then, have it check to see if the lock
    stateid is still hashed after taking the mutex. If it's not, then put
    the stateid and try the find/create again.

    Signed-off-by: Jeff Layton
    Tested-by: Alexey Kodanev
    Cc: stable@vger.kernel.org # feb9dad5 nfsd: Always lock state exclusively.
    Cc: stable@vger.kernel.org
    Signed-off-by: J. Bruce Fields

    Jeff Layton
     

12 Aug, 2016

1 commit

  • When running LTP's nfslock01 test, the Linux client can send a LOCK
    and a FREE_STATEID request at the same time. The outcome is:

    Frame 324 R OPEN stateid [2,O]

    Frame 115004 C LOCK lockowner_is_new stateid [2,O] offset 672000 len 64
    Frame 115008 R LOCK stateid [1,L]
    Frame 115012 C WRITE stateid [0,L] offset 672000 len 64
    Frame 115016 R WRITE NFS4_OK
    Frame 115019 C LOCKU stateid [1,L] offset 672000 len 64
    Frame 115022 R LOCKU NFS4_OK
    Frame 115025 C FREE_STATEID stateid [2,L]
    Frame 115026 C LOCK lockowner_is_new stateid [2,O] offset 672128 len 64
    Frame 115029 R FREE_STATEID NFS4_OK
    Frame 115030 R LOCK stateid [3,L]
    Frame 115034 C WRITE stateid [0,L] offset 672128 len 64
    Frame 115038 R WRITE NFS4ERR_BAD_STATEID

    In other words, the server returns stateid L in a successful LOCK
    reply, but it has already released it. Subsequent uses of stateid L
    fail.

    To address this, protect the generation check in nfsd4_free_stateid
    with the st_mutex. This should guarantee that only one of two
    outcomes occurs: either LOCK returns a fresh valid stateid, or
    FREE_STATEID returns NFS4ERR_LOCKS_HELD.

    Reported-by: Alexey Kodanev
    Fix-suggested-by: Jeff Layton
    Signed-off-by: Chuck Lever
    Tested-by: Alexey Kodanev
    Cc: stable@vger.kernel.org
    Signed-off-by: J. Bruce Fields

    Chuck Lever
     

11 Aug, 2016

1 commit

  • b44061d0b9 introduced a dentry ref counting bug. Previously we were
    grabbing one ref to dchild in nfsd_create(), but with the creation of
    nfsd_create_locked() we have a ref for dchild from the lookup in
    nfsd_create(), and then another ref in nfsd_create_locked(). The ref
    from the lookup in nfsd_create() is never dropped and results in
    dentries still in use at unmount.

    Signed-off-by: Josef Bacik
    Fixes: b44061d0b9 "nfsd: reorganize nfsd_create"
    Reported-by: kernel test robot
    Reviewed-by: Jeff Layton
    Acked-by: Al Viro
    Signed-off-by: J. Bruce Fields

    Josef Bacik
     

05 Aug, 2016

9 commits

  • Pull nfsd updates from Bruce Fields:
    "Highlights:

    - Trond made a change to the server's tcp logic that allows a fast
    client to better take advantage of high bandwidth networks, but may
    increase the risk that a single client could starve other clients;
    a new sunrpc.svc_rpc_per_connection_limit parameter should help
    mitigate this in the (hopefully unlikely) event this becomes a
    problem in practice.

    - Tom Haynes added a minimal flex-layout pnfs server, which is of no
    use in production for now--don't build it unless you're doing
    client testing or further server development"

    * tag 'nfsd-4.8' of git://linux-nfs.org/~bfields/linux: (32 commits)
    nfsd: remove some dead code in nfsd_create_locked()
    nfsd: drop unnecessary MAY_EXEC check from create
    nfsd: clean up bad-type check in nfsd_create_locked
    nfsd: remove unnecessary positive-dentry check
    nfsd: reorganize nfsd_create
    nfsd: check d_can_lookup in fh_verify of directories
    nfsd: remove redundant zero-length check from create
    nfsd: Make creates return EEXIST instead of EACCES
    SUNRPC: Detect immediate closure of accepted sockets
    SUNRPC: accept() may return sockets that are still in SYN_RECV
    nfsd: allow nfsd to advertise multiple layout types
    nfsd: Close race between nfsd4_release_lockowner and nfsd4_lock
    nfsd/blocklayout: Make sure calculate signature/designator length aligned
    xfs: abstract block export operations from nfsd layouts
    SUNRPC: Remove unused callback xpo_adjust_wspace()
    SUNRPC: Change TCP socket space reservation
    SUNRPC: Add a server side per-connection limit
    SUNRPC: Micro optimisation for svc_data_ready
    SUNRPC: Call the default socket callbacks instead of open coding
    SUNRPC: lock the socket while detaching it
    ...

    Linus Torvalds
     
  • We changed this around in f135af1041f ('nfsd: reorganize nfsd_create')
    so "dchild" can't be an error pointer any more. Also, dchild can't be
    NULL here (and dput would already handle this even if it was).

    Signed-off-by: Dan Carpenter
    Signed-off-by: J. Bruce Fields

    Dan Carpenter
     
  • We need an fh_verify to make sure we at least have a dentry, but actual
    permission checks happen later.

    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     
  • Minor cleanup, no change in behavior.

    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     
  • vfs_{create,mkdir,mknod} each begin with a call to may_create(), which
    returns EEXIST if the object already exists.

    This check is therefore unnecessary.

    (In the NFSv2 case, nfsd_proc_create also has such a check. Contrary to
    RFC 1094, our code seems to believe that a CREATE of an existing file
    should succeed. I'm leaving that behavior alone.)

    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     
  • There's some odd logic in nfsd_create() that allows it to be called with
    the parent directory either locked or unlocked. The only already-locked
    caller is NFSv2's nfsd_proc_create(). It's less confusing to split out
    the unlocked case into a separate function which the NFSv2 code can call
    directly.

    Also fix some comments while we're here.

    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     
  • Create and other nfsd ops generally assume we can call lookup_one_len on
    inodes with S_IFDIR set. Al says that this assumption isn't true in
    general, though it should be for the filesystem objects nfsd sees.

    Add a check just to make sure our assumption isn't violated.

    Remove a couple checks for i_op->lookup in create code.

    Cc: Al Viro
    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     
  • lookup_one_len already has this check.

    The only effect of this patch is to return access instead of perm in the
    0-length-filename case. I actually prefer nfserr_perm (or _inval?), but
    I doubt anyone cares.

    The isdotent check seems redundant too, but I worry that some client
    might actually care about that strange nfserr_exist error.

    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     
  • When doing a create (mkdir/mknod) on a name, it's worth
    checking the name exists first before returning EACCES in case
    the directory is not writeable by the user.
    This makes return values on the client more consistent
    regardless of whenever the entry there is cached in the local
    cache or not.
    Another positive side effect is certain programs only expect
    EEXIST in that case even despite POSIX allowing any valid
    error to be returned.

    Signed-off-by: Oleg Drokin
    Signed-off-by: J. Bruce Fields

    Oleg Drokin
     

30 Jul, 2016

1 commit

  • Pull userns vfs updates from Eric Biederman:
    "This tree contains some very long awaited work on generalizing the
    user namespace support for mounting filesystems to include filesystems
    with a backing store. The real world target is fuse but the goal is
    to update the vfs to allow any filesystem to be supported. This
    patchset is based on a lot of code review and testing to approach that
    goal.

    While looking at what is needed to support the fuse filesystem it
    became clear that there were things like xattrs for security modules
    that needed special treatment. That the resolution of those concerns
    would not be fuse specific. That sorting out these general issues
    made most sense at the generic level, where the right people could be
    drawn into the conversation, and the issues could be solved for
    everyone.

    At a high level what this patchset does a couple of simple things:

    - Add a user namespace owner (s_user_ns) to struct super_block.

    - Teach the vfs to handle filesystem uids and gids not mapping into
    to kuids and kgids and being reported as INVALID_UID and
    INVALID_GID in vfs data structures.

    By assigning a user namespace owner filesystems that are mounted with
    only user namespace privilege can be detected. This allows security
    modules and the like to know which mounts may not be trusted. This
    also allows the set of uids and gids that are communicated to the
    filesystem to be capped at the set of kuids and kgids that are in the
    owning user namespace of the filesystem.

    One of the crazier corner casees this handles is the case of inodes
    whose i_uid or i_gid are not mapped into the vfs. Most of the code
    simply doesn't care but it is easy to confuse the inode writeback path
    so no operation that could cause an inode write-back is permitted for
    such inodes (aka only reads are allowed).

    This set of changes starts out by cleaning up the code paths involved
    in user namespace permirted mounts. Then when things are clean enough
    adds code that cleanly sets s_user_ns. Then additional restrictions
    are added that are possible now that the filesystem superblock
    contains owner information.

    These changes should not affect anyone in practice, but there are some
    parts of these restrictions that are changes in behavior.

    - Andy's restriction on suid executables that does not honor the
    suid bit when the path is from another mount namespace (think
    /proc/[pid]/fd/) or when the filesystem was mounted by a less
    privileged user.

    - The replacement of the user namespace implicit setting of MNT_NODEV
    with implicitly setting SB_I_NODEV on the filesystem superblock
    instead.

    Using SB_I_NODEV is a stronger form that happens to make this state
    user invisible. The user visibility can be managed but it caused
    problems when it was introduced from applications reasonably
    expecting mount flags to be what they were set to.

    There is a little bit of work remaining before it is safe to support
    mounting filesystems with backing store in user namespaces, beyond
    what is in this set of changes.

    - Verifying the mounter has permission to read/write the block device
    during mount.

    - Teaching the integrity modules IMA and EVM to handle filesystems
    mounted with only user namespace root and to reduce trust in their
    security xattrs accordingly.

    - Capturing the mounters credentials and using that for permission
    checks in d_automount and the like. (Given that overlayfs already
    does this, and we need the work in d_automount it make sense to
    generalize this case).

    Furthermore there are a few changes that are on the wishlist:

    - Get all filesystems supporting posix acls using the generic posix
    acls so that posix_acl_fix_xattr_from_user and
    posix_acl_fix_xattr_to_user may be removed. [Maintainability]

    - Reducing the permission checks in places such as remount to allow
    the superblock owner to perform them.

    - Allowing the superblock owner to chown files with unmapped uids and
    gids to something that is mapped so the files may be treated
    normally.

    I am not considering even obvious relaxations of permission checks
    until it is clear there are no more corner cases that need to be
    locked down and handled generically.

    Many thanks to Seth Forshee who kept this code alive, and putting up
    with me rewriting substantial portions of what he did to handle more
    corner cases, and for his diligent testing and reviewing of my
    changes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (30 commits)
    fs: Call d_automount with the filesystems creds
    fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns
    evm: Translate user/group ids relative to s_user_ns when computing HMAC
    dquot: For now explicitly don't support filesystems outside of init_user_ns
    quota: Handle quota data stored in s_user_ns in quota_setxquota
    quota: Ensure qids map to the filesystem
    vfs: Don't create inodes with a uid or gid unknown to the vfs
    vfs: Don't modify inodes with a uid or gid unknown to the vfs
    cred: Reject inodes with invalid ids in set_create_file_as()
    fs: Check for invalid i_uid in may_follow_link()
    vfs: Verify acls are valid within superblock's s_user_ns.
    userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS
    fs: Refuse uid/gid changes which don't map into s_user_ns
    selinux: Add support for unprivileged mounts from user namespaces
    Smack: Handle labels consistently in untrusted mounts
    Smack: Add support for unprivileged mounts from user namespaces
    fs: Treat foreign mounts as nosuid
    fs: Limit file caps to the user namespace of the super block
    userns: Remove the now unnecessary FS_USERNS_DEV_MOUNT flag
    userns: Remove implicit MNT_NODEV fragility.
    ...

    Linus Torvalds
     

29 Jul, 2016

1 commit

  • Pull vfs updates from Al Viro:
    "Assorted cleanups and fixes.

    Probably the most interesting part long-term is ->d_init() - that will
    have a bunch of followups in (at least) ceph and lustre, but we'll
    need to sort the barrier-related rules before it can get used for
    really non-trivial stuff.

    Another fun thing is the merge of ->d_iput() callers (dentry_iput()
    and dentry_unlink_inode()) and a bunch of ->d_compare() ones (all
    except the one in __d_lookup_lru())"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
    fs/dcache.c: avoid soft-lockup in dput()
    vfs: new d_init method
    vfs: Update lookup_dcache() comment
    bdev: get rid of ->bd_inodes
    Remove last traces of ->sync_page
    new helper: d_same_name()
    dentry_cmp(): use lockless_dereference() instead of smp_read_barrier_depends()
    vfs: clean up documentation
    vfs: document ->d_real()
    vfs: merge .d_select_inode() into .d_real()
    unify dentry_iput() and dentry_unlink_inode()
    binfmt_misc: ->s_root is not going anywhere
    drop redundant ->owner initializations
    ufs: get rid of redundant checks
    orangefs: constify inode_operations
    missed comment updates from ->direct_IO() prototype change
    file_inode(f)->i_mapping is f->f_mapping
    trim fsnotify hooks a bit
    9p: new helper - v9fs_parent_fid()
    debugfs: ->d_parent is never NULL or negative
    ...

    Linus Torvalds
     

28 Jul, 2016

1 commit

  • Pull xfs updates from Dave Chinner:
    "The major addition is the new iomap based block mapping
    infrastructure. We've been kicking this about locally for years, but
    there are other filesystems want to use it too (e.g. gfs2). Now it
    is fully working, reviewed and ready for merge and be used by other
    filesystems.

    There are a lot of other fixes and cleanups in the tree, but those are
    XFS internal things and none are of the scale or visibility of the
    iomap changes. See below for details.

    I am likely to send another pull request next week - we're just about
    ready to merge some new functionality (on disk block->owner reverse
    mapping infrastructure), but that's a huge chunk of code (74 files
    changed, 7283 insertions(+), 1114 deletions(-)) so I'm keeping that
    separate to all the "normal" pull request changes so they don't get
    lost in the noise.

    Summary of changes in this update:
    - generic iomap based IO path infrastructure
    - generic iomap based fiemap implementation
    - xfs iomap based Io path implementation
    - buffer error handling fixes
    - tracking of in flight buffer IO for unmount serialisation
    - direct IO and DAX io path separation and simplification
    - shortform directory format definition changes for wider platform
    compatibility
    - various buffer cache fixes
    - cleanups in preparation for rmap merge
    - error injection cleanups and fixes
    - log item format buffer memory allocation restructuring to prevent
    rare OOM reclaim deadlocks
    - sparse inode chunks are now fully supported"

    * tag 'xfs-for-linus-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (53 commits)
    xfs: remove EXPERIMENTAL tag from sparse inode feature
    xfs: bufferhead chains are invalid after end_page_writeback
    xfs: allocate log vector buffers outside CIL context lock
    libxfs: directory node splitting does not have an extra block
    xfs: remove dax code from object file when disabled
    xfs: skip dirty pages in ->releasepage()
    xfs: remove __arch_pack
    xfs: kill xfs_dir2_inou_t
    xfs: kill xfs_dir2_sf_off_t
    xfs: split direct I/O and DAX path
    xfs: direct calls in the direct I/O path
    xfs: stop using generic_file_read_iter for direct I/O
    xfs: split xfs_file_read_iter into buffered and direct I/O helpers
    xfs: remove s_maxbytes enforcement in xfs_file_read_iter
    xfs: kill ioflags
    xfs: don't pass ioflags around in the ioctl path
    xfs: track and serialize in-flight async buffers against unmount
    xfs: exclude never-released buffers from buftarg I/O accounting
    xfs: don't reset b_retries to 0 on every failure
    xfs: remove extraneous buffer flag changes
    ...

    Linus Torvalds
     

16 Jul, 2016

2 commits

  • If the underlying filesystem supports multiple layout types, then there
    is little reason not to advertise that fact to clients and let them
    choose what type to use.

    Turn the ex_layout_type field into a bitfield. For each supported
    layout type, we set a bit in that field. When the client requests a
    layout, ensure that the bit for that layout type is set. When the
    client requests attributes, send back a list of supported types.

    Signed-off-by: Jeff Layton
    Reviewed-by: Weston Andros Adamson
    Signed-off-by: J. Bruce Fields

    Jeff Layton
     
  • nfsd4_release_lockowner finds a lock owner that has no lock state,
    and drops cl_lock. Then release_lockowner picks up cl_lock and
    unhashes the lock owner.

    During the window where cl_lock is dropped, I don't see anything
    preventing a concurrent nfsd4_lock from finding that same lock owner
    and adding lock state to it.

    Move release_lockowner() into nfsd4_release_lockowner and hang onto
    the cl_lock until after the lock owner's state cannot be found
    again.

    Found by inspection, we don't currently have a reproducer.

    Fixes: 2c41beb0e5cf ("nfsd: reduce cl_lock thrashing in ... ")
    Reviewed-by: Jeff Layton
    Signed-off-by: Chuck Lever
    Signed-off-by: J. Bruce Fields

    Chuck Lever