24 Oct, 2014

1 commit

  • Unknown operation numbers are caught in nfsd4_decode_compound() which
    sets op->opnum to OP_ILLEGAL and op->status to nfserr_op_illegal. The
    error causes the main loop in nfsd4_proc_compound() to skip most
    processing. But nfsd4_proc_compound also peeks ahead at the next
    operation in one case and doesn't take similar precautions there.

    Cc: stable@vger.kernel.org
    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     

21 Oct, 2014

1 commit

  • We added this new estimator function but forgot to hook it up. The
    effect is that NFSv4.1 (and greater) won't do zero-copy reads.

    The estimate was also wrong by 8 bytes.

    Fixes: ccae70a9ee41 "nfsd4: estimate sequence response size"
    Cc: stable@vger.kernel.org
    Reported-by: Chuck Lever
    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     

13 Oct, 2014

1 commit

  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Optimized support for Intel "Cluster-on-Die" (CoD) topologies (Dave
    Hansen)

    - Various sched/idle refinements for better idle handling (Nicolas
    Pitre, Daniel Lezcano, Chuansheng Liu, Vincent Guittot)

    - sched/numa updates and optimizations (Rik van Riel)

    - sysbench speedup (Vincent Guittot)

    - capacity calculation cleanups/refactoring (Vincent Guittot)

    - Various cleanups to thread group iteration (Oleg Nesterov)

    - Double-rq-lock removal optimization and various refactorings
    (Kirill Tkhai)

    - various sched/deadline fixes

    ... and lots of other changes"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
    sched/dl: Use dl_bw_of() under rcu_read_lock_sched()
    sched/fair: Delete resched_cpu() from idle_balance()
    sched, time: Fix build error with 64 bit cputime_t on 32 bit systems
    sched: Improve sysbench performance by fixing spurious active migration
    sched/x86: Fix up typo in topology detection
    x86, sched: Add new topology for multi-NUMA-node CPUs
    sched/rt: Use resched_curr() in task_tick_rt()
    sched: Use rq->rd in sched_setaffinity() under RCU read lock
    sched: cleanup: Rename 'out_unlock' to 'out_free_new_mask'
    sched: Use dl_bw_of() under RCU read lock
    sched/fair: Remove duplicate code from can_migrate_task()
    sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSW
    sched: print_rq(): Don't use tasklist_lock
    sched: normalize_rt_tasks(): Don't use _irqsave for tasklist_lock, use task_rq_lock()
    sched: Fix the task-group check in tg_has_rt_tasks()
    sched/fair: Leverage the idle state info when choosing the "idlest" cpu
    sched: Let the scheduler see CPU idle states
    sched/deadline: Fix inter- exclusive cpusets migrations
    sched/deadline: Clear dl_entity params when setscheduling to different class
    sched/numa: Kill the wrong/dead TASK_DEAD check in task_numa_fault()
    ...

    Linus Torvalds
     

12 Oct, 2014

2 commits

  • Pull security subsystem updates from James Morris.

    Mostly ima, selinux, smack and key handling updates.

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (65 commits)
    integrity: do zero padding of the key id
    KEYS: output last portion of fingerprint in /proc/keys
    KEYS: strip 'id:' from ca_keyid
    KEYS: use swapped SKID for performing partial matching
    KEYS: Restore partial ID matching functionality for asymmetric keys
    X.509: If available, use the raw subjKeyId to form the key description
    KEYS: handle error code encoded in pointer
    selinux: normalize audit log formatting
    selinux: cleanup error reporting in selinux_nlmsg_perm()
    KEYS: Check hex2bin()'s return when generating an asymmetric key ID
    ima: detect violations for mmaped files
    ima: fix race condition on ima_rdwr_violation_check and process_measurement
    ima: added ima_policy_flag variable
    ima: return an error code from ima_add_boot_aggregate()
    ima: provide 'ima_appraise=log' kernel option
    ima: move keyring initialization to ima_init()
    PKCS#7: Handle PKCS#7 messages that contain no X.509 certs
    PKCS#7: Better handling of unsupported crypto
    KEYS: Overhaul key identification when searching for asymmetric keys
    KEYS: Implement binary asymmetric key ID handling
    ...

    Linus Torvalds
     
  • Pull file locking related changes from Jeff Layton:
    "This release is a little more busy for file locking changes than the
    last:

    - a set of patches from Kinglong Mee to fix the lockowner handling in
    knfsd
    - a pile of cleanups to the internal file lease API. This should get
    us a bit closer to allowing for setlease methods that can block.

    There are some dependencies between mine and Bruce's trees this cycle,
    and I based my tree on top of the requisite patches in Bruce's tree"

    * tag 'locks-v3.18-1' of git://git.samba.org/jlayton/linux: (26 commits)
    locks: fix fcntl_setlease/getlease return when !CONFIG_FILE_LOCKING
    locks: flock_make_lock should return a struct file_lock (or PTR_ERR)
    locks: set fl_owner for leases to filp instead of current->files
    locks: give lm_break a return value
    locks: __break_lease cleanup in preparation of allowing direct removal of leases
    locks: remove i_have_this_lease check from __break_lease
    locks: move freeing of leases outside of i_lock
    locks: move i_lock acquisition into generic_*_lease handlers
    locks: define a lm_setup handler for leases
    locks: plumb a "priv" pointer into the setlease routines
    nfsd: don't keep a pointer to the lease in nfs4_file
    locks: clean up vfs_setlease kerneldoc comments
    locks: generic_delete_lease doesn't need a file_lock at all
    nfsd: fix potential lease memory leak in nfs4_setlease
    locks: close potential race in lease_get_mtime
    security: make security_file_set_fowner, f_setown and __f_setown void return
    locks: consolidate "nolease" routines
    locks: remove lock_may_read and lock_may_write
    lockd: rip out deferred lock handling from testlock codepath
    NFSD: Get reference of lockowner when coping file_lock
    ...

    Linus Torvalds
     

09 Oct, 2014

1 commit

  • Pull nfsd updates from Bruce Fields:
    "Highlights:

    - support the NFSv4.2 SEEK operation (allowing clients to support
    SEEK_HOLE/SEEK_DATA), thanks to Anna.
    - end the grace period early in a number of cases, mitigating a
    long-standing annoyance, thanks to Jeff
    - improve SMP scalability, thanks to Trond"

    * 'for-3.18' of git://linux-nfs.org/~bfields/linux: (55 commits)
    nfsd: eliminate "to_delegation" define
    NFSD: Implement SEEK
    NFSD: Add generic v4.2 infrastructure
    svcrdma: advertise the correct max payload
    nfsd: introduce nfsd4_callback_ops
    nfsd: split nfsd4_callback initialization and use
    nfsd: introduce a generic nfsd4_cb
    nfsd: remove nfsd4_callback.cb_op
    nfsd: do not clear rpc_resp in nfsd4_cb_done_sequence
    nfsd: fix nfsd4_cb_recall_done error handling
    nfsd4: clarify how grace period ends
    nfsd4: stop grace_time update at end of grace period
    nfsd: skip subsequent UMH "create" operations after the first one for v4.0 clients
    nfsd: set and test NFSD4_CLIENT_STABLE bit to reduce nfsdcltrack upcalls
    nfsd: serialize nfsdcltrack upcalls for a particular client
    nfsd: pass extra info in env vars to upcalls to allow for early grace period end
    nfsd: add a v4_end_grace file to /proc/fs/nfsd
    lockd: add a /proc/fs/lockd/nlm_end_grace file
    nfsd: reject reclaim request when client has already sent RECLAIM_COMPLETE
    nfsd: remove redundant boot_time parm from grace_done client tracking op
    ...

    Linus Torvalds
     

08 Oct, 2014

7 commits

  • Christoph suggests:

    "Add a return value to lm_break so that the lock manager can tell the
    core code "you can delete this lease right now". That gets rid of
    the games with the timeout which require all kinds of race avoidance
    code in the users."

    Do that here and have the nfsd lease break routine use it when it detects
    that there was a race between setting up the lease and it being broken.

    Signed-off-by: Jeff Layton
    Reviewed-by: Christoph Hellwig

    Jeff Layton
     
  • There was only one place where we still could free a file_lock while
    holding the i_lock -- lease_modify. Add a new list_head argument to the
    lm_change operation, pass in a private list when calling it, and fix
    those callers to dispose of the list once the lock has been dropped.

    Signed-off-by: Jeff Layton
    Reviewed-by: Christoph Hellwig

    Jeff Layton
     
  • ...and move the fasync setup into it for fcntl lease calls. At the same
    time, change the semantics of how the file_lock double-pointer is
    handled. Up until now, on a successful lease return you got a pointer to
    the lock on the list. This is bad, since that pointer can no longer be
    relied on as valid once the inode->i_lock has been released.

    Change the code to instead just zero out the pointer if the lease we
    passed in ended up being used. Then the callers can just check to see
    if it's NULL after the call and free it if it isn't.

    The priv argument has the same semantics. The lm_setup function can
    zero the pointer out to signal to the caller that it should not be
    freed after the function returns.

    Signed-off-by: Jeff Layton
    Reviewed-by: Christoph Hellwig

    Jeff Layton
     
  • In later patches, we're going to add a new lock_manager_operation to
    finish setting up the lease while still holding the i_lock. To do
    this, we'll need to pass a little bit of info in the fcntl setlease
    case (primarily an fasync structure). Plumb the extra pointer into
    there in advance of that.

    We declare this pointer as a void ** to make it clear that this is
    private info, and that the caller isn't required to set this unless
    the lm_setup specifically requires it.

    Signed-off-by: Jeff Layton
    Reviewed-by: Christoph Hellwig

    Jeff Layton
     
  • Now that we don't need to pass in an actual lease pointer to
    vfs_setlease on unlock, we can stop tracking a pointer to the lease in
    the nfs4_file.

    Switch all of the places that check the fi_lease to check fi_deleg_file
    instead. We always set that at the same time so it will have the same
    semantics.

    Cc: J. Bruce Fields
    Signed-off-by: Jeff Layton
    Reviewed-by: Christoph Hellwig

    Jeff Layton
     
  • Ensure that it's OK to pass in a NULL file_lock double pointer on
    a F_UNLCK request and convert the vfs_setlease F_UNLCK callers to
    do just that.

    Finally, turn the BUG_ON in generic_setlease into a WARN_ON_ONCE
    with an error return. That's a problem we can handle without
    crashing the box if it occurs.

    Signed-off-by: Jeff Layton
    Reviewed-by: Christoph Hellwig

    Jeff Layton
     
  • It's unlikely to ever occur, but if there were already a lease set on
    the file then we could end up getting back a different pointer on a
    successful setlease attempt than the one we allocated. If that happens,
    the one we allocated could leak.

    In practice, I don't think this will happen due to the fact that we only
    try to set up the lease once per nfs4_file, but this error handling is a
    bit more correct given the current lease API.

    Cc: J. Bruce Fields
    Signed-off-by: Jeff Layton
    Reviewed-by: Christoph Hellwig

    Jeff Layton
     

02 Oct, 2014

1 commit

  • We now have cb_to_delegation and to_delegation, which do the same thing
    and are defined separately in different .c files. Move the
    cb_to_delegation definition into a header file and eliminate the
    redundant to_delegation definition.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jeff Layton

    Jeff Layton
     

01 Oct, 2014

1 commit

  • The calculation of page_ptr here is wrong in the case the read doesn't
    start at an offset that is a multiple of a page.

    The result is that nfs4svc_encode_compoundres sets rq_next_page to a
    value one too small, and then the loop in svc_free_res_pages may
    incorrectly fail to clear a page pointer in rq_respages[].

    Pages left in rq_respages[] are available for the next rpc request to
    use, so xdr data may be written to that page, which may hold data still
    waiting to be transmitted to the client or data in the page cache.

    The observed result was silent data corruption seen on an NFSv4 client.

    We tag this as "fixing" 05638dc73af2 because that commit exposed this
    bug, though the incorrect calculation predates it.

    Particular thanks to Andrea Arcangeli and David Gilbert for analysis and
    testing.

    Fixes: 05638dc73af2 "nfsd4: simplify server xdr->next_page use"
    Cc: stable@vger.kernel.org
    Reported-by: Andrea Arcangeli
    Tested-by: "Dr. David Alan Gilbert"
    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     

30 Sep, 2014

3 commits


27 Sep, 2014

6 commits

  • Add a higher level abstraction than the rpc_ops for callback operations.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jeff Layton
    Signed-off-by: J. Bruce Fields

    Christoph Hellwig
     
  • Split out initializing the nfs4_callback structure from using it. For
    the NULL callback this gets rid of tons of pointless re-initializations.

    Note that I don't quite understand what protects us from running multiple
    NULL callbacks at the same time, but at least this chance doesn't make
    it worse..

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jeff Layton
    Signed-off-by: J. Bruce Fields

    Christoph Hellwig
     
  • Add a helper to queue up a callback. CB_NULL has a bit of special casing
    because it is special in the specification, but all other new callback
    operations will be able to share code with this and a few more changes
    to refactor the callback code.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jeff Layton
    Signed-off-by: J. Bruce Fields

    Christoph Hellwig
     
  • We can always get at the private data by using container_of, no need for
    a void pointer. Also introduce a little to_delegation helper to avoid
    opencoding the container_of everywhere.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jeff Layton
    Signed-off-by: J. Bruce Fields

    Christoph Hellwig
     
  • This is incorrect when a callback is has to be restarted, in which case
    the XDR decoding of the second iteration will see a NULL cb argument.

    [hch: updated description]
    Signed-off-by: Benny Halevy
    Signed-off-by: Christoph Hellwig
    Signed-off-by: J. Bruce Fields

    Benny Halevy
     
  • For any error that is not EBADHANDLE or NFS4ERR_BAD_STATEID,
    nfsd4_cb_recall_done first marks the connection down, then
    retries until dl_retries hits zero, then marks the connection down
    again and sets cb_done. This changes the code to only retry
    for EBADHANDLE or NFS4ERR_BAD_STATEID, and factors setting
    cb_done into a single point in the function.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: J. Bruce Fields

    Christoph Hellwig
     

19 Sep, 2014

1 commit

  • schedule(), io_schedule() and schedule_timeout() always return
    with TASK_RUNNING state set, so one more setting is unnecessary.

    (All places in patch are visible good, only exception is
    kiblnd_scheduler() from:

    drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c

    Its schedule() is one line above standard 3 lines of unified diff)

    No places where set_current_state() is used for mb().

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Link: http://lkml.kernel.org/r/1410529254.3569.23.camel@tkhai
    Cc: Alasdair Kergon
    Cc: Anil Belur
    Cc: Arnd Bergmann
    Cc: Dave Kleikamp
    Cc: David Airlie
    Cc: David Howells
    Cc: Dmitry Eremin
    Cc: Frank Blaschka
    Cc: Greg Kroah-Hartman
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: Isaac Huang
    Cc: James E.J. Bottomley
    Cc: James E.J. Bottomley
    Cc: J. Bruce Fields
    Cc: Jeff Dike
    Cc: Jesper Nilsson
    Cc: Jiri Slaby
    Cc: Laura Abbott
    Cc: Liang Zhen
    Cc: Linus Torvalds
    Cc: Martin Schwidefsky
    Cc: Masaru Nomura
    Cc: Michael Opdenacker
    Cc: Mikael Starvik
    Cc: Mike Snitzer
    Cc: Neil Brown
    Cc: Oleg Drokin
    Cc: Peng Tao
    Cc: Richard Weinberger
    Cc: Robert Love
    Cc: Steven Rostedt
    Cc: Trond Myklebust
    Cc: Ursula Braun
    Cc: Zi Shen Lim
    Cc: devel@driverdev.osuosl.org
    Cc: dm-devel@redhat.com
    Cc: dri-devel@lists.freedesktop.org
    Cc: fcoe-devel@open-fcoe.org
    Cc: jfs-discussion@lists.sourceforge.net
    Cc: linux390@de.ibm.com
    Cc: linux-afs@lists.infradead.org
    Cc: linux-cris-kernel@axis.com
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-nfs@vger.kernel.org
    Cc: linux-parisc@vger.kernel.org
    Cc: linux-raid@vger.kernel.org
    Cc: linux-s390@vger.kernel.org
    Cc: linux-scsi@vger.kernel.org
    Cc: qla2xxx-upstream@qlogic.com
    Cc: user-mode-linux-devel@lists.sourceforge.net
    Cc: user-mode-linux-user@lists.sourceforge.net
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     

18 Sep, 2014

10 commits

  • The grace period is ended in two steps--first userland is notified that
    the grace period is now long enough that any clients who have not yet
    reclaimed can be safely forgotten, then we flip the switch that forbids
    reclaims and allows new opens. I had to think a bit to convince myself
    that the ordering was right here. Document it.

    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     
  • The attempt to automatically set a new grace period time at the end of
    the grace period isn't really helpful. We'll probably shut down and
    reboot before we actually make use of the new grace period time anyway.
    So may as well leave it up to the init system to get this right.

    This just confuses people when they see /proc/fs/nfsd/nfsv4gracetime
    change from what they set it to.

    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     
  • In the case of v4.0 clients, we may call into the "create" client
    tracking operation multiple times (once for each openowner). Upcalling
    for each one of those is wasteful and slow however. We can skip doing
    further "create" operations after the first one if we know that one has
    already been done.

    v4.1+ clients generally only call into this function once (on
    RECLAIM_COMPLETE), and we can't skip upcalling on the create even if the
    STABLE bit is set. Doing so would make it impossible for nfsdcltrack to
    lift the grace period early since the timestamp has a different meaning
    in the case where the client is expected to issue a RECLAIM_COMPLETE.

    Signed-off-by: Jeff Layton

    Jeff Layton
     
  • The nfsdcltrack upcall doesn't utilize the NFSD4_CLIENT_STABLE flag,
    which basically results in an upcall every time we call into the client
    tracking ops.

    Change it to set this bit on a successful "check" or "create" request,
    and clear it on a "remove" request. Also, check to see if that bit is
    set before upcalling on a "check" or "remove" request, and skip
    upcalling appropriately, depending on its state.

    Signed-off-by: Jeff Layton

    Jeff Layton
     
  • In a later patch, we want to add a flag that will allow us to reduce the
    need for upcalls. In order to handle that correctly, we'll need to
    ensure that racing upcalls for the same client can't occur. In practice
    it should be rare for this to occur with a well-behaved client, but it
    is possible.

    Convert one of the bits in the cl_flags field to be an upcall bitlock,
    and use it to ensure that upcalls for the same client are serialized.

    Signed-off-by: Jeff Layton

    Jeff Layton
     
  • In order to support lifting the grace period early, we must tell
    nfsdcltrack what sort of client the "create" upcall is for. We can't
    reliably tell if a v4.0 client has completed reclaiming, so we can only
    lift the grace period once all the v4.1+ clients have issued a
    RECLAIM_COMPLETE and if there are no v4.0 clients.

    Also, in order to lift the grace period, we have to tell userland when
    the grace period started so that it can tell whether a RECLAIM_COMPLETE
    has been issued for each client since then.

    Since this is all optional info, we pass it along in environment
    variables to the "init" and "create" upcalls. By doing this, we don't
    need to revise the upcall format. The UMH upcall can simply make use of
    this info if it happens to be present. If it's not then it can just
    avoid lifting the grace period early.

    Signed-off-by: Jeff Layton

    Jeff Layton
     
  • Allow a privileged userland process to end the v4 grace period early.
    Writing "Y", "y", or "1" to the file will cause the v4 grace period to
    be lifted. The basic idea with this will be to allow the userland
    client tracking program to lift the grace period once it knows that no
    more clients will be reclaiming state.

    Signed-off-by: Jeff Layton

    Jeff Layton
     
  • As stated in RFC 5661, section 18.51.3:

    Once a RECLAIM_COMPLETE is done, there can be no further reclaim
    operations for locks whose scope is defined as having completed
    recovery. Once the client sends RECLAIM_COMPLETE, the server will
    not allow the client to do subsequent reclaims of locking state for
    that scope and, if these are attempted, will return
    NFS4ERR_NO_GRACE.

    Ensure that we enforce that requirement.

    Signed-off-by: Jeff Layton

    Jeff Layton
     
  • Since it's stored in nfsd_net, we don't need to pass it in separately.

    Signed-off-by: Jeff Layton

    Jeff Layton
     
  • Currently, all of the grace period handling is part of lockd. Eventually
    though we'd like to be able to build v4-only servers, at which point
    we'll need to put all of this elsewhere.

    Move the code itself into fs/nfs_common and have it build a grace.ko
    module. Then, rejigger the Kconfig options so that both nfsd and lockd
    enable it automatically.

    Signed-off-by: Jeff Layton

    Jeff Layton
     

11 Sep, 2014

1 commit

  • This fixes a failure in xfstests generic/313 because nfs doesn't update
    mtime on a truncate. The protocol requires this to be done implicity
    for a size changing setattr.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: J. Bruce Fields

    Christoph Hellwig
     

10 Sep, 2014

2 commits


09 Sep, 2014

2 commits

  • Empty files and missing xattrs do not guarantee that a file was
    just created. This patch passes FILE_CREATED flag to IMA to
    reliably identify new files.

    Signed-off-by: Dmitry Kasatkin
    Signed-off-by: Mimi Zohar
    Cc: 3.14+

    Dmitry Kasatkin
     
  • Commit 3b299709091b "nfsd4: enforce rd_dircount" totally misunderstood
    rd_dircount; it refers to total non-attribute bytes returned, not number
    of directory entries returned.

    Bring the code into agreement with RFC 3530 section 14.2.24.

    Cc: stable@vger.kernel.org
    Fixes: 3b299709091b "nfsd4: enforce rd_dircount"
    Signed-off-by: J. Bruce Fields

    J. Bruce Fields