23 May, 2014

7 commits

  • PF_LESS_THROTTLE has a very specific use case: to avoid deadlocks
    and live-locks while writing to the page cache in a loop-back
    NFS mount situation.

    It therefore makes sense to *only* set PF_LESS_THROTTLE in this
    situation.
    We now know when a request came from the local-host so it could be a
    loop-back mount. We already know when we are handling write requests,
    and when we are doing anything else.

    So combine those two to allow nfsd to still be throttled (like any
    other process) in every situation except when it is known to be
    problematic.

    Signed-off-by: NeilBrown
    Signed-off-by: J. Bruce Fields

    NeilBrown
     
  • If an incoming NFS request is coming from the local host, then
    nfsd will need to perform some special handling. So detect that
    possibility and make the source visible in rq_local.

    Signed-off-by: NeilBrown
    Signed-off-by: J. Bruce Fields

    NeilBrown
     
  • If the accept() call fails, we need to put the module reference.

    Signed-off-by: Trond Myklebust
    Cc: stable@vger.kernel.org
    Signed-off-by: J. Bruce Fields

    Trond Myklebust
     
  • An NFS/RDMA client's source port is meaningless for RDMA transports.
    The transport layer typically sets the source port value on the
    connection to a random ephemeral port.

    Currently, NFS server administrators must specify the "insecure"
    export option to enable clients to access exports via RDMA.

    But this means NFS clients can access such an export via IP using an
    ephemeral port, which may not be desirable.

    This patch eliminates the need to specify the "insecure" export
    option to allow NFS/RDMA clients access to an export.

    BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=250
    Signed-off-by: Chuck Lever
    Signed-off-by: J. Bruce Fields

    Chuck Lever
     
  • No need for a kmem_cache_destroy wrapper in nfsd, just do proper
    goto based unwinding.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: J. Bruce Fields

    Christoph Hellwig
     
  • Assignments should not happen inside an if conditional, but in the line
    before. This issue was reported by checkpatch.

    The semantic patch that makes this change is as follows
    (http://coccinelle.lip6.fr/):

    //

    @@
    identifier i1;
    expression e1;
    statement S;
    @@
    -if(!(i1 = e1)) S
    +i1 = e1;
    +if(!i1)
    +S

    //

    It has been tested by compilation.

    Signed-off-by: Benoit Taine
    Signed-off-by: J. Bruce Fields

    Benoit Taine
     
  • J. Bruce Fields
     

22 May, 2014

2 commits

  • We're not cleaning up everything we need to on error. In particular,
    we're not removing our lease. Among other problems this can cause the
    struct nfs4_file used as fl_owner to be referenced after it has been
    destroyed.

    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     
  • We're clearing the SUID/SGID bits on write by hand in nfsd_vfs_write,
    even though the subsequent vfs_writev() call will end up doing this for
    us (through file system write methods eventually calling
    file_remove_suid(), e.g., from __generic_file_aio_write).

    So, remove the redundant nfsd code.

    The only change in behavior is when the write is by root, in which case
    we previously cleared SUID/SGID, but will now leave it alone. The new
    behavior is the behavior of every filesystem we've checked.

    It seems better to be consistent with local filesystem behavior. And
    the security advantage seems limited as root could always restore these
    bits by hand if it wanted.

    SUID/SGID is not cleared after writing data with (root, local ext4),
    File: ‘test’
    Size: 0 Blocks: 0 IO Block: 4096 regular
    empty file
    Device: 803h/2051d Inode: 1200137 Links: 1
    Access: (4777/-rwsrwxrwx) Uid: ( 0/ root) Gid: ( 0/ root)
    Context: unconfined_u:object_r:admin_home_t:s0
    Access: 2014-04-18 21:36:31.016029014 +0800
    Modify: 2014-04-18 21:36:31.016029014 +0800
    Change: 2014-04-18 21:36:31.026030285 +0800
    Birth: -
    File: ‘test’
    Size: 5 Blocks: 8 IO Block: 4096 regular file
    Device: 803h/2051d Inode: 1200137 Links: 1
    Access: (4777/-rwsrwxrwx) Uid: ( 0/ root) Gid: ( 0/ root)
    Context: unconfined_u:object_r:admin_home_t:s0
    Access: 2014-04-18 21:36:31.016029014 +0800
    Modify: 2014-04-18 21:36:31.040032065 +0800
    Change: 2014-04-18 21:36:31.040032065 +0800
    Birth: -

    With no_root_squash, (root, remote ext4), SUID/SGID are cleared,
    File: ‘test’
    Size: 0 Blocks: 0 IO Block: 262144 regular
    empty file
    Device: 24h/36d Inode: 786439 Links: 1
    Access: (4777/-rwsrwxrwx) Uid: ( 1000/ test) Gid: ( 1000/ test)
    Context: system_u:object_r:nfs_t:s0
    Access: 2014-04-18 21:45:32.155805097 +0800
    Modify: 2014-04-18 21:45:32.155805097 +0800
    Change: 2014-04-18 21:45:32.168806749 +0800
    Birth: -
    File: ‘test’
    Size: 5 Blocks: 8 IO Block: 262144 regular file
    Device: 24h/36d Inode: 786439 Links: 1
    Access: (0777/-rwxrwxrwx) Uid: ( 1000/ test) Gid: ( 1000/ test)
    Context: system_u:object_r:nfs_t:s0
    Access: 2014-04-18 21:45:32.155805097 +0800
    Modify: 2014-04-18 21:45:32.184808783 +0800
    Change: 2014-04-18 21:45:32.184808783 +0800
    Birth: -

    Signed-off-by: Kinglong Mee
    Signed-off-by: J. Bruce Fields

    Kinglong Mee
     

21 May, 2014

2 commits

  • The current code assumes a one-to-one lockownerlock stateid
    correspondance.

    Cc: stable@vger.kernel.org
    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     
  • The nfsv4 state code has always assumed a one-to-one correspondance
    between lock stateid's and lockowners even if it appears not to in some
    places.

    We may actually change that, but for now when FREE_STATEID releases a
    lock stateid it also needs to release the parent lockowner.

    Symptoms were a subsequent LOCK crashing in find_lockowner_str when it
    calls same_lockowner_ino on a lockowner that unexpectedly has an empty
    so_stateids list.

    Cc: stable@vger.kernel.org
    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     

16 May, 2014

1 commit


09 May, 2014

7 commits

  • Signed-off-by: Kinglong Mee
    Signed-off-by: J. Bruce Fields

    Kinglong Mee
     
  • Signed-off-by: Kinglong Mee
    Signed-off-by: J. Bruce Fields

    Kinglong Mee
     
  • Signed-off-by: Kinglong Mee
    Signed-off-by: J. Bruce Fields

    Kinglong Mee
     
  • J. Bruce Fields
     
  • Use fh_fsid when reffering to the fsid part of the filehandle. The
    variable length auth field envisioned in nfsfh wasn't ever implemented.
    Also clean up some lose ends around this and document the file handle
    format better.

    Btw, why do we even export nfsfh.h to userspace? The file handle very
    much is kernel private, and nothing in nfs-utils include the header
    either.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: J. Bruce Fields

    Christoph Hellwig
     
  • commit 4ac7249ea5a0ceef9f8269f63f33cc873c3fac61 have remove all EXPORT_SYMBOL,
    linux/export.h is not needed, just clean it.

    Signed-off-by: Kinglong Mee
    Signed-off-by: J. Bruce Fields

    Kinglong Mee
     
  • After setting ACL for directory, I got two problems that caused
    by the cached zero-length default posix acl.

    This patch make sure nfsd4_set_nfs4_acl calls ->set_acl
    with a NULL ACL structure if there are no entries.

    Thanks for Christoph Hellwig's advice.

    First problem:
    ............ hang ...........

    Second problem:
    [ 1610.167668] ------------[ cut here ]------------
    [ 1610.168320] kernel BUG at /root/nfs/linux/fs/nfsd/nfs4acl.c:239!
    [ 1610.168320] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
    [ 1610.168320] Modules linked in: nfsv4(OE) nfs(OE) nfsd(OE)
    rpcsec_gss_krb5 fscache ip6t_rpfilter ip6t_REJECT cfg80211 xt_conntrack
    rfkill ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables
    ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6
    ip6table_mangle ip6table_security ip6table_raw ip6table_filter
    ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4
    nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw
    auth_rpcgss nfs_acl snd_intel8x0 ppdev lockd snd_ac97_codec ac97_bus
    snd_pcm snd_timer e1000 pcspkr parport_pc snd parport serio_raw joydev
    i2c_piix4 sunrpc(OE) microcode soundcore i2c_core ata_generic pata_acpi
    [last unloaded: nfsd]
    [ 1610.168320] CPU: 0 PID: 27397 Comm: nfsd Tainted: G OE
    3.15.0-rc1+ #15
    [ 1610.168320] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS
    VirtualBox 12/01/2006
    [ 1610.168320] task: ffff88005ab653d0 ti: ffff88005a944000 task.ti:
    ffff88005a944000
    [ 1610.168320] RIP: 0010:[] []
    _posix_to_nfsv4_one+0x3cd/0x3d0 [nfsd]
    [ 1610.168320] RSP: 0018:ffff88005a945b00 EFLAGS: 00010293
    [ 1610.168320] RAX: 0000000000000001 RBX: ffff88006700bac0 RCX:
    0000000000000000
    [ 1610.168320] RDX: 0000000000000000 RSI: ffff880067c83f00 RDI:
    ffff880068233300
    [ 1610.168320] RBP: ffff88005a945b48 R08: ffffffff81c64830 R09:
    0000000000000000
    [ 1610.168320] R10: ffff88004ea85be0 R11: 000000000000f475 R12:
    ffff880068233300
    [ 1610.168320] R13: 0000000000000003 R14: 0000000000000002 R15:
    ffff880068233300
    [ 1610.168320] FS: 0000000000000000(0000) GS:ffff880077800000(0000)
    knlGS:0000000000000000
    [ 1610.168320] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [ 1610.168320] CR2: 00007f5bcbd3b0b9 CR3: 0000000001c0f000 CR4:
    00000000000006f0
    [ 1610.168320] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
    0000000000000000
    [ 1610.168320] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
    0000000000000400
    [ 1610.168320] Stack:
    [ 1610.168320] ffffffff00000000 0000000b67c83500 000000076700bac0
    0000000000000000
    [ 1610.168320] ffff88006700bac0 ffff880068233300 ffff88005a945c08
    0000000000000002
    [ 1610.168320] 0000000000000000 ffff88005a945b88 ffffffffa034e2d5
    000000065a945b68
    [ 1610.168320] Call Trace:
    [ 1610.168320] [] nfsd4_get_nfs4_acl+0x95/0x150 [nfsd]
    [ 1610.168320] [] nfsd4_encode_fattr+0x646/0x1e70 [nfsd]
    [ 1610.168320] [] ? kmemleak_alloc+0x4e/0xb0
    [ 1610.168320] [] ?
    nfsd_setuser_and_check_port+0x52/0x80 [nfsd]
    [ 1610.168320] [] ? selinux_cred_prepare+0x1b/0x30
    [ 1610.168320] [] nfsd4_encode_getattr+0x5a/0x60 [nfsd]
    [ 1610.168320] [] nfsd4_encode_operation+0x67/0x110
    [nfsd]
    [ 1610.168320] [] nfsd4_proc_compound+0x21d/0x810 [nfsd]
    [ 1610.168320] [] nfsd_dispatch+0xbb/0x200 [nfsd]
    [ 1610.168320] [] svc_process_common+0x46d/0x6d0 [sunrpc]
    [ 1610.168320] [] svc_process+0x103/0x170 [sunrpc]
    [ 1610.168320] [] nfsd+0xbf/0x130 [nfsd]
    [ 1610.168320] [] ? nfsd_destroy+0x80/0x80 [nfsd]
    [ 1610.168320] [] kthread+0xd2/0xf0
    [ 1610.168320] [] ? insert_kthread_work+0x40/0x40
    [ 1610.168320] [] ret_from_fork+0x7c/0xb0
    [ 1610.168320] [] ? insert_kthread_work+0x40/0x40
    [ 1610.168320] Code: 78 02 e9 e7 fc ff ff 31 c0 31 d2 31 c9 66 89 45 ce
    41 8b 04 24 66 89 55 d0 66 89 4d d2 48 8d 04 80 49 8d 5c 84 04 e9 37 fd
    ff ff 0b 90 0f 1f 44 00 00 55 8b 56 08 c7 07 00 00 00 00 8b 46 0c
    [ 1610.168320] RIP [] _posix_to_nfsv4_one+0x3cd/0x3d0
    [nfsd]
    [ 1610.168320] RSP
    [ 1610.257313] ---[ end trace 838254e3e352285b ]---

    Signed-off-by: Kinglong Mee
    Cc: stable@vger.kernel.org
    Signed-off-by: J. Bruce Fields

    Kinglong Mee
     

07 May, 2014

10 commits


18 Apr, 2014

3 commits

  • Since we're still limiting attributes to a page, the result here is that
    a large getattr result will return NFS4ERR_REP_TOO_BIG/TOO_BIG_TO_CACHE
    instead of NFS4ERR_RESOURCE.

    Both error returns are wrong, and the real bug here is the arbitrary
    limit on getattr results, fixed by as-yet out-of-tree patches. But at a
    minimum we can make life easier for clients by sticking to one broken
    behavior in released kernels instead of two....

    Trond says:

    one immediate consequence of this patch will be that NFSv4.1
    clients will now report EIO instead of EREMOTEIO if they hit the
    problem. That may make debugging a little less obvious.

    Another consequence will be that if we ever do try to add client
    side handling of NFS4ERR_REP_TOO_BIG, then we now have to deal
    with the “handle existing buggy server” syndrome.

    Reported-by: Trond Myklebust
    Signed-off-by: J. Bruce Fields

    J. Bruce Fields
     
  • ...otherwise the logic in the timeout handling doesn't work correctly.

    Spotted-by: Trond Myklebust
    Cc: stable@vger.kernel.org
    Signed-off-by: Jeff Layton
    Signed-off-by: J. Bruce Fields

    Jeff Layton
     
  • A fl->fl_break_time of 0 has a special meaning to the lease break code
    that basically means "never break the lease". knfsd uses this to ensure
    that leases don't disappear out from under it.

    Unfortunately, the code in __break_lease can end up passing this value
    to wait_event_interruptible as a timeout, which prevents it from going
    to sleep at all. This causes __break_lease to spin in a tight loop and
    causes soft lockups.

    Fix this by ensuring that we pass a minimum value of 1 as a timeout
    instead.

    Cc:
    Cc: J. Bruce Fields
    Reported-by: Terry Barnaby
    Signed-off-by: Jeff Layton
    Signed-off-by: J. Bruce Fields

    Jeff Layton
     

14 Apr, 2014

4 commits

  • Linus Torvalds
     
  • Some versions of gcc even warn about it:

    mm/shmem.c: In function ‘shmem_file_aio_read’:
    mm/shmem.c:1414: warning: ‘error’ may be used uninitialized in this function

    If the loop is aborted during the first iteration by one of the two
    first break statements, error will be uninitialized.

    Introduced by commit 6e58e79db8a1 ("introduce copy_page_to_iter, kill
    loop over iovec in generic_file_aio_read()").

    Signed-off-by: Geert Uytterhoeven
    Acked-by: Al Viro
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     
  • On 32 bit, size_t is "unsigned int", not "unsigned long", causing the
    following warning when comparing with PAGE_SIZE, which is always "unsigned
    long":

    fs/cifs/file.c: In function ‘cifs_readdata_to_iov’:
    fs/cifs/file.c:2757: warning: comparison of distinct pointer types lacks a cast

    Introduced by commit 7f25bba819a3 ("cifs_iovec_read: keep iov_iter
    between the calls of cifs_readdata_to_iov()"), which changed the
    signedness of "remaining" and the code from min_t() to min().

    Signed-off-by: Geert Uytterhoeven
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     
  • Pull slab changes from Pekka Enberg:
    "The biggest change is byte-sized freelist indices which reduces slab
    freelist memory usage:

    https://lkml.org/lkml/2013/12/2/64"

    * 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
    mm: slab/slub: use page->list consistently instead of page->lru
    mm/slab.c: cleanup outdated comments and unify variables naming
    slab: fix wrongly used macro
    slub: fix high order page allocation problem with __GFP_NOFAIL
    slab: Make allocations with GFP_ZERO slightly more efficient
    slab: make more slab management structure off the slab
    slab: introduce byte sized index for the freelist of a slab
    slab: restrict the number of objects in a slab
    slab: introduce helper functions to get/set free object
    slab: factor out calculate nr objects in cache_estimate

    Linus Torvalds
     

13 Apr, 2014

4 commits

  • Pull misc kbuild changes from Michal Marek:
    "Here is the non-critical part of kbuild:
    - One bogus coccinelle check removed, one check fixed not to suggest
    the obsolete PTR_RET macro
    - scripts/tags.sh does not index the generated *.mod.c files
    - new objdiff tool to list differences between two versions of an
    object file
    - A fix for scripts/bootgraph.pl"

    * 'misc' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
    scripts/coccinelle: Use PTR_ERR_OR_ZERO
    scripts/bootgraph.pl: Add graphic header
    scripts: objdiff: detect object code changes between two commits
    Coccicheck: Remove memcpy to struct assignment test
    scripts/tags.sh: Ignore *.mod.c

    Linus Torvalds
     
  • This patch fixes I/O errors with the sym53c8xx_2 driver when the disk
    returns QUEUE FULL status.

    When the controller encounters an error (including QUEUE FULL or BUSY
    status), it aborts all not yet submitted requests in the function
    sym_dequeue_from_squeue.

    This function aborts them with DID_SOFT_ERROR.

    If the disk has full tag queue, the request that caused the overflow is
    aborted with QUEUE FULL status (and the scsi midlayer properly retries
    it until it is accepted by the disk), but the sym53c8xx_2 driver aborts
    the following requests with DID_SOFT_ERROR --- for them, the midlayer
    does just a few retries and then signals the error up to sd.

    The result is that disk returning QUEUE FULL causes request failures.

    The error was reproduced on 53c895 with COMPAQ BD03685A24 disk
    (rebranded ST336607LC) with command queue 48 or 64 tags. The disk has
    64 tags, but under some access patterns it return QUEUE FULL when there
    are less than 64 pending tags. The SCSI specification allows returning
    QUEUE FULL anytime and it is up to the host to retry.

    Signed-off-by: Mikulas Patocka
    Cc: Matthew Wilcox
    Cc: James Bottomley
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     
  • Commit 8f619b5429d9 ("powerpc/ppc64: Do not turn AIL (reloc-on
    interrupts) too early") added code to set the AIL bit in the LPCR
    without checking whether the kernel is running in hypervisor mode. The
    result is that when the kernel is running as a guest (i.e., under
    PowerKVM or PowerVM), the processor takes a privileged instruction
    interrupt at that point, causing a panic. The visible result is that
    the kernel hangs after printing "returning from prom_init".

    This fixes it by checking for hypervisor mode being available before
    setting LPCR. If we are not in hypervisor mode, we enable relocation-on
    interrupts later in pSeries_setup_arch using the H_SET_MODE hcall.

    Signed-off-by: Paul Mackerras
    Acked-by: Benjamin Herrenschmidt
    Signed-off-by: Linus Torvalds

    Paul Mackerras
     
  • Commits 11d4616bd07f ("futex: revert back to the explicit waiter
    counting code") and 69cd9eba3886 ("futex: avoid race between requeue and
    wake") changed some of the finer details of how we think about futexes.
    One was a late fix and the other a consequence of overlooking the whole
    requeuing logic.

    The first change caused our documentation to be incorrect, and the
    second made us aware that we need to explicitly add more details to it.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso