01 Mar, 2013

3 commits

  • The client will currently try LAYOUTGETs forever if a server is returning
    NFS4ERR_LAYOUTTRYLATER or NFS4ERR_RECALLCONFLICT - even if the client no
    longer needs the layout (ie process killed, unmounted).

    This patch uses the DS timeout value (module parameter 'dataserver_timeo'
    via rpc layer) to set an upper limit of how long the client tries LATOUTGETs
    in this situation. Once the timeout is reached, IO is redirected to the MDS.

    This also changes how the client checks if a layout is on the clp list
    to avoid a double list_add.

    Signed-off-by: Weston Andros Adamson
    Signed-off-by: Trond Myklebust

    Weston Andros Adamson
     
  • The client should have 60 second default timeouts for DS operations, not 6
    seconds.

    NFS4_DEF_DS_TIMEO is used as "timeout in tenths of a second" in
    nfs_init_timeout_values (and is not used anywhere else).
    This matches up with the description of the module param dataserver_timeo.

    Signed-off-by: Weston Andros Adamson
    Signed-off-by: Trond Myklebust

    Weston Andros Adamson
     
  • If we don't release the open seqid before we wait for state recovery,
    then we may end up deadlocking the state recovery thread.
    This patch addresses a new deadlock that was introduced by
    commit c21443c2c792cd9b463646d982b0fe48aa6feb0f (NFSv4: Fix a reboot
    recovery race when opening a file)

    Reported-by: Andy Adamson
    Signed-off-by: Trond Myklebust

    Trond Myklebust
     

28 Feb, 2013

1 commit

  • Benny Halevy reported the following oops when testing RHEL6:

    nfs_update_inode: inode 892950 mode changed, 0040755 to 0100644
    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [] nfs_closedir+0x15/0x30 [nfs]
    PGD 81448a067 PUD 831632067 PMD 0
    Oops: 0000 [#1] SMP
    last sysfs file: /sys/kernel/mm/redhat_transparent_hugepage/enabled
    CPU 6
    Modules linked in: fuse bonding 8021q garp ebtable_nat ebtables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi softdog bridge stp llc xt_physdev ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_multiport iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_round_robin dm_multipath objlayoutdriver2(U) nfs(U) lockd fscache auth_rpcgss nfs_acl sunrpc vhost_net macvtap macvlan tun kvm_intel kvm be2net igb dca ptp pps_core microcode serio_raw sg iTCO_wdt iTCO_vendor_support i7core_edac edac_core shpchp ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

    Pid: 6332, comm: dd Not tainted 2.6.32-358.el6.x86_64 #1 HP ProLiant DL170e G6 /ProLiant DL170e G6
    RIP: 0010:[] [] nfs_closedir+0x15/0x30 [nfs]
    RSP: 0018:ffff88081458bb98 EFLAGS: 00010292
    RAX: ffffffffa02a52b0 RBX: 0000000000000000 RCX: 0000000000000003
    RDX: ffffffffa02e45a0 RSI: ffff88081440b300 RDI: ffff88082d5f5760
    RBP: ffff88081458bba8 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000772 R11: 0000000000400004 R12: 0000000040000008
    R13: ffff88082d5f5760 R14: ffff88082d6e8800 R15: ffff88082f12d780
    FS: 00007f728f37e700(0000) GS:ffff8800456c0000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 0000000000000000 CR3: 0000000831279000 CR4: 00000000000007e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process dd (pid: 6332, threadinfo ffff88081458a000, task ffff88082fa0e040)
    Stack:
    0000000040000008 ffff88081440b300 ffff88081458bbf8 ffffffff81182745
    ffff88082d5f5760 ffff88082d6e8800 ffff88081458bbf8 ffffffffffffffea
    ffff88082f12d780 ffff88082d6e8800 ffffffffa02a50a0 ffff88082d5f5760
    Call Trace:
    [] __fput+0xf5/0x210
    [] ? do_open+0x0/0x20 [nfs]
    [] fput+0x25/0x30
    [] __dentry_open+0x27e/0x360
    [] ? inotify_d_instantiate+0x2a/0x60
    [] lookup_instantiate_filp+0x69/0x90
    [] nfs_intent_set_file+0x59/0x90 [nfs]
    [] nfs_atomic_lookup+0x1bb/0x310 [nfs]
    [] __lookup_hash+0x102/0x160
    [] ? selinux_inode_permission+0x72/0xb0
    [] lookup_hash+0x3a/0x50
    [] do_filp_open+0x2eb/0xdd0
    [] ? __do_page_fault+0x1ec/0x480
    [] ? alloc_fd+0x92/0x160
    [] do_sys_open+0x69/0x140
    [] ? sys_lseek+0x66/0x80
    [] sys_open+0x20/0x30
    [] system_call_fastpath+0x16/0x1b
    Code: 65 48 8b 04 25 c8 cb 00 00 83 a8 44 e0 ff ff 01 5b 41 5c c9 c3 90 55 48 89 e5 53 48 83 ec 08 0f 1f 44 00 00 48 8b 9e a0 00 00 00 8b 3b e8 13 0c f7 ff 48 89 df e8 ab 3d ec e0 48 83 c4 08 31
    RIP [] nfs_closedir+0x15/0x30 [nfs]
    RSP
    CR2: 0000000000000000

    I think this is ultimately due to a bug on the server. The client had
    previously found a directory dentry. It then later tried to do an atomic
    open on a new (regular file) dentry. The attributes it got back had the
    same filehandle as the previously found directory inode. It then tried
    to put the filp because it failed the aops tests for O_DIRECT opens, and
    oopsed here because the ctx was still NULL.

    Obviously the root cause here is a server issue, but we can take steps
    to mitigate this on the client. When nfs_fhget is called, we always know
    what type of inode it is. In the event that there's a broken or
    malicious server on the other end of the wire, the client can end up
    crashing because the wrong ops are set on it.

    Have nfs_find_actor check that the inode type is correct after checking
    the fileid. The fileid check should rarely ever match, so it should only
    rarely ever get to this check. In the case where we have a broken
    server, we may see two different inodes with the same i_ino, but the
    client should be able to cope with them without crashing.

    This should fix the oops reported here:

    https://bugzilla.redhat.com/show_bug.cgi?id=913660

    Reported-by: Benny Halevy
    Signed-off-by: Jeff Layton
    Signed-off-by: Trond Myklebust

    Jeff Layton
     

26 Feb, 2013

1 commit

  • This fixes an oops where a LAYOUTGET is in still in the rpciod queue,
    but the requesting processes has been killed. Without this, killing
    the process does the final pnfs_put_layout_hdr() and sets NFS_I(inode)->layout
    to NULL while the LAYOUTGET rpc task still references it.

    Example oops:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000080
    IP: [] pnfs_choose_layoutget_stateid+0x37/0xef [nfsv4]
    PGD 7365b067 PUD 7365d067 PMD 0
    Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
    Modules linked in: nfs_layout_nfsv41_files nfsv4 auth_rpcgss nfs lockd sunrpc ipt_MASQUERADE ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle ip6table_filter ip6_tables ppdev e1000 i2c_piix4 i2c_core shpchp parport_pc parport crc32c_intel aesni_intel xts aes_x86_64 lrw gf128mul ablk_helper cryptd mptspi scsi_transport_spi mptscsih mptbase floppy autofs4
    CPU 0
    Pid: 27, comm: kworker/0:1 Not tainted 3.8.0-dros_cthon2013+ #4 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
    RIP: 0010:[] [] pnfs_choose_layoutget_stateid+0x37/0xef [nfsv4]
    RSP: 0018:ffff88007b0c1c88 EFLAGS: 00010246
    RAX: ffff88006ed36678 RBX: 0000000000000000 RCX: 0000000ea877e3bc
    RDX: ffff88007a729da8 RSI: 0000000000000000 RDI: ffff88007a72b958
    RBP: ffff88007b0c1ca8 R08: 0000000000000002 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: ffff88007a72b958
    R13: ffff88007a729da8 R14: 0000000000000000 R15: ffffffffa011077e
    FS: 0000000000000000(0000) GS:ffff88007f600000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000080 CR3: 00000000735f8000 CR4: 00000000001407f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process kworker/0:1 (pid: 27, threadinfo ffff88007b0c0000, task ffff88007c2fa0c0)
    Stack:
    ffff88006fc05388 ffff88007a72b908 ffff88007b240900 ffff88006fc05388
    ffff88007b0c1cd8 ffffffffa01a2170 ffff88007b240900 ffff88007b240900
    ffff88007b240970 ffffffffa011077e ffff88007b0c1ce8 ffffffffa0110791
    Call Trace:
    [] nfs4_layoutget_prepare+0x7b/0x92 [nfsv4]
    [] ? __rpc_atrun+0x15/0x15 [sunrpc]
    [] rpc_prepare_task+0x13/0x15 [sunrpc]

    Reported-by: Tigran Mkrtchyan
    Signed-off-by: Weston Andros Adamson
    Cc: stable@kernel.org
    Signed-off-by: Trond Myklebust

    Weston Andros Adamson
     

24 Feb, 2013

1 commit

  • Pass the directio request on pageio_init to clean up the API.

    Percolate pg_dreq from original nfs_pageio_descriptor to the
    pnfs_{read,write}_done_resend_to_mds and use it on respective
    call to nfs_pageio_init_{read,write} on the newly created
    nfs_pageio_descriptor.

    Reproduced by command:
    mount -o vers=4.1 server:/ /mnt
    dd bs=128k count=8 if=/dev/zero of=/mnt/dd.out oflag=direct

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
    IP: [] atomic_inc+0x4/0x9 [nfs]
    PGD 34786067 PUD 34794067 PMD 0
    Oops: 0002 [#1] SMP
    Modules linked in: nfs_layout_nfsv41_files nfsv4 nfs nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc btrfs zlib_deflate libcrc32c ipv6 autofs4
    CPU 1
    Pid: 259, comm: kworker/1:2 Not tainted 3.8.0-rc6 #2 Bochs Bochs
    RIP: 0010:[] [] atomic_inc+0x4/0x9 [nfs]
    RSP: 0018:ffff880038f8fa68 EFLAGS: 00010206
    RAX: ffffffffa021a6a9 RBX: ffff880038f8fb48 RCX: 00000000000a0000
    RDX: ffffffffa021e616 RSI: ffff8800385e9a40 RDI: 0000000000000028
    RBP: ffff880038f8fa68 R08: ffffffff81ad6720 R09: ffff8800385e9510
    R10: ffffffffa0228450 R11: ffff880038e87418 R12: ffff8800385e9a40
    R13: ffff8800385e9a70 R14: ffff880038f8fb38 R15: ffffffffa0148878
    FS: 0000000000000000(0000) GS:ffff88003e400000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 0000000000000028 CR3: 0000000034789000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process kworker/1:2 (pid: 259, threadinfo ffff880038f8e000, task ffff880038302480)
    Stack:
    ffff880038f8fa78 ffffffffa021a6bf ffff880038f8fa88 ffffffffa021bb82
    ffff880038f8fae8 ffffffffa021f454 ffff880038f8fae8 ffffffff8109689d
    ffff880038f8fab8 ffffffff00000006 0000000000000000 ffff880038f8fb48
    Call Trace:
    [] nfs_direct_pgio_init+0x16/0x18 [nfs]
    [] nfs_pgheader_init+0x6a/0x6c [nfs]
    [] nfs_generic_pg_writepages+0x51/0xf8 [nfs]
    [] ? mark_held_locks+0x71/0x99
    [] ? rpc_release_resources_task+0x37/0x37 [sunrpc]
    [] nfs_pageio_doio+0x1a/0x43 [nfs]
    [] nfs_pageio_complete+0x16/0x2c [nfs]
    [] pnfs_write_done_resend_to_mds+0x95/0xc5 [nfsv4]
    [] ? rpc_release_resources_task+0x37/0x37 [sunrpc]
    [] filelayout_reset_write+0x8c/0x99 [nfs_layout_nfsv41_files]
    [] filelayout_write_done_cb+0x4d/0xc1 [nfs_layout_nfsv41_files]
    [] nfs4_write_done+0x36/0x49 [nfsv4]
    [] nfs_writeback_done+0x53/0x1cc [nfs]
    [] nfs_writeback_done_common+0xe/0x10 [nfs]
    [] filelayout_write_call_done+0x28/0x2a [nfs_layout_nfsv41_files]
    [] rpc_exit_task+0x29/0x87 [sunrpc]
    [] __rpc_execute+0x11d/0x3cc [sunrpc]
    [] ? trace_hardirqs_on_caller+0x117/0x173
    [] rpc_async_schedule+0x27/0x32 [sunrpc]
    [] ? __rpc_execute+0x3cc/0x3cc [sunrpc]
    [] process_one_work+0x226/0x422
    [] ? process_one_work+0x159/0x422
    [] ? lock_acquired+0x210/0x249
    [] ? __rpc_execute+0x3cc/0x3cc [sunrpc]
    [] worker_thread+0x126/0x1c4
    [] ? manage_workers+0x240/0x240
    [] kthread+0xb1/0xb9
    [] ? __kthread_parkme+0x65/0x65
    [] ret_from_fork+0x7c/0xb0
    [] ? __kthread_parkme+0x65/0x65
    Code: 00 83 38 02 74 12 48 81 4b 50 00 00 01 00 c7 83 60 07 00 00 01 00 00 00 48 89 df e8 55 fe ff ff 5b 41 5c 5d c3 66 90 55 48 89 e5 ff 07 5d c3 55 48 89 e5 f0 ff 0f 0f 94 c0 84 c0 0f 95 c0 0f
    RIP [] atomic_inc+0x4/0x9 [nfs]
    RSP
    CR2: 0000000000000028

    Signed-off-by: Benny Halevy
    Cc: stable@kernel.org [>= 3.6]
    Signed-off-by: Trond Myklebust

    Benny Halevy
     

23 Feb, 2013

1 commit

  • Commit 73ca100 broke the code that prevents the client from deleting
    a silly renamed dentry. This affected "delete on last close"
    semantics as after that commit, nothing prevented removal of
    silly-renamed files. As a result, a process holding a file open
    could easily get an ESTALE on the file in a directory where some
    other process issued 'rm -rf some_dir_containing_the_file' twice.
    Before the commit, any attempt at unlinking silly renamed files would
    fail inside may_delete() with -EBUSY because of the
    DCACHE_NFSFS_RENAMED flag. The following testcase demonstrates
    the problem:
    tail -f /nfsmnt/dir/file &
    rm -rf /nfsmnt/dir
    rm -rf /nfsmnt/dir
    # second removal does not fail, 'tail' process receives ESTALE

    The problem with the above commit is that it unhashes the old and
    new dentries from the lookup path, even in the normal case when
    a signal is not encountered and it would have been safe to call
    d_move. Unfortunately the old dentry has the special
    DCACHE_NFSFS_RENAMED flag set on it. Unhashing has the
    side-effect that future lookups call d_alloc(), allocating a new
    dentry without the special flag for any silly-renamed files. As a
    result, subsequent calls to unlink silly renamed files do not fail
    but allow the removal to go through. This will result in ESTALE
    errors for any other process doing operations on the file.

    To fix this, go back to using d_move on success.
    For the signal case, it's unclear what we may safely do beyond d_drop.

    Reported-by: Dave Wysochanski
    Signed-off-by: Trond Myklebust
    Acked-by: Jeff Layton
    Cc: stable@vger.kernel.org

    Trond Myklebust
     

20 Feb, 2013

1 commit

  • Currently, nlmclnt_lock will break out of the for(;;) loop when
    the reclaimer wakes up the blocking lock thread by setting
    nlm_lck_denied_grace_period. This causes the lock request to fail
    with an ENOLCK error.
    The intention was always to ensure that we resend the lock request
    after the grace period has expired.

    Reported-by: Wangyuan Zhang
    Signed-off-by: Trond Myklebust
    Cc: stable@vger.kernel.org

    Trond Myklebust
     

18 Feb, 2013

3 commits

  • now pnfs client uses block layout, maybe we can remove
    blocklayoutdriver first. if we umount later,
    it can cause oops in unset_pnfs_layoutdriver.
    because nfss->pnfs_curr_ld->clear_layoutdriver is invalid.

    reproduce it:
    modprobe blocklayoutdriver
    mount -t nfs4 -o minorversion=1 pnfsip:/ /mnt/
    rmmod blocklayoutdriver
    umount /mnt

    then you can see following

    CPU 0
    Pid: 17023, comm: umount.nfs4 Tainted: GF O 3.7.0-rc6-pnfs #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
    RIP: 0010:[] [] unset_pnfs_layoutdriver+0x1d/0x70 [nfsv4]
    RSP: 0018:ffff8800022d9e48 EFLAGS: 00010286
    RAX: ffffffffa04a1b00 RBX: ffff88000b013800 RCX: 0000000000000001
    RDX: ffffffff81ae8ee0 RSI: ffff880001ee94b8 RDI: ffff88000b013800
    RBP: ffff8800022d9e58 R08: 0000000000000001 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: ffff880001ee9400
    R13: ffff8800105978c0 R14: 00007fff25846c08 R15: 0000000001bba550
    FS: 00007f45ae7f0700(0000) GS:ffff880012c00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: ffffffffa04a1b38 CR3: 0000000002c0c000 CR4: 00000000000006f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process umount.nfs4 (pid: 17023, threadinfo ffff8800022d8000, task ffff880006e48aa0)
    Stack:
    ffff8800105978c0 ffff88000b013800 ffff8800022d9e78 ffffffffa04cd0ce
    ffff8800022d9e78 ffff88000b013800 ffff8800022d9ea8 ffffffffa04755a7
    ffff8800022d9ea8 ffff880002f96400 ffff88000b013800 ffff880002f96400
    Call Trace:
    [] nfs4_destroy_server+0x1e/0x30 [nfsv4]
    [] nfs_free_server+0xb7/0x150 [nfs]
    [] nfs_kill_super+0x35/0x40 [nfs]
    [] deactivate_locked_super+0x45/0x70
    [] deactivate_super+0x4a/0x70
    [] mntput_no_expire+0xd2/0x130
    [] sys_umount+0x72/0xe0
    [] system_call_fastpath+0x16/0x1b
    Code: 06 e1 b8 ea ff ff ff eb 9e 0f 1f 44 00 00 55 48 89 e5 53 48 83 ec 08 66 66 66 66 90 48 8b 87 80 03 00 00 48 89 fb 48 85 c0 74 29 8b 40 38 48 85 c0 74 02 ff d0 48 8b 03 3e ff 48 04 0f 94 c2
    RIP [] unset_pnfs_layoutdriver+0x1d/0x70 [nfsv4]
    RSP
    CR2: ffffffffa04a1b38
    ---[ end trace 29f75aaedda058bf ]---

    Signed-off-by: fanchaoting
    Signed-off-by: Trond Myklebust
    Cc: stable@vger.kernel.org

    fanchaoting
     
  • smatch analysis:

    fs/nfs/getroot.c:130 nfs_get_root() info: redundant null
    check on name calling kfree()

    fs/nfs/unlink.c:272 nfs_async_unlink() info: redundant null
    check on devname_garbage calling kfree()

    Cc: Trond Myklebust
    Cc: linux-nfs@vger.kernel.org
    Signed-off-by: Tim Gardner
    Signed-off-by: Trond Myklebust

    Tim Gardner
     
  • layoutget's prepare hook can call rpc_exit with status = NFS4_OK (0).
    Because of this, nfs4_proc_layoutget can't depend on a 0 status to mean
    that the RPC was successfully sent, received and parsed.

    To fix this, use the result's len member to see if parsing took place.

    This fixes the following OOPS -- calling xdr_init_decode() with a buffer length
    0 doesn't set the stream's 'p' member and ends up using uninitialized memory
    in filelayout_decode_layout.

    BUG: unable to handle kernel paging request at 0000000000008050
    IP: [] memcpy+0x18/0x120
    PGD 0
    Oops: 0000 [#1] SMP
    last sysfs file: /sys/devices/pci0000:00/0000:00:11.0/0000:02:01.0/irq
    CPU 1
    Modules linked in: nfs_layout_nfsv41_files nfs lockd fscache auth_rpcgss nfs_acl autofs4 sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_mirror dm_region_hash dm_log dm_mod ppdev parport_pc parport snd_ens1371 snd_rawmidi snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc e1000 microcode vmware_balloon i2c_piix4 i2c_core sg shpchp ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix mptspi mptscsih mptbase scsi_transport_spi [last unloaded: speedstep_lib]

    Pid: 1665, comm: flush-0:22 Not tainted 2.6.32-356-test-2 #2 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
    RIP: 0010:[] [] memcpy+0x18/0x120
    RSP: 0018:ffff88003dfab588 EFLAGS: 00010206
    RAX: ffff88003dc42000 RBX: ffff88003dfab610 RCX: 0000000000000009
    RDX: 000000003f807ff0 RSI: 0000000000008050 RDI: ffff88003dc42000
    RBP: ffff88003dfab5b0 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000080 R12: 0000000000000024
    R13: ffff88003dc42000 R14: ffff88003f808030 R15: ffff88003dfab6a0
    FS: 0000000000000000(0000) GS:ffff880003420000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
    CR2: 0000000000008050 CR3: 000000003bc92000 CR4: 00000000001407e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process flush-0:22 (pid: 1665, threadinfo ffff88003dfaa000, task ffff880037f77540)
    Stack:
    ffffffffa0398ac1 ffff8800397c5940 ffff88003dfab610 ffff88003dfab6a0
    ffff88003dfab5d0 ffff88003dfab680 ffffffffa01c150b ffffea0000d82e70
    000000508116713b 0000000000000000 0000000000000000 0000000000000000
    Call Trace:
    [] ? xdr_inline_decode+0xb1/0x120 [sunrpc]
    [] filelayout_decode_layout+0xeb/0x350 [nfs_layout_nfsv41_files]
    [] filelayout_alloc_lseg+0x8c/0x3c0 [nfs_layout_nfsv41_files]
    [] ? __wait_on_bit+0x7e/0x90

    Signed-off-by: Weston Andros Adamson
    Signed-off-by: Trond Myklebust
    Cc: stable@vger.kernel.org

    Weston Andros Adamson
     

15 Feb, 2013

1 commit

  • The current code in pnfs_destroy_all_layouts() assumes that removing
    the layout from the server->layouts list is sufficient to make it
    invisible to other processes. This ignores the fact that most
    users access the layout through the nfs_inode->layout...
    There is further breakage due to lack of reference counting of the
    layouts, meaning that the whole thing Oopses at the drop of a hat.

    The code in initiate_bulk_draining() is almost correct, and can be
    used as a model for pnfs_destroy_all_layouts(), so move that
    code to pnfs.c, and refactor the code to allow us to choose between
    a single filesystem bulk recall, and a recall of all layouts.
    Also note that initiate_bulk_draining() currently calls iput() while
    holding locks. Fix that too.

    Signed-off-by: Trond Myklebust
    Cc: stable@vger.kernel.org

    Trond Myklebust
     

12 Feb, 2013

7 commits

  • Ensure that if nfs_wait_on_sequence() causes our rpc task to wait for
    an NFSv4 state serialisation lock, then we also drop the session slot.

    Signed-off-by: Trond Myklebust
    Cc: stable@vger.kernel.org

    Trond Myklebust
     
  • If the server reboots after it has replied to our OPEN, but before we
    call nfs4_opendata_to_nfs4_state(), then the reboot recovery thread
    will not see a stateid for this open, and so will fail to recover it.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Add a mutex to the struct nfs4_state_owner to ensure that delegation
    recall doesn't conflict with byte range lock removal.

    Note that we nest the new mutex _outside_ the state manager reclaim
    protection (nfsi->rwsem) in order to avoid deadlocks.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Adjust the return values so that they return EAGAIN to the caller in
    cases where we might want to retry the delegation recall after
    the state recovery has run.
    Note that we can't wait and retry in this routine, because the caller
    may be the state manager thread.

    If delegation recall fails due to a session or reboot related issue,
    also ensure that we mark the stateid as delegated so that
    nfs_delegation_claim_opens can find it again later.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • If the server reboots while we are converting a delegation into
    OPEN/LOCK stateids as part of a delegation return, the current code
    will simply exit with an error. This causes us to lose both
    delegation state and locking state (i.e. locking atomicity).

    Deal with this by exposing the delegation stateid during delegation
    return, so that we can recover the delegation, and then resume
    open/lock recovery.

    Note that not having to hold the nfs_inode->rwsem across the
    calls to nfs_delegation_claim_opens() also fixes a deadlock against
    the NFSv4.1 reboot recovery code.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • We currently have a deadlock in which the state recovery thread
    ends up blocking due to one of the locks which it is trying to
    recover holding the nfs_inode->rwsem.
    The situation is as follows: the state recovery thread is
    scheduled in order to recover from a reboot. It immediately
    drains the session, forcing all ordinary NFSv4.1 calls to
    nfs41_setup_sequence() to be put to sleep. This includes the
    file locking process that holds the nfs_inode->rwsem.
    When the thread gets to nfs4_reclaim_locks(), it tries to
    grab a write lock on nfs_inode->rwsem, and boom...

    Fix is to have the lock drop the nfs_inode->rwsem while it is
    doing RPC calls. We use a sequence lock in order to signal to
    the locking process whether or not a state recovery thread has
    run on that inode, in which case it should retry the lock.

    Reported-by: Andy Adamson
    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • This patch adds a seqcount_t lock for use by the state manager to
    signal that an open owner has been recovered. This mechanism will be
    used by the delegation, open and byte range lock code in order to
    figure out if they need to replay requests due to collisions with
    lock recovery.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     

01 Feb, 2013

2 commits

  • This reverts commit 324d003b0cd82151adbaecefef57b73f7959a469.

    The deadlock turned out to be caused by a workqueue limitation that has
    now been worked around in the RPC code (see comment in rpc_free_task).

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Pull NFS client bugfixes from Trond Myklebust:

    - Error reporting in nfs_xdev_mount incorrectly maps all errors to
    ENOMEM

    - Fix an NFSv4 refcounting issue

    - Fix a mount failure when the server reboots during NFSv4 trunking
    discovery

    - NFSv4.1 mounts may need to run the lease recovery thread.

    - Don't silently fail setattr() requests on mountpoints

    - Fix a SUNRPC socket/transport livelock and priority queue issue

    - We must handle NFS4ERR_DELAY when resetting the NFSv4.1 session.

    * tag 'nfs-for-3.8-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
    NFSv4.1: Handle NFS4ERR_DELAY when resetting the NFSv4.1 session
    SUNRPC: When changing the queue priority, ensure that we change the owner
    NFS: Don't silently fail setattr() requests on mountpoints
    NFSv4.1: Ensure that nfs41_walk_client_list() does start lease recovery
    NFSv4: Fix NFSv4 trunking discovery
    NFSv4: Fix NFSv4 reference counting for trunked sessions
    NFS: Fix error reporting in nfs_xdev_mount

    Linus Torvalds
     

31 Jan, 2013

2 commits

  • NFS4ERR_DELAY is a legal reply when we call DESTROY_SESSION. It
    usually means that the server is busy handling an unfinished RPC
    request. Just sleep for a second and then retry.
    We also need to be able to handle the NFS4ERR_BACK_CHAN_BUSY return
    value. If the NFS server has outstanding callbacks, we just want to
    similarly sleep & retry.

    Signed-off-by: Trond Myklebust
    Cc: stable@vger.kernel.org

    Trond Myklebust
     
  • Ensure that any setattr and getattr requests for junctions and/or
    mountpoints are sent to the server. Ever since commit
    0ec26fd0698 (vfs: automount should ignore LOOKUP_FOLLOW), we have
    silently dropped any setattr requests to a server-side mountpoint.
    For referrals, we have silently dropped both getattr and setattr
    requests.

    This patch restores the original behaviour for setattr on mountpoints,
    and tries to do the same for referrals, provided that we have a
    filehandle...

    Signed-off-by: Trond Myklebust
    Cc: stable@vger.kernel.org

    Trond Myklebust
     

30 Jan, 2013

1 commit

  • Pull xfs bugfixes from Ben Myers:
    "Here are fixes for returning EFSCORRUPTED on probe of a non-xfs
    filesystem, the stack switch in xfs_bmapi_allocate, a crash in
    _xfs_buf_find, speculative preallocation as the filesystem nears
    ENOSPC, an unmount hang, a race with AIO, and a regression with
    xfs_fsr:

    - fix return value when filesystem probe finds no XFS magic, a
    regression introduced in 9802182.

    - fix stack switch in __xfs_bmapi_allocate by moving the check for
    stack switch up into xfs_bmapi_write.

    - fix oops in _xfs_buf_find by validating that the requested block is
    within the filesystem bounds.

    - limit speculative preallocation near ENOSPC.

    - fix an unmount hang in xfs_wait_buftarg by freeing the
    xfs_buf_log_item in xfs_buf_item_unlock.

    - fix a possible use after free with AIO.

    - fix xfs_swap_extents after removal of xfs_flushinval_pages, a
    regression introduced in commit fb59581404a."

    * tag 'for-linus-v3.8-rc6' of git://oss.sgi.com/xfs/xfs:
    xfs: Fix xfs_swap_extents() after removal of xfs_flushinval_pages()
    xfs: Fix possible use-after-free with AIO
    xfs: fix shutdown hang on invalid inode during create
    xfs: limit speculative prealloc near ENOSPC thresholds
    xfs: fix _xfs_buf_find oops on blocks beyond the filesystem end
    xfs: pull up stack_switch check into xfs_bmapi_write
    xfs: Do not return EFSCORRUPTED when filesystem probe finds no XFS magic

    Linus Torvalds
     

29 Jan, 2013

7 commits

  • Commit fb59581404ab7ec5075299065c22cb211a9262a9 removed
    xfs_flushinval_pages() and changed its callers to use
    filemap_write_and_wait() and truncate_pagecache_range() directly.

    But in xfs_swap_extents() this change accidental switched the argument
    for 'tip' to 'ip'. This patch switches it back to 'tip'

    Signed-off-by: Torsten Kaiser
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    Torsten Kaiser
     
  • Running AIO is pinning inode in memory using file reference. Once AIO
    is completed using aio_complete(), file reference is put and inode can
    be freed from memory. So we have to be sure that calling aio_complete()
    is the last thing we do with the inode.

    CC: xfs@oss.sgi.com
    CC: Ben Myers
    CC: stable@vger.kernel.org
    Signed-off-by: Jan Kara
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    Jan Kara
     
  • When the new inode verify in xfs_iread() fails, the create
    transaction is aborted and a shutdown occurs. The subsequent unmount
    then hangs in xfs_wait_buftarg() on a buffer that has an elevated
    hold count. Debug showed that it was an AGI buffer getting stuck:

    [ 22.576147] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
    [ 22.976213] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
    [ 23.376206] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
    [ 23.776325] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck

    The trace of this buffer leading up to the shutdown (trimmed for
    brevity) looks like:

    xfs_buf_init: bno 0x2 nblks 0x1 hold 1 caller xfs_buf_get_map
    xfs_buf_get: bno 0x2 len 0x200 hold 1 caller xfs_buf_read_map
    xfs_buf_read: bno 0x2 len 0x200 hold 1 caller xfs_trans_read_buf_map
    xfs_buf_iorequest: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
    xfs_buf_hold: bno 0x2 nblks 0x1 hold 1 caller xfs_buf_iorequest
    xfs_buf_rele: bno 0x2 nblks 0x1 hold 2 caller xfs_buf_iorequest
    xfs_buf_iowait: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
    xfs_buf_ioerror: bno 0x2 len 0x200 hold 1 caller xfs_buf_bio_end_io
    xfs_buf_iodone: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_ioend
    xfs_buf_iowait_done: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
    xfs_buf_hold: bno 0x2 nblks 0x1 hold 1 caller xfs_buf_item_init
    xfs_trans_read_buf: bno 0x2 len 0x200 hold 2 recur 0 refcount 1
    xfs_trans_brelse: bno 0x2 len 0x200 hold 2 recur 0 refcount 1
    xfs_buf_item_relse: bno 0x2 nblks 0x1 hold 2 caller xfs_trans_brelse
    xfs_buf_rele: bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_relse
    xfs_buf_unlock: bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse
    xfs_buf_rele: bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse
    xfs_buf_trylock: bno 0x2 nblks 0x1 hold 2 caller _xfs_buf_find
    xfs_buf_find: bno 0x2 len 0x200 hold 2 caller xfs_buf_get_map
    xfs_buf_get: bno 0x2 len 0x200 hold 2 caller xfs_buf_read_map
    xfs_buf_read: bno 0x2 len 0x200 hold 2 caller xfs_trans_read_buf_map
    xfs_buf_hold: bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_init
    xfs_trans_read_buf: bno 0x2 len 0x200 hold 3 recur 0 refcount 1
    xfs_trans_log_buf: bno 0x2 len 0x200 hold 3 recur 0 refcount 1
    xfs_buf_item_unlock: bno 0x2 len 0x200 hold 3 flags DIRTY liflags ABORTED
    xfs_buf_unlock: bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock
    xfs_buf_rele: bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock

    And that is the AGI buffer from cold cache read into memory to
    transaction abort. You can see at transaction abort the bli is dirty
    and only has a single reference. The item is not pinned, and it's
    not in the AIL. Hence the only reference to it is this transaction.

    The problem is that the xfs_buf_item_unlock() call is dropping the
    last reference to the xfs_buf_log_item attached to the buffer (which
    holds a reference to the buffer), but it is not freeing the
    xfs_buf_log_item. Hence nothing will ever release the buffer, and
    the unmount hangs waiting for this reference to go away.

    The fix is simple - xfs_buf_item_unlock needs to detect the last
    reference going away in this case and free the xfs_buf_log_item to
    release the reference it holds on the buffer.

    Signed-off-by: Dave Chinner
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • There is a window on small filesytsems where specualtive
    preallocation can be larger than that ENOSPC throttling thresholds,
    resulting in specualtive preallocation trying to reserve more space
    than there is space available. This causes immediate ENOSPC to be
    triggered, prealloc to be turned off and flushing to occur. One the
    next write (i.e. next 4k page), we do exactly the same thing, and so
    effective drive into synchronous 4k writes by triggering ENOSPC
    flushing on every page while in the window between the prealloc size
    and the ENOSPC prealloc throttle threshold.

    Fix this by checking to see if the prealloc size would consume all
    free space, and throttle it appropriately to avoid premature
    ENOSPC...

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • When _xfs_buf_find is passed an out of range address, it will fail
    to find a relevant struct xfs_perag and oops with a null
    dereference. This can happen when trying to walk a filesystem with a
    metadata inode that has a partially corrupted extent map (i.e. the
    block number returned is corrupt, but is otherwise intact) and we
    try to read from the corrupted block address.

    In this case, just fail the lookup. If it is readahead being issued,
    it will simply not be done, but if it is real read that fails we
    will get an error being reported. Ideally this case should result
    in an EFSCORRUPTED error being reported, but we cannot return an
    error through xfs_buf_read() or xfs_buf_get() so this lookup failure
    may result in ENOMEM or EIO errors being reported instead.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    Dave Chinner
     
  • The stack_switch check currently occurs in __xfs_bmapi_allocate,
    which means the stack switch only occurs when xfs_bmapi_allocate()
    is called in a loop. Pull the check up before the loop in
    xfs_bmapi_write() such that the first iteration of the loop has
    consistent behavior.

    Signed-off-by: Brian Foster
    Reviewed-by: Dave Chinner
    Signed-off-by: Ben Myers

    Brian Foster
     
  • 9802182 changed the return value from EWRONGFS (aka EINVAL)
    to EFSCORRUPTED which doesn't seem to be handled properly by
    the root filesystem probe.

    Signed-off-by: Eric Sandeen
    Tested-by: Sergei Trofimovich
    Reviewed-by: Ben Myers
    Signed-off-by: Ben Myers

    Eric Sandeen
     

28 Jan, 2013

5 commits

  • The recent commit fb6791d100d1bba20b5cdbc4912e1f7086ec60f8
    included the wrong logic. The lvbptr check was incorrectly
    added after the patch was tested.

    Signed-off-by: David Teigland
    Signed-off-by: Steven Whitehouse

    David Teigland
     
  • We do need to start the lease recovery thread prior to waiting for the
    client initialisation to complete in NFSv4.1.

    Signed-off-by: Trond Myklebust
    Cc: Chuck Lever
    Cc: Ben Greear
    Cc: stable@vger.kernel.org [>=3.7]

    Trond Myklebust
     
  • If walking the list in nfs4[01]_walk_client_list fails, then the most
    likely explanation is that the server dropped the clientid before we
    actually managed to confirm it. As long as our nfs_client is the very
    last one in the list to be tested, the caller can be assured that this
    is the case when the final return value is NFS4ERR_STALE_CLIENTID.

    Reported-by: Ben Greear
    Signed-off-by: Trond Myklebust
    Cc: Chuck Lever
    Cc: stable@vger.kernel.org [>=3.7]
    Tested-by: Ben Greear

    Trond Myklebust
     
  • The reference counting in nfs4_init_client assumes wongly that it
    is safe for nfs4_discover_server_trunking() to return a pointer to a
    nfs_client prior to bumping the reference count.

    Signed-off-by: Trond Myklebust
    Cc: Chuck Lever
    Cc: Ben Greear
    Cc: stable@vger.kernel.org [>=3.7]

    Trond Myklebust
     
  • Currently, nfs_xdev_mount converts all errors from clone_server() to
    ENOMEM, which can then leak to userspace (for instance to 'mount'). Fix that.
    Also ensure that if nfs_fs_mount_common() returns an error, we
    don't dprintk(0)...

    The regression originated in commit 3d176e3fe4f6dc379b252bf43e2e146a8f7caf01
    (NFS: Use nfs_fs_mount_common() for xdev mounts)

    Signed-off-by: Trond Myklebust
    Cc: stable@vger.kernel.org [>= 3.5]

    Trond Myklebust
     

26 Jan, 2013

1 commit

  • Pull btrfs fixes from Chris Mason:
    "It turns out that we had two crc bugs when running fsx-linux in a
    loop. Many thanks to Josef, Miao Xie, and Dave Sterba for nailing it
    all down. Miao also has a new OOM fix in this v2 pull as well.

    Ilya fixed a regression Liu Bo found in the balance ioctls for pausing
    and resuming a running balance across drives.

    Josef's orphan truncate patch fixes an obscure corruption we'd see
    during xfstests.

    Arne's patches address problems with subvolume quotas. If the user
    destroys quota groups incorrectly the FS will refuse to mount.

    The rest are smaller fixes and plugs for memory leaks."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (30 commits)
    Btrfs: fix repeated delalloc work allocation
    Btrfs: fix wrong max device number for single profile
    Btrfs: fix missed transaction->aborted check
    Btrfs: Add ACCESS_ONCE() to transaction->abort accesses
    Btrfs: put csums on the right ordered extent
    Btrfs: use right range to find checksum for compressed extents
    Btrfs: fix panic when recovering tree log
    Btrfs: do not allow logged extents to be merged or removed
    Btrfs: fix a regression in balance usage filter
    Btrfs: prevent qgroup destroy when there are still relations
    Btrfs: ignore orphan qgroup relations
    Btrfs: reorder locks and sanity checks in btrfs_ioctl_defrag
    Btrfs: fix unlock order in btrfs_ioctl_rm_dev
    Btrfs: fix unlock order in btrfs_ioctl_resize
    Btrfs: fix "mutually exclusive op is running" error code
    Btrfs: bring back balance pause/resume logic
    btrfs: update timestamps on truncate()
    btrfs: fix btrfs_cont_expand() freeing IS_ERR em
    Btrfs: fix a bug when llseek for delalloc bytes behind prealloc extents
    Btrfs: fix off-by-one in lseek
    ...

    Linus Torvalds
     

25 Jan, 2013

3 commits

  • Pull cifs fixes from Steve French:
    "Two small cifs fixes"

    * 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
    fs/cifs/cifs_dfs_ref.c: fix potential memory leakage
    cifs: fix srcip_matches() for ipv6

    Linus Torvalds
     
  • btrfs_start_delalloc_inodes() locks the delalloc_inodes list, fetches the
    first inode, unlocks the list, triggers btrfs_alloc_delalloc_work/
    btrfs_queue_worker for this inode, and then it locks the list, checks the
    head of the list again. But because we don't delete the first inode that it
    deals with before, it will fetch the same inode. As a result, this function
    allocates a huge amount of btrfs_delalloc_work structures, and OOM happens.

    Fix this problem by splice this delalloc list.

    Reported-by: Alex Lyakas
    Signed-off-by: Miao Xie
    Signed-off-by: Josef Bacik

    Miao Xie
     
  • The max device number of single profile is 1, not 0 (0 means 'as many as
    possible'). Fix it.

    Cc: Liu Bo
    Signed-off-by: Miao Xie
    Reviewed-by: Liu Bo
    Signed-off-by: Josef Bacik

    Miao Xie