01 Mar, 2013
3 commits
-
The client will currently try LAYOUTGETs forever if a server is returning
NFS4ERR_LAYOUTTRYLATER or NFS4ERR_RECALLCONFLICT - even if the client no
longer needs the layout (ie process killed, unmounted).This patch uses the DS timeout value (module parameter 'dataserver_timeo'
via rpc layer) to set an upper limit of how long the client tries LATOUTGETs
in this situation. Once the timeout is reached, IO is redirected to the MDS.This also changes how the client checks if a layout is on the clp list
to avoid a double list_add.Signed-off-by: Weston Andros Adamson
Signed-off-by: Trond Myklebust -
The client should have 60 second default timeouts for DS operations, not 6
seconds.NFS4_DEF_DS_TIMEO is used as "timeout in tenths of a second" in
nfs_init_timeout_values (and is not used anywhere else).
This matches up with the description of the module param dataserver_timeo.Signed-off-by: Weston Andros Adamson
Signed-off-by: Trond Myklebust -
If we don't release the open seqid before we wait for state recovery,
then we may end up deadlocking the state recovery thread.
This patch addresses a new deadlock that was introduced by
commit c21443c2c792cd9b463646d982b0fe48aa6feb0f (NFSv4: Fix a reboot
recovery race when opening a file)Reported-by: Andy Adamson
Signed-off-by: Trond Myklebust
28 Feb, 2013
1 commit
-
Benny Halevy reported the following oops when testing RHEL6:
nfs_update_inode: inode 892950 mode changed, 0040755 to 0100644
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [] nfs_closedir+0x15/0x30 [nfs]
PGD 81448a067 PUD 831632067 PMD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/kernel/mm/redhat_transparent_hugepage/enabled
CPU 6
Modules linked in: fuse bonding 8021q garp ebtable_nat ebtables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi softdog bridge stp llc xt_physdev ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_multiport iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_round_robin dm_multipath objlayoutdriver2(U) nfs(U) lockd fscache auth_rpcgss nfs_acl sunrpc vhost_net macvtap macvlan tun kvm_intel kvm be2net igb dca ptp pps_core microcode serio_raw sg iTCO_wdt iTCO_vendor_support i7core_edac edac_core shpchp ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]Pid: 6332, comm: dd Not tainted 2.6.32-358.el6.x86_64 #1 HP ProLiant DL170e G6 /ProLiant DL170e G6
RIP: 0010:[] [] nfs_closedir+0x15/0x30 [nfs]
RSP: 0018:ffff88081458bb98 EFLAGS: 00010292
RAX: ffffffffa02a52b0 RBX: 0000000000000000 RCX: 0000000000000003
RDX: ffffffffa02e45a0 RSI: ffff88081440b300 RDI: ffff88082d5f5760
RBP: ffff88081458bba8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000772 R11: 0000000000400004 R12: 0000000040000008
R13: ffff88082d5f5760 R14: ffff88082d6e8800 R15: ffff88082f12d780
FS: 00007f728f37e700(0000) GS:ffff8800456c0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000831279000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process dd (pid: 6332, threadinfo ffff88081458a000, task ffff88082fa0e040)
Stack:
0000000040000008 ffff88081440b300 ffff88081458bbf8 ffffffff81182745
ffff88082d5f5760 ffff88082d6e8800 ffff88081458bbf8 ffffffffffffffea
ffff88082f12d780 ffff88082d6e8800 ffffffffa02a50a0 ffff88082d5f5760
Call Trace:
[] __fput+0xf5/0x210
[] ? do_open+0x0/0x20 [nfs]
[] fput+0x25/0x30
[] __dentry_open+0x27e/0x360
[] ? inotify_d_instantiate+0x2a/0x60
[] lookup_instantiate_filp+0x69/0x90
[] nfs_intent_set_file+0x59/0x90 [nfs]
[] nfs_atomic_lookup+0x1bb/0x310 [nfs]
[] __lookup_hash+0x102/0x160
[] ? selinux_inode_permission+0x72/0xb0
[] lookup_hash+0x3a/0x50
[] do_filp_open+0x2eb/0xdd0
[] ? __do_page_fault+0x1ec/0x480
[] ? alloc_fd+0x92/0x160
[] do_sys_open+0x69/0x140
[] ? sys_lseek+0x66/0x80
[] sys_open+0x20/0x30
[] system_call_fastpath+0x16/0x1b
Code: 65 48 8b 04 25 c8 cb 00 00 83 a8 44 e0 ff ff 01 5b 41 5c c9 c3 90 55 48 89 e5 53 48 83 ec 08 0f 1f 44 00 00 48 8b 9e a0 00 00 00 8b 3b e8 13 0c f7 ff 48 89 df e8 ab 3d ec e0 48 83 c4 08 31
RIP [] nfs_closedir+0x15/0x30 [nfs]
RSP
CR2: 0000000000000000I think this is ultimately due to a bug on the server. The client had
previously found a directory dentry. It then later tried to do an atomic
open on a new (regular file) dentry. The attributes it got back had the
same filehandle as the previously found directory inode. It then tried
to put the filp because it failed the aops tests for O_DIRECT opens, and
oopsed here because the ctx was still NULL.Obviously the root cause here is a server issue, but we can take steps
to mitigate this on the client. When nfs_fhget is called, we always know
what type of inode it is. In the event that there's a broken or
malicious server on the other end of the wire, the client can end up
crashing because the wrong ops are set on it.Have nfs_find_actor check that the inode type is correct after checking
the fileid. The fileid check should rarely ever match, so it should only
rarely ever get to this check. In the case where we have a broken
server, we may see two different inodes with the same i_ino, but the
client should be able to cope with them without crashing.This should fix the oops reported here:
https://bugzilla.redhat.com/show_bug.cgi?id=913660
Reported-by: Benny Halevy
Signed-off-by: Jeff Layton
Signed-off-by: Trond Myklebust
26 Feb, 2013
1 commit
-
This fixes an oops where a LAYOUTGET is in still in the rpciod queue,
but the requesting processes has been killed. Without this, killing
the process does the final pnfs_put_layout_hdr() and sets NFS_I(inode)->layout
to NULL while the LAYOUTGET rpc task still references it.Example oops:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000080
IP: [] pnfs_choose_layoutget_stateid+0x37/0xef [nfsv4]
PGD 7365b067 PUD 7365d067 PMD 0
Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
Modules linked in: nfs_layout_nfsv41_files nfsv4 auth_rpcgss nfs lockd sunrpc ipt_MASQUERADE ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle ip6table_filter ip6_tables ppdev e1000 i2c_piix4 i2c_core shpchp parport_pc parport crc32c_intel aesni_intel xts aes_x86_64 lrw gf128mul ablk_helper cryptd mptspi scsi_transport_spi mptscsih mptbase floppy autofs4
CPU 0
Pid: 27, comm: kworker/0:1 Not tainted 3.8.0-dros_cthon2013+ #4 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
RIP: 0010:[] [] pnfs_choose_layoutget_stateid+0x37/0xef [nfsv4]
RSP: 0018:ffff88007b0c1c88 EFLAGS: 00010246
RAX: ffff88006ed36678 RBX: 0000000000000000 RCX: 0000000ea877e3bc
RDX: ffff88007a729da8 RSI: 0000000000000000 RDI: ffff88007a72b958
RBP: ffff88007b0c1ca8 R08: 0000000000000002 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff88007a72b958
R13: ffff88007a729da8 R14: 0000000000000000 R15: ffffffffa011077e
FS: 0000000000000000(0000) GS:ffff88007f600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000080 CR3: 00000000735f8000 CR4: 00000000001407f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kworker/0:1 (pid: 27, threadinfo ffff88007b0c0000, task ffff88007c2fa0c0)
Stack:
ffff88006fc05388 ffff88007a72b908 ffff88007b240900 ffff88006fc05388
ffff88007b0c1cd8 ffffffffa01a2170 ffff88007b240900 ffff88007b240900
ffff88007b240970 ffffffffa011077e ffff88007b0c1ce8 ffffffffa0110791
Call Trace:
[] nfs4_layoutget_prepare+0x7b/0x92 [nfsv4]
[] ? __rpc_atrun+0x15/0x15 [sunrpc]
[] rpc_prepare_task+0x13/0x15 [sunrpc]Reported-by: Tigran Mkrtchyan
Signed-off-by: Weston Andros Adamson
Cc: stable@kernel.org
Signed-off-by: Trond Myklebust
24 Feb, 2013
1 commit
-
Pass the directio request on pageio_init to clean up the API.
Percolate pg_dreq from original nfs_pageio_descriptor to the
pnfs_{read,write}_done_resend_to_mds and use it on respective
call to nfs_pageio_init_{read,write} on the newly created
nfs_pageio_descriptor.Reproduced by command:
mount -o vers=4.1 server:/ /mnt
dd bs=128k count=8 if=/dev/zero of=/mnt/dd.out oflag=directBUG: unable to handle kernel NULL pointer dereference at 0000000000000028
IP: [] atomic_inc+0x4/0x9 [nfs]
PGD 34786067 PUD 34794067 PMD 0
Oops: 0002 [#1] SMP
Modules linked in: nfs_layout_nfsv41_files nfsv4 nfs nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc btrfs zlib_deflate libcrc32c ipv6 autofs4
CPU 1
Pid: 259, comm: kworker/1:2 Not tainted 3.8.0-rc6 #2 Bochs Bochs
RIP: 0010:[] [] atomic_inc+0x4/0x9 [nfs]
RSP: 0018:ffff880038f8fa68 EFLAGS: 00010206
RAX: ffffffffa021a6a9 RBX: ffff880038f8fb48 RCX: 00000000000a0000
RDX: ffffffffa021e616 RSI: ffff8800385e9a40 RDI: 0000000000000028
RBP: ffff880038f8fa68 R08: ffffffff81ad6720 R09: ffff8800385e9510
R10: ffffffffa0228450 R11: ffff880038e87418 R12: ffff8800385e9a40
R13: ffff8800385e9a70 R14: ffff880038f8fb38 R15: ffffffffa0148878
FS: 0000000000000000(0000) GS:ffff88003e400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000028 CR3: 0000000034789000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kworker/1:2 (pid: 259, threadinfo ffff880038f8e000, task ffff880038302480)
Stack:
ffff880038f8fa78 ffffffffa021a6bf ffff880038f8fa88 ffffffffa021bb82
ffff880038f8fae8 ffffffffa021f454 ffff880038f8fae8 ffffffff8109689d
ffff880038f8fab8 ffffffff00000006 0000000000000000 ffff880038f8fb48
Call Trace:
[] nfs_direct_pgio_init+0x16/0x18 [nfs]
[] nfs_pgheader_init+0x6a/0x6c [nfs]
[] nfs_generic_pg_writepages+0x51/0xf8 [nfs]
[] ? mark_held_locks+0x71/0x99
[] ? rpc_release_resources_task+0x37/0x37 [sunrpc]
[] nfs_pageio_doio+0x1a/0x43 [nfs]
[] nfs_pageio_complete+0x16/0x2c [nfs]
[] pnfs_write_done_resend_to_mds+0x95/0xc5 [nfsv4]
[] ? rpc_release_resources_task+0x37/0x37 [sunrpc]
[] filelayout_reset_write+0x8c/0x99 [nfs_layout_nfsv41_files]
[] filelayout_write_done_cb+0x4d/0xc1 [nfs_layout_nfsv41_files]
[] nfs4_write_done+0x36/0x49 [nfsv4]
[] nfs_writeback_done+0x53/0x1cc [nfs]
[] nfs_writeback_done_common+0xe/0x10 [nfs]
[] filelayout_write_call_done+0x28/0x2a [nfs_layout_nfsv41_files]
[] rpc_exit_task+0x29/0x87 [sunrpc]
[] __rpc_execute+0x11d/0x3cc [sunrpc]
[] ? trace_hardirqs_on_caller+0x117/0x173
[] rpc_async_schedule+0x27/0x32 [sunrpc]
[] ? __rpc_execute+0x3cc/0x3cc [sunrpc]
[] process_one_work+0x226/0x422
[] ? process_one_work+0x159/0x422
[] ? lock_acquired+0x210/0x249
[] ? __rpc_execute+0x3cc/0x3cc [sunrpc]
[] worker_thread+0x126/0x1c4
[] ? manage_workers+0x240/0x240
[] kthread+0xb1/0xb9
[] ? __kthread_parkme+0x65/0x65
[] ret_from_fork+0x7c/0xb0
[] ? __kthread_parkme+0x65/0x65
Code: 00 83 38 02 74 12 48 81 4b 50 00 00 01 00 c7 83 60 07 00 00 01 00 00 00 48 89 df e8 55 fe ff ff 5b 41 5c 5d c3 66 90 55 48 89 e5 ff 07 5d c3 55 48 89 e5 f0 ff 0f 0f 94 c0 84 c0 0f 95 c0 0f
RIP [] atomic_inc+0x4/0x9 [nfs]
RSP
CR2: 0000000000000028Signed-off-by: Benny Halevy
Cc: stable@kernel.org [>= 3.6]
Signed-off-by: Trond Myklebust
23 Feb, 2013
1 commit
-
Commit 73ca100 broke the code that prevents the client from deleting
a silly renamed dentry. This affected "delete on last close"
semantics as after that commit, nothing prevented removal of
silly-renamed files. As a result, a process holding a file open
could easily get an ESTALE on the file in a directory where some
other process issued 'rm -rf some_dir_containing_the_file' twice.
Before the commit, any attempt at unlinking silly renamed files would
fail inside may_delete() with -EBUSY because of the
DCACHE_NFSFS_RENAMED flag. The following testcase demonstrates
the problem:
tail -f /nfsmnt/dir/file &
rm -rf /nfsmnt/dir
rm -rf /nfsmnt/dir
# second removal does not fail, 'tail' process receives ESTALEThe problem with the above commit is that it unhashes the old and
new dentries from the lookup path, even in the normal case when
a signal is not encountered and it would have been safe to call
d_move. Unfortunately the old dentry has the special
DCACHE_NFSFS_RENAMED flag set on it. Unhashing has the
side-effect that future lookups call d_alloc(), allocating a new
dentry without the special flag for any silly-renamed files. As a
result, subsequent calls to unlink silly renamed files do not fail
but allow the removal to go through. This will result in ESTALE
errors for any other process doing operations on the file.To fix this, go back to using d_move on success.
For the signal case, it's unclear what we may safely do beyond d_drop.Reported-by: Dave Wysochanski
Signed-off-by: Trond Myklebust
Acked-by: Jeff Layton
Cc: stable@vger.kernel.org
20 Feb, 2013
1 commit
-
Currently, nlmclnt_lock will break out of the for(;;) loop when
the reclaimer wakes up the blocking lock thread by setting
nlm_lck_denied_grace_period. This causes the lock request to fail
with an ENOLCK error.
The intention was always to ensure that we resend the lock request
after the grace period has expired.Reported-by: Wangyuan Zhang
Signed-off-by: Trond Myklebust
Cc: stable@vger.kernel.org
18 Feb, 2013
3 commits
-
now pnfs client uses block layout, maybe we can remove
blocklayoutdriver first. if we umount later,
it can cause oops in unset_pnfs_layoutdriver.
because nfss->pnfs_curr_ld->clear_layoutdriver is invalid.reproduce it:
modprobe blocklayoutdriver
mount -t nfs4 -o minorversion=1 pnfsip:/ /mnt/
rmmod blocklayoutdriver
umount /mntthen you can see following
CPU 0
Pid: 17023, comm: umount.nfs4 Tainted: GF O 3.7.0-rc6-pnfs #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
RIP: 0010:[] [] unset_pnfs_layoutdriver+0x1d/0x70 [nfsv4]
RSP: 0018:ffff8800022d9e48 EFLAGS: 00010286
RAX: ffffffffa04a1b00 RBX: ffff88000b013800 RCX: 0000000000000001
RDX: ffffffff81ae8ee0 RSI: ffff880001ee94b8 RDI: ffff88000b013800
RBP: ffff8800022d9e58 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff880001ee9400
R13: ffff8800105978c0 R14: 00007fff25846c08 R15: 0000000001bba550
FS: 00007f45ae7f0700(0000) GS:ffff880012c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffffffffa04a1b38 CR3: 0000000002c0c000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process umount.nfs4 (pid: 17023, threadinfo ffff8800022d8000, task ffff880006e48aa0)
Stack:
ffff8800105978c0 ffff88000b013800 ffff8800022d9e78 ffffffffa04cd0ce
ffff8800022d9e78 ffff88000b013800 ffff8800022d9ea8 ffffffffa04755a7
ffff8800022d9ea8 ffff880002f96400 ffff88000b013800 ffff880002f96400
Call Trace:
[] nfs4_destroy_server+0x1e/0x30 [nfsv4]
[] nfs_free_server+0xb7/0x150 [nfs]
[] nfs_kill_super+0x35/0x40 [nfs]
[] deactivate_locked_super+0x45/0x70
[] deactivate_super+0x4a/0x70
[] mntput_no_expire+0xd2/0x130
[] sys_umount+0x72/0xe0
[] system_call_fastpath+0x16/0x1b
Code: 06 e1 b8 ea ff ff ff eb 9e 0f 1f 44 00 00 55 48 89 e5 53 48 83 ec 08 66 66 66 66 90 48 8b 87 80 03 00 00 48 89 fb 48 85 c0 74 29 8b 40 38 48 85 c0 74 02 ff d0 48 8b 03 3e ff 48 04 0f 94 c2
RIP [] unset_pnfs_layoutdriver+0x1d/0x70 [nfsv4]
RSP
CR2: ffffffffa04a1b38
---[ end trace 29f75aaedda058bf ]---Signed-off-by: fanchaoting
Signed-off-by: Trond Myklebust
Cc: stable@vger.kernel.org -
smatch analysis:
fs/nfs/getroot.c:130 nfs_get_root() info: redundant null
check on name calling kfree()fs/nfs/unlink.c:272 nfs_async_unlink() info: redundant null
check on devname_garbage calling kfree()Cc: Trond Myklebust
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Tim Gardner
Signed-off-by: Trond Myklebust -
layoutget's prepare hook can call rpc_exit with status = NFS4_OK (0).
Because of this, nfs4_proc_layoutget can't depend on a 0 status to mean
that the RPC was successfully sent, received and parsed.To fix this, use the result's len member to see if parsing took place.
This fixes the following OOPS -- calling xdr_init_decode() with a buffer length
0 doesn't set the stream's 'p' member and ends up using uninitialized memory
in filelayout_decode_layout.BUG: unable to handle kernel paging request at 0000000000008050
IP: [] memcpy+0x18/0x120
PGD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:11.0/0000:02:01.0/irq
CPU 1
Modules linked in: nfs_layout_nfsv41_files nfs lockd fscache auth_rpcgss nfs_acl autofs4 sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_mirror dm_region_hash dm_log dm_mod ppdev parport_pc parport snd_ens1371 snd_rawmidi snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc e1000 microcode vmware_balloon i2c_piix4 i2c_core sg shpchp ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix mptspi mptscsih mptbase scsi_transport_spi [last unloaded: speedstep_lib]Pid: 1665, comm: flush-0:22 Not tainted 2.6.32-356-test-2 #2 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
RIP: 0010:[] [] memcpy+0x18/0x120
RSP: 0018:ffff88003dfab588 EFLAGS: 00010206
RAX: ffff88003dc42000 RBX: ffff88003dfab610 RCX: 0000000000000009
RDX: 000000003f807ff0 RSI: 0000000000008050 RDI: ffff88003dc42000
RBP: ffff88003dfab5b0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000080 R12: 0000000000000024
R13: ffff88003dc42000 R14: ffff88003f808030 R15: ffff88003dfab6a0
FS: 0000000000000000(0000) GS:ffff880003420000(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000008050 CR3: 000000003bc92000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process flush-0:22 (pid: 1665, threadinfo ffff88003dfaa000, task ffff880037f77540)
Stack:
ffffffffa0398ac1 ffff8800397c5940 ffff88003dfab610 ffff88003dfab6a0
ffff88003dfab5d0 ffff88003dfab680 ffffffffa01c150b ffffea0000d82e70
000000508116713b 0000000000000000 0000000000000000 0000000000000000
Call Trace:
[] ? xdr_inline_decode+0xb1/0x120 [sunrpc]
[] filelayout_decode_layout+0xeb/0x350 [nfs_layout_nfsv41_files]
[] filelayout_alloc_lseg+0x8c/0x3c0 [nfs_layout_nfsv41_files]
[] ? __wait_on_bit+0x7e/0x90Signed-off-by: Weston Andros Adamson
Signed-off-by: Trond Myklebust
Cc: stable@vger.kernel.org
15 Feb, 2013
1 commit
-
The current code in pnfs_destroy_all_layouts() assumes that removing
the layout from the server->layouts list is sufficient to make it
invisible to other processes. This ignores the fact that most
users access the layout through the nfs_inode->layout...
There is further breakage due to lack of reference counting of the
layouts, meaning that the whole thing Oopses at the drop of a hat.The code in initiate_bulk_draining() is almost correct, and can be
used as a model for pnfs_destroy_all_layouts(), so move that
code to pnfs.c, and refactor the code to allow us to choose between
a single filesystem bulk recall, and a recall of all layouts.
Also note that initiate_bulk_draining() currently calls iput() while
holding locks. Fix that too.Signed-off-by: Trond Myklebust
Cc: stable@vger.kernel.org
12 Feb, 2013
7 commits
-
Ensure that if nfs_wait_on_sequence() causes our rpc task to wait for
an NFSv4 state serialisation lock, then we also drop the session slot.Signed-off-by: Trond Myklebust
Cc: stable@vger.kernel.org -
If the server reboots after it has replied to our OPEN, but before we
call nfs4_opendata_to_nfs4_state(), then the reboot recovery thread
will not see a stateid for this open, and so will fail to recover it.Signed-off-by: Trond Myklebust
-
Add a mutex to the struct nfs4_state_owner to ensure that delegation
recall doesn't conflict with byte range lock removal.Note that we nest the new mutex _outside_ the state manager reclaim
protection (nfsi->rwsem) in order to avoid deadlocks.Signed-off-by: Trond Myklebust
-
Adjust the return values so that they return EAGAIN to the caller in
cases where we might want to retry the delegation recall after
the state recovery has run.
Note that we can't wait and retry in this routine, because the caller
may be the state manager thread.If delegation recall fails due to a session or reboot related issue,
also ensure that we mark the stateid as delegated so that
nfs_delegation_claim_opens can find it again later.Signed-off-by: Trond Myklebust
-
If the server reboots while we are converting a delegation into
OPEN/LOCK stateids as part of a delegation return, the current code
will simply exit with an error. This causes us to lose both
delegation state and locking state (i.e. locking atomicity).Deal with this by exposing the delegation stateid during delegation
return, so that we can recover the delegation, and then resume
open/lock recovery.Note that not having to hold the nfs_inode->rwsem across the
calls to nfs_delegation_claim_opens() also fixes a deadlock against
the NFSv4.1 reboot recovery code.Signed-off-by: Trond Myklebust
-
We currently have a deadlock in which the state recovery thread
ends up blocking due to one of the locks which it is trying to
recover holding the nfs_inode->rwsem.
The situation is as follows: the state recovery thread is
scheduled in order to recover from a reboot. It immediately
drains the session, forcing all ordinary NFSv4.1 calls to
nfs41_setup_sequence() to be put to sleep. This includes the
file locking process that holds the nfs_inode->rwsem.
When the thread gets to nfs4_reclaim_locks(), it tries to
grab a write lock on nfs_inode->rwsem, and boom...Fix is to have the lock drop the nfs_inode->rwsem while it is
doing RPC calls. We use a sequence lock in order to signal to
the locking process whether or not a state recovery thread has
run on that inode, in which case it should retry the lock.Reported-by: Andy Adamson
Signed-off-by: Trond Myklebust -
This patch adds a seqcount_t lock for use by the state manager to
signal that an open owner has been recovered. This mechanism will be
used by the delegation, open and byte range lock code in order to
figure out if they need to replay requests due to collisions with
lock recovery.Signed-off-by: Trond Myklebust
01 Feb, 2013
2 commits
-
This reverts commit 324d003b0cd82151adbaecefef57b73f7959a469.
The deadlock turned out to be caused by a workqueue limitation that has
now been worked around in the RPC code (see comment in rpc_free_task).Signed-off-by: Trond Myklebust
-
Pull NFS client bugfixes from Trond Myklebust:
- Error reporting in nfs_xdev_mount incorrectly maps all errors to
ENOMEM- Fix an NFSv4 refcounting issue
- Fix a mount failure when the server reboots during NFSv4 trunking
discovery- NFSv4.1 mounts may need to run the lease recovery thread.
- Don't silently fail setattr() requests on mountpoints
- Fix a SUNRPC socket/transport livelock and priority queue issue
- We must handle NFS4ERR_DELAY when resetting the NFSv4.1 session.
* tag 'nfs-for-3.8-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4.1: Handle NFS4ERR_DELAY when resetting the NFSv4.1 session
SUNRPC: When changing the queue priority, ensure that we change the owner
NFS: Don't silently fail setattr() requests on mountpoints
NFSv4.1: Ensure that nfs41_walk_client_list() does start lease recovery
NFSv4: Fix NFSv4 trunking discovery
NFSv4: Fix NFSv4 reference counting for trunked sessions
NFS: Fix error reporting in nfs_xdev_mount
31 Jan, 2013
2 commits
-
NFS4ERR_DELAY is a legal reply when we call DESTROY_SESSION. It
usually means that the server is busy handling an unfinished RPC
request. Just sleep for a second and then retry.
We also need to be able to handle the NFS4ERR_BACK_CHAN_BUSY return
value. If the NFS server has outstanding callbacks, we just want to
similarly sleep & retry.Signed-off-by: Trond Myklebust
Cc: stable@vger.kernel.org -
Ensure that any setattr and getattr requests for junctions and/or
mountpoints are sent to the server. Ever since commit
0ec26fd0698 (vfs: automount should ignore LOOKUP_FOLLOW), we have
silently dropped any setattr requests to a server-side mountpoint.
For referrals, we have silently dropped both getattr and setattr
requests.This patch restores the original behaviour for setattr on mountpoints,
and tries to do the same for referrals, provided that we have a
filehandle...Signed-off-by: Trond Myklebust
Cc: stable@vger.kernel.org
30 Jan, 2013
1 commit
-
Pull xfs bugfixes from Ben Myers:
"Here are fixes for returning EFSCORRUPTED on probe of a non-xfs
filesystem, the stack switch in xfs_bmapi_allocate, a crash in
_xfs_buf_find, speculative preallocation as the filesystem nears
ENOSPC, an unmount hang, a race with AIO, and a regression with
xfs_fsr:- fix return value when filesystem probe finds no XFS magic, a
regression introduced in 9802182.- fix stack switch in __xfs_bmapi_allocate by moving the check for
stack switch up into xfs_bmapi_write.- fix oops in _xfs_buf_find by validating that the requested block is
within the filesystem bounds.- limit speculative preallocation near ENOSPC.
- fix an unmount hang in xfs_wait_buftarg by freeing the
xfs_buf_log_item in xfs_buf_item_unlock.- fix a possible use after free with AIO.
- fix xfs_swap_extents after removal of xfs_flushinval_pages, a
regression introduced in commit fb59581404a."* tag 'for-linus-v3.8-rc6' of git://oss.sgi.com/xfs/xfs:
xfs: Fix xfs_swap_extents() after removal of xfs_flushinval_pages()
xfs: Fix possible use-after-free with AIO
xfs: fix shutdown hang on invalid inode during create
xfs: limit speculative prealloc near ENOSPC thresholds
xfs: fix _xfs_buf_find oops on blocks beyond the filesystem end
xfs: pull up stack_switch check into xfs_bmapi_write
xfs: Do not return EFSCORRUPTED when filesystem probe finds no XFS magic
29 Jan, 2013
7 commits
-
Commit fb59581404ab7ec5075299065c22cb211a9262a9 removed
xfs_flushinval_pages() and changed its callers to use
filemap_write_and_wait() and truncate_pagecache_range() directly.But in xfs_swap_extents() this change accidental switched the argument
for 'tip' to 'ip'. This patch switches it back to 'tip'Signed-off-by: Torsten Kaiser
Reviewed-by: Ben Myers
Signed-off-by: Ben Myers -
Running AIO is pinning inode in memory using file reference. Once AIO
is completed using aio_complete(), file reference is put and inode can
be freed from memory. So we have to be sure that calling aio_complete()
is the last thing we do with the inode.CC: xfs@oss.sgi.com
CC: Ben Myers
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara
Reviewed-by: Ben Myers
Signed-off-by: Ben Myers -
When the new inode verify in xfs_iread() fails, the create
transaction is aborted and a shutdown occurs. The subsequent unmount
then hangs in xfs_wait_buftarg() on a buffer that has an elevated
hold count. Debug showed that it was an AGI buffer getting stuck:[ 22.576147] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
[ 22.976213] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
[ 23.376206] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuck
[ 23.776325] XFS (vdb): buffer 0x2/0x1, hold 0x2 stuckThe trace of this buffer leading up to the shutdown (trimmed for
brevity) looks like:xfs_buf_init: bno 0x2 nblks 0x1 hold 1 caller xfs_buf_get_map
xfs_buf_get: bno 0x2 len 0x200 hold 1 caller xfs_buf_read_map
xfs_buf_read: bno 0x2 len 0x200 hold 1 caller xfs_trans_read_buf_map
xfs_buf_iorequest: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
xfs_buf_hold: bno 0x2 nblks 0x1 hold 1 caller xfs_buf_iorequest
xfs_buf_rele: bno 0x2 nblks 0x1 hold 2 caller xfs_buf_iorequest
xfs_buf_iowait: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
xfs_buf_ioerror: bno 0x2 len 0x200 hold 1 caller xfs_buf_bio_end_io
xfs_buf_iodone: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_ioend
xfs_buf_iowait_done: bno 0x2 nblks 0x1 hold 1 caller _xfs_buf_read
xfs_buf_hold: bno 0x2 nblks 0x1 hold 1 caller xfs_buf_item_init
xfs_trans_read_buf: bno 0x2 len 0x200 hold 2 recur 0 refcount 1
xfs_trans_brelse: bno 0x2 len 0x200 hold 2 recur 0 refcount 1
xfs_buf_item_relse: bno 0x2 nblks 0x1 hold 2 caller xfs_trans_brelse
xfs_buf_rele: bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_relse
xfs_buf_unlock: bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse
xfs_buf_rele: bno 0x2 nblks 0x1 hold 1 caller xfs_trans_brelse
xfs_buf_trylock: bno 0x2 nblks 0x1 hold 2 caller _xfs_buf_find
xfs_buf_find: bno 0x2 len 0x200 hold 2 caller xfs_buf_get_map
xfs_buf_get: bno 0x2 len 0x200 hold 2 caller xfs_buf_read_map
xfs_buf_read: bno 0x2 len 0x200 hold 2 caller xfs_trans_read_buf_map
xfs_buf_hold: bno 0x2 nblks 0x1 hold 2 caller xfs_buf_item_init
xfs_trans_read_buf: bno 0x2 len 0x200 hold 3 recur 0 refcount 1
xfs_trans_log_buf: bno 0x2 len 0x200 hold 3 recur 0 refcount 1
xfs_buf_item_unlock: bno 0x2 len 0x200 hold 3 flags DIRTY liflags ABORTED
xfs_buf_unlock: bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlock
xfs_buf_rele: bno 0x2 nblks 0x1 hold 3 caller xfs_buf_item_unlockAnd that is the AGI buffer from cold cache read into memory to
transaction abort. You can see at transaction abort the bli is dirty
and only has a single reference. The item is not pinned, and it's
not in the AIL. Hence the only reference to it is this transaction.The problem is that the xfs_buf_item_unlock() call is dropping the
last reference to the xfs_buf_log_item attached to the buffer (which
holds a reference to the buffer), but it is not freeing the
xfs_buf_log_item. Hence nothing will ever release the buffer, and
the unmount hangs waiting for this reference to go away.The fix is simple - xfs_buf_item_unlock needs to detect the last
reference going away in this case and free the xfs_buf_log_item to
release the reference it holds on the buffer.Signed-off-by: Dave Chinner
Reviewed-by: Ben Myers
Signed-off-by: Ben Myers -
There is a window on small filesytsems where specualtive
preallocation can be larger than that ENOSPC throttling thresholds,
resulting in specualtive preallocation trying to reserve more space
than there is space available. This causes immediate ENOSPC to be
triggered, prealloc to be turned off and flushing to occur. One the
next write (i.e. next 4k page), we do exactly the same thing, and so
effective drive into synchronous 4k writes by triggering ENOSPC
flushing on every page while in the window between the prealloc size
and the ENOSPC prealloc throttle threshold.Fix this by checking to see if the prealloc size would consume all
free space, and throttle it appropriately to avoid premature
ENOSPC...Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Signed-off-by: Ben Myers -
When _xfs_buf_find is passed an out of range address, it will fail
to find a relevant struct xfs_perag and oops with a null
dereference. This can happen when trying to walk a filesystem with a
metadata inode that has a partially corrupted extent map (i.e. the
block number returned is corrupt, but is otherwise intact) and we
try to read from the corrupted block address.In this case, just fail the lookup. If it is readahead being issued,
it will simply not be done, but if it is real read that fails we
will get an error being reported. Ideally this case should result
in an EFSCORRUPTED error being reported, but we cannot return an
error through xfs_buf_read() or xfs_buf_get() so this lookup failure
may result in ENOMEM or EIO errors being reported instead.Signed-off-by: Dave Chinner
Reviewed-by: Brian Foster
Reviewed-by: Ben Myers
Signed-off-by: Ben Myers -
The stack_switch check currently occurs in __xfs_bmapi_allocate,
which means the stack switch only occurs when xfs_bmapi_allocate()
is called in a loop. Pull the check up before the loop in
xfs_bmapi_write() such that the first iteration of the loop has
consistent behavior.Signed-off-by: Brian Foster
Reviewed-by: Dave Chinner
Signed-off-by: Ben Myers -
9802182 changed the return value from EWRONGFS (aka EINVAL)
to EFSCORRUPTED which doesn't seem to be handled properly by
the root filesystem probe.Signed-off-by: Eric Sandeen
Tested-by: Sergei Trofimovich
Reviewed-by: Ben Myers
Signed-off-by: Ben Myers
28 Jan, 2013
5 commits
-
The recent commit fb6791d100d1bba20b5cdbc4912e1f7086ec60f8
included the wrong logic. The lvbptr check was incorrectly
added after the patch was tested.Signed-off-by: David Teigland
Signed-off-by: Steven Whitehouse -
We do need to start the lease recovery thread prior to waiting for the
client initialisation to complete in NFSv4.1.Signed-off-by: Trond Myklebust
Cc: Chuck Lever
Cc: Ben Greear
Cc: stable@vger.kernel.org [>=3.7] -
If walking the list in nfs4[01]_walk_client_list fails, then the most
likely explanation is that the server dropped the clientid before we
actually managed to confirm it. As long as our nfs_client is the very
last one in the list to be tested, the caller can be assured that this
is the case when the final return value is NFS4ERR_STALE_CLIENTID.Reported-by: Ben Greear
Signed-off-by: Trond Myklebust
Cc: Chuck Lever
Cc: stable@vger.kernel.org [>=3.7]
Tested-by: Ben Greear -
The reference counting in nfs4_init_client assumes wongly that it
is safe for nfs4_discover_server_trunking() to return a pointer to a
nfs_client prior to bumping the reference count.Signed-off-by: Trond Myklebust
Cc: Chuck Lever
Cc: Ben Greear
Cc: stable@vger.kernel.org [>=3.7] -
Currently, nfs_xdev_mount converts all errors from clone_server() to
ENOMEM, which can then leak to userspace (for instance to 'mount'). Fix that.
Also ensure that if nfs_fs_mount_common() returns an error, we
don't dprintk(0)...The regression originated in commit 3d176e3fe4f6dc379b252bf43e2e146a8f7caf01
(NFS: Use nfs_fs_mount_common() for xdev mounts)Signed-off-by: Trond Myklebust
Cc: stable@vger.kernel.org [>= 3.5]
26 Jan, 2013
1 commit
-
Pull btrfs fixes from Chris Mason:
"It turns out that we had two crc bugs when running fsx-linux in a
loop. Many thanks to Josef, Miao Xie, and Dave Sterba for nailing it
all down. Miao also has a new OOM fix in this v2 pull as well.Ilya fixed a regression Liu Bo found in the balance ioctls for pausing
and resuming a running balance across drives.Josef's orphan truncate patch fixes an obscure corruption we'd see
during xfstests.Arne's patches address problems with subvolume quotas. If the user
destroys quota groups incorrectly the FS will refuse to mount.The rest are smaller fixes and plugs for memory leaks."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (30 commits)
Btrfs: fix repeated delalloc work allocation
Btrfs: fix wrong max device number for single profile
Btrfs: fix missed transaction->aborted check
Btrfs: Add ACCESS_ONCE() to transaction->abort accesses
Btrfs: put csums on the right ordered extent
Btrfs: use right range to find checksum for compressed extents
Btrfs: fix panic when recovering tree log
Btrfs: do not allow logged extents to be merged or removed
Btrfs: fix a regression in balance usage filter
Btrfs: prevent qgroup destroy when there are still relations
Btrfs: ignore orphan qgroup relations
Btrfs: reorder locks and sanity checks in btrfs_ioctl_defrag
Btrfs: fix unlock order in btrfs_ioctl_rm_dev
Btrfs: fix unlock order in btrfs_ioctl_resize
Btrfs: fix "mutually exclusive op is running" error code
Btrfs: bring back balance pause/resume logic
btrfs: update timestamps on truncate()
btrfs: fix btrfs_cont_expand() freeing IS_ERR em
Btrfs: fix a bug when llseek for delalloc bytes behind prealloc extents
Btrfs: fix off-by-one in lseek
...
25 Jan, 2013
3 commits
-
Pull cifs fixes from Steve French:
"Two small cifs fixes"* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
fs/cifs/cifs_dfs_ref.c: fix potential memory leakage
cifs: fix srcip_matches() for ipv6 -
btrfs_start_delalloc_inodes() locks the delalloc_inodes list, fetches the
first inode, unlocks the list, triggers btrfs_alloc_delalloc_work/
btrfs_queue_worker for this inode, and then it locks the list, checks the
head of the list again. But because we don't delete the first inode that it
deals with before, it will fetch the same inode. As a result, this function
allocates a huge amount of btrfs_delalloc_work structures, and OOM happens.Fix this problem by splice this delalloc list.
Reported-by: Alex Lyakas
Signed-off-by: Miao Xie
Signed-off-by: Josef Bacik -
The max device number of single profile is 1, not 0 (0 means 'as many as
possible'). Fix it.Cc: Liu Bo
Signed-off-by: Miao Xie
Reviewed-by: Liu Bo
Signed-off-by: Josef Bacik