23 Sep, 2015

2 commits

  • If we send a layoutreturn asynchronously before close, the close
    might reach server first and layoutreturn would fail with BADSTATEID
    because there is nothing keeping the layout stateid alive.

    Also do not pretend sending layoutreturn if we are not.

    Signed-off-by: Peng Tao
    Signed-off-by: Trond Myklebust

    Peng Tao
     
  • When lseg's commit_through_mds is set, pnfs client always WARN once
    in nfs_direct_select_verf after checking ds_cinfo.nbuckets.

    nfs should use the DS verf except commit_through_mds is set for
    layout segment where nbuckets is zero.

    [17844.666094] ------------[ cut here ]------------
    [17844.667071] WARNING: CPU: 0 PID: 21758 at /root/source/linux-pnfs/fs/nfs/direct.c:174 nfs_direct_select_verf+0x5a/0x70 [nfs]()
    [17844.668650] Modules linked in: nfs_layout_nfsv41_files(OE) nfsv4(OE) nfs(OE) fscache(E) nfsd(OE) xfs libcrc32c btrfs ppdev coretemp crct10dif_pclmul auth_rpcgss crc32_pclmul crc32c_intel nfs_acl ghash_clmulni_intel lockd vmw_balloon xor vmw_vmci grace raid6_pq shpchp sunrpc parport_pc i2c_piix4 parport vmwgfx drm_kms_helper ttm drm serio_raw mptspi e1000 scsi_transport_spi mptscsih mptbase ata_generic pata_acpi [last unloaded: fscache]
    [17844.686676] CPU: 0 PID: 21758 Comm: kworker/0:1 Tainted: G W OE 4.3.0-rc1-pnfs+ #245
    [17844.687352] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/20/2014
    [17844.698502] Workqueue: nfsiod rpc_async_release [sunrpc]
    [17844.699212] 0000000000000009 0000000043e58010 ffff8800454fbc10 ffffffff813680c4
    [17844.699990] ffff8800454fbc48 ffffffff8108b49d ffff88004eb20000 ffff88004eb20000
    [17844.700844] ffff880062e26000 0000000000000000 0000000000000001 ffff8800454fbc58
    [17844.701637] Call Trace:
    [17844.725252] [] dump_stack+0x19/0x25
    [17844.732693] [] warn_slowpath_common+0x7d/0xb0
    [17844.733855] [] warn_slowpath_null+0x1a/0x20
    [17844.735015] [] nfs_direct_select_verf+0x5a/0x70 [nfs]
    [17844.735999] [] nfs_direct_set_hdr_verf+0x23/0x90 [nfs]
    [17844.736846] [] nfs_direct_write_completion+0x227/0x260 [nfs]
    [17844.737782] [] nfs_pgio_release+0x1c/0x20 [nfs]
    [17844.738597] [] pnfs_generic_rw_release+0x23/0x30 [nfsv4]
    [17844.739486] [] rpc_free_task+0x2a/0x70 [sunrpc]
    [17844.740326] [] rpc_async_release+0x15/0x20 [sunrpc]
    [17844.741173] [] process_one_work+0x21c/0x4c0
    [17844.741984] [] ? process_one_work+0x16d/0x4c0
    [17844.742837] [] worker_thread+0x4a/0x440
    [17844.743639] [] ? process_one_work+0x4c0/0x4c0
    [17844.744399] [] ? process_one_work+0x4c0/0x4c0
    [17844.745176] [] kthread+0xf5/0x110
    [17844.745927] [] ? kthread_create_on_node+0x240/0x240
    [17844.747105] [] ret_from_fork+0x3f/0x70
    [17844.747856] [] ? kthread_create_on_node+0x240/0x240
    [17844.748642] ---[ end trace 336a2845d42b83f0 ]---

    Signed-off-by: Kinglong Mee
    Signed-off-by: Trond Myklebust

    Kinglong Mee
     

21 Sep, 2015

4 commits

  • If the current open or layout stateid doesn't match the stateid used
    in the layoutget RPC call, then don't try to recover it.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • When a read delegation is being recalled, and we're reclaiming the
    cached opens, we need to make sure that we only reclaim read-only
    modes.
    A previous attempt to do this, relied on retrieving the delegation
    type from the nfs4_opendata structure. Unfortunately, as Kinglong
    pointed out, this field can only be set when performing reboot recovery.

    Furthermore, if we call nfs4_open_recover(), then we end up clobbering
    the state->flags for all modes that we're not recovering...

    The fix is to have the delegation recall code pass this information
    to the recovery call, and then refactor the recovery code so that
    nfs4_open_delegation_recall() does not need to call nfs4_open_recover().

    Reported-by: Kinglong Mee
    Fixes: 39f897fdbd46 ("NFSv4: When returning a delegation, don't...")
    Tested-by: Kinglong Mee
    Cc: NeilBrown
    Cc: stable@vger.kernel.org # v4.2+
    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • If layouget fail with BAD_STATEID, restart should not using the old stateid.
    But, nfs client choose the layout stateid at first, and then the open stateid.

    To avoid the infinite loop of using bad stateid for layoutget,
    this patch sets the layout flag'ss NFS_LAYOUT_INVALID_STID bit to
    skip choosing the bad layout stateid.

    Signed-off-by: Kinglong Mee
    Signed-off-by: Trond Myklebust

    Kinglong Mee
     
  • There is a reference leak of layout segment after resetting
    pageio read/write to mds.

    Signed-off-by: Kinglong Mee
    Cc: stable@vger.kernel.org # v4.0+
    Signed-off-by: Trond Myklebust

    Kinglong Mee
     

18 Sep, 2015

4 commits

  • If filelayout_decode_layout fail, _filelayout_free_lseg will causes
    a double freeing of fh_array.

    [ 1179.279800] BUG: unable to handle kernel NULL pointer dereference at (null)
    [ 1179.280198] IP: [] filelayout_free_fh_array.isra.11+0x1d/0x70 [nfs_layout_nfsv41_files]
    [ 1179.281010] PGD 0
    [ 1179.281443] Oops: 0000 [#1]
    [ 1179.281831] Modules linked in: nfs_layout_nfsv41_files(OE) nfsv4(OE) nfs(OE) fscache(E) xfs libcrc32c coretemp nfsd crct10dif_pclmul ppdev crc32_pclmul crc32c_intel auth_rpcgss ghash_clmulni_intel nfs_acl lockd vmw_balloon grace sunrpc parport_pc vmw_vmci parport shpchp i2c_piix4 vmwgfx drm_kms_helper ttm drm serio_raw mptspi scsi_transport_spi mptscsih e1000 mptbase ata_generic pata_acpi [last unloaded: fscache]
    [ 1179.283891] CPU: 0 PID: 13336 Comm: cat Tainted: G OE 4.3.0-rc1-pnfs+ #244
    [ 1179.284323] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/20/2014
    [ 1179.285206] task: ffff8800501d48c0 ti: ffff88003e3c4000 task.ti: ffff88003e3c4000
    [ 1179.285668] RIP: 0010:[] [] filelayout_free_fh_array.isra.11+0x1d/0x70 [nfs_layout_nfsv41_files]
    [ 1179.286612] RSP: 0018:ffff88003e3c77f8 EFLAGS: 00010202
    [ 1179.287092] RAX: 0000000000000000 RBX: ffff88001fe78900 RCX: 0000000000000000
    [ 1179.287731] RDX: ffffea0000f40760 RSI: ffff88001fe789c8 RDI: ffff88001fe789c0
    [ 1179.288383] RBP: ffff88003e3c7810 R08: ffffea0000f40760 R09: 0000000000000000
    [ 1179.289170] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88001fe789c8
    [ 1179.289959] R13: ffff88001fe789c0 R14: ffff88004ec05a80 R15: ffff88004f935b88
    [ 1179.290791] FS: 00007f4e66bb5700(0000) GS:ffffffff81c29000(0000) knlGS:0000000000000000
    [ 1179.291580] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 1179.292209] CR2: 0000000000000000 CR3: 00000000203f8000 CR4: 00000000001406f0
    [ 1179.292731] Stack:
    [ 1179.293195] ffff88001fe78900 00000000000000d0 ffff88001fe78178 ffff88003e3c7868
    [ 1179.293676] ffffffffa0272737 0000000000000001 0000000000000001 ffff88001fe78800
    [ 1179.294151] 00000000614fffce ffffffff81727671 ffff88001fe78100 ffff88001fe78100
    [ 1179.294623] Call Trace:
    [ 1179.295092] [] filelayout_alloc_lseg+0xa7/0x2d0 [nfs_layout_nfsv41_files]
    [ 1179.295625] [] ? out_of_line_wait_on_bit+0x81/0xb0
    [ 1179.296133] [] pnfs_layout_process+0xae/0x320 [nfsv4]
    [ 1179.296632] [] nfs4_proc_layoutget+0x2b1/0x360 [nfsv4]
    [ 1179.297134] [] pnfs_update_layout+0x853/0xb30 [nfsv4]
    [ 1179.297632] [] ? nfs_get_lock_context+0x74/0x170 [nfs]
    [ 1179.298158] [] filelayout_pg_init_read+0x37/0x50 [nfs_layout_nfsv41_files]
    [ 1179.298834] [] __nfs_pageio_add_request+0x119/0x460 [nfs]
    [ 1179.299385] [] ? nfs_create_request.part.9+0x37/0x2e0 [nfs]
    [ 1179.299872] [] nfs_pageio_add_request+0xa3/0x1b0 [nfs]
    [ 1179.300362] [] readpage_async_filler+0x85/0x260 [nfs]
    [ 1179.300907] [] read_cache_pages+0x91/0xd0
    [ 1179.301391] [] ? nfs_read_completion+0x220/0x220 [nfs]
    [ 1179.301867] [] nfs_readpages+0x128/0x200 [nfs]
    [ 1179.302330] [] __do_page_cache_readahead+0x203/0x280
    [ 1179.302784] [] ? __do_page_cache_readahead+0xd8/0x280
    [ 1179.303413] [] ondemand_readahead+0x1a6/0x2f0
    [ 1179.303855] [] page_cache_sync_readahead+0x31/0x50
    [ 1179.304286] [] generic_file_read_iter+0x4a6/0x5c0
    [ 1179.304711] [] ? __nfs_revalidate_mapping+0x1f6/0x240 [nfs]
    [ 1179.305132] [] nfs_file_read+0x52/0xa0 [nfs]
    [ 1179.305540] [] __vfs_read+0xcc/0x100
    [ 1179.305936] [] vfs_read+0x85/0x130
    [ 1179.306326] [] SyS_read+0x58/0xd0
    [ 1179.306708] [] entry_SYSCALL_64_fastpath+0x12/0x76
    [ 1179.307094] Code: c4 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 8b 07 49 89 f4 85 c0 74 47 48 8b 06 49 89 fd 8b 38 48 85 ff 74 22 31 db eb 0c 48 63 d3 48 8b 3c d0 48 85
    [ 1179.308357] RIP [] filelayout_free_fh_array.isra.11+0x1d/0x70 [nfs_layout_nfsv41_files]
    [ 1179.309177] RSP
    [ 1179.309582] CR2: 0000000000000000

    Signed-off-by: Kinglong Mee
    Signed-off-by: Trond Myklebust

    Kinglong Mee
     
  • We're incorrectly assigning a loff_t return to an int. If SEEK_HOLE or
    SEEK_DATA returns an offset over 2^31 then the application will see a
    weird lseek() result (usually -EIO).

    Cc: stable@vger.kernel.org
    Fixes: bdcc2cd14e4e "NFSv4.2: handle NFS-specific llseek errors"
    Signed-off-by: J. Bruce Fields
    Reviewed-by: Anna Schumaker
    Signed-off-by: Trond Myklebust

    J. Bruce Fields
     
  • We really want sizeof(struct page *) instead. Otherwise we limit
    maximum IO size to 64 pages rather than 512 pages on a 64bit system.

    Fixes 2e11f829(nfs: cap request size to fit a kmalloced page array).

    Cc: Christoph Hellwig
    Signed-off-by: Peng Tao
    Fixes: 2e11f8296d22 ("nfs: cap request size to fit a kmalloced page array")
    Signed-off-by: Trond Myklebust

    Peng Tao
     
  • A test case is as the description says:
    open(foobar, O_WRONLY);
    sleep() --> reboot the server
    close(foobar)

    The bug is because in nfs4state.c in nfs4_reclaim_open_state() a few
    line before going to restart, there is
    clear_bit(NFS4CLNT_RECLAIM_NOGRACE, &state->flags).

    NFS4CLNT_RECLAIM_NOGRACE is a flag for the client states not open
    owner states. Value of NFS4CLNT_RECLAIM_NOGRACE is 4 which is the
    value of NFS_O_WRONLY_STATE in nfs4_state->flags. So clearing it wipes
    out state and when we go to close it, “call_close” doesn’t get set as
    state flag is not set and CLOSE doesn’t go on the wire.

    Signed-off-by: Olga Kornievskaia
    Signed-off-by: Trond Myklebust

    Olga Kornievskaia
     

08 Sep, 2015

3 commits

  • Pull NFS client updates from Trond Myklebust:
    "Highlights include:

    Stable patches:
    - Fix atomicity of pNFS commit list updates
    - Fix NFSv4 handling of open(O_CREAT|O_EXCL|O_RDONLY)
    - nfs_set_pgio_error sometimes misses errors
    - Fix a thinko in xs_connect()
    - Fix borkage in _same_data_server_addrs_locked()
    - Fix a NULL pointer dereference of migration recovery ops for v4.2
    client
    - Don't let the ctime override attribute barriers.
    - Revert "NFSv4: Remove incorrect check in can_open_delegated()"
    - Ensure flexfiles pNFS driver updates the inode after write finishes
    - flexfiles must not pollute the attribute cache with attrbutes from
    the DS
    - Fix a protocol error in layoutreturn
    - Fix a protocol issue with NFSv4.1 CLOSE stateids

    Bugfixes + cleanups
    - pNFS blocks bugfixes from Christoph
    - Various cleanups from Anna
    - More fixes for delegation corner cases
    - Don't fsync twice for O_SYNC/IS_SYNC files
    - Fix pNFS and flexfiles layoutstats bugs
    - pnfs/flexfiles: avoid duplicate tracking of mirror data
    - pnfs: Fix layoutget/layoutreturn/return-on-close serialisation
    issues
    - pnfs/flexfiles: error handling retries a layoutget before fallback
    to MDS

    Features:
    - Full support for the OPEN NFS4_CREATE_EXCLUSIVE4_1 mode from
    Kinglong
    - More RDMA client transport improvements from Chuck
    - Removal of the deprecated ib_reg_phys_mr() and ib_rereg_phys_mr()
    verbs from the SUNRPC, Lustre and core infiniband tree.
    - Optimise away the close-to-open getattr if there is no cached data"

    * tag 'nfs-for-4.3-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (108 commits)
    NFSv4: Respect the server imposed limit on how many changes we may cache
    NFSv4: Express delegation limit in units of pages
    Revert "NFS: Make close(2) asynchronous when closing NFS O_DIRECT files"
    NFS: Optimise away the close-to-open getattr if there is no cached data
    NFSv4.1/flexfiles: Clean up ff_layout_write_done_cb/ff_layout_commit_done_cb
    NFSv4.1/flexfiles: Mark the layout for return in ff_layout_io_track_ds_error()
    nfs: Remove unneeded checking of the return value from scnprintf
    nfs: Fix truncated client owner id without proto type
    NFSv4.1/flexfiles: Mark layout for return if the mirrors are invalid
    NFSv4.1/flexfiles: RW layouts are valid only if all mirrors are valid
    NFSv4.1/flexfiles: Fix incorrect usage of pnfs_generic_mark_devid_invalid()
    NFSv4.1/flexfiles: Fix freeing of mirrors
    NFSv4.1/pNFS: Don't request a minimal read layout beyond the end of file
    NFSv4.1/pnfs: Handle LAYOUTGET return values correctly
    NFSv4.1/pnfs: Don't ask for a read layout for an empty file.
    NFSv4.1: Fix a protocol issue with CLOSE stateids
    NFSv4.1/flexfiles: Don't mark the entire deviceid as bad for file errors
    SUNRPC: Prevent SYN+SYNACK+RST storms
    SUNRPC: xs_reset_transport must mark the connection as disconnected
    NFSv4.1/pnfs: Ensure layoutreturn reserves space for the opaque payload
    ...

    Linus Torvalds
     
  • The NFSv4 delegation spec allows the server to tell a client to limit how
    much data it cache after the file is closed. In return, the server
    guarantees enough free space to avoid ENOSPC situations, etc.
    Prior to this patch, we assumed we could always cache aggressively after
    close. Unfortunately, this causes problems with servers that set the
    limit to 0 and therefore do not offer any ENOSPC guarantees.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Since we're tracking modifications to the page cache on a per-page
    basis, it makes sense to express the limit to how much we may cache
    in units of pages.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     

06 Sep, 2015

1 commit

  • Pull nfsd updates from Bruce Fields:
    "Nothing major, but:

    - Add Jeff Layton as an nfsd co-maintainer: no change to existing
    practice, just an acknowledgement of the status quo.

    - Two patches ("nfsd: ensure that...") for a race overlooked by the
    state locking rewrite, causing a crash noticed by multiple users.

    - Lots of smaller bugfixes all over from Kinglong Mee.

    - From Jeff, some cleanup of server rpc code in preparation for
    possible shift of nfsd threads to workqueues"

    * tag 'nfsd-4.3' of git://linux-nfs.org/~bfields/linux: (52 commits)
    nfsd: deal with DELEGRETURN racing with CB_RECALL
    nfsd: return CLID_INUSE for unexpected SETCLIENTID_CONFIRM case
    nfsd: ensure that delegation stateid hash references are only put once
    nfsd: ensure that the ol stateid hash reference is only put once
    net: sunrpc: fix tracepoint Warning: unknown op '->'
    nfsd: allow more than one laundry job to run at a time
    nfsd: don't WARN/backtrace for invalid container deployment.
    fs: fix fs/locks.c kernel-doc warning
    nfsd: Add Jeff Layton as co-maintainer
    NFSD: Return word2 bitmask if setting security label in OPEN/CREATE
    NFSD: Set the attributes used to store the verifier for EXCLUSIVE4_1
    nfsd: SUPPATTR_EXCLCREAT must be encoded before SECURITY_LABEL.
    nfsd: Fix an FS_LAYOUT_TYPES/LAYOUT_TYPES encode bug
    NFSD: Store parent's stat in a separate value
    nfsd: Fix two typos in comments
    lockd: NLM grace period shouldn't block NFSv4 opens
    nfsd: include linux/nfs4.h in export.h
    sunrpc: Switch to using hash list instead single list
    sunrpc/nfsd: Remove redundant code by exports seq_operations functions
    sunrpc: Store cache_detail in seq_file's private directly
    ...

    Linus Torvalds
     

05 Sep, 2015

2 commits


03 Sep, 2015

3 commits

  • Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • When I/O cannot complete due to a fatal error on the DS, ensure that we
    invalidate the corresponding layout segment and return it.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • Pull core block updates from Jens Axboe:
    "This first core part of the block IO changes contains:

    - Cleanup of the bio IO error signaling from Christoph. We used to
    rely on the uptodate bit and passing around of an error, now we
    store the error in the bio itself.

    - Improvement of the above from myself, by shrinking the bio size
    down again to fit in two cachelines on x86-64.

    - Revert of the max_hw_sectors cap removal from a revision again,
    from Jeff Moyer. This caused performance regressions in various
    tests. Reinstate the limit, bump it to a more reasonable size
    instead.

    - Make /sys/block//queue/discard_max_bytes writeable, by me.
    Most devices have huge trim limits, which can cause nasty latencies
    when deleting files. Enable the admin to configure the size down.
    We will look into having a more sane default instead of UINT_MAX
    sectors.

    - Improvement of the SGP gaps logic from Keith Busch.

    - Enable the block core to handle arbitrarily sized bios, which
    enables a nice simplification of bio_add_page() (which is an IO hot
    path). From Kent.

    - Improvements to the partition io stats accounting, making it
    faster. From Ming Lei.

    - Also from Ming Lei, a basic fixup for overflow of the sysfs pending
    file in blk-mq, as well as a fix for a blk-mq timeout race
    condition.

    - Ming Lin has been carrying Kents above mentioned patches forward
    for a while, and testing them. Ming also did a few fixes around
    that.

    - Sasha Levin found and fixed a use-after-free problem introduced by
    the bio->bi_error changes from Christoph.

    - Small blk cgroup cleanup from Viresh Kumar"

    * 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits)
    blk: Fix bio_io_vec index when checking bvec gaps
    block: Replace SG_GAPS with new queue limits mask
    block: bump BLK_DEF_MAX_SECTORS to 2560
    Revert "block: remove artifical max_hw_sectors cap"
    blk-mq: fix race between timeout and freeing request
    blk-mq: fix buffer overflow when reading sysfs file of 'pending'
    Documentation: update notes in biovecs about arbitrarily sized bios
    block: remove bio_get_nr_vecs()
    fs: use helper bio_add_page() instead of open coding on bi_io_vec
    block: kill merge_bvec_fn() completely
    md/raid5: get rid of bio_fits_rdev()
    md/raid5: split bio for chunk_aligned_read
    block: remove split code in blkdev_issue_{discard,write_same}
    btrfs: remove bio splitting and merge_bvec_fn() calls
    bcache: remove driver private bio splitting code
    block: simplify bio_add_page()
    block: make generic_make_request handle arbitrarily sized bios
    blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL)
    block: don't access bio->bi_error after bio_put()
    block: shrink struct bio down to 2 cache lines again
    ...

    Linus Torvalds
     

02 Sep, 2015

6 commits


31 Aug, 2015

5 commits


28 Aug, 2015

7 commits


26 Aug, 2015

3 commits