02 Oct, 2015

1 commit

  • Commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for
    DAX") moved some code in __dax_pmd_fault() that was responsible for
    zeroing newly allocated PMD pages. The new location didn't properly set
    up 'kaddr', so when run this code resulted in a NULL pointer BUG.

    Fix this by getting the correct 'kaddr' via bdev_direct_access().

    Signed-off-by: Ross Zwisler
    Reported-by: Dan Williams
    Reviewed-by: Dan Williams
    Cc: Alexander Viro
    Cc: Matthew Wilcox
    Cc: "Kirill A. Shutemov"
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

01 Oct, 2015

1 commit


29 Sep, 2015

1 commit

  • Fixes the following lockdep splat:
    [ 1.244527] =============================================
    [ 1.245193] [ INFO: possible recursive locking detected ]
    [ 1.245193] 4.2.0-rc1+ #37 Not tainted
    [ 1.245193] ---------------------------------------------
    [ 1.245193] cp/742 is trying to acquire lock:
    [ 1.245193] (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [] ubifs_init_security+0x29/0xb0
    [ 1.245193]
    [ 1.245193] but task is already holding lock:
    [ 1.245193] (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [] path_openat+0x3af/0x1280
    [ 1.245193]
    [ 1.245193] other info that might help us debug this:
    [ 1.245193] Possible unsafe locking scenario:
    [ 1.245193]
    [ 1.245193] CPU0
    [ 1.245193] ----
    [ 1.245193] lock(&sb->s_type->i_mutex_key#9);
    [ 1.245193] lock(&sb->s_type->i_mutex_key#9);
    [ 1.245193]
    [ 1.245193] *** DEADLOCK ***
    [ 1.245193]
    [ 1.245193] May be due to missing lock nesting notation
    [ 1.245193]
    [ 1.245193] 2 locks held by cp/742:
    [ 1.245193] #0: (sb_writers#5){.+.+.+}, at: [] mnt_want_write+0x1f/0x50
    [ 1.245193] #1: (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [] path_openat+0x3af/0x1280
    [ 1.245193]
    [ 1.245193] stack backtrace:
    [ 1.245193] CPU: 2 PID: 742 Comm: cp Not tainted 4.2.0-rc1+ #37
    [ 1.245193] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140816_022509-build35 04/01/2014
    [ 1.245193] ffffffff8252d530 ffff88007b023a38 ffffffff814f6f49 ffffffff810b56c5
    [ 1.245193] ffff88007c30cc80 ffff88007b023af8 ffffffff810a150d ffff88007b023a68
    [ 1.245193] 000000008101302a ffff880000000000 00000008f447e23f ffffffff8252d500
    [ 1.245193] Call Trace:
    [ 1.245193] [] dump_stack+0x4c/0x65
    [ 1.245193] [] ? console_unlock+0x1c5/0x510
    [ 1.245193] [] __lock_acquire+0x1a6d/0x1ea0
    [ 1.245193] [] ? __lock_is_held+0x58/0x80
    [ 1.245193] [] lock_acquire+0xd3/0x270
    [ 1.245193] [] ? ubifs_init_security+0x29/0xb0
    [ 1.245193] [] mutex_lock_nested+0x6b/0x3a0
    [ 1.245193] [] ? ubifs_init_security+0x29/0xb0
    [ 1.245193] [] ? ubifs_init_security+0x29/0xb0
    [ 1.245193] [] ubifs_init_security+0x29/0xb0
    [ 1.245193] [] ubifs_create+0xa6/0x1f0
    [ 1.245193] [] ? path_openat+0x3af/0x1280
    [ 1.245193] [] vfs_create+0x95/0xc0
    [ 1.245193] [] path_openat+0x7cc/0x1280
    [ 1.245193] [] ? __lock_acquire+0x543/0x1ea0
    [ 1.245193] [] ? sched_clock_cpu+0x90/0xc0
    [ 1.245193] [] ? calc_global_load_tick+0x60/0x90
    [ 1.245193] [] ? sched_clock_cpu+0x90/0xc0
    [ 1.245193] [] ? __alloc_fd+0xaf/0x180
    [ 1.245193] [] do_filp_open+0x75/0xd0
    [ 1.245193] [] ? _raw_spin_unlock+0x26/0x40
    [ 1.245193] [] ? __alloc_fd+0xaf/0x180
    [ 1.245193] [] do_sys_open+0x129/0x200
    [ 1.245193] [] SyS_open+0x19/0x20
    [ 1.245193] [] entry_SYSCALL_64_fastpath+0x12/0x6f

    While the lockdep splat is a false positive, becuase path_openat holds i_mutex
    of the parent directory and ubifs_init_security() tries to acquire i_mutex
    of a new inode, it reveals that taking i_mutex in ubifs_init_security() is
    in vain because it is only being called in the inode allocation path
    and therefore nobody else can see the inode yet.

    Cc: stable@vger.kernel.org # 3.20-
    Reported-and-tested-by: Boris Brezillon
    Reviewed-and-tested-by: Dongsheng Yang
    Signed-off-by: Richard Weinberger
    Signed-off-by: dedekind1@gmail.com

    Richard Weinberger
     

27 Sep, 2015

1 commit

  • Pull CIFS fixes from Steve French:
    "Four fixes from testing at the recent SMB3 Plugfest including two
    important authentication ones (one fixes authentication problems to
    some popular servers when clock times differ more than two hours
    between systems, the other fixes Kerberos authentication for SMB3)"

    * 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
    fix encryption error checks on mount
    [SMB3] Fix sec=krb5 on smb3 mounts
    cifs: use server timestamp for ntlmv2 authentication
    disabling oplocks/leases via module parm enable_oplocks broken for SMB3

    Linus Torvalds
     

26 Sep, 2015

2 commits

  • Pull btrfs fixes from Chris Mason:
    "This is an assorted set I've been queuing up:

    Jeff Mahoney tracked down a tricky one where we ended up starting IO
    on the wrong mapping for special files in btrfs_evict_inode. A few
    people reported this one on the list.

    Filipe found (and provided a test for) a difficult bug in reading
    compressed extents, and Josef fixed up some quota record keeping with
    snapshot deletion. Chandan killed off an accounting bug during DIO
    that lead to WARN_ONs as we freed inodes"

    * 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs: keep dropped roots in cache until transaction commit
    Btrfs: Direct I/O: Fix space accounting
    btrfs: skip waiting on ordered range for special files
    Btrfs: fix read corruption of compressed and shared extents
    Btrfs: remove unnecessary locking of cleaner_mutex to avoid deadlock
    Btrfs: don't initialize a space info as full to prevent ENOSPC

    Linus Torvalds
     
  • Pull NFS client bugfixes from Trond Myklebust:
    "Highlights include:

    Stable patches:
    - fix v4.2 SEEK on files over 2 gigs
    - Fix a layout segment reference leak when pNFS I/O falls back to inband I/O.
    - Fix recovery of recalled read delegations

    Bugfixes:
    - Fix a case where NFSv4 fails to send CLOSE after a server reboot
    - Fix sunrpc to wait for connections to complete before retrying
    - Fix sunrpc races between transport connect/disconnect and shutdown
    - Fix an infinite loop when layoutget fail with BAD_STATEID
    - nfs/filelayout: Fix NULL reference caused by double freeing of fh_array
    - Fix a bogus WARN_ON_ONCE() in O_DIRECT when layout commit_through_mds is set
    - Fix layoutreturn/close ordering issues"

    * tag 'nfs-for-4.3-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
    NFS41: make close wait for layoutreturn
    NFS: Skip checking ds_cinfo.buckets when lseg's commit_through_mds is set
    NFSv4.x/pnfs: Don't try to recover stateids twice in layoutget
    NFSv4: Recovery of recalled read delegations is broken
    NFS: Fix an infinite loop when layoutget fail with BAD_STATEID
    NFS: Do cleanup before resetting pageio read/write to mds
    SUNRPC: xs_sock_mark_closed() does not need to trigger socket autoclose
    SUNRPC: Lock the transport layer on shutdown
    nfs/filelayout: Fix NULL reference caused by double freeing of fh_array
    SUNRPC: Ensure that we wait for connections to complete before retrying
    SUNRPC: drop null test before destroy functions
    nfs: fix v4.2 SEEK on files over 2 gigs
    SUNRPC: Fix races between socket connection and destroy code
    nfs: fix pg_test page count calculation
    Failing to send a CLOSE if file is opened WRONLY and server reboots on a 4.x mount

    Linus Torvalds
     

24 Sep, 2015

2 commits

  • Signed-off-by: Steve French

    Steve French
     
  • Kerberos, which is very important for security, was only enabled for
    CIFS not SMB2/SMB3 mounts (e.g. vers=3.0)

    Patch based on the information detailed in
    http://thread.gmane.org/gmane.linux.kernel.cifs/10081/focus=10307
    to enable Kerberized SMB2/SMB3

    a) SMB2_negotiate: enable/use decode_negTokenInit in SMB2_negotiate
    b) SMB2_sess_setup: handle Kerberos sectype and replicate Kerberos
    SMB1 processing done in sess_auth_kerberos

    Signed-off-by: Noel Power
    Signed-off-by: Jim McDonough
    CC: Stable
    Signed-off-by: Steve French

    Steve French
     

23 Sep, 2015

7 commits

  • If we send a layoutreturn asynchronously before close, the close
    might reach server first and layoutreturn would fail with BADSTATEID
    because there is nothing keeping the layout stateid alive.

    Also do not pretend sending layoutreturn if we are not.

    Signed-off-by: Peng Tao
    Signed-off-by: Trond Myklebust

    Peng Tao
     
  • The order of the following three spinlocks should be:
    dlm_domain_lock < dlm_ctxt->spinlock < dlm_lock_resource->spinlock

    But dlm_dispatch_assert_master() is called while holding
    dlm_ctxt->spinlock and dlm_lock_resource->spinlock, and then it calls
    dlm_grab() which will take dlm_domain_lock.

    Once another thread (for example, dlm_query_join_handler) has already
    taken dlm_domain_lock, and tries to take dlm_ctxt->spinlock deadlock
    happens.

    Signed-off-by: Joseph Qi
    Cc: Joel Becker
    Cc: Mark Fasheh
    Cc: "Junxiao Bi"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joseph Qi
     
  • This reverts commit 51360155eccb907ff8635bd10fc7de876408c2e0 and adapts
    fs/userfaultfd.c to use the old version of that function.

    It didn't look robust to call __wake_up_common with "nr == 1" when we
    absolutely require wakeall semantics, but we've full control of what we
    insert in the two waitqueue heads of the blocked userfaults. No
    exclusive waitqueue risks to be inserted into those two waitqueue heads
    so we can as well stick to "nr == 1" of the old code and we can rely
    purely on the fact no waitqueue inserted in one of the two waitqueue
    heads we must enforce as wakeall, has wait->flags WQ_FLAG_EXCLUSIVE set.

    Signed-off-by: Andrea Arcangeli
    Cc: Dr. David Alan Gilbert
    Cc: Michael Ellerman
    Cc: Shuah Khan
    Cc: Thierry Reding
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • When lseg's commit_through_mds is set, pnfs client always WARN once
    in nfs_direct_select_verf after checking ds_cinfo.nbuckets.

    nfs should use the DS verf except commit_through_mds is set for
    layout segment where nbuckets is zero.

    [17844.666094] ------------[ cut here ]------------
    [17844.667071] WARNING: CPU: 0 PID: 21758 at /root/source/linux-pnfs/fs/nfs/direct.c:174 nfs_direct_select_verf+0x5a/0x70 [nfs]()
    [17844.668650] Modules linked in: nfs_layout_nfsv41_files(OE) nfsv4(OE) nfs(OE) fscache(E) nfsd(OE) xfs libcrc32c btrfs ppdev coretemp crct10dif_pclmul auth_rpcgss crc32_pclmul crc32c_intel nfs_acl ghash_clmulni_intel lockd vmw_balloon xor vmw_vmci grace raid6_pq shpchp sunrpc parport_pc i2c_piix4 parport vmwgfx drm_kms_helper ttm drm serio_raw mptspi e1000 scsi_transport_spi mptscsih mptbase ata_generic pata_acpi [last unloaded: fscache]
    [17844.686676] CPU: 0 PID: 21758 Comm: kworker/0:1 Tainted: G W OE 4.3.0-rc1-pnfs+ #245
    [17844.687352] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/20/2014
    [17844.698502] Workqueue: nfsiod rpc_async_release [sunrpc]
    [17844.699212] 0000000000000009 0000000043e58010 ffff8800454fbc10 ffffffff813680c4
    [17844.699990] ffff8800454fbc48 ffffffff8108b49d ffff88004eb20000 ffff88004eb20000
    [17844.700844] ffff880062e26000 0000000000000000 0000000000000001 ffff8800454fbc58
    [17844.701637] Call Trace:
    [17844.725252] [] dump_stack+0x19/0x25
    [17844.732693] [] warn_slowpath_common+0x7d/0xb0
    [17844.733855] [] warn_slowpath_null+0x1a/0x20
    [17844.735015] [] nfs_direct_select_verf+0x5a/0x70 [nfs]
    [17844.735999] [] nfs_direct_set_hdr_verf+0x23/0x90 [nfs]
    [17844.736846] [] nfs_direct_write_completion+0x227/0x260 [nfs]
    [17844.737782] [] nfs_pgio_release+0x1c/0x20 [nfs]
    [17844.738597] [] pnfs_generic_rw_release+0x23/0x30 [nfsv4]
    [17844.739486] [] rpc_free_task+0x2a/0x70 [sunrpc]
    [17844.740326] [] rpc_async_release+0x15/0x20 [sunrpc]
    [17844.741173] [] process_one_work+0x21c/0x4c0
    [17844.741984] [] ? process_one_work+0x16d/0x4c0
    [17844.742837] [] worker_thread+0x4a/0x440
    [17844.743639] [] ? process_one_work+0x4c0/0x4c0
    [17844.744399] [] ? process_one_work+0x4c0/0x4c0
    [17844.745176] [] kthread+0xf5/0x110
    [17844.745927] [] ? kthread_create_on_node+0x240/0x240
    [17844.747105] [] ret_from_fork+0x3f/0x70
    [17844.747856] [] ? kthread_create_on_node+0x240/0x240
    [17844.748642] ---[ end trace 336a2845d42b83f0 ]---

    Signed-off-by: Kinglong Mee
    Signed-off-by: Trond Myklebust

    Kinglong Mee
     
  • Linux cifs mount with ntlmssp against an Mac OS X (Yosemite
    10.10.5) share fails in case the clocks differ more than +/-2h:

    digest-service: digest-request: od failed with 2 proto=ntlmv2
    digest-service: digest-request: kdc failed with -1561745592 proto=ntlmv2

    Fix this by (re-)using the given server timestamp for the
    ntlmv2 authentication (as Windows 7 does).

    A related problem was also reported earlier by Namjae Jaen (see below):

    Windows machine has extended security feature which refuse to allow
    authentication when there is time difference between server time and
    client time when ntlmv2 negotiation is used. This problem is prevalent
    in embedded enviornment where system time is set to default 1970.

    Modern servers send the server timestamp in the TargetInfo Av_Pair
    structure in the challenge message [see MS-NLMP 2.2.2.1]
    In [MS-NLMP 3.1.5.1.2] it is explicitly mentioned that the client must
    use the server provided timestamp if present OR current time if it is
    not

    Reported-by: Namjae Jeon
    Signed-off-by: Peter Seiderer
    Signed-off-by: Steve French
    CC: Stable

    Peter Seiderer
     
  • leases (oplocks) were always requested for SMB2/SMB3 even when oplocks
    disabled in the cifs.ko module.

    Signed-off-by: Steve French
    Reviewed-by: Chandrika Srinivasan
    CC: Stable

    Steve French
     
  • When dropping a snapshot we need to account for the qgroup changes. If we drop
    the snapshot in all one go then the backref code will fail to find blocks from
    the snapshot we dropped since it won't be able to find the root in the fs root
    cache. This can lead to us failing to find refs from other roots that pointed
    at blocks in the now deleted root. To handle this we need to not remove the fs
    roots from the cache until after we process the qgroup operations. Do this by
    adding dropped roots to a list on the transaction, and letting the transaction
    remove the roots at the same time it drops the commit roots. This will keep all
    of the backref searching code in sync properly, and fixes a problem Mark was
    seeing with snapshot delete and qgroups. Thanks,

    Signed-off-by: Josef Bacik
    Tested-by: Holger Hoffstätte
    Signed-off-by: Chris Mason

    Josef Bacik
     

22 Sep, 2015

1 commit

  • The following call trace is seen when generic/095 test is executed,

    WARNING: CPU: 3 PID: 2769 at /home/chandan/code/repos/linux/fs/btrfs/inode.c:8967 btrfs_destroy_inode+0x284/0x2a0()
    Modules linked in:
    CPU: 3 PID: 2769 Comm: umount Not tainted 4.2.0-rc5+ #31
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20150306_163512-brownie 04/01/2014
    ffffffff81c08150 ffff8802ec9cbce8 ffffffff81984058 ffff8802ffd8feb0
    0000000000000000 ffff8802ec9cbd28 ffffffff81050385 ffff8802ec9cbd38
    ffff8802d12f8588 ffff8802d12f8588 ffff8802f15ab000 ffff8800bb96c0b0
    Call Trace:
    [] dump_stack+0x45/0x57
    [] warn_slowpath_common+0x85/0xc0
    [] warn_slowpath_null+0x15/0x20
    [] btrfs_destroy_inode+0x284/0x2a0
    [] destroy_inode+0x37/0x60
    [] evict+0x109/0x170
    [] dispose_list+0x35/0x50
    [] evict_inodes+0xaa/0x100
    [] generic_shutdown_super+0x47/0xf0
    [] kill_anon_super+0x11/0x20
    [] btrfs_kill_super+0x13/0x110
    [] deactivate_locked_super+0x39/0x70
    [] deactivate_super+0x5f/0x70
    [] cleanup_mnt+0x3e/0x90
    [] __cleanup_mnt+0xd/0x10
    [] task_work_run+0x96/0xb0
    [] do_notify_resume+0x3d/0x50
    [] int_signal+0x12/0x17

    This means that the inode had non-zero "outstanding extents" during
    eviction. This occurs because, during direct I/O a task which successfully
    used up its reserved data space would set BTRFS_INODE_DIO_READY bit and does
    not clear the bit after finishing the DIO write. A future DIO write could
    actually fail and the unused reserve space won't be freed because of the
    previously set BTRFS_INODE_DIO_READY bit.

    Clearing the BTRFS_INODE_DIO_READY bit in btrfs_direct_IO() caused the
    following issue,
    |-----------------------------------+-------------------------------------|
    | Task A | Task B |
    |-----------------------------------+-------------------------------------|
    | Start direct i/o write on inode X.| |
    | reserve space | |
    | Allocate ordered extent | |
    | release reserved space | |
    | Set BTRFS_INODE_DIO_READY bit. | |
    | | splice() |
    | | Transfer data from pipe buffer to |
    | | destination file. |
    | | - kmap(pipe buffer page) |
    | | - Start direct i/o write on |
    | | inode X. |
    | | - reserve space |
    | | - dio_refill_pages() |
    | | - sdio->blocks_available == 0 |
    | | - Since a kernel address is |
    | | being passed instead of a |
    | | user space address, |
    | | iov_iter_get_pages() returns |
    | | -EFAULT. |
    | | - Since BTRFS_INODE_DIO_READY is |
    | | set, we don't release reserved |
    | | space. |
    | | - Clear BTRFS_INODE_DIO_READY bit.|
    | -EIOCBQUEUED is returned. | |
    |-----------------------------------+-------------------------------------|

    Hence this commit introduces "struct btrfs_dio_data" to track the usage of
    reserved data space. The remaining unused "reserve space" can now be freed
    reliably.

    Signed-off-by: Chandan Rajendra
    Reviewed-by: Liu Bo
    Signed-off-by: Chris Mason

    chandan
     

21 Sep, 2015

4 commits

  • If the current open or layout stateid doesn't match the stateid used
    in the layoutget RPC call, then don't try to recover it.

    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • When a read delegation is being recalled, and we're reclaiming the
    cached opens, we need to make sure that we only reclaim read-only
    modes.
    A previous attempt to do this, relied on retrieving the delegation
    type from the nfs4_opendata structure. Unfortunately, as Kinglong
    pointed out, this field can only be set when performing reboot recovery.

    Furthermore, if we call nfs4_open_recover(), then we end up clobbering
    the state->flags for all modes that we're not recovering...

    The fix is to have the delegation recall code pass this information
    to the recovery call, and then refactor the recovery code so that
    nfs4_open_delegation_recall() does not need to call nfs4_open_recover().

    Reported-by: Kinglong Mee
    Fixes: 39f897fdbd46 ("NFSv4: When returning a delegation, don't...")
    Tested-by: Kinglong Mee
    Cc: NeilBrown
    Cc: stable@vger.kernel.org # v4.2+
    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • If layouget fail with BAD_STATEID, restart should not using the old stateid.
    But, nfs client choose the layout stateid at first, and then the open stateid.

    To avoid the infinite loop of using bad stateid for layoutget,
    this patch sets the layout flag'ss NFS_LAYOUT_INVALID_STID bit to
    skip choosing the bad layout stateid.

    Signed-off-by: Kinglong Mee
    Signed-off-by: Trond Myklebust

    Kinglong Mee
     
  • There is a reference leak of layout segment after resetting
    pageio read/write to mds.

    Signed-off-by: Kinglong Mee
    Cc: stable@vger.kernel.org # v4.0+
    Signed-off-by: Trond Myklebust

    Kinglong Mee
     

20 Sep, 2015

2 commits

  • Pull libnvdimm fixes from Dan Williams:

    - a boot regression (since v4.2) fix for some ARM configurations from
    Tyler

    - regression (since v4.1) fixes for mkfs.xfs on a DAX enabled device
    from Jeff. These are tagged for -stable.

    - a pair of locking fixes from Axel that are hidden from lockdep since
    they involve device_lock(). The "btt" one is tagged for -stable, the
    other only applies to the new "pfn" mechanism in v4.3.

    - a fix for the pmem ->rw_page() path to use wmb_pmem() from Ross.

    * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    mm: fix type cast in __pfn_to_phys()
    pmem: add proper fencing to pmem_rw_page()
    libnvdimm: pfn_devs: Fix locking in namespace_store
    libnvdimm: btt_devs: Fix locking in namespace_store
    blockdev: don't set S_DAX for misaligned partitions
    dax: fix O_DIRECT I/O to the last block of a blockdev

    Linus Torvalds
     
  • Commit 505a666ee3fc ("writeback: plug writeback in wb_writeback() and
    writeback_inodes_wb()") has us holding a plug during writeback_sb_inodes,
    which increases the merge rate when relatively contiguous small files
    are written by the filesystem. It helps both on flash and spindles.

    For an fs_mark workload creating 4K files in parallel across 8 drives,
    this commit improves performance ~9% more by unplugging before calling
    cond_resched(). cond_resched() doesn't trigger an implicit unplug, so
    explicitly getting the IO down to the device before scheduling reduces
    latencies for anyone waiting on clean pages.

    It also cuts down on how often we use kblockd to unplug, which means
    less work bouncing from one workqueue to another.

    Many more details about how we got here:

    https://lkml.org/lkml/2015/9/11/570

    Signed-off-by: Chris Mason
    Signed-off-by: Linus Torvalds

    Chris Mason
     

18 Sep, 2015

5 commits

  • This fixes a memleak if anon_inode_getfile() fails in userfaultfd().

    Signed-off-by: Eric Biggers
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Biggers
     
  • If filelayout_decode_layout fail, _filelayout_free_lseg will causes
    a double freeing of fh_array.

    [ 1179.279800] BUG: unable to handle kernel NULL pointer dereference at (null)
    [ 1179.280198] IP: [] filelayout_free_fh_array.isra.11+0x1d/0x70 [nfs_layout_nfsv41_files]
    [ 1179.281010] PGD 0
    [ 1179.281443] Oops: 0000 [#1]
    [ 1179.281831] Modules linked in: nfs_layout_nfsv41_files(OE) nfsv4(OE) nfs(OE) fscache(E) xfs libcrc32c coretemp nfsd crct10dif_pclmul ppdev crc32_pclmul crc32c_intel auth_rpcgss ghash_clmulni_intel nfs_acl lockd vmw_balloon grace sunrpc parport_pc vmw_vmci parport shpchp i2c_piix4 vmwgfx drm_kms_helper ttm drm serio_raw mptspi scsi_transport_spi mptscsih e1000 mptbase ata_generic pata_acpi [last unloaded: fscache]
    [ 1179.283891] CPU: 0 PID: 13336 Comm: cat Tainted: G OE 4.3.0-rc1-pnfs+ #244
    [ 1179.284323] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/20/2014
    [ 1179.285206] task: ffff8800501d48c0 ti: ffff88003e3c4000 task.ti: ffff88003e3c4000
    [ 1179.285668] RIP: 0010:[] [] filelayout_free_fh_array.isra.11+0x1d/0x70 [nfs_layout_nfsv41_files]
    [ 1179.286612] RSP: 0018:ffff88003e3c77f8 EFLAGS: 00010202
    [ 1179.287092] RAX: 0000000000000000 RBX: ffff88001fe78900 RCX: 0000000000000000
    [ 1179.287731] RDX: ffffea0000f40760 RSI: ffff88001fe789c8 RDI: ffff88001fe789c0
    [ 1179.288383] RBP: ffff88003e3c7810 R08: ffffea0000f40760 R09: 0000000000000000
    [ 1179.289170] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88001fe789c8
    [ 1179.289959] R13: ffff88001fe789c0 R14: ffff88004ec05a80 R15: ffff88004f935b88
    [ 1179.290791] FS: 00007f4e66bb5700(0000) GS:ffffffff81c29000(0000) knlGS:0000000000000000
    [ 1179.291580] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 1179.292209] CR2: 0000000000000000 CR3: 00000000203f8000 CR4: 00000000001406f0
    [ 1179.292731] Stack:
    [ 1179.293195] ffff88001fe78900 00000000000000d0 ffff88001fe78178 ffff88003e3c7868
    [ 1179.293676] ffffffffa0272737 0000000000000001 0000000000000001 ffff88001fe78800
    [ 1179.294151] 00000000614fffce ffffffff81727671 ffff88001fe78100 ffff88001fe78100
    [ 1179.294623] Call Trace:
    [ 1179.295092] [] filelayout_alloc_lseg+0xa7/0x2d0 [nfs_layout_nfsv41_files]
    [ 1179.295625] [] ? out_of_line_wait_on_bit+0x81/0xb0
    [ 1179.296133] [] pnfs_layout_process+0xae/0x320 [nfsv4]
    [ 1179.296632] [] nfs4_proc_layoutget+0x2b1/0x360 [nfsv4]
    [ 1179.297134] [] pnfs_update_layout+0x853/0xb30 [nfsv4]
    [ 1179.297632] [] ? nfs_get_lock_context+0x74/0x170 [nfs]
    [ 1179.298158] [] filelayout_pg_init_read+0x37/0x50 [nfs_layout_nfsv41_files]
    [ 1179.298834] [] __nfs_pageio_add_request+0x119/0x460 [nfs]
    [ 1179.299385] [] ? nfs_create_request.part.9+0x37/0x2e0 [nfs]
    [ 1179.299872] [] nfs_pageio_add_request+0xa3/0x1b0 [nfs]
    [ 1179.300362] [] readpage_async_filler+0x85/0x260 [nfs]
    [ 1179.300907] [] read_cache_pages+0x91/0xd0
    [ 1179.301391] [] ? nfs_read_completion+0x220/0x220 [nfs]
    [ 1179.301867] [] nfs_readpages+0x128/0x200 [nfs]
    [ 1179.302330] [] __do_page_cache_readahead+0x203/0x280
    [ 1179.302784] [] ? __do_page_cache_readahead+0xd8/0x280
    [ 1179.303413] [] ondemand_readahead+0x1a6/0x2f0
    [ 1179.303855] [] page_cache_sync_readahead+0x31/0x50
    [ 1179.304286] [] generic_file_read_iter+0x4a6/0x5c0
    [ 1179.304711] [] ? __nfs_revalidate_mapping+0x1f6/0x240 [nfs]
    [ 1179.305132] [] nfs_file_read+0x52/0xa0 [nfs]
    [ 1179.305540] [] __vfs_read+0xcc/0x100
    [ 1179.305936] [] vfs_read+0x85/0x130
    [ 1179.306326] [] SyS_read+0x58/0xd0
    [ 1179.306708] [] entry_SYSCALL_64_fastpath+0x12/0x76
    [ 1179.307094] Code: c4 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 8b 07 49 89 f4 85 c0 74 47 48 8b 06 49 89 fd 8b 38 48 85 ff 74 22 31 db eb 0c 48 63 d3 48 8b 3c d0 48 85
    [ 1179.308357] RIP [] filelayout_free_fh_array.isra.11+0x1d/0x70 [nfs_layout_nfsv41_files]
    [ 1179.309177] RSP
    [ 1179.309582] CR2: 0000000000000000

    Signed-off-by: Kinglong Mee
    Signed-off-by: Trond Myklebust

    Kinglong Mee
     
  • We're incorrectly assigning a loff_t return to an int. If SEEK_HOLE or
    SEEK_DATA returns an offset over 2^31 then the application will see a
    weird lseek() result (usually -EIO).

    Cc: stable@vger.kernel.org
    Fixes: bdcc2cd14e4e "NFSv4.2: handle NFS-specific llseek errors"
    Signed-off-by: J. Bruce Fields
    Reviewed-by: Anna Schumaker
    Signed-off-by: Trond Myklebust

    J. Bruce Fields
     
  • We really want sizeof(struct page *) instead. Otherwise we limit
    maximum IO size to 64 pages rather than 512 pages on a 64bit system.

    Fixes 2e11f829(nfs: cap request size to fit a kmalloced page array).

    Cc: Christoph Hellwig
    Signed-off-by: Peng Tao
    Fixes: 2e11f8296d22 ("nfs: cap request size to fit a kmalloced page array")
    Signed-off-by: Trond Myklebust

    Peng Tao
     
  • A test case is as the description says:
    open(foobar, O_WRONLY);
    sleep() --> reboot the server
    close(foobar)

    The bug is because in nfs4state.c in nfs4_reclaim_open_state() a few
    line before going to restart, there is
    clear_bit(NFS4CLNT_RECLAIM_NOGRACE, &state->flags).

    NFS4CLNT_RECLAIM_NOGRACE is a flag for the client states not open
    owner states. Value of NFS4CLNT_RECLAIM_NOGRACE is 4 which is the
    value of NFS_O_WRONLY_STATE in nfs4_state->flags. So clearing it wipes
    out state and when we go to close it, “call_close” doesn’t get set as
    state flag is not set and CLOSE doesn’t go on the wire.

    Signed-off-by: Olga Kornievskaia
    Signed-off-by: Trond Myklebust

    Olga Kornievskaia
     

16 Sep, 2015

2 commits

  • The dax code doesn't currently support misaligned partitions,
    so disable O_DIRECT via dax until such time as that support
    materializes.

    Cc:
    Suggested-by: Boaz Harrosh
    Signed-off-by: Jeff Moyer
    Signed-off-by: Dan Williams

    Jeff Moyer
     
  • commit bbab37ddc20b (block: Add support for DAX reads/writes to
    block devices) caused a regression in mkfs.xfs. That utility
    sets the block size of the device to the logical block size
    using the BLKBSZSET ioctl, and then issues a single sector read
    from the last sector of the device. This results in the dax_io
    code trying to do a page-sized read from 512 bytes from the end
    of the device. The result is -ERANGE being returned to userspace.

    The fix is to align the block to the page size before calling
    get_block.

    Thanks to willy for simplifying my original patch.

    Cc:
    Signed-off-by: Jeff Moyer
    Tested-by: Linda Knippers
    Signed-off-by: Dan Williams

    Jeff Moyer
     

15 Sep, 2015

3 commits

  • In btrfs_evict_inode, we properly truncate the page cache for evicted
    inodes but then we call btrfs_wait_ordered_range for every inode as well.
    It's the right thing to do for regular files but results in incorrect
    behavior for device inodes for block devices.

    filemap_fdatawrite_range gets called with inode->i_mapping which gets
    resolved to the block device inode before getting passed to
    wbc_attach_fdatawrite_inode and ultimately to inode_to_bdi. What happens
    next depends on whether there's an open file handle associated with the
    inode. If there is, we write to the block device, which is unexpected
    behavior. If there isn't, we through normally and inode->i_data is used.
    We can also end up racing against open/close which can result in crashes
    when i_mapping points to a block device inode that has been closed.

    Since there can't be any page cache associated with special file inodes,
    it's safe to skip the btrfs_wait_ordered_range call entirely and avoid
    the problem.

    Cc:
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=100911
    Tested-by: Christoph Biedl
    Signed-off-by: Jeff Mahoney
    Reviewed-by: Filipe Manana

    Jeff Mahoney
     
  • If a file has a range pointing to a compressed extent, followed by
    another range that points to the same compressed extent and a read
    operation attempts to read both ranges (either completely or part of
    them), the pages that correspond to the second range are incorrectly
    filled with zeroes.

    Consider the following example:

    File layout
    [0 - 8K] [8K - 24K]
    | |
    | |
    points to extent X, points to extent X,
    offset 4K, length of 8K offset 0, length 16K

    [extent X, compressed length = 4K uncompressed length = 16K]

    If a readpages() call spans the 2 ranges, a single bio to read the extent
    is submitted - extent_io.c:submit_extent_page() would only create a new
    bio to cover the second range pointing to the extent if the extent it
    points to had a different logical address than the extent associated with
    the first range. This has a consequence of the compressed read end io
    handler (compression.c:end_compressed_bio_read()) finish once the extent
    is decompressed into the pages covering the first range, leaving the
    remaining pages (belonging to the second range) filled with zeroes (done
    by compression.c:btrfs_clear_biovec_end()).

    So fix this by submitting the current bio whenever we find a range
    pointing to a compressed extent that was preceded by a range with a
    different extent map. This is the simplest solution for this corner
    case. Making the end io callback populate both ranges (or more, if we
    have multiple pointing to the same extent) is a much more complex
    solution since each bio is tightly coupled with a single extent map and
    the extent maps associated to the ranges pointing to the shared extent
    can have different offsets and lengths.

    The following test case for fstests triggers the issue:

    seq=`basename $0`
    seqres=$RESULT_DIR/$seq
    echo "QA output created by $seq"
    tmp=/tmp/$$
    status=1 # failure is the default!
    trap "_cleanup; exit \$status" 0 1 2 3 15

    _cleanup()
    {
    rm -f $tmp.*
    }

    # get standard environment, filters and checks
    . ./common/rc
    . ./common/filter

    # real QA test starts here
    _need_to_be_root
    _supported_fs btrfs
    _supported_os Linux
    _require_scratch
    _require_cloner

    rm -f $seqres.full

    test_clone_and_read_compressed_extent()
    {
    local mount_opts=$1

    _scratch_mkfs >>$seqres.full 2>&1
    _scratch_mount $mount_opts

    # Create a test file with a single extent that is compressed (the
    # data we write into it is highly compressible no matter which
    # compression algorithm is used, zlib or lzo).
    $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K" \
    -c "pwrite -S 0xbb 4K 8K" \
    -c "pwrite -S 0xcc 12K 4K" \
    $SCRATCH_MNT/foo | _filter_xfs_io

    # Now clone our extent into an adjacent offset.
    $CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \
    $SCRATCH_MNT/foo $SCRATCH_MNT/foo

    # Same as before but for this file we clone the extent into a lower
    # file offset.
    $XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K" \
    -c "pwrite -S 0xbb 12K 8K" \
    -c "pwrite -S 0xcc 20K 4K" \
    $SCRATCH_MNT/bar | _filter_xfs_io

    $CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \
    $SCRATCH_MNT/bar $SCRATCH_MNT/bar

    echo "File digests before unmounting filesystem:"
    md5sum $SCRATCH_MNT/foo | _filter_scratch
    md5sum $SCRATCH_MNT/bar | _filter_scratch

    # Evicting the inode or clearing the page cache before reading
    # again the file would also trigger the bug - reads were returning
    # all bytes in the range corresponding to the second reference to
    # the extent with a value of 0, but the correct data was persisted
    # (it was a bug exclusively in the read path). The issue happened
    # only if the same readpages() call targeted pages belonging to the
    # first and second ranges that point to the same compressed extent.
    _scratch_remount

    echo "File digests after mounting filesystem again:"
    # Must match the same digests we got before.
    md5sum $SCRATCH_MNT/foo | _filter_scratch
    md5sum $SCRATCH_MNT/bar | _filter_scratch
    }

    echo -e "\nTesting with zlib compression..."
    test_clone_and_read_compressed_extent "-o compress=zlib"

    _scratch_unmount

    echo -e "\nTesting with lzo compression..."
    test_clone_and_read_compressed_extent "-o compress=lzo"

    status=0
    exit

    Cc: stable@vger.kernel.org
    Signed-off-by: Filipe Manana
    Reviewed-by: Qu Wenruo
    Reviewed-by: Liu Bo

    Filipe Manana
     
  • Pull CIFS fixes from Steve French:
    "Two small cifs fixes"

    * 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
    [CIFS] mount option sec=none not displayed properly in /proc/mounts
    CIFS: fix type confusion in copy offload ioctl

    Linus Torvalds
     

13 Sep, 2015

2 commits

  • Fix up the writeback plugging introduced in commit d353d7587d02
    ("writeback: plug writeback at a high level") that then caused problems
    due to the unplug happening with a spinlock held.

    * writeback-plugging:
    writeback: plug writeback in wb_writeback() and writeback_inodes_wb()
    Revert "writeback: plug writeback at a high level"

    Linus Torvalds
     
  • We had to revert the pluggin in writeback_sb_inodes() because the
    wb->list_lock is held, but we could easily plug at a higher level before
    taking that lock, and unplug after releasing it. This does that.

    Chris will run performance numbers, just to verify that this approach is
    comparable to the alternative (we could just drop and re-take the lock
    around the blk_finish_plug() rather than these two commits.

    I'd have preferred waiting for actual performance numbers before picking
    one approach over the other, but I don't want to release rc1 with the
    known "sleeping function called from invalid context" issue, so I'll
    pick this cleanup version for now. But if the numbers show that we
    really want to plug just at the writeback_sb_inodes() level, and we
    should just play ugly games with the spinlock, we'll switch to that.

    Cc: Chris Mason
    Cc: Josef Bacik
    Cc: Dave Chinner
    Cc: Neil Brown
    Cc: Jan Kara
    Cc: Christoph Hellwig
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

12 Sep, 2015

6 commits

  • Merge fourth patch-bomb from Andrew Morton:

    - sys_membarier syscall

    - seq_file interface changes

    - a few misc fixups

    * emailed patches from Andrew Morton :
    revert "ocfs2/dlm: use list_for_each_entry instead of list_for_each"
    mm/early_ioremap: add explicit #include of asm/early_ioremap.h
    fs/seq_file: convert int seq_vprint/seq_printf/etc... returns to void
    selftests: enhance membarrier syscall test
    selftests: add membarrier syscall test
    sys_membarrier(): system-wide memory barrier (generic, x86)
    MODSIGN: fix a compilation warning in extract-cert

    Linus Torvalds
     
  • When the user specifies "sec=none" in a cifs mount, we set
    sec_type as unspecified (and set a flag and the username will be
    null) rather than setting sectype as "none" so
    cifs_show_security was not properly displaying it in
    cifs /proc/mounts entries.

    Signed-off-by: Steve French
    Reviewed-by: Jeff Layton

    Steve French
     
  • Revert commit f83c7b5e9fd6 ("ocfs2/dlm: use list_for_each_entry instead
    of list_for_each").

    list_for_each_entry() will dereference its `pos' argument, which can be
    NULL in dlm_process_recovery_data().

    Reported-by: Julia Lawall
    Reported-by: Fengguang Wu
    Cc: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The seq_ function return values were frequently misused.

    See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to
    seq_has_overflowed() and make public")

    All uses of these return values have been removed, so convert the
    return types to void.

    Miscellanea:

    o Move seq_put_decimal_ and seq_escape prototypes closer the
    other seq_vprintf prototypes
    o Reorder seq_putc and seq_puts to return early on overflow
    o Add argument names to seq_vprintf and seq_printf
    o Update the seq_escape kernel-doc
    o Convert a couple of leading spaces to tabs in seq_escape

    Signed-off-by: Joe Perches
    Cc: Al Viro
    Cc: Steven Rostedt
    Cc: Mark Brown
    Cc: Stephen Rothwell
    Cc: Joerg Roedel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • This reverts commit d353d7587d02116b9732d5c06615aed75a4d3a47.

    Doing the block layer plug/unplug inside writeback_sb_inodes() is
    broken, because that function is actually called with a spinlock held:
    wb->list_lock, as pointed out by Chris Mason.

    Chris suggested just dropping and re-taking the spinlock around the
    blk_finish_plug() call (the plgging itself can happen under the
    spinlock), and that would technically work, but is just disgusting.

    We do something fairly similar - but not quite as disgusting because we
    at least have a better reason for it - in writeback_single_inode(), so
    it's not like the caller can depend on the lock being held over the
    call, but in this case there just isn't any good reason for that
    "release and re-take the lock" pattern.

    [ In general, we should really strive to avoid the "release and retake"
    pattern for locks, because in the general case it can easily cause
    subtle bugs when the caller caches any state around the call that
    might be invalidated by dropping the lock even just temporarily. ]

    But in this case, the plugging should be easy to just move up to the
    callers before the spinlock is taken, which should even improve the
    effectiveness of the plug. So there is really no good reason to play
    games with locking here.

    I'll send off a test-patch so that Dave Chinner can verify that that
    plug movement works. In the meantime this just reverts the problematic
    commit and adds a comment to the function so that we hopefully don't
    make this mistake again.

    Reported-by: Chris Mason
    Cc: Josef Bacik
    Cc: Dave Chinner
    Cc: Neil Brown
    Cc: Jan Kara
    Cc: Christoph Hellwig
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Pull btrfs cleanups and fixes from Chris Mason:
    "These are small cleanups, and also some fixes for our async worker
    thread initialization.

    I was having some trouble testing these, but it ended up being a
    combination of changing around my test servers and a shiny new
    schedule while atomic from the new start/finish_plug in
    writeback_sb_inodes().

    That one only hits on btrfs raid5/6 or MD raid10, and if I wasn't
    changing a bunch of things in my test setup at once it would have been
    really clear. Fix for writeback_sb_inodes() on the way as well"

    * 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs: cleanup: remove unnecessary check before btrfs_free_path is called
    btrfs: async_thread: Fix workqueue 'max_active' value when initializing
    btrfs: Add raid56 support for updating num_tolerated_disk_barrier_failures in btrfs_balance
    btrfs: Cleanup for btrfs_calc_num_tolerated_disk_barrier_failures
    btrfs: Remove noused chunk_tree and chunk_objectid from scrub_enumerate_chunks and scrub_chunk
    btrfs: Update out-of-date "skip parity stripe" comment

    Linus Torvalds