07 Mar, 2015

15 commits

  • commit 23b133bdc452aa441fcb9b82cbf6dd05cfd342d0 upstream.

    Check length of extended attributes and allocation descriptors when
    loading inodes from disk. Otherwise corrupted filesystems could confuse
    the code and make the kernel oops.

    Reported-by: Carl Henrik Lunde
    Signed-off-by: Jan Kara
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit 79144954278d4bb5989f8b903adcac7a20ff2a5a upstream.

    Store blocksize in a local variable in udf_fill_inode() since it is used
    a lot of times.

    Signed-off-by: Jan Kara
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit d8ba1f971497c19cf80da1ea5391a46a5f9fbd41 upstream.

    If the call to decode_rc_list() fails due to a memory allocation error,
    then we need to truncate the array size to ensure that we only call
    kfree() on those pointer that were allocated.

    Reported-by: David Ramos
    Fixes: 4aece6a19cf7f ("nfs41: cb_sequence xdr implementation")
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit ea7c38fef0b774a5dc16fb0ca5935f0ae8568176 upstream.

    If we have to do a return-on-close in the delegreturn code, then
    we must ensure that the inode and super block remain referenced.

    Cc: Peng Tao
    Signed-off-by: Trond Myklebust
    Reviewed-by: Peng Tao
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit 03a9a42a1a7e5b3e7919ddfacc1d1cc81882a955 upstream.

    Fix an Oopsable condition when nsm_mon_unmon is called as part of the
    namespace cleanup, which now apparently happens after the utsname
    has been freed.

    Link: http://lkml.kernel.org/r/20150125220604.090121ae@neptune.home
    Reported-by: Bruno Prémont
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit cb5d04bc39e914124e811ea55f3034d2379a5f6c upstream.

    With pgio refactoring in v3.15, .init_read and .init_write can be
    called with valid pgio->pg_lseg. file layout was fixed at that time
    by commit c6194271f (pnfs: filelayout: support non page aligned
    layouts). But the generic helper still needs to be fixed.

    Signed-off-by: Peng Tao
    Signed-off-by: Greg Kroah-Hartman

    Peng Tao
     
  • commit f4086a3d789dbe18949862276d83b8f49fce6d2f upstream.

    Commit 411a99adffb4f (nfs: clear_request_commit while holding i_lock)
    assumes that the nfs_commit_info always points to the inode->i_lock.
    For historical reasons, that is not the case for O_DIRECT writes.

    Cc: Weston Andros Adamson
    Fixes: 411a99adffb4f ("nfs: clear_request_commit while holding i_lock")
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit 6ffa30d3f734d4f6b478081dfc09592021028f90 upstream.

    Bruce reported seeing this warning pop when mounting using v4.1:

    ------------[ cut here ]------------
    WARNING: CPU: 1 PID: 1121 at kernel/sched/core.c:7300 __might_sleep+0xbd/0xd0()
    do not call blocking ops when !TASK_RUNNING; state=1 set at [] prepare_to_wait+0x2f/0x90
    Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc fscache ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm snd_timer ppdev joydev snd virtio_console virtio_balloon pcspkr serio_raw parport_pc parport pvpanic floppy soundcore i2c_piix4 virtio_blk virtio_net qxl drm_kms_helper ttm drm virtio_pci virtio_ring ata_generic virtio pata_acpi
    CPU: 1 PID: 1121 Comm: nfsv4.1-svc Not tainted 3.19.0-rc4+ #25
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153950- 04/01/2014
    0000000000000000 000000004e5e3f73 ffff8800b998fb48 ffffffff8186ac78
    0000000000000000 ffff8800b998fba0 ffff8800b998fb88 ffffffff810ac9da
    ffff8800b998fb68 ffffffff81c923e7 00000000000004d9 0000000000000000
    Call Trace:
    [] dump_stack+0x4c/0x65
    [] warn_slowpath_common+0x8a/0xc0
    [] warn_slowpath_fmt+0x55/0x70
    [] ? prepare_to_wait+0x2f/0x90
    [] ? prepare_to_wait+0x2f/0x90
    [] __might_sleep+0xbd/0xd0
    [] kmem_cache_alloc_trace+0x243/0x430
    [] ? groups_alloc+0x3e/0x130
    [] groups_alloc+0x3e/0x130
    [] svcauth_unix_accept+0x16e/0x290 [sunrpc]
    [] svc_authenticate+0xe1/0xf0 [sunrpc]
    [] svc_process_common+0x244/0x6a0 [sunrpc]
    [] bc_svc_process+0x1c4/0x260 [sunrpc]
    [] nfs41_callback_svc+0x128/0x1f0 [nfsv4]
    [] ? wait_woken+0xc0/0xc0
    [] ? nfs4_callback_svc+0x60/0x60 [nfsv4]
    [] kthread+0x11f/0x140
    [] ? local_clock+0x15/0x30
    [] ? kthread_create_on_node+0x250/0x250
    [] ret_from_fork+0x7c/0xb0
    [] ? kthread_create_on_node+0x250/0x250
    ---[ end trace 675220a11e30f4f2 ]---

    nfs41_callback_svc does most of its work while in TASK_INTERRUPTIBLE,
    which is just wrong. Fix that by finishing the wait immediately if we've
    found that the list has something on it.

    Also, we don't expect this kthread to accept signals, so we should be
    using a TASK_UNINTERRUPTIBLE sleep instead. That however, opens us up
    hung task warnings from the watchdog, so have the schedule_timeout
    wake up every 60s if there's no callback activity.

    Reported-by: "J. Bruce Fields"
    Signed-off-by: Jeff Layton
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Jeff Layton
     
  • commit 05fbf357d94152171bc50f8a369390f1f16efd89 upstream.

    Lockless access to pte in pagemap_pte_range() might race with page
    migration and trigger BUG_ON(!PageLocked()) in migration_entry_to_page():

    CPU A (pagemap) CPU B (migration)
    lock_page()
    try_to_unmap(page, TTU_MIGRATION...)
    make_migration_entry()
    set_pte_at()

    pte_to_pagemap_entry()
    remove_migration_ptes()
    unlock_page()
    if(is_migration_entry())
    migration_entry_to_page()
    BUG_ON(!PageLocked(page))

    Also lockless read might be non-atomic if pte is larger than wordsize.
    Other pte walkers (smaps, numa_maps, clear_refs) already lock ptes.

    Fixes: 052fb0d635df ("proc: report file/anon bit in /proc/pid/pagemap")
    Signed-off-by: Konstantin Khlebnikov
    Reported-by: Andrey Ryabinin
    Reviewed-by: Cyrill Gorcunov
    Acked-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Konstantin Khlebnikov
     
  • commit e9892d3cc853afdda2cc69e2576d9ddb5fafad71 upstream.

    The commit 2d3d0c5 ("xfs: lobotomise xfs_trans_read_buf_map()") left
    a landmine in the tracing code: trace_xfs_trans_buf_read() is now
    call on all buffers that are read through this interface rather than
    just buffers in transactions. For buffers outside transaction
    context, bp->b_fspriv is null, and so the buf log item tracing
    functions cannot be called. This causes a NULL pointer dereference
    in the trace_xfs_trans_buf_read() function when tracing is turned
    on.

    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner
    Signed-off-by: Greg Kroah-Hartman

    Dave Chinner
     
  • commit 3443a3bca54588f43286b725d8648d33a38c86f1 upstream.

    When the superblock is modified in a transaction, the commonly
    modified fields are not actually copied to the superblock buffer to
    avoid the buffer lock becoming a serialisation point. However, there
    are some other operations that modify the superblock fields within
    the transaction that don't directly log to the superblock but rely
    on the changes to be applied during the transaction commit (to
    minimise the buffer lock hold time).

    When we do this, we fail to mark the buffer log item as being a
    superblock buffer and that can lead to the buffer not being marked
    with the corect type in the log and hence causing recovery issues.
    Fix it by setting the type correctly, similar to xfs_mod_sb()...

    Tested-by: Jan Kara
    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner
    Signed-off-by: Greg Kroah-Hartman

    Dave Chinner
     
  • commit fe22d552b82d7cc7de1851233ae8bef579198637 upstream.

    Conversion from local to extent format does not set the buffer type
    correctly on the new extent buffer when a symlink data is moved out
    of line.

    Fix the symlink code and leave a comment in the generic bmap code
    reminding us that the format-specific data copy needs to set the
    destination buffer type appropriately.

    Tested-by: Jan Kara
    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner
    Signed-off-by: Greg Kroah-Hartman

    Dave Chinner
     
  • commit f19b872b086711bb4b22c3a0f52f16aa920bcc61 upstream.

    This leads to log recovery throwing errors like:

    XFS (md0): Mounting V5 Filesystem
    XFS (md0): Starting recovery (logdev: internal)
    XFS (md0): Unknown buffer type 0!
    XFS (md0): _xfs_buf_ioapply: no ops on block 0xaea8802/0x1
    ffff8800ffc53800: 58 41 47 49 .....

    Which is the AGI buffer magic number.

    Ensure that we set the type appropriately in both unlink list
    addition and removal.

    Tested-by: Jan Kara
    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner
    Signed-off-by: Greg Kroah-Hartman

    Dave Chinner
     
  • commit 0d612fb570b71ea2e49554a770cff4c489018b2c upstream.

    Jan Kara reported that log recovery was finding buffers with invalid
    types in them. This should not happen, and indicates a bug in the
    logging of buffers. To catch this, add asserts to the buffer
    formatting code to ensure that the buffer type is in range when the
    transaction is committed.

    We don't set a type on buffers being marked stale - they are not
    going to get replayed, the format item exists only for recovery to
    be able to prevent replay of the buffer, so the type does not
    matter. Hence that needs special casing here.

    Reported-by: Jan Kara
    Tested-by: Jan Kara
    Signed-off-by: Dave Chinner
    Reviewed-by: Brian Foster
    Signed-off-by: Dave Chinner
    Signed-off-by: Greg Kroah-Hartman

    Dave Chinner
     
  • commit 2d5b86e048780c5efa7f7d9708815555919e7b05 upstream.

    As of v3.18, ext4 started rejecting a remount which changes the
    journal_checksum option.

    Prior to that, it was simply ignored; the problem here is that
    if someone has this in their fstab for the root fs, now the box
    fails to boot properly, because remount of root with the new options
    will fail, and the box proceeds with a readonly root.

    I think it is a little nicer behavior to accept the option, but
    warn that it's being ignored, rather than failing the mount,
    but that might be a subjective matter...

    Reported-by: Cónräd
    Signed-off-by: Eric Sandeen
    Signed-off-by: Theodore Ts'o
    Cc: Josh Boyer
    Signed-off-by: Greg Kroah-Hartman

    Eric Sandeen
     

09 Feb, 2015

1 commit


08 Feb, 2015

1 commit


06 Feb, 2015

1 commit

  • Nilfs2 eventually hangs in a stress test with fsstress program. This
    issue was caused by the following deadlock over I_SYNC flag between
    nilfs_segctor_thread() and writeback_sb_inodes():

    nilfs_segctor_thread()
    nilfs_segctor_thread_construct()
    nilfs_segctor_unlock()
    nilfs_dispose_list()
    iput()
    iput_final()
    evict()
    inode_wait_for_writeback() * wait for I_SYNC flag

    writeback_sb_inodes()
    * set I_SYNC flag on inode->i_state
    __writeback_single_inode()
    do_writepages()
    nilfs_writepages()
    nilfs_construct_dsync_segment()
    nilfs_segctor_sync()
    * wait for completion of segment constructor
    inode_sync_complete()
    * clear I_SYNC flag after __writeback_single_inode() completed

    writeback_sb_inodes() calls do_writepages() for dirty inodes after
    setting I_SYNC flag on inode->i_state. do_writepages() in turn calls
    nilfs_writepages(), which can run segment constructor and wait for its
    completion. On the other hand, segment constructor calls iput(), which
    can call evict() and wait for the I_SYNC flag on
    inode_wait_for_writeback().

    Since segment constructor doesn't know when I_SYNC will be set, it
    cannot know whether iput() will block or not unless inode->i_nlink has a
    non-zero count. We can prevent evict() from being called in iput() by
    implementing sop->drop_inode(), but it's not preferable to leave inodes
    with i_nlink == 0 for long periods because it even defers file
    truncation and inode deallocation. So, this instead resolves the
    deadlock by calling iput() asynchronously with a workqueue for inodes
    with i_nlink == 0.

    Signed-off-by: Ryusuke Konishi
    Cc: Al Viro
    Tested-by: Ryusuke Konishi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ryusuke Konishi
     

05 Feb, 2015

2 commits


04 Feb, 2015

1 commit

  • Under CONFIG_DEBUG_ATOMIC_SLEEP=y, aio_read_event_ring() will throw
    warnings like the following due to being called from wait_event
    context:

    WARNING: CPU: 0 PID: 16006 at kernel/sched/core.c:7300 __might_sleep+0x7f/0x90()
    do not call blocking ops when !TASK_RUNNING; state=1 set at [] prepare_to_wait_event+0x63/0x110
    Modules linked in:
    CPU: 0 PID: 16006 Comm: aio-dio-fcntl-r Not tainted 3.19.0-rc6-dgc+ #705
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    ffffffff821c0372 ffff88003c117cd8 ffffffff81daf2bd 000000000000d8d8
    ffff88003c117d28 ffff88003c117d18 ffffffff8109beda ffff88003c117cf8
    ffffffff821c115e 0000000000000061 0000000000000000 00007ffffe4aa300
    Call Trace:
    [] dump_stack+0x4c/0x65
    [] warn_slowpath_common+0x8a/0xc0
    [] warn_slowpath_fmt+0x46/0x50
    [] ? prepare_to_wait_event+0x63/0x110
    [] ? prepare_to_wait_event+0x63/0x110
    [] __might_sleep+0x7f/0x90
    [] mutex_lock+0x24/0x45
    [] aio_read_events+0x4c/0x290
    [] read_events+0x1ec/0x220
    [] ? prepare_to_wait_event+0x110/0x110
    [] ? hrtimer_get_res+0x50/0x50
    [] SyS_io_getevents+0x4d/0xb0
    [] system_call_fastpath+0x12/0x17
    ---[ end trace bde69eaf655a4fea ]---

    There is not actually a bug here, so annotate the code to tell the
    debug logic that everything is just fine and not to fire a false
    positive.

    Signed-off-by: Dave Chinner
    Signed-off-by: Benjamin LaHaise

    Dave Chinner
     

31 Jan, 2015

2 commits


30 Jan, 2015

1 commit

  • Pull NFS client bugfixes from Trond Myklebust:
    "Highlights include:

    - Stable fix for a NFSv4.1 Oops on mount
    - Stable fix for an O_DIRECT deadlock condition
    - Fix an issue with submounted volumes and fake duplicate inode
    numbers"

    * tag 'nfs-for-3.19-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
    NFS: Fix use of nfs_attr_use_mounted_on_fileid()
    NFSv4.1: Fix an Oops in nfs41_walk_client_list
    nfs: fix dio deadlock when O_DIRECT flag is flipped

    Linus Torvalds
     

28 Jan, 2015

3 commits

  • Currently ->get_dqblk() and ->set_dqblk() use struct fs_disk_quota which
    tracks space limits and usage in 512-byte blocks. However VFS quotas
    track usage in bytes (as some filesystems require that) and we need to
    somehow pass this information. Upto now it wasn't a problem because we
    didn't do any unit conversion (thus VFS quota routines happily stuck
    number of bytes into d_bcount field of struct fd_disk_quota). Only if
    you tried to use Q_XGETQUOTA or Q_XSETQLIM for VFS quotas (or Q_GETQUOTA
    / Q_SETQUOTA for XFS quotas), you got bogus results. Hardly anyone
    tried this but reportedly some Samba users hit the problem in practice.
    So when we want interfaces compatible we need to fix this.

    We bite the bullet and define another quota structure used for passing
    information from/to ->get_dqblk()/->set_dqblk. It's somewhat sad we have
    to have more conversion routines in fs/quota/quota.c and another copying
    of quota structure slows down getting of quota information by about 2%
    but it seems cleaner than overloading e.g. units of d_bcount to bytes.

    CC: stable@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara

    Jan Kara
     
  • Commit 6fb1ca92a640 "udf: Fix race between write(2) and close(2)"
    changed the condition when preallocation is released. The idea was that
    we don't want to release the preallocation for an inode on close when
    there are other writeable file descriptors for the inode. However the
    condition was written in the opposite way so we released preallocation
    only if there were other writeable file descriptors. Fix the problem by
    changing the condition properly.

    CC: stable@vger.kernel.org
    Fixes: 6fb1ca92a6409a9d5b0696447cd4997bc9aaf5a2
    Reported-by: Fabian Frederick
    Signed-off-by: Jan Kara

    Jan Kara
     
  • The xfstests btrfs/072 reports uncorrectable read errors in dmesg,
    because scrub forgets to use commit_root for parity scrub routine
    and scrub attempts to scrub those extents items whose contents are
    not fully on disk.

    To fix it, we just add the @search_commit_root flag back.

    Signed-off-by: Gui Hecheng
    Signed-off-by: Qu Wenruo
    Reviewed-by: Miao Xie
    Signed-off-by: Chris Mason

    Gui Hecheng
     

27 Jan, 2015

1 commit

  • If CONFIG_CIFS_WEAK_PW_HASH is not set, CIFSSEC_MUST_LANMAN
    and CIFSSEC_MUST_PLNTXT is defined as 0.

    When setting new SecurityFlags without any MUST flags,
    your flags would be overwritten with CIFSSEC_MUST_LANMAN (0).

    Signed-off-by: Niklas Cassel
    Signed-off-by: Steve French

    Niklas Cassel
     

26 Jan, 2015

1 commit


24 Jan, 2015

1 commit

  • Pull btrfs fixes from Chris Mason:
    "We have a few fixes in my for-linus branch.

    Qu Wenruo's batch fix a regression between some our merge window pull
    and the inode_cache feature. The rest are smaller bugs"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    btrfs: Don't call btrfs_start_transaction() on frozen fs to avoid deadlock.
    btrfs: Fix the bug that fs_info->pending_changes is never cleared.
    btrfs: fix state->private cast on 32 bit machines
    Btrfs: fix race deleting block group from space_info->ro_bgs list
    Btrfs: fix incorrect freeing in scrub_stripe
    btrfs: sync ioctl, handle errors after transaction start

    Linus Torvalds
     

22 Jan, 2015

3 commits

  • This function call was being optimized out during nfs_fhget(), leading
    to situations where we have a valid fileid but still want to use the
    mounted_on_fileid. For example, imagine we have our server configured
    like this:

    server % df
    Filesystem Size Used Avail Use% Mounted on
    /dev/vda1 9.1G 6.5G 1.9G 78% /
    /dev/vdb1 487M 2.3M 456M 1% /exports
    /dev/vdc1 487M 2.3M 456M 1% /exports/vol1
    /dev/vdd1 487M 2.3M 456M 1% /exports/vol2

    If our client mounts /exports and tries to do a "chown -R" across the
    entire mountpoint, we will get a nasty message warning us about a circular
    directory structure. Running chown with strace tells me that each directory
    has the same device and inode number:

    newfstatat(AT_FDCWD, "/nfs/", {st_dev=makedev(0, 38), st_ino=2, ...}) = 0
    newfstatat(4, "vol1", {st_dev=makedev(0, 38), st_ino=2, ...}) = 0
    newfstatat(4, "vol2", {st_dev=makedev(0, 38), st_ino=2, ...}) = 0

    With this patch the mounted_on_fileid values are used for st_ino, so the
    directory loop warning isn't reported.

    Signed-off-by: Anna Schumaker
    Signed-off-by: Trond Myklebust

    Anna Schumaker
     
  • If we start state recovery on a client that failed to initialise correctly,
    then we are very likely to Oops.

    Reported-by: "Mkrtchyan, Tigran"
    Link: http://lkml.kernel.org/r/130621862.279655.1421851650684.JavaMail.zimbra@desy.de
    Cc: stable@vger.kernel.org
    Signed-off-by: Trond Myklebust

    Trond Myklebust
     
  • We only support swap file calling nfs_direct_IO. However, application
    might be able to get to nfs_direct_IO if it toggles O_DIRECT flag
    during IO and it can deadlock because we grab inode->i_mutex in
    nfs_file_direct_write(). So return 0 for such case. Then the generic
    layer will fall back to buffer IO.

    Signed-off-by: Peng Tao
    Cc: stable@vger.kernel.org
    Signed-off-by: Trond Myklebust

    Peng Tao
     

21 Jan, 2015

2 commits

  • Commit 6b5fe46dfa52 (btrfs: do commit in sync_fs if there are pending
    changes) will call btrfs_start_transaction() in sync_fs(), to handle
    some operations needed to be done in next transaction.

    However this can cause deadlock if the filesystem is frozen, with the
    following sys_r+w output:
    [ 143.255932] Call Trace:
    [ 143.255936] [] schedule+0x29/0x70
    [ 143.255939] [] __sb_start_write+0xb3/0x100
    [ 143.255971] [] start_transaction+0x2e6/0x5a0
    [btrfs]
    [ 143.255992] [] btrfs_start_transaction+0x1b/0x20
    [btrfs]
    [ 143.256003] [] btrfs_sync_fs+0xca/0xd0 [btrfs]
    [ 143.256007] [] sync_fs_one_sb+0x20/0x30
    [ 143.256011] [] iterate_supers+0xe1/0xf0
    [ 143.256014] [] sys_sync+0x55/0x90
    [ 143.256017] [] system_call_fastpath+0x12/0x17
    [ 143.256111] Call Trace:
    [ 143.256114] [] schedule+0x29/0x70
    [ 143.256119] [] rwsem_down_write_failed+0x1c5/0x2d0
    [ 143.256123] [] call_rwsem_down_write_failed+0x13/0x20
    [ 143.256131] [] thaw_super+0x28/0xc0
    [ 143.256135] [] do_vfs_ioctl+0x3f5/0x540
    [ 143.256187] [] SyS_ioctl+0x91/0xb0
    [ 143.256213] [] system_call_fastpath+0x12/0x17

    The reason is like the following:
    (Holding s_umount)
    VFS sync_fs staff:
    |- btrfs_sync_fs()
    |- btrfs_start_transaction()
    |- sb_start_intwrite()
    (Waiting thaw_fs to unfreeze)
    VFS thaw_fs staff:
    thaw_fs()
    (Waiting sync_fs to release
    s_umount)

    So deadlock happens.
    This can be easily triggered by fstest/generic/068 with inode_cache
    mount option.

    The fix is to check if the fs is frozen, if the fs is frozen, just
    return and waiting for the next transaction.

    Cc: David Sterba
    Reported-by: Gui Hecheng
    Signed-off-by: Qu Wenruo
    [enhanced comment, changed to SB_FREEZE_WRITE]
    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason

    Qu Wenruo
     
  • Fs_info->pending_changes is never cleared since the original code uses
    cmpxchg(&fs_info->pending_changes, 0, 0), which will only clear it if
    pending_changes is already 0.

    This will cause a lot of problem when mount it with inode_cache mount
    option.
    If the btrfs is mounted as inode_cache, pending_changes will always be
    1, even when the fs is frozen.

    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason

    Qu Wenruo
     

20 Jan, 2015

5 commits

  • Commit
    c11f1df5003d534fd067f0168bfad7befffb3b5c
    requires writers to wait for any pending oplock break handler to
    complete before proceeding to write. This is done by waiting on bit
    CIFS_INODE_PENDING_OPLOCK_BREAK in cifsFileInfo->flags. This bit is
    cleared by the oplock break handler job queued on the workqueue once it
    has completed handling the oplock break allowing writers to proceed with
    writing to the file.

    While testing, it was noticed that the filehandle could be closed while
    there is a pending oplock break which results in the oplock break
    handler on the cifsiod workqueue being cancelled before it has had a
    chance to execute and clear the CIFS_INODE_PENDING_OPLOCK_BREAK bit.
    Any subsequent attempt to write to this file hangs waiting for the
    CIFS_INODE_PENDING_OPLOCK_BREAK bit to be cleared.

    We fix this by ensuring that we also clear the bit
    CIFS_INODE_PENDING_OPLOCK_BREAK when we remove the oplock break handler
    from the workqueue.

    The bug was found by Red Hat QA while testing using ltp's fsstress
    command.

    Signed-off-by: Sachin Prabhu
    Acked-by: Shirish Pargaonkar
    Signed-off-by: Jeff Layton
    Cc: stable@vger.kernel.org
    Signed-off-by: Steve French

    Sachin Prabhu
     
  • When leaving a function use memzero_explicit instead of memset(0) to
    clear stack allocated buffers. memset(0) may be optimized away.

    This particular buffer is highly likely to contain sensitive data which
    we shouldn't leak (it's named 'passwd' after all).

    Signed-off-by: Giel van Schijndel
    Acked-by: Herbert Xu
    Reported-at: http://www.viva64.com/en/b/0299/
    Reported-by: Andrey Karpov
    Reported-by: Svyatoslav Razmyslov
    Signed-off-by: Steve French

    Giel van Schijndel
     
  • Suppress the following warning displayed on building 32bit (i686) kernel.

    ===============================================================================
    ...
    CC [M] fs/btrfs/extent_io.o
    fs/btrfs/extent_io.c: In function ‘btrfs_free_io_failure_record’:
    fs/btrfs/extent_io.c:2193:13: warning: cast to pointer from integer of
    different size [-Wint-to-pointer-cast]
    failrec = (struct io_failure_record *)state->private;
    ...
    ===============================================================================

    Signed-off-by: Satoru Takeuchi
    Reported-by: Chris Murphy
    Signed-off-by: Chris Mason

    Satoru Takeuchi
     
  • When removing a block group we were deleting it from its space_info's
    ro_bgs list without the correct protection - the space info's spinlock.
    Fix this by doing the list delete while holding the spinlock of the
    corresponding space info, which is the correct lock for any operation
    on that list.

    This issue was introduced in the 3.19 kernel by the following change:

    Btrfs: move read only block groups onto their own list V2
    commit 633c0aad4c0243a506a3e8590551085ad78af82d

    I ran into a kernel crash while a task was running statfs, which iterates
    the space_info->ro_bgs list while holding the space info's spinlock,
    and another task was deleting it from the same list, without holding that
    spinlock, as part of the block group remove operation (while running the
    function btrfs_remove_block_group). This happened often when running the
    stress test xfstests/generic/038 I recently made.

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • The address that should be freed is not 'ppath' but 'path'.

    Signed-off-by: Tsutomu Itoh
    Reviewed-by: Miao Xie
    Signed-off-by: Chris Mason

    Tsutomu Itoh