13 Apr, 2014

1 commit

  • Pull vfs updates from Al Viro:
    "The first vfs pile, with deep apologies for being very late in this
    window.

    Assorted cleanups and fixes, plus a large preparatory part of iov_iter
    work. There's a lot more of that, but it'll probably go into the next
    merge window - it *does* shape up nicely, removes a lot of
    boilerplate, gets rid of locking inconsistencie between aio_write and
    splice_write and I hope to get Kent's direct-io rewrite merged into
    the same queue, but some of the stuff after this point is having
    (mostly trivial) conflicts with the things already merged into
    mainline and with some I want more testing.

    This one passes LTP and xfstests without regressions, in addition to
    usual beating. BTW, readahead02 in ltp syscalls testsuite has started
    giving failures since "mm/readahead.c: fix readahead failure for
    memoryless NUMA nodes and limit readahead pages" - might be a false
    positive, might be a real regression..."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    missing bits of "splice: fix racy pipe->buffers uses"
    cifs: fix the race in cifs_writev()
    ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
    kill generic_file_buffered_write()
    ocfs2_file_aio_write(): switch to generic_perform_write()
    ceph_aio_write(): switch to generic_perform_write()
    xfs_file_buffered_aio_write(): switch to generic_perform_write()
    export generic_perform_write(), start getting rid of generic_file_buffer_write()
    generic_file_direct_write(): get rid of ppos argument
    btrfs_file_aio_write(): get rid of ppos
    kill the 5th argument of generic_file_buffered_write()
    kill the 4th argument of __generic_file_aio_write()
    lustre: don't open-code kernel_recvmsg()
    ocfs2: don't open-code kernel_recvmsg()
    drbd: don't open-code kernel_recvmsg()
    constify blk_rq_map_user_iov() and friends
    lustre: switch to kernel_sendmsg()
    ocfs2: don't open-code kernel_sendmsg()
    take iov_iter stuff to mm/iov_iter.c
    process_vm_access: tidy up a bit
    ...

    Linus Torvalds
     

12 Apr, 2014

1 commit

  • Pull second set of btrfs updates from Chris Mason:
    "The most important changes here are from Josef, fixing a btrfs
    regression in 3.14 that can cause corruptions in the extent allocation
    tree when snapshots are in use.

    Josef also fixed some deadlocks in send/recv and other assorted races
    when balance is running"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (23 commits)
    Btrfs: fix compile warnings on on avr32 platform
    btrfs: allow mounting btrfs subvolumes with different ro/rw options
    btrfs: export global block reserve size as space_info
    btrfs: fix crash in remount(thread_pool=) case
    Btrfs: abort the transaction when we don't find our extent ref
    Btrfs: fix EINVAL checks in btrfs_clone
    Btrfs: fix unlock in __start_delalloc_inodes()
    Btrfs: scrub raid56 stripes in the right way
    Btrfs: don't compress for a small write
    Btrfs: more efficient io tree navigation on wait_extent_bit
    Btrfs: send, build path string only once in send_hole
    btrfs: filter invalid arg for btrfs resize
    Btrfs: send, fix data corruption due to incorrect hole detection
    Btrfs: kmalloc() doesn't return an ERR_PTR
    Btrfs: fix snapshot vs nocow writting
    btrfs: Change the expanding write sequence to fix snapshot related bug.
    btrfs: make device scan less noisy
    btrfs: fix lockdep warning with reclaim lock inversion
    Btrfs: hold the commit_root_sem when getting the commit root during send
    Btrfs: remove transaction from send
    ...

    Linus Torvalds
     

11 Apr, 2014

2 commits

  • fs/btrfs/scrub.c: In function 'get_raid56_logic_offset':
    fs/btrfs/scrub.c:2269: warning: comparison of distinct pointer types lacks a cast
    fs/btrfs/scrub.c:2269: warning: right shift count >= width of type
    fs/btrfs/scrub.c:2269: warning: passing argument 1 of '__div64_32' from incompatible pointer type

    Since @rot is an int type, we should not use do_div(), fix it.

    Reported-by: kbuild test robot
    Signed-off-by: Wang Shilong
    Signed-off-by: Chris Mason

    Wang Shilong
     
  • Given the following /etc/fstab entries:

    /dev/sda3 /mnt/foo btrfs subvol=foo,ro 0 0
    /dev/sda3 /mnt/bar btrfs subvol=bar,rw 0 0

    you can't issue:

    $ mount /mnt/foo
    $ mount /mnt/bar

    You would have to do:

    $ mount /mnt/foo
    $ mount -o remount,rw /mnt/foo
    $ mount --bind -o remount,ro /mnt/foo
    $ mount /mnt/bar

    or

    $ mount /mnt/bar
    $ mount --rw /mnt/foo
    $ mount --bind -o remount,ro /mnt/foo

    With this patch you can do

    $ mount /mnt/foo
    $ mount /mnt/bar

    $ cat /proc/self/mountinfo
    49 33 0:41 /foo /mnt/foo ro,relatime shared:36 - btrfs /dev/sda3 rw,ssd,space_cache
    87 33 0:41 /bar /mnt/bar rw,relatime shared:74 - btrfs /dev/sda3 rw,ssd,space_cache

    Signed-off-by: Chris Mason

    Harald Hoyer
     

08 Apr, 2014

18 commits

  • filemap_map_pages() is generic implementation of ->map_pages() for
    filesystems who uses page cache.

    It should be safe to use filemap_map_pages() for ->map_pages() if
    filesystem use filemap_fault() for ->fault().

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Linus Torvalds
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Matthew Wilcox
    Cc: Dave Hansen
    Cc: Alexander Viro
    Cc: Dave Chinner
    Cc: Ning Qu
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Introduce a block group type bit for a global reserve and fill the space
    info for SPACE_INFO ioctl. This should replace the newly added ioctl
    (01e219e8069516cdb98594d417b8bb8d906ed30d) to get just the 'size' part
    of the global reserve, while the actual usage can be now visible in the
    'btrfs fi df' output during ENOSPC stress.

    The unpatched userspace tools will show the blockgroup as 'unknown'.

    CC: Jeff Mahoney
    CC: Josef Bacik
    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason

    David Sterba
     
  • Reproducer:
    mount /dev/ubda /mnt
    mount -oremount,thread_pool=42 /mnt

    Gives a crash:
    ? btrfs_workqueue_set_max+0x0/0x70
    btrfs_resize_thread_pool+0xe3/0xf0
    ? sync_filesystem+0x0/0xc0
    ? btrfs_resize_thread_pool+0x0/0xf0
    btrfs_remount+0x1d2/0x570
    ? kern_path+0x0/0x80
    do_remount_sb+0xd9/0x1c0
    do_mount+0x26a/0xbf0
    ? kfree+0x0/0x1b0
    SyS_mount+0xc4/0x110

    It's a call
    btrfs_workqueue_set_max(fs_info->scrub_wr_completion_workers, new_pool_size);
    with
    fs_info->scrub_wr_completion_workers = NULL;

    as scrub wqs get created only on user's demand.

    Patch skips not-created-yet workqueues.

    Signed-off-by: Sergei Trofimovich
    CC: Qu Wenruo
    CC: Chris Mason
    CC: Josef Bacik
    CC: linux-btrfs@vger.kernel.org
    Signed-off-by: Chris Mason

    Sergei Trofimovich
     
  • I'm not sure why we weren't aborting here in the first place, it is obviously a
    bad time from the fact that we print the leaf and yell loudly about it. Fix
    this up, otherwise we panic because our path could be pointing into oblivion.
    Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • btrfs_drop_extents can now return -EINVAL, but only one caller
    in btrfs_clone was checking for it. This adds it to the
    caller for inline extents, which is where we really need it.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • This patch fix a regression caused by the following patch:
    Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock

    break while loop will make us call @spin_unlock() without
    calling @spin_lock() before, fix it.

    Signed-off-by: Wang Shilong
    Reviewed-by: David Sterba
    Signed-off-by: Chris Mason

    Wang Shilong
     
  • Steps to reproduce:
    # mkfs.btrfs -f /dev/sda[8-11] -m raid5 -d raid5
    # mount /dev/sda8 /mnt
    # btrfs scrub start -BR /mnt
    # echo $?
    Signed-off-by: Chris Mason

    Wang Shilong
     
  • To compress a small file range(
    Signed-off-by: Chris Mason

    Wang Shilong
     
  • If we don't reschedule use rb_next to find the next extent state
    instead of a full tree search, which is more efficient and safe
    since we didn't release the io tree's lock.

    Signed-off-by: Filipe David Borba Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • There's no point building the path string in each iteration of the
    send_hole loop, as it produces always the same string.

    Signed-off-by: Filipe David Borba Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • Originally following cmds will work:
    # btrfs fi resize -10A
    # btrfs fi resize -10Gaha
    Filter the arg by checking the return pointer of memparse.

    Signed-off-by: Gui Hecheng
    Signed-off-by: Chris Mason

    Gui Hecheng
     
  • During an incremental send, when we finish processing an inode (corresponding to
    a regular file) we would assume the gap between the end of the last processed file
    extent and the file's size corresponded to a file hole, and therefore incorrectly
    send a bunch of zero bytes to overwrite that region in the file.

    This affects only kernel 3.14.

    Reproducer:

    mkfs.btrfs -f /dev/sdc
    mount /dev/sdc /mnt

    xfs_io -f -c "falloc -k 0 268435456" /mnt/foo

    btrfs subvolume snapshot -r /mnt /mnt/mysnap0

    xfs_io -c "pwrite -S 0x01 -b 9216 16190218 9216" /mnt/foo
    xfs_io -c "pwrite -S 0x02 -b 1121 198720104 1121" /mnt/foo
    xfs_io -c "pwrite -S 0x05 -b 9216 107887439 9216" /mnt/foo
    xfs_io -c "pwrite -S 0x06 -b 9216 225520207 9216" /mnt/foo
    xfs_io -c "pwrite -S 0x07 -b 67584 102138300 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x08 -b 7000 94897484 7000" /mnt/foo
    xfs_io -c "pwrite -S 0x09 -b 113664 245083212 113664" /mnt/foo
    xfs_io -c "pwrite -S 0x10 -b 123 17937788 123" /mnt/foo
    xfs_io -c "pwrite -S 0x11 -b 39936 229573311 39936" /mnt/foo
    xfs_io -c "pwrite -S 0x12 -b 67584 174792222 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x13 -b 9216 249253213 9216" /mnt/foo
    xfs_io -c "pwrite -S 0x16 -b 67584 150046083 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x17 -b 39936 118246040 39936" /mnt/foo
    xfs_io -c "pwrite -S 0x18 -b 67584 215965442 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x19 -b 33792 97096725 33792" /mnt/foo
    xfs_io -c "pwrite -S 0x20 -b 125952 166300596 125952" /mnt/foo
    xfs_io -c "pwrite -S 0x21 -b 123 1078957 123" /mnt/foo
    xfs_io -c "pwrite -S 0x25 -b 9216 212044492 9216" /mnt/foo
    xfs_io -c "pwrite -S 0x26 -b 7000 265037146 7000" /mnt/foo
    xfs_io -c "pwrite -S 0x27 -b 42757 215922685 42757" /mnt/foo
    xfs_io -c "pwrite -S 0x28 -b 7000 69865411 7000" /mnt/foo
    xfs_io -c "pwrite -S 0x29 -b 67584 67948958 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x30 -b 39936 266967019 39936" /mnt/foo
    xfs_io -c "pwrite -S 0x31 -b 1121 19582453 1121" /mnt/foo
    xfs_io -c "pwrite -S 0x32 -b 17408 257710255 17408" /mnt/foo
    xfs_io -c "pwrite -S 0x33 -b 39936 3895518 39936" /mnt/foo
    xfs_io -c "pwrite -S 0x34 -b 125952 12045847 125952" /mnt/foo
    xfs_io -c "pwrite -S 0x35 -b 17408 19156379 17408" /mnt/foo
    xfs_io -c "pwrite -S 0x36 -b 39936 50160066 39936" /mnt/foo
    xfs_io -c "pwrite -S 0x37 -b 113664 9549793 113664" /mnt/foo
    xfs_io -c "pwrite -S 0x38 -b 105472 94391506 105472" /mnt/foo
    xfs_io -c "pwrite -S 0x39 -b 23552 143632863 23552" /mnt/foo
    xfs_io -c "pwrite -S 0x40 -b 39936 241283845 39936" /mnt/foo
    xfs_io -c "pwrite -S 0x41 -b 113664 199937606 113664" /mnt/foo
    xfs_io -c "pwrite -S 0x42 -b 67584 67380093 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x43 -b 67584 26793129 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x44 -b 39936 14421913 39936" /mnt/foo
    xfs_io -c "pwrite -S 0x45 -b 123 253097405 123" /mnt/foo
    xfs_io -c "pwrite -S 0x46 -b 1121 128233424 1121" /mnt/foo
    xfs_io -c "pwrite -S 0x47 -b 105472 91577959 105472" /mnt/foo
    xfs_io -c "pwrite -S 0x48 -b 1121 7245381 1121" /mnt/foo
    xfs_io -c "pwrite -S 0x49 -b 113664 182414694 113664" /mnt/foo
    xfs_io -c "pwrite -S 0x50 -b 9216 32750608 9216" /mnt/foo
    xfs_io -c "pwrite -S 0x51 -b 67584 266546049 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x52 -b 67584 87969398 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x53 -b 9216 260848797 9216" /mnt/foo
    xfs_io -c "pwrite -S 0x54 -b 39936 119461243 39936" /mnt/foo
    xfs_io -c "pwrite -S 0x55 -b 7000 200178693 7000" /mnt/foo
    xfs_io -c "pwrite -S 0x56 -b 9216 243316029 9216" /mnt/foo
    xfs_io -c "pwrite -S 0x57 -b 7000 209658229 7000" /mnt/foo
    xfs_io -c "pwrite -S 0x58 -b 101376 179745192 101376" /mnt/foo
    xfs_io -c "pwrite -S 0x59 -b 9216 64012300 9216" /mnt/foo
    xfs_io -c "pwrite -S 0x60 -b 125952 181705139 125952" /mnt/foo
    xfs_io -c "pwrite -S 0x61 -b 23552 235737348 23552" /mnt/foo
    xfs_io -c "pwrite -S 0x62 -b 113664 106021355 113664" /mnt/foo
    xfs_io -c "pwrite -S 0x63 -b 67584 135753552 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x64 -b 23552 95730888 23552" /mnt/foo
    xfs_io -c "pwrite -S 0x65 -b 11 17311415 11" /mnt/foo
    xfs_io -c "pwrite -S 0x66 -b 33792 120695553 33792" /mnt/foo
    xfs_io -c "pwrite -S 0x67 -b 9216 17164631 9216" /mnt/foo
    xfs_io -c "pwrite -S 0x68 -b 9216 136065853 9216" /mnt/foo
    xfs_io -c "pwrite -S 0x69 -b 67584 37752198 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x70 -b 101376 189717473 101376" /mnt/foo
    xfs_io -c "pwrite -S 0x71 -b 7000 227463698 7000" /mnt/foo
    xfs_io -c "pwrite -S 0x72 -b 9216 12655137 9216" /mnt/foo
    xfs_io -c "pwrite -S 0x73 -b 7000 7488866 7000" /mnt/foo
    xfs_io -c "pwrite -S 0x74 -b 113664 87813649 113664" /mnt/foo
    xfs_io -c "pwrite -S 0x75 -b 33792 25802183 33792" /mnt/foo
    xfs_io -c "pwrite -S 0x76 -b 39936 93524024 39936" /mnt/foo
    xfs_io -c "pwrite -S 0x77 -b 33792 113336388 33792" /mnt/foo
    xfs_io -c "pwrite -S 0x78 -b 105472 184955320 105472" /mnt/foo
    xfs_io -c "pwrite -S 0x79 -b 101376 225691598 101376" /mnt/foo
    xfs_io -c "pwrite -S 0x80 -b 23552 77023155 23552" /mnt/foo
    xfs_io -c "pwrite -S 0x81 -b 11 201888192 11" /mnt/foo
    xfs_io -c "pwrite -S 0x82 -b 11 115332492 11" /mnt/foo
    xfs_io -c "pwrite -S 0x83 -b 67584 230278015 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x84 -b 11 120589073 11" /mnt/foo
    xfs_io -c "pwrite -S 0x85 -b 125952 202207819 125952" /mnt/foo
    xfs_io -c "pwrite -S 0x86 -b 113664 86672080 113664" /mnt/foo
    xfs_io -c "pwrite -S 0x87 -b 17408 208459603 17408" /mnt/foo
    xfs_io -c "pwrite -S 0x88 -b 7000 73372211 7000" /mnt/foo
    xfs_io -c "pwrite -S 0x89 -b 7000 42252122 7000" /mnt/foo
    xfs_io -c "pwrite -S 0x90 -b 23552 46784881 23552" /mnt/foo
    xfs_io -c "pwrite -S 0x91 -b 101376 63172351 101376" /mnt/foo
    xfs_io -c "pwrite -S 0x92 -b 23552 59341931 23552" /mnt/foo
    xfs_io -c "pwrite -S 0x93 -b 39936 239599283 39936" /mnt/foo
    xfs_io -c "pwrite -S 0x94 -b 67584 175643105 67584" /mnt/foo
    xfs_io -c "pwrite -S 0x97 -b 23552 105534880 23552" /mnt/foo
    xfs_io -c "pwrite -S 0x98 -b 113664 8236844 113664" /mnt/foo
    xfs_io -c "pwrite -S 0x99 -b 125952 144489686 125952" /mnt/foo
    xfs_io -c "pwrite -S 0xa0 -b 7000 73273112 7000" /mnt/foo
    xfs_io -c "pwrite -S 0xa1 -b 125952 194580243 125952" /mnt/foo
    xfs_io -c "pwrite -S 0xa2 -b 123 56296779 123" /mnt/foo
    xfs_io -c "pwrite -S 0xa3 -b 11 233066845 11" /mnt/foo
    xfs_io -c "pwrite -S 0xa4 -b 39936 197727090 39936" /mnt/foo
    xfs_io -c "pwrite -S 0xa5 -b 101376 53579812 101376" /mnt/foo
    xfs_io -c "pwrite -S 0xa6 -b 9216 85669738 9216" /mnt/foo
    xfs_io -c "pwrite -S 0xa7 -b 125952 21266322 125952" /mnt/foo
    xfs_io -c "pwrite -S 0xa8 -b 23552 125726568 23552" /mnt/foo
    xfs_io -c "pwrite -S 0xa9 -b 9216 18423680 9216" /mnt/foo
    xfs_io -c "pwrite -S 0xb0 -b 1121 165901483 1121" /mnt/foo

    btrfs subvolume snapshot -r /mnt /mnt/mysnap1

    xfs_io -c "pwrite -S 0xff -b 10 16190218 10" /mnt/foo

    btrfs subvolume snapshot -r /mnt /mnt/mysnap2

    md5sum /mnt/foo # returns 79e53f1466bfc09fd82b450689e6119e
    md5sum /mnt/mysnap2/foo # returns 79e53f1466bfc09fd82b450689e6119e too

    btrfs send /mnt/mysnap1 -f /tmp/1.snap
    btrfs send -p /mnt/mysnap1 /mnt/mysnap2 -f /tmp/2.snap

    mkfs.btrfs -f /dev/sdc
    mount /dev/sdc /mnt

    btrfs receive /mnt -f /tmp/1.snap
    btrfs receive /mnt -f /tmp/2.snap

    md5sum /mnt/mysnap2/foo # returns 2bb414c5155767cedccd7063e51beabd !!

    A testcase for xfstests follows soon too.

    Signed-off-by: Filipe David Borba Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • The error handling was copy and pasted from memdup_user(). It should be
    checking for NULL obviously.

    Fixes: abccd00f8af2 ('btrfs: Fix 32/64-bit problem with BTRFS_SET_RECEIVED_SUBVOL ioctl')
    Signed-off-by: Dan Carpenter
    Signed-off-by: Chris Mason

    Dan Carpenter
     
  • While running fsstress and snapshots concurrently, we will hit something
    like followings:

    Thread 1 Thread 2

    |->fallocate
    |->write pages
    |->join transaction
    |->add ordered extent
    |->end transaction
    |->flushing data
    |->creating pending snapshots
    |->write data into src root's
    fallocated space

    After above work flows finished, we will get a state that source and
    snapshot root share same space, but source root have written data into
    fallocated space, this will make fsck fail to verify checksums for
    snapshot root's preallocating file extent data.Nocow writting also
    has this same problem.

    Fix this problem by syncing snapshots with nocow writting:

    1.for nocow writting,if there are pending snapshots, we will
    fall into COW way.

    2.if there are pending nocow writes, snapshots for this root
    will be blocked until nocow writting finish.

    Reported-by: Gui Hecheng
    Signed-off-by: Wang Shilong
    Signed-off-by: Chris Mason

    Wang Shilong
     
  • When testing fsstress with snapshot making background, some snapshot
    following problem.

    Snapshot 270:
    inode 323: size 0

    Snapshot 271:
    inode 323: size 349145
    |-------Hole---|---------Empty gap-------|-------Hole-----|
    0 122880 172032 349145

    Snapshot 272:
    inode 323: size 349145
    |-------Hole---|------------Data---------|-------Hole-----|
    0 122880 172032 349145

    The fsstress operation on inode 323 is the following:
    write: offset 126832 len 43124
    truncate: size 349145

    Since the write with offset is consist of 2 operations:
    1. punch hole
    2. write data
    Hole punching is faster than data write, so hole punching in write
    and truncate is done first and then buffered write, so the snapshot 271 got
    empty gap, which will not pass btrfsck.

    To fix the bug, this patch will change the write sequence which will
    first punch a hole covering the write end if a hole is needed.

    Reported-by: Gui Hecheng
    Signed-off-by: Qu Wenruo
    Signed-off-by: Chris Mason

    Qu Wenruo
     
  • Print the message only when the device is seen for the first time.

    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason

    David Sterba
     
  • When encountering memory pressure, testers have run into the following
    lockdep warning. It was caused by __link_block_group calling kobject_add
    with the groups_sem held. kobject_add calls kvasprintf with GFP_KERNEL,
    which gets us into reclaim context. The kobject doesn't actually need
    to be added under the lock -- it just needs to ensure that it's only
    added for the first block group to be linked.

    =========================================================
    [ INFO: possible irq lock inversion dependency detected ]
    3.14.0-rc8-default #1 Not tainted
    ---------------------------------------------------------
    kswapd0/169 just changed the state of lock:
    (&delayed_node->mutex){+.+.-.}, at: [] __btrfs_release_delayed_node+0x3a/0x200 [btrfs]
    but this lock took another, RECLAIM_FS-unsafe lock in the past:
    (&found->groups_sem){+++++.}

    and interrupts could create inverse lock ordering between them.

    other info that might help us debug this:
    Possible interrupt unsafe locking scenario:
    CPU0 CPU1
    ---- ----
    lock(&found->groups_sem);
    local_irq_disable();
    lock(&delayed_node->mutex);
    lock(&found->groups_sem);

    lock(&delayed_node->mutex);

    *** DEADLOCK ***
    2 locks held by kswapd0/169:
    #0: (shrinker_rwsem){++++..}, at: [] shrink_slab+0x3a/0x160
    #1: (&type->s_umount_key#27){++++..}, at: [] grab_super_passive+0x3f/0x90

    Signed-off-by: Jeff Mahoney
    Signed-off-by: Chris Mason

    Jeff Mahoney
     
  • We currently rely too heavily on roots being read-only to save us from just
    accessing root->commit_root. We can easily balance blocks out from underneath a
    read only root, so to save us from getting screwed make sure we only access
    root->commit_root under the commit root sem. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

07 Apr, 2014

4 commits

  • Lets try this again. We can deadlock the box if we send on a box and try to
    write onto the same fs with the app that is trying to listen to the send pipe.
    This is because the writer could get stuck waiting for a transaction commit
    which is being blocked by the send. So fix this by making sure looking at the
    commit roots is always going to be consistent. We do this by keeping track of
    which roots need to have their commit roots swapped during commit, and then
    taking the commit_root_sem and swapping them all at once. Then make sure we
    take a read lock on the commit_root_sem in cases where we search the commit root
    to make sure we're always looking at a consistent view of the commit roots.
    Previously we had problems with this because we would swap a fs tree commit root
    and then swap the extent tree commit root independently which would cause the
    backref walking code to screw up sometimes. With this patch we no longer
    deadlock and pass all the weird send/receive corner cases. Thanks,

    Reportedy-by: Hugo Mills
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • So I have an awful exercise script that will run snapshot, balance and
    send/receive in parallel. This sometimes would crash spectacularly and when it
    came back up the fs would be completely hosed. Turns out this is because of a
    bad interaction of balance and send/receive. Send will hold onto its entire
    path for the whole send, but its blocks could get relocated out from underneath
    it, and because it doesn't old tree locks theres nothing to keep this from
    happening. So it will go to read in a slot with an old transid, and we could
    have re-allocated this block for something else and it could have a completely
    different transid. But because we think it is invalid we clear uptodate and
    re-read in the block. If we do this before we actually write out the new block
    we could write back stale data to the fs, and boom we're screwed.

    Now we definitely need to fix this disconnect between send and balance, but we
    really really need to not allow ourselves to accidently read in stale data over
    new data. So make sure we check if the extent buffer is not under io before
    clearing uptodate, this will kick back EIO to the caller instead of reading in
    stale data and keep us from corrupting the fs. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • We could have possibly added an extent_op to the locked_ref while we dropped
    locked_ref->lock, so check for this case as well and loop around. Otherwise we
    could lose flag updates which would lead to extent tree corruption. Thanks,

    cc: stable@vger.kernel.org
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • This was done to allow NO_COW to continue to be NO_COW after relocation but it
    is not right. When relocating we will convert blocks to FULL_BACKREF that we
    relocate. We can leave some of these full backref blocks behind if they are not
    cow'ed out during the relocation, like if we fail the relocation with ENOSPC and
    then just drop the reloc tree. Then when we go to cow the block again we won't
    lookup the extent flags because we won't think there has been a snapshot
    recently which means we will do our normal ref drop thing instead of adding back
    a tree ref and dropping the shared ref. This will cause btrfs_free_extent to
    blow up because it can't find the ref we are trying to free. This was found
    with my ref verifying tool. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

05 Apr, 2014

2 commits

  • Pull ext4 updates from Ted Ts'o:
    "Major changes for 3.14 include support for the newly added ZERO_RANGE
    and COLLAPSE_RANGE fallocate operations, and scalability improvements
    in the jbd2 layer and in xattr handling when the extended attributes
    spill over into an external block.

    Other than that, the usual clean ups and minor bug fixes"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (42 commits)
    ext4: fix premature freeing of partial clusters split across leaf blocks
    ext4: remove unneeded test of ret variable
    ext4: fix comment typo
    ext4: make ext4_block_zero_page_range static
    ext4: atomically set inode->i_flags in ext4_set_inode_flags()
    ext4: optimize Hurd tests when reading/writing inodes
    ext4: kill i_version support for Hurd-castrated file systems
    ext4: each filesystem creates and uses its own mb_cache
    fs/mbcache.c: doucple the locking of local from global data
    fs/mbcache.c: change block and index hash chain to hlist_bl_node
    ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
    ext4: refactor ext4_fallocate code
    ext4: Update inode i_size after the preallocation
    ext4: fix partial cluster handling for bigalloc file systems
    ext4: delete path dealloc code in ext4_ext_handle_uninitialized_extents
    ext4: only call sync_filesystm() when remounting read-only
    fs: push sync_filesystem() down to the file system's remount_fs()
    jbd2: improve error messages for inconsistent journal heads
    jbd2: minimize region locked by j_list_lock in jbd2_journal_forget()
    jbd2: minimize region locked by j_list_lock in journal_get_create_access()
    ...

    Linus Torvalds
     
  • Pull btrfs changes from Chris Mason:
    "This is a pretty long stream of bug fixes and performance fixes.

    Qu Wenruo has replaced the btrfs async threads with regular kernel
    workqueues. We'll keep an eye out for performance differences, but
    it's nice to be using more generic code for this.

    We still have some corruption fixes and other patches coming in for
    the merge window, but this batch is tested and ready to go"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (108 commits)
    Btrfs: fix a crash of clone with inline extents's split
    btrfs: fix uninit variable warning
    Btrfs: take into account total references when doing backref lookup
    Btrfs: part 2, fix incremental send's decision to delay a dir move/rename
    Btrfs: fix incremental send's decision to delay a dir move/rename
    Btrfs: remove unnecessary inode generation lookup in send
    Btrfs: fix race when updating existing ref head
    btrfs: Add trace for btrfs_workqueue alloc/destroy
    Btrfs: less fs tree lock contention when using autodefrag
    Btrfs: return EPERM when deleting a default subvolume
    Btrfs: add missing kfree in btrfs_destroy_workqueue
    Btrfs: cache extent states in defrag code path
    Btrfs: fix deadlock with nested trans handles
    Btrfs: fix possible empty list access when flushing the delalloc inodes
    Btrfs: split the global ordered extents mutex
    Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock
    Btrfs: reclaim delalloc metadata more aggressively
    Btrfs: remove unnecessary lock in may_commit_transaction()
    Btrfs: remove the unnecessary flush when preparing the pages
    Btrfs: just do dirty page flush for the inode with compression before direct IO
    ...

    Linus Torvalds
     

04 Apr, 2014

3 commits

  • Reclaim will be leaving shadow entries in the page cache radix tree upon
    evicting the real page. As those pages are found from the LRU, an
    iput() can lead to the inode being freed concurrently. At this point,
    reclaim must no longer install shadow pages because the inode freeing
    code needs to ensure the page tree is really empty.

    Add an address_space flag, AS_EXITING, that the inode freeing code sets
    under the tree lock before doing the final truncate. Reclaim will check
    for this flag before installing shadow pages.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • shmem mappings already contain exceptional entries where swap slot
    information is remembered.

    To be able to store eviction information for regular page cache, prepare
    every site dealing with the radix trees directly to handle entries other
    than pages.

    The common lookup functions will filter out non-page entries and return
    NULL for page cache holes, just as before. But provide a raw version of
    the API which returns non-page entries as well, and switch shmem over to
    use it.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • We know that "ret > 0" is true here. These tests were left over from
    commit 02afc27faec9 ('direct-io: Handle O_(D)SYNC AIO') and aren't
    needed any more.

    Signed-off-by: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Carpenter
     

02 Apr, 2014

3 commits


22 Mar, 2014

5 commits

  • xfstests's btrfs/035 triggers a BUG_ON, which we use to detect the split
    of inline extents in __btrfs_drop_extents().

    For inline extents, we cannot duplicate another EXTENT_DATA item, because
    it breaks the rule of inline extents, that is, 'start offset' needs to be 0.

    We have set limitations for the source inode's compressed inline extents,
    because it needs to decompress and recompress. Now the destination inode's
    inline extents also need similar limitations.

    With this, xfstests btrfs/035 doesn't run into panic.

    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    Liu Bo
     
  • fs/btrfs/send.c:2926: warning: ‘entry’ may be used uninitialized in this
    function

    Signed-off-by: Chris Mason

    Chris Mason
     
  • I added an optimization for large files where we would stop searching for
    backrefs once we had looked at the number of references we currently had for
    this extent. This works great most of the time, but for snapshots that point to
    this extent and has changes in the original root this assumption falls on it
    face. So keep track of any delayed ref mods made and add in the actual ref
    count as reported by the extent item and use that to limit how far down an inode
    we'll search for extents. Thanks,

    Reportedy-by: Hugo Mills
    Signed-off-by: Josef Bacik
    Reported-by: Hugo Mills
    Tested-by: Hugo Mills
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • For an incremental send, fix the process of determining whether the directory
    inode we're currently processing needs to have its move/rename operation delayed.

    We were ignoring the fact that if the inode's new immediate ancestor has a higher
    inode number than ours but wasn't renamed/moved, we might still need to delay our
    move/rename, because some other ancestor directory higher in the hierarchy might
    have an inode number higher than ours *and* was renamed/moved too - in this case
    we have to wait for rename/move of that ancestor to happen before our current
    directory's rename/move operation.

    Simple steps to reproduce this issue:

    $ mkfs.btrfs -f /dev/sdd
    $ mount /dev/sdd /mnt

    $ mkdir -p /mnt/a/x1/x2
    $ mkdir /mnt/a/Z
    $ mkdir -p /mnt/a/x1/x2/x3/x4/x5

    $ btrfs subvolume snapshot -r /mnt /mnt/snap1
    $ btrfs send /mnt/snap1 -f /tmp/base.send

    $ mv /mnt/a/x1/x2/x3 /mnt/a/Z/X33
    $ mv /mnt/a/x1/x2 /mnt/a/Z/X33/x4/x5/X22

    $ btrfs subvolume snapshot -r /mnt /mnt/snap2
    $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send

    The incremental send caused the kernel code to enter an infinite loop when
    building the path string for directory Z after its references are processed.

    A more complex scenario:

    $ mkfs.btrfs -f /dev/sdd
    $ mount /dev/sdd /mnt

    $ mkdir -p /mnt/a/b/c/d
    $ mkdir /mnt/a/b/c/d/e
    $ mkdir /mnt/a/b/c/d/f
    $ mv /mnt/a/b/c/d/e /mnt/a/b/c/d/f/E2
    $ mkdir /mmt/a/b/c/g
    $ mv /mnt/a/b/c/d /mnt/a/b/D2

    $ btrfs subvolume snapshot -r /mnt /mnt/snap1
    $ btrfs send /mnt/snap1 -f /tmp/base.send

    $ mkdir /mnt/a/o
    $ mv /mnt/a/b/c/g /mnt/a/b/D2/f/G2
    $ mv /mnt/a/b/D2 /mnt/a/b/dd
    $ mv /mnt/a/b/c /mnt/a/C2
    $ mv /mnt/a/b/dd/f /mnt/a/o/FF
    $ mv /mnt/a/b /mnt/a/o/FF/E2/BB

    $ btrfs subvolume snapshot -r /mnt /mnt/snap2
    $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send

    A test case for xfstests follows.

    Signed-off-by: Filipe David Borba Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • It's possible to change the parent/child relationship between directories
    in such a way that if a child directory has a higher inode number than
    its parent, it doesn't necessarily means the child rename/move operation
    can be performed immediately. The parent migth have its own rename/move
    operation delayed, therefore in this case the child needs to have its
    rename/move operation delayed too, and be performed after its new parent's
    rename/move.

    Steps to reproduce the issue:

    $ umount /mnt
    $ mkfs.btrfs -f /dev/sdd
    $ mount /dev/sdd /mnt

    $ mkdir /mnt/A
    $ mkdir /mnt/B
    $ mkdir /mnt/C
    $ mv /mnt/C /mnt/A
    $ mv /mnt/B /mnt/A/C
    $ mkdir /mnt/A/C/D

    $ btrfs subvolume snapshot -r /mnt /mnt/snap1
    $ btrfs send /mnt/snap1 -f /tmp/base.send

    $ mv /mnt/A/C/D /mnt/A/D2
    $ mv /mnt/A/C/B /mnt/A/D2/B2
    $ mv /mnt/A/C /mnt/A/D2/B2/C2

    $ btrfs subvolume snapshot -r /mnt /mnt/snap2
    $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send

    The incremental send caused the kernel code to enter an infinite loop when
    building the path string for directory C after its references are processed.

    The necessary conditions here are that C has an inode number higher than both
    A and B, and B as an higher inode number higher than A, and D has the highest
    inode number, that is:
    inode_number(A) < inode_number(B) < inode_number(C) < inode_number(D)

    The same issue could happen if after the first snapshot there's any number
    of intermediary parent directories between A2 and B2, and between B2 and C2.

    A test case for xfstests follows, covering this simple case and more advanced
    ones, with files and hard links created inside the directories.

    Signed-off-by: Filipe David Borba Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     

21 Mar, 2014

1 commit