30 Jun, 2017

1 commit

  • Introduce a new parameter, struct extent_changeset for
    btrfs_qgroup_reserved_data() and its callers.

    Such extent_changeset was used in btrfs_qgroup_reserve_data() to record
    which range it reserved in current reserve, so it can free it in error
    paths.

    The reason we need to export it to callers is, at buffered write error
    path, without knowing what exactly which range we reserved in current
    allocation, we can free space which is not reserved by us.

    This will lead to qgroup reserved space underflow.

    Reviewed-by: Chandan Rajendra
    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba

    Qu Wenruo
     

28 Feb, 2017

1 commit


17 Feb, 2017

1 commit


06 Dec, 2016

3 commits


27 Sep, 2016

1 commit

  • For many printks, we want to know which file system issued the message.

    This patch converts most pr_* calls to use the btrfs_* versions instead.
    In some cases, this means adding plumbing to allow call sites access to
    an fs_info pointer.

    fs/btrfs/check-integrity.c is left alone for another day.

    Signed-off-by: Jeff Mahoney
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Jeff Mahoney
     

25 Aug, 2016

1 commit

  • This patch can fix some false ENOSPC errors, below test script can
    reproduce one false ENOSPC error:
    #!/bin/bash
    dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
    dev=$(losetup --show -f fs.img)
    mkfs.btrfs -f -M $dev
    mkdir /tmp/mntpoint
    mount $dev /tmp/mntpoint
    cd /tmp/mntpoint
    xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile

    Above script will fail for ENOSPC reason, but indeed fs still has free
    space to satisfy this request. Please see call graph:
    btrfs_fallocate()
    |-> btrfs_alloc_data_chunk_ondemand()
    | bytes_may_use += 64M
    |-> btrfs_prealloc_file_range()
    |-> btrfs_reserve_extent()
    |-> btrfs_add_reserved_bytes()
    | alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not
    | change bytes_may_use, and bytes_reserved += 64M. Now
    | bytes_may_use + bytes_reserved == 128M, which is greater
    | than btrfs_space_info's total_bytes, false enospc occurs.
    | Note, the bytes_may_use decrease operation will be done in
    | end of btrfs_fallocate(), which is too late.

    Here is another simple case for buffered write:
    CPU 1 | CPU 2
    |
    |-> cow_file_range() |-> __btrfs_buffered_write()
    |-> btrfs_reserve_extent() | |
    | | |
    | | |
    | ..... | |-> btrfs_check_data_free_space()
    | |
    | |
    |-> extent_clear_unlock_delalloc() |

    In CPU 1, btrfs_reserve_extent()->find_free_extent()->
    btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease
    operation will be delayed to be done in extent_clear_unlock_delalloc().
    Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's
    btrfs_check_data_free_space() tries to reserve 100MB data space.
    If
    100MB > data_sinfo->total_bytes - data_sinfo->bytes_used -
    data_sinfo->bytes_reserved - data_sinfo->bytes_pinned -
    data_sinfo->bytes_readonly - data_sinfo->bytes_may_use
    btrfs_check_data_free_space() will try to allcate new data chunk or call
    btrfs_start_delalloc_roots(), or commit current transaction in order to
    reserve some free space, obviously a lot of work. But indeed it's not
    necessary as long as decreasing bytes_may_use timely, we still have
    free space, decreasing 128M from bytes_may_use.

    To fix this issue, this patch chooses to update bytes_may_use for both
    data and metadata in btrfs_add_reserved_bytes(). For compress path, real
    extent length may not be equal to file content length, so introduce a
    ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and
    btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by
    file content length. Then compress path can update bytes_may_use
    correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC
    and RESERVE_FREE.

    As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In
    run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as
    PREALLOC, we also need to update bytes_may_use, but can not pass
    EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so
    here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook()
    to update btrfs_space_info's bytes_may_use.

    Meanwhile __btrfs_prealloc_file_range() will call
    btrfs_free_reserved_data_space() internally for both sucessful and failed
    path, btrfs_prealloc_file_range()'s callers does not need to call
    btrfs_free_reserved_data_space() any more.

    Signed-off-by: Wang Xiaoguang
    Reviewed-by: Josef Bacik
    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason

    Wang Xiaoguang
     

26 Jul, 2016

2 commits


05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

12 Mar, 2016

1 commit


20 Jan, 2016

1 commit


16 Jan, 2016

1 commit

  • The following call trace is seen when btrfs/031 test is executed in a loop,

    [ 158.661848] ------------[ cut here ]------------
    [ 158.662634] WARNING: CPU: 2 PID: 890 at /home/chandan/repos/linux/fs/btrfs/ioctl.c:558 create_subvol+0x3d1/0x6ea()
    [ 158.664102] BTRFS: Transaction aborted (error -2)
    [ 158.664774] Modules linked in:
    [ 158.665266] CPU: 2 PID: 890 Comm: btrfs Not tainted 4.4.0-rc6-g511711a #2
    [ 158.666251] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    [ 158.667392] ffffffff81c0a6b0 ffff8806c7c4f8e8 ffffffff81431fc8 ffff8806c7c4f930
    [ 158.668515] ffff8806c7c4f920 ffffffff81051aa1 ffff880c85aff000 ffff8800bb44d000
    [ 158.669647] ffff8808863b5c98 0000000000000000 00000000fffffffe ffff8806c7c4f980
    [ 158.670769] Call Trace:
    [ 158.671153] [] dump_stack+0x44/0x5c
    [ 158.671884] [] warn_slowpath_common+0x81/0xc0
    [ 158.672769] [] warn_slowpath_fmt+0x47/0x50
    [ 158.673620] [] create_subvol+0x3d1/0x6ea
    [ 158.674440] [] btrfs_mksubvol.isra.30+0x369/0x520
    [ 158.675376] [] ? percpu_down_read+0x1a/0x50
    [ 158.676235] [] btrfs_ioctl_snap_create_transid+0x101/0x180
    [ 158.677268] [] btrfs_ioctl_snap_create+0x52/0x70
    [ 158.678183] [] btrfs_ioctl+0x474/0x2f90
    [ 158.678975] [] ? vma_merge+0xee/0x300
    [ 158.679751] [] ? alloc_pages_vma+0x91/0x170
    [ 158.680599] [] ? lru_cache_add_active_or_unevictable+0x22/0x70
    [ 158.681686] [] ? selinux_file_ioctl+0xff/0x1d0
    [ 158.682581] [] do_vfs_ioctl+0x2c1/0x490
    [ 158.683399] [] ? security_file_ioctl+0x3e/0x60
    [ 158.684297] [] SyS_ioctl+0x74/0x80
    [ 158.685051] [] entry_SYSCALL_64_fastpath+0x12/0x6a
    [ 158.685958] ---[ end trace 4b63312de5a2cb76 ]---
    [ 158.686647] BTRFS: error (device loop0) in create_subvol:558: errno=-2 No such entry
    [ 158.709508] BTRFS info (device loop0): forced readonly
    [ 158.737113] BTRFS info (device loop0): disk space caching is enabled
    [ 158.738096] BTRFS error (device loop0): Remounting read-write after error is not allowed
    [ 158.851303] BTRFS error (device loop0): cleaner transaction attach returned -30

    This occurs because,

    Mount filesystem
    Create subvol with ID 257
    Unmount filesystem
    Mount filesystem
    Delete subvol with ID 257
    btrfs_drop_snapshot()
    Add root corresponding to subvol 257 into
    btrfs_transaction->dropped_roots list
    Create new subvol (i.e. create_subvol())
    257 is returned as the next free objectid
    btrfs_read_fs_root_no_name()
    Finds the btrfs_root instance corresponding to the old subvol with ID 257
    in btrfs_fs_info->fs_roots_radix.
    Returns error since btrfs_root_item->refs has the value of 0.

    To fix the issue the commit initializes tree root's and subvolume root's
    highest_objectid when loading the roots from disk.

    Signed-off-by: Chandan Rajendra
    Signed-off-by: David Sterba

    Chandan Rajendra
     

11 Jan, 2016

1 commit


07 Jan, 2016

3 commits


22 Oct, 2015

2 commits


01 Jul, 2015

2 commits

  • While the inode cache caching kthread is calling btrfs_unpin_free_ino(),
    we could have a concurrent call to btrfs_return_ino() that adds a new
    entry to the root's free space cache of pinned inodes. This concurrent
    call does not acquire the fs_info->commit_root_sem before adding a new
    entry if the caching state is BTRFS_CACHE_FINISHED, which is a problem
    because the caching kthread calls btrfs_unpin_free_ino() after setting
    the caching state to BTRFS_CACHE_FINISHED and therefore races with
    the task calling btrfs_return_ino(), which is adding a new entry, while
    the former (caching kthread) is navigating the cache's rbtree, removing
    and freeing nodes from the cache's rbtree without acquiring the spinlock
    that protects the rbtree.

    This race resulted in memory corruption due to double free of struct
    btrfs_free_space objects because both tasks can end up doing freeing the
    same objects. Note that adding a new entry can result in merging it with
    other entries in the cache, in which case those entries are freed.
    This is particularly important as btrfs_free_space structures are also
    used for the block group free space caches.

    This memory corruption can be detected by a debugging kernel, which
    reports it with the following trace:

    [132408.501148] slab error in verify_redzone_free(): cache `btrfs_free_space': double free detected
    [132408.505075] CPU: 15 PID: 12248 Comm: btrfs-ino-cache Tainted: G W 4.1.0-rc5-btrfs-next-10+ #1
    [132408.505075] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
    [132408.505075] ffff880023e7d320 ffff880163d73cd8 ffffffff8145eec7 ffffffff81095dce
    [132408.505075] ffff880009735d40 ffff880163d73ce8 ffffffff81154e1e ffff880163d73d68
    [132408.505075] ffffffff81155733 ffffffffa054a95a ffff8801b6099f00 ffffffffa0505b5f
    [132408.505075] Call Trace:
    [132408.505075] [] dump_stack+0x4f/0x7b
    [132408.505075] [] ? console_unlock+0x356/0x3a2
    [132408.505075] [] __slab_error.isra.28+0x25/0x36
    [132408.505075] [] __cache_free+0xe2/0x4b6
    [132408.505075] [] ? __btrfs_add_free_space+0x2f0/0x343 [btrfs]
    [132408.505075] [] ? btrfs_unpin_free_ino+0x8e/0x99 [btrfs]
    [132408.505075] [] ? time_hardirqs_off+0x15/0x28
    [132408.505075] [] ? trace_hardirqs_off+0xd/0xf
    [132408.505075] [] ? kfree+0xb6/0x14e
    [132408.505075] [] kfree+0xe5/0x14e
    [132408.505075] [] btrfs_unpin_free_ino+0x8e/0x99 [btrfs]
    [132408.505075] [] caching_kthread+0x29e/0x2d9 [btrfs]
    [132408.505075] [] ? btrfs_unpin_free_ino+0x99/0x99 [btrfs]
    [132408.505075] [] kthread+0xef/0xf7
    [132408.505075] [] ? time_hardirqs_on+0x15/0x28
    [132408.505075] [] ? __kthread_parkme+0xad/0xad
    [132408.505075] [] ret_from_fork+0x42/0x70
    [132408.505075] [] ? __kthread_parkme+0xad/0xad
    [132408.505075] ffff880023e7d320: redzone 1:0x9f911029d74e35b, redzone 2:0x9f911029d74e35b.
    [132409.501654] slab: double free detected in cache 'btrfs_free_space', objp ffff880023e7d320
    [132409.503355] ------------[ cut here ]------------
    [132409.504241] kernel BUG at mm/slab.c:2571!

    Therefore fix this by having btrfs_unpin_free_ino() acquire the lock
    that protects the rbtree while doing the searches and removing entries.

    Fixes: 1c70d8fb4dfa ("Btrfs: fix inode caching vs tree log")
    Cc: stable@vger.kernel.org
    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     
  • The free space entries are allocated using kmem_cache_zalloc(),
    through __btrfs_add_free_space(), therefore we should use
    kmem_cache_free() and not kfree() to avoid any confusion and
    any potential problem. Looking at the kfree() definition at
    mm/slab.c it has the following comment:

    /*
    * (...)
    *
    * Don't free memory not originally allocated by kmalloc()
    * or you will run into trouble.
    */

    So better be safe and use kmem_cache_free().

    Cc: stable@vger.kernel.org
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: Chris Mason

    Filipe Manana
     

11 Apr, 2015

1 commit

  • We loop through all of the dirty block groups during commit and write
    the free space cache. In order to make sure the cache is currect, we do
    this while no other writers are allowed in the commit.

    If a large number of block groups are dirty, this can introduce long
    stalls during the final stages of the commit, which can block new procs
    trying to change the filesystem.

    This commit changes the block group cache writeout to take appropriate
    locks and allow it to run earlier in the commit. We'll still have to
    redo some of the block groups, but it means we can get most of the work
    out of the way without blocking the entire FS.

    Signed-off-by: Chris Mason

    Chris Mason
     

03 Dec, 2014

1 commit

  • Trimming is completely transactionless, and the way it operates consists
    of hiding free space entries from a block group, perform the trim/discard
    and then make the free space entries visible again.
    Therefore while a free space entry is being trimmed, we can have free space
    cache writing running in parallel (as part of a transaction commit) which
    will miss the free space entry. This means that an unmount (or crash/reboot)
    after that transaction commit and mount again before another transaction
    starts/commits after the discard finishes, we will have some free space
    that won't be used again unless the free space cache is rebuilt. After the
    unmount, fsck (btrfsck, btrfs check) reports the issue like the following
    example:

    *** fsck.btrfs output ***
    checking extents
    checking free space cache
    There is no free space entry for 521764864-521781248
    There is no free space entry for 521764864-1103101952
    cache appears valid but isnt 29360128
    Checking filesystem on /dev/sdc
    UUID: b4789e27-4774-4626-98e9-ae8dfbfb0fb5
    found 1235681286 bytes used err is -22
    (...)

    Another issue caused by this race is a crash while writing bitmap entries
    to the cache, because while the cache writeout task accesses the bitmaps,
    the trim task can be concurrently modifying the bitmap or worse might
    be freeing the bitmap. The later case results in the following crash:

    [55650.804460] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
    [55650.804835] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop parport_pc parport i2c_piix4 psmouse evdev pcspkr microcode processor i2ccore serio_raw thermal_sys button ext4 crc16 jbd2 mbcache sg sd_mod crc_t10dif sr_mod cdrom crct10dif_generic crct10dif_common ata_generic virtio_scsi floppy ata_piix libata virtio_pci virtio_ring virtio scsi_mod e1000 [last unloaded: btrfs]
    [55650.806169] CPU: 1 PID: 31002 Comm: btrfs-transacti Tainted: G W 3.17.0-rc5-btrfs-next-1+ #1
    [55650.806493] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
    [55650.806867] task: ffff8800b12f6410 ti: ffff880071538000 task.ti: ffff880071538000
    [55650.807166] RIP: 0010:[] [] write_bitmap_entries+0x65/0xbb [btrfs]
    [55650.807514] RSP: 0018:ffff88007153bc30 EFLAGS: 00010246
    [55650.807687] RAX: 000000005d1ec000 RBX: ffff8800a665df08 RCX: 0000000000000400
    [55650.807885] RDX: ffff88005d1ec000 RSI: 6b6b6b6b6b6b6b6b RDI: ffff88005d1ec000
    [55650.808017] RBP: ffff88007153bc58 R08: 00000000ddd51536 R09: 00000000000001e0
    [55650.808017] R10: 0000000000000000 R11: 0000000000000037 R12: 6b6b6b6b6b6b6b6b
    [55650.808017] R13: ffff88007153bca8 R14: 6b6b6b6b6b6b6b6b R15: ffff88007153bc98
    [55650.808017] FS: 0000000000000000(0000) GS:ffff88023ec80000(0000) knlGS:0000000000000000
    [55650.808017] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [55650.808017] CR2: 0000000002273b88 CR3: 00000000b18f6000 CR4: 00000000000006e0
    [55650.808017] Stack:
    [55650.808017] ffff88020e834e00 ffff880172d68db0 0000000000000000 ffff88019257c800
    [55650.808017] ffff8801d42ea720 ffff88007153bd10 ffffffffa037d2fa ffff880224e99180
    [55650.808017] ffff8801469a6188 ffff880224e99140 ffff880172d68c50 00000003000000b7
    [55650.808017] Call Trace:
    [55650.808017] [] __btrfs_write_out_cache+0x1ea/0x37f [btrfs]
    [55650.808017] [] btrfs_write_out_cache+0xa1/0xd8 [btrfs]
    [55650.808017] [] btrfs_write_dirty_block_groups+0x4b5/0x505 [btrfs]
    [55650.808017] [] commit_cowonly_roots+0x15e/0x1f7 [btrfs]
    [55650.808017] [] ? _raw_spin_lock+0xe/0x10
    [55650.808017] [] btrfs_commit_transaction+0x411/0x882 [btrfs]
    [55650.808017] [] transaction_kthread+0xf2/0x1a4 [btrfs]
    [55650.808017] [] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
    [55650.808017] [] kthread+0xb7/0xbf
    [55650.808017] [] ? __kthread_parkme+0x67/0x67
    [55650.808017] [] ret_from_fork+0x7c/0xb0
    [55650.808017] [] ? __kthread_parkme+0x67/0x67
    [55650.808017] Code: 4c 89 ef 8d 70 ff e8 d4 fc ff ff 41 8b 45 34 41 39 45 30 7d 5c 31 f6 4c 89 ef e8 80 f6 ff ff 49 8b 7d 00 4c 89 f6 b9 00 04 00 00 a5 4c 89 ef 41 8b 45 30 8d 70 ff e8 a3 fc ff ff 41 8b 45 34
    [55650.808017] RIP [] write_bitmap_entries+0x65/0xbb [btrfs]
    [55650.808017] RSP
    [55650.815725] ---[ end trace 1c032e96b149ff86 ]---

    Fix this by serializing both tasks in such a way that cache writeout
    doesn't wait for the trim/discard of free space entries to finish and
    doesn't miss any free space entry.

    Signed-off-by: Filipe Manana
    Signed-off-by: Chris Mason

    Filipe Manana
     

12 Nov, 2014

1 commit

  • The pending mount option(s) now share namespace and bits with the normal
    options, and the existing one for (inode_cache) is unset unconditionally
    at each transaction commit.

    Introduce a separate namespace for pending changes and enhance the
    descriptions of the intended change to use separate bits for each
    action.

    Signed-off-by: David Sterba

    David Sterba
     

18 Sep, 2014

1 commit


10 Jun, 2014

1 commit


25 Apr, 2014

2 commits

  • Currently, with inode cache enabled, we will reuse its inode id immediately
    after unlinking file, we may hit something like following:

    |->iput inode
    |->return inode id into inode cache
    |->create dir,fsync
    |->power off

    An easy way to reproduce this problem is:

    mkfs.btrfs -f /dev/sdb
    mount /dev/sdb /mnt -o inode_cache,commit=100
    dd if=/dev/zero of=/mnt/data bs=1M count=10 oflag=sync
    inode_id=`ls -i /mnt/data | awk '{print $1}'`
    rm -f /mnt/data

    i=1
    while [ 1 ]
    do
    mkdir /mnt/dir_$i
    test1=`stat /mnt/dir_$i | grep Inode: | awk '{print $4}'`
    if [ $test1 -eq $inode_id ]
    then
    dd if=/dev/zero of=/mnt/dir_$i/data bs=1M count=1 oflag=sync
    echo b > /proc/sysrq-trigger
    fi
    sleep 1
    i=$(($i+1))
    done

    mount /dev/sdb /mnt
    umount /dev/sdb
    btrfs check /dev/sdb

    We fix this problem by adding unlinked inode's id into pinned tree,
    and we can not reuse them until committing transaction.

    Cc: stable@vger.kernel.org
    Signed-off-by: Miao Xie
    Signed-off-by: Wang Shilong
    Signed-off-by: Chris Mason

    Miao Xie
     
  • When running stress test(including snapshots,balance,fstress), we trigger
    the following BUG_ON() which is because we fail to start inode caching task.

    [ 181.131945] kernel BUG at fs/btrfs/inode-map.c:179!
    [ 181.137963] invalid opcode: 0000 [#1] SMP
    [ 181.217096] CPU: 11 PID: 2532 Comm: btrfs Not tainted 3.14.0 #1
    [ 181.240521] task: ffff88013b621b30 ti: ffff8800b6ada000 task.ti: ffff8800b6ada000
    [ 181.367506] Call Trace:
    [ 181.371107] [] btrfs_return_ino+0x9e/0x110 [btrfs]
    [ 181.379191] [] btrfs_evict_inode+0x46b/0x4c0 [btrfs]
    [ 181.387464] [] ? autoremove_wake_function+0x40/0x40
    [ 181.395642] [] evict+0x9e/0x190
    [ 181.401882] [] iput+0xf3/0x180
    [ 181.408025] [] btrfs_orphan_cleanup+0x1ee/0x430 [btrfs]
    [ 181.416614] [] btrfs_mksubvol.isra.29+0x3bd/0x450 [btrfs]
    [ 181.425399] [] btrfs_ioctl_snap_create_transid+0x186/0x190 [btrfs]
    [ 181.435059] [] btrfs_ioctl_snap_create_v2+0xeb/0x130 [btrfs]
    [ 181.444148] [] btrfs_ioctl+0xf76/0x2b90 [btrfs]
    [ 181.451971] [] ? handle_mm_fault+0x475/0xe80
    [ 181.459509] [] ? __do_page_fault+0x1ec/0x520
    [ 181.467046] [] ? do_mmap_pgoff+0x2f5/0x3c0
    [ 181.474393] [] do_vfs_ioctl+0x2d8/0x4b0
    [ 181.481450] [] SyS_ioctl+0x81/0xa0
    [ 181.488021] [] system_call_fastpath+0x16/0x1b

    We should avoid triggering BUG_ON() here, instead, we output warning messages
    and clear inode_cache option.

    Signed-off-by: Wang Shilong
    Signed-off-by: Chris Mason

    Wang Shilong
     

07 Apr, 2014

1 commit

  • Lets try this again. We can deadlock the box if we send on a box and try to
    write onto the same fs with the app that is trying to listen to the send pipe.
    This is because the writer could get stuck waiting for a transaction commit
    which is being blocked by the send. So fix this by making sure looking at the
    commit roots is always going to be consistent. We do this by keeping track of
    which roots need to have their commit roots swapped during commit, and then
    taking the commit_root_sem and swapping them all at once. Then make sure we
    take a read lock on the commit_root_sem in cases where we search the commit root
    to make sure we're always looking at a consistent view of the commit roots.
    Previously we had problems with this because we would swap a fs tree commit root
    and then swap the extent tree commit root independently which would cause the
    backref walking code to screw up sometimes. With this patch we no longer
    deadlock and pass all the weird send/receive corner cases. Thanks,

    Reportedy-by: Hugo Mills
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

12 Nov, 2013

5 commits

  • Use WARN_ON()'s return value in place of WARN_ON(1) for cleaner source
    code that outputs a more descriptive warnings. Also fix the styling
    warning of redundant braces that came up as a result of this fix.

    Signed-off-by: Dulshani Gunawardhana
    Reviewed-by: Zach Brown
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Dulshani Gunawardhana
     
  • Due to an off-by-one error, it is possible to reproduce a bug
    when the inode cache is used.

    The same inode number is assigned twice, the second time this
    leads to an EEXIST in btrfs_insert_empty_items().

    The issue can happen when a file is removed right after a subvolume
    is created and then a new inode number is created before the
    inodes in free_inode_pinned are processed.
    unlink() calls btrfs_return_ino() which calls start_caching() in this
    case which adds [highest_ino + 1, BTRFS_LAST_FREE_OBJECTID] by
    searching for the highest inode (which already cannot find the
    unlinked one anymore in btrfs_find_free_objectid()). So if this
    unlinked inode's number is equal to the highest_ino + 1 (or >= this value
    instead of > this value which was the off-by-one error), we mustn't add
    the inode number to free_ino_pinned (caching_thread() does it right).
    In this case we need to try directly to add the number to the inode_cache
    which will fail in this case.

    When this inode number is allocated while it is still in free_ino_pinned,
    it is allocated and still added to the free inode cache when the
    pinned inodes are processed, thus one of the following inode number
    allocations will get an inode that is already in use and fail with EEXIST
    in btrfs_insert_empty_items().

    One example which was created with the reproducer below:
    Create a snapshot, work in the newly created snapshot for the rest.
    In unlink(inode 34284) call btrfs_return_ino() which calls start_caching().
    start_caching() calls add_free_space [34284, 18446744073709517077].
    In btrfs_return_ino(), call start_caching pinned [34284, 1] which is wrong.
    mkdir() call btrfs_find_ino_for_alloc() which returns the number 34284.
    btrfs_unpin_free_ino calls add_free_space [34284, 1].
    mkdir() call btrfs_find_ino_for_alloc() which returns the number 34284.
    EEXIST when the new inode is inserted.

    One possible reproducer is this one:
    #!/bin/sh
    # preparation
    TEST_DEV=/dev/sdc1
    TEST_MNT=/mnt
    umount ${TEST_MNT} 2>/dev/null || true
    mkfs.btrfs -f ${TEST_DEV}
    mount ${TEST_DEV} ${TEST_MNT} -o \
    rw,relatime,compress=lzo,space_cache,inode_cache
    btrfs subv create ${TEST_MNT}/s1
    for i in `seq 34027`; do touch ${TEST_MNT}/s1/${i}; done
    btrfs subv snap ${TEST_MNT}/s1 ${TEST_MNT}/s2
    FILENAME=`find ${TEST_MNT}/s1/ -inum 4085 | sed 's|^.*/\([^/]*\)$|\1|'`
    rm ${TEST_MNT}/s2/$FILENAME
    touch ${TEST_MNT}/s2/$FILENAME
    # the following steps can be repeated to reproduce the issue again and again
    [ -e ${TEST_MNT}/s3 ] && btrfs subv del ${TEST_MNT}/s3
    btrfs subv snap ${TEST_MNT}/s2 ${TEST_MNT}/s3
    rm ${TEST_MNT}/s3/$FILENAME
    touch ${TEST_MNT}/s3/$FILENAME
    ls -alFi ${TEST_MNT}/s?/$FILENAME
    touch ${TEST_MNT}/s3/_1 || logger FAILED
    ls -alFi ${TEST_MNT}/s?/_1
    touch ${TEST_MNT}/s3/_2 || logger FAILED
    ls -alFi ${TEST_MNT}/s?/_2
    touch ${TEST_MNT}/s3/__1 || logger FAILED
    ls -alFi ${TEST_MNT}/s?/__1
    touch ${TEST_MNT}/s3/__2 || logger FAILED
    ls -alFi ${TEST_MNT}/s?/__2
    # if the above is not enough, add the following loop:
    for i in `seq 3 9`; do touch ${TEST_MNT}/s3/__${i} || logger FAILED; done
    #for i in `seq 3 34027`; do touch ${TEST_MNT}/s3/__${i} || logger FAILED; done
    # one of the touch(1) calls in s3 fail due to EEXIST because the inode is
    # already in use that btrfs_find_ino_for_alloc() returns.

    Signed-off-by: Stefan Behrens
    Reviewed-by: Jan Schmidt
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Stefan Behrens
     
  • Not used for anything, and removing it avoids caller's need to
    allocate a path structure.

    Signed-off-by: Filipe David Borba Manana
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Filipe David Borba Manana
     
  • We're doing a unnecessary extra lookup of the ino cache's
    inode when we already have it (and holding a reference)
    during the process of saving the ino cache contents to disk.
    Therefore remove this extra lookup.

    Signed-off-by: Filipe David Borba Manana
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Filipe David Borba Manana
     
  • The fact that btrfs_root_refs() returned 0 for the tree_root caused
    bugs in the past, therefore it is set to 1 with this patch and
    (hopefully) all affected code is adapted to this change.

    I verified this change by temporarily adding WARN_ON() checks
    everywhere where btrfs_root_refs() is used, checking whether the
    logic of the code is changed by btrfs_root_refs() returning 1
    instead of 0 for root->root_key.objectid == BTRFS_ROOT_TREE_OBJECTID.
    With these added checks, I ran the xfstests './check -g auto'.

    The two roots chunk_root and log_root_tree that are only referenced
    by the superblock and the log_roots below the log_root_tree still
    have btrfs_root_refs() == 0, only the tree_root is changed.

    Signed-off-by: Stefan Behrens
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Stefan Behrens
     

18 May, 2013

2 commits

  • It is very likely that there are lots of subvolumes/snapshots in the filesystem,
    so if we use global block reservation to do inode cache truncation, we may hog
    all the free space that is reserved in global rsv. So it is better that we do
    the free space reservation for inode cache truncation by ourselves.

    Cc: Tsutomu Itoh
    Signed-off-by: Miao Xie
    Signed-off-by: Josef Bacik

    Miao Xie
     
  • The filesystem with inode cache was forced to be read-only when we umounted it.

    Steps to reproduce:
    # mkfs.btrfs -f ${DEV}
    # mount -o inode_cache ${DEV} ${MNT}
    # dd if=/dev/zero of=${MNT}/file1 bs=1M count=8192
    # btrfs fi syn ${MNT}
    # dd if=${MNT}/file1 of=/dev/null bs=1M
    # rm -f ${MNT}/file1
    # btrfs fi syn ${MNT}
    # umount ${MNT}

    It is because there was no enough space to do inode cache truncation, and then
    we aborted the current transaction.

    But no space error is not a serious problem when we write out the inode cache,
    and it is safe that we just skip this step if we meet this problem. So we need
    not abort the current transaction.

    Reported-by: Tsutomu Itoh
    Signed-off-by: Miao Xie
    Tested-by: Tsutomu Itoh
    Signed-off-by: Josef Bacik

    Miao Xie
     

12 Dec, 2012

1 commit

  • In some places(such as: evicting inode), we just can not flush the reserved
    space of delalloc, flushing the delayed directory index and delayed inode
    is OK, but we don't try to flush those things and just go back when there is
    no enough space to be reserved. This patch fixes this problem.

    We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
    If we can in the transaction, we should not flush anything, or the deadlock
    would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
    would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
    and we will flush all things.

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     

29 Mar, 2012

1 commit


22 Mar, 2012

1 commit

  • btrfs currently handles most errors with BUG_ON. This patch is a work-in-
    progress but aims to handle most errors other than internal logic
    errors and ENOMEM more gracefully.

    This iteration prevents most crashes but can run into lockups with
    the page lock on occasion when the timing "works out."

    Signed-off-by: Jeff Mahoney

    Jeff Mahoney