29 Jan, 2012

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs: fix reservations in btrfs_page_mkwrite
    Btrfs: advance window_start if we're using a bitmap
    btrfs: mask out gfp flags in releasepage
    Btrfs: fix enospc error caused by wrong checks of the chunk
    Btrfs: do not defrag a file partially
    Btrfs: fix warning for 32-bit build of fs/btrfs/check-integrity.c
    Btrfs: use cluster->window_start when allocating from a cluster bitmap
    Btrfs: Check for NULL page in extent_range_uptodate
    btrfs: Fix busyloops in transaction waiting code
    Btrfs: make sure a bitmap has enough bytes
    Btrfs: fix uninit warning in backref.c

    Linus Torvalds
     

27 Jan, 2012

11 commits

  • Josef fixed btrfs_page_mkwrite to properly release reserved
    extents if there was an error. But if we fail to get a reservation
    and we fail to dirty the inode (for ENOSPC reasons), we'll end up
    trying to release a reservation we never had.

    This makes sure we only release if we were able to reserve.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • If we span a long area in a bitmap we could end up taking a lot of time
    searching to the next free area if we're searching from the original
    window_start, so advance window_start in order to make sure we don't do any
    superficial searching. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • btree_releasepage is a callback and can be passed unknown gfp flags and then
    they may end up in kmem_cache_alloc called from alloc_extent_state, slab
    allocator will BUG_ON when there is HIGHMEM or DMA32 flag set.

    This may happen when btrfs is mounted from a loop device, which masks out
    __GFP_IO flag. The check in try_release_extent_state

    3399 if ((mask & GFP_NOFS) == GFP_NOFS)
    3400 mask = GFP_NOFS;

    will not work and passes unfiltered flags further resulting in crash at
    mm/slab.c:2963

    [] cache_alloc_refill+0x3b4/0x5c8
    [] kmem_cache_alloc+0x204/0x294
    [] mempool_alloc+0x52/0x170
    [] alloc_extent_state+0x40/0xd4 [btrfs]
    [] __clear_extent_bit+0x38a/0x4cc [btrfs]
    [] try_release_extent_state+0x9c/0xd4 [btrfs]
    [] btree_releasepage+0x7e/0xd0 [btrfs]
    [] shrink_page_list+0x6a0/0x724
    [] shrink_inactive_list+0x230/0x578
    [] shrink_list+0x6c/0x120
    [] shrink_zone+0x1e2/0x228
    [] shrink_zones+0x90/0x254
    [] do_try_to_free_pages+0xac/0x420
    [] try_to_free_pages+0x13c/0x1b0
    [] __alloc_pages_nodemask+0x5b4/0x9a8
    [] grab_cache_page_write_begin+0x7e/0xe8

    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason

    David Sterba
     
  • When we did sysbench test for inline files, enospc error happened easily though
    there was lots of free disk space which could be allocated for new chunks.

    Reproduce steps:
    # mkfs.btrfs -b $((2 * 1024 * 1024 * 1024))
    # mount /mnt
    # ulimit -n 102400
    # cd /mnt
    # sysbench --num-threads=1 --test=fileio --file-num=81920 \
    > --file-total-size=80M --file-block-size=1K --file-io-mode=sync \
    > --file-test-mode=seqwr prepare
    # sysbench --num-threads=1 --test=fileio --file-num=81920 \
    > --file-total-size=80M --file-block-size=1K --file-io-mode=sync \
    > --file-test-mode=seqwr run

    The reason of this bug is:
    Now, we can reserve space which is larger than the free space in the chunks if
    we have enough free disk space which can be used for new chunks. By this way,
    the space allocator should allocate a new chunk by force if there is no free
    space in the free space cache. But there are two wrong checks which break this
    operation.

    One is
    if (ret == -ENOSPC && num_bytes > min_alloc_size)
    in btrfs_reserve_extent(), it is wrong, we should try to allocate a new chunk
    even we fail to allocate free space by minimum allocable size.

    The other is
    if (space_info->force_alloc)
    force = space_info->force_alloc;
    in do_chunk_alloc(). It makes the allocator ignore CHUNK_ALLOC_FORCE If someone
    sets ->force_alloc to CHUNK_ALLOC_LIMITED, and makes the enospc error happen.

    Fix these two wrong checks. Especially the second one, we fix it by changing
    the value of CHUNK_ALLOC_LIMITED and CHUNK_ALLOC_FORCE, and make
    CHUNK_ALLOC_FORCE greater than CHUNK_ALLOC_LIMITED since CHUNK_ALLOC_FORCE has
    higher priority. And if the value which is passed in by the caller is greater
    than ->force_alloc, use the passed value.

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     
  • xfstests 218 complains that btrfs defrags a file partially:
    After: 1
    Write backwards sync, but contiguous - should defrag to 1 extent
    Before: 10
    -After: 1
    +After: 2

    To fix this, we need to set max_to_defrag count properly.

    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    Liu Bo
     
  • There have been 4 warnings on 32-bit build, they are herewith fixed.

    Signed-off-by: Stefan Behrens
    Signed-off-by: Chris Mason

    Stefan Behrens
     
  • We specifically set window_start in the cluster struct to indicate where the
    cluster starts in a bitmap, but we've been using min_start to indicate where
    we're searching from. This is usually the start of the blockgroup, so
    essentially means we're constantly searching from the start of any bitmap we
    find, which completely negates all the trouble we go to in order to setup a
    cluster. So start using window_start to make sure we actually use the area we
    found. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • A user has encountered a NULL pointer kernel oops in btrfs when
    encountering media errors. The problem has been identified
    as an unhandled NULL pointer returned from find_get_page().
    This modification simply checks for a NULL page, and returns
    with an error if found (the extent_range_uptodate() function
    returns 1 on errors).

    After testing this patch, the user reported that the error with
    the NULL pointer oops was solved. However, there is still a
    remaining problem with a thread becoming stuck in
    wait_on_page_locked(page) in the read_extent_buffer_pages(...)
    function in extent_io.c

    for (i = start_i; i < num_pages; i++) {
    page = extent_buffer_page(eb, i);
    wait_on_page_locked(page);
    if (!PageUptodate(page))
    ret = -EIO;
    }

    This patch leaves the issue with the locked page yet to be resolved.

    Signed-off-by: Mitch Harder
    Signed-off-by: Chris Mason

    Mitch Harder
     
  • wait_log_commit() and wait_for_writer() were using slightly different
    conditions for deciding whether they should call schedule() and whether they
    should continue in the wait loop. Thus it could happen that we busylooped when
    the first condition was not true while the second one was. That is burning CPU
    cycles needlessly and is deadly on UP machines...

    Signed-off-by: Jan Kara
    Signed-off-by: Chris Mason

    Jan Kara
     
  • We have only been checking for min_bytes available in bitmap entries, but we
    won't successfully setup a bitmap cluster unless it has at least bytes in the
    bitmap, so in the common case min_bytes is 4k and we want something like 2MB, so
    if there are a bunch of bitmap entries with less than 2mb's in them, we'll
    search all them anyway, which is suboptimal. Fix this check. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • Added initialization with the declaration of ret. It isn't set later on the
    switch-default branch (which should never be taken).

    Signed-off-by: Jan Schmidt
    Signed-off-by: Chris Mason

    Jan Schmidt
     

18 Jan, 2012

2 commits

  • * 'btrfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    btrfs: take allocation of ->tree_root into open_ctree()
    btrfs: let ->s_fs_info point to fs_info, not root...
    btrfs: consolidate failure exits in btrfs_mount() a bit
    btrfs: make free_fs_info() call ->kill_sb() unconditional
    btrfs: merge free_fs_info() calls on fill_super failures
    btrfs: kill pointless reassignment of ->s_fs_info in btrfs_fill_super()
    btrfs: make open_ctree() return int
    btrfs: sanitizing ->fs_info, part 5
    btrfs: sanitizing ->fs_info, part 4
    btrfs: sanitizing ->fs_info, part 3
    btrfs: sanitizing ->fs_info, part 2
    btrfs: sanitizing ->fs_info, part 1
    btrfs: fix a deadlock in btrfs_scan_one_device()
    btrfs: fix mount/umount race
    btrfs: get ->kill_sb() of its own
    btrfs: preparation to fixing mount/umount race

    Linus Torvalds
     
  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (62 commits)
    Btrfs: use larger system chunks
    Btrfs: add a delalloc mutex to inodes for delalloc reservations
    Btrfs: space leak tracepoints
    Btrfs: protect orphan block rsv with spin_lock
    Btrfs: add allocator tracepoints
    Btrfs: don't call btrfs_throttle in file write
    Btrfs: release space on error in page_mkwrite
    Btrfs: fix btrfsck error 400 when truncating a compressed
    Btrfs: do not use btrfs_end_transaction_throttle everywhere
    Btrfs: add balance progress reporting
    Btrfs: allow for resuming restriper after it was paused
    Btrfs: allow for canceling restriper
    Btrfs: allow for pausing restriper
    Btrfs: add skip_balance mount option
    Btrfs: recover balance on mount
    Btrfs: save balance parameters to disk
    Btrfs: soft profile changing mode (aka soft convert)
    Btrfs: implement online profile changing
    Btrfs: do not reduce profile in do_chunk_alloc()
    Btrfs: virtual address space subset filter
    ...

    Fix up trivial conflict in fs/btrfs/ioctl.c due to the use of the new
    mnt_drop_write_file() helper.

    Linus Torvalds
     

17 Jan, 2012

26 commits

  • system chunks by default are very small. This makes them slightly
    larger and also fixes the conditional checks to make sure we don't
    allocate a billion of them at once.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • I was using i_mutex for this, but we're getting bogus lockdep warnings by doing
    that and theres no real way to get rid of those, so just stop using i_mutex to
    protect delalloc metadata reservations and use a delalloc mutex instead. This
    shouldn't be contended often at all, only if you are writing and mmap writing to
    the file at the same time. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • This in addition to a script in my btrfs-tracing tree will help track down space
    leaks when we're getting space left over in block groups on umount. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • We've been seeing warnings coming out of the orphan commit stuff forever from
    ceph. Turns out it's because we're racing with checking if the orphan block
    reserve is set, because we clear it outside of the spin_lock. So leave the
    normal fastpath checks where they are, but take the spin_lock and _recheck_ to
    make sure we haven't had an orphan block rsv added in the meantime. Then clear
    the root's orphan block rsv and release the lock. With this patch a user said
    the warnings went away and they usually showed up pretty soon after he started
    ceph. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • I used these tracepoints when figuring out what the cluster stuff was doing, so
    add them to mainline in case we need to profile this stuff again. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • Btrfs_throttle will make us wait if there is a currently committing transaction
    until we can open new transactions, which is ridiculous since we don't actually
    start any transactions within the file write path anyway, so all this does is
    introduce big latencies if we have a sync/fsync heavy workload going on while
    somebody else is trying to do work. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • If updating the inode gave us an ENOSPC we were just returning in page_mkwrite,
    which is a problem since we make our reservation right before trying to update
    the inode, so fix the out label so that we actually free our reservation.
    Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • Reproduce steps:
    # mkfs.btrfs /dev/sdb5
    # mount /dev/sdb5 -o compress=lzo /mnt
    # dd if=/dev/zero of=/mnt/tmpfile bs=128K count=1
    # sync
    # truncate -s 64K /mnt/tmpfile
    root 5 inode 257 errors 400

    This is because of the wrong if condition, which is used to check if we should
    subtract the bytes of the dropped range from i_blocks/i_bytes of i-node or not.
    When we truncate a compressed extent, btrfs substracts the bytes of the whole
    extent, it's wrong. We should substract the real size that we truncate, no
    matter it is a compressed extent or not. Fix it.

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     
  • A user reported a problem where things like open with O_CREAT would take up to
    30 seconds when he had nfs activity on the same mount. This is because all of
    our quick metadata operations, like create, symlink etc all do
    btrfs_end_transaction_throttle, which if the transaction is blocked will wait
    for the commit to complete before it returns. This adds a ridiculous amount of
    latency and isn't really needed. The normal btrfs_end_transaction will mark the
    transaction as blocked and wake the transaction kthread up if it thinks the
    transaction needs to end (this being in the running out of global reserve space
    scenario), and this is all that is really needed since we've already done
    everything we're going to do, we just need to return. This should help people
    with the latency they were seeing when using synchronous heavy workloads.
    Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • Conflicts:
    fs/btrfs/ctree.h
    fs/btrfs/super.c

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Chris Mason
     
  • Conflicts:
    fs/btrfs/volumes.c

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Chris Mason
     
  • Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Recognize BTRFS_BALANCE_RESUME flag passed from userspace. We use the
    same heuristics used when recovering balance after a crash to try to
    start where we left off last time.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Implement an ioctl for canceling restriper. Currently we wait until
    relocation of the current block group is finished, in future this can be
    done by triggering a commit. Balance item is deleted and no memory
    about the interrupted balance is kept.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Implement an ioctl for pausing restriper. This pauses the relocation,
    but balance is still considered to be "in progress": balance item is
    not deleted, other volume operations cannot be started, etc. If paused
    in the middle of profile changing operation we will continue making
    allocations with the target profile.

    Add a hook to close_ctree() to pause restriper and free its data
    structures on unmount. (It's safe to unmount when restriper is in
    "paused" state, we will resume with the same parameters on the next
    mount)

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Since restriper kthread starts involuntarily on mount and can suck cpu
    and memory bandwidth add a mount option to forcefully skip it. The
    restriper in that case hangs around in paused state and can be resumed
    from userspace when it's convenient.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • On mount, if balance item is found, resume balance in a separate
    kernel thread.

    Try to be smart to continue roughly where previous balance (or convert)
    was interrupted. For chunk types that were being converted to some
    profile we turn on soft convert, in case of a simple balance we turn on
    usage filter and relocate only less-than-90%-full chunks of that type.
    These are just heuristics but they help quite a bit, and can be improved
    in future.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Introduce a new btree objectid for storing balance item. The reason is
    to be able to resume restriper after a crash with the same parameters.
    Balance item has a very high objectid and goes into tree of tree roots.

    The key for the new item is as follows:

    [ BTRFS_BALANCE_OBJECTID ; BTRFS_BALANCE_ITEM_KEY ; 0 ]

    Older kernels simply ignore it so it's safe to mount with an older
    kernel and then go back to the newer one.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • When doing convert from one profile to another if soft mode is on
    restriper won't touch chunks that already have the profile we are
    converting to. This is useful if e.g. half of the FS was converted
    earlier.

    The soft mode switch is (like every other filter) per-type. This means
    that we can convert for example meta chunks the "hard" way while
    converting data chunks selectively with soft switch.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Profile changing is done by launching a balance with
    BTRFS_BALANCE_CONVERT bits set and target fields of respective
    btrfs_balance_args structs initialized. Profile reducing code in this
    case will pick restriper's target profile if it's available instead of
    doing a blind reduce. If target profile is not yet available it goes
    back to a plain reduce.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Every caller of do_chunk_alloc() feeds it the reduced allocation
    profile, so stop trying to reduce it one more time. Instead check the
    validity of the passed profile.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Select chunks which have at least one byte located inside a given
    [vstart, vend) virtual address space range.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Select chunks which have at least one byte of at least one stripe
    located on a device with devid X in a given [pstart,pend) physical
    address range.

    This filter only works when devid filter is turned on.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov
     
  • Relocate chunks which have at least one stripe located on a device with
    devid X.

    Signed-off-by: Ilya Dryomov

    Ilya Dryomov