23 Feb, 2012

4 commits


21 Feb, 2012

1 commit


17 Feb, 2012

6 commits


16 Feb, 2012

1 commit


15 Feb, 2012

9 commits

  • Raid array setup code creates an extent buffer in an usual way. When the
    PAGE_CACHE_SIZE is > super block size, the extent pages are not marked
    up-to-date, which triggers a WARN_ON in the following
    write_extent_buffer call. Add an explicit up-to-date call to silence the
    warning.

    Signed-off-by: David Sterba

    David Sterba
     
  • On ia64, powerpc64 and sparc64 the bitfield is modified through a RMW cycle and current
    gcc rewrites the adjacent 4B word, which in case of a spinlock or atomic has
    disaterous effect.

    https://lkml.org/lkml/2012/2/1/220

    Signed-off-by: David Sterba

    David Sterba
     
  • We encountered an issue that was easily observable on s/390 systems but
    could really happen anywhere. The timing just seemed to hit reliably
    on s/390 with limited memory.

    The gist is that when an unexpected set_page_dirty() happened, we'd
    run into the BUG() in btrfs_writepage_fixup_worker since it wasn't
    properly set up for delalloc.

    This patch does the following:
    - Performs the missing delalloc in the fixup worker
    - Allow the start hook to return -EBUSY which informs __extent_writepage
    that it should mark the page skipped and not to redirty it. This is
    required since the fixup worker can fail with -ENOSPC and the page
    will have already been redirtied. That causes an Oops in
    drop_outstanding_extents later. Retrying the fixup worker could
    lead to an infinite loop. Deferring the page redirty also saves us
    some cycles since the page would be stuck in a resubmit-redirty loop
    until the fixup worker completes. It's not harmful, just wasteful.
    - If the fixup worker fails, we mark the page and mapping as errored,
    and end the writeback, similar to what we would do had the page
    actually been submitted to writeback.

    Signed-off-by: Jeff Mahoney

    Jeff Mahoney
     
  • load_free_space_cache() has forgotten to free path.

    Signed-off-by: Tsutomu Itoh

    Tsutomu Itoh
     
  • Because scrub enumerates the dev extent tree to find the chunks to scrub,
    it currently finds each DUP chunk twice and also scrubs it twice. This
    patch makes sure that scrub_chunk only checks that part of the chunk the
    dev extent has been found for. This only changes the behaviour for DUP
    chunks.

    Reported-and-tested-by: Stefan Behrens
    Signed-off-by: Arne Jansen

    Arne Jansen
     
  • A user reported a bug of btrfs's trim, that is we will trim 0 bytes
    after a device delete.

    The reproducer:

    $ mkfs.btrfs disk1
    $ mkfs.btrfs disk2
    $ mount disk1 /mnt
    $ fstrim -v /mnt
    $ btrfs device add disk2 /mnt
    $ btrfs device del disk1 /mnt
    $ fstrim -v /mnt

    This is because after we delete the device, the block group may start from
    a non-zero place, which will confuse trim to discard nothing.

    Reported-by: Lutz Euler
    Signed-off-by: Liu Bo

    Liu Bo
     
  • …led for SEEK_DATA/SEEK_HOLE inquiry

    Given that ENXIO only means "offset beyond EOF" for either SEEK_DATA or SEEK_HOLE inquiry
    in a desired file range, so we should return the internal error unchanged if btrfs_get_extent_fiemap()
    call failed, rather than ENXIO.

    Cc: Dave Chinner <david@fromorbit.com>
    Signed-off-by: Jie Liu <jeff.liu@oracle.com>

    Jeff Liu
     
  • inode_ref_info() returns 1 when the element wasn't found and < 0 on error,
    just like btrfs_search_slot(). In iref_to_path() it's an error when the
    inode ref can't be found, thus we return ERR_PTR(ret) in that case. In order
    to avoid ERR_PTR(1), we now set ret to -ENOENT in that case.

    Signed-off-by: Jan Schmidt

    Jan Schmidt
     
  • Gracefully fail when trying to mount a BTRFS file system that has a
    sectorsize smaller than PAGE_SIZE.

    On PPC it is possible to build a FS while using a 4k PAGE_SIZE kernel
    then boot into a 64K PAGE_SIZE kernel. Presently open_ctree fails in an
    endless loop and hangs the machine in this situation.

    My debugging has show this Sector size < Page size to be a non trivial
    situation and a graceful exit from the situation would be nice for the
    time being.

    Signed-off-by: Keith Mannthey

    Keith Mannthey
     

01 Feb, 2012

1 commit


27 Jan, 2012

11 commits

  • Josef fixed btrfs_page_mkwrite to properly release reserved
    extents if there was an error. But if we fail to get a reservation
    and we fail to dirty the inode (for ENOSPC reasons), we'll end up
    trying to release a reservation we never had.

    This makes sure we only release if we were able to reserve.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • If we span a long area in a bitmap we could end up taking a lot of time
    searching to the next free area if we're searching from the original
    window_start, so advance window_start in order to make sure we don't do any
    superficial searching. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • btree_releasepage is a callback and can be passed unknown gfp flags and then
    they may end up in kmem_cache_alloc called from alloc_extent_state, slab
    allocator will BUG_ON when there is HIGHMEM or DMA32 flag set.

    This may happen when btrfs is mounted from a loop device, which masks out
    __GFP_IO flag. The check in try_release_extent_state

    3399 if ((mask & GFP_NOFS) == GFP_NOFS)
    3400 mask = GFP_NOFS;

    will not work and passes unfiltered flags further resulting in crash at
    mm/slab.c:2963

    [] cache_alloc_refill+0x3b4/0x5c8
    [] kmem_cache_alloc+0x204/0x294
    [] mempool_alloc+0x52/0x170
    [] alloc_extent_state+0x40/0xd4 [btrfs]
    [] __clear_extent_bit+0x38a/0x4cc [btrfs]
    [] try_release_extent_state+0x9c/0xd4 [btrfs]
    [] btree_releasepage+0x7e/0xd0 [btrfs]
    [] shrink_page_list+0x6a0/0x724
    [] shrink_inactive_list+0x230/0x578
    [] shrink_list+0x6c/0x120
    [] shrink_zone+0x1e2/0x228
    [] shrink_zones+0x90/0x254
    [] do_try_to_free_pages+0xac/0x420
    [] try_to_free_pages+0x13c/0x1b0
    [] __alloc_pages_nodemask+0x5b4/0x9a8
    [] grab_cache_page_write_begin+0x7e/0xe8

    Signed-off-by: David Sterba
    Signed-off-by: Chris Mason

    David Sterba
     
  • When we did sysbench test for inline files, enospc error happened easily though
    there was lots of free disk space which could be allocated for new chunks.

    Reproduce steps:
    # mkfs.btrfs -b $((2 * 1024 * 1024 * 1024))
    # mount /mnt
    # ulimit -n 102400
    # cd /mnt
    # sysbench --num-threads=1 --test=fileio --file-num=81920 \
    > --file-total-size=80M --file-block-size=1K --file-io-mode=sync \
    > --file-test-mode=seqwr prepare
    # sysbench --num-threads=1 --test=fileio --file-num=81920 \
    > --file-total-size=80M --file-block-size=1K --file-io-mode=sync \
    > --file-test-mode=seqwr run

    The reason of this bug is:
    Now, we can reserve space which is larger than the free space in the chunks if
    we have enough free disk space which can be used for new chunks. By this way,
    the space allocator should allocate a new chunk by force if there is no free
    space in the free space cache. But there are two wrong checks which break this
    operation.

    One is
    if (ret == -ENOSPC && num_bytes > min_alloc_size)
    in btrfs_reserve_extent(), it is wrong, we should try to allocate a new chunk
    even we fail to allocate free space by minimum allocable size.

    The other is
    if (space_info->force_alloc)
    force = space_info->force_alloc;
    in do_chunk_alloc(). It makes the allocator ignore CHUNK_ALLOC_FORCE If someone
    sets ->force_alloc to CHUNK_ALLOC_LIMITED, and makes the enospc error happen.

    Fix these two wrong checks. Especially the second one, we fix it by changing
    the value of CHUNK_ALLOC_LIMITED and CHUNK_ALLOC_FORCE, and make
    CHUNK_ALLOC_FORCE greater than CHUNK_ALLOC_LIMITED since CHUNK_ALLOC_FORCE has
    higher priority. And if the value which is passed in by the caller is greater
    than ->force_alloc, use the passed value.

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     
  • xfstests 218 complains that btrfs defrags a file partially:
    After: 1
    Write backwards sync, but contiguous - should defrag to 1 extent
    Before: 10
    -After: 1
    +After: 2

    To fix this, we need to set max_to_defrag count properly.

    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    Liu Bo
     
  • There have been 4 warnings on 32-bit build, they are herewith fixed.

    Signed-off-by: Stefan Behrens
    Signed-off-by: Chris Mason

    Stefan Behrens
     
  • We specifically set window_start in the cluster struct to indicate where the
    cluster starts in a bitmap, but we've been using min_start to indicate where
    we're searching from. This is usually the start of the blockgroup, so
    essentially means we're constantly searching from the start of any bitmap we
    find, which completely negates all the trouble we go to in order to setup a
    cluster. So start using window_start to make sure we actually use the area we
    found. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • A user has encountered a NULL pointer kernel oops in btrfs when
    encountering media errors. The problem has been identified
    as an unhandled NULL pointer returned from find_get_page().
    This modification simply checks for a NULL page, and returns
    with an error if found (the extent_range_uptodate() function
    returns 1 on errors).

    After testing this patch, the user reported that the error with
    the NULL pointer oops was solved. However, there is still a
    remaining problem with a thread becoming stuck in
    wait_on_page_locked(page) in the read_extent_buffer_pages(...)
    function in extent_io.c

    for (i = start_i; i < num_pages; i++) {
    page = extent_buffer_page(eb, i);
    wait_on_page_locked(page);
    if (!PageUptodate(page))
    ret = -EIO;
    }

    This patch leaves the issue with the locked page yet to be resolved.

    Signed-off-by: Mitch Harder
    Signed-off-by: Chris Mason

    Mitch Harder
     
  • wait_log_commit() and wait_for_writer() were using slightly different
    conditions for deciding whether they should call schedule() and whether they
    should continue in the wait loop. Thus it could happen that we busylooped when
    the first condition was not true while the second one was. That is burning CPU
    cycles needlessly and is deadly on UP machines...

    Signed-off-by: Jan Kara
    Signed-off-by: Chris Mason

    Jan Kara
     
  • We have only been checking for min_bytes available in bitmap entries, but we
    won't successfully setup a bitmap cluster unless it has at least bytes in the
    bitmap, so in the common case min_bytes is 4k and we want something like 2MB, so
    if there are a bunch of bitmap entries with less than 2mb's in them, we'll
    search all them anyway, which is suboptimal. Fix this check. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • Added initialization with the declaration of ret. It isn't set later on the
    switch-default branch (which should never be taken).

    Signed-off-by: Jan Schmidt
    Signed-off-by: Chris Mason

    Jan Schmidt
     

17 Jan, 2012

7 commits

  • system chunks by default are very small. This makes them slightly
    larger and also fixes the conditional checks to make sure we don't
    allocate a billion of them at once.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • I was using i_mutex for this, but we're getting bogus lockdep warnings by doing
    that and theres no real way to get rid of those, so just stop using i_mutex to
    protect delalloc metadata reservations and use a delalloc mutex instead. This
    shouldn't be contended often at all, only if you are writing and mmap writing to
    the file at the same time. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • This in addition to a script in my btrfs-tracing tree will help track down space
    leaks when we're getting space left over in block groups on umount. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • We've been seeing warnings coming out of the orphan commit stuff forever from
    ceph. Turns out it's because we're racing with checking if the orphan block
    reserve is set, because we clear it outside of the spin_lock. So leave the
    normal fastpath checks where they are, but take the spin_lock and _recheck_ to
    make sure we haven't had an orphan block rsv added in the meantime. Then clear
    the root's orphan block rsv and release the lock. With this patch a user said
    the warnings went away and they usually showed up pretty soon after he started
    ceph. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • I used these tracepoints when figuring out what the cluster stuff was doing, so
    add them to mainline in case we need to profile this stuff again. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • Btrfs_throttle will make us wait if there is a currently committing transaction
    until we can open new transactions, which is ridiculous since we don't actually
    start any transactions within the file write path anyway, so all this does is
    introduce big latencies if we have a sync/fsync heavy workload going on while
    somebody else is trying to do work. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • If updating the inode gave us an ENOSPC we were just returning in page_mkwrite,
    which is a problem since we make our reservation right before trying to update
    the inode, so fix the out label so that we actually free our reservation.
    Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik