28 Mar, 2013

2 commits

  • If we don't find the expected csum item, but find a csum item which is
    adjacent to the specified extent, we should return -EFBIG, or we should
    return -ENOENT. But btrfs_lookup_csum() return -EFBIG even the csum item
    is not adjacent to the specified extent. Fix it.

    Signed-off-by: Miao Xie
    Signed-off-by: Josef Bacik

    Miao Xie
     
  • We reserve the space for csums only when we write data into a file, in
    the other cases, such as tree log, log replay, we don't do reservation,
    so we can use the reservation of the transaction handle just for the former.
    And for the latter, we should use the tree's own reservation. But the
    function - btrfs_csum_file_blocks() didn't differentiate between these
    two types of the cases, fix it.

    Signed-off-by: Miao Xie
    Signed-off-by: Josef Bacik

    Miao Xie
     

21 Feb, 2013

1 commit

  • For write, we also reserve some space for COW blocks during updating
    the checksum tree, and we calculate the number of blocks by checking
    if the number of bytes outstanding that are going to need csums needs
    one more block for csum.

    When we add these checksum into the checksum tree, we use ordered sums
    list.
    Every ordered sum contains csums for each sector, and we'll first try
    to look up an existing csum item,
    a) if we don't yet have a proper csum item, then we need to insert one,
    b) or if we find one but the csum item is not big enough, then we need
    to extend it.

    The point is we'll unlock the whole path and then insert or extend.
    So others can hack in and update the tree.

    Each insert or extend needs update the tree with COW on, and we may need
    to insert/extend for many times.

    That means what we've reserved for updating checksum tree is NOT enough
    indeed.

    The case is even more serious with having several write threads at the
    same time, it can end up eating our reserved space quickly and starting
    eating globle reserve pool instead.

    I don't yet come up with a way to calculate the worse case for updating
    csum, but extending the checksum item as much as possible can be helpful
    in my test.

    The idea behind is that it can reduce the times we insert/extend so that
    it saves us precious reserved space.

    Signed-off-by: Liu Bo
    Signed-off-by: Josef Bacik

    Liu Bo
     

25 Jan, 2013

1 commit

  • I noticed a WARN_ON going off when adding csums because we were going over
    the amount of csum bytes that should have been allowed for an ordered
    extent. This is a leftover from when we used to hold the csums privately
    for direct io, but now we use the normal ordered sum stuff so we need to
    make sure and check if we've moved on to another extent so that the csums
    are added to the right extent. Without this we could end up with csums for
    bytenrs that don't have extents to cover them yet. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

13 Dec, 2012

1 commit

  • There are two types of the file extent - inline extent and regular extent,
    When we log file extents, we didn't take inline extent into account, fix it.

    Signed-off-by: Miao Xie
    Reviewed-by: Liu Bo
    Signed-off-by: Chris Mason

    Miao Xie
     

09 Oct, 2012

1 commit

  • commit 7ca4be45a0255ac8f08c05491c6add2dd87dd4f8 limited csum items to
    PAGE_CACHE_SIZE. It used min() with incompatible types in 32bit which
    generates warnings:

    fs/btrfs/file-item.c: In function ‘btrfs_csum_file_blocks’:
    fs/btrfs/file-item.c:717: warning: comparison of distinct pointer types lacks a cast

    This uses min_t(u32,) to fix the warnings. u32 seemed reasonable
    because btrfs_root->leafsize is u32 and PAGE_CACHE_SIZE is unsigned
    long.

    Signed-off-by: Zach Brown

    Zach Brown
     

02 Oct, 2012

1 commit


29 Aug, 2012

1 commit

  • We've been allocating a big array for csums instead of storing them in the
    io_tree like we do for buffered reads because previously we were locking the
    entire range, so we didn't have an extent state for each sector of the
    range. But now that we do the range locking as we map the buffers we can
    limit the mapping lenght to sectorsize and use the private part of the
    io_tree for our csums. This allows us to avoid an extra memory allocation
    for direct reads which could incur latency. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

24 Jul, 2012

2 commits

  • Since root can be fetched via BTRFS_I macro directly, we can save an args
    for btrfs_is_free_space_inode().

    Signed-off-by: Liu Bo
    Signed-off-by: Josef Bacik

    Liu Bo
     
  • There is weird logic I had to put in place to make sure that when we were
    adding csums that we'd used the delalloc block rsv instead of the global
    block rsv. Part of this meant that we had to free up our transaction
    reservation before we ran the delayed refs since csum deletion happens
    during the delayed ref work. The problem with this is that when we release
    a reservation we will add it to the global reserve if it is not full in
    order to keep us going along longer before we have to force a transaction
    commit. By releasing our reservation before we run delayed refs we don't
    get the opportunity to drain down the global reserve for the work we did, so
    we won't refill it as often. This isn't a problem per-se, it just results
    in us possibly committing transactions more and more often, and in rare
    cases could cause those WARN_ON()'s to pop in use_block_rsv because we ran
    out of space in our block rsv.

    This also helps us by holding onto space while the delayed refs run so we
    don't end up with as many people trying to do things at the same time, which
    again will help us not force commits or hit the use_block_rsv warnings.
    Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

31 Mar, 2012

1 commit

  • Pull btrfs fixes and features from Chris Mason:
    "We've merged in the error handling patches from SuSE. These are
    already shipping in the sles kernel, and they give btrfs the ability
    to abort transactions and go readonly on errors. It involves a lot of
    churn as they clarify BUG_ONs, and remove the ones we now properly
    deal with.

    Josef reworked the way our metadata interacts with the page cache.
    page->private now points to the btrfs extent_buffer object, which
    makes everything faster. He changed it so we write an whole extent
    buffer at a time instead of allowing individual pages to go down,,
    which will be important for the raid5/6 code (for the 3.5 merge
    window ;)

    Josef also made us more aggressive about dropping pages for metadata
    blocks that were freed due to COW. Overall, our metadata caching is
    much faster now.

    We've integrated my patch for metadata bigger than the page size.
    This allows metadata blocks up to 64KB in size. In practice 16K and
    32K seem to work best. For workloads with lots of metadata, this cuts
    down the size of the extent allocation tree dramatically and fragments
    much less.

    Scrub was updated to support the larger block sizes, which ended up
    being a fairly large change (thanks Stefan Behrens).

    We also have an assortment of fixes and updates, especially to the
    balancing code (Ilya Dryomov), the back ref walker (Jan Schmidt) and
    the defragging code (Liu Bo)."

    Fixed up trivial conflicts in fs/btrfs/scrub.c that were just due to
    removal of the second argument to k[un]map_atomic() in commit
    7ac687d9e047.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (75 commits)
    Btrfs: update the checks for mixed block groups with big metadata blocks
    Btrfs: update to the right index of defragment
    Btrfs: do not bother to defrag an extent if it is a big real extent
    Btrfs: add a check to decide if we should defrag the range
    Btrfs: fix recursive defragment with autodefrag option
    Btrfs: fix the mismatch of page->mapping
    Btrfs: fix race between direct io and autodefrag
    Btrfs: fix deadlock during allocating chunks
    Btrfs: show useful info in space reservation tracepoint
    Btrfs: don't use crc items bigger than 4KB
    Btrfs: flush out and clean up any block device pages during mount
    btrfs: disallow unequal data/metadata blocksize for mixed block groups
    Btrfs: enhance superblock sanity checks
    Btrfs: change scrub to support big blocks
    Btrfs: minor cleanup in scrub
    Btrfs: introduce common define for max number of mirrors
    Btrfs: fix infinite loop in btrfs_shrink_device()
    Btrfs: fix memory leak in resolver code
    Btrfs: allow dup for data chunks in mixed mode
    Btrfs: validate target profiles only if we are going to use them
    ...

    Linus Torvalds
     

29 Mar, 2012

1 commit

  • With the big metadata blocks, we can have crc items
    that are much bigger than a page. There are a few
    places that we try to kmalloc memory to hold the
    items during a split.

    Items bigger than 4KB don't really have a huge benefit
    in efficiency, but they do trigger larger order allocations.
    This commits changes the csums to make sure they stay under
    4KB. This is not a format change, just a #define to limit
    huge items.

    Signed-off-by: Chris Mason

    Chris Mason
     

22 Mar, 2012

3 commits

  • btrfs currently handles most errors with BUG_ON. This patch is a work-in-
    progress but aims to handle most errors other than internal logic
    errors and ENOMEM more gracefully.

    This iteration prevents most crashes but can run into lockups with
    the page lock on occasion when the timing "works out."

    Signed-off-by: Jeff Mahoney

    Jeff Mahoney
     
  • Unfortunately it isn't enough to just exit here - the kzalloc() happens in a
    loop and the allocated items are added to a linked list whose head is passed
    in from the caller.

    To fix the BUG_ON() and also provide the semantic that the list passed in is
    only modified on success, I create function-local temporary list that we add
    items too. If no error is met, that list is spliced to the callers at the
    end of the function. Otherwise the list will be walked and all items freed
    before the error value is returned.

    I did a simple test on this patch by forcing an error at the kzalloc() point
    and verifying that when this hits (git clone seemed to exercise this), the
    function throws the proper error. Unfortunately but predictably, we later
    hit a BUG_ON(ret) type line that still hasn't been fixed up ;)

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     
  • Signed-off-by: Jeff Mahoney

    Jeff Mahoney
     

20 Mar, 2012

1 commit


06 Nov, 2011

1 commit

  • fs_info has now ~9kb, more than fits into one page. This will cause
    mount failure when memory is too fragmented. Top space consumers are
    super block structures super_copy and super_for_commit, ~2.8kb each.
    Allocate them dynamically. fs_info will be ~3.5kb. (measured on x86_64)

    Add a wrapper for freeing fs_info and all of it's dynamically allocated
    members.

    Signed-off-by: David Sterba

    David Sterba
     

11 Sep, 2011

1 commit


02 Aug, 2011

1 commit


28 Jul, 2011

2 commits

  • Now that we are using regular file crcs for the free space cache,
    we can deadlock if we try to read the free_space_inode while we are
    updating the crc tree.

    This commit fixes things by using the commit_root to read the crcs. This is
    safe because we the free space cache file would already be loaded if
    that block group had been changed in the current transaction.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • The extent_buffers have a very complex interface where
    we use HIGHMEM for metadata and try to cache a kmap mapping
    to access the memory.

    The next commit adds reader/writer locks, and concurrent use
    of this kmap cache would make it even more complex.

    This commit drops the ability to use HIGHMEM with extent buffers,
    and rips out all of the related code.

    Signed-off-by: Chris Mason

    Chris Mason
     

15 Jul, 2011

1 commit

  • This patch fixes many callers of btrfs_alloc_path() which BUG_ON allocation
    failure. All the sites that are fixed in this patch were checked by me to
    be fairly trivial to fix because of at least one of two criteria:

    - Callers of the function catch errors from it already so bubbling the
    error up will be handled.
    - Callers of the function might BUG_ON any nonzero return code in which
    case there is no behavior changed (but we still got to remove a BUG_ON)

    The following functions were updated:

    btrfs_lookup_extent, alloc_reserved_tree_block, btrfs_remove_block_group,
    btrfs_lookup_csums_range, btrfs_csum_file_blocks, btrfs_mark_extent_written,
    btrfs_inode_by_name, btrfs_new_inode, btrfs_symlink,
    insert_reserved_file_extent, and run_delalloc_nocow

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     

24 May, 2011

3 commits


23 May, 2011

2 commits


12 May, 2011

1 commit

  • This adds an initial implementation for scrub. It works quite
    straightforward. The usermode issues an ioctl for each device in the
    fs. For each device, it enumerates the allocated device chunks. For
    each chunk, the contained extents are enumerated and the data checksums
    fetched. The extents are read sequentially and the checksums verified.
    If an error occurs (checksum or EIO), a good copy is searched for. If
    one is found, the bad copy will be rewritten.
    All enumerations happen from the commit roots. During a transaction
    commit, the scrubs get paused and afterwards continue from the new
    roots.

    This commit is based on the series originally posted to linux-btrfs
    with some improvements that resulted from comments from David Sterba,
    Ilya Dryomov and Jan Schmidt.

    Signed-off-by: Arne Jansen

    Arne Jansen
     

02 May, 2011

1 commit


25 Apr, 2011

1 commit

  • There's a potential problem in 32bit system when we exhaust 32bit inode
    numbers and start to allocate big inode numbers, because btrfs uses
    inode->i_ino in many places.

    So here we always use BTRFS_I(inode)->location.objectid, which is an
    u64 variable.

    There are 2 exceptions that BTRFS_I(inode)->location.objectid !=
    inode->i_ino: the btree inode (0 vs 1) and empty subvol dirs (256 vs 2),
    and inode->i_ino will be used in those cases.

    Another reason to make this change is I'm going to use a special inode
    to save free ino cache, and the inode number must be > (u64)-256.

    Signed-off-by: Li Zefan

    Li Zefan
     

28 Mar, 2011

2 commits


29 Jan, 2011

2 commits

  • Got a report of a box panicing because we got a NULL eb in read_extent_buffer.
    His fs was borked and btrfs_search_path returned EIO, but we don't check for
    errors so the box paniced. Yes I know this will just make something higher up
    the stack panic, but that's a problem for future Josef. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • To make btrfs more stable, add several missing necessary memory allocation
    checks, and when no memory, return proper errno.

    We've checked that some of those -ENOMEM errors will be returned to
    userspace, and some will be catched by BUG_ON() in the upper callers,
    and none will be ignored silently.

    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    liubo
     

25 May, 2010

2 commits

  • This provides basic DIO support for reading and writing. It does not do the
    work to recover from mismatching checksums, that will come later. A few design
    changes have been made from Jim's code (sorry Jim!)

    1) Use the generic direct-io code. Jim originally re-wrote all the generic DIO
    code in order to account for all of BTRFS's oddities, but thanks to that work it
    seems like the best bet is to just ignore compression and such and just opt to
    fallback on buffered IO.

    2) Fallback on buffered IO for compressed or inline extents. Jim's code did
    it's own buffering to make dio with compressed extents work. Now we just
    fallback onto normal buffered IO.

    3) Use ordered extents for the writes so that all of the

    lock_extent()
    lookup_ordered()

    type checks continue to work.

    4) Do the lock_extent() lookup_ordered() loop in readpage so we don't race with
    DIO writes.

    I've tested this with fsx and everything works great. This patch depends on my
    dio and filemap.c patches to work. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • Previous patches make the allocater return -ENOSPC if there is no
    unreserved free metadata space. This patch updates tree log code
    and various other places to propagate/handle the ENOSPC error.

    Signed-off-by: Yan Zheng
    Signed-off-by: Chris Mason

    Yan, Zheng
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

25 Mar, 2009

1 commit

  • btrfs_mark_buffer dirty would set dirty bits in the extent_io tree
    for the buffers it was dirtying. This may require a kmalloc and it
    was not atomic. So, anyone who called btrfs_mark_buffer_dirty had to
    set any btree locks they were holding to blocking first.

    This commit changes dirty tracking for extent buffers to just use a flag
    in the extent buffer. Now that we have one and only one extent buffer
    per page, this can be safely done without losing dirty bits along the way.

    This also introduces a path->leave_spinning flag that callers of
    btrfs_search_slot can use to indicate they will properly deal with a
    path returned where all the locks are spinning instead of blocking.

    Many of the btree search callers now expect spinning paths,
    resulting in better btree concurrency overall.

    Signed-off-by: Chris Mason

    Chris Mason
     

07 Jan, 2009

1 commit

  • This patch contains following things.

    1) Limit the max size of btrfs_ordered_sum structure to PAGE_SIZE. This
    struct is kmalloced so we want to keep it reasonable.

    2) Replace copy_extent_csums by btrfs_lookup_csums_range. This was
    duplicated code in tree-log.c

    3) Remove replay_one_csum. csum items are replayed at the same time as
    replaying file extents. This guarantees we only replay useful csums.

    4) nbytes accounting fix.

    Signed-off-by: Yan Zheng

    Yan Zheng
     

06 Jan, 2009

1 commit