09 Oct, 2012

3 commits

  • When building btrfs from kernel code, it will report:

    fs/btrfs/extent_io.h:281: warning: 'extent_buffer_page' declared inline after being called
    fs/btrfs/extent_io.h:281: warning: previous declaration of 'extent_buffer_page' was here
    fs/btrfs/extent_io.h:280: warning: 'num_extent_pages' declared inline after being called
    fs/btrfs/extent_io.h:280: warning: previous declaration of 'num_extent_pages' was here

    because of the wrong declaration of inline functions.

    Signed-off-by: Robin Dong

    Robin Dong
     
  • Everytime we write out dirty pages we search for an offset in the tree,
    convert the bits in the state, and then when we wait we search for the
    offset again and clear the bits. So for every dirty range in the io tree we
    are doing 4 rb searches, which is suboptimal. With this patch we are only
    doing 2 searches for every cycle (modulo weird things happening). Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • There are a coule scenarios where farming metadata csumming off to an async
    thread doesn't help. The first is if our processor supports crc32c, in
    which case the csumming will be fast and so the overhead of the async model
    is not worth the cost. The other case is for our tree log. We will be
    making that stuff dirty and writing it out and waiting for it immediately.
    Even with software crc32c this gives me a ~15% increase in speed with O_SYNC
    workloads. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

02 Oct, 2012

1 commit

  • We're going to use this flag EXTENT_DEFRAG to indicate which range
    belongs to defragment so that we can implement snapshow-aware defrag:

    We set the EXTENT_DEFRAG flag when dirtying the extents that need
    defragmented, so later on writeback thread can differentiate between
    normal writeback and writeback started by defragmentation.

    Original-Signed-off-by: Li Zefan
    Signed-off-by: Liu Bo

    Liu Bo
     

01 Jun, 2012

1 commit


30 May, 2012

1 commit

  • We noticed that the ordered extent completion doesn't really rely on having
    a page and that it could be done independantly of ending the writeback on a
    page. This patch makes us not do the threaded endio stuff for normal
    buffered writes and direct writes so we can end page writeback as soon as
    possible (in irq context) and only start threads to do the ordered work when
    it is actually done. Compression needs to be reworked some to take
    advantage of this as well, but atm it has to do a find_get_page in its endio
    handler so it must be done in its own thread. This makes direct writes
    quite a bit faster. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

26 May, 2012

1 commit


19 Apr, 2012

1 commit

  • A user reported a panic where we were trying to fix a bad mirror but the
    mirror number we were giving was 0, which is invalid. This is because we
    don't do the transid verification until after the read, so as far as the
    read code is concerned the read was a success. So instead store the mirror
    we read from so that if there is some failure post read we know which mirror
    to try next and which mirror needs to be fixed if we find a good copy of the
    block. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

29 Mar, 2012

1 commit


27 Mar, 2012

5 commits

  • Since we need to read and write extent buffers in their entirety we can't use
    the normal bio_readpage_error stuff since it only works on a per page basis. So
    instead make it so that if we see an io error in endio we just mark the eb as
    having an IO error and then in btree_read_extent_buffer_pages we will manually
    try other mirrors and then overwrite the bad mirror if we find a good copy.
    This works with larger than page size blocks. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • This patch simplifies how we track our extent buffers. Previously we could exit
    writepages with only having written half of an extent buffer, which meant we had
    to track the state of the pages and the state of the extent buffers differently.
    Now we only read in entire extent buffers and write out entire extent buffers,
    this allows us to simply set bits in our bflags to indicate the state of the eb
    and we no longer have to do things like track uptodate with our iotree. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • Because btrfs cow's we can end up with extent buffers that are no longer
    necessary just sitting around in memory. So instead of evicting these pages, we
    could end up evicting things we actually care about. Thus we have
    free_extent_buffer_stale for use when we are freeing tree blocks. This will
    make it so that the ref for the eb being in the radix tree is dropped as soon as
    possible and then is freed when the refcount hits 0 instead of waiting to be
    released by releasepage. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • We spend a lot of time looking up extent buffers from pages when we could just
    store the pointer to the eb the page is associated with in page->private. This
    patch does just that, and it makes things a little simpler and reduces a bit of
    CPU overhead involved with doing metadata IO. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • A few years ago the btrfs code to support blocks lager than
    the page size was disabled to fix a few corner cases in the
    page cache handling. This fixes the code to properly support
    large metadata blocks again.

    Since current kernels will crash early and often with larger
    metadata blocks, this adds an incompat bit so that older kernels
    can't mount it.

    This also does away with different blocksizes for nodes and leaves.
    You get a single block size for all tree blocks.

    Signed-off-by: Chris Mason

    Chris Mason
     

22 Mar, 2012

3 commits


15 Feb, 2012

1 commit

  • We encountered an issue that was easily observable on s/390 systems but
    could really happen anywhere. The timing just seemed to hit reliably
    on s/390 with limited memory.

    The gist is that when an unexpected set_page_dirty() happened, we'd
    run into the BUG() in btrfs_writepage_fixup_worker since it wasn't
    properly set up for delalloc.

    This patch does the following:
    - Performs the missing delalloc in the fixup worker
    - Allow the start hook to return -EBUSY which informs __extent_writepage
    that it should mark the page skipped and not to redirty it. This is
    required since the fixup worker can fail with -ENOSPC and the page
    will have already been redirtied. That causes an Oops in
    drop_outstanding_extents later. Retrying the fixup worker could
    lead to an infinite loop. Deferring the page redirty also saves us
    some cycles since the page would be stuck in a resubmit-redirty loop
    until the fixup worker completes. It's not harmful, just wasteful.
    - If the fixup worker fails, we mark the page and mapping as errored,
    and end the writeback, similar to what we would do had the page
    actually been submitted to writeback.

    Signed-off-by: Jeff Mahoney

    Jeff Mahoney
     

04 Jan, 2012

1 commit


20 Nov, 2011

1 commit


06 Nov, 2011

3 commits


20 Oct, 2011

2 commits

  • While looking for a performance regression a user was complaining about, I
    noticed that we had a regression with the varmail test of filebench. This was
    introduced by

    0d10ee2e6deb5c8409ae65b970846344897d5e4e

    which keeps us from calling writepages in writepage. This is a correct change,
    however it happens to help the varmail test because we write out in larger
    chunks. This is largly to do with how we write out dirty pages for each
    transaction. If you run filebench with

    load varmail
    set $dir=/mnt/btrfs-test
    run 60

    prior to this patch you would get ~1420 ops/second, but with the patch you get
    ~1200 ops/second. This is a 16% decrease. So since we know the range of dirty
    pages we want to write out, don't write out in one page chunks, write out in
    ranges. So to do this we call filemap_fdatawrite_range() on the range of bytes.
    Then we convert the DIRTY extents to NEED_WAIT extents. When we then call
    btrfs_wait_marked_extents() we only have to filemap_fdatawait_range() on that
    range and clear the NEED_WAIT extents. This doesn't get us back to our original
    speeds, but I've been seeing ~1380 ops/second, which is a 15% regression. That is acceptable given that the original commit
    greatly reduces our latency to begin with. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • If I have a range where I know a certain bit is and I want to set it to another
    bit the only option I have is to call set and then clear bit, which will result
    in 2 tree searches. This is inefficient, so introduce convert_extent_bit which
    will go through and set the bit I want and clear the old bit I don't want.
    Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

02 Oct, 2011

2 commits

  • Add a READAHEAD extent buffer flag.
    Add a function to trigger a read with this flag set.

    Changes v2:
    - use extent buffer flags instead of extent state flags

    Changes v5:
    - adapt to changed read_extent_buffer_pages interface
    - don't return eb from reada_tree_block_flagged if it has CORRUPT flag set

    Signed-off-by: Arne Jansen

    Arne Jansen
     
  • read_extent_buffer_pages currently has two modes, either trigger a read
    without waiting for anything, or wait for the I/O to finish. The former
    also bails when it's unable to lock the page. This patch now adds an
    additional parameter to allow it to block on page lock, but don't wait
    for completion.

    Changes v5:
    - merge the 2 wait parameters into one and define WAIT_NONE, WAIT_COMPLETE and
    WAIT_PAGE_LOCK

    Change v6:
    - fix bug introduced in v5

    Signed-off-by: Arne Jansen

    Arne Jansen
     

29 Sep, 2011

3 commits

  • The raid-retry code in inode.c can be generalized so that it works for
    metadata as well. Thus, this patch moves it to extent_io.c and makes the
    raid-retry code a raid-repair code.

    Repair works that way: Whenever a read error occurs and we have more
    mirrors to try, note the failed mirror, and retry another. If we find a
    good one, check if we did note a failure earlier and if so, do not allow
    the read to complete until after the bad sector was written with the good
    data we just fetched. As we have the extent locked while reading, no one
    can change the data in between.

    Signed-off-by: Jan Schmidt

    Jan Schmidt
     
  • This removes a FIXME comment and introduces the first part of nodatasum
    fixup: It gets the corresponding inode for a logical address and triggers a
    regular readpage for the corrupted sector.

    Once we have on-the-fly error correction our error will be automatically
    corrected. The correction code is expected to clear the newly introduced
    EXTENT_DAMAGED flag, making scrub report that error as "corrected" instead
    of "uncorrectable" eventually.

    Signed-off-by: Jan Schmidt

    Jan Schmidt
     
  • Currently, extent_read_full_page always assumes we are trying to read mirror
    0, which generally is the best we can do. To add flexibility, pass it as a
    parameter. This will be needed by scrub fixup code.

    Signed-off-by: Jan Schmidt

    Jan Schmidt
     

02 Aug, 2011

2 commits


28 Jul, 2011

2 commits

  • The btrfs metadata btree is the source of significant
    lock contention, especially in the root node. This
    commit changes our locking to use a reader/writer
    lock.

    The lock is built on top of rw spinlocks, and it
    extends the lock tracking to remember if we have a
    read lock or a write lock when we go to blocking. Atomics
    count the number of blocking readers or writers at any
    given time.

    It removes all of the adaptive spinning from the old code
    and uses only the spinning/blocking hints inside of btrfs
    to decide when it should continue spinning.

    In read heavy workloads this is dramatically faster. In write
    heavy workloads we're still faster because of less contention
    on the root node lock.

    We suffer slightly in dbench because we schedule more often
    during write locks, but all other benchmarks so far are improved.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • The extent_buffers have a very complex interface where
    we use HIGHMEM for metadata and try to cache a kmap mapping
    to access the memory.

    The next commit adds reader/writer locks, and concurrent use
    of this kmap cache would make it even more complex.

    This commit drops the ability to use HIGHMEM with extent buffers,
    and rips out all of the related code.

    Signed-off-by: Chris Mason

    Chris Mason
     

11 Jun, 2011

1 commit

  • Reorder extent_buffer to remove 8 bytes of alignment padding on 64 bit
    builds. This shrinks its size to 128 bytes allowing it to fit into one
    fewer cache lines and allows more objects per slab in its kmem_cache.

    slabinfo extent_buffer reports :-

    before:-
    Sizes (bytes) Slabs
    ----------------------------------
    Object : 136 Total : 123
    SlabObj: 136 Full : 121
    SlabSiz: 4096 Partial: 0
    Loss : 0 CpuSlab: 2
    Align : 8 Objects: 30

    after :-
    Object : 128 Total : 4
    SlabObj: 128 Full : 2
    SlabSiz: 4096 Partial: 0
    Loss : 0 CpuSlab: 2
    Align : 8 Objects: 32

    Signed-off-by: Richard Kennedy
    Signed-off-by: Chris Mason

    richard kennedy
     

06 May, 2011

1 commit

  • Remove static and global declarations and/or definitions. Reduces size
    of btrfs.ko by ~3.4kB.

    text data bss dec hex filename
    402081 7464 200 409745 64091 btrfs.ko.base
    398620 7144 200 405964 631cc btrfs.ko.remove-all

    Signed-off-by: David Sterba

    David Sterba
     

04 May, 2011

1 commit


02 May, 2011

3 commits