27 Jul, 2012

1 commit

  • Pull large btrfs update from Chris Mason:
    "This pull request is very large, and the two main features in here
    have been under testing/devel for quite a while.

    We have subvolume quotas from the strato developers. This enables
    full tracking of how many blocks are allocated to each subvolume (and
    all snapshots) and you can set limits on a per-subvolume basis. You
    can also create quota groups and toss multiple subvolumes into a big
    group. It's everything you need to be a web hosting company and give
    each user their own subvolume.

    The userland side of the quotas is being refreshed, they'll send out
    details on where to grab it soon.

    Next is the kernel side of btrfs send/receive from Alexander Block.
    This leverages the same infrastructure as the quota code to figure out
    relationships between blocks and their owners. It can then compute
    the difference between two snapshots and sends the diffs in a neutral
    format into userland.

    The basic model:

    create a snapshot
    send that snapshot as the initial backup
    make changes
    create a second snapshot
    send the incremental as a backup
    delete the first snapshot
    (use the second snapshot for the next incremental)

    The receive portion is all in userland, and in the 'next' branch of my
    btrfs-progs repo.

    There's still some work to do in terms of optimizing the send side
    from kernel to userland. The really important part is figuring out
    how two snapshots are different, and this is where we are
    concentrating right now. The initial send of a dataset is a little
    slower than tar, but the incremental sends are dramatically faster
    than what rsync can do.

    On top of all of that, we have a nice queue of fixes, cleanups and
    optimizations."

    Fix up trivial modify/del conflict in fs/btrfs/ioctl.c

    Also fix up semantic conflict in fs/btrfs/send.c: the interface to
    dentry_open() changed in commit 765927b2d508 ("switch dentry_open() to
    struct path, make it grab references itself"), and since it now grabs
    whatever references it needs, we should no longer do the mntget() on the
    mnt (and we need to dput() the dentry reference we took).

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (65 commits)
    Btrfs: uninit variable fixes in send/receive
    Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive
    Btrfs: add btrfs_compare_trees function
    Btrfs: introduce subvol uuids and times
    Btrfs: make iref_to_path non static
    Btrfs: add a barrier before a waitqueue_active check
    Btrfs: call the ordered free operation without any locks held
    Btrfs: Check INCOMPAT flags on remount and add helper function
    Btrfs: add helper for tree enumeration
    btrfs: allow cross-subvolume file clone
    Btrfs: improve multi-thread buffer read
    Btrfs: make btrfs's allocation smoothly with preallocation
    Btrfs: lock the transition from dirty to writeback for an eb
    Btrfs: fix potential race in extent buffer freeing
    Btrfs: don't return true in releasepage unless we actually freed the eb
    Btrfs: suppress printk() if all device I/O stats are zero
    Btrfs: remove unwanted printk() for btrfs device I/O stats
    Btrfs: rewrite BTRFS_SETGET_FUNCS
    Btrfs: zero unused bytes in inode item
    Btrfs: kill free_space pointer from inode structure
    ...

    Conflicts:
    fs/btrfs/ioctl.c

    Linus Torvalds
     

25 Jul, 2012

1 commit

  • Pull trivial tree from Jiri Kosina:
    "Trivial updates all over the place as usual."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (29 commits)
    Fix typo in include/linux/clk.h .
    pci: hotplug: Fix typo in pci
    iommu: Fix typo in iommu
    video: Fix typo in drivers/video
    Documentation: Add newline at end-of-file to files lacking one
    arm,unicore32: Remove obsolete "select MISC_DEVICES"
    module.c: spelling s/postition/position/g
    cpufreq: Fix typo in cpufreq driver
    trivial: typo in comment in mksysmap
    mach-omap2: Fix typo in debug message and comment
    scsi: aha152x: Fix sparse warning and make printing pointer address more portable.
    Change email address for Steve Glendinning
    Btrfs: fix typo in convert_extent_bit
    via: Remove bogus if check
    netprio_cgroup.c: fix comment typo
    backlight: fix memory leak on obscure error path
    Documentation: asus-laptop.txt references an obsolete Kconfig item
    Documentation: ManagementStyle: fixed typo
    mm/vmscan: cleanup comment error in balance_pgdat
    mm: cleanup on the comments of zone_reclaim_stat
    ...

    Linus Torvalds
     

24 Jul, 2012

5 commits

  • While testing with my buffer read fio jobs[1], I find that btrfs does not
    perform well enough.

    Here is a scenario in fio jobs:

    We have 4 threads, "t1 t2 t3 t4", starting to buffer read a same file,
    and all of them will race on add_to_page_cache_lru(), and if one thread
    successfully puts its page into the page cache, it takes the responsibility
    to read the page's data.

    And what's more, reading a page needs a period of time to finish, in which
    other threads can slide in and process rest pages:

    t1 t2 t3 t4
    add Page1
    read Page1 add Page2
    | read Page2 add Page3
    | | read Page3 add Page4
    | | | read Page4
    -----|------------|-----------|-----------|--------
    v v v v
    bio bio bio bio

    Now we have four bios, each of which holds only one page since we need to
    maintain consecutive pages in bio. Thus, we can end up with far more bios
    than we need.

    Here we're going to
    a) delay the real read-page section and
    b) try to put more pages into page cache.

    With that said, we can make each bio hold more pages and reduce the number
    of bios we need.

    Here is some numbers taken from fio results:
    w/o patch w patch
    ------------- -------- ---------------
    READ: 745MB/s +25% 934MB/s

    [1]:
    [global]
    group_reporting
    thread
    numjobs=4
    bs=32k
    rw=read
    ioengine=sync
    directory=/mnt/btrfs/

    [READ]
    filename=foobar
    size=2000M
    invalidate=1

    Signed-off-by: Liu Bo
    Signed-off-by: Josef Bacik

    Liu Bo
     
  • There is a small window where an eb can have no IO bits set on it, which
    could potentially result in extent_buffer_under_io() returning false when we
    want it to return true, which could result in not fun things happening. So
    in order to protect this case we need to hold the refs_lock when we make
    this transition to make sure we get reliable results out of
    extent_buffer_udner_io(). Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • This sounds sort of impossible but it is the only thing I can think of and
    at the very least it is theoretically possible so here it goes.

    If we are in try_release_extent_buffer we will check that the ref count on
    the extent buffer is 1 and not under IO, and then go down and clear the tree
    ref. If between this check and clearing the tree ref somebody else comes in
    and grabs a ref on the eb and the marks it dirty before
    try_release_extent_buffer() does it's tree ref clear we can end up with a
    dirty eb that will be freed while it is still dirty which will result in a
    panic. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • I noticed while looking at an extent_buffer race that we will
    unconditionally return 1 if we get down to release_extent_buffer after
    clearing the tree ref. However we can easily race in here and get a ref on
    the eb and not actually free the eb. So make release_extent_buffer return 1
    if it free'd the eb and 0 if not so we can be a little kinder to the vm.
    Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • Changing printk_in_rcu to printk_ratelimited_in_rcu will suffice

    Signed-off-by: Josef Bacik

    Anand Jain
     

12 Jul, 2012

1 commit


03 Jul, 2012

1 commit

  • We can race with unlink and not actually be able to do our igrab in
    btrfs_add_ordered_extent. This will result in all sorts of problems.
    Instead of doing the complicated work to try and handle returning an error
    properly from btrfs_add_ordered_extent, just hold a ref to the inode during
    writepages. If we cannot grab a ref we know we're freeing this inode anyway
    and can just drop the dirty pages on the floor, because screw them we're
    going to invalidate them anyway. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

15 Jun, 2012

1 commit

  • Al pointed out that we can just toss out the old name on a device and add a
    new one arbitrarily, so anybody who uses device->name in printk could
    possibly use free'd memory. Instead of adding locking around all of this he
    suggested doing it with RCU, so I've introduced a struct rcu_string that
    does just that and have gone through and protected all accesses to
    device->name that aren't under the uuid_mutex with rcu_read_lock(). This
    protects us and I will use it for dealing with removing the device that we
    used to mount the file system in a later patch. Thanks,

    Reviewed-by: David Sterba
    Signed-off-by: Josef Bacik

    Josef Bacik
     

01 Jun, 2012

1 commit


30 May, 2012

4 commits

  • The goal is to detect when drives start to get an increased error rate,
    when drives should be replaced soon. Therefore statistic counters are
    added that count IO errors (read, write and flush). Additionally, the
    software detected errors like checksum errors and corrupted blocks are
    counted.

    Signed-off-by: Stefan Behrens

    Stefan Behrens
     
  • Fully utilize our extent state's new helper functions to use
    fastpath as much as possible.

    Signed-off-by: Liu Bo
    Reviewed-by: Josef Bacik

    Liu Bo
     
  • We noticed that the ordered extent completion doesn't really rely on having
    a page and that it could be done independantly of ending the writeback on a
    page. This patch makes us not do the threaded endio stuff for normal
    buffered writes and direct writes so we can end page writeback as soon as
    possible (in irq context) and only start threads to do the ordered work when
    it is actually done. Compression needs to be reworked some to take
    advantage of this as well, but atm it has to do a find_get_page in its endio
    handler so it must be done in its own thread. This makes direct writes
    quite a bit faster. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • These warnings are bogus since we will always have at least one page in an
    eb, but to make the compiler happy just set ret = 0 in these two cases.
    Thanks,
    Btrfs: fix compile warnings in extent_io.c

    These warnings are bogus since we will always have at least one page in an
    eb, but to make the compiler happy just set ret = 0 in these two cases.
    Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

26 May, 2012

1 commit


11 May, 2012

4 commits


07 May, 2012

1 commit

  • Pull btrfs fixes from Chris Mason:
    "The big ones here are a memory leak we introduced in rc1, and a
    scheduling while atomic if the transid on disk doesn't match the
    transid we expected. This happens for corrupt blocks, or out of date
    disks.

    It also fixes up the ioctl definition for our ioctl to resolve logical
    inode numbers. The __u32 was a merging error and doesn't match what
    we ship in the progs."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs: avoid sleeping in verify_parent_transid while atomic
    Btrfs: fix crash in scrub repair code when device is missing
    btrfs: Fix mismatching struct members in ioctl.h
    Btrfs: fix page leak when allocing extent buffers
    Btrfs: Add properly locking around add_root_to_dirty_list

    Linus Torvalds
     

05 May, 2012

1 commit

  • If we happen to alloc a extent buffer and then alloc a page and notice that
    page is already attached to an extent buffer, we will only unlock it and
    free our existing eb. Any pages currently attached to that eb will be
    properly freed, but we don't do the page_cache_release() on the page where
    we noticed the other extent buffer which can cause us to leak pages and I
    hope cause the weird issues we've been seeing in this area. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

29 Apr, 2012

1 commit

  • Pull btrfs fixes from Chris Mason:
    "This has our collection of bug fixes. I missed the last rc because I
    thought our patches were making NFS crash during my xfs test runs.
    Turns out it was an NFS client bug fixed by someone else while I tried
    to bisect it.

    All of these fixes are small, but some are fairly high impact. The
    biggest are fixes for our mount -o remount handling, a deadlock due to
    GFP_KERNEL allocations in readdir, and a RAID10 error handling bug.

    This was tested against both 3.3 and Linus' master as of this morning."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (26 commits)
    Btrfs: reduce lock contention during extent insertion
    Btrfs: avoid deadlocks from GFP_KERNEL allocations during btrfs_real_readdir
    Btrfs: Fix space checking during fs resize
    Btrfs: fix block_rsv and space_info lock ordering
    Btrfs: Prevent root_list corruption
    Btrfs: fix repair code for RAID10
    Btrfs: do not start delalloc inodes during sync
    Btrfs: fix that check_int_data mount option was ignored
    Btrfs: don't count CRC or header errors twice while scrubbing
    Btrfs: fix btrfs_ioctl_dev_info() crash on missing device
    btrfs: don't return EINTR
    Btrfs: double unlock bug in error handling
    Btrfs: always store the mirror we read the eb from
    fs/btrfs/volumes.c: add missing free_fs_devices
    btrfs: fix early abort in 'remount'
    Btrfs: fix max chunk size check in chunk allocator
    Btrfs: add missing read locks in backref.c
    Btrfs: don't call free_extent_buffer twice in iterate_irefs
    Btrfs: Make free_ipath() deal gracefully with NULL pointers
    Btrfs: avoid possible use-after-free in clear_extent_bit()
    ...

    Linus Torvalds
     

19 Apr, 2012

3 commits

  • A user reported a panic where we were trying to fix a bad mirror but the
    mirror number we were giving was 0, which is invalid. This is because we
    don't do the transid verification until after the read, so as far as the
    read code is concerned the read was a success. So instead store the mirror
    we read from so that if there is some failure post read we know which mirror
    to try next and which mirror needs to be fixed if we find a good copy of the
    block. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • clear_extent_bit()
    {
    next_node = rb_next(&state->rb_node);
    ...
    clear_state_bit(state);

    Li Zefan
     
  • Currently it returns a set of bits that were cleared, but this return
    value is not used at all.

    Moreover it doesn't seem to be useful, because we may clear the bits
    of a few extent_states, but only the cleared bits of last one is
    returned.

    Signed-off-by: Li Zefan

    Li Zefan
     

14 Apr, 2012

1 commit

  • Pull the minimal btrfs branch from Chris Mason:
    "We have a use-after-free in there, along with errors when mount -o
    discard is enabled, and a BUG_ON(we should compile with UP more
    often)."

    * 'for-linus-min' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs: use commit root when loading free space cache
    Btrfs: fix use-after-free in __btrfs_end_transaction
    Btrfs: check return value of bio_alloc() properly
    Btrfs: remove lock assert from get_restripe_target()
    Btrfs: fix eof while discarding extents
    Btrfs: fix uninit variable in repair_eb_io_failure
    Revert "Btrfs: increase the global block reserve estimates"

    Linus Torvalds
     

13 Apr, 2012

2 commits


31 Mar, 2012

1 commit

  • Pull btrfs fixes and features from Chris Mason:
    "We've merged in the error handling patches from SuSE. These are
    already shipping in the sles kernel, and they give btrfs the ability
    to abort transactions and go readonly on errors. It involves a lot of
    churn as they clarify BUG_ONs, and remove the ones we now properly
    deal with.

    Josef reworked the way our metadata interacts with the page cache.
    page->private now points to the btrfs extent_buffer object, which
    makes everything faster. He changed it so we write an whole extent
    buffer at a time instead of allowing individual pages to go down,,
    which will be important for the raid5/6 code (for the 3.5 merge
    window ;)

    Josef also made us more aggressive about dropping pages for metadata
    blocks that were freed due to COW. Overall, our metadata caching is
    much faster now.

    We've integrated my patch for metadata bigger than the page size.
    This allows metadata blocks up to 64KB in size. In practice 16K and
    32K seem to work best. For workloads with lots of metadata, this cuts
    down the size of the extent allocation tree dramatically and fragments
    much less.

    Scrub was updated to support the larger block sizes, which ended up
    being a fairly large change (thanks Stefan Behrens).

    We also have an assortment of fixes and updates, especially to the
    balancing code (Ilya Dryomov), the back ref walker (Jan Schmidt) and
    the defragging code (Liu Bo)."

    Fixed up trivial conflicts in fs/btrfs/scrub.c that were just due to
    removal of the second argument to k[un]map_atomic() in commit
    7ac687d9e047.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (75 commits)
    Btrfs: update the checks for mixed block groups with big metadata blocks
    Btrfs: update to the right index of defragment
    Btrfs: do not bother to defrag an extent if it is a big real extent
    Btrfs: add a check to decide if we should defrag the range
    Btrfs: fix recursive defragment with autodefrag option
    Btrfs: fix the mismatch of page->mapping
    Btrfs: fix race between direct io and autodefrag
    Btrfs: fix deadlock during allocating chunks
    Btrfs: show useful info in space reservation tracepoint
    Btrfs: don't use crc items bigger than 4KB
    Btrfs: flush out and clean up any block device pages during mount
    btrfs: disallow unequal data/metadata blocksize for mixed block groups
    Btrfs: enhance superblock sanity checks
    Btrfs: change scrub to support big blocks
    Btrfs: minor cleanup in scrub
    Btrfs: introduce common define for max number of mirrors
    Btrfs: fix infinite loop in btrfs_shrink_device()
    Btrfs: fix memory leak in resolver code
    Btrfs: allow dup for data chunks in mixed mode
    Btrfs: validate target profiles only if we are going to use them
    ...

    Linus Torvalds
     

29 Mar, 2012

1 commit


27 Mar, 2012

8 commits

  • Since we need to read and write extent buffers in their entirety we can't use
    the normal bio_readpage_error stuff since it only works on a per page basis. So
    instead make it so that if we see an io error in endio we just mark the eb as
    having an IO error and then in btree_read_extent_buffer_pages we will manually
    try other mirrors and then overwrite the bad mirror if we find a good copy.
    This works with larger than page size blocks. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • lock_extent_buffer_for_io needs to loop around and make sure the
    writeback bits are not set.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • This patch simplifies how we track our extent buffers. Previously we could exit
    writepages with only having written half of an extent buffer, which meant we had
    to track the state of the pages and the state of the extent buffers differently.
    Now we only read in entire extent buffers and write out entire extent buffers,
    this allows us to simply set bits in our bflags to indicate the state of the eb
    and we no longer have to do things like track uptodate with our iotree. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • Because an eb can have multiple pages we need to make sure that all pages within
    the eb are markes as accessed, since releasepage can be called against any page
    in the eb. This will keep us from possibly evicting hot eb's when we're doing
    larger than pagesize eb's. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • Because btrfs cow's we can end up with extent buffers that are no longer
    necessary just sitting around in memory. So instead of evicting these pages, we
    could end up evicting things we actually care about. Thus we have
    free_extent_buffer_stale for use when we are freeing tree blocks. This will
    make it so that the ref for the eb being in the radix tree is dropped as soon as
    possible and then is freed when the refcount hits 0 instead of waiting to be
    released by releasepage. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • We can run into a problem where we find an eb for our existing page already on
    the radix tree but it has a ref count of 0. It hasn't yet been removed by RCU
    yet so this can cause issues where we will use the EB after free. So do
    atomic_inc_not_zero on the exists->refs and if it is zero just do
    synchronize_rcu() and try again. We won't have to worry about new allocators
    coming in since they will block on the page lock at this point. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • We spend a lot of time looking up extent buffers from pages when we could just
    store the pointer to the eb the page is associated with in page->private. This
    patch does just that, and it makes things a little simpler and reduces a bit of
    CPU overhead involved with doing metadata IO. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • A few years ago the btrfs code to support blocks lager than
    the page size was disabled to fix a few corner cases in the
    page cache handling. This fixes the code to properly support
    large metadata blocks again.

    Since current kernels will crash early and often with larger
    metadata blocks, this adds an incompat bit so that older kernels
    can't mount it.

    This also does away with different blocksizes for nodes and leaves.
    You get a single block size for all tree blocks.

    Signed-off-by: Chris Mason

    Chris Mason
     

22 Mar, 2012

1 commit

  • btrfs currently handles most errors with BUG_ON. This patch is a work-in-
    progress but aims to handle most errors other than internal logic
    errors and ENOMEM more gracefully.

    This iteration prevents most crashes but can run into lockups with
    the page lock on occasion when the timing "works out."

    Signed-off-by: Jeff Mahoney

    Jeff Mahoney