20 Nov, 2011

1 commit

  • When btrfs is writing the super blocks, it send barrier flushes to make
    sure writeback caching drives get all the metadata on disk in the
    right order.

    But, we have two bugs in the way these are sent down. When doing
    full commits (not via the tree log), we are sending the barrier down
    before the last super when it should be going down before the first.

    In multi-device setups, we should be waiting for the barriers to
    complete on all devices before writing any of the supers.

    Both of these bugs can cause corruptions on power failures. We fix it
    with some new code to send down empty barriers to all devices before
    writing the first super.

    Alexandre Oliva found the multi-device bug. Arne Jansen did the async
    barrier loop.

    Signed-off-by: Chris Mason
    Reported-by: Alexandre Oliva

    Chris Mason
     

06 Nov, 2011

1 commit


02 Oct, 2011

1 commit


29 Sep, 2011

1 commit

  • btrfs_bio is a bio abstraction able to split and not complete after the last
    bio has returned (like the old btrfs_multi_bio). Additionally, btrfs_bio
    tracks the mirror_num used to read data which can be used for error
    correction purposes.

    Signed-off-by: Jan Schmidt

    Jan Schmidt
     

17 Aug, 2011

1 commit

  • We have a problem where if a user specifies discard but doesn't actually support
    it we will return EOPNOTSUPP from btrfs_discard_extent. This is a problem
    because this gets called (in a fashion) from the tree log recovery code, which
    has a nice little BUG_ON(ret) after it, which causes us to fail the tree log
    replay. So instead detect wether our devices support discard when we're adding
    them and then don't issue discards if we know that the device doesn't support
    it. And just for good measure set ret = 0 in btrfs_issue_discard just in case
    we still get EOPNOTSUPP so we don't screw anybody up like this again. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

24 May, 2011

2 commits


23 May, 2011

2 commits


13 May, 2011

2 commits

  • In a multi device setup, the chunk allocator currently always allocates
    chunks on the devices in the same order. This leads to a very uneven
    distribution, especially with RAID1 or RAID10 and an uneven number of
    devices.
    This patch always sorts the devices before allocating, and allocates the
    stripes on the devices with the most available space, as long as there
    is enough space available. In a low space situation, it first tries to
    maximize striping.
    The patch also simplifies the allocator and reduces the checks for
    corner cases.
    The simplification is done by several means. First, it defines the
    properties of each RAID type upfront. These properties are used afterwards
    instead of differentiating cases in several places.
    Second, the old allocator defined a minimum stripe size for each block
    group type, tried to find a large enough chunk, and if this fails just
    allocates a smaller one. This is now done in one step. The largest possible
    chunk (up to max_chunk_size) is searched and allocated.
    Because we now have only one pass, the allocation of the map (struct
    map_lookup) is moved down to the point where the number of stripes is
    already known. This way we avoid reallocation of the map.
    We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.

    Arne Jansen
     
  • this function won't be used here anymore, so move it super.c where it is
    used for df-calculation

    Arne Jansen
     

12 May, 2011

1 commit

  • This adds an initial implementation for scrub. It works quite
    straightforward. The usermode issues an ioctl for each device in the
    fs. For each device, it enumerates the allocated device chunks. For
    each chunk, the contained extents are enumerated and the data checksums
    fetched. The extents are read sequentially and the checksums verified.
    If an error occurs (checksum or EIO), a good copy is searched for. If
    one is found, the bad copy will be rewritten.
    All enumerations happen from the commit roots. During a transaction
    commit, the scrubs get paused and afterwards continue from the new
    roots.

    This commit is based on the series originally posted to linux-btrfs
    with some improvements that resulted from comments from David Sterba,
    Ilya Dryomov and Jan Schmidt.

    Signed-off-by: Arne Jansen

    Arne Jansen
     

06 May, 2011

1 commit

  • Remove static and global declarations and/or definitions. Reduces size
    of btrfs.ko by ~3.4kB.

    text data bss dec hex filename
    402081 7464 200 409745 64091 btrfs.ko.base
    398620 7144 200 405964 631cc btrfs.ko.remove-all

    Signed-off-by: David Sterba

    David Sterba
     

04 May, 2011

1 commit


28 Mar, 2011

2 commits

  • btrfs_map_block() will only return a single stripe length, but we want the
    full extent be mapped to each disk when we are trimming the extent,
    so we add length to btrfs_bio_stripe and fill it if we are mapping for REQ_DISCARD.

    Signed-off-by: Li Dongyang
    Signed-off-by: Chris Mason

    Li Dongyang
     
  • Tracepoints can provide insight into why btrfs hits bugs and be greatly
    helpful for debugging, e.g
    dd-7822 [000] 2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
    dd-7822 [000] 2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
    btrfs-transacti-7804 [001] 2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
    btrfs-transacti-7804 [001] 2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
    btrfs-transacti-7804 [001] 2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
    flush-btrfs-2-7821 [001] 2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
    flush-btrfs-2-7821 [001] 2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
    flush-btrfs-2-7821 [001] 2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
    flush-btrfs-2-7821 [000] 2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
    btrfs-endio-wri-7800 [001] 2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
    btrfs-endio-wri-7800 [001] 2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)

    Here is what I have added:

    1) ordere_extent:
    btrfs_ordered_extent_add
    btrfs_ordered_extent_remove
    btrfs_ordered_extent_start
    btrfs_ordered_extent_put

    These provide critical information to understand how ordered_extents are
    updated.

    2) extent_map:
    btrfs_get_extent

    extent_map is used in both read and write cases, and it is useful for tracking
    how btrfs specific IO is running.

    3) writepage:
    __extent_writepage
    btrfs_writepage_end_io_hook

    Pages are cirtical resourses and produce a lot of corner cases during writeback,
    so it is valuable to know how page is written to disk.

    4) inode:
    btrfs_inode_new
    btrfs_inode_request
    btrfs_inode_evict

    These can show where and when a inode is created, when a inode is evicted.

    5) sync:
    btrfs_sync_file
    btrfs_sync_fs

    These show sync arguments.

    6) transaction:
    btrfs_transaction_commit

    In transaction based filesystem, it will be useful to know the generation and
    who does commit.

    7) back reference and cow:
    btrfs_delayed_tree_ref
    btrfs_delayed_data_ref
    btrfs_delayed_ref_head
    btrfs_cow_block

    Btrfs natively supports back references, these tracepoints are helpful on
    understanding btrfs's COW mechanism.

    8) chunk:
    btrfs_chunk_alloc
    btrfs_chunk_free

    Chunk is a link between physical offset and logical offset, and stands for space
    infomation in btrfs, and these are helpful on tracing space things.

    9) reserved_extent:
    btrfs_reserved_extent_alloc
    btrfs_reserved_extent_free

    These can show how btrfs uses its space.

    Signed-off-by: Liu Bo
    Signed-off-by: Chris Mason

    liubo
     

18 Jan, 2011

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (25 commits)
    Btrfs: forced readonly mounts on errors
    btrfs: Require CAP_SYS_ADMIN for filesystem rebalance
    Btrfs: don't warn if we get ENOSPC in btrfs_block_rsv_check
    btrfs: Fix memory leak in btrfs_read_fs_root_no_radix()
    btrfs: check NULL or not
    btrfs: Don't pass NULL ptr to func that may deref it.
    btrfs: mount failure return value fix
    btrfs: Mem leak in btrfs_get_acl()
    btrfs: fix wrong free space information of btrfs
    btrfs: make the chunk allocator utilize the devices better
    btrfs: restructure find_free_dev_extent()
    btrfs: fix wrong calculation of stripe size
    btrfs: try to reclaim some space when chunk allocation fails
    btrfs: fix wrong data space statistics
    fs/btrfs: Fix build of ctree
    Btrfs: fix off by one while setting block groups readonly
    Btrfs: Add BTRFS_IOC_SUBVOL_GETFLAGS/SETFLAGS ioctls
    Btrfs: Add readonly snapshots support
    Btrfs: Refactor btrfs_ioctl_snap_create()
    btrfs: Extract duplicate decompress code
    ...

    Linus Torvalds
     

17 Jan, 2011

2 commits

  • When we store data by raid profile in btrfs with two or more different size
    disks, df command shows there is some free space in the filesystem, but the
    user can not write any data in fact, df command shows the wrong free space
    information of btrfs.

    # mkfs.btrfs -d raid1 /dev/sda9 /dev/sda10
    # btrfs-show
    Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
    Total devices 2 FS bytes used 28.00KB
    devid 1 size 5.01GB used 2.03GB path /dev/sda9
    devid 2 size 10.00GB used 2.01GB path /dev/sda10
    # btrfs device scan /dev/sda9 /dev/sda10
    # mount /dev/sda9 /mnt
    # dd if=/dev/zero of=tmpfile0 bs=4K count=9999999999
    (fill the filesystem)
    # sync
    # df -TH
    Filesystem Type Size Used Avail Use% Mounted on
    /dev/sda9 btrfs 17G 8.6G 5.4G 62% /mnt
    # btrfs-show
    Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
    Total devices 2 FS bytes used 3.99GB
    devid 1 size 5.01GB used 5.01GB path /dev/sda9
    devid 2 size 10.00GB used 4.99GB path /dev/sda10

    It is because btrfs cannot allocate chunks when one of the pairing disks has
    no space, the free space on the other disks can not be used for ever, and should
    be subtracted from the total space, but btrfs doesn't subtract this space from
    the total. It is strange to the user.

    This patch fixes it by calcing the free space that can be used to allocate
    chunks.

    Implementation:
    1. get all the devices free space, and align them by stripe length.
    2. sort the devices by the free space.
    3. check the free space of the devices,
    3.1. if it is not zero, and then check the number of the devices that has
    more free space than this device,
    if the number of the devices is beyond the min stripe number, the free
    space can be used, and add into total free space.
    if the number of the devices is below the min stripe number, we can not
    use the free space, the check ends.
    3.2. if the free space is zero, check the next devices, goto 3.1

    This implementation is just likely fake chunk allocation.

    After appling this patch, df can show correct space information:
    # df -TH
    Filesystem Type Size Used Avail Use% Mounted on
    /dev/sda9 btrfs 17G 8.6G 0 100% /mnt

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     
  • With this patch, we change the handling method when we can not get enough free
    extents with default size.

    Implementation:
    1. Look up the suitable free extent on each device and keep the search result.
    If not find a suitable free extent, keep the max free extent
    2. If we get enough suitable free extents with default size, chunk allocation
    succeeds.
    3. If we can not get enough free extents, but the number of the extent with
    default size is >= min_stripes, we just change the mapping information
    (reduce the number of stripes in the extent map), and chunk allocation
    succeeds.
    4. If the number of the extent with default size is < min_stripes, sort the
    devices by its max free extent's size descending
    5. Use the size of the max free extent on the (num_stripes - 1)th device as the
    stripe size to allocate the device space

    By this way, the chunk allocator can allocate chunks as large as possible when
    the devices' space is not enough and make full use of the devices.

    Signed-off-by: Miao Xie
    Signed-off-by: Chris Mason

    Miao Xie
     

14 Jan, 2011

1 commit

  • * 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
    block: ensure that completion error gets properly traced
    blktrace: add missing probe argument to block_bio_complete
    block cfq: don't use atomic_t for cfq_group
    block cfq: don't use atomic_t for cfq_queue
    block: trace event block fix unassigned field
    block: add internal hd part table references
    block: fix accounting bug on cross partition merges
    kref: add kref_test_and_get
    bio-integrity: mark kintegrityd_wq highpri and CPU intensive
    block: make kblockd_workqueue smarter
    Revert "sd: implement sd_check_events()"
    block: Clean up exit_io_context() source code.
    Fix compile warnings due to missing removal of a 'ret' variable
    fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
    block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
    cfq-iosched: don't check cfqg in choose_service_tree()
    fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
    cdrom: export cdrom_check_events()
    sd: implement sd_check_events()
    sr: implement sr_check_events()
    ...

    Linus Torvalds
     

15 Dec, 2010

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
    Btrfs: prevent RAID level downgrades when space is low
    Btrfs: account for missing devices in RAID allocation profiles
    Btrfs: EIO when we fail to read tree roots
    Btrfs: fix compiler warnings
    Btrfs: Make async snapshot ioctl more generic
    Btrfs: pwrite blocked when writing from the mmaped buffer of the same page
    Btrfs: Fix a crash when mounting a subvolume
    Btrfs: fix sync subvol/snapshot creation
    Btrfs: Fix page leak in compressed writeback path
    Btrfs: do not BUG if we fail to remove the orphan item for dead snapshots
    Btrfs: fixup return code for btrfs_del_orphan_item
    Btrfs: do not do fast caching if we are allocating blocks for tree_root
    Btrfs: deal with space cache errors better
    Btrfs: fix use after free in O_DIRECT

    Linus Torvalds
     

14 Dec, 2010

1 commit

  • When we mount in RAID degraded mode without adding a new device to
    replace the failed one, we can end up using the wrong RAID flags for
    allocations.

    This results in strange combinations of block groups (raid1 in a raid10
    filesystem) and corruptions when we try to allocate blocks from single
    spindle chunks on drives that are actually missing.

    The first device has two small 4MB chunks in it that mkfs creates and
    these are usually unused in a raid1 or raid10 setup. But, in -o degraded,
    the allocator will fall back to these because the mask of desired raid groups
    isn't correct.

    The fix here is to count the missing devices as we build up the list
    of devices in the system. This count is used when picking the
    raid level to make sure we continue using the same levels that were
    in place before we lost a drive.

    Signed-off-by: Chris Mason

    Chris Mason
     

13 Nov, 2010

1 commit

  • After recent blkdev_get() modifications, open_by_devnum() and
    open_bdev_exclusive() are simple wrappers around blkdev_get().
    Replace them with blkdev_get_by_dev() and blkdev_get_by_path().

    blkdev_get_by_dev() is identical to open_by_devnum().
    blkdev_get_by_path() is slightly different in that it doesn't
    automatically add %FMODE_EXCL to @mode.

    All users are converted. Most conversions are mechanical and don't
    introduce any behavior difference. There are several exceptions.

    * btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no
    reason to OR it explicitly on blkdev_put().

    * gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in
    sb->s_mode.

    * With the above changes, sb->s_mode now always should contain
    FMODE_EXCL. WARN_ON_ONCE() added to kill_block_super() to detect
    errors.

    The new blkdev_get_*() functions are with proper docbook comments.
    While at it, add function description to blkdev_get() too.

    Signed-off-by: Tejun Heo
    Cc: Philipp Reisner
    Cc: Neil Brown
    Cc: Mike Snitzer
    Cc: Joern Engel
    Cc: Chris Mason
    Cc: Jan Kara
    Cc: "Theodore Ts'o"
    Cc: KONISHI Ryusuke
    Cc: reiserfs-devel@vger.kernel.org
    Cc: xfs-masters@oss.sgi.com
    Cc: Alexander Viro

    Tejun Heo
     

10 Sep, 2010

1 commit


22 Sep, 2009

1 commit

  • Currently, we can panic the box if the first block group we go to move is of a
    type where there is no space left to move those extents. For example, if we
    fill the disk up with data, and then we try to balance and we have no room to
    move the data nor room to allocate new chunks, we will panic. Change this by
    checking to see if we have room to move this chunk around, and if not, return
    -ENOSPC and move on to the next chunk. This will make sure we remove block
    groups that are moveable, like if we have alot of empty metadata block groups,
    and then that way we make room to be able to balance our data chunks as well.
    Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
    panics with this patch.

    V1->V2:
    -actually search for a free extent on the device to make sure we can allocate a
    chunk if need be.

    -fix btrfs_shrink_device to make sure we actually try to relocate all the
    chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
    we don't remove the device with data still on it.

    -check to make sure the block group we are going to relocate isn't the last one
    in that particular space

    -fix a bug in btrfs_shrink_device where we would change the device's size and
    not fix it if we fail to do our relocate

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

11 Jun, 2009

1 commit

  • On multi-device filesystems, btrfs writes supers to all of the devices
    before considering a sync complete. There wasn't any additional
    locking between super writeout and the device list management code
    because device management was done inside a transaction and
    super writeout only happened with no transation writers running.

    With the btrfs fsync log and other async transaction updates, this
    has been racey for some time. This adds a mutex to protect
    the device list. The existing volume mutex could not be reused due to
    transaction lock ordering requirements.

    Signed-off-by: Chris Mason

    Chris Mason
     

10 Jun, 2009

1 commit

  • During mount, btrfs will check the queue nonrot flag
    for all the devices found in the FS. If they are all
    non-rotating, SSD mode is enabled by default.

    If the FS was mounted with -o nossd, the non-rotating
    flag is ignored.

    Signed-off-by: Chris Mason

    Chris Mason
     

27 Apr, 2009

1 commit

  • Previously, we updated a device's size prior to attempting a shrink
    operation. This patch moves the device resizing logic to only happen if
    the shrink completes successfully. In the process, it introduces a new
    field to btrfs_device -- disk_total_bytes -- to track the on-disk size.

    Signed-off-by: Chris Ball
    Signed-off-by: Chris Mason

    Chris Ball
     

21 Apr, 2009

1 commit

  • Part of reducing fsync/O_SYNC/O_DIRECT latencies is using WRITE_SYNC for
    writes we plan on waiting on in the near future. This patch
    mirrors recent changes in other filesystems and the generic code to
    use WRITE_SYNC when WB_SYNC_ALL is passed and to use WRITE_SYNC for
    other latency critical writes.

    Btrfs uses async worker threads for checksumming before the write is done,
    and then again to actually submit the bios. The bio submission code just
    runs a per-device list of bios that need to be sent down the pipe.

    This list is split into low priority and high priority lists so the
    WRITE_SYNC IO happens first.

    Signed-off-by: Chris Mason

    Chris Mason
     

03 Apr, 2009

1 commit


12 Dec, 2008

1 commit

  • This patch makes seed device possible to be shared by
    multiple mounted file systems. The sharing is achieved
    by cloning seed device's btrfs_fs_devices structure.
    Thanks you,

    Signed-off-by: Yan Zheng

    Yan Zheng
     

09 Dec, 2008

1 commit

  • This patch implements superblock duplication. Superblocks
    are stored at offset 16K, 64M and 256G on every devices.
    Spaces used by superblocks are preserved by the allocator,
    which uses a reverse mapping function to find the logical
    addresses that correspond to superblocks. Thank you,

    Signed-off-by: Yan Zheng

    Yan Zheng
     

02 Dec, 2008

1 commit


20 Nov, 2008

1 commit


18 Nov, 2008

1 commit

  • Seed device is a special btrfs with SEEDING super flag
    set and can only be mounted in read-only mode. Seed
    devices allow people to create new btrfs on top of it.

    The new FS contains the same contents as the seed device,
    but it can be mounted in read-write mode.

    This patch does the following:

    1) split code in btrfs_alloc_chunk into two parts. The first part does makes
    the newly allocated chunk usable, but does not do any operation that modifies
    the chunk tree. The second part does the the chunk tree modifications. This
    division is for the bootstrap step of adding storage to the seed device.

    2) Update device management code to handle seed device.
    The basic idea is: For an FS grown from seed devices, its
    seed devices are put into a list. Seed devices are
    opened on demand at mounting time. If any seed device is
    missing or has been changed, btrfs kernel module will
    refuse to mount the FS.

    3) make btrfs_find_block_group not return NULL when all
    block groups are read-only.

    Signed-off-by: Yan Zheng

    Yan Zheng
     

25 Sep, 2008

5 commits

  • The multi-bio code is responsible for duplicating blocks in raid1 and
    single spindle duplication. It has counters to make sure all of
    the locations for a given extent are properly written before io completion
    is returned to the higher layers.

    But, it didn't always complete the same bio it was given, sometimes a
    clone was completed instead. This lead to problems with the async
    work queues because they saved a pointer to the bio in a struct off
    bi_private.

    The fix is to remember the original bio and only complete that one.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Btrfs has been using workqueues to spread the checksumming load across
    other CPUs in the system. But, workqueues only schedule work on the
    same CPU that queued the work, giving them a limited benefit for systems with
    higher CPU counts.

    This code adds a generic facility to schedule work with pools of kthreads,
    and changes the bio submission code to queue bios up. The queueing is
    important to make sure large numbers of procs on the system don't
    turn streaming workloads into random workloads by sending IO down
    concurrently.

    The end result of all of this is much higher performance (and CPU usage) when
    doing checksumming on large machines. Two worker pools are created,
    one for writes and one for endio processing. The two could deadlock if
    we tried to service both from a single pool.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Devices can change after the scan ioctls are done, and btrfs_open_devices
    needs to be able to verify them as they are opened and used by the FS.

    Signed-off-by: Chris Mason

    Chris Mason
     
  • Signed-off-by: Chris Mason

    Chris Mason
     
  • This required a few structural changes to the code that manages bdev pointers:

    The VFS super block now gets an anon-bdev instead of a pointer to the
    lowest bdev. This allows us to avoid swapping the super block bdev pointer
    around at run time.

    The code to read in the super block no longer goes through the extent
    buffer interface. Things got ugly keeping the mapping constant.

    Signed-off-by: Chris Mason

    Chris Mason