11 Jan, 2012

2 commits

  • * 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: move MIN_WRITEBACK_PAGES to fs-writeback.c
    writeback: balanced_rate cannot exceed write bandwidth
    writeback: do strict bdi dirty_exceeded
    writeback: avoid tiny dirty poll intervals
    writeback: max, min and target dirty pause time
    writeback: dirty ratelimit - think time compensation
    btrfs: fix dirtied pages accounting on sub-page writes
    writeback: fix dirtied pages accounting on redirty
    writeback: fix dirtied pages accounting on sub-page writes
    writeback: charge leaked page dirties to active tasks
    writeback: Include all dirty inodes in background writeback

    Linus Torvalds
     
  • Tell the page allocator that pages allocated for a buffered write are
    expected to become dirty soon.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Acked-by: Mel Gorman
    Cc: Minchan Kim
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Hellwig
    Cc: Wu Fengguang
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Shaohua Li
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

18 Dec, 2011

1 commit

  • When doing 1KB sequential writes to the same page,
    balance_dirty_pages_ratelimited_nr() should be called once instead of 4
    times, the latter makes the dirtier tasks be throttled much too heavy.

    Fix it with proper de-accounting on clear_page_dirty_for_io().

    CC: Chris Mason
    Signed-off-by: Wu Fengguang

    Wu Fengguang
     

17 Dec, 2011

2 commits

  • …inux/kernel/git/mason/linux-btrfs

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    Btrfs: unplug every once and a while
    Btrfs: deal with NULL srv_rsv in the delalloc inode reservation code
    Btrfs: only set cache_generation if we setup the block group
    Btrfs: don't panic if orphan item already exists
    Btrfs: fix leaked space in truncate
    Btrfs: fix how we do delalloc reservations and how we free reservations on error
    Btrfs: deal with enospc from dirtying inodes properly
    Btrfs: fix num_workers_starting bug and other bugs in async thread
    BTRFS: Establish i_ops before calling d_instantiate
    Btrfs: add a cond_resched() into the worker loop
    Btrfs: fix ctime update of on-disk inode
    btrfs: keep orphans for subvolume deletion
    Btrfs: fix inaccurate available space on raid0 profile
    Btrfs: fix wrong disk space information of the files
    Btrfs: fix wrong i_size when truncating a file to a larger size
    Btrfs: fix btrfs_end_bio to deal with write errors to a single mirror

    * 'for-linus-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
    btrfs: lower the dirty balance poll interval

    Linus Torvalds
     
  • Tests show that the original large intervals can easily make the dirty
    limit exceeded on 100 concurrent dd's. So adapt to as large as the
    next check point selected by the dirty throttling algorithm.

    Signed-off-by: Wu Fengguang
    Signed-off-by: Chris Mason

    Wu Fengguang
     

16 Dec, 2011

1 commit

  • Now that we're properly keeping track of delayed inode space we've been getting
    a lot of warnings out of btrfs_dirty_inode() when running xfstest 83. This is
    because a bunch of people call mark_inode_dirty, which is void so we can't
    return ENOSPC. This needs to be fixed in a few areas

    1) file_update_time - this updates the mtime and such when writing to a file,
    which will call mark_inode_dirty. So copy file_update_time into btrfs so we can
    call btrfs_dirty_inode directly and return an error if we get one appropriately.

    2) fix symlinks to use btrfs_setattr for ->setattr. For some reason we weren't
    setting ->setattr for symlinks, even though we should have been. This catches
    one of the cases where we were getting errors in mark_inode_dirty.

    3) Fix btrfs_setattr and btrfs_setsize to call btrfs_dirty_inode directly
    instead of mark_inode_dirty. This lets us return errors properly for truncate
    and chown/anything related to setattr.

    4) Add a new btrfs_fs_dirty_inode which will just call btrfs_dirty_inode and
    print an error if we have one. The only remaining user we can't control for
    this is touch_atime(), but we don't really want to keep people from walking
    down the tree if we don't have space to save the atime update, so just complain
    but don't worry about it.

    With this patch xfstests 83 complains a handful of times instead of hundreds of
    times. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

07 Nov, 2011

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (114 commits)
    Btrfs: check for a null fs root when writing to the backup root log
    Btrfs: fix race during transaction joins
    Btrfs: fix a potential btrfs_bio leak on scrub fixups
    Btrfs: rename btrfs_bio multi -> bbio for consistency
    Btrfs: stop leaking btrfs_bios on readahead
    Btrfs: stop the readahead threads on failed mount
    Btrfs: fix extent_buffer leak in the metadata IO error handling
    Btrfs: fix the new inspection ioctls for 32 bit compat
    Btrfs: fix delayed insertion reservation
    Btrfs: ClearPageError during writepage and clean_tree_block
    Btrfs: be smarter about committing the transaction in reserve_metadata_bytes
    Btrfs: make a delayed_block_rsv for the delayed item insertion
    Btrfs: add a log of past tree roots
    btrfs: separate superblock items out of fs_info
    Btrfs: use the global reserve when truncating the free space cache inode
    Btrfs: release metadata from global reserve if we have to fallback for unlink
    Btrfs: make sure to flush queued bios if write_cache_pages waits
    Btrfs: fix extent pinning bugs in the tree log
    Btrfs: make sure btrfs_remove_free_space doesn't leak EAGAIN
    Btrfs: don't wait as long for more batches during SSD log commit
    ...

    Linus Torvalds
     

28 Oct, 2011

1 commit

  • The i_mutex lock use of generic _file_llseek hurts. Independent processes
    accessing the same file synchronize over a single lock, even though
    they have no need for synchronization at all.

    Under high utilization this can cause llseek to scale very poorly on larger
    systems.

    This patch does some rethinking of the llseek locking model:

    First the 64bit f_pos is not necessarily atomic without locks
    on 32bit systems. This can already cause races with read() today.
    This was discussed on linux-kernel in the past and deemed acceptable.
    The patch does not change that.

    Let's look at the different seek variants:

    SEEK_SET: Doesn't really need any locking.
    If there's a race one writer wins, the other loses.

    For 32bit the non atomic update races against read()
    stay the same. Without a lock they can also happen
    against write() now. The read() race was deemed
    acceptable in past discussions, and I think if it's
    ok for read it's ok for write too.

    => Don't need a lock.

    SEEK_END: This behaves like SEEK_SET plus it reads
    the maximum size too. Reading the maximum size would have the
    32bit atomic problem. But luckily we already have a way to read
    the maximum size without locking (i_size_read), so we
    can just use that instead.

    Without i_mutex there is no synchronization with write() anymore,
    however since the write() update is atomic on 64bit it just behaves
    like another racy SEEK_SET. On non atomic 32bit it's the same
    as SEEK_SET.

    => Don't need a lock, but need to use i_size_read()

    SEEK_CUR: This has a read-modify-write race window
    on the same file. One could argue that any application
    doing unsynchronized seeks on the same file is already broken.
    But for the sake of not adding a regression here I'm
    using the file->f_lock to synchronize this. Using this
    lock is much better than the inode mutex because it doesn't
    synchronize between processes.

    => So still need a lock, but can use a f_lock.

    This patch implements this new scheme in generic_file_llseek.
    I dropped generic_file_llseek_unlocked and changed all callers.

    Signed-off-by: Andi Kleen
    Signed-off-by: Christoph Hellwig

    Andi Kleen
     

20 Oct, 2011

2 commits

  • Johannes pointed out we were allocating only kernel pages for doing writes,
    which is kind of a big deal if you are on 32bit and have more than a gig of ram.
    So fix our allocations to use the mapping's gfp but still clear __GFP_FS so we
    don't re-enter. Thanks,

    Reported-by: Johannes Weiner
    Signed-off-by: Josef Bacik

    Josef Bacik
     
  • Lukas found a problem where if he tries to fallocate over the same region twice
    and the first fallocate took up all the space we would fail with ENOSPC. This
    is because we reserve the total space we want to use for fallocate, regardless
    of wether or not we will have to actually preallocate. So instead move the
    check into the loop where we actually have to do the preallocate. Thanks,

    Tested-by: Lukas Czerner
    Signed-off-by: Josef Bacik

    Josef Bacik
     

04 Oct, 2011

1 commit


01 Oct, 2011

1 commit

  • A user reported a problem where ceph was getting into 100% cpu usage while doing
    some writing. It turns out it's because we were doing a short write on a not
    uptodate page, which means we'd fall back at one page at a time and fault the
    page in. The problem is our position is on the page boundary, so our fault in
    logic wasn't actually reading the page, so we'd just spin forever or until the
    page got read in by somebody else. This will force a readpage if we end up
    doing a short copy. Alexandre could reproduce this easily with ceph and reports
    it fixes his problem. I also wrote a reproducer that no longer hangs my box
    with this patch. Thanks,

    Reported-and-tested-by: Alexandre Oliva
    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     

18 Sep, 2011

1 commit

  • The recent reworking of btrfs' lseek lead to incorrect
    values being returned. This adds checks for seeking
    beyond EOF in SEEK_HOLE and makes sure the error
    values come back correct.

    Andi Kleen also sent in similar patches.

    Signed-off-by: Jie Liu
    Reported-by: Andi Kleen
    Signed-off-by: Chris Mason

    Jeff Liu
     

13 Sep, 2011

1 commit

  • * 'for-linus' of git://github.com/chrismason/linux:
    Btrfs: add dummy extent if dst offset excceeds file end in
    Btrfs: calc file extent num_bytes correctly in file clone
    btrfs: xattr: fix attribute removal
    Btrfs: fix wrong nbytes information of the inode
    Btrfs: fix the file extent gap when doing direct IO
    Btrfs: fix unclosed transaction handle in btrfs_cont_expand
    Btrfs: fix misuse of trans block rsv
    Btrfs: reset to appropriate block rsv after orphan operations
    Btrfs: skip locking if searching the commit root in csum lookup
    btrfs: fix warning in iput for bad-inode
    Btrfs: fix an oops when deleting snapshots

    Linus Torvalds
     

11 Sep, 2011

1 commit

  • When we write some data to the place that is beyond the end of the file
    in direct I/O mode, a data hole will be created. And Btrfs should insert
    a file extent item that point to this hole into the fs tree. But unfortunately
    Btrfs forgets doing it.

    The following is a simple way to reproduce it:
    # mkfs.btrfs /dev/sdc2
    # mount /dev/sdc2 /test4
    # touch /test4/a
    # dd if=/dev/zero of=/test4/a seek=8 count=1 bs=4K oflag=direct conv=nocreat,notrunc
    # umount /test4
    # btrfsck /dev/sdc2
    root 5 inode 257 errors 100

    Reported-by: Tsutomu Itoh
    Signed-off-by: Miao Xie
    Tested-by: Tsutomu Itoh
    Signed-off-by: Chris Mason

    Miao Xie
     

18 Aug, 2011

3 commits


17 Aug, 2011

1 commit


03 Aug, 2011

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (31 commits)
    Btrfs: don't call writepages from within write_full_page
    Btrfs: Remove unused variable 'last_index' in file.c
    Btrfs: clean up for find_first_extent_bit()
    Btrfs: clean up for wait_extent_bit()
    Btrfs: clean up for insert_state()
    Btrfs: remove unused members from struct extent_state
    Btrfs: clean up code for merging extent maps
    Btrfs: clean up code for extent_map lookup
    Btrfs: clean up search_extent_mapping()
    Btrfs: remove redundant code for dir item lookup
    Btrfs: make acl functions really no-op if acl is not enabled
    Btrfs: remove remaining ref-cache code
    Btrfs: remove a BUG_ON() in btrfs_commit_transaction()
    Btrfs: use wait_event()
    Btrfs: check the nodatasum flag when writing compressed files
    Btrfs: copy string correctly in INO_LOOKUP ioctl
    Btrfs: don't print the leaf if we had an error
    btrfs: make btrfs_set_root_node void
    Btrfs: fix oops while writing data to SSD partitions
    Btrfs: Protect the readonly flag of block group
    ...

    Fix up trivial conflicts (due to acl and writeback cleanups) in
    - fs/btrfs/acl.c
    - fs/btrfs/ctree.h
    - fs/btrfs/extent_io.c

    Linus Torvalds
     

02 Aug, 2011

3 commits


28 Jul, 2011

3 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
    Btrfs: make sure reserve_metadata_bytes doesn't leak out strange errors
    Btrfs: use the commit_root for reading free_space_inode crcs
    Btrfs: reduce extent_state lock contention for metadata
    Btrfs: remove lockdep magic from btrfs_next_leaf
    Btrfs: make a lockdep class for each root
    Btrfs: switch the btrfs tree locks to reader/writer
    Btrfs: fix deadlock when throttling transactions
    Btrfs: stop using highmem for extent_buffers
    Btrfs: fix BUG_ON() caused by ENOSPC when relocating space
    Btrfs: tag pages for writeback in sync
    Btrfs: fix enospc problems with delalloc
    Btrfs: don't flush delalloc arbitrarily
    Btrfs: use find_or_create_page instead of grab_cache_page
    Btrfs: use a worker thread to do caching
    Btrfs: fix how we merge extent states and deal with cached states
    Btrfs: use the normal checksumming infrastructure for free space cache
    Btrfs: serialize flushers in reserve_metadata_bytes
    Btrfs: do transaction space reservation before joining the transaction
    Btrfs: try to only do one btrfs_search_slot in do_setxattr

    Linus Torvalds
     
  • So I had this brilliant idea to use atomic counters for outstanding and reserved
    extents, but this turned out to be a bad idea. Consider this where we have 1
    outstanding extent and 1 reserved extent

    Reserver Releaser
    atomic_dec(outstanding) now 0
    atomic_read(outstanding)+1 get 1
    atomic_read(reserved) get 1
    don't actually reserve anything because
    they are the same
    atomic_cmpxchg(reserved, 1, 0)
    atomic_inc(outstanding)
    atomic_add(0, reserved)
    free reserved space for 1 extent

    Then the reserver now has no actual space reserved for it, and when it goes to
    finish the ordered IO it won't have enough space to do it's allocation and you
    get those lovely warnings.

    Signed-off-by: Josef Bacik
    Signed-off-by: Chris Mason

    Josef Bacik
     
  • grab_cache_page will use mapping_gfp_mask(), which for all inodes is set to
    GFP_HIGHUSER_MOVABLE. So instead use find_or_create_page in all cases where we
    need GFP_NOFS so we don't deadlock. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

21 Jul, 2011

2 commits

  • Btrfs needs to be able to control how filemap_write_and_wait_range() is called
    in fsync to make it less of a painful operation, so push down taking i_mutex and
    the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some
    file systems can drop taking the i_mutex altogether it seems, like ext3 and
    ocfs2. For correctness sake I just pushed everything down in all cases to make
    sure that we keep the current behavior the same for everybody, and then each
    individual fs maintainer can make up their mind about what to do from there.
    Thanks,

    Acked-by: Jan Kara
    Signed-off-by: Josef Bacik
    Signed-off-by: Al Viro

    Josef Bacik
     
  • In order to handle SEEK_HOLE/SEEK_DATA we need to implement our own llseek.
    Basically for the normal SEEK_*'s we will just defer to the generic helper, and
    for SEEK_HOLE/SEEK_DATA we will use our fiemap helper to figure out the nearest
    hole or data. Currently this helper doesn't check for delalloc bytes for
    prealloc space, so for now treat prealloc as data until that is fixed. Thanks,

    Signed-off-by: Josef Bacik
    Signed-off-by: Al Viro

    Josef Bacik
     

15 Jul, 2011

1 commit

  • This patch fixes many callers of btrfs_alloc_path() which BUG_ON allocation
    failure. All the sites that are fixed in this patch were checked by me to
    be fairly trivial to fix because of at least one of two criteria:

    - Callers of the function catch errors from it already so bubbling the
    error up will be handled.
    - Callers of the function might BUG_ON any nonzero return code in which
    case there is no behavior changed (but we still got to remove a BUG_ON)

    The following functions were updated:

    btrfs_lookup_extent, alloc_reserved_tree_block, btrfs_remove_block_group,
    btrfs_lookup_csums_range, btrfs_csum_file_blocks, btrfs_mark_extent_written,
    btrfs_inode_by_name, btrfs_new_inode, btrfs_symlink,
    insert_reserved_file_extent, and run_delalloc_nocow

    Signed-off-by: Mark Fasheh

    Mark Fasheh
     

04 Jun, 2011

2 commits


28 May, 2011

1 commit

  • git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-work into for-linus

    Conflicts:
    fs/btrfs/disk-io.c
    fs/btrfs/extent-tree.c
    fs/btrfs/free-space-cache.c
    fs/btrfs/inode.c
    fs/btrfs/transaction.c

    Signed-off-by: Chris Mason

    Chris Mason
     

27 May, 2011

1 commit

  • This will detect small random writes into files and
    queue the up for an auto defrag process. It isn't well suited to
    database workloads yet, but works for smaller files such as rpm, sqlite
    or bdb databases.

    Signed-off-by: Chris Mason

    Chris Mason
     

24 May, 2011

1 commit

  • We use trans_mutex for lots of things, here's a basic list

    1) To serialize trans_handles joining the currently running transaction
    2) To make sure that no new trans handles are started while we are committing
    3) To protect the dead_roots list and the transaction lists

    Really the serializing trans_handles joining is not too hard, and can really get
    bogged down in acquiring a reference to the transaction. So replace the
    trans_mutex with a trans_lock spinlock and use it to do the following

    1) Protect fs_info->running_transaction. All trans handles have to do is check
    this, and then take a reference of the transaction and keep on going.
    2) Protect the fs_info->trans_list. This doesn't get used too much, basically
    it just holds the current transactions, which will usually just be the currently
    committing transaction and the currently running transaction at most.
    3) Protect the dead roots list. This is only ever processed by splicing the
    list so this is relatively simple.
    4) Protect the fs_info->reloc_ctl stuff. This is very lightweight and was using
    the trans_mutex before, so this is a pretty straightforward change.
    5) Protect fs_info->no_trans_join. Because we don't hold the trans_lock over
    the entirety of the commit we need to have a way to block new people from
    creating a new transaction while we're doing our work. So we set no_trans_join
    and in join_transaction we test to see if that is set, and if it is we do a
    wait_on_commit.
    6) Make the transaction use count atomic so we don't need to take locks to
    modify it when we're dropping references.
    7) Add a commit_lock to the transaction to make sure multiple people trying to
    commit the same transaction don't race and commit at the same time.
    8) Make open_ioctl_trans an atomic so we don't have to take any locks for ioctl
    trans.

    I have tested this with xfstests, but obviously it is a pretty hairy change so
    lots of testing is greatly appreciated. Thanks,

    Signed-off-by: Josef Bacik

    Josef Bacik
     

23 May, 2011

1 commit


02 May, 2011

3 commits


25 Apr, 2011

1 commit

  • There's a potential problem in 32bit system when we exhaust 32bit inode
    numbers and start to allocate big inode numbers, because btrfs uses
    inode->i_ino in many places.

    So here we always use BTRFS_I(inode)->location.objectid, which is an
    u64 variable.

    There are 2 exceptions that BTRFS_I(inode)->location.objectid !=
    inode->i_ino: the btree inode (0 vs 1) and empty subvol dirs (256 vs 2),
    and inode->i_ino will be used in those cases.

    Another reason to make this change is I'm going to use a special inode
    to save free ino cache, and the inode number must be > (u64)-256.

    Signed-off-by: Li Zefan

    Li Zefan
     

09 Apr, 2011

1 commit

  • Currently we don't handle running out of space in the cache, so to fix this we
    keep track of how far in the cache we are. Then we only dirty the pages if we
    successfully modify all of them, otherwise if we have an error or run out of
    space we can just drop them and not worry about the vm writing them out.
    Thanks,

    Tested-by Johannes Hirte
    Signed-off-by: Josef Bacik

    Josef Bacik