01 Jul, 2013

1 commit


17 Jun, 2013

1 commit


13 Jun, 2013

10 commits

  • Return the FIEMAP_EXTENT_UNKNOWN flag as well except the
    FIEMAP_EXTENT_DELALLOC because the data location of an
    delayed allocation extent is unknown.

    Signed-off-by: Jie Liu

    Jie Liu
     
  • Commit b6e96d0067d8 ("jbd2: use module parameters instead of debugfs
    for jbd_debug") removed any need for a dependency on DEBUG_FS. It
    also moved the /sys variables out from underneath the typical debugfs
    mount point. Delete the dependency and update the /sys path to where
    the debug settings are currently.

    Signed-off-by: Paul Gortmaker
    Signed-off-by: "Theodore Ts'o"

    Paul Gortmaker
     
  • Since the jbd_debug() is implemented with two separate printk()
    calls, it can lead to corrupted and misleading debug output like
    the following (see lines marked with "*"):

    [ 290.339362] (fs/jbd2/journal.c, 203): kjournald2: kjournald2 wakes
    [ 290.339365] (fs/jbd2/journal.c, 155): kjournald2: commit_sequence=42103, commit_request=42104
    [ 290.339369] (fs/jbd2/journal.c, 158): kjournald2: OK, requests differ
    [* 290.339376] (fs/jbd2/journal.c, 648): jbd2_log_wait_commit:
    [* 290.339379] (fs/jbd2/commit.c, 370): jbd2_journal_commit_transaction: JBD2: want 42104, j_commit_sequence=42103
    [* 290.339382] JBD2: starting commit of transaction 42104
    [ 290.339410] (fs/jbd2/revoke.c, 566): jbd2_journal_write_revoke_records: Wrote 0 revoke records
    [ 290.376555] (fs/jbd2/commit.c, 1088): jbd2_journal_commit_transaction: JBD2: commit 42104 complete, head 42079

    i.e. the debug output from log_wait_commit and journal_commit_transaction
    have become interleaved. The output should have been:

    (fs/jbd2/journal.c, 648): jbd2_log_wait_commit: JBD2: want 42104, j_commit_sequence=42103
    (fs/jbd2/commit.c, 370): jbd2_journal_commit_transaction: JBD2: starting commit of transaction 42104

    It is expected that this is not easy to replicate -- I was only able
    to cause it on preempt-rt kernels, and even then only under heavy
    I/O load.

    Reported-by: Paul Gortmaker
    Suggested-by: "Theodore Ts'o"
    Signed-off-by: Paul Gortmaker
    Signed-off-by: "Theodore Ts'o"

    Paul Gortmaker
     
  • The bit_spinlock functions are only used for the jbd_lock_bh_state
    functions (and friends) in jbd_common.h and are not directly used
    by either of jbd.h or jbd2.h content.

    The jbd_common file is new as of commit 446066724c36 ("jdb/jbd2: factor
    out common functions from the jbd[2] header files") but common
    (and isolated) headers were not considered for factoring at that time.

    Signed-off-by: Paul Gortmaker
    Signed-off-by: "Theodore Ts'o"

    Paul Gortmaker
     
  • Currently we see this output:

    $git grep phase fs/jbd2
    fs/jbd2/commit.c: jbd_debug(3, "JBD2: commit phase 1\n");
    fs/jbd2/commit.c: jbd_debug(3, "JBD2: commit phase 2\n");
    fs/jbd2/commit.c: jbd_debug(3, "JBD2: commit phase 2\n");
    fs/jbd2/commit.c: jbd_debug(3, "JBD2: commit phase 3\n");
    fs/jbd2/commit.c: jbd_debug(3, "JBD2: commit phase 4\n");
    [...]

    There is clearly a duplicate label for phase 2, and they are
    both active (i.e. not in #if ... #else block). Rename them to
    be "2a" and "2b" so the debug output is unambiguous.

    Signed-off-by: Paul Gortmaker
    Signed-off-by: "Theodore Ts'o"

    Paul Gortmaker
     
  • While trying to debug an an issue under extreme I/O loading
    on preempt-rt kernels, the following backtrace was observed
    via SysRQ output:

    rm D ffff8802203afbc0 4600 4878 4748 0x00000000
    ffff8802217bfb78 0000000000000082 ffff88021fc2bb80 ffff88021fc2bb80
    ffff88021fc2bb80 ffff8802217bffd8 ffff8802217bffd8 ffff8802217bffd8
    ffff88021f1d4c80 ffff88021fc2bb80 ffff8802217bfb88 ffff88022437b000
    Call Trace:
    [] schedule+0x24/0x70
    [] jbd2_log_wait_commit+0xbd/0x140
    [] ? __init_waitqueue_head+0x50/0x50
    [] jbd2_log_do_checkpoint+0xf5/0x520
    [] __jbd2_log_wait_for_space+0xa9/0x1f0
    [] start_this_handle.isra.10+0x2e0/0x530
    [] ? __init_waitqueue_head+0x50/0x50
    [] jbd2__journal_start+0xc3/0x110
    [] ? ext4_rmdir+0x6e/0x230
    [] jbd2_journal_start+0xe/0x10
    [] ext4_journal_start_sb+0x5b/0x160
    [] ext4_rmdir+0x6e/0x230
    [] vfs_rmdir+0xd5/0x140
    [] do_rmdir+0xdf/0x120
    [] ? task_work_run+0x44/0x80
    [] ? do_notify_resume+0x89/0x100
    [] ? int_signal+0x12/0x17
    [] sys_unlinkat+0x25/0x40
    [] system_call_fastpath+0x16/0x1b

    What is interesting here, is that we call log_wait_commit, from
    within wait_for_space, but we are still holding the checkpoint_mutex
    as it surrounds mostly the whole of wait_for_space. And then, as we
    are waiting, journal_commit_transaction can run, and if the JBD2_FLUSHED
    bit is set, then we will also try to take the same checkpoint_mutex.

    It seems that we need to drop the checkpoint_mutex while sitting in
    jbd2_log_wait_commit, if we want to guarantee that progress can be made
    by jbd2_journal_commit_transaction(). There does not seem to be
    anything preempt-rt specific about this, other then perhaps increasing
    the odds of it happening.

    Signed-off-by: Paul Gortmaker
    Signed-off-by: "Theodore Ts'o"

    Paul Gortmaker
     
  • The state lock is taken after we are doing an assert on the state
    value, not before. So we might in fact be doing an assert on a
    transient value. Ensure the state check is within the scope of
    the state lock being taken.

    Signed-off-by: Paul Gortmaker
    Signed-off-by: "Theodore Ts'o"

    Paul Gortmaker
     
  • If filesystem was aborted after inode's write back is complete
    but before its metadata was updated we may return success
    results in data loss.
    In order to handle fs abort correctly we have to check
    fs state once we discover that it is in MS_RDONLY state

    Test case: http://patchwork.ozlabs.org/patch/244297

    Reviewed-by: Jan Kara
    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"

    Dmitry Monakhov
     
  • Inode's data or non journaled quota may be written w/o jounral so we
    _must_ send a barrier at the end of ext4_sync_fs. But it can be
    skipped if journal commit will do it for us.

    Also fix data integrity for nojournal mode.

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"

    Dmitry Monakhov
     
  • Current implementation of jbd2_journal_force_commit() is suboptimal because
    result in empty and useless commits. But callers just want to force and wait
    any unfinished commits. We already have jbd2_journal_force_commit_nested()
    which does exactly what we want, except we are guaranteed that we do not hold
    journal transaction open.

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"

    Dmitry Monakhov
     

12 Jun, 2013

2 commits

  • Commit 18888cf0883c: "ext4: speed up truncate/unlink by not using
    bforget() unless needed" removed the use of EXT4_FREE_BLOCKS_FORGET in
    the most important codepath for file systems using extents, but a
    similar optimization also can be done for file systems using indirect
    blocks, and for the two special cases in the ext4 extents code.

    Cc: Andrey Sidorov
    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     
  • For a file systems with a very large number of block groups, if all of
    the block group bitmaps are in memory and the file system is
    relatively badly fragmented, it's possible ext4_mb_regular_allocator()
    to take a long time trying to find a good match. This is especially
    true if the tuning parameter mb_max_to_scan has been sent to a very
    large number. So add a cond_resched() to avoid soft lockup warnings
    and to provide better system responsiveness.

    For ext4_free_blocks(), if we are deleting a large range of blocks,
    and data=journal is enabled so that EXT4_FREE_BLOCKS_FORGET is passed,
    the loop to call sb_find_get_block() and to call ext4_forget() can
    take over 10-15 milliseocnds or more. So it's better to add a
    cond_resched() here a well.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

07 Jun, 2013

1 commit

  • Rename ext4_da_writepages() to ext4_writepages() and use it for all
    modes. We still need to iterate over all the pages in the case of
    data=journalling, but in the case of nodelalloc/data=ordered (which is
    what file systems mounted using ext3 backwards compatibility will use)
    this will allow us to use a much more efficient I/O submission path.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

06 Jun, 2013

4 commits


05 Jun, 2013

21 commits

  • Now that we clear PageWriteback after extent conversion, there's no
    need to wait for io_end processing in ext4_evict_inode(). Running
    AIO/DIO keeps file reference until aio_complete() is called so
    ext4_evict_inode() cannot be called. For io_end structures resulting
    from buffered IO waiting is happening because we wait for
    PageWriteback in truncate_inode_pages().

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • We don't have to wait for extent conversion in ext4_punch_hole() as
    buffered IO for the punched range has been flushed and waited upon
    (thus all extent conversions for that range have completed). Also we
    wait for all DIO to finish using inode_dio_wait() so there cannot be
    any extent conversions pending due to direct IO.

    Also remove ext4_flush_unwritten_io() since it's unused now.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • We don't have to wait for unwritten extent conversion in
    ext4_ind_direct_IO() as all writes that happened before DIO are
    flushed by the generic code and extent conversion has happened before
    we cleared PageWriteback bit.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • After removal of ext4_flush_unwritten_io() call, ext4_file_sync()
    doesn't need i_mutex anymore. Forcing of transaction commits doesn't
    need i_mutex as there's nothing inode specific in that code apart from
    grabbing transaction ids from the inode. So remove the lock.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • Just use the generic function instead of duplicating it. We only need
    to reshuffle the read-only check a bit (which is there to prevent
    writing to a filesystem which has been remounted read-only after error
    I assume).

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • Since PageWriteback bit is now cleared after extents are converted
    from unwritten to written ones, we have full exclusion of writeback
    path from truncate (truncate_inode_pages() waits for PageWriteback
    bits to get cleared on all invalidated pages). Exclusion from DIO
    path is achieved by inode_dio_wait() call in ext4_setattr(). So
    there's no need to wait for extent convertion in ext4_truncate()
    anymore.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • Make sure extent conversion after DIO happens while i_dio_count is
    still elevated so that inode_dio_wait() waits until extent conversion
    is done. This removes the need for explicit waiting for extent
    conversion in some cases.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • Currently PageWriteback bit gets cleared from put_io_page() called
    from ext4_end_bio(). This is somewhat inconvenient as extent tree is
    not fully updated at that time (unwritten extents are not marked as
    written) so we cannot read the data back yet. This design was
    dictated by lock ordering as we cannot start a transaction while
    PageWriteback bit is set (we could easily deadlock with
    ext4_da_writepages()). But now that we use transaction reservation
    for extent conversion, locking issues are solved and we can move
    PageWriteback bit clearing after extent conversion is done. As a
    result we can remove wait for unwritten extent conversion from
    ext4_sync_file() because it already implicitely happens through
    wait_on_page_writeback().

    We implement deferring of PageWriteback clearing by queueing completed
    bios to appropriate io_end and processing all the pages when io_end is
    going to be freed instead of at the moment ext4_io_end() is called.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • Now that we have extent conversions with reserved transaction, we have
    to prevent extent conversions without reserved transaction (from DIO
    code) to block these (as that would effectively void any transaction
    reservation we did). So split lists, work items, and work queues to
    reserved and unreserved parts.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • Later we would like to clear PageWriteback bit only after extent
    conversion from unwritten to written extents is performed. However it
    is not possible to start a transaction after PageWriteback is set
    because that violates lock ordering (and is easy to deadlock). So we
    have to reserve a transaction before locking pages and sending them
    for IO and later we use the transaction for extent conversion from
    ext4_end_io().

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • There isn't any need for setting BH_Uninit on buffers anymore. It was
    only used to signal we need to mark io_end as needing extent
    conversion in add_bh_to_extent() but now we can mark the io_end
    directly when mapping extent.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • There are two issues with current writeback path in ext4. For one we
    don't necessarily map complete pages when blocksize < pagesize and
    thus needn't do any writeback in one iteration. We always map some
    blocks though so we will eventually finish mapping the page. Just if
    writeback races with other operations on the file, forward progress is
    not really guaranteed. The second problem is that current code
    structure makes it hard to associate all the bios to some range of
    pages with one io_end structure so that unwritten extents can be
    converted after all the bios are finished. This will be especially
    difficult later when io_end will be associated with reserved
    transaction handle.

    We restructure the writeback path to a relatively simple loop which
    first prepares extent of pages, then maps one or more extents so that
    no page is partially mapped, and once page is fully mapped it is
    submitted for IO. We keep all the mapping and IO submission
    information in mpage_da_data structure to somewhat reduce stack usage.
    Resulting code is somewhat shorter than the old one and hopefully also
    easier to read.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • We limit the number of blocks written in a single loop of
    ext4_da_writepages() to 64 when inode uses indirect blocks. That is
    unnecessary as credit estimates for mapping logically continguous run
    of blocks is rather low even for inode with indirect blocks. So just
    lift this limitation and properly calculate the number of necessary
    credits.

    This better credit estimate will also later allow us to always write
    at least a single page in one iteration.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • ext4_ind_trans_blocks() wrongly used 'chunk' argument to decide whether
    blocks mapped are logically contiguous. That is wrong since the argument
    informs whether the blocks are physically contiguous. As the blocks
    mapped are always logically contiguous and that's all
    ext4_ind_trans_blocks() cares about, just remove the 'chunk' argument.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • This attribute is now unused so deprecate it. We still show the old
    default value to keep some compatibility but we don't allow writing to
    that attribute anymore.

    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • Writeback code got better in how it submits IO and now the number of
    pages requested to be written is usually higher than original 1024.
    The number is now dynamically computed based on observed throughput
    and is set to be about 0.5 s worth of writeback. E.g. on ordinary
    SATA drive this ends up somewhere around 10000 as my testing shows.
    So remove the unnecessary smarts from ext4_da_writepages().

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • In some cases we cannot start a transaction because of locking
    constraints and passing started transaction into those places is not
    handy either because we could block transaction commit for too long.
    Transaction reservation is designed to solve these issues. It
    reserves a handle with given number of credits in the journal and the
    handle can be later attached to the running transaction without
    blocking on commit or checkpointing. Reserved handles do not block
    transaction commit in any way, they only reduce maximum size of the
    running transaction (because we have to always be prepared to
    accomodate request for attaching reserved handle).

    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • j_wait_logspace and j_wait_checkpoint are unused. Remove them.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • jbd2_journal_extend() first checked whether transaction can accept
    extending handle with more credits and then added credits to
    t_outstanding_credits. This can race with start_this_handle() adding
    another handle to a transaction and thus overbooking a transaction.
    Make jbd2_journal_extend() use atomic_add_return() to close the race.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     
  • __jbd2_log_space_left() and jbd_space_needed() were kind of odd.
    jbd_space_needed() accounted also credits needed for currently
    committing transaction while it didn't account for credits needed for
    control blocks. __jbd2_log_space_left() then accounted for control
    blocks as a fraction of free space. Since results of these two
    functions are always only compared against each other, this works
    correct but is somewhat strange. Move the estimates so that
    jbd_space_needed() returns number of blocks needed for a transaction
    including control blocks and __jbd2_log_space_left() returns free
    space in the journal (with the committing transaction already
    subtracted). Rename functions to jbd2_log_space_left() and
    jbd2_space_needed() while we are changing them.

    Reviewed-by: Zheng Liu
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara