16 Sep, 2009

3 commits


21 Jul, 2009

1 commit

  • The function journal_write_metadata_buffer() calls jbd_unlock_bh_state(bh_in)
    too early; this could potentially allow another thread to call get_write_access
    on the buffer head, modify the data, and dirty it, and allowing the wrong data
    to be written into the journal. Fortunately, if we lose this race, the only
    time this will actually cause filesystem corruption is if there is a system
    crash or other unclean shutdown of the system before the next commit can take
    place.

    Signed-off-by: dingdinghua
    Acked-by: "Theodore Ts'o"
    Signed-off-by: Jan Kara

    dingdinghua
     

16 Jul, 2009

2 commits

  • The following race can happen:

    CPU1 CPU2
    checkpointing code checks the buffer, adds
    it to an array for writeback
    do_get_write_access()
    ...
    lock_buffer()
    unlock_buffer()
    flush_batch() submits the buffer for IO
    __jbd_journal_file_buffer()

    So a buffer under writeout is returned from do_get_write_access(). Since
    the filesystem code relies on the fact that journaled buffers cannot be
    written out, it does not take the buffer lock and so it can modify buffer
    while it is under writeout. That can lead to a filesystem corruption
    if we crash at the right moment. The similar problem can happen with
    the journal_get_create_access() path.
    We fix the problem by clearing the buffer dirty bit under buffer_lock
    even if the buffer is on BJ_None list. Actually, we clear the dirty bit
    regardless the list the buffer is in and warn about the fact if
    the buffer is already journalled.

    Thanks for spotting the problem goes to dingdinghua .

    Reported-by: dingdinghua
    Signed-off-by: Jan Kara

    Jan Kara
     
  • Due to on disk corruption, it can happen that journal is too short. Fail
    to load it in such case so that we don't oops somewhere later.

    Reported-by: Nageswara R Sastry
    Signed-off-by: Jan Kara

    Jan Kara
     

19 Jun, 2009

1 commit

  • I delete the following patch
    "commit 3f31fddfa26b7594b44ff2b34f9a04ba409e0f91
    Author: Mingming Cao
    Date: Fri Jul 25 01:46:22 2008 -0700

    jbd: fix race between free buffer and commit transaction

    This patch is no longer needed because if race between freeing buffer and
    committing transaction functionality occurs and dio gets error, currently
    dio falls back to buffered IO by the following patch.

    commit 6ccfa806a9cfbbf1cd43d5b6aa47ef2c0eb518fd
    Author: Hisashi Hifumi
    Date: Tue Sep 2 14:35:40 2008 -0700

    VFS: fix dio write returning EIO when try_to_release_page fails

    Signed-off-by: Hisashi Hifumi
    Cc: Theodore Tso
    Cc: Mingming Cao
    Acked-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hisashi Hifumi
     

10 Jun, 2009

1 commit

  • In commit code, we scan buffers attached to a transaction. During this
    scan, we sometimes have to drop j_list_lock and then we recheck whether
    the journal buffer head didn't get freed by journal_try_to_free_buffers().
    But checking for buffer_jbd(bh) isn't enough because a new journal head
    could get attached to our buffer head. So add a check whether the journal
    head remained the same and whether it's still at the same transaction and
    list.

    This is a nasty bug and can cause problems like memory corruption (use after
    free) or trigger various assertions in JBD code (observed).

    Signed-off-by: Jan Kara
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

24 Apr, 2009

1 commit


14 Apr, 2009

2 commits


06 Apr, 2009

1 commit

  • When you are going to be submitting several sync writes, we want to
    give the IO scheduler a chance to merge some of them. Instead of
    using the implicitly unplugging WRITE_SYNC variant, use WRITE_SYNC_PLUG
    and rely on sync_buffer() doing the unplug when someone does a
    wait_on_buffer()/lock_buffer().

    Signed-off-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

04 Apr, 2009

1 commit


03 Apr, 2009

1 commit


28 Mar, 2009

1 commit


12 Feb, 2009

1 commit

  • journal_start_commit() returns 1 if either a transaction is committing or
    the function has queued a transaction commit. But it returns 0 if we
    raced with somebody queueing the transaction commit as well. This
    resulted in ext3_sync_fs() not functioning correctly (description from
    Arthur Jones): In the case of a data=ordered umount with pending long
    symlinks which are delayed due to a long list of other I/O on the backing
    block device, this causes the buffer associated with the long symlinks to
    not be moved to the inode dirty list in the second phase of fsync_super.
    Then, before they can be dirtied again, kjournald exits, seeing the UMOUNT
    flag and the dirty pages are never written to the backing block device,
    causing long symlink corruption and exposing new or previously freed block
    data to userspace.

    This can be reproduced with a script created by Eric Sandeen
    :

    #!/bin/bash

    umount /mnt/test2
    mount /dev/sdb4 /mnt/test2
    rm -f /mnt/test2/*
    dd if=/dev/zero of=/mnt/test2/bigfile bs=1M count=512
    touch /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
    ln -s /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
    /mnt/test2/link
    umount /mnt/test2
    mount /dev/sdb4 /mnt/test2
    ls /mnt/test2/

    This patch fixes journal_start_commit() to always return 1 when there's
    a transaction committing or queued for commit.

    Cc: Eric Sandeen
    Cc: Mike Snitzer
    Cc:
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

09 Jan, 2009

2 commits

  • Remove excess kernel-doc from fs/jbd/transaction.c:

    Warning(linux-2.6.28-git5//fs/jbd/transaction.c:764): Excess function parameter 'credits' description in 'journal_get_write_access'

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • There is a flaw with the way jbd handles fsync batching. If we fsync() a
    file and we were not the last person to run fsync() on this fs then we
    automatically sleep for 1 jiffie in order to wait for new writers to join
    into the transaction before forcing the commit. The problem with this is
    that with really fast storage (ie a Clariion) the time it takes to commit
    a transaction to disk is way faster than 1 jiffie in most cases, so
    sleeping means waiting longer with nothing to do than if we just committed
    the transaction and kept going. Ric Wheeler noticed this when using
    fs_mark with more than 1 thread, the throughput would plummet as he added
    more threads.

    This patch attempts to fix this problem by recording the average time in
    nanoseconds that it takes to commit a transaction to disk, and what time
    we started the transaction. If we run an fsync() and we have been running
    for less time than it takes to commit the transaction to disk, we sleep
    for the delta amount of time and then commit to disk. We acheive
    sub-jiffie sleeping using schedule_hrtimeout. This means that the wait
    time is auto-tuned to the speed of the underlying disk, instead of having
    this static timeout. I weighted the average according to somebody's
    comments (Andreas Dilger I think) in order to help normalize random
    outliers where we take way longer or way less time to commit than the
    average. I also have a min() check in there to make sure we don't sleep
    longer than a jiffie in case our storage is super slow, this was requested
    by Andrew.

    I unfortunately do not have access to a Clariion, so I had to use a
    ramdisk to represent a super fast array. I tested with a SATA drive with
    barrier=1 to make sure there was no regression with local disks, I tested
    with a 4 way multipathed Apple Xserve RAID array and of course the
    ramdisk. I ran the following command

    fs_mark -d /mnt/ext3-test -s 4096 -n 2000 -D 64 -t $i

    where $i was 2, 4, 8, 16 and 32. I mkfs'ed the fs each time. Here are my
    results

    type threads with patch without patch
    sata 2 24.6 26.3
    sata 4 49.2 48.1
    sata 8 70.1 67.0
    sata 16 104.0 94.1
    sata 32 153.6 142.7

    xserve 2 246.4 222.0
    xserve 4 480.0 440.8
    xserve 8 829.5 730.8
    xserve 16 1172.7 1026.9
    xserve 32 1816.3 1650.5

    ramdisk 2 2538.3 1745.6
    ramdisk 4 2942.3 661.9
    ramdisk 8 2882.5 999.8
    ramdisk 16 2738.7 1801.9
    ramdisk 32 2541.9 2394.0

    Signed-off-by: Josef Bacik
    Cc: Andreas Dilger
    Cc: Arjan van de Ven
    Cc: Ric Wheeler
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josef Bacik
     

07 Nov, 2008

1 commit

  • Commit be07c4ed introducd a regression because it assumed that if
    there were no transactions ready to be checkpointed, that no progress
    could be made on making space available in the journal, and so the
    journal should be aborted. This assumption is false; it could be the
    case that simply calling cleanup_journal_tail() will recover the
    necessary space, or, for small journals, the currently committing
    transaction could be responsible for chewing up the required space in
    the log, so we need to wait for the currently committing transaction
    to finish before trying to force a checkpoint operation.

    This patch fixes the bug reported by Meelis Roos at:
    http://bugzilla.kernel.org/show_bug.cgi?id=11937

    Signed-off-by: "Theodore Ts'o"
    Cc: Duane Griffin
    Cc: Toshiyuki Okajima

    Theodore Ts'o
     

31 Oct, 2008

1 commit

  • Delete excess kernel-doc notation in fs/ subdirectory:

    Warning(linux-2.6.27-git10//fs/jbd/transaction.c:886): Excess function parameter or struct member 'credits' description in 'journal_get_undo_access'

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

23 Oct, 2008

3 commits

  • The __log_wait_for_space function sits in a loop checkpointing
    transactions until there is sufficient space free in the journal.
    However, if there are no transactions to be processed (e.g. because the
    free space calculation is wrong due to a corrupted filesystem) it will
    never progress.

    Check for space being required when no transactions are outstanding and
    abort the journal instead of endlessly looping.

    This patch fixes the bug reported by Sami Liedes at:
    http://bugzilla.kernel.org/show_bug.cgi?id=10976

    Signed-off-by: Duane Griffin
    Tested-by: Sami Liedes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Duane Griffin
     
  • __try_to_free_cp_buf(), __process_buffer(), and __wait_cp_io() test
    BH_Uptodate flag to detect write I/O errors on metadata buffers. But by
    commit 95450f5a7e53d5752ce1a0d0b8282e10fe745ae0 "ext3: don't read inode
    block if the buffer has a write error"(*), BH_Uptodate flag can be set to
    inode buffers with BH_Write_EIO in order to avoid reading old inode data.
    So now, we have to test BH_Write_EIO flag of checkpointing inode buffers
    instead of BH_Uptodate. This patch does it.

    Signed-off-by: Hidehiro Kawai
    Acked-by: Jan Kara
    Acked-by: Eric Sandeen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hidehiro Kawai
     
  • When a checkpointing IO fails, current JBD code doesn't check the error
    and continue journaling. This means latest metadata can be lost from both
    the journal and filesystem.

    This patch leaves the failed metadata blocks in the journal space and
    aborts journaling in the case of log_do_checkpoint(). To achieve this, we
    need to do:

    1. don't remove the failed buffer from the checkpoint list where in
    the case of __try_to_free_cp_buf() because it may be released or
    overwritten by a later transaction
    2. log_do_checkpoint() is the last chance, remove the failed buffer
    from the checkpoint list and abort the journal
    3. when checkpointing fails, don't update the journal super block to
    prevent the journaled contents from being cleaned. For safety,
    don't update j_tail and j_tail_sequence either
    4. when checkpointing fails, notify this error to the ext3 layer so
    that ext3 don't clear the needs_recovery flag, otherwise the
    journaled contents are ignored and cleaned in the recovery phase
    5. if the recovery fails, keep the needs_recovery flag
    6. prevent cleanup_journal_tail() from being called between
    __journal_drop_transaction() and journal_abort() (a race issue
    between journal_flush() and __log_wait_for_space()

    Signed-off-by: Hidehiro Kawai
    Acked-by: Jan Kara
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hidehiro Kawai
     

21 Oct, 2008

1 commit


20 Oct, 2008

4 commits

  • In ordered mode, if a file data buffer being dirtied exists in the
    committing transaction, we write the buffer to the disk, move it from the
    committing transaction to the running transaction, then dirty it. But we
    don't have to remove the buffer from the committing transaction when the
    buffer couldn't be written out, otherwise it would miss the error and the
    committing transaction would not abort.

    This patch adds an error check before removing the buffer from the
    committing transaction.

    Signed-off-by: Hidehiro Kawai
    Acked-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hidehiro Kawai
     
  • If the journal doesn't abort when it gets an IO error in file data blocks,
    the file data corruption will spread silently. Because most of
    applications and commands do buffered writes without fsync(), they don't
    notice the IO error. It's scary for mission critical systems. On the
    other hand, if the journal aborts whenever it gets an IO error in file
    data blocks, the system will easily become inoperable. So this patch
    introduces a filesystem option to determine whether it aborts the journal
    or just call printk() when it gets an IO error in file data.

    If you mount a ext3 fs with data_err=abort option, it aborts on file data
    write error. If you mount it with data_err=ignore, it doesn't abort, just
    call printk(). data_err=ignore is the default.

    Signed-off-by: Hidehiro Kawai
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hidehiro Kawai
     
  • Currently, original metadata buffers are dirtied when they are unfiled
    whether the journal has aborted or not. Eventually these buffers will be
    written-back to the filesystem by pdflush. This means some metadata
    buffers are written to the filesystem without journaling if the journal
    aborts. So if both journal abort and system crash happen at the same
    time, the filesystem would become inconsistent state. Additionally,
    replaying journaled metadata can overwrite the latest metadata on the
    filesystem partly. Because, if the journal aborts, journaled metadata are
    preserved and replayed during the next mount not to lose uncheckpointed
    metadata. This would also break the consistency of the filesystem.

    This patch prevents original metadata buffers from being dirtied on abort
    by clearing BH_JBDDirty flag from those buffers. Thus, no metadata
    buffers are written to the filesystem without journaling.

    Signed-off-by: Hidehiro Kawai
    Acked-by: Jan Kara
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hidehiro Kawai
     
  • If we failed to write metadata buffers to the journal space and succeeded
    to write the commit record, stale data can be written back to the
    filesystem as metadata in the recovery phase.

    To avoid this, when we failed to write out metadata buffers, abort the
    journal before writing the commit record.

    Signed-off-by: Hidehiro Kawai
    Acked-by: Jan Kara
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hidehiro Kawai
     

12 Aug, 2008

1 commit


11 Aug, 2008

2 commits


05 Aug, 2008

2 commits

  • Like the page lock change, this also requires name change, so convert the
    raw test_and_set bitop to a trylock.

    Signed-off-by: Nick Piggin
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Converting page lock to new locking bitops requires a change of page flag
    operation naming, so we might as well convert it to something nicer
    (!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked).

    This also facilitates lockdeping of page lock.

    Signed-off-by: Nick Piggin
    Acked-by: KOSAKI Motohiro
    Acked-by: Peter Zijlstra
    Acked-by: Andrew Morton
    Acked-by: Benjamin Herrenschmidt
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

26 Jul, 2008

7 commits

  • In ordered mode, the current jbd aborts the journal if a file data buffer
    has an error. But this behavior is unintended, and we found that it has
    been adopted accidentally.

    This patch undoes it and just calls printk() instead of aborting the
    journal. Additionally, set AS_EIO into the address_space object of the
    failed buffer which is submitted by journal_do_submit_data() so that
    fsync() can get -EIO.

    Missing error checkings are also added to inform errors on file data
    buffers to the user. The following buffers are targeted.

    (a) the buffer which has already been written out by pdflush
    (b) the buffer which has been unlocked before scanned in the
    t_locked_list loop

    [akpm@linux-foundation.org: improve grammar in a printk]
    Signed-off-by: Hidehiro Kawai
    Acked-by: Jan Kara
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hidehiro Kawai
     
  • After ext3-ordered files are truncated, there is a possibility that the
    pages which cannot be estimated still remain. Remaining pages can be
    released when the system has really few memory. So, it is not memory
    leakage. But the resource management software etc. may not work
    correctly.

    It is possible that journal_unmap_buffer() cannot release the buffers, and
    the pages to which they belong because they are attached to a commiting
    transaction and journal_unmap_buffer() cannot release them. To release
    such the buffers and the pages later, journal_unmap_buffer() leaves it to
    journal_commit_transaction(). (journal_unmap_buffer() puts the mark
    'BH_Freed' to the buffers so that journal_commit_transaction() can
    identify whether they can be released or not.)

    In the journalled mode and the writeback mode, jbd does with only metadata
    buffers. But in the ordered mode, jbd does with metadata buffers and also
    data buffers.

    Actually, journal_commit_transaction() releases only the metadata buffers
    of which release is demanded by journal_unmap_buffer(), and also releases
    the pages to which they belong if possible.

    As a result, the data buffers of which release is demanded by
    journal_unmap_buffer() remain after a transaction commits. And also the
    pages to which they belong remain.

    Such the remained pages don't have mapping any longer. Due to this fact,
    there is a possibility that the pages which cannot be estimated remain.

    The metadata buffers marked 'BH_Freed' and the pages to which
    they belong can be released at 'JBD: commit phase 7'.

    Therefore, by applying the same code into 'JBD: commit phase 2' (where the
    data buffers are done with), journal_commit_transaction() can also release
    the data buffers marked 'BH_Freed' and the pages to which they belong.

    As a result, all the buffers marked 'BH_Freed' can be released, and also
    all the pages to which these buffers belong can be released at
    journal_commit_transaction(). So, the page which cannot be estimated is
    lost.

    <>
    > spin_lock(&journal->j_list_lock);
    > while (commit_transaction->t_forget) {
    > transaction_t *cp_transaction;
    > struct buffer_head *bh;
    >
    > jh = commit_transaction->t_forget;
    >...
    > if (buffer_freed(bh)) {
    > ^^^^^^^^^^^^^^^^^^^^^^^^
    > clear_buffer_freed(bh);
    > ^^^^^^^^^^^^^^^^^^^^^^^^
    > clear_buffer_jbddirty(bh);
    > }
    >
    > if (buffer_jbddirty(bh)) {
    > JBUFFER_TRACE(jh, "add to new checkpointing trans");
    > __journal_insert_checkpoint(jh, commit_transaction);
    > JBUFFER_TRACE(jh, "refile for checkpoint writeback");
    > __journal_refile_buffer(jh);
    > jbd_unlock_bh_state(bh);
    > } else {
    > J_ASSERT_BH(bh, !buffer_dirty(bh));
    > ...
    > JBUFFER_TRACE(jh, "refile or unfile freed buffer");
    > __journal_refile_buffer(jh);
    > if (!jh->b_transaction) {
    > jbd_unlock_bh_state(bh);
    > /* needs a brelse */
    > journal_remove_journal_head(bh);
    > release_buffer_page(bh);
    > ^^^^^^^^^^^^^^^^^^^^^^^^
    > } else
    > }
    ****************************************************************
    * Apply the code of "^^^^^^" lines into 'JBD: commit phase 2' *
    ****************************************************************

    At journal_commit_transaction() code, there is one extra message in the
    series of jbd debug messages. ("JBD: commit phase 2") This patch fixes
    it, too.

    Signed-off-by: Toshiyuki Okajima
    Acked-by: Jan Kara
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toshiyuki Okajima
     
  • Remove the unused EXPORT_SYMBOL(journal_update_superblock).

    Signed-off-by: Adrian Bunk
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • journal_try_to_free_buffers() could race with jbd commit transaction when
    the later is holding the buffer reference while waiting for the data
    buffer to flush to disk. If the caller of journal_try_to_free_buffers()
    request tries hard to release the buffers, it will treat the failure as
    error and return back to the caller. We have seen the directo IO failed
    due to this race. Some of the caller of releasepage() also expecting the
    buffer to be dropped when passed with GFP_KERNEL mask to the
    releasepage()->journal_try_to_free_buffers().

    With this patch, if the caller is passing the __GFP_WAIT and __GFP_FS to
    indicating this call could wait, in case of try_to_free_buffers() failed,
    let's waiting for journal_commit_transaction() to finish commit the
    current committing transaction, then try to free those buffers again.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Mingming Cao
    Reviewed-by: Badari Pulavarty
    Acked-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mingming Cao
     
  • Make revocation cache destruction safe to call if initialisation fails
    partially or entirely. This allows it to be used to cleanup in the case
    of initialisation failure, simplifying that code slightly.

    Signed-off-by: Duane Griffin
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Duane Griffin
     
  • The revocation table initialisation/destruction code is repeated for each
    of the two revocation tables stored in the journal. Refactoring the
    duplicated code into functions is tidier, simplifies the logic in
    initialisation in particular, and slightly reduces the code size.

    There should not be any functional change.

    Signed-off-by: Duane Griffin
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Duane Griffin
     
  • If an error occurs during jbd cache initialisation it is possible for the
    journal_head_cache to be NULL when journal_destroy_journal_head_cache is
    called. Replace the J_ASSERT with an if block to handle the situation
    correctly.

    Note that even with this fix things will break badly if jbd is statically
    compiled in and cache initialisation fails.

    Signed-off-by: Duane Griffin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Duane Griffin