11 Feb, 2009

2 commits

  • If we race with commit code setting i_transaction to NULL, we could
    possibly dereference it. Proper locking requires the journal pointer
    (to access journal->j_list_lock), which we don't have. So we have to
    change the prototype of the function so that filesystem passes us the
    journal pointer. Also add a more detailed comment about why the
    function jbd2_journal_begin_ordered_truncate() does what it does and
    how it should be used.

    Thanks to Dan Carpenter for pointing to the
    suspitious code.

    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"
    Acked-by: Joel Becker
    CC: linux-ext4@vger.kernel.org
    CC: ocfs2-devel@oss.oracle.com
    CC: mfasheh@suse.de
    CC: Dan Carpenter

    Jan Kara
     
  • The function jbd2_journal_start_commit() returns 1 if either a
    transaction is committing or the function has queued a transaction
    commit. But it returns 0 if we raced with somebody queueing the
    transaction commit as well. This resulted in ext4_sync_fs() not
    functioning correctly (description from Arthur Jones):

    In the case of a data=ordered umount with pending long symlinks
    which are delayed due to a long list of other I/O on the backing
    block device, this causes the buffer associated with the long
    symlinks to not be moved to the inode dirty list in the second
    phase of fsync_super. Then, before they can be dirtied again,
    kjournald exits, seeing the UMOUNT flag and the dirty pages are
    never written to the backing block device, causing long symlink
    corruption and exposing new or previously freed block data to
    userspace.

    This can be reproduced with a script created by Eric Sandeen
    :

    #!/bin/bash

    umount /mnt/test2
    mount /dev/sdb4 /mnt/test2
    rm -f /mnt/test2/*
    dd if=/dev/zero of=/mnt/test2/bigfile bs=1M count=512
    touch /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
    ln -s /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
    /mnt/test2/link
    umount /mnt/test2
    mount /dev/sdb4 /mnt/test2
    ls /mnt/test2/

    This patch fixes jbd2_journal_start_commit() to always return 1 when
    there's a transaction committing or queued for commit.

    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"
    CC: Eric Sandeen
    CC: linux-ext4@vger.kernel.org

    Jan Kara
     

12 Jan, 2009

1 commit

  • the following warning:

    fs/jbd2/journal.c: In function ‘jbd2_seq_info_show’:
    fs/jbd2/journal.c:850: warning: format ‘%lu’ expects type ‘long
    unsigned int’, but argument 3 has type ‘uint32_t’

    is caused by wrong usage of do_div that modifies the dividend in-place
    and returns the quotient. So not only would an incorrect value be
    displayed, but s->journal->j_average_commit_time would also be changed
    to a wrong value!

    Fix it by using div_u64 instead.

    Signed-off-by: Simon Holm Thøgersen
    Signed-off-by: "Theodore Ts'o"

    Simon Holm Thøgersen
     

09 Jan, 2009

1 commit

  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (57 commits)
    jbd2: Fix oops in jbd2_journal_init_inode() on corrupted fs
    ext4: Remove "extents" mount option
    block: Add Kconfig help which notes that ext4 needs CONFIG_LBD
    ext4: Make printk's consistently prefixed with "EXT4-fs: "
    ext4: Add sanity checks for the superblock before mounting the filesystem
    ext4: Add mount option to set kjournald's I/O priority
    jbd2: Submit writes to the journal using WRITE_SYNC
    jbd2: Add pid and journal device name to the "kjournald2 starting" message
    ext4: Add markers for better debuggability
    ext4: Remove code to create the journal inode
    ext4: provide function to release metadata pages under memory pressure
    ext3: provide function to release metadata pages under memory pressure
    add releasepage hooks to block devices which can be used by file systems
    ext4: Fix s_dirty_blocks_counter if block allocation failed with nodelalloc
    ext4: Init the complete page while building buddy cache
    ext4: Don't allow new groups to be added during block allocation
    ext4: mark the blocks/inode bitmap beyond end of group as used
    ext4: Use new buffer_head flag to check uninit group bitmaps initialization
    ext4: Fix the race between read_inode_bitmap() and ext4_new_inode()
    ext4: code cleanup
    ...

    Linus Torvalds
     

07 Jan, 2009

2 commits

  • On 32-bit system with CONFIG_LBD getblk can fail because provided
    block number is too big. Add error checks so we fail gracefully if
    getblk() returns NULL (which can also happen on memory allocation
    failures).

    Thanks to David Maciejak from Fortinet's FortiGuard Global Security
    Research Team for reporting this bug.

    http://bugzilla.kernel.org/show_bug.cgi?id=12370

    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"
    cc: stable@kernel.org

    Jan Kara
     
  • This code has been obsolete in quite some time, since the supported
    method for adding a journal inode is to use tune2fs (or to creating
    new filesystem with a journal via mke2fs or mkfs.ext4).

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

06 Jan, 2009

2 commits

  • Filesystems often to do compute intensive operation on some
    metadata. If this operation is repeated many times, it can be very
    expensive. It would be much nicer if the operation could be performed
    once before a buffer goes to disk.

    This adds triggers to jbd2 buffer heads. Just before writing a metadata
    buffer to the journal, jbd2 will optionally call a commit trigger associated
    with the buffer. If the journal is aborted, an abort trigger will be
    called on any dirty buffers as they are dropped from pending
    transactions.

    ocfs2 will use this feature.

    Initially I tried to come up with a more generic trigger that could be
    used for non-buffer-related events like transaction completion. It
    doesn't tie nicely, because the information a buffer trigger needs
    (specific to a journal_head) isn't the same as what a transaction
    trigger needs (specific to a tranaction_t or perhaps journal_t). So I
    implemented a buffer set, with the understanding that
    journal/transaction wide triggers should be implemented separately.

    There is only one trigger set allowed per buffer. I can't think of any
    reason to attach more than one set. Contrast this with a journal or
    transaction in which multiple places may want to watch the entire
    transaction separately.

    The trigger sets are considered static allocation from the jbd2
    perspective. ocfs2 will just have one trigger set per block type,
    setting the same set on every bh of the same type.

    Signed-off-by: Joel Becker
    Cc: "Theodore Ts'o"
    Cc:
    Signed-off-by: Mark Fasheh

    Joel Becker
     
  • Xen doesn't report that barriers are not supported until buffer I/O is
    reported as completed, instead of when the buffer I/O is submitted.
    Add a check and a fallback codepath to journal_wait_on_commit_record()
    to detect this case, so that attempts to mount ext4 filesystems on
    LVM/devicemapper devices on Xen guests don't blow up with an "Aborting
    journal on device XXX"; "Remounting filesystem read-only" error.

    Thanks to Andreas Sundstrom for reporting this issue.

    Signed-off-by: "Theodore Ts'o"
    Cc: stable@kernel.org

    Theodore Ts'o
     

05 Jan, 2009

1 commit


04 Jan, 2009

2 commits


17 Dec, 2008

1 commit


26 Nov, 2008

1 commit

  • This patch removes the static sleep time in favor of a more self
    optimizing approach where we measure the average amount of time it
    takes to commit a transaction to disk and the ammount of time a
    transaction has been running. If somebody does a sync write or an
    fsync() traditionally we would sleep for 1 jiffies, which depending on
    the value of HZ could be a significant amount of time compared to how
    long it takes to commit a transaction to the underlying storage. With
    this patch instead of sleeping for a jiffie, we check to see if the
    amount of time this transaction has been running is less than the
    average commit time, and if it is we sleep for the delta using
    schedule_hrtimeout to give us a higher precision sleep time. This
    greatly benefits high end storage where you could end up sleeping for
    longer than it takes to commit the transaction and therefore sitting
    idle instead of allowing the transaction to be committed by keeping
    the sleep time to a minimum so you are sure to always be doing
    something.

    Signed-off-by: Josef Bacik
    Signed-off-by: "Theodore Ts'o"

    Josef Bacik
     

07 Nov, 2008

2 commits

  • Avoid freeing the transaction in __jbd2_journal_drop_transaction() so
    the journal commit callback can run without holding j_list_lock, to
    avoid lock contention on this spinlock.

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: "Theodore Ts'o"

    Aneesh Kumar K.V
     
  • Commit 23f8b79e introducd a regression because it assumed that if
    there were no transactions ready to be checkpointed, that no progress
    could be made on making space available in the journal, and so the
    journal should be aborted. This assumption is false; it could be the
    case that simply calling jbd2_cleanup_journal_tail() will recover the
    necessary space, or, for small journals, the currently committing
    transaction could be responsible for chewing up the required space in
    the log, so we need to wait for the currently committing transaction
    to finish before trying to force a checkpoint operation.

    This patch fixes a bug reported by Mihai Harpau at:
    https://bugzilla.redhat.com/show_bug.cgi?id=469582

    This patch fixes a bug reported by François Valenduc at:
    http://bugzilla.kernel.org/show_bug.cgi?id=11840

    Signed-off-by: "Theodore Ts'o"
    Cc: Duane Griffin
    Cc: Toshiyuki Okajima

    Theodore Ts'o
     

05 Nov, 2008

1 commit


03 Nov, 2008

1 commit

  • jbd2_journal_init_inode() does not call jbd2_stats_proc_exit() on all
    failure paths after calling jbd2_stats_proc_init(). This leaves
    dangling references to the fs in proc.

    This patch fixes a bug reported by Sami Leides at:
    http://bugzilla.kernel.org/show_bug.cgi?id=11493

    Signed-off-by: Sami Liedes
    Signed-off-by: "Theodore Ts'o"

    Sami Liedes
     

29 Oct, 2008

1 commit

  • The transaction can potentially get dropped if there are no buffers
    that need to be written. Make sure we call the commit callback before
    potentially deciding to drop the transaction. Also avoid
    dereferencing the commit_transaction pointer in the marker for the
    same reason.

    This patch fixes the bug reported by Eric Paris at:
    http://bugzilla.kernel.org/show_bug.cgi?id=11838

    Signed-off-by: "Theodore Ts'o"
    Acked-by: Eric Sandeen
    Tested-by: Eric Paris

    Theodore Ts'o
     

21 Oct, 2008

1 commit


17 Oct, 2008

1 commit

  • The multiblock allocator needs to be able to release blocks (and issue
    a blkdev discard request) when the transaction which freed those
    blocks is committed. Previously this was done via a polling mechanism
    when blocks are allocated or freed. A much better way of doing things
    is to create a jbd2 callback function and attaching the list of blocks
    to be freed directly to the transaction structure.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

13 Oct, 2008

1 commit

  • If we failed to write metadata buffers to the journal space and
    succeeded to write the commit record, stale data can be written
    back to the filesystem as metadata in the recovery phase.

    To avoid this, when we failed to write out metadata buffers,
    abort the journal before writing the commit record.

    We can also avoid this kind of corruption by using the journal
    checksum feature because it can detect invalid metadata blocks in the
    journal and avoid them from being replayed. So we don't need to care
    about asynchronous commit record writeout with a checksum.

    Signed-off-by: Hidehiro Kawai
    Signed-off-by: Theodore Ts'o

    Hidehiro Kawai
     

11 Oct, 2008

3 commits

  • If the journal doesn't abort when it gets an IO error in file data
    blocks, the file data corruption will spread silently. Because
    most of applications and commands do buffered writes without fsync(),
    they don't notice the IO error. It's scary for mission critical
    systems. On the other hand, if the journal aborts whenever it gets
    an IO error in file data blocks, the system will easily become
    inoperable. So this patch introduces a filesystem option to
    determine whether it aborts the journal or just call printk() when
    it gets an IO error in file data.

    If you mount an ext4 fs with data_err=abort option, it aborts on file
    data write error. If you mount it with data_err=ignore, it doesn't
    abort, just call printk(). data_err=ignore is the default.

    Here is the corresponding patch of the ext3 version:
    http://kerneltrap.org/mailarchive/linux-kernel/2008/9/9/3239374

    Signed-off-by: Hidehiro Kawai
    Signed-off-by: Theodore Ts'o

    Hidehiro Kawai
     
  • Currently, original metadata buffers are dirtied when they are
    unfiled whether the journal has aborted or not. Eventually these
    buffers will be written-back to the filesystem by pdflush. This
    means some metadata buffers are written to the filesystem without
    journaling if the journal aborts. So if both journal abort and
    system crash happen at the same time, the filesystem would become
    inconsistent state. Additionally, replaying journaled metadata
    can overwrite the latest metadata on the filesystem partly.
    Because, if the journal gets aborted, journaled metadata are
    preserved and replayed during the next mount not to lose
    uncheckpointed metadata. This would also break the consistency
    of the filesystem.

    This patch prevents original metadata buffers from being dirtied
    on abort by clearing BH_JBDDirty flag from those buffers. Thus,
    no metadata buffers are written to the filesystem without journaling.

    Signed-off-by: Hidehiro Kawai
    Signed-off-by: Theodore Ts'o

    Hidehiro Kawai
     
  • When a checkpointing IO fails, current JBD2 code doesn't check the
    error and continue journaling. This means latest metadata can be
    lost from both the journal and filesystem.

    This patch leaves the failed metadata blocks in the journal space
    and aborts journaling in the case of jbd2_log_do_checkpoint().
    To achieve this, we need to do:

    1. don't remove the failed buffer from the checkpoint list where in
    the case of __try_to_free_cp_buf() because it may be released or
    overwritten by a later transaction
    2. jbd2_log_do_checkpoint() is the last chance, remove the failed
    buffer from the checkpoint list and abort the journal
    3. when checkpointing fails, don't update the journal super block to
    prevent the journaled contents from being cleaned. For safety,
    don't update j_tail and j_tail_sequence either
    4. when checkpointing fails, notify this error to the ext4 layer so
    that ext4 don't clear the needs_recovery flag, otherwise the
    journaled contents are ignored and cleaned in the recovery phase
    5. if the recovery fails, keep the needs_recovery flag
    6. prevent jbd2_cleanup_journal_tail() from being called between
    __jbd2_journal_drop_transaction() and jbd2_journal_abort()
    (a possible race issue between jbd2_log_do_checkpoint()s called by
    jbd2_journal_flush() and __jbd2_log_wait_for_space())

    Signed-off-by: Hidehiro Kawai
    Signed-off-by: Theodore Ts'o

    Hidehiro Kawai
     

09 Oct, 2008

1 commit

  • The __jbd2_log_wait_for_space function sits in a loop checkpointing
    transactions until there is sufficient space free in the journal.
    However, if there are no transactions to be processed (e.g. because the
    free space calculation is wrong due to a corrupted filesystem) it will
    never progress.

    Check for space being required when no transactions are outstanding and
    abort the journal instead of endlessly looping.

    This patch fixes the bug reported by Sami Liedes at:
    http://bugzilla.kernel.org/show_bug.cgi?id=10976

    Signed-off-by: Duane Griffin
    Cc: Sami Liedes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: "Theodore Ts'o"

    Duane Griffin
     

07 Oct, 2008

2 commits


06 Oct, 2008

1 commit


17 Sep, 2008

1 commit

  • Calculate the journal device name once and stash it away in the
    journal_s structure. This avoids needing to call bdevname()
    everywhere and reduces stack usage by not needing to allocate an
    on-stack buffer. In addition, we eliminate the '/' that can appear in
    device names (e.g. "cciss/c0d0p9" --- see kernel bugzilla #11321) that
    can cause problems when creating proc directory names, and include the
    inode number to support ocfs2 which creates multiple journals with
    different inode numbers.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

12 Aug, 2008

1 commit


11 Aug, 2008

2 commits


05 Aug, 2008

1 commit

  • Converting page lock to new locking bitops requires a change of page flag
    operation naming, so we might as well convert it to something nicer
    (!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked).

    This also facilitates lockdeping of page lock.

    Signed-off-by: Nick Piggin
    Acked-by: KOSAKI Motohiro
    Acked-by: Peter Zijlstra
    Acked-by: Andrew Morton
    Acked-by: Benjamin Herrenschmidt
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

01 Aug, 2008

1 commit

  • In ordered mode, the current jbd2 aborts the journal if a file data buffer
    has an error. But this behavior is unintended, and we found that it has
    been adopted accidentally.

    This patch undoes it and just calls printk() instead of aborting the
    journal. Unlike a similar patch for ext3/jbd, file data buffers are
    written via generic_writepages(). But we also need to set AS_EIO
    into their mappings because wait_on_page_writeback_range() clears
    AS_EIO before a user process sees it.

    Signed-off-by: Hidehiro Kawai
    Signed-off-by: "Theodore Ts'o"

    Hidehiro Kawai
     

27 Jul, 2008

1 commit


14 Jul, 2008

1 commit

  • journal_try_to_free_buffers() could race with jbd commit transaction
    when the later is holding the buffer reference while waiting for the
    data buffer to flush to disk. If the caller of
    journal_try_to_free_buffers() request tries hard to release the buffers,
    it will treat the failure as error and return back to the caller. We
    have seen the directo IO failed due to this race. Some of the caller of
    releasepage() also expecting the buffer to be dropped when passed with
    GFP_KERNEL mask to the releasepage()->journal_try_to_free_buffers().

    With this patch, if the caller is passing the GFP_KERNEL to indicating
    this call could wait, in case of try_to_free_buffers() failed, let's
    waiting for journal_commit_transaction() to finish commit the current
    committing transaction , then try to free those buffers again with
    journal locked.

    Signed-off-by: Mingming Cao
    Reviewed-by: Badari Pulavarty
    Signed-off-by: "Theodore Ts'o"

    Mingming Cao
     

12 Jul, 2008

4 commits

  • This provides a new ordered mode implementation which gets rid of using
    buffer heads to enforce the ordering between metadata change with the
    related data chage. Instead, in the new ordering mode, it keeps track
    of all of the inodes touched by each transaction on a list, and when
    that transaction is committed, it flushes all of the dirty pages for
    those inodes. In addition, the new ordered mode reverses the lock
    ordering of the page lock and transaction lock, which provides easier
    support for delayed allocation.

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Mingming Cao
    Signed-off-by: "Theodore Ts'o"

    Aneesh Kumar K.V
     
  • Signed-off-by: Jan Kara

    Jan Kara
     
  • This patch adds necessary framework into JBD2 to be able to track inodes
    with each transaction and write-out their dirty data during transaction
    commit time.

    This new ordered mode brings all sorts of advantages such as possibility
    to get rid of journal heads and buffer heads for data buffers in ordered
    mode, better ordering of writes on transaction commit, simplification of
    some JBD code, no more anonymous pages when truncate of data being
    committed happens. Also with this new ordered mode, delayed allocation
    on ordered mode is much simpler.

    Signed-off-by: Jan Kara

    Jan Kara
     
  • Carlo Wood has demonstrated that it's possible to recover deleted
    files from the journal. Something that will make this easier is if we
    can put the time of the commit into commit block.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o