05 Jan, 2012

1 commit

  • Toshiyuki Okajima found out that when running

    for ((i=0; i < 100000; i++)); do
    if ((i%2 == 0)); then
    chattr +j /mnt/file
    else
    chattr -j /mnt/file
    fi
    echo "0" >> /mnt/file
    done

    process sometimes hangs indefinitely in jbd2_journal_lock_updates().

    Toshiyuki identified that the following race happens:

    jbd2_journal_lock_updates() |jbd2_journal_stop()
    ---------------------------------------+---------------------------------------
    write_lock(&journal->j_state_lock) | .
    ++journal->j_barrier_count | .
    spin_lock(&tran->t_handle_lock) | .
    atomic_read(&tran->t_updates) //not 0 |
    | atomic_dec_and_test(&tran->t_updates)
    | // t_updates = 0
    | wake_up(&journal->j_wait_updates)
    prepare_to_wait() | // no process is woken up.
    spin_unlock(&tran->t_handle_lock) |
    write_unlock(&journal->j_state_lock) |
    schedule() // never return |

    We fix the problem by first calling prepare_to_wait() and only after that
    checking t_updates in jbd2_journal_lock_updates().

    Reported-and-analyzed-by: Toshiyuki Okajima
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     

02 Nov, 2011

1 commit


27 Oct, 2011

1 commit

  • Fix build error when CONFIG_BUG is not enabled:

    fs/jbd2/transaction.c:1175:3: error: implicit declaration of function '__WARN'

    by changing __WARN() to WARN_ON(), as suggested by
    Arnaud Lacombe .

    Signed-off-by: Randy Dunlap
    Signed-off-by: "Theodore Ts'o"
    Cc: Arnd Bergmann
    Cc: Arnaud Lacombe

    Randy Dunlap
     

04 Sep, 2011

2 commits

  • This silences some Sparse warnings:
    fs/jbd2/transaction.c:135:69: warning: incorrect type in argument 2 (different base types)
    fs/jbd2/transaction.c:135:69: expected restricted gfp_t [usertype] flags
    fs/jbd2/transaction.c:135:69: got int [signed] gfp_mask

    Signed-off-by: Dan Carpenter
    Signed-off-by: "Theodore Ts'o"

    Dan Carpenter
     
  • Add debugging information in case jbd2_journal_dirty_metadata() is
    called with a buffer_head which didn't have
    jbd2_journal_get_write_access() called on it, or if the journal_head
    has the wrong transaction in it. In addition, return an error code.
    This won't change anything for ocfs2, which will BUG_ON() the non-zero
    exit code.

    For ext4, the caller of this function is ext4_handle_dirty_metadata(),
    and on seeing a non-zero return code, will call __ext4_journal_stop(),
    which will print the function and line number of the (buggy) calling
    function and abort the journal. This will allow us to recover instead
    of bug halting, which is better from a robustness and reliability
    point of view.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

14 Jun, 2011

1 commit

  • jbd2_journal_remove_journal_head() can oops when trying to access
    journal_head returned by bh2jh(). This is caused for example by the
    following race:

    TASK1 TASK2
    jbd2_journal_commit_transaction()
    ...
    processing t_forget list
    __jbd2_journal_refile_buffer(jh);
    if (!jh->b_transaction) {
    jbd_unlock_bh_state(bh);
    jbd2_journal_try_to_free_buffers()
    jbd2_journal_grab_journal_head(bh)
    jbd_lock_bh_state(bh)
    __journal_try_to_free_buffer()
    jbd2_journal_put_journal_head(jh)
    jbd2_journal_remove_journal_head(bh);

    jbd2_journal_put_journal_head() in TASK2 sees that b_jcount == 0 and
    buffer is not part of any transaction and thus frees journal_head
    before TASK1 gets to doing so. Note that even buffer_head can be
    released by try_to_free_buffers() after
    jbd2_journal_put_journal_head() which adds even larger opportunity for
    oops (but I didn't see this happen in reality).

    Fix the problem by making transactions hold their own journal_head
    reference (in b_jcount). That way we don't have to remove journal_head
    explicitely via jbd2_journal_remove_journal_head() and instead just
    remove journal_head when b_jcount drops to zero. The result of this is
    that [__]jbd2_journal_refile_buffer(),
    [__]jbd2_journal_unfile_buffer(), and
    __jdb2_journal_remove_checkpoint() can free journal_head which needs
    modification of a few callers. Also we have to be careful because once
    journal_head is removed, buffer_head might be freed as well. So we
    have to get our own buffer_head reference where it matters.

    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     

13 Jun, 2011

1 commit


26 May, 2011

1 commit


25 May, 2011

1 commit


24 May, 2011

1 commit

  • In data=ordered mode, it's theoretically possible (however rare) that
    an inode is filed to transaction's t_inode_list and a flusher thread
    writes all the data and inode is reclaimed before the transaction
    starts to commit. In such a case, we could erroneously omit sending a
    flush to file system device when it is different from the journal
    device (because data can still be in disk cache only).

    Fix the problem by setting a flag in a transaction when some inode is added
    to it and then send disk flush in the commit code when the flag is set.

    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     

23 May, 2011

1 commit

  • t_max_wait is added in commit 8e85fb3f to indicate how long we
    were waiting for new transaction to start. In commit 6d0bf005,
    it is moved to another function named update_t_max_wait to
    avoid a build warning. But the wrong thing is that the original
    'ts' is initialized in the start of function start_this_handle
    and we can calculate t_max_wait in the right way. while with
    this change, ts is initialized within the function and t_max_wait
    can never be calculated right.

    This patch moves the initialization of ts to the original beginning
    of start_this_handle and pass it to function update_t_max_wait so
    that it can be calculated right and the build warning is avoided also.

    Cc: Jan Kara
    Signed-off-by: Tao Ma
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Eric Sandeen

    Tao Ma
     

31 Mar, 2011

1 commit


12 Feb, 2011

1 commit

  • On an SMP ARM system running ext4, I've received a report that the
    first J_ASSERT in jbd2_journal_commit_transaction has been triggering:

    J_ASSERT(journal->j_running_transaction != NULL);

    While investigating possible causes for this problem, I noticed that
    __jbd2_log_start_commit() is getting called with j_state_lock only
    read-locked, in spite of the fact that it's possible for it might
    j_commit_request. Fix this by grabbing the necessary information so
    we can test to see if we need to start a new transaction before
    dropping the read lock, and then calling jbd2_log_start_commit() which
    will grab the write lock.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

14 Jan, 2011

1 commit

  • * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
    Documentation/trace/events.txt: Remove obsolete sched_signal_send.
    writeback: fix global_dirty_limits comment runtime -> real-time
    ppc: fix comment typo singal -> signal
    drivers: fix comment typo diable -> disable.
    m68k: fix comment typo diable -> disable.
    wireless: comment typo fix diable -> disable.
    media: comment typo fix diable -> disable.
    remove doc for obsolete dynamic-printk kernel-parameter
    remove extraneous 'is' from Documentation/iostats.txt
    Fix spelling milisec -> ms in snd_ps3 module parameter description
    Fix spelling mistakes in comments
    Revert conflicting V4L changes
    i7core_edac: fix typos in comments
    mm/rmap.c: fix comment
    sound, ca0106: Fix assignment to 'channel'.
    hrtimer: fix a typo in comment
    init/Kconfig: fix typo
    anon_inodes: fix wrong function name in comment
    fix comment typos concerning "consistent"
    poll: fix a typo in comment
    ...

    Fix up trivial conflicts in:
    - drivers/net/wireless/iwlwifi/iwl-core.c (moved to iwl-legacy.c)
    - fs/ext4/ext4.h

    Also fix missed 'diabled' typo in drivers/net/bnx2x/bnx2x.h while at it.

    Linus Torvalds
     

19 Dec, 2010

3 commits


10 Dec, 2010

1 commit


28 Oct, 2010

1 commit


10 Aug, 2010

1 commit


08 Aug, 2010

1 commit

  • * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (40 commits)
    ext4: Adding error check after calling ext4_mb_regular_allocator()
    ext4: Fix dirtying of journalled buffers in data=journal mode
    ext4: re-inline ext4_rec_len_(to|from)_disk functions
    jbd2: Remove t_handle_lock from start_this_handle()
    jbd2: Change j_state_lock to be a rwlock_t
    jbd2: Use atomic variables to avoid taking t_handle_lock in jbd2_journal_stop
    ext4: Add mount options in superblock
    ext4: force block allocation on quota_off
    ext4: fix freeze deadlock under IO
    ext4: drop inode from orphan list if ext4_delete_inode() fails
    ext4: check to make make sure bd_dev is set before dereferencing it
    jbd2: Make barrier messages less scary
    ext4: don't print scary messages for allocation failures post-abort
    ext4: fix EFBIG edge case when writing to large non-extent file
    ext4: fix ext4_get_blocks references
    ext4: Always journal quota file modifications
    ext4: Fix potential memory leak in ext4_fill_super
    ext4: Don't error out the fs if the user tries to make a file too big
    ext4: allocate stripe-multiple IOs on stripe boundaries
    ext4: move aio completion after unwritten extent conversion
    ...

    Fix up conflicts in fs/ext4/inode.c as per Ted.

    Fix up xfs conflicts as per earlier xfs merge.

    Linus Torvalds
     

04 Aug, 2010

2 commits


02 Aug, 2010

1 commit


27 Jul, 2010

1 commit

  • __GFP_NOFAIL is going away, so add our own retry loop. Also add
    jbd2__journal_start() and jbd2__journal_restart() which take a gfp
    mask, so that file systems can optionally (re)start transaction
    handles using GFP_KERNEL. If they do this, then they need to be
    prepared to handle receiving an PTR_ERR(-ENOMEM) error, and be ready
    to reflect that error up to userspace.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

16 Jul, 2010

1 commit

  • OCFS2 uses t_commit trigger to compute and store checksum of the just
    committed blocks. When a buffer has b_frozen_data, checksum is computed
    for it instead of b_data but this can result in an old checksum being
    written to the filesystem in the following scenario:

    1) transaction1 is opened
    2) handle1 is opened
    3) journal_access(handle1, bh)
    - This sets jh->b_transaction to transaction1
    4) modify(bh)
    5) journal_dirty(handle1, bh)
    6) handle1 is closed
    7) start committing transaction1, opening transaction2
    8) handle2 is opened
    9) journal_access(handle2, bh)
    - This copies off b_frozen_data to make it safe for transaction1 to commit.
    jh->b_next_transaction is set to transaction2.
    10) jbd2_journal_write_metadata() checksums b_frozen_data
    11) the journal correctly writes b_frozen_data to the disk journal
    12) handle2 is closed
    - There was no dirty call for the bh on handle2, so it is never queued for
    any more journal operation
    13) Checkpointing finally happens, and it just spools the bh via normal buffer
    writeback. This will write b_data, which was never triggered on and thus
    contains a wrong (old) checksum.

    This patch fixes the problem by calling the trigger at the moment data is
    frozen for journal commit - i.e., either when b_frozen_data is created by
    do_get_write_access or just before we write a buffer to the log if
    b_frozen_data does not exist. We also rename the trigger to t_frozen as
    that better describes when it is called.

    Signed-off-by: Jan Kara
    Signed-off-by: Mark Fasheh
    Signed-off-by: Joel Becker

    Jan Kara
     

16 May, 2010

1 commit

  • One of the most contended locks in the jbd2 layer is j_state_lock when
    running dbench. This is especially true if using the real-time kernel
    with its "sleeping spinlocks" patch that replaces spinlocks with
    priority inheriting mutexes --- but it also shows up on large SMP
    benchmarks.

    Thanks to John Stultz for pointing this out.

    Reviewed by Mingming Cao and Jan Kara.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

16 Feb, 2010

1 commit

  • Delay discarding buffers in journal_unmap_buffer until
    we know that "add to orphan" operation has definitely been
    committed, otherwise the log space of committing transation
    may be freed and reused before truncate get committed, updates
    may get lost if crash happens.

    Signed-off-by: dingdinghua
    Signed-off-by: "Theodore Ts'o"

    dingdinghua
     

18 Aug, 2009

1 commit


11 Aug, 2009

1 commit

  • fix jiffie rounding in jbd commit timer setup code. Rounding down
    could cause the timer to be fired before the corresponding transaction
    has expired. That transaction can stay not committed forever if no
    new transaction is created or expicit sync/umount happens.

    Signed-off-by: Alex Zhuravlev (Tomas)
    Signed-off-by: Andreas Dilger
    Signed-off-by: "Theodore Ts'o"

    Andreas Dilger
     

14 Jul, 2009

1 commit

  • The following race can happen:

    CPU1 CPU2
    checkpointing code checks the buffer, adds
    it to an array for writeback
    do_get_write_access()
    ...
    lock_buffer()
    unlock_buffer()
    flush_batch() submits the buffer for IO
    __jbd2_journal_file_buffer()

    So a buffer under writeout is returned from
    do_get_write_access(). Since the filesystem code relies on the fact
    that journaled buffers cannot be written out, it does not take the
    buffer lock and so it can modify buffer while it is under
    writeout. That can lead to a filesystem corruption if we crash at the
    right moment.

    We fix the problem by clearing the buffer dirty bit under buffer_lock
    even if the buffer is on BJ_None list. Actually, we clear the dirty
    bit regardless the list the buffer is in and warn about the fact if
    the buffer is already journalled.

    Thanks for spotting the problem goes to dingdinghua .

    Reported-by: dingdinghua
    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"

    Jan Kara
     

18 Jun, 2009

1 commit

  • This patch reverts 3f31fddf, which is no longer needed because if a
    race between freeing buffer and committing transaction functionality
    occurs and dio gets error, currently dio falls back to buffered IO due
    to the commit 6ccfa806.

    Signed-off-by: Hisashi Hifumi
    Cc: Mingming Cao
    Acked-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: "Theodore Ts'o"

    Hisashi Hifumi
     

26 Mar, 2009

1 commit


11 Feb, 2009

1 commit

  • If we race with commit code setting i_transaction to NULL, we could
    possibly dereference it. Proper locking requires the journal pointer
    (to access journal->j_list_lock), which we don't have. So we have to
    change the prototype of the function so that filesystem passes us the
    journal pointer. Also add a more detailed comment about why the
    function jbd2_journal_begin_ordered_truncate() does what it does and
    how it should be used.

    Thanks to Dan Carpenter for pointing to the
    suspitious code.

    Signed-off-by: Jan Kara
    Signed-off-by: "Theodore Ts'o"
    Acked-by: Joel Becker
    CC: linux-ext4@vger.kernel.org
    CC: ocfs2-devel@oss.oracle.com
    CC: mfasheh@suse.de
    CC: Dan Carpenter

    Jan Kara
     

09 Jan, 2009

1 commit

  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (57 commits)
    jbd2: Fix oops in jbd2_journal_init_inode() on corrupted fs
    ext4: Remove "extents" mount option
    block: Add Kconfig help which notes that ext4 needs CONFIG_LBD
    ext4: Make printk's consistently prefixed with "EXT4-fs: "
    ext4: Add sanity checks for the superblock before mounting the filesystem
    ext4: Add mount option to set kjournald's I/O priority
    jbd2: Submit writes to the journal using WRITE_SYNC
    jbd2: Add pid and journal device name to the "kjournald2 starting" message
    ext4: Add markers for better debuggability
    ext4: Remove code to create the journal inode
    ext4: provide function to release metadata pages under memory pressure
    ext3: provide function to release metadata pages under memory pressure
    add releasepage hooks to block devices which can be used by file systems
    ext4: Fix s_dirty_blocks_counter if block allocation failed with nodelalloc
    ext4: Init the complete page while building buddy cache
    ext4: Don't allow new groups to be added during block allocation
    ext4: mark the blocks/inode bitmap beyond end of group as used
    ext4: Use new buffer_head flag to check uninit group bitmaps initialization
    ext4: Fix the race between read_inode_bitmap() and ext4_new_inode()
    ext4: code cleanup
    ...

    Linus Torvalds
     

06 Jan, 2009

1 commit

  • Filesystems often to do compute intensive operation on some
    metadata. If this operation is repeated many times, it can be very
    expensive. It would be much nicer if the operation could be performed
    once before a buffer goes to disk.

    This adds triggers to jbd2 buffer heads. Just before writing a metadata
    buffer to the journal, jbd2 will optionally call a commit trigger associated
    with the buffer. If the journal is aborted, an abort trigger will be
    called on any dirty buffers as they are dropped from pending
    transactions.

    ocfs2 will use this feature.

    Initially I tried to come up with a more generic trigger that could be
    used for non-buffer-related events like transaction completion. It
    doesn't tie nicely, because the information a buffer trigger needs
    (specific to a journal_head) isn't the same as what a transaction
    trigger needs (specific to a tranaction_t or perhaps journal_t). So I
    implemented a buffer set, with the understanding that
    journal/transaction wide triggers should be implemented separately.

    There is only one trigger set allowed per buffer. I can't think of any
    reason to attach more than one set. Contrast this with a journal or
    transaction in which multiple places may want to watch the entire
    transaction separately.

    The trigger sets are considered static allocation from the jbd2
    perspective. ocfs2 will just have one trigger set per block type,
    setting the same set on every bh of the same type.

    Signed-off-by: Joel Becker
    Cc: "Theodore Ts'o"
    Cc:
    Signed-off-by: Mark Fasheh

    Joel Becker
     

04 Jan, 2009

1 commit

  • Add new mount options, min_batch_time and max_batch_time, which
    controls how long the jbd2 layer should wait for additional filesystem
    operations to get batched with a synchronous write transaction.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

26 Nov, 2008

1 commit

  • This patch removes the static sleep time in favor of a more self
    optimizing approach where we measure the average amount of time it
    takes to commit a transaction to disk and the ammount of time a
    transaction has been running. If somebody does a sync write or an
    fsync() traditionally we would sleep for 1 jiffies, which depending on
    the value of HZ could be a significant amount of time compared to how
    long it takes to commit a transaction to the underlying storage. With
    this patch instead of sleeping for a jiffie, we check to see if the
    amount of time this transaction has been running is less than the
    average commit time, and if it is we sleep for the delta using
    schedule_hrtimeout to give us a higher precision sleep time. This
    greatly benefits high end storage where you could end up sleeping for
    longer than it takes to commit the transaction and therefore sitting
    idle instead of allowing the transaction to be committed by keeping
    the sleep time to a minimum so you are sure to always be doing
    something.

    Signed-off-by: Josef Bacik
    Signed-off-by: "Theodore Ts'o"

    Josef Bacik
     

17 Oct, 2008

1 commit

  • The multiblock allocator needs to be able to release blocks (and issue
    a blkdev discard request) when the transaction which freed those
    blocks is committed. Previously this was done via a polling mechanism
    when blocks are allocated or freed. A much better way of doing things
    is to create a jbd2 callback function and attaching the list of blocks
    to be freed directly to the transaction structure.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

11 Aug, 2008

1 commit