Eric Lee / smarc-fsl-linux-kernel

16 Sep, 2009

3 commits

3adae9da0 jbd: Annotate transaction start also for journal_restart() ... Browse Code »

lockdep annotation for a transaction start has been at the end of
journal_start(). But a transaction is also started from journal_restart(). Move
the lockdep annotation to start_this_handle() which covers both cases.

Signed-off-by: Jan Kara

Jan Kara
2009-09-16 23:44:10 +0800
9c28cbcce jbd: Journal block numbers can ever be only 32-bit use unsigned int for them ... Browse Code »

It does not make sense to store block number for journal as unsigned long
since they can be only 32-bit (because of on-disk format limitation). So
change in-memory structures and variables to use unsigned int instead.

Signed-off-by: Jan Kara

Jan Kara
2009-09-16 23:44:10 +0800
b449fc6fc JBD: round commit timer up to avoid uncommitted transaction ... Browse Code »

Fix jiffie rounding in jbd commit timer setup code. Rounding down could cause
the timer to be fired before the corresponding transaction has expired. That
transaction can stay not committed forever if no new transaction is created or
explicit sync/umount happens.

Signed-off-by: Andreas Dilger
Signed-off-by: Jan Kara

Andreas Dilger
2009-09-16 23:44:10 +0800

21 Jul, 2009

1 commit

f1015c447 jbd: fix race between write_metadata_buffer and get_write_access ... Browse Code »

The function journal_write_metadata_buffer() calls jbd_unlock_bh_state(bh_in)
too early; this could potentially allow another thread to call get_write_access
on the buffer head, modify the data, and dirty it, and allowing the wrong data
to be written into the journal. Fortunately, if we lose this race, the only
time this will actually cause filesystem corruption is if there is a system
crash or other unclean shutdown of the system before the next commit can take
place.

Signed-off-by: dingdinghua
Acked-by: "Theodore Ts'o"
Signed-off-by: Jan Kara

dingdinghua
2009-07-21 17:54:42 +0800

16 Jul, 2009

2 commits

1e9fd53b7 jbd: Fix a race between checkpointing code and journal_get_write_access() ... Browse Code »

The following race can happen:

CPU1 CPU2
checkpointing code checks the buffer, adds
it to an array for writeback
do_get_write_access()
...
lock_buffer()
unlock_buffer()
flush_batch() submits the buffer for IO
__jbd_journal_file_buffer()

So a buffer under writeout is returned from do_get_write_access(). Since
the filesystem code relies on the fact that journaled buffers cannot be
written out, it does not take the buffer lock and so it can modify buffer
while it is under writeout. That can lead to a filesystem corruption
if we crash at the right moment. The similar problem can happen with
the journal_get_create_access() path.
We fix the problem by clearing the buffer dirty bit under buffer_lock
even if the buffer is on BJ_None list. Actually, we clear the dirty bit
regardless the list the buffer is in and warn about the fact if
the buffer is already journalled.

Thanks for spotting the problem goes to dingdinghua .

Reported-by: dingdinghua
Signed-off-by: Jan Kara

Jan Kara
2009-07-16 03:30:07 +0800
7447a668a jbd: Fail to load a journal if it is too short ... Browse Code »

Due to on disk corruption, it can happen that journal is too short. Fail
to load it in such case so that we don't oops somewhere later.

Reported-by: Nageswara R Sastry
Signed-off-by: Jan Kara

Jan Kara
2009-07-16 03:26:23 +0800

19 Jun, 2009

1 commit

6f3f1cb21 jbd: clean up journal_try_to_free_buffers() ... Browse Code »

I delete the following patch
"commit 3f31fddfa26b7594b44ff2b34f9a04ba409e0f91
Author: Mingming Cao
Date: Fri Jul 25 01:46:22 2008 -0700

jbd: fix race between free buffer and commit transaction

This patch is no longer needed because if race between freeing buffer and
committing transaction functionality occurs and dio gets error, currently
dio falls back to buffered IO by the following patch.

commit 6ccfa806a9cfbbf1cd43d5b6aa47ef2c0eb518fd
Author: Hisashi Hifumi
Date: Tue Sep 2 14:35:40 2008 -0700

VFS: fix dio write returning EIO when try_to_release_page fails

Signed-off-by: Hisashi Hifumi
Cc: Theodore Tso
Cc: Mingming Cao
Acked-by: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hisashi Hifumi
2009-06-19 04:03:45 +0800

10 Jun, 2009

1 commit

a61d90d75 jbd: fix race in buffer processing in commit code ... Browse Code »

In commit code, we scan buffers attached to a transaction. During this
scan, we sometimes have to drop j_list_lock and then we recheck whether
the journal buffer head didn't get freed by journal_try_to_free_buffers().
But checking for buffer_jbd(bh) isn't enough because a new journal head
could get attached to our buffer head. So add a check whether the journal
head remained the same and whether it's still at the same transaction and
list.

This is a nasty bug and can cause problems like memory corruption (use after
free) or trigger various assertions in JBD code (observed).

Signed-off-by: Jan Kara
Cc:
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2009-06-10 07:59:03 +0800

24 Apr, 2009

1 commit

a4277bf12 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 ... Browse Code »

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: Fix potential inode allocation soft lockup in Orlov allocator
ext4: Make the extent validity check more paranoid
jbd: use SWRITE_SYNC_PLUG when writing synchronous revoke records
jbd2: use SWRITE_SYNC_PLUG when writing synchronous revoke records
ext4: really print the find_group_flex fallback warning only once

Linus Torvalds
2009-04-24 23:37:40 +0800

14 Apr, 2009

2 commits

38d726d15 jbd: use SWRITE_SYNC_PLUG when writing synchronous revoke records ... Browse Code »

The revoke records must be written using the same way as the rest of
the blocks during the commit process; that is, either marked as
synchronous writes or as asynchornous writes.

Signed-off-by: "Theodore Ts'o"

Theodore Ts'o
2009-04-14 22:10:47 +0800
324338794 jbd: update locking coments ... Browse Code »

Update information about locking in JBD revoke code.

Reported-by: Lin Tan .
Signed-off-by: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2009-04-14 06:04:32 +0800

06 Apr, 2009

1 commit

6c4bac6b3 jbd: use WRITE_SYNC_PLUG instead of WRITE_SYNC ... Browse Code »

When you are going to be submitting several sync writes, we want to
give the IO scheduler a chance to merge some of them. Instead of
using the implicitly unplugging WRITE_SYNC variant, use WRITE_SYNC_PLUG
and rely on sync_buffer() doing the unplug when someone does a
wait_on_buffer()/lock_buffer().

Signed-off-by: Jens Axboe
Signed-off-by: Linus Torvalds

Jens Axboe
2009-04-06 23:04:53 +0800

04 Apr, 2009

1 commit

20bec8ab1 Merge branch 'ext3-latency-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 ... Browse Code »

* 'ext3-latency-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext3: Add replace-on-rename hueristics for data=writeback mode
ext3: Add replace-on-truncate hueristics for data=writeback mode
ext3: Use WRITE_SYNC for commits which are caused by fsync()
block_write_full_page: Use synchronous writes for WBC_SYNC_ALL writebacks

Linus Torvalds
2009-04-04 02:10:33 +0800

03 Apr, 2009

1 commit

ecca9af0a jbd: fix oops in jbd_journal_init_inode() on corrupted fs ... Browse Code »

On 32-bit system with CONFIG_LBD getblk can fail because provided block
number is too big. Make JBD gracefully handle that.

Signed-off-by: Jan Kara
Cc:
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2009-04-03 10:04:52 +0800

28 Mar, 2009

1 commit

512a00438 ext3: Use WRITE_SYNC for commits which are caused by fsync() ... Browse Code »

If a commit is triggered by fsync(), set a flag indicating the journal
blocks associated with the transaction should be flushed out using
WRITE_SYNC.

Signed-off-by: "Theodore Ts'o"
Acked-by: Jan Kara

Theodore Ts'o
2009-03-28 10:14:27 +0800

12 Feb, 2009

1 commit

8fe4cd0dc jbd: fix return value of journal_start_commit() ... Browse Code »

journal_start_commit() returns 1 if either a transaction is committing or
the function has queued a transaction commit. But it returns 0 if we
raced with somebody queueing the transaction commit as well. This
resulted in ext3_sync_fs() not functioning correctly (description from
Arthur Jones): In the case of a data=ordered umount with pending long
symlinks which are delayed due to a long list of other I/O on the backing
block device, this causes the buffer associated with the long symlinks to
not be moved to the inode dirty list in the second phase of fsync_super.
Then, before they can be dirtied again, kjournald exits, seeing the UMOUNT
flag and the dirty pages are never written to the backing block device,
causing long symlink corruption and exposing new or previously freed block
data to userspace.

This can be reproduced with a script created by Eric Sandeen
:

#!/bin/bash

umount /mnt/test2
mount /dev/sdb4 /mnt/test2
rm -f /mnt/test2/*
dd if=/dev/zero of=/mnt/test2/bigfile bs=1M count=512
touch /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
ln -s /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
/mnt/test2/link
umount /mnt/test2
mount /dev/sdb4 /mnt/test2
ls /mnt/test2/

This patch fixes journal_start_commit() to always return 1 when there's
a transaction committing or queued for commit.

Cc: Eric Sandeen
Cc: Mike Snitzer
Cc:
Signed-off-by: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2009-02-12 06:25:35 +0800

09 Jan, 2009

2 commits

1579c3a15 jbd: remove excess kernel-doc notation ... Browse Code »

Remove excess kernel-doc from fs/jbd/transaction.c:

Warning(linux-2.6.28-git5//fs/jbd/transaction.c:764): Excess function parameter 'credits' description in 'journal_get_write_access'

Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Randy Dunlap
2009-01-09 00:31:01 +0800
f420d4dc4 jbd: improve fsync batching ... Browse Code »

There is a flaw with the way jbd handles fsync batching. If we fsync() a
file and we were not the last person to run fsync() on this fs then we
automatically sleep for 1 jiffie in order to wait for new writers to join
into the transaction before forcing the commit. The problem with this is
that with really fast storage (ie a Clariion) the time it takes to commit
a transaction to disk is way faster than 1 jiffie in most cases, so
sleeping means waiting longer with nothing to do than if we just committed
the transaction and kept going. Ric Wheeler noticed this when using
fs_mark with more than 1 thread, the throughput would plummet as he added
more threads.

This patch attempts to fix this problem by recording the average time in
nanoseconds that it takes to commit a transaction to disk, and what time
we started the transaction. If we run an fsync() and we have been running
for less time than it takes to commit the transaction to disk, we sleep
for the delta amount of time and then commit to disk. We acheive
sub-jiffie sleeping using schedule_hrtimeout. This means that the wait
time is auto-tuned to the speed of the underlying disk, instead of having
this static timeout. I weighted the average according to somebody's
comments (Andreas Dilger I think) in order to help normalize random
outliers where we take way longer or way less time to commit than the
average. I also have a min() check in there to make sure we don't sleep
longer than a jiffie in case our storage is super slow, this was requested
by Andrew.

I unfortunately do not have access to a Clariion, so I had to use a
ramdisk to represent a super fast array. I tested with a SATA drive with
barrier=1 to make sure there was no regression with local disks, I tested
with a 4 way multipathed Apple Xserve RAID array and of course the
ramdisk. I ran the following command

fs_mark -d /mnt/ext3-test -s 4096 -n 2000 -D 64 -t $i

where $i was 2, 4, 8, 16 and 32. I mkfs'ed the fs each time. Here are my
results

type threads with patch without patch
sata 2 24.6 26.3
sata 4 49.2 48.1
sata 8 70.1 67.0
sata 16 104.0 94.1
sata 32 153.6 142.7

xserve 2 246.4 222.0
xserve 4 480.0 440.8
xserve 8 829.5 730.8
xserve 16 1172.7 1026.9
xserve 32 1816.3 1650.5

ramdisk 2 2538.3 1745.6
ramdisk 4 2942.3 661.9
ramdisk 8 2882.5 999.8
ramdisk 16 2738.7 1801.9
ramdisk 32 2541.9 2394.0

Signed-off-by: Josef Bacik
Cc: Andreas Dilger
Cc: Arjan van de Ven
Cc: Ric Wheeler
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Josef Bacik
2009-01-09 00:31:00 +0800

07 Nov, 2008

1 commit

e219cca08 jbd: don't give up looking for space so easily in __log_wait_for_space ... Browse Code »

Commit be07c4ed introducd a regression because it assumed that if
there were no transactions ready to be checkpointed, that no progress
could be made on making space available in the journal, and so the
journal should be aborted. This assumption is false; it could be the
case that simply calling cleanup_journal_tail() will recover the
necessary space, or, for small journals, the currently committing
transaction could be responsible for chewing up the required space in
the log, so we need to wait for the currently committing transaction
to finish before trying to force a checkpoint operation.

This patch fixes the bug reported by Meelis Roos at:
http://bugzilla.kernel.org/show_bug.cgi?id=11937

Signed-off-by: "Theodore Ts'o"
Cc: Duane Griffin
Cc: Toshiyuki Okajima

Theodore Ts'o
2008-11-07 11:37:59 +0800

31 Oct, 2008

1 commit

e74481e23 fs: remove excess kernel-doc ... Browse Code »

Delete excess kernel-doc notation in fs/ subdirectory:

Warning(linux-2.6.27-git10//fs/jbd/transaction.c:886): Excess function parameter or struct member 'credits' description in 'journal_get_undo_access'

Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Randy Dunlap
2008-10-31 02:38:46 +0800

23 Oct, 2008

3 commits

be07c4ed4 jbd: abort instead of waiting for nonexistent transactions ... Browse Code »

The __log_wait_for_space function sits in a loop checkpointing
transactions until there is sufficient space free in the journal.
However, if there are no transactions to be processed (e.g. because the
free space calculation is wrong due to a corrupted filesystem) it will
never progress.

Check for space being required when no transactions are outstanding and
abort the journal instead of endlessly looping.

This patch fixes the bug reported by Sami Liedes at:
http://bugzilla.kernel.org/show_bug.cgi?id=10976

Signed-off-by: Duane Griffin
Tested-by: Sami Liedes
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Duane Griffin
2008-10-23 23:55:02 +0800
9f818b4ac jbd: test BH_Write_EIO to detect errors on metadata buffers ... Browse Code »

__try_to_free_cp_buf(), __process_buffer(), and __wait_cp_io() test
BH_Uptodate flag to detect write I/O errors on metadata buffers. But by
commit 95450f5a7e53d5752ce1a0d0b8282e10fe745ae0 "ext3: don't read inode
block if the buffer has a write error"(*), BH_Uptodate flag can be set to
inode buffers with BH_Write_EIO in order to avoid reading old inode data.
So now, we have to test BH_Write_EIO flag of checkpointing inode buffers
instead of BH_Uptodate. This patch does it.

Signed-off-by: Hidehiro Kawai
Acked-by: Jan Kara
Acked-by: Eric Sandeen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hidehiro Kawai
2008-10-23 23:55:02 +0800
4afe97853 jbd: fix error handling for checkpoint io ... Browse Code »

When a checkpointing IO fails, current JBD code doesn't check the error
and continue journaling. This means latest metadata can be lost from both
the journal and filesystem.

This patch leaves the failed metadata blocks in the journal space and
aborts journaling in the case of log_do_checkpoint(). To achieve this, we
need to do:

1. don't remove the failed buffer from the checkpoint list where in
the case of __try_to_free_cp_buf() because it may be released or
overwritten by a later transaction
2. log_do_checkpoint() is the last chance, remove the failed buffer
from the checkpoint list and abort the journal
3. when checkpointing fails, don't update the journal super block to
prevent the journaled contents from being cleaned. For safety,
don't update j_tail and j_tail_sequence either
4. when checkpointing fails, notify this error to the ext3 layer so
that ext3 don't clear the needs_recovery flag, otherwise the
journaled contents are ignored and cleaned in the recovery phase
5. if the recovery fails, keep the needs_recovery flag
6. prevent cleanup_journal_tail() from being called between
__journal_drop_transaction() and journal_abort() (a race issue
between journal_flush() and __log_wait_for_space()

Signed-off-by: Hidehiro Kawai
Acked-by: Jan Kara
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hidehiro Kawai
2008-10-23 23:55:01 +0800

21 Oct, 2008

1 commit

6da0b38f4 fs/Kconfig: move ext2, ext3, ext4, JBD, JBD2 out ... Browse Code »

Use fs/*/Kconfig more, which is good because everything related to one
filesystem is in one place and fs/Kconfig is quite fat.

Signed-off-by: Alexey Dobriyan
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2008-10-21 02:43:59 +0800

20 Oct, 2008

4 commits

960a22ae6 jbd: ordered data integrity fix ... Browse Code »

In ordered mode, if a file data buffer being dirtied exists in the
committing transaction, we write the buffer to the disk, move it from the
committing transaction to the running transaction, then dirty it. But we
don't have to remove the buffer from the committing transaction when the
buffer couldn't be written out, otherwise it would miss the error and the
committing transaction would not abort.

This patch adds an error check before removing the buffer from the
committing transaction.

Signed-off-by: Hidehiro Kawai
Acked-by: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hidehiro Kawai
2008-10-20 23:52:37 +0800
0e4fb5e28 ext3: add an option to control error handling on file data ... Browse Code »

If the journal doesn't abort when it gets an IO error in file data blocks,
the file data corruption will spread silently. Because most of
applications and commands do buffered writes without fsync(), they don't
notice the IO error. It's scary for mission critical systems. On the
other hand, if the journal aborts whenever it gets an IO error in file
data blocks, the system will easily become inoperable. So this patch
introduces a filesystem option to determine whether it aborts the journal
or just call printk() when it gets an IO error in file data.

If you mount a ext3 fs with data_err=abort option, it aborts on file data
write error. If you mount it with data_err=ignore, it doesn't abort, just
call printk(). data_err=ignore is the default.

Signed-off-by: Hidehiro Kawai
Cc: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hidehiro Kawai
2008-10-20 23:52:37 +0800
885e353c7 jbd: don't dirty original metadata buffer on abort ... Browse Code »

Currently, original metadata buffers are dirtied when they are unfiled
whether the journal has aborted or not. Eventually these buffers will be
written-back to the filesystem by pdflush. This means some metadata
buffers are written to the filesystem without journaling if the journal
aborts. So if both journal abort and system crash happen at the same
time, the filesystem would become inconsistent state. Additionally,
replaying journaled metadata can overwrite the latest metadata on the
filesystem partly. Because, if the journal aborts, journaled metadata are
preserved and replayed during the next mount not to lose uncheckpointed
metadata. This would also break the consistency of the filesystem.

This patch prevents original metadata buffers from being dirtied on abort
by clearing BH_JBDDirty flag from those buffers. Thus, no metadata
buffers are written to the filesystem without journaling.

Signed-off-by: Hidehiro Kawai
Acked-by: Jan Kara
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hidehiro Kawai
2008-10-20 23:52:37 +0800
d1645e526 jbd: abort when failed to log metadata buffers ... Browse Code »

If we failed to write metadata buffers to the journal space and succeeded
to write the commit record, stale data can be written back to the
filesystem as metadata in the recovery phase.

To avoid this, when we failed to write out metadata buffers, abort the
journal before writing the commit record.

Signed-off-by: Hidehiro Kawai
Acked-by: Jan Kara
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hidehiro Kawai
2008-10-20 23:52:36 +0800

12 Aug, 2008

1 commit

23a0ee908 Merge branch 'core/locking' into core/urgent Browse Code »

Ingo Molnar
2008-08-12 06:11:49 +0800

11 Aug, 2008

2 commits

3295f0ef9 lockdep: rename map_[acquire|release]() => lock_map_[acquire|release]() ... Browse Code »

the names were too generic:

drivers/uio/uio.c:87: error: expected identifier or '(' before 'do'
drivers/uio/uio.c:87: error: expected identifier or '(' before 'while'
drivers/uio/uio.c:113: error: 'map_release' undeclared here (not in a function)

Signed-off-by: Ingo Molnar

Ingo Molnar
2008-08-11 16:30:30 +0800
4f3e7524b lockdep: map_acquire ... Browse Code »

Most the free-standing lock_acquire() usages look remarkably similar, sweep
them into a new helper.

Signed-off-by: Peter Zijlstra
Signed-off-by: Ingo Molnar

Peter Zijlstra
2008-08-11 15:30:23 +0800

05 Aug, 2008

2 commits

ca5de404f fs: rename buffer trylock ... Browse Code »

Like the page lock change, this also requires name change, so convert the
raw test_and_set bitop to a trylock.

Signed-off-by: Nick Piggin
Signed-off-by: Linus Torvalds

Nick Piggin
2008-08-05 12:56:09 +0800
529ae9aaa mm: rename page trylock ... Browse Code »

Converting page lock to new locking bitops requires a change of page flag
operation naming, so we might as well convert it to something nicer
(!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked).

This also facilitates lockdeping of page lock.

Signed-off-by: Nick Piggin
Acked-by: KOSAKI Motohiro
Acked-by: Peter Zijlstra
Acked-by: Andrew Morton
Acked-by: Benjamin Herrenschmidt
Signed-off-by: Linus Torvalds

Nick Piggin
2008-08-05 12:31:34 +0800

26 Jul, 2008

7 commits

cbe5f466f jbd: don't abort if flushing file data failed ... Browse Code »

In ordered mode, the current jbd aborts the journal if a file data buffer
has an error. But this behavior is unintended, and we found that it has
been adopted accidentally.

This patch undoes it and just calls printk() instead of aborting the
journal. Additionally, set AS_EIO into the address_space object of the
failed buffer which is submitted by journal_do_submit_data() so that
fsync() can get -EIO.

Missing error checkings are also added to inform errors on file data
buffers to the user. The following buffers are targeted.

(a) the buffer which has already been written out by pdflush
(b) the buffer which has been unlocked before scanned in the
t_locked_list loop

[akpm@linux-foundation.org: improve grammar in a printk]
Signed-off-by: Hidehiro Kawai
Acked-by: Jan Kara
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hidehiro Kawai
2008-07-26 01:53:32 +0800
fc80c4427 jbd: positively dispose the unmapped data buffers in journal_commit_transaction() ... Browse Code »

After ext3-ordered files are truncated, there is a possibility that the
pages which cannot be estimated still remain. Remaining pages can be
released when the system has really few memory. So, it is not memory
leakage. But the resource management software etc. may not work
correctly.

It is possible that journal_unmap_buffer() cannot release the buffers, and
the pages to which they belong because they are attached to a commiting
transaction and journal_unmap_buffer() cannot release them. To release
such the buffers and the pages later, journal_unmap_buffer() leaves it to
journal_commit_transaction(). (journal_unmap_buffer() puts the mark
'BH_Freed' to the buffers so that journal_commit_transaction() can
identify whether they can be released or not.)

In the journalled mode and the writeback mode, jbd does with only metadata
buffers. But in the ordered mode, jbd does with metadata buffers and also
data buffers.

Actually, journal_commit_transaction() releases only the metadata buffers
of which release is demanded by journal_unmap_buffer(), and also releases
the pages to which they belong if possible.

As a result, the data buffers of which release is demanded by
journal_unmap_buffer() remain after a transaction commits. And also the
pages to which they belong remain.

Such the remained pages don't have mapping any longer. Due to this fact,
there is a possibility that the pages which cannot be estimated remain.

The metadata buffers marked 'BH_Freed' and the pages to which
they belong can be released at 'JBD: commit phase 7'.

Therefore, by applying the same code into 'JBD: commit phase 2' (where the
data buffers are done with), journal_commit_transaction() can also release
the data buffers marked 'BH_Freed' and the pages to which they belong.

As a result, all the buffers marked 'BH_Freed' can be released, and also
all the pages to which these buffers belong can be released at
journal_commit_transaction(). So, the page which cannot be estimated is
lost.

<>
> spin_lock(&journal->j_list_lock);
> while (commit_transaction->t_forget) {
> transaction_t *cp_transaction;
> struct buffer_head *bh;
>
> jh = commit_transaction->t_forget;
>...
> if (buffer_freed(bh)) {
> ^^^^^^^^^^^^^^^^^^^^^^^^
> clear_buffer_freed(bh);
> ^^^^^^^^^^^^^^^^^^^^^^^^
> clear_buffer_jbddirty(bh);
> }
>
> if (buffer_jbddirty(bh)) {
> JBUFFER_TRACE(jh, "add to new checkpointing trans");
> __journal_insert_checkpoint(jh, commit_transaction);
> JBUFFER_TRACE(jh, "refile for checkpoint writeback");
> __journal_refile_buffer(jh);
> jbd_unlock_bh_state(bh);
> } else {
> J_ASSERT_BH(bh, !buffer_dirty(bh));
> ...
> JBUFFER_TRACE(jh, "refile or unfile freed buffer");
> __journal_refile_buffer(jh);
> if (!jh->b_transaction) {
> jbd_unlock_bh_state(bh);
> /* needs a brelse */
> journal_remove_journal_head(bh);
> release_buffer_page(bh);
> ^^^^^^^^^^^^^^^^^^^^^^^^
> } else
> }
****************************************************************
* Apply the code of "^^^^^^" lines into 'JBD: commit phase 2' *
****************************************************************

At journal_commit_transaction() code, there is one extra message in the
series of jbd debug messages. ("JBD: commit phase 2") This patch fixes
it, too.

Signed-off-by: Toshiyuki Okajima
Acked-by: Jan Kara
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Toshiyuki Okajima
2008-07-26 01:53:32 +0800
a10320e8f jbd: unexport journal_update_superblock ... Browse Code »

Remove the unused EXPORT_SYMBOL(journal_update_superblock).

Signed-off-by: Adrian Bunk
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Adrian Bunk
2008-07-26 01:53:32 +0800
3f31fddfa jbd: fix race between free buffer and commit transaction ... Browse Code »

journal_try_to_free_buffers() could race with jbd commit transaction when
the later is holding the buffer reference while waiting for the data
buffer to flush to disk. If the caller of journal_try_to_free_buffers()
request tries hard to release the buffers, it will treat the failure as
error and return back to the caller. We have seen the directo IO failed
due to this race. Some of the caller of releasepage() also expecting the
buffer to be dropped when passed with GFP_KERNEL mask to the
releasepage()->journal_try_to_free_buffers().

With this patch, if the caller is passing the __GFP_WAIT and __GFP_FS to
indicating this call could wait, in case of try_to_free_buffers() failed,
let's waiting for journal_commit_transaction() to finish commit the
current committing transaction, then try to free those buffers again.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mingming Cao
Reviewed-by: Badari Pulavarty
Acked-by: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mingming Cao
2008-07-26 01:53:32 +0800
1984bb763 jbd: tidy up revoke cache initialisation and destruction ... Browse Code »

Make revocation cache destruction safe to call if initialisation fails
partially or entirely. This allows it to be used to cleanup in the case
of initialisation failure, simplifying that code slightly.

Signed-off-by: Duane Griffin
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Duane Griffin
2008-07-26 01:53:32 +0800
f4d79ca2f jbd: eliminate duplicated code in revocation table init/destroy functions ... Browse Code »

The revocation table initialisation/destruction code is repeated for each
of the two revocation tables stored in the journal. Refactoring the
duplicated code into functions is tidier, simplifies the logic in
initialisation in particular, and slightly reduces the code size.

There should not be any functional change.

Signed-off-by: Duane Griffin
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Duane Griffin
2008-07-26 01:53:32 +0800
3850f7a52 jbd: replace potentially false assertion with if block ... Browse Code »

If an error occurs during jbd cache initialisation it is possible for the
journal_head_cache to be NULL when journal_destroy_journal_head_cache is
called. Replace the J_ASSERT with an if block to handle the situation
correctly.

Note that even with this fix things will break badly if jbd is statically
compiled in and cache initialisation fails.

Signed-off-by: Duane Griffin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Duane Griffin
2008-07-26 01:53:32 +0800