Eric Lee / smarc-fsl-linux-kernel

23 Sep, 2009

1 commit

1fe72eaa0 fs/buffer.c: clean up EXPORT* macros ... Browse Code »

According to Documentation/CodingStyle the EXPORT* macro should follow
immediately after the closing function brace line.

Also, mark_buffer_async_write_endio() and do_thaw_all() are not used
elsewhere so they should be marked as static.

In addition, file_fsync() is actually in fs/sync.c so move the EXPORT* to
that file.

Signed-off-by: H Hartley Sweeten
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

H Hartley Sweeten
2009-09-23 22:39:29 +0800

11 Sep, 2009

1 commit

03ba3782e writeback: switch to per-bdi threads for flushing data ... Browse Code »

This gets rid of pdflush for bdi writeout and kupdated style cleaning.
pdflush writeout suffers from lack of locality and also requires more
threads to handle the same workload, since it has to work in a
non-blocking fashion against each queue. This also introduces lumpy
behaviour and potential request starvation, since pdflush can be starved
for queue access if others are accessing it. A sample ffsb workload that
does random writes to files is about 8% faster here on a simple SATA drive
during the benchmark phase. File layout also seems a LOT more smooth in
vmstat:

r b swpd free buff cache si so bi bo in cs us sy id wa
0 1 0 608848 2652 375372 0 0 0 71024 604 24 1 10 48 42
0 1 0 549644 2712 433736 0 0 0 60692 505 27 1 8 48 44
1 0 0 476928 2784 505192 0 0 4 29540 553 24 0 9 53 37
0 1 0 457972 2808 524008 0 0 0 54876 331 16 0 4 38 58
0 1 0 366128 2928 614284 0 0 4 92168 710 58 0 13 53 34
0 1 0 295092 3000 684140 0 0 0 62924 572 23 0 9 53 37
0 1 0 236592 3064 741704 0 0 4 58256 523 17 0 8 48 44
0 1 0 165608 3132 811464 0 0 0 57460 560 21 0 8 54 38
0 1 0 102952 3200 873164 0 0 4 74748 540 29 1 10 48 41
0 1 0 48604 3252 926472 0 0 0 53248 469 29 0 7 47 45

where vanilla tends to fluctuate a lot in the creation phase:

r b swpd free buff cache si so bi bo in cs us sy id wa
1 1 0 678716 5792 303380 0 0 0 74064 565 50 1 11 52 36
1 0 0 662488 5864 319396 0 0 4 352 302 329 0 2 47 51
0 1 0 599312 5924 381468 0 0 0 78164 516 55 0 9 51 40
0 1 0 519952 6008 459516 0 0 4 78156 622 56 1 11 52 37
1 1 0 436640 6092 541632 0 0 0 82244 622 54 0 11 48 41
0 1 0 436640 6092 541660 0 0 0 8 152 39 0 0 51 49
0 1 0 332224 6200 644252 0 0 4 102800 728 46 1 13 49 36
1 0 0 274492 6260 701056 0 0 4 12328 459 49 0 7 50 43
0 1 0 211220 6324 763356 0 0 0 106940 515 37 1 10 51 39
1 0 0 160412 6376 813468 0 0 0 8224 415 43 0 6 49 45
1 1 0 85980 6452 886556 0 0 4 113516 575 39 1 11 54 34
0 2 0 85968 6452 886620 0 0 0 1640 158 211 0 0 46 54

A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A
SSD based writeback test on XFS performs over 20% better as well, with
the throughput being very stable around 1GB/sec, where pdflush only
manages 750MB/sec and fluctuates wildly while doing so. Random buffered
writes to many files behave a lot better as well, as does random mmap'ed
writes.

A separate thread is added to sync the super blocks. In the long term,
adding sync_supers_bdi() functionality could get rid of this thread again.

Signed-off-by: Jens Axboe

Jens Axboe
2009-09-11 15:20:25 +0800

22 Aug, 2009

1 commit

8e9d78ede Re-introduce page mapping check in mark_buffer_dirty() ... Browse Code »

In commit a8e7d49aa7be728c4ae241a75a2a124cdcabc0c5 ("Fix race in
create_empty_buffers() vs __set_page_dirty_buffers()"), I removed a test
for a NULL page mapping unintentionally when some of the code inside
__set_page_dirty() was moved to the callers.

That removal generally didn't matter, since a filesystem would serialize
truncation (which clears the page mapping) against writing (which marks
the buffer dirty), so locking at a higher level (either per-page or an
inode at a time) should mean that the buffer page would be stable. And
indeed, nothing bad seemed to happen.

Except it turns out that apparently reiserfs does something odd when
under load and writing out the journal, and we have a number of bugzilla
entries that look similar:

http://bugzilla.kernel.org/show_bug.cgi?id=13556
http://bugzilla.kernel.org/show_bug.cgi?id=13756
http://bugzilla.kernel.org/show_bug.cgi?id=13876

and it looks like reiserfs depended on that check (the common theme
seems to be "data=journal", and a journal writeback during a truncate).

I suspect reiserfs should have some additional locking, but in the
meantime this should get us back to the pre-2.6.29 behavior.

Pattern-pointed-out-by: Roland Kletzing
Cc: stable@kernel.org (2.6.29 and 2.6.30)
Cc: Jeff Mahoney
Cc: Nick Piggin
Cc: Al Viro
Signed-off-by: Linus Torvalds

Linus Torvalds
2009-08-22 08:40:08 +0800

12 Jun, 2009

2 commits

c9059598e Merge branch 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block ... Browse Code »

* 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block: (153 commits)
block: add request clone interface (v2)
floppy: fix hibernation
ramdisk: remove long-deprecated "ramdisk=" boot-time parameter
fs/bio.c: add missing __user annotation
block: prevent possible io_context->refcount overflow
Add serial number support for virtio_blk, V4a
block: Add missing bounce_pfn stacking and fix comments
Revert "block: Fix bounce limit setting in DM"
cciss: decode unit attention in SCSI error handling code
cciss: Remove no longer needed sendcmd reject processing code
cciss: change SCSI error handling routines to work with interrupts enabled.
cciss: separate error processing and command retrying code in sendcmd_withirq_core()
cciss: factor out fix target status processing code from sendcmd functions
cciss: simplify interface of sendcmd() and sendcmd_withirq()
cciss: factor out core of sendcmd_withirq() for use by SCSI error handling code
cciss: Use schedule_timeout_uninterruptible in SCSI error handling code
block: needs to set the residual length of a bidi request
Revert "block: implement blkdev_readpages"
block: Fix bounce limit setting in DM
Removed reference to non-existing file Documentation/PCI/PCI-DMA-mapping.txt
...

Manually fix conflicts with tracing updates in:
block/blk-sysfs.c
drivers/ide/ide-atapi.c
drivers/ide/ide-cd.c
drivers/ide/ide-floppy.c
drivers/ide/ide-tape.c
include/trace/events/block.h
kernel/trace/blktrace.c

Linus Torvalds
2009-06-12 02:10:35 +0800
e893123c7 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 ... Browse Code »

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (49 commits)
ext4: Avoid corrupting the uninitialized bit in the extent during truncate
ext4: Don't treat a truncation of a zero-length file as replace-via-truncate
ext4: fix dx_map_entry to support 256k directory blocks
ext4: truncate the file properly if we fail to copy data from userspace
ext4: Avoid leaking blocks after a block allocation failure
ext4: Change all super.c messages to print the device
ext4: Get rid of EXTEND_DISKSIZE flag of ext4_get_blocks_handle()
ext4: super.c whitespace cleanup
jbd2: Fix minor typos in comments in fs/jbd2/journal.c
ext4: Clean up calls to ext4_get_group_desc()
ext4: remove unused function __ext4_write_dirty_metadata
ext2: Fix memory leak in ext2_fill_super() in case of a failed mount
ext3: Fix memory leak in ext3_fill_super() in case of a failed mount
ext4: Fix memory leak in ext4_fill_super() in case of a failed mount
ext4: down i_data_sem only for read when walking tree for fiemap
ext4: Add a comprehensive block validity check to ext4_get_blocks()
ext4: Clean up ext4_get_blocks() so it does not depend on bh_result->b_state
ext4: Merge ext4_da_get_block_write() into mpage_da_map_blocks()
ext4: Add BUG_ON debugging checks to noalloc_get_block_write()
ext4: Add documentation to the ext4_*get_block* functions
...

Linus Torvalds
2009-06-12 01:00:50 +0800

06 Jun, 2009

1 commit

460bcf57b Fix nobh_truncate_page() to not pass stack garbage to get_block() ... Browse Code »

The nobh_truncate_page() function is used by ext2, exofs, and jfs. Of
these three, only ext2 and jfs's get_block() function pays attention
to bh->b_size --- which is normally always the filesystem blocksize
except when the get_block() function is called by either
mpage_readpage(), mpage_readpages(), or the direct I/O routines in
fs/direct_io.c.

Unfortunately, nobh_truncate_page() does not initialize map_bh before
calling the filesystem-supplied get_block() function. So ext2 and jfs
will try to calculate the number of blocks to map by taking stack
garbage and shifting it left by inode->i_blkbits. This should be
*mostly* harmless (except the filesystem will do some unnneeded work)
unless the stack garbage is less than filesystem's blocksize, in which
case maxblocks will be zero, and the attempt to find out whether or
not the filesystem has a hole at a given logical block will fail, and
the page cache entry might not get zero'ed out.

Also if the stack garbage in in map_bh->state happens to have the
BH_Mapped bit set, there could be an attempt to call readpage() on a
non-existent page, which could cause nobh_truncate_page() to return an
error when it should not.

Fix this by initializing map_bh->state and map_bh->size.

Fortunately, it's probably fairly unlikely that ext2 and jfs users
mount with nobh these days.

Signed-off-by: "Theodore Ts'o"
Cc: Dave Kleikamp
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Al Viro

Theodore Ts'o
2009-06-06 18:17:25 +0800

23 May, 2009

1 commit

e1defc4ff block: Do away with the notion of hardsect_size ... Browse Code »

Until now we have had a 1:1 mapping between storage device physical
block size and the logical block sized used when addressing the device.
With SATA 4KB drives coming out that will no longer be the case. The
sector size will be 4KB but the logical block size will remain
512-bytes. Hence we need to distinguish between the physical block size
and the logical ditto.

This patch renames hardsect_size to logical_block_size.

Signed-off-by: Martin K. Petersen
Signed-off-by: Jens Axboe

Martin K. Petersen
2009-05-23 05:22:54 +0800

13 May, 2009

1 commit

8fb0e3424 vfs: Add BUG_ON for delayed and unwritten flags in submit_bh() ... Browse Code »

The BH_Delay and BH_Unwritten flags should never leak out to
submit_bh(). So add some BUG_ON() checks to submit_bh so we can get a
stack trace and determine how and why this might have happened.

(Note that only XFS and ext4 use these buffer head flags, and XFS does
not use submit_bh(). So this patch should only modify behavior for
ext4.)

Signed-off-by: Aneesh Kumar K.V
Signed-off-by: "Theodore Ts'o"
Cc: linux-fsdevel@vger.kernel.org

Aneesh Kumar K.V
2009-05-13 04:22:37 +0800

03 May, 2009

1 commit

b827e496c mm: close page_mkwrite races ... Browse Code »

Change page_mkwrite to allow implementations to return with the page
locked, and also change it's callers (in page fault paths) to hold the
lock until the page is marked dirty. This allows the filesystem to have
full control of page dirtying events coming from the VM.

Rather than simply hold the page locked over the page_mkwrite call, we
call page_mkwrite with the page unlocked and allow callers to return with
it locked, so filesystems can avoid LOR conditions with page lock.

The problem with the current scheme is this: a filesystem that wants to
associate some metadata with a page as long as the page is dirty, will
perform this manipulation in its ->page_mkwrite. It currently then must
return with the page unlocked and may not hold any other locks (according
to existing page_mkwrite convention).

In this window, the VM could write out the page, clearing page-dirty. The
filesystem has no good way to detect that a dirty pte is about to be
attached, so it will happily write out the page, at which point, the
filesystem may manipulate the metadata to reflect that the page is no
longer dirty.

It is not always possible to perform the required metadata manipulation in
->set_page_dirty, because that function cannot block or fail. The
filesystem may need to allocate some data structure, for example.

And the VM cannot mark the pte dirty before page_mkwrite, because
page_mkwrite is allowed to fail, so we must not allow any window where the
page could be written to if page_mkwrite does fail.

This solution of holding the page locked over the 3 critical operations
(page_mkwrite, setting the pte dirty, and finally setting the page dirty)
closes out races nicely, preventing page cleaning for writeout being
initiated in that window. This provides the filesystem with a strong
synchronisation against the VM here.

- Sage needs this race closed for ceph filesystem.
- Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913).
- I need it for fsblock.
- I suspect other filesystems may need it too (eg. btrfs).
- I have converted buffer.c to the new locking. Even simple block allocation
under dirty pages might be susceptible to i_size changing under partial page
at the end of file (we also have a buffer.c-side problem here, but it cannot
be fixed properly without this patch).
- Other filesystems (eg. NFS, maybe btrfs) will need to change their
page_mkwrite functions themselves.

[ This also moves page_mkwrite another step closer to fault, which should
eventually allow page_mkwrite to be moved into ->fault, and thus avoiding a
filesystem calldown and page lock/unlock cycle in __do_fault. ]

[akpm@linux-foundation.org: fix derefs of NULL ->mapping]
Cc: Sage Weil
Cc: Trond Myklebust
Signed-off-by: Nick Piggin
Cc: Valdis Kletnieks
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2009-05-03 06:36:09 +0800

16 Apr, 2009

1 commit

35c80d5f4 Add block_write_full_page_endio for passing endio handler ... Browse Code »

block_write_full_page doesn't allow the caller to control what happens
when the IO is over. This adds a new call named block_write_full_page_endio
so the buffer head end_io handler can be provided by the caller.

This will be used by the ext3 data=guarded mode to do i_size updates in
a workqueue based end_io handler. end_buffer_async_write is also
exported so it can be called to do the dirty work of managing page
writeback for the higher level end_io handler.

Signed-off-by: Chris Mason
Acked-by: Theodore Tso
Acked-by: Jan Kara
Signed-off-by: Linus Torvalds

Chris Mason
2009-04-16 22:47:49 +0800

15 Apr, 2009

1 commit

053c525fc buffer: switch do_emergency_thaw() away from pdflush_operation() ... Browse Code »

This is (again) a preparatory patch similar to commit
a2a9537ac0b37a5da6fbe7e1e9cb06c524d2a9c4. It open codes a simple
async way of executing do_thaw_all() out of context, so we can get
rid of pdflush.

Signed-off-by: Jens Axboe

Jens Axboe
2009-04-15 14:28:12 +0800

09 Apr, 2009

1 commit

6e34eeddf block_write_full_page: switch synchronous writes to use WRITE_SYNC_PLUG ... Browse Code »

Now that we have a distinction between WRITE_SYNC and WRITE_SYNC_PLUG,
use WRITE_SYNC_PLUG in __block_write_full_page() to avoid unplugging
the block device I/O queue between each page that gets flushed out.

Otherwise, when we run sync() or fsync() and we need to write out a
large number of pages, the block device queue will get unplugged
between for every page that is flushed out, which will be a pretty
serious performance regression caused by commit a64c8610.

Signed-off-by: "Theodore Ts'o"

Theodore Ts'o
2009-04-09 01:15:09 +0800

06 Apr, 2009

2 commits

1aa2a7cc6 block: switch sync_dirty_buffer() over to WRITE_SYNC ... Browse Code »

We should now have the logic in place to handle this properly
without regressing on the write performance, so re-enable
the sync writes.

Signed-off-by: Jens Axboe
Signed-off-by: Linus Torvalds

Jens Axboe
2009-04-06 23:04:54 +0800
9cf6b720f block: fsync_buffers_list() should use SWRITE_SYNC_PLUG ... Browse Code »

Then it can submit all the buffers without unplugging for each one.
We will kick off the pending IO if we come across a new address space.

Signed-off-by: Jens Axboe
Signed-off-by: Linus Torvalds

Jens Axboe
2009-04-06 23:04:53 +0800

04 Apr, 2009

1 commit

20bec8ab1 Merge branch 'ext3-latency-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 ... Browse Code »

* 'ext3-latency-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext3: Add replace-on-rename hueristics for data=writeback mode
ext3: Add replace-on-truncate hueristics for data=writeback mode
ext3: Use WRITE_SYNC for commits which are caused by fsync()
block_write_full_page: Use synchronous writes for WBC_SYNC_ALL writebacks

Linus Torvalds
2009-04-04 02:10:33 +0800

03 Apr, 2009

2 commits

8fe74cf05 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
Remove two unneeded exports and make two symbols static in fs/mpage.c
Cleanup after commit 585d3bc06f4ca57f975a5a1f698f65a45ea66225
Trim includes of fdtable.h
Don't crap into descriptor table in binfmt_som
Trim includes in binfmt_elf
Don't mess with descriptor table in load_elf_binary()
Get rid of indirect include of fs_struct.h
New helper - current_umask()
check_unsafe_exec() doesn't care about signal handlers sharing
New locking/refcounting for fs_struct
Take fs_struct handling to new file (fs/fs_struct.c)
Get rid of bumping fs_struct refcount in pivot_root(2)
Kill unsharing fs_struct in __set_personality()

Linus Torvalds
2009-04-03 12:09:10 +0800
97f76d3d1 vfs: check bh->b_blocknr only if BH_Mapped is set ... Browse Code »

Check bh->b_blocknr only if BH_Mapped is set.

akpm: I doubt if b_blocknr is ever uninitialised here, but it could
conceivably cause a problem if we're doing a lookup for block zero.

Signed-off-by: Nikanth Karthikesan
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nikanth Karthikesan
2009-04-03 10:04:49 +0800

01 Apr, 2009

6 commits

c2d754385 filesystem freeze: allow SysRq emergency thaw to thaw frozen filesystems ... Browse Code »

Now that the filesystem freeze operation has been elevated to the VFS, and
is just an ioctl away, some sort of safety net for unintentionally frozen
root filesystems may be in order.

The timeout thaw originally proposed did not get merged, but perhaps
something like this would be useful in emergencies.

For example, freeze /path/to/mountpoint may freeze your root filesystem if
you forgot that you had that unmounted.

I chose 'j' as the last remaining character other than 'h' which is sort
of reserved for help (because help is generated on any unknown character).

I've tested this on a non-root fs with multiple (nested) freezers, as well
as on a system rendered unresponsive due to a frozen root fs.

[randy.dunlap@oracle.com: emergency thaw only if CONFIG_BLOCK enabled]
Signed-off-by: Eric Sandeen
Cc: Takashi Sato
Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Sandeen
2009-04-01 23:59:17 +0800
327c0e968 vmscan: fix it to take care of nodemask ... Browse Code »

try_to_free_pages() is used for the direct reclaim of up to
SWAP_CLUSTER_MAX pages when watermarks are low. The caller to
alloc_pages_nodemask() can specify a nodemask of nodes that are allowed to
be used but this is not passed to try_to_free_pages(). This can lead to
unnecessary reclaim of pages that are unusable by the caller and int the
worst case lead to allocation failure as progress was not been make where
it is needed.

This patch passes the nodemask used for alloc_pages_nodemask() to
try_to_free_pages().

Reviewed-by: KOSAKI Motohiro
Acked-by: Mel Gorman
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-04-01 23:59:15 +0800
56a76f827 fs: fix page_mkwrite error cases in core code and btrfs ... Browse Code »

page_mkwrite is called with neither the page lock nor the ptl held. This
means a page can be concurrently truncated or invalidated out from
underneath it. Callers are supposed to prevent truncate races themselves,
however previously the only thing they can do in case they hit one is to
raise a SIGBUS. A sigbus is wrong for the case that the page has been
invalidated or truncated within i_size (eg. hole punched). Callers may
also have to perform memory allocations in this path, where again, SIGBUS
would be wrong.

The previous patch ("mm: page_mkwrite change prototype to match fault")
made it possible to properly specify errors. Convert the generic buffer.c
code and btrfs to return sane error values (in the case of page removed
from pagecache, VM_FAULT_NOPAGE will cause the fault handler to exit
without doing anything, and the fault will be retried properly).

This fixes core code, and converts btrfs as a template/example. All other
filesystems defining their own page_mkwrite should be fixed in a similar
manner.

Acked-by: Chris Mason
Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2009-04-01 23:59:14 +0800
c2ec175c3 mm: page_mkwrite change prototype to match fault ... Browse Code »

Change the page_mkwrite prototype to take a struct vm_fault, and return
VM_FAULT_xxx flags. There should be no functional change.

This makes it possible to return much more detailed error information to
the VM (and also can provide more information eg. virtual_address to the
driver, which might be important in some special cases).

This is required for a subsequent fix. And will also make it easier to
merge page_mkwrite() with fault() in future.

Signed-off-by: Nick Piggin
Cc: Chris Mason
Cc: Trond Myklebust
Cc: Miklos Szeredi
Cc: Steven Whitehouse
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Artem Bityutskiy
Cc: Felix Blyakher
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2009-04-01 23:59:14 +0800
e3a7cca1e vfs: add/use account_page_dirtied() ... Browse Code »

Add a helper function account_page_dirtied(). Use that from two
callsites. reiser4 adds a function which adds a third callsite.

Signed-off-by: Edward Shishkin
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Edward Shishkin
2009-04-01 23:59:12 +0800
47e4491b4 Cleanup after commit 585d3bc06f4ca57f975a5a1f698f65a45ea66225 ... Browse Code »

fsync_bdev() export and a bunch of stubs for !CONFIG_BLOCK case had
been left behind

Signed-off-by: Al Viro

Al Viro
2009-04-01 19:07:16 +0800

28 Mar, 2009

2 commits

a64c8610b block_write_full_page: Use synchronous writes for WBC_SYNC_ALL writebacks ... Browse Code »

When doing synchronous writes because wbc->sync_mode is set to
WBC_SYNC_ALL, send the write request using WRITE_SYNC, so that we
don't unduly block system calls such as fsync().

Signed-off-by: "Theodore Ts'o"
Acked-by: Jan Kara

Theodore Ts'o
2009-03-28 10:14:10 +0800
585d3bc06 fs: move bdev code out of buffer.c ... Browse Code »

Move some block device related code out from buffer.c and put it in
block_dev.c. I'm trying to move non-buffer_head code out of buffer.c

Signed-off-by: Al Viro

Nick Piggin
2009-03-28 02:44:03 +0800

20 Mar, 2009

1 commit

a8e7d49aa Fix race in create_empty_buffers() vs __set_page_dirty_buffers() ... Browse Code »

Nick Piggin noticed this (very unlikely) race between setting a page
dirty and creating the buffers for it - we need to hold the mapping
private_lock until we've set the page dirty bit in order to make sure
that create_empty_buffers() might not build up a set of buffers without
the dirty bits set when the page is dirty.

I doubt anybody has ever hit this race (and it didn't solve the issue
Nick was looking at), but as Nick says: "Still, it does appear to solve
a real race, which we should close."

Acked-by: Nick Piggin
Signed-off-by: Linus Torvalds

Linus Torvalds
2009-03-20 02:32:05 +0800

19 Feb, 2009

2 commits

ba95fd47d Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block ... Browse Code »

* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
block: fix deadlock in blk_abort_queue() for drivers that readd to timeout list
block: fix booting from partitioned md array
block: revert part of 18ce3751ccd488c78d3827e9f6bf54e6322676fb
cciss: PCI power management reset for kexec
paride/pg.c: xs(): &&/|| confusion
fs/bio: bio_alloc_bioset: pass right object ptr to mempool_free
block: fix bad definition of BIO_RW_SYNC
bsg: Fix sense buffer bug in SG_IO

Linus Torvalds
2009-02-19 10:33:04 +0800
1cf6e7d83 mm: task dirty accounting fix ... Browse Code »

YAMAMOTO-san noticed that task_dirty_inc doesn't seem to be called properly for
cases where set_page_dirty is not used to dirty a page (eg. mark_buffer_dirty).

Additionally, there is some inconsistency about when task_dirty_inc is
called. It is used for dirty balancing, however it even gets called for
__set_page_dirty_no_writeback.

So rather than increment it in a set_page_dirty wrapper, move it down to
exactly where the dirty page accounting stats are incremented.

Cc: YAMAMOTO Takashi
Signed-off-by: Nick Piggin
Acked-by: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2009-02-19 07:37:54 +0800

18 Feb, 2009

1 commit

78f707bfc block: revert part of 18ce3751ccd488c78d3827e9f6bf54e6322676fb ... Browse Code »

The above commit added WRITE_SYNC and switched various places to using
that for committing writes that will be waited upon immediately after
submission. However, this causes a performance regression with AS and CFQ
for ext3 at least, since sync_dirty_buffer() will submit some writes with
WRITE_SYNC while ext3 has sumitted others dependent writes without the sync
flag set. This causes excessive anticipation/idling in the IO scheduler
because sync and async writes get interleaved, causing a big performance
regression for the below test case (which is meant to simulate sqlite
like behaviour).

---- test case ----

int main(int argc, char **argv)
{

int fdes, i;
FILE *fp;
struct timeval start;
struct timeval end;
struct timeval res;

gettimeofday(&start, NULL);
for (i=0; i

Jens Axboe
2009-02-18 17:32:01 +0800

07 Feb, 2009

1 commit

d4cf109f0 vfs: Don't call attach_nobh_buffers() with an empty list ... Browse Code »

This is a modification of a patch by Bill Pemberton

nobh_write_end() could call attach_nobh_buffers() with head == NULL.
This would result in a trap when attach_nobh_buffers() attempted to
access bh->b_this_page.

This can be illustrated by running the writev01 testcase from LTP on jfs.

This error was introduced by commit 5b41e74a "vfs: fix data leak in
nobh_write_end()". That patch did not take into account that if
PageMappedToDisk() is true upon entry to nobh_write_begin(), then no
buffers will be allocated for the page. In that case, we won't have to
worry about a failed write leaving unitialized data in the page.

Of course, head != NULL implies !page_has_buffers(page), so no need to
test both.

Signed-off-by: Dave Kleikamp
Cc: Bill Pemberton
Cc: Dmitri Monakhov
Signed-off-by: Linus Torvalds

Dave Kleikamp
2009-02-07 05:34:22 +0800

14 Jan, 2009

1 commit

bdc480e3b [CVE-2009-0029] System call wrappers part 10 ... Browse Code »

Signed-off-by: Heiko Carstens

Heiko Carstens
2009-01-14 21:15:22 +0800

10 Jan, 2009

2 commits

fcccf5025 filesystem freeze: implement generic freeze feature ... Browse Code »

The ioctls for the generic freeze feature are below.
o Freeze the filesystem
int ioctl(int fd, int FIFREEZE, arg)
fd: The file descriptor of the mountpoint
FIFREEZE: request code for the freeze
arg: Ignored
Return value: 0 if the operation succeeds. Otherwise, -1

o Unfreeze the filesystem
int ioctl(int fd, int FITHAW, arg)
fd: The file descriptor of the mountpoint
FITHAW: request code for unfreeze
arg: Ignored
Return value: 0 if the operation succeeds. Otherwise, -1
Error number: If the filesystem has already been unfrozen,
errno is set to EINVAL.

[akpm@linux-foundation.org: fix CONFIG_BLOCK=n]
Signed-off-by: Takashi Sato
Signed-off-by: Masayuki Hamaguchi
Cc:
Cc:
Cc: Christoph Hellwig
Cc: Dave Kleikamp
Cc: Dave Chinner
Cc: Alasdair G Kergon
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Takashi Sato
2009-01-10 08:54:42 +0800
c4be0c1dc filesystem freeze: add error handling of write_super_lockfs/unlockfs ... Browse Code »

Currently, ext3 in mainline Linux doesn't have the freeze feature which
suspends write requests. So, we cannot take a backup which keeps the
filesystem's consistency with the storage device's features (snapshot and
replication) while it is mounted.

In many case, a commercial filesystem (e.g. VxFS) has the freeze feature
and it would be used to get the consistent backup.

If Linux's standard filesystem ext3 has the freeze feature, we can do it
without a commercial filesystem.

So I have implemented the ioctls of the freeze feature.
I think we can take the consistent backup with the following steps.
1. Freeze the filesystem with the freeze ioctl.
2. Separate the replication volume or create the snapshot
with the storage device's feature.
3. Unfreeze the filesystem with the unfreeze ioctl.
4. Take the backup from the separated replication volume
or the snapshot.

This patch:

VFS:
Changed the type of write_super_lockfs and unlockfs from "void"
to "int" so that they can return an error.
Rename write_super_lockfs and unlockfs of the super block operation
freeze_fs and unfreeze_fs to avoid a confusion.

ext3, ext4, xfs, gfs2, jfs:
Changed the type of write_super_lockfs and unlockfs from "void"
to "int" so that write_super_lockfs returns an error if needed,
and unlockfs always returns 0.

reiserfs:
Changed the type of write_super_lockfs and unlockfs from "void"
to "int" so that they always return 0 (success) to keep a current behavior.

Signed-off-by: Takashi Sato
Signed-off-by: Masayuki Hamaguchi
Cc:
Cc:
Cc: Christoph Hellwig
Cc: Dave Kleikamp
Cc: Dave Chinner
Cc: Alasdair G Kergon
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Takashi Sato
2009-01-10 08:54:42 +0800

07 Jan, 2009

1 commit

69e993099 block_write_begin(): remove useless goto ... Browse Code »

Signed-off-by: Franck Bui-Huu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Franck Bui-Huu
2009-01-07 07:59:08 +0800

05 Jan, 2009

1 commit

54566b2c1 fs: symlink write_begin allocation context fix ... Browse Code »

With the write_begin/write_end aops, page_symlink was broken because it
could no longer pass a GFP_NOFS type mask into the point where the
allocations happened. They are done in write_begin, which would always
assume that the filesystem can be entered from reclaim. This bug could
cause filesystem deadlocks.

The funny thing with having a gfp_t mask there is that it doesn't really
allow the caller to arbitrarily tinker with the context in which it can be
called. It couldn't ever be GFP_ATOMIC, for example, because it needs to
take the page lock. The only thing any callers care about is __GFP_FS
anyway, so turn that into a single flag.

Add a new flag for write_begin, AOP_FLAG_NOFS. Filesystems can now act on
this flag in their write_begin function. Change __grab_cache_page to
accept a nofs argument as well, to honour that flag (while we're there,
change the name to grab_cache_page_write_begin which is more instructive
and does away with random leading underscores).

This is really a more flexible way to go in the end anyway -- if a
filesystem happens to want any extra allocations aside from the pagecache
ones in ints write_begin function, it may now use GFP_KERNEL (rather than
GFP_NOFS) for common case allocations (eg. ocfs2_alloc_write_ctxt, for a
random example).

[kosaki.motohiro@jp.fujitsu.com: fix ubifs]
[kosaki.motohiro@jp.fujitsu.com: fix fuse]
Signed-off-by: Nick Piggin
Reviewed-by: KOSAKI Motohiro
Cc: [2.6.28.x]
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
[ Cleaned up the calling convention: just pass in the AOP flags
untouched to the grab_cache_page_write_begin() function. That
just simplifies everybody, and may even allow future expansion of the
logic. - Linus ]
Signed-off-by: Linus Torvalds

Nick Piggin
2009-01-05 05:33:20 +0800

29 Dec, 2008

1 commit

08bafc034 block: Supress Buffer I/O errors when SCSI REQ_QUIET flag set ... Browse Code »

Allow the scsi request REQ_QUIET flag to be propagated to the buffer
file system layer. The basic ideas is to pass the flag from the scsi
request to the bio (block IO) and then to the buffer layer. The buffer
layer can then suppress needless printks.

This patch declutters the kernel log by removed the 40-50 (per lun)
buffer io error messages seen during a boot in my multipath setup . It
is a good chance any real errors will be missed in the "noise" it the
logs without this patch.

During boot I see blocks of messages like
"
__ratelimit: 211 callbacks suppressed
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242847
Buffer I/O error on device sdm, logical block 1
Buffer I/O error on device sdm, logical block 5242878
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242879
Buffer I/O error on device sdm, logical block 5242872
"
in my logs.

My disk environment is multipath fiber channel using the SCSI_DH_RDAC
code and multipathd. This topology includes an "active" and "ghost"
path for each lun. IO's to the "ghost" path will never complete and the
SCSI layer, via the scsi device handler rdac code, quick returns the IOs
to theses paths and sets the REQ_QUIET scsi flag to suppress the scsi
layer messages.

I am wanting to extend the QUIET behavior to include the buffer file
system layer to deal with these errors as well. I have been running this
patch for a while now on several boxes without issue. A few runs of
bonnie++ show no noticeable difference in performance in my setup.

Thanks for John Stultz for the quiet_error finalization.

Submitted-by: Keith Mannthey
Signed-off-by: Jens Axboe

Keith Mannthey
2008-12-29 15:28:44 +0800

28 Nov, 2008

1 commit

52b19ac99 udf: Fix BUG_ON() in destroy_inode() ... Browse Code »

udf_clear_inode() can leave behind buffers on mapping's i_private list (when
we truncated preallocation). Call invalidate_inode_buffers() so that the list
is properly cleaned-up before we return from udf_clear_inode(). This is ugly
and suggest that we should cleanup preallocation earlier than in clear_inode()
but currently there's no such call available since drop_inode() is called under
inode lock and thus is unusable for disk operations.

Signed-off-by: Jan Kara

Jan Kara
2008-11-28 00:38:28 +0800

20 Oct, 2008

1 commit

51b07fc3c fs: buffer lock use lock bitops ... Browse Code »

trylock_buffer and unlock_buffer open and close a critical section.
Hence, we can use the lock bitops to get the desired memory ordering.

Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2008-10-20 23:52:32 +0800

27 Aug, 2008

1 commit

48fd4f93a block: submit_bh() inadvertently discards barrier flag on a sync write ... Browse Code »

Reported by Milan Broz , commit 18ce3751 inadvertently
made submit_bh() discard the barrier bit for a WRITE_SYNC request. Fix
that up.

Signed-off-by: Jens Axboe

Jens Axboe
2008-08-27 15:50:19 +0800

05 Aug, 2008

1 commit

ca5de404f fs: rename buffer trylock ... Browse Code »

Like the page lock change, this also requires name change, so convert the
raw test_and_set bitop to a trylock.

Signed-off-by: Nick Piggin
Signed-off-by: Linus Torvalds

Nick Piggin
2008-08-05 12:56:09 +0800