23 Sep, 2009

1 commit

  • According to Documentation/CodingStyle the EXPORT* macro should follow
    immediately after the closing function brace line.

    Also, mark_buffer_async_write_endio() and do_thaw_all() are not used
    elsewhere so they should be marked as static.

    In addition, file_fsync() is actually in fs/sync.c so move the EXPORT* to
    that file.

    Signed-off-by: H Hartley Sweeten
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    H Hartley Sweeten
     

11 Sep, 2009

1 commit

  • This gets rid of pdflush for bdi writeout and kupdated style cleaning.
    pdflush writeout suffers from lack of locality and also requires more
    threads to handle the same workload, since it has to work in a
    non-blocking fashion against each queue. This also introduces lumpy
    behaviour and potential request starvation, since pdflush can be starved
    for queue access if others are accessing it. A sample ffsb workload that
    does random writes to files is about 8% faster here on a simple SATA drive
    during the benchmark phase. File layout also seems a LOT more smooth in
    vmstat:

    r b swpd free buff cache si so bi bo in cs us sy id wa
    0 1 0 608848 2652 375372 0 0 0 71024 604 24 1 10 48 42
    0 1 0 549644 2712 433736 0 0 0 60692 505 27 1 8 48 44
    1 0 0 476928 2784 505192 0 0 4 29540 553 24 0 9 53 37
    0 1 0 457972 2808 524008 0 0 0 54876 331 16 0 4 38 58
    0 1 0 366128 2928 614284 0 0 4 92168 710 58 0 13 53 34
    0 1 0 295092 3000 684140 0 0 0 62924 572 23 0 9 53 37
    0 1 0 236592 3064 741704 0 0 4 58256 523 17 0 8 48 44
    0 1 0 165608 3132 811464 0 0 0 57460 560 21 0 8 54 38
    0 1 0 102952 3200 873164 0 0 4 74748 540 29 1 10 48 41
    0 1 0 48604 3252 926472 0 0 0 53248 469 29 0 7 47 45

    where vanilla tends to fluctuate a lot in the creation phase:

    r b swpd free buff cache si so bi bo in cs us sy id wa
    1 1 0 678716 5792 303380 0 0 0 74064 565 50 1 11 52 36
    1 0 0 662488 5864 319396 0 0 4 352 302 329 0 2 47 51
    0 1 0 599312 5924 381468 0 0 0 78164 516 55 0 9 51 40
    0 1 0 519952 6008 459516 0 0 4 78156 622 56 1 11 52 37
    1 1 0 436640 6092 541632 0 0 0 82244 622 54 0 11 48 41
    0 1 0 436640 6092 541660 0 0 0 8 152 39 0 0 51 49
    0 1 0 332224 6200 644252 0 0 4 102800 728 46 1 13 49 36
    1 0 0 274492 6260 701056 0 0 4 12328 459 49 0 7 50 43
    0 1 0 211220 6324 763356 0 0 0 106940 515 37 1 10 51 39
    1 0 0 160412 6376 813468 0 0 0 8224 415 43 0 6 49 45
    1 1 0 85980 6452 886556 0 0 4 113516 575 39 1 11 54 34
    0 2 0 85968 6452 886620 0 0 0 1640 158 211 0 0 46 54

    A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A
    SSD based writeback test on XFS performs over 20% better as well, with
    the throughput being very stable around 1GB/sec, where pdflush only
    manages 750MB/sec and fluctuates wildly while doing so. Random buffered
    writes to many files behave a lot better as well, as does random mmap'ed
    writes.

    A separate thread is added to sync the super blocks. In the long term,
    adding sync_supers_bdi() functionality could get rid of this thread again.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

22 Aug, 2009

1 commit

  • In commit a8e7d49aa7be728c4ae241a75a2a124cdcabc0c5 ("Fix race in
    create_empty_buffers() vs __set_page_dirty_buffers()"), I removed a test
    for a NULL page mapping unintentionally when some of the code inside
    __set_page_dirty() was moved to the callers.

    That removal generally didn't matter, since a filesystem would serialize
    truncation (which clears the page mapping) against writing (which marks
    the buffer dirty), so locking at a higher level (either per-page or an
    inode at a time) should mean that the buffer page would be stable. And
    indeed, nothing bad seemed to happen.

    Except it turns out that apparently reiserfs does something odd when
    under load and writing out the journal, and we have a number of bugzilla
    entries that look similar:

    http://bugzilla.kernel.org/show_bug.cgi?id=13556
    http://bugzilla.kernel.org/show_bug.cgi?id=13756
    http://bugzilla.kernel.org/show_bug.cgi?id=13876

    and it looks like reiserfs depended on that check (the common theme
    seems to be "data=journal", and a journal writeback during a truncate).

    I suspect reiserfs should have some additional locking, but in the
    meantime this should get us back to the pre-2.6.29 behavior.

    Pattern-pointed-out-by: Roland Kletzing
    Cc: stable@kernel.org (2.6.29 and 2.6.30)
    Cc: Jeff Mahoney
    Cc: Nick Piggin
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

12 Jun, 2009

2 commits

  • * 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block: (153 commits)
    block: add request clone interface (v2)
    floppy: fix hibernation
    ramdisk: remove long-deprecated "ramdisk=" boot-time parameter
    fs/bio.c: add missing __user annotation
    block: prevent possible io_context->refcount overflow
    Add serial number support for virtio_blk, V4a
    block: Add missing bounce_pfn stacking and fix comments
    Revert "block: Fix bounce limit setting in DM"
    cciss: decode unit attention in SCSI error handling code
    cciss: Remove no longer needed sendcmd reject processing code
    cciss: change SCSI error handling routines to work with interrupts enabled.
    cciss: separate error processing and command retrying code in sendcmd_withirq_core()
    cciss: factor out fix target status processing code from sendcmd functions
    cciss: simplify interface of sendcmd() and sendcmd_withirq()
    cciss: factor out core of sendcmd_withirq() for use by SCSI error handling code
    cciss: Use schedule_timeout_uninterruptible in SCSI error handling code
    block: needs to set the residual length of a bidi request
    Revert "block: implement blkdev_readpages"
    block: Fix bounce limit setting in DM
    Removed reference to non-existing file Documentation/PCI/PCI-DMA-mapping.txt
    ...

    Manually fix conflicts with tracing updates in:
    block/blk-sysfs.c
    drivers/ide/ide-atapi.c
    drivers/ide/ide-cd.c
    drivers/ide/ide-floppy.c
    drivers/ide/ide-tape.c
    include/trace/events/block.h
    kernel/trace/blktrace.c

    Linus Torvalds
     
  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (49 commits)
    ext4: Avoid corrupting the uninitialized bit in the extent during truncate
    ext4: Don't treat a truncation of a zero-length file as replace-via-truncate
    ext4: fix dx_map_entry to support 256k directory blocks
    ext4: truncate the file properly if we fail to copy data from userspace
    ext4: Avoid leaking blocks after a block allocation failure
    ext4: Change all super.c messages to print the device
    ext4: Get rid of EXTEND_DISKSIZE flag of ext4_get_blocks_handle()
    ext4: super.c whitespace cleanup
    jbd2: Fix minor typos in comments in fs/jbd2/journal.c
    ext4: Clean up calls to ext4_get_group_desc()
    ext4: remove unused function __ext4_write_dirty_metadata
    ext2: Fix memory leak in ext2_fill_super() in case of a failed mount
    ext3: Fix memory leak in ext3_fill_super() in case of a failed mount
    ext4: Fix memory leak in ext4_fill_super() in case of a failed mount
    ext4: down i_data_sem only for read when walking tree for fiemap
    ext4: Add a comprehensive block validity check to ext4_get_blocks()
    ext4: Clean up ext4_get_blocks() so it does not depend on bh_result->b_state
    ext4: Merge ext4_da_get_block_write() into mpage_da_map_blocks()
    ext4: Add BUG_ON debugging checks to noalloc_get_block_write()
    ext4: Add documentation to the ext4_*get_block* functions
    ...

    Linus Torvalds
     

06 Jun, 2009

1 commit

  • The nobh_truncate_page() function is used by ext2, exofs, and jfs. Of
    these three, only ext2 and jfs's get_block() function pays attention
    to bh->b_size --- which is normally always the filesystem blocksize
    except when the get_block() function is called by either
    mpage_readpage(), mpage_readpages(), or the direct I/O routines in
    fs/direct_io.c.

    Unfortunately, nobh_truncate_page() does not initialize map_bh before
    calling the filesystem-supplied get_block() function. So ext2 and jfs
    will try to calculate the number of blocks to map by taking stack
    garbage and shifting it left by inode->i_blkbits. This should be
    *mostly* harmless (except the filesystem will do some unnneeded work)
    unless the stack garbage is less than filesystem's blocksize, in which
    case maxblocks will be zero, and the attempt to find out whether or
    not the filesystem has a hole at a given logical block will fail, and
    the page cache entry might not get zero'ed out.

    Also if the stack garbage in in map_bh->state happens to have the
    BH_Mapped bit set, there could be an attempt to call readpage() on a
    non-existent page, which could cause nobh_truncate_page() to return an
    error when it should not.

    Fix this by initializing map_bh->state and map_bh->size.

    Fortunately, it's probably fairly unlikely that ext2 and jfs users
    mount with nobh these days.

    Signed-off-by: "Theodore Ts'o"
    Cc: Dave Kleikamp
    Cc: linux-fsdevel@vger.kernel.org
    Signed-off-by: Al Viro

    Theodore Ts'o
     

23 May, 2009

1 commit

  • Until now we have had a 1:1 mapping between storage device physical
    block size and the logical block sized used when addressing the device.
    With SATA 4KB drives coming out that will no longer be the case. The
    sector size will be 4KB but the logical block size will remain
    512-bytes. Hence we need to distinguish between the physical block size
    and the logical ditto.

    This patch renames hardsect_size to logical_block_size.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

13 May, 2009

1 commit

  • The BH_Delay and BH_Unwritten flags should never leak out to
    submit_bh(). So add some BUG_ON() checks to submit_bh so we can get a
    stack trace and determine how and why this might have happened.

    (Note that only XFS and ext4 use these buffer head flags, and XFS does
    not use submit_bh(). So this patch should only modify behavior for
    ext4.)

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: "Theodore Ts'o"
    Cc: linux-fsdevel@vger.kernel.org

    Aneesh Kumar K.V
     

03 May, 2009

1 commit

  • Change page_mkwrite to allow implementations to return with the page
    locked, and also change it's callers (in page fault paths) to hold the
    lock until the page is marked dirty. This allows the filesystem to have
    full control of page dirtying events coming from the VM.

    Rather than simply hold the page locked over the page_mkwrite call, we
    call page_mkwrite with the page unlocked and allow callers to return with
    it locked, so filesystems can avoid LOR conditions with page lock.

    The problem with the current scheme is this: a filesystem that wants to
    associate some metadata with a page as long as the page is dirty, will
    perform this manipulation in its ->page_mkwrite. It currently then must
    return with the page unlocked and may not hold any other locks (according
    to existing page_mkwrite convention).

    In this window, the VM could write out the page, clearing page-dirty. The
    filesystem has no good way to detect that a dirty pte is about to be
    attached, so it will happily write out the page, at which point, the
    filesystem may manipulate the metadata to reflect that the page is no
    longer dirty.

    It is not always possible to perform the required metadata manipulation in
    ->set_page_dirty, because that function cannot block or fail. The
    filesystem may need to allocate some data structure, for example.

    And the VM cannot mark the pte dirty before page_mkwrite, because
    page_mkwrite is allowed to fail, so we must not allow any window where the
    page could be written to if page_mkwrite does fail.

    This solution of holding the page locked over the 3 critical operations
    (page_mkwrite, setting the pte dirty, and finally setting the page dirty)
    closes out races nicely, preventing page cleaning for writeout being
    initiated in that window. This provides the filesystem with a strong
    synchronisation against the VM here.

    - Sage needs this race closed for ceph filesystem.
    - Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913).
    - I need it for fsblock.
    - I suspect other filesystems may need it too (eg. btrfs).
    - I have converted buffer.c to the new locking. Even simple block allocation
    under dirty pages might be susceptible to i_size changing under partial page
    at the end of file (we also have a buffer.c-side problem here, but it cannot
    be fixed properly without this patch).
    - Other filesystems (eg. NFS, maybe btrfs) will need to change their
    page_mkwrite functions themselves.

    [ This also moves page_mkwrite another step closer to fault, which should
    eventually allow page_mkwrite to be moved into ->fault, and thus avoiding a
    filesystem calldown and page lock/unlock cycle in __do_fault. ]

    [akpm@linux-foundation.org: fix derefs of NULL ->mapping]
    Cc: Sage Weil
    Cc: Trond Myklebust
    Signed-off-by: Nick Piggin
    Cc: Valdis Kletnieks
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

16 Apr, 2009

1 commit

  • block_write_full_page doesn't allow the caller to control what happens
    when the IO is over. This adds a new call named block_write_full_page_endio
    so the buffer head end_io handler can be provided by the caller.

    This will be used by the ext3 data=guarded mode to do i_size updates in
    a workqueue based end_io handler. end_buffer_async_write is also
    exported so it can be called to do the dirty work of managing page
    writeback for the higher level end_io handler.

    Signed-off-by: Chris Mason
    Acked-by: Theodore Tso
    Acked-by: Jan Kara
    Signed-off-by: Linus Torvalds

    Chris Mason
     

15 Apr, 2009

1 commit


09 Apr, 2009

1 commit

  • Now that we have a distinction between WRITE_SYNC and WRITE_SYNC_PLUG,
    use WRITE_SYNC_PLUG in __block_write_full_page() to avoid unplugging
    the block device I/O queue between each page that gets flushed out.

    Otherwise, when we run sync() or fsync() and we need to write out a
    large number of pages, the block device queue will get unplugged
    between for every page that is flushed out, which will be a pretty
    serious performance regression caused by commit a64c8610.

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

06 Apr, 2009

2 commits


04 Apr, 2009

1 commit


03 Apr, 2009

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    Remove two unneeded exports and make two symbols static in fs/mpage.c
    Cleanup after commit 585d3bc06f4ca57f975a5a1f698f65a45ea66225
    Trim includes of fdtable.h
    Don't crap into descriptor table in binfmt_som
    Trim includes in binfmt_elf
    Don't mess with descriptor table in load_elf_binary()
    Get rid of indirect include of fs_struct.h
    New helper - current_umask()
    check_unsafe_exec() doesn't care about signal handlers sharing
    New locking/refcounting for fs_struct
    Take fs_struct handling to new file (fs/fs_struct.c)
    Get rid of bumping fs_struct refcount in pivot_root(2)
    Kill unsharing fs_struct in __set_personality()

    Linus Torvalds
     
  • Check bh->b_blocknr only if BH_Mapped is set.

    akpm: I doubt if b_blocknr is ever uninitialised here, but it could
    conceivably cause a problem if we're doing a lookup for block zero.

    Signed-off-by: Nikanth Karthikesan
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nikanth Karthikesan
     

01 Apr, 2009

6 commits

  • Now that the filesystem freeze operation has been elevated to the VFS, and
    is just an ioctl away, some sort of safety net for unintentionally frozen
    root filesystems may be in order.

    The timeout thaw originally proposed did not get merged, but perhaps
    something like this would be useful in emergencies.

    For example, freeze /path/to/mountpoint may freeze your root filesystem if
    you forgot that you had that unmounted.

    I chose 'j' as the last remaining character other than 'h' which is sort
    of reserved for help (because help is generated on any unknown character).

    I've tested this on a non-root fs with multiple (nested) freezers, as well
    as on a system rendered unresponsive due to a frozen root fs.

    [randy.dunlap@oracle.com: emergency thaw only if CONFIG_BLOCK enabled]
    Signed-off-by: Eric Sandeen
    Cc: Takashi Sato
    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Sandeen
     
  • try_to_free_pages() is used for the direct reclaim of up to
    SWAP_CLUSTER_MAX pages when watermarks are low. The caller to
    alloc_pages_nodemask() can specify a nodemask of nodes that are allowed to
    be used but this is not passed to try_to_free_pages(). This can lead to
    unnecessary reclaim of pages that are unusable by the caller and int the
    worst case lead to allocation failure as progress was not been make where
    it is needed.

    This patch passes the nodemask used for alloc_pages_nodemask() to
    try_to_free_pages().

    Reviewed-by: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • page_mkwrite is called with neither the page lock nor the ptl held. This
    means a page can be concurrently truncated or invalidated out from
    underneath it. Callers are supposed to prevent truncate races themselves,
    however previously the only thing they can do in case they hit one is to
    raise a SIGBUS. A sigbus is wrong for the case that the page has been
    invalidated or truncated within i_size (eg. hole punched). Callers may
    also have to perform memory allocations in this path, where again, SIGBUS
    would be wrong.

    The previous patch ("mm: page_mkwrite change prototype to match fault")
    made it possible to properly specify errors. Convert the generic buffer.c
    code and btrfs to return sane error values (in the case of page removed
    from pagecache, VM_FAULT_NOPAGE will cause the fault handler to exit
    without doing anything, and the fault will be retried properly).

    This fixes core code, and converts btrfs as a template/example. All other
    filesystems defining their own page_mkwrite should be fixed in a similar
    manner.

    Acked-by: Chris Mason
    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Change the page_mkwrite prototype to take a struct vm_fault, and return
    VM_FAULT_xxx flags. There should be no functional change.

    This makes it possible to return much more detailed error information to
    the VM (and also can provide more information eg. virtual_address to the
    driver, which might be important in some special cases).

    This is required for a subsequent fix. And will also make it easier to
    merge page_mkwrite() with fault() in future.

    Signed-off-by: Nick Piggin
    Cc: Chris Mason
    Cc: Trond Myklebust
    Cc: Miklos Szeredi
    Cc: Steven Whitehouse
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Artem Bityutskiy
    Cc: Felix Blyakher
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Add a helper function account_page_dirtied(). Use that from two
    callsites. reiser4 adds a function which adds a third callsite.

    Signed-off-by: Edward Shishkin
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Edward Shishkin
     
  • fsync_bdev() export and a bunch of stubs for !CONFIG_BLOCK case had
    been left behind

    Signed-off-by: Al Viro

    Al Viro
     

28 Mar, 2009

2 commits


20 Mar, 2009

1 commit

  • Nick Piggin noticed this (very unlikely) race between setting a page
    dirty and creating the buffers for it - we need to hold the mapping
    private_lock until we've set the page dirty bit in order to make sure
    that create_empty_buffers() might not build up a set of buffers without
    the dirty bits set when the page is dirty.

    I doubt anybody has ever hit this race (and it didn't solve the issue
    Nick was looking at), but as Nick says: "Still, it does appear to solve
    a real race, which we should close."

    Acked-by: Nick Piggin
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

19 Feb, 2009

2 commits

  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block:
    block: fix deadlock in blk_abort_queue() for drivers that readd to timeout list
    block: fix booting from partitioned md array
    block: revert part of 18ce3751ccd488c78d3827e9f6bf54e6322676fb
    cciss: PCI power management reset for kexec
    paride/pg.c: xs(): &&/|| confusion
    fs/bio: bio_alloc_bioset: pass right object ptr to mempool_free
    block: fix bad definition of BIO_RW_SYNC
    bsg: Fix sense buffer bug in SG_IO

    Linus Torvalds
     
  • YAMAMOTO-san noticed that task_dirty_inc doesn't seem to be called properly for
    cases where set_page_dirty is not used to dirty a page (eg. mark_buffer_dirty).

    Additionally, there is some inconsistency about when task_dirty_inc is
    called. It is used for dirty balancing, however it even gets called for
    __set_page_dirty_no_writeback.

    So rather than increment it in a set_page_dirty wrapper, move it down to
    exactly where the dirty page accounting stats are incremented.

    Cc: YAMAMOTO Takashi
    Signed-off-by: Nick Piggin
    Acked-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

18 Feb, 2009

1 commit

  • The above commit added WRITE_SYNC and switched various places to using
    that for committing writes that will be waited upon immediately after
    submission. However, this causes a performance regression with AS and CFQ
    for ext3 at least, since sync_dirty_buffer() will submit some writes with
    WRITE_SYNC while ext3 has sumitted others dependent writes without the sync
    flag set. This causes excessive anticipation/idling in the IO scheduler
    because sync and async writes get interleaved, causing a big performance
    regression for the below test case (which is meant to simulate sqlite
    like behaviour).

    ---- test case ----

    int main(int argc, char **argv)
    {

    int fdes, i;
    FILE *fp;
    struct timeval start;
    struct timeval end;
    struct timeval res;

    gettimeofday(&start, NULL);
    for (i=0; i

    Jens Axboe
     

07 Feb, 2009

1 commit

  • This is a modification of a patch by Bill Pemberton

    nobh_write_end() could call attach_nobh_buffers() with head == NULL.
    This would result in a trap when attach_nobh_buffers() attempted to
    access bh->b_this_page.

    This can be illustrated by running the writev01 testcase from LTP on jfs.

    This error was introduced by commit 5b41e74a "vfs: fix data leak in
    nobh_write_end()". That patch did not take into account that if
    PageMappedToDisk() is true upon entry to nobh_write_begin(), then no
    buffers will be allocated for the page. In that case, we won't have to
    worry about a failed write leaving unitialized data in the page.

    Of course, head != NULL implies !page_has_buffers(page), so no need to
    test both.

    Signed-off-by: Dave Kleikamp
    Cc: Bill Pemberton
    Cc: Dmitri Monakhov
    Signed-off-by: Linus Torvalds

    Dave Kleikamp
     

14 Jan, 2009

1 commit


10 Jan, 2009

2 commits

  • The ioctls for the generic freeze feature are below.
    o Freeze the filesystem
    int ioctl(int fd, int FIFREEZE, arg)
    fd: The file descriptor of the mountpoint
    FIFREEZE: request code for the freeze
    arg: Ignored
    Return value: 0 if the operation succeeds. Otherwise, -1

    o Unfreeze the filesystem
    int ioctl(int fd, int FITHAW, arg)
    fd: The file descriptor of the mountpoint
    FITHAW: request code for unfreeze
    arg: Ignored
    Return value: 0 if the operation succeeds. Otherwise, -1
    Error number: If the filesystem has already been unfrozen,
    errno is set to EINVAL.

    [akpm@linux-foundation.org: fix CONFIG_BLOCK=n]
    Signed-off-by: Takashi Sato
    Signed-off-by: Masayuki Hamaguchi
    Cc:
    Cc:
    Cc: Christoph Hellwig
    Cc: Dave Kleikamp
    Cc: Dave Chinner
    Cc: Alasdair G Kergon
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Takashi Sato
     
  • Currently, ext3 in mainline Linux doesn't have the freeze feature which
    suspends write requests. So, we cannot take a backup which keeps the
    filesystem's consistency with the storage device's features (snapshot and
    replication) while it is mounted.

    In many case, a commercial filesystem (e.g. VxFS) has the freeze feature
    and it would be used to get the consistent backup.

    If Linux's standard filesystem ext3 has the freeze feature, we can do it
    without a commercial filesystem.

    So I have implemented the ioctls of the freeze feature.
    I think we can take the consistent backup with the following steps.
    1. Freeze the filesystem with the freeze ioctl.
    2. Separate the replication volume or create the snapshot
    with the storage device's feature.
    3. Unfreeze the filesystem with the unfreeze ioctl.
    4. Take the backup from the separated replication volume
    or the snapshot.

    This patch:

    VFS:
    Changed the type of write_super_lockfs and unlockfs from "void"
    to "int" so that they can return an error.
    Rename write_super_lockfs and unlockfs of the super block operation
    freeze_fs and unfreeze_fs to avoid a confusion.

    ext3, ext4, xfs, gfs2, jfs:
    Changed the type of write_super_lockfs and unlockfs from "void"
    to "int" so that write_super_lockfs returns an error if needed,
    and unlockfs always returns 0.

    reiserfs:
    Changed the type of write_super_lockfs and unlockfs from "void"
    to "int" so that they always return 0 (success) to keep a current behavior.

    Signed-off-by: Takashi Sato
    Signed-off-by: Masayuki Hamaguchi
    Cc:
    Cc:
    Cc: Christoph Hellwig
    Cc: Dave Kleikamp
    Cc: Dave Chinner
    Cc: Alasdair G Kergon
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Takashi Sato
     

07 Jan, 2009

1 commit


05 Jan, 2009

1 commit

  • With the write_begin/write_end aops, page_symlink was broken because it
    could no longer pass a GFP_NOFS type mask into the point where the
    allocations happened. They are done in write_begin, which would always
    assume that the filesystem can be entered from reclaim. This bug could
    cause filesystem deadlocks.

    The funny thing with having a gfp_t mask there is that it doesn't really
    allow the caller to arbitrarily tinker with the context in which it can be
    called. It couldn't ever be GFP_ATOMIC, for example, because it needs to
    take the page lock. The only thing any callers care about is __GFP_FS
    anyway, so turn that into a single flag.

    Add a new flag for write_begin, AOP_FLAG_NOFS. Filesystems can now act on
    this flag in their write_begin function. Change __grab_cache_page to
    accept a nofs argument as well, to honour that flag (while we're there,
    change the name to grab_cache_page_write_begin which is more instructive
    and does away with random leading underscores).

    This is really a more flexible way to go in the end anyway -- if a
    filesystem happens to want any extra allocations aside from the pagecache
    ones in ints write_begin function, it may now use GFP_KERNEL (rather than
    GFP_NOFS) for common case allocations (eg. ocfs2_alloc_write_ctxt, for a
    random example).

    [kosaki.motohiro@jp.fujitsu.com: fix ubifs]
    [kosaki.motohiro@jp.fujitsu.com: fix fuse]
    Signed-off-by: Nick Piggin
    Reviewed-by: KOSAKI Motohiro
    Cc: [2.6.28.x]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    [ Cleaned up the calling convention: just pass in the AOP flags
    untouched to the grab_cache_page_write_begin() function. That
    just simplifies everybody, and may even allow future expansion of the
    logic. - Linus ]
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

29 Dec, 2008

1 commit

  • Allow the scsi request REQ_QUIET flag to be propagated to the buffer
    file system layer. The basic ideas is to pass the flag from the scsi
    request to the bio (block IO) and then to the buffer layer. The buffer
    layer can then suppress needless printks.

    This patch declutters the kernel log by removed the 40-50 (per lun)
    buffer io error messages seen during a boot in my multipath setup . It
    is a good chance any real errors will be missed in the "noise" it the
    logs without this patch.

    During boot I see blocks of messages like
    "
    __ratelimit: 211 callbacks suppressed
    Buffer I/O error on device sdm, logical block 5242879
    Buffer I/O error on device sdm, logical block 5242879
    Buffer I/O error on device sdm, logical block 5242847
    Buffer I/O error on device sdm, logical block 1
    Buffer I/O error on device sdm, logical block 5242878
    Buffer I/O error on device sdm, logical block 5242879
    Buffer I/O error on device sdm, logical block 5242879
    Buffer I/O error on device sdm, logical block 5242879
    Buffer I/O error on device sdm, logical block 5242879
    Buffer I/O error on device sdm, logical block 5242872
    "
    in my logs.

    My disk environment is multipath fiber channel using the SCSI_DH_RDAC
    code and multipathd. This topology includes an "active" and "ghost"
    path for each lun. IO's to the "ghost" path will never complete and the
    SCSI layer, via the scsi device handler rdac code, quick returns the IOs
    to theses paths and sets the REQ_QUIET scsi flag to suppress the scsi
    layer messages.

    I am wanting to extend the QUIET behavior to include the buffer file
    system layer to deal with these errors as well. I have been running this
    patch for a while now on several boxes without issue. A few runs of
    bonnie++ show no noticeable difference in performance in my setup.

    Thanks for John Stultz for the quiet_error finalization.

    Submitted-by: Keith Mannthey
    Signed-off-by: Jens Axboe

    Keith Mannthey
     

28 Nov, 2008

1 commit

  • udf_clear_inode() can leave behind buffers on mapping's i_private list (when
    we truncated preallocation). Call invalidate_inode_buffers() so that the list
    is properly cleaned-up before we return from udf_clear_inode(). This is ugly
    and suggest that we should cleanup preallocation earlier than in clear_inode()
    but currently there's no such call available since drop_inode() is called under
    inode lock and thus is unusable for disk operations.

    Signed-off-by: Jan Kara

    Jan Kara
     

20 Oct, 2008

1 commit

  • trylock_buffer and unlock_buffer open and close a critical section.
    Hence, we can use the lock bitops to get the desired memory ordering.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

27 Aug, 2008

1 commit


05 Aug, 2008

1 commit