25 Mar, 2011

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    fs: simplify iget & friends
    fs: pull inode->i_lock up out of writeback_single_inode
    fs: rename inode_lock to inode_hash_lock
    fs: move i_wb_list out from under inode_lock
    fs: move i_sb_list out from under inode_lock
    fs: remove inode_lock from iput_final and prune_icache
    fs: Lock the inode LRU list separately
    fs: factor inode disposal
    fs: protect inode->i_state with inode->i_lock
    autofs4: Do not potentially dereference NULL pointer returned by fget() in autofs_dev_ioctl_setpipefd()
    autofs4 - remove autofs4_lock
    autofs4 - fix d_manage() return on rcu-walk
    autofs4 - fix autofs4_expire_indirect() traversal
    autofs4 - fix dentry leak in autofs4_expire_direct()
    autofs4 - reinstate last used update on access
    vfs - check non-mountpoint dentry might block in __follow_mount_rcu()

    Linus Torvalds
     
  • Protect inode state transitions and validity checks with the
    inode->i_lock. This enables us to make inode state transitions
    independently of the inode_lock and is the first step to peeling
    away the inode_lock from the code.

    This requires that __iget() is done atomically with i_state checks
    during list traversals so that we don't race with another thread
    marking the inode I_FREEING between the state check and grabbing the
    reference.

    Also remove the unlock_new_inode() memory barrier optimisation
    required to avoid taking the inode_lock when clearing I_NEW.
    Simplify the code by simply taking the inode->i_lock around the
    state change and wakeup. Because the wakeup is no longer tricky,
    remove the wake_up_inode() function and open code the wakeup where
    necessary.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     

17 Mar, 2011

1 commit


10 Mar, 2011

2 commits

  • With the plugging now being explicitly controlled by the
    submitter, callers need not pass down unplugging hints
    to the block layer. If they want to unplug, it's because they
    manually plugged on their own - in which case, they should just
    unplug at will.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Code has been converted over to the new explicit on-stack plugging,
    and delay users have been converted to use the new API for that.
    So lets kill off the old plugging along with aops->sync_page().

    Signed-off-by: Jens Axboe

    Jens Axboe
     

17 Dec, 2010

2 commits

  • __this_cpu_inc can create a single instruction with the same effect
    as the _get_cpu_var(..)++ construct in buffer.c.

    Cc: Wu Fengguang
    Cc: Christoph Hellwig
    Acked-by: H. Peter Anvin
    Signed-off-by: Christoph Lameter
    Signed-off-by: Tejun Heo

    Christoph Lameter
     
  • Optimize various per cpu area operations through these new percpu
    operations. These operations avoid address calculations through the
    use of segment prefixes and multiple memory references through RMW
    instructions etc.

    Reduces code size:

    Before:

    christoph@linux-2.6$ size fs/buffer.o
    text data bss dec hex filename
    19169 80 28 19277 4b4d fs/buffer.o

    After:

    christoph@linux-2.6$ size fs/buffer.o
    text data bss dec hex filename
    19138 80 28 19246 4b2e fs/buffer.o

    V3->V4:
    - Move the use of this_cpu_inc_return into a later patch so that
    this one can go in without percpu infrastructure changes.

    Cc: Wu Fengguang
    Cc: Christoph Hellwig
    Acked-by: H. Peter Anvin
    Signed-off-by: Christoph Lameter
    Signed-off-by: Tejun Heo

    Christoph Lameter
     

27 Oct, 2010

3 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
    split invalidate_inodes()
    fs: skip I_FREEING inodes in writeback_sb_inodes
    fs: fold invalidate_list into invalidate_inodes
    fs: do not drop inode_lock in dispose_list
    fs: inode split IO and LRU lists
    fs: switch bdev inode bdi's correctly
    fs: fix buffer invalidation in invalidate_list
    fsnotify: use dget_parent
    smbfs: use dget_parent
    exportfs: use dget_parent
    fs: use RCU read side protection in d_validate
    fs: clean up dentry lru modification
    fs: split __shrink_dcache_sb
    fs: improve DCACHE_REFERENCED usage
    fs: use percpu counter for nr_dentry and nr_dentry_unused
    fs: simplify __d_free
    fs: take dcache_lock inside __d_path
    fs: do not assign default i_ino in new_inode
    fs: introduce a per-cpu last_ino allocator
    new helper: ihold()
    ...

    Linus Torvalds
     
  • bh->b_private is initialized within init_buffer(), thus this assignment is
    redundant.

    Signed-off-by: Namhyung Kim
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     
  • This removes more dead code that was somehow missed by commit 0d99519efef
    (writeback: remove unused nonblocking and congestion checks). There are
    no behavior change except for the removal of two entries from one of the
    ext4 tracing interface.

    The nonblocking checks in ->writepages are no longer used because the
    flusher now prefer to block on get_request_wait() than to skip inodes on
    IO congestion. The latter will lead to more seeky IO.

    The nonblocking checks in ->writepage are no longer used because it's
    redundant with the WB_SYNC_NONE check.

    We no long set ->nonblocking in VM page out and page migration, because
    a) it's effectively redundant with WB_SYNC_NONE in current code
    b) it's old semantic of "Don't get stuck on request queues" is mis-behavior:
    that would skip some dirty inodes on congestion and page out others, which
    is unfair in terms of LRU age.

    Inspired by Christoph Hellwig. Thanks!

    Signed-off-by: Wu Fengguang
    Cc: Theodore Ts'o
    Cc: David Howells
    Cc: Sage Weil
    Cc: Steve French
    Cc: Chris Mason
    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     

26 Oct, 2010

3 commits


10 Sep, 2010

1 commit


18 Aug, 2010

2 commits

  • These flags aren't real I/O types, but tell ll_rw_block to always
    lock the buffer instead of giving up on a failed trylock.

    Instead add a new write_dirty_buffer helper that implements this semantic
    and use it from the existing SWRITE* callers. Note that the ll_rw_block
    code had a bug where it didn't promote WRITE_SYNC_PLUG properly, which
    this patch fixes.

    In the ufs code clean up the helper that used to call ll_rw_block
    to mirror sync_dirty_buffer, which is the function it implements for
    compound buffers.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Instead of abusing a buffer_head flag just add a variant of
    sync_dirty_buffer which allows passing the exact type of write
    flag required.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

10 Aug, 2010

4 commits

  • Move the call to vmtruncate to get rid of accessive blocks to the callers
    in preparation of the new truncate sequence and rename the non-truncating
    version to block_write_begin.

    While we're at it also remove several unused arguments to block_write_begin.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Split up the block_write_begin implementation - __block_write_begin is a new
    trivial wrapper for block_prepare_write that always takes an already
    allocated page and can be either called from block_write_begin or filesystem
    code that already has a page allocated. Remove the handling of already
    allocated pages from block_write_begin after switching all callers that
    do it to __block_write_begin.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Move the call to vmtruncate to get rid of accessive blocks to the callers
    in preparation of the new truncate sequence and rename the non-truncating
    version to cont_write_begin.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     
  • Move the call to vmtruncate to get rid of accessive blocks to the only
    remaining caller and rename the non-truncating version to nobh_write_begin.

    Get rid of the superflous file argument to it while we're at it.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Al Viro

    Christoph Hellwig
     

28 May, 2010

1 commit

  • Introduce a new truncate calling sequence into fs/mm subsystems. Rather than
    setattr > vmtruncate > truncate, have filesystems call their truncate sequence
    from ->setattr if filesystem specific operations are required. vmtruncate is
    deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced
    previously should be used.

    simple_setattr is introduced for simple in-ram filesystems to implement
    the new truncate sequence. Eventually all filesystems should be converted
    to implement a setattr, and the default code in notify_change should go
    away.

    simple_setsize is also introduced to perform just the ATTR_SIZE portion
    of simple_setattr (ie. changing i_size and trimming pagecache).

    To implement the new truncate sequence:
    - filesystem specific manipulations (eg freeing blocks) must be done in
    the setattr method rather than ->truncate.
    - vmtruncate can not be used by core code to trim blocks past i_size in
    the event of write failure after allocation, so this must be performed
    in the fs code.
    - convert usage of helpers block_write_begin, nobh_write_begin,
    cont_write_begin, and *blockdev_direct_IO* to use _newtrunc postfixed
    variants. These avoid calling vmtruncate to trim blocks (see previous).
    - inode_setattr should not be used. generic_setattr is a new function
    to be used to copy simple attributes into the generic inode.
    - make use of the better opportunity to handle errors with the new sequence.

    Big problem with the previous calling sequence: the filesystem is not called
    until i_size has already changed. This means it is not allowed to fail the
    call, and also it does not know what the previous i_size was. Also, generic
    code calling vmtruncate to truncate allocated blocks in case of error had
    no good way to return a meaningful error (or, for example, atomically handle
    block deallocation).

    Cc: Christoph Hellwig
    Acked-by: Jan Kara
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    npiggin@suse.de
     

22 May, 2010

5 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (69 commits)
    fix handling of offsets in cris eeprom.c, get rid of fake on-stack files
    get rid of home-grown mutex in cris eeprom.c
    switch ecryptfs_write() to struct inode *, kill on-stack fake files
    switch ecryptfs_get_locked_page() to struct inode *
    simplify access to ecryptfs inodes in ->readpage() and friends
    AFS: Don't put struct file on the stack
    Ban ecryptfs over ecryptfs
    logfs: replace inode uid,gid,mode initialization with helper function
    ufs: replace inode uid,gid,mode initialization with helper function
    udf: replace inode uid,gid,mode init with helper
    ubifs: replace inode uid,gid,mode initialization with helper function
    sysv: replace inode uid,gid,mode initialization with helper function
    reiserfs: replace inode uid,gid,mode initialization with helper function
    ramfs: replace inode uid,gid,mode initialization with helper function
    omfs: replace inode uid,gid,mode initialization with helper function
    bfs: replace inode uid,gid,mode initialization with helper function
    ocfs2: replace inode uid,gid,mode initialization with helper function
    nilfs2: replace inode uid,gid,mode initialization with helper function
    minix: replace inode uid,gid,mode init with helper
    ext4: replace inode uid,gid,mode init with helper
    ...

    Trivial conflict in fs/fs-writeback.c (mark bitfields unsigned)

    Linus Torvalds
     
  • ... and switch the simple "loop over superblocks and do something"
    loops to it.

    Signed-off-by: Al Viro

    Al Viro
     
  • Signed-off-by: Al Viro

    Al Viro
     
  • We used to remove from s_list and s_instances at the same
    time. So let's *not* do the former and skip superblocks
    that have empty s_instances in the loops over s_list.

    The next step, of course, will be to get rid of rescan logics
    in those loops.

    Signed-off-by: Al Viro

    Al Viro
     
  • invalidate_bdev() should release all page cache pages which are clean
    and not being used; however, if some pages are still in the percpu LRU
    add caches on other cpus, those pages are considered in used and don't
    get released. Fix it by calling lru_add_drain_all() before trying to
    invalidate pages.

    This problem was discovered while testing block automatic native
    capacity unlocking. Null pages which were read before automatic
    unlocking didn't get released by invalidate_bdev() and ended up
    interfering with partition scan after unlocking.

    Signed-off-by: Tejun Heo
    Acked-by: David S. Miller
    Signed-off-by: Jens Axboe

    Tejun Heo
     

13 Mar, 2010

2 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (56 commits)
    doc: fix typo in comment explaining rb_tree usage
    Remove fs/ntfs/ChangeLog
    doc: fix console doc typo
    doc: cpuset: Update the cpuset flag file
    Fix of spelling in arch/sparc/kernel/leon_kernel.c no longer needed
    Remove drivers/parport/ChangeLog
    Remove drivers/char/ChangeLog
    doc: typo - Table 1-2 should refer to "status", not "statm"
    tree-wide: fix typos "ass?o[sc]iac?te" -> "associate" in comments
    No need to patch AMD-provided drivers/gpu/drm/radeon/atombios.h
    devres/irq: Fix devm_irq_match comment
    Remove reference to kthread_create_on_cpu
    tree-wide: Assorted spelling fixes
    tree-wide: fix 'lenght' typo in comments and code
    drm/kms: fix spelling in error message
    doc: capitalization and other minor fixes in pnp doc
    devres: typo fix s/dev/devm/
    Remove redundant trailing semicolons from macros
    fix typo "definetly" -> "definitely" in comment
    tree-wide: s/widht/width/g typo in comments
    ...

    Fix trivial conflict in Documentation/laptops/00-INDEX

    Linus Torvalds
     
  • When using slub, having a kmem_cache constructor forces slub to add a free
    pointer to the size of the cached object, which can have a significant
    impact to the number of small objects that can fit into a slab.

    As buffer_head is relatively small and we can have large numbers of them,
    removing the constructor is a definite win.

    On x86_64 removing the constructor gives me 39 objects/slab, 3 more than
    without the patch. And on x86_32 73 objects/slab, which is 9 more.

    As alloc_buffer_head() already initializes each new object there is very
    little difference in actual code run.

    Signed-off-by: Richard Kennedy
    Cc: Alexander Viro
    Cc: Jens Axboe
    Acked-by: Nick Piggin
    Cc: "Theodore Ts'o"
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Kennedy
     

04 Feb, 2010

1 commit


26 Sep, 2009

2 commits

  • * 'writeback' of git://git.kernel.dk/linux-2.6-block:
    writeback: writeback_inodes_sb() should use bdi_start_writeback()
    writeback: don't delay inodes redirtied by a fast dirtier
    writeback: make the super_block pinning more efficient
    writeback: don't resort for a single super_block in move_expired_inodes()
    writeback: move inodes from one super_block together
    writeback: get rid to incorrect references to pdflush in comments
    writeback: improve readability of the wb_writeback() continue/break logic
    writeback: cleanup writeback_single_inode()
    writeback: kupdate writeback shall not stop when more io is possible
    writeback: stop background writeback when below background threshold
    writeback: balance_dirty_pages() shall write more than dirtied pages
    fs: Fix busyloop in wb_writeback()

    Linus Torvalds
     
  • Signed-off-by: Jens Axboe

    Jens Axboe
     

24 Sep, 2009

1 commit

  • Update some fs code to make use of new helper functions introduced
    in the previous patch. Should be no significant change in behaviour
    (except CIFS now calls send_sig under i_lock, via inode_newsize_ok).

    Reviewed-by: Christoph Hellwig
    Acked-by: Miklos Szeredi
    Cc: linux-nfs@vger.kernel.org
    Cc: Trond.Myklebust@netapp.com
    Cc: linux-cifs-client@lists.samba.org
    Cc: sfrench@samba.org
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    npiggin@suse.de
     

23 Sep, 2009

1 commit

  • According to Documentation/CodingStyle the EXPORT* macro should follow
    immediately after the closing function brace line.

    Also, mark_buffer_async_write_endio() and do_thaw_all() are not used
    elsewhere so they should be marked as static.

    In addition, file_fsync() is actually in fs/sync.c so move the EXPORT* to
    that file.

    Signed-off-by: H Hartley Sweeten
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    H Hartley Sweeten
     

11 Sep, 2009

1 commit

  • This gets rid of pdflush for bdi writeout and kupdated style cleaning.
    pdflush writeout suffers from lack of locality and also requires more
    threads to handle the same workload, since it has to work in a
    non-blocking fashion against each queue. This also introduces lumpy
    behaviour and potential request starvation, since pdflush can be starved
    for queue access if others are accessing it. A sample ffsb workload that
    does random writes to files is about 8% faster here on a simple SATA drive
    during the benchmark phase. File layout also seems a LOT more smooth in
    vmstat:

    r b swpd free buff cache si so bi bo in cs us sy id wa
    0 1 0 608848 2652 375372 0 0 0 71024 604 24 1 10 48 42
    0 1 0 549644 2712 433736 0 0 0 60692 505 27 1 8 48 44
    1 0 0 476928 2784 505192 0 0 4 29540 553 24 0 9 53 37
    0 1 0 457972 2808 524008 0 0 0 54876 331 16 0 4 38 58
    0 1 0 366128 2928 614284 0 0 4 92168 710 58 0 13 53 34
    0 1 0 295092 3000 684140 0 0 0 62924 572 23 0 9 53 37
    0 1 0 236592 3064 741704 0 0 4 58256 523 17 0 8 48 44
    0 1 0 165608 3132 811464 0 0 0 57460 560 21 0 8 54 38
    0 1 0 102952 3200 873164 0 0 4 74748 540 29 1 10 48 41
    0 1 0 48604 3252 926472 0 0 0 53248 469 29 0 7 47 45

    where vanilla tends to fluctuate a lot in the creation phase:

    r b swpd free buff cache si so bi bo in cs us sy id wa
    1 1 0 678716 5792 303380 0 0 0 74064 565 50 1 11 52 36
    1 0 0 662488 5864 319396 0 0 4 352 302 329 0 2 47 51
    0 1 0 599312 5924 381468 0 0 0 78164 516 55 0 9 51 40
    0 1 0 519952 6008 459516 0 0 4 78156 622 56 1 11 52 37
    1 1 0 436640 6092 541632 0 0 0 82244 622 54 0 11 48 41
    0 1 0 436640 6092 541660 0 0 0 8 152 39 0 0 51 49
    0 1 0 332224 6200 644252 0 0 4 102800 728 46 1 13 49 36
    1 0 0 274492 6260 701056 0 0 4 12328 459 49 0 7 50 43
    0 1 0 211220 6324 763356 0 0 0 106940 515 37 1 10 51 39
    1 0 0 160412 6376 813468 0 0 0 8224 415 43 0 6 49 45
    1 1 0 85980 6452 886556 0 0 4 113516 575 39 1 11 54 34
    0 2 0 85968 6452 886620 0 0 0 1640 158 211 0 0 46 54

    A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A
    SSD based writeback test on XFS performs over 20% better as well, with
    the throughput being very stable around 1GB/sec, where pdflush only
    manages 750MB/sec and fluctuates wildly while doing so. Random buffered
    writes to many files behave a lot better as well, as does random mmap'ed
    writes.

    A separate thread is added to sync the super blocks. In the long term,
    adding sync_supers_bdi() functionality could get rid of this thread again.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

22 Aug, 2009

1 commit

  • In commit a8e7d49aa7be728c4ae241a75a2a124cdcabc0c5 ("Fix race in
    create_empty_buffers() vs __set_page_dirty_buffers()"), I removed a test
    for a NULL page mapping unintentionally when some of the code inside
    __set_page_dirty() was moved to the callers.

    That removal generally didn't matter, since a filesystem would serialize
    truncation (which clears the page mapping) against writing (which marks
    the buffer dirty), so locking at a higher level (either per-page or an
    inode at a time) should mean that the buffer page would be stable. And
    indeed, nothing bad seemed to happen.

    Except it turns out that apparently reiserfs does something odd when
    under load and writing out the journal, and we have a number of bugzilla
    entries that look similar:

    http://bugzilla.kernel.org/show_bug.cgi?id=13556
    http://bugzilla.kernel.org/show_bug.cgi?id=13756
    http://bugzilla.kernel.org/show_bug.cgi?id=13876

    and it looks like reiserfs depended on that check (the common theme
    seems to be "data=journal", and a journal writeback during a truncate).

    I suspect reiserfs should have some additional locking, but in the
    meantime this should get us back to the pre-2.6.29 behavior.

    Pattern-pointed-out-by: Roland Kletzing
    Cc: stable@kernel.org (2.6.29 and 2.6.30)
    Cc: Jeff Mahoney
    Cc: Nick Piggin
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

12 Jun, 2009

2 commits

  • * 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block: (153 commits)
    block: add request clone interface (v2)
    floppy: fix hibernation
    ramdisk: remove long-deprecated "ramdisk=" boot-time parameter
    fs/bio.c: add missing __user annotation
    block: prevent possible io_context->refcount overflow
    Add serial number support for virtio_blk, V4a
    block: Add missing bounce_pfn stacking and fix comments
    Revert "block: Fix bounce limit setting in DM"
    cciss: decode unit attention in SCSI error handling code
    cciss: Remove no longer needed sendcmd reject processing code
    cciss: change SCSI error handling routines to work with interrupts enabled.
    cciss: separate error processing and command retrying code in sendcmd_withirq_core()
    cciss: factor out fix target status processing code from sendcmd functions
    cciss: simplify interface of sendcmd() and sendcmd_withirq()
    cciss: factor out core of sendcmd_withirq() for use by SCSI error handling code
    cciss: Use schedule_timeout_uninterruptible in SCSI error handling code
    block: needs to set the residual length of a bidi request
    Revert "block: implement blkdev_readpages"
    block: Fix bounce limit setting in DM
    Removed reference to non-existing file Documentation/PCI/PCI-DMA-mapping.txt
    ...

    Manually fix conflicts with tracing updates in:
    block/blk-sysfs.c
    drivers/ide/ide-atapi.c
    drivers/ide/ide-cd.c
    drivers/ide/ide-floppy.c
    drivers/ide/ide-tape.c
    include/trace/events/block.h
    kernel/trace/blktrace.c

    Linus Torvalds
     
  • * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (49 commits)
    ext4: Avoid corrupting the uninitialized bit in the extent during truncate
    ext4: Don't treat a truncation of a zero-length file as replace-via-truncate
    ext4: fix dx_map_entry to support 256k directory blocks
    ext4: truncate the file properly if we fail to copy data from userspace
    ext4: Avoid leaking blocks after a block allocation failure
    ext4: Change all super.c messages to print the device
    ext4: Get rid of EXTEND_DISKSIZE flag of ext4_get_blocks_handle()
    ext4: super.c whitespace cleanup
    jbd2: Fix minor typos in comments in fs/jbd2/journal.c
    ext4: Clean up calls to ext4_get_group_desc()
    ext4: remove unused function __ext4_write_dirty_metadata
    ext2: Fix memory leak in ext2_fill_super() in case of a failed mount
    ext3: Fix memory leak in ext3_fill_super() in case of a failed mount
    ext4: Fix memory leak in ext4_fill_super() in case of a failed mount
    ext4: down i_data_sem only for read when walking tree for fiemap
    ext4: Add a comprehensive block validity check to ext4_get_blocks()
    ext4: Clean up ext4_get_blocks() so it does not depend on bh_result->b_state
    ext4: Merge ext4_da_get_block_write() into mpage_da_map_blocks()
    ext4: Add BUG_ON debugging checks to noalloc_get_block_write()
    ext4: Add documentation to the ext4_*get_block* functions
    ...

    Linus Torvalds
     

06 Jun, 2009

1 commit

  • The nobh_truncate_page() function is used by ext2, exofs, and jfs. Of
    these three, only ext2 and jfs's get_block() function pays attention
    to bh->b_size --- which is normally always the filesystem blocksize
    except when the get_block() function is called by either
    mpage_readpage(), mpage_readpages(), or the direct I/O routines in
    fs/direct_io.c.

    Unfortunately, nobh_truncate_page() does not initialize map_bh before
    calling the filesystem-supplied get_block() function. So ext2 and jfs
    will try to calculate the number of blocks to map by taking stack
    garbage and shifting it left by inode->i_blkbits. This should be
    *mostly* harmless (except the filesystem will do some unnneeded work)
    unless the stack garbage is less than filesystem's blocksize, in which
    case maxblocks will be zero, and the attempt to find out whether or
    not the filesystem has a hole at a given logical block will fail, and
    the page cache entry might not get zero'ed out.

    Also if the stack garbage in in map_bh->state happens to have the
    BH_Mapped bit set, there could be an attempt to call readpage() on a
    non-existent page, which could cause nobh_truncate_page() to return an
    error when it should not.

    Fix this by initializing map_bh->state and map_bh->size.

    Fortunately, it's probably fairly unlikely that ext2 and jfs users
    mount with nobh these days.

    Signed-off-by: "Theodore Ts'o"
    Cc: Dave Kleikamp
    Cc: linux-fsdevel@vger.kernel.org
    Signed-off-by: Al Viro

    Theodore Ts'o
     

23 May, 2009

1 commit

  • Until now we have had a 1:1 mapping between storage device physical
    block size and the logical block sized used when addressing the device.
    With SATA 4KB drives coming out that will no longer be the case. The
    sector size will be 4KB but the logical block size will remain
    512-bytes. Hence we need to distinguish between the physical block size
    and the logical ditto.

    This patch renames hardsect_size to logical_block_size.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

03 May, 2009

1 commit

  • Change page_mkwrite to allow implementations to return with the page
    locked, and also change it's callers (in page fault paths) to hold the
    lock until the page is marked dirty. This allows the filesystem to have
    full control of page dirtying events coming from the VM.

    Rather than simply hold the page locked over the page_mkwrite call, we
    call page_mkwrite with the page unlocked and allow callers to return with
    it locked, so filesystems can avoid LOR conditions with page lock.

    The problem with the current scheme is this: a filesystem that wants to
    associate some metadata with a page as long as the page is dirty, will
    perform this manipulation in its ->page_mkwrite. It currently then must
    return with the page unlocked and may not hold any other locks (according
    to existing page_mkwrite convention).

    In this window, the VM could write out the page, clearing page-dirty. The
    filesystem has no good way to detect that a dirty pte is about to be
    attached, so it will happily write out the page, at which point, the
    filesystem may manipulate the metadata to reflect that the page is no
    longer dirty.

    It is not always possible to perform the required metadata manipulation in
    ->set_page_dirty, because that function cannot block or fail. The
    filesystem may need to allocate some data structure, for example.

    And the VM cannot mark the pte dirty before page_mkwrite, because
    page_mkwrite is allowed to fail, so we must not allow any window where the
    page could be written to if page_mkwrite does fail.

    This solution of holding the page locked over the 3 critical operations
    (page_mkwrite, setting the pte dirty, and finally setting the page dirty)
    closes out races nicely, preventing page cleaning for writeout being
    initiated in that window. This provides the filesystem with a strong
    synchronisation against the VM here.

    - Sage needs this race closed for ceph filesystem.
    - Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913).
    - I need it for fsblock.
    - I suspect other filesystems may need it too (eg. btrfs).
    - I have converted buffer.c to the new locking. Even simple block allocation
    under dirty pages might be susceptible to i_size changing under partial page
    at the end of file (we also have a buffer.c-side problem here, but it cannot
    be fixed properly without this patch).
    - Other filesystems (eg. NFS, maybe btrfs) will need to change their
    page_mkwrite functions themselves.

    [ This also moves page_mkwrite another step closer to fault, which should
    eventually allow page_mkwrite to be moved into ->fault, and thus avoiding a
    filesystem calldown and page lock/unlock cycle in __do_fault. ]

    [akpm@linux-foundation.org: fix derefs of NULL ->mapping]
    Cc: Sage Weil
    Cc: Trond Myklebust
    Signed-off-by: Nick Piggin
    Cc: Valdis Kletnieks
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin