22 Sep, 2014

1 commit

  • On 32-bit architectures, the legacy buffer_head functions are not always
    handling the sector number with the proper 64-bit types, and will thus
    fail on 4TB+ disks.

    Any code that uses __getblk() (and thus bread(), breadahead(),
    sb_bread(), sb_breadahead(), sb_getblk()), and calls it using a 64-bit
    block on a 32-bit arch (where "long" is 32-bit) causes an inifinite loop
    in __getblk_slow() with an infinite stream of errors logged to dmesg
    like this:

    __find_get_block_slow() failed. block=6740375944, b_blocknr=2445408648
    b_state=0x00000020, b_size=512
    device sda1 blocksize: 512

    Note how in hex block is 0x191C1F988 and b_blocknr is 0x91C1F988 i.e. the
    top 32-bits are missing (in this case the 0x1 at the top).

    This is because grow_dev_page() is broken and has a 32-bit overflow due
    to shifting the page index value (a pgoff_t - which is just 32 bits on
    32-bit architectures) left-shifted as the block number. But the top
    bits to get lost as the pgoff_t is not type cast to sector_t / 64-bit
    before the shift.

    This patch fixes this issue by type casting "index" to sector_t before
    doing the left shift.

    Note this is not a theoretical bug but has been seen in the field on a
    4TiB hard drive with logical sector size 512 bytes.

    This patch has been verified to fix the infinite loop problem on 3.17-rc5
    kernel using a 4TB disk image mounted using "-o loop". Without this patch
    doing a "find /nt" where /nt is an NTFS volume causes the inifinite loop
    100% reproducibly whilst with the patch it works fine as expected.

    Signed-off-by: Anton Altaparmakov
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Anton Altaparmakov
     

16 Jul, 2014

1 commit

  • The current "wait_on_bit" interface requires an 'action'
    function to be provided which does the actual waiting.
    There are over 20 such functions, many of them identical.
    Most cases can be satisfied by one of just two functions, one
    which uses io_schedule() and one which just uses schedule().

    So:
    Rename wait_on_bit and wait_on_bit_lock to
    wait_on_bit_action and wait_on_bit_lock_action
    to make it explicit that they need an action function.

    Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
    which are *not* given an action function but implicitly use
    a standard one.
    The decision to error-out if a signal is pending is now made
    based on the 'mode' argument rather than being encoded in the action
    function.

    All instances of the old wait_on_bit and wait_on_bit_lock which
    can use the new version have been changed accordingly and their
    action functions have been discarded.
    wait_on_bit{_lock} does not return any specific error code in the
    event of a signal so the caller must check for non-zero and
    interpolate their own error code as appropriate.

    The wait_on_bit() call in __fscache_wait_on_invalidate() was
    ambiguous as it specified TASK_UNINTERRUPTIBLE but used
    fscache_wait_bit_interruptible as an action function.
    David Howells confirms this should be uniformly
    "uninterruptible"

    The main remaining user of wait_on_bit{,_lock}_action is NFS
    which needs to use a freezer-aware schedule() call.

    A comment in fs/gfs2/glock.c notes that having multiple 'action'
    functions is useful as they display differently in the 'wchan'
    field of 'ps'. (and /proc/$PID/wchan).
    As the new bit_wait{,_io} functions are tagged "__sched", they
    will not show up at all, but something higher in the stack. So
    the distinction will still be visible, only with different
    function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
    gfs2/glock.c case).

    Since first version of this patch (against 3.15) two new action
    functions appeared, on in NFS and one in CIFS. CIFS also now
    uses an action function that makes the same freezer aware
    schedule call as NFS.

    Signed-off-by: NeilBrown
    Acked-by: David Howells (fscache, keys)
    Acked-by: Steven Whitehouse (gfs2)
    Acked-by: Peter Zijlstra
    Cc: Oleg Nesterov
    Cc: Steve French
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown
    Signed-off-by: Ingo Molnar

    NeilBrown
     

05 Jun, 2014

3 commits

  • aops->write_begin may allocate a new page and make it visible only to have
    mark_page_accessed called almost immediately after. Once the page is
    visible the atomic operations are necessary which is noticable overhead
    when writing to an in-memory filesystem like tmpfs but should also be
    noticable with fast storage. The objective of the patch is to initialse
    the accessed information with non-atomic operations before the page is
    visible.

    The bulk of filesystems directly or indirectly use
    grab_cache_page_write_begin or find_or_create_page for the initial
    allocation of a page cache page. This patch adds an init_page_accessed()
    helper which behaves like the first call to mark_page_accessed() but may
    called before the page is visible and can be done non-atomically.

    The primary APIs of concern in this care are the following and are used
    by most filesystems.

    find_get_page
    find_lock_page
    find_or_create_page
    grab_cache_page_nowait
    grab_cache_page_write_begin

    All of them are very similar in detail to the patch creates a core helper
    pagecache_get_page() which takes a flags parameter that affects its
    behavior such as whether the page should be marked accessed or not. Then
    old API is preserved but is basically a thin wrapper around this core
    function.

    Each of the filesystems are then updated to avoid calling
    mark_page_accessed when it is known that the VM interfaces have already
    done the job. There is a slight snag in that the timing of the
    mark_page_accessed() has now changed so in rare cases it's possible a page
    gets to the end of the LRU as PageReferenced where as previously it might
    have been repromoted. This is expected to be rare but it's worth the
    filesystem people thinking about it in case they see a problem with the
    timing change. It is also the case that some filesystems may be marking
    pages accessed that previously did not but it makes sense that filesystems
    have consistent behaviour in this regard.

    The test case used to evaulate this is a simple dd of a large file done
    multiple times with the file deleted on each iterations. The size of the
    file is 1/10th physical memory to avoid dirty page balancing. In the
    async case it will be possible that the workload completes without even
    hitting the disk and will have variable results but highlight the impact
    of mark_page_accessed for async IO. The sync results are expected to be
    more stable. The exception is tmpfs where the normal case is for the "IO"
    to not hit the disk.

    The test machine was single socket and UMA to avoid any scheduling or NUMA
    artifacts. Throughput and wall times are presented for sync IO, only wall
    times are shown for async as the granularity reported by dd and the
    variability is unsuitable for comparison. As async results were variable
    do to writback timings, I'm only reporting the maximum figures. The sync
    results were stable enough to make the mean and stddev uninteresting.

    The performance results are reported based on a run with no profiling.
    Profile data is based on a separate run with oprofile running.

    async dd
    3.15.0-rc3 3.15.0-rc3
    vanilla accessed-v2
    ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%)
    tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%)
    btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%)
    ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%)
    xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%)

    The XFS figure is a bit strange as it managed to avoid a worst case by
    sheer luck but the average figures looked reasonable.

    samples percentage
    ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed

    [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Tested-by: Prabhakar Lad
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Discarding buffers uses a bunch of atomic operations when discarding
    buffers because ...... I can't think of a reason. Use a cmpxchg loop to
    clear all the necessary flags. In most (all?) cases this will be a single
    atomic operations.

    [akpm@linux-foundation.org: move BUFFER_FLAGS_DISCARD into the .c file]
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The last in-tree caller of block_write_full_page_endio() was removed in
    January 2013. It's time to remove the EXPORT_SYMBOL, which leaves
    block_write_full_page() as the only caller of
    block_write_full_page_endio(), so inline block_write_full_page_endio()
    into block_write_full_page().

    Signed-off-by: Matthew Wilcox
    Cc: Hugh Dickins
    Cc: Dave Chinner
    Cc: Dheeraj Reddy
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

18 Apr, 2014

1 commit

  • Mostly scripted conversion of the smp_mb__* barriers.

    Signed-off-by: Peter Zijlstra
    Acked-by: Paul E. McKenney
    Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org
    Cc: Linus Torvalds
    Cc: linux-arch@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

13 Apr, 2014

1 commit

  • Pull vfs updates from Al Viro:
    "The first vfs pile, with deep apologies for being very late in this
    window.

    Assorted cleanups and fixes, plus a large preparatory part of iov_iter
    work. There's a lot more of that, but it'll probably go into the next
    merge window - it *does* shape up nicely, removes a lot of
    boilerplate, gets rid of locking inconsistencie between aio_write and
    splice_write and I hope to get Kent's direct-io rewrite merged into
    the same queue, but some of the stuff after this point is having
    (mostly trivial) conflicts with the things already merged into
    mainline and with some I want more testing.

    This one passes LTP and xfstests without regressions, in addition to
    usual beating. BTW, readahead02 in ltp syscalls testsuite has started
    giving failures since "mm/readahead.c: fix readahead failure for
    memoryless NUMA nodes and limit readahead pages" - might be a false
    positive, might be a real regression..."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    missing bits of "splice: fix racy pipe->buffers uses"
    cifs: fix the race in cifs_writev()
    ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
    kill generic_file_buffered_write()
    ocfs2_file_aio_write(): switch to generic_perform_write()
    ceph_aio_write(): switch to generic_perform_write()
    xfs_file_buffered_aio_write(): switch to generic_perform_write()
    export generic_perform_write(), start getting rid of generic_file_buffer_write()
    generic_file_direct_write(): get rid of ppos argument
    btrfs_file_aio_write(): get rid of ppos
    kill the 5th argument of generic_file_buffered_write()
    kill the 4th argument of __generic_file_aio_write()
    lustre: don't open-code kernel_recvmsg()
    ocfs2: don't open-code kernel_recvmsg()
    drbd: don't open-code kernel_recvmsg()
    constify blk_rq_map_user_iov() and friends
    lustre: switch to kernel_sendmsg()
    ocfs2: don't open-code kernel_sendmsg()
    take iov_iter stuff to mm/iov_iter.c
    process_vm_access: tidy up a bit
    ...

    Linus Torvalds
     

02 Apr, 2014

1 commit


20 Feb, 2014

1 commit


19 Feb, 2014

1 commit


07 Feb, 2014

1 commit

  • To use spin_{un}lock_irq is dangerous if caller disabled interrupt.
    During aio buffer migration, we have a possibility to see the following
    call stack.

    aio_migratepage [disable interrupt]
    migrate_page_copy
    clear_page_dirty_for_io
    set_page_dirty
    __set_page_dirty_buffers
    __set_page_dirty
    spin_lock_irq

    This mean, current aio migration is a deadlockable. spin_lock_irqsave
    is a safer alternative and we should use it.

    Signed-off-by: KOSAKI Motohiro
    Reported-by: David Rientjes rientjes@google.com>
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

04 Dec, 2013

1 commit


24 Nov, 2013

1 commit

  • Immutable biovecs are going to require an explicit iterator. To
    implement immutable bvecs, a later patch is going to add a bi_bvec_done
    member to this struct; for now, this patch effectively just renames
    things.

    Signed-off-by: Kent Overstreet
    Cc: Jens Axboe
    Cc: Geert Uytterhoeven
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "Ed L. Cashin"
    Cc: Nick Piggin
    Cc: Lars Ellenberg
    Cc: Jiri Kosina
    Cc: Matthew Wilcox
    Cc: Geoff Levand
    Cc: Yehuda Sadeh
    Cc: Sage Weil
    Cc: Alex Elder
    Cc: ceph-devel@vger.kernel.org
    Cc: Joshua Morris
    Cc: Philip Kelleher
    Cc: Rusty Russell
    Cc: "Michael S. Tsirkin"
    Cc: Konrad Rzeszutek Wilk
    Cc: Jeremy Fitzhardinge
    Cc: Neil Brown
    Cc: Alasdair Kergon
    Cc: Mike Snitzer
    Cc: dm-devel@redhat.com
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: linux390@de.ibm.com
    Cc: Boaz Harrosh
    Cc: Benny Halevy
    Cc: "James E.J. Bottomley"
    Cc: Greg Kroah-Hartman
    Cc: "Nicholas A. Bellinger"
    Cc: Alexander Viro
    Cc: Chris Mason
    Cc: "Theodore Ts'o"
    Cc: Andreas Dilger
    Cc: Jaegeuk Kim
    Cc: Steven Whitehouse
    Cc: Dave Kleikamp
    Cc: Joern Engel
    Cc: Prasad Joshi
    Cc: Trond Myklebust
    Cc: KONISHI Ryusuke
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Ben Myers
    Cc: xfs@oss.sgi.com
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Len Brown
    Cc: Pavel Machek
    Cc: "Rafael J. Wysocki"
    Cc: Herton Ronaldo Krzesinski
    Cc: Ben Hutchings
    Cc: Andrew Morton
    Cc: Guo Chao
    Cc: Tejun Heo
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Wei Yongjun
    Cc: "Roger Pau Monné"
    Cc: Jan Beulich
    Cc: Stefano Stabellini
    Cc: Ian Campbell
    Cc: Sebastian Ott
    Cc: Christian Borntraeger
    Cc: Minchan Kim
    Cc: Jiang Liu
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Cc: Joe Perches
    Cc: Peng Tao
    Cc: Andy Adamson
    Cc: fanchaoting
    Cc: Jie Liu
    Cc: Sunil Mushran
    Cc: "Martin K. Petersen"
    Cc: Namjae Jeon
    Cc: Pankaj Kumar
    Cc: Dan Magenheimer
    Cc: Mel Gorman 6

    Kent Overstreet
     

17 Oct, 2013

1 commit

  • Buffer allocation has a very crude indefinite loop around waking the
    flusher threads and performing global NOFS direct reclaim because it can
    not handle allocation failures.

    The most immediate problem with this is that the allocation may fail due
    to a memory cgroup limit, where flushers + direct reclaim might not make
    any progress towards resolving the situation at all. Because unlike the
    global case, a memory cgroup may not have any cache at all, only
    anonymous pages but no swap. This situation will lead to a reclaim
    livelock with insane IO from waking the flushers and thrashing unrelated
    filesystem cache in a tight loop.

    Use __GFP_NOFAIL allocations for buffers for now. This makes sure that
    any looping happens in the page allocator, which knows how to
    orchestrate kswapd, direct reclaim, and the flushers sensibly. It also
    allows memory cgroups to detect allocations that can't handle failure
    and will allow them to ultimately bypass the limit if reclaim can not
    make progress.

    Reported-by: azurIt
    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

04 Jul, 2013

1 commit

  • Page reclaim keeps track of dirty and under writeback pages and uses it
    to determine if wait_iff_congested() should stall or if kswapd should
    begin writing back pages. This fails to account for buffer pages that
    can be under writeback but not PageWriteback which is the case for
    filesystems like ext3 ordered mode. Furthermore, PageDirty buffer pages
    can have all the buffers clean and writepage does no IO so it should not
    be accounted as congested.

    This patch adds an address_space operation that filesystems may
    optionally use to check if a page is really dirty or really under
    writeback. An implementation is provided for for buffer_heads is added
    and used for block operations and ext3 in ordered mode. By default the
    page flags are obeyed.

    Credit goes to Jan Kara for identifying that the page flags alone are
    not sufficient for ext3 and sanity checking a number of ideas on how the
    problem could be addressed.

    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: Jiri Slaby
    Cc: Valdis Kletnieks
    Cc: Zlatko Calusic
    Cc: dormando
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

22 May, 2013

1 commit

  • Currently there is no way to truncate partial page where the end
    truncate point is not at the end of the page. This is because it was not
    needed and the functionality was enough for file system truncate
    operation to work properly. However more file systems now support punch
    hole feature and it can benefit from mm supporting truncating page just
    up to the certain point.

    Specifically, with this functionality truncate_inode_pages_range() can
    be changed so it supports truncating partial page at the end of the
    range (currently it will BUG_ON() if 'end' is not at the end of the
    page).

    This commit changes the invalidatepage() address space operation
    prototype to accept range to be invalidated and update all the instances
    for it.

    We also change the block_invalidatepage() in the same way and actually
    make a use of the new length argument implementing range invalidation.

    Actual file system implementations will follow except the file systems
    where the changes are really simple and should not change the behaviour
    in any way .Implementation for truncate_page_range() which will be able
    to accept page unaligned ranges will follow as well.

    Signed-off-by: Lukas Czerner
    Cc: Andrew Morton
    Cc: Hugh Dickins

    Lukas Czerner
     

09 May, 2013

1 commit

  • Pull block core updates from Jens Axboe:

    - Major bit is Kents prep work for immutable bio vecs.

    - Stable candidate fix for a scheduling-while-atomic in the queue
    bypass operation.

    - Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
    discard bios.

    - Tejuns changes to convert the writeback thread pool to the generic
    workqueue mechanism.

    - Runtime PM framework, SCSI patches exists on top of these in James'
    tree.

    - A few random fixes.

    * 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
    relay: move remove_buf_file inside relay_close_buf
    partitions/efi.c: replace useless kzalloc's by kmalloc's
    fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
    block: fix max discard sectors limit
    blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
    Documentation: cfq-iosched: update documentation help for cfq tunables
    writeback: expose the bdi_wq workqueue
    writeback: replace custom worker pool implementation with unbound workqueue
    writeback: remove unused bdi_pending_list
    aoe: Fix unitialized var usage
    bio-integrity: Add explicit field for owner of bip_buf
    block: Add an explicit bio flag for bios that own their bvec
    block: Add bio_alloc_pages()
    block: Convert some code to bio_for_each_segment_all()
    block: Add bio_for_each_segment_all()
    bounce: Refactor __blk_queue_bounce to not use bi_io_vec
    raid1: use bio_copy_data()
    pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
    pktcdvd: use bio_copy_data()
    block: Add bio_copy_data()
    ...

    Linus Torvalds
     

01 May, 2013

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "Mostly performance and bug fixes, plus some cleanups. The one new
    feature this merge window is a new ioctl EXT4_IOC_SWAP_BOOT which
    allows installation of a hidden inode designed for boot loaders."

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (50 commits)
    ext4: fix type-widening bug in inode table readahead code
    ext4: add check for inodes_count overflow in new resize ioctl
    ext4: fix Kconfig documentation for CONFIG_EXT4_DEBUG
    ext4: fix online resizing for ext3-compat file systems
    jbd2: trace when lock_buffer in do_get_write_access takes a long time
    ext4: mark metadata blocks using bh flags
    buffer: add BH_Prio and BH_Meta flags
    ext4: mark all metadata I/O with REQ_META
    ext4: fix readdir error in case inline_data+^dir_index.
    ext4: fix readdir error in the case of inline_data+dir_index
    jbd2: use kmem_cache_zalloc instead of kmem_cache_alloc/memset
    ext4: mext_insert_extents should update extent block checksum
    ext4: move quota initialization out of inode allocation transaction
    ext4: reserve xattr index for Rich ACL support
    jbd2: reduce journal_head size
    ext4: clear buffer_uninit flag when submitting IO
    ext4: use io_end for multiple bios
    ext4: make ext4_bio_write_page() use BH_Async_Write flags
    ext4: Use kstrtoul() instead of parse_strtoul()
    ext4: defragmentation code cleanup
    ...

    Linus Torvalds
     

30 Apr, 2013

2 commits

  • bh allocation uses kmem_cache_zalloc() so we needn't call
    'init_buffer(bh, NULL, NULL)' and perform other set-zero-operations.

    Signed-off-by: Jianpeng Ma
    Cc: Jan Kara
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    majianpeng
     
  • Walking a bio's page mappings has proved problematic, so create a new
    bio flag to indicate that a bio's data needs to be snapshotted in order
    to guarantee stable pages during writeback. Next, for the one user
    (ext3/jbd) of snapshotting, hook all the places where writes can be
    initiated without PG_writeback set, and set BIO_SNAP_STABLE there.

    We must also flag journal "metadata" bios for stable writeout, since
    file data can be written through the journal. Finally, the
    MS_SNAP_STABLE mount flag (only used by ext3) is now superfluous, so get
    rid of it.

    [akpm@linux-foundation.org: rename _submit_bh()'s `flags' to `bio_flags', delobotomize the _submit_bh declaration]
    [akpm@linux-foundation.org: teeny cleanup]
    Signed-off-by: Darrick J. Wong
    Cc: Andy Lutomirski
    Cc: Adrian Hunter
    Cc: Artem Bityutskiy
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     

21 Apr, 2013

1 commit


24 Mar, 2013

1 commit

  • For immutable bvecs, all bi_idx usage needs to be audited - so here
    we're removing all the unnecessary uses.

    Most of these are places where it was being initialized on a bio that
    was just allocated, a few others are conversions to standard macros.

    Signed-off-by: Kent Overstreet
    CC: Jens Axboe

    Kent Overstreet
     

01 Mar, 2013

1 commit

  • Pull block IO core bits from Jens Axboe:
    "Below are the core block IO bits for 3.9. It was delayed a few days
    since my workstation kept crashing every 2-8h after pulling it into
    current -git, but turns out it is a bug in the new pstate code (divide
    by zero, will report separately). In any case, it contains:

    - The big cfq/blkcg update from Tejun and and Vivek.

    - Additional block and writeback tracepoints from Tejun.

    - Improvement of the should sort (based on queues) logic in the plug
    flushing.

    - _io() variants of the wait_for_completion() interface, using
    io_schedule() instead of schedule() to contribute to io wait
    properly.

    - Various little fixes.

    You'll get two trivial merge conflicts, which should be easy enough to
    fix up"

    Fix up the trivial conflicts due to hlist traversal cleanups (commit
    b67bfe0d42ca: "hlist: drop the node parameter from iterators").

    * 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits)
    block: remove redundant check to bd_openers()
    block: use i_size_write() in bd_set_size()
    cfq: fix lock imbalance with failed allocations
    drivers/block/swim3.c: fix null pointer dereference
    block: don't select PERCPU_RWSEM
    block: account iowait time when waiting for completion of IO request
    sched: add wait_for_completion_io[_timeout]
    writeback: add more tracepoints
    block: add block_{touch|dirty}_buffer tracepoint
    buffer: make touch_buffer() an exported function
    block: add @req to bio_{front|back}_merge tracepoints
    block: add missing block_bio_complete() tracepoint
    block: Remove should_sort judgement when flush blk_plug
    block,elevator: use new hashtable implementation
    cfq-iosched: add hierarchical cfq_group statistics
    cfq-iosched: collect stats from dead cfqgs
    cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
    blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
    block: RCU free request_queue
    blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
    ...

    Linus Torvalds
     

27 Feb, 2013

1 commit

  • Pull vfs pile (part one) from Al Viro:
    "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
    locking violations, etc.

    The most visible changes here are death of FS_REVAL_DOT (replaced with
    "has ->d_weak_revalidate()") and a new helper getting from struct file
    to inode. Some bits of preparation to xattr method interface changes.

    Misc patches by various people sent this cycle *and* ocfs2 fixes from
    several cycles ago that should've been upstream right then.

    PS: the next vfs pile will be xattr stuff."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
    saner proc_get_inode() calling conventions
    proc: avoid extra pde_put() in proc_fill_super()
    fs: change return values from -EACCES to -EPERM
    fs/exec.c: make bprm_mm_init() static
    ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
    ocfs2: fix possible use-after-free with AIO
    ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
    get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
    target: writev() on single-element vector is pointless
    export kernel_write(), convert open-coded instances
    fs: encode_fh: return FILEID_INVALID if invalid fid_type
    kill f_vfsmnt
    vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
    nfsd: handle vfs_getattr errors in acl protocol
    switch vfs_getattr() to struct path
    default SET_PERSONALITY() in linux/elf.h
    ceph: prepopulate inodes only when request is aborted
    d_hash_and_lookup(): export, switch open-coded instances
    9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
    9p: split dropping the acls from v9fs_set_create_acl()
    ...

    Linus Torvalds
     

24 Feb, 2013

1 commit


23 Feb, 2013

1 commit


22 Feb, 2013

1 commit

  • Create a helper function to check if a backing device requires stable
    page writes and, if so, performs the necessary wait. Then, make it so
    that all points in the memory manager that handle making pages writable
    use the helper function. This should provide stable page write support
    to most filesystems, while eliminating unnecessary waiting for devices
    that don't require the feature.

    Before this patchset, all filesystems would block, regardless of whether
    or not it was necessary. ext3 would wait, but still generate occasional
    checksum errors. The network filesystems were left to do their own
    thing, so they'd wait too.

    After this patchset, all the disk filesystems except ext3 and btrfs will
    wait only if the hardware requires it. ext3 (if necessary) snapshots
    pages instead of blocking, and btrfs provides its own bdi so the mm will
    never wait. Network filesystems haven't been touched, so either they
    provide their own stable page guarantees or they don't block at all.
    The blocking behavior is back to what it was before 3.0 if you don't
    have a disk requiring stable page writes.

    Here's the result of using dbench to test latency on ext2:

    3.8.0-rc3:
    Operation Count AvgLat MaxLat
    ----------------------------------------
    WriteX 109347 0.028 59.817
    ReadX 347180 0.004 3.391
    Flush 15514 29.828 287.283

    Throughput 57.429 MB/sec 4 clients 4 procs max_latency=287.290 ms

    3.8.0-rc3 + patches:
    WriteX 105556 0.029 4.273
    ReadX 335004 0.005 4.112
    Flush 14982 30.540 298.634

    Throughput 55.4496 MB/sec 4 clients 4 procs max_latency=298.650 ms

    As you can see, the maximum write latency drops considerably with this
    patch enabled. The other filesystems (ext3/ext4/xfs/btrfs) behave
    similarly, but see the cover letter for those results.

    Signed-off-by: Darrick J. Wong
    Acked-by: Steven Whitehouse
    Reviewed-by: Jan Kara
    Cc: Adrian Hunter
    Cc: Andy Lutomirski
    Cc: Artem Bityutskiy
    Cc: Joel Becker
    Cc: Mark Fasheh
    Cc: Jens Axboe
    Cc: Eric Van Hensbergen
    Cc: Ron Minnich
    Cc: Latchesar Ionkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     

15 Jan, 2013

1 commit

  • Andrew Morton pointed this out a month ago, and then I completely forgot
    about it.

    If we read a partial last page of a block device, we will zero out the
    end of the page, but since that page can then be mapped into user space,
    we should also make sure to flush the cache on architectures that have
    virtual caches. We have the flush_dcache_page() function for this, so
    use it.

    Now, in practice this really never matters, because nobody sane uses
    virtual caches to begin with, and they largely exist on old broken RISC
    arhitectures.

    And even if you did run on one of those obsolete CPU's, the whole "mmap
    and access the last partial page of a block device" behavior probably
    doesn't actually exist. The normal IO functions (read/write) will never
    see the zeroed-out part of the page that migth not be coherent in the
    cache, because they honor the size of the device.

    So I'm marking this for stable (3.7 only), but I'm not sure anybody will
    ever care.

    Pointed-out-by: Andrew Morton
    Cc: stable@vger.kernel.org # 3.7
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

14 Jan, 2013

2 commits

  • The former is triggered from touch_buffer() and the latter
    mark_buffer_dirty().

    This is part of tracepoint additions to improve visiblity into
    dirtying / writeback operations for io tracer and userland.

    v2: Transformed writeback_dirty_buffer to block_dirty_buffer and made
    it share TP definition with block_touch_buffer.

    Signed-off-by: Tejun Heo
    Cc: Fengguang Wu
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • We want to add a trace point to touch_buffer() but macros and inline
    functions defined in header files can't have tracing points. Move
    touch_buffer() to fs/buffer.c and make it a proper function.

    The new exported function is also declared inline. As most uses of
    touch_buffer() are inside buffer.c with nilfs2 as the only other user,
    the effect of this change should be negligible.

    Signed-off-by: Tejun Heo
    Cc: Steven Rostedt
    Signed-off-by: Jens Axboe

    Tejun Heo
     

13 Dec, 2012

2 commits


12 Dec, 2012

1 commit

  • Overhaul struct address_space.assoc_mapping renaming it to
    address_space.private_data and its type is redefined to void*. By this
    approach we consistently name the .private_* elements from struct
    address_space as well as allow extended usage for address_space
    association with other data structures through ->private_data.

    Also, all users of old ->assoc_mapping element are converted to reflect
    its new name and type change (->private_data).

    Signed-off-by: Rafael Aquini
    Cc: Rusty Russell
    Cc: "Michael S. Tsirkin"
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andi Kleen
    Cc: Konrad Rzeszutek Wilk
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     

06 Dec, 2012

1 commit


05 Dec, 2012

1 commit

  • The block device access simplification that avoided accessing the (racy)
    block size information (commit bbec0270bdd8: "blkdev_max_block: make
    private to fs/buffer.c") no longer checks the maximum block size in the
    block mapping path.

    That was _almost_ as simple as just removing the code entirely, because
    the readers and writers all check the size of the device anyway, so
    under normal circumstances it "just worked".

    However, the block size may be such that the end of the device may
    straddle one single buffer_head. At which point we may still want to
    access the end of the device, but the buffer we use to access it
    partially extends past the end.

    The 'bd_set_size()' function intentionally sets the block size to avoid
    this, but mounting the device - or setting the block size by hand to
    some other value - can modify that block size.

    So instead, teach 'submit_bh()' about the special case of the buffer
    head straddling the end of the device, and turning such an access into a
    smaller IO access, avoiding the problem.

    This, btw, also means that unlike before, we can now access the whole
    device regardless of device block size setting. So now, even if the
    device size is only 512-byte aligned, we can read and write even the
    last sector even when having a much bigger block size for accessing the
    rest of the device.

    So with this, we could now get rid of the 'bd_set_size()' block size
    code entirely - resulting in faster IO for the common case - but that
    would be a separate patch.

    Reported-and-tested-by: Romain Francoise
    Reporeted-and-tested-by: Meelis Roos
    Reported-by: Tony Luck
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

30 Nov, 2012

2 commits

  • We really don't want to look at the block size for the raw block device
    accesses in fs/block-dev.c, because it may be changing from under us.
    So get rid of the max_block logic entirely, since the caller should
    already have done it anyway.

    That leaves the only user of this function in fs/buffer.c, so move the
    whole function there and make it static.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • This makes the buffer size handling be a per-page thing, which allows us
    to not have to worry about locking too much when changing the buffer
    size. If a page doesn't have buffers, we still need to read the block
    size from the inode, but we can do that with ACCESS_ONCE(), so that even
    if the size is changing, we get a consistent value.

    This doesn't convert all functions - many of the buffer functions are
    used purely by filesystems, which in turn results in the buffer size
    being fixed at mount-time. So they don't have the same consistency
    issues that the raw device access can have.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

08 Oct, 2012

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "The big new feature added this time is supporting online resizing
    using the meta_bg feature. This allows us to resize file systems
    which are greater than 16TB. In addition, the speed of online
    resizing has been improved in general.

    We also fix a number of races, some of which could lead to deadlocks,
    in ext4's Asynchronous I/O and online defrag support, thanks to good
    work by Dmitry Monakhov.

    There are also a large number of more minor bug fixes and cleanups
    from a number of other ext4 contributors, quite of few of which have
    submitted fixes for the first time."

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (69 commits)
    ext4: fix ext4_flush_completed_IO wait semantics
    ext4: fix mtime update in nodelalloc mode
    ext4: fix ext_remove_space for punch_hole case
    ext4: punch_hole should wait for DIO writers
    ext4: serialize truncate with owerwrite DIO workers
    ext4: endless truncate due to nonlocked dio readers
    ext4: serialize unlocked dio reads with truncate
    ext4: serialize dio nonlocked reads with defrag workers
    ext4: completed_io locking cleanup
    ext4: fix unwritten counter leakage
    ext4: give i_aiodio_unwritten a more appropriate name
    ext4: ext4_inode_info diet
    ext4: convert to use leXX_add_cpu()
    ext4: ext4_bread usage audit
    fs: reserve fallocate flag codepoint
    ext4: remove redundant offset check in mext_check_arguments()
    ext4: don't clear orphan list on ro mount with errors
    jbd2: fix assertion failure in commit code due to lacking transaction credits
    ext4: release donor reference when EXT4_IOC_MOVE_EXT ioctl fails
    ext4: enable FITRIM ioctl on bigalloc file system
    ...

    Linus Torvalds
     

01 Oct, 2012

1 commit

  • Commits 5e8830dc85d0 and 41c4d25f78c0 introduced a regression into
    v3.6-rc1 for ext4 in nodealloc mode, such that mtime updates would not
    take place for files modified via mmap if the page was already in the
    page cache. This would also affect ext3 file systems mounted using
    the ext4 file system driver.

    The problem was that ext4_page_mkwrite() had a shortcut which would
    avoid calling __block_page_mkwrite() under some circumstances, and the
    above two commit transferred the responsibility of calling
    file_update_time() to __block_page_mkwrite --- which woudln't get
    called in some circumstances.

    Since __block_page_mkwrite() only has three callers,
    block_page_mkwrite(), ext4_page_mkwrite, and nilfs_page_mkwrite(), the
    best way to solve this is to move the responsibility for calling
    file_update_time() to its caller.

    This problem was found via xfstests #215 with a file system mounted
    with -o nodelalloc.

    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Jan Kara
    Cc: KONISHI Ryusuke
    Cc: stable@vger.kernel.org

    Theodore Ts'o
     

23 Aug, 2012

1 commit

  • Commit 91f68c89d8f3 ("block: fix infinite loop in __getblk_slow")
    is not good: a successful call to grow_buffers() cannot guarantee
    that the page won't be reclaimed before the immediate next call to
    __find_get_block(), which is why there was always a loop there.

    Yesterday I got "EXT4-fs error (device loop0): __ext4_get_inode_loc:3595:
    inode #19278: block 664: comm cc1: unable to read itable block" on console,
    which pointed to this commit.

    I've been trying to bisect for weeks, why kbuild-on-ext4-on-loop-on-tmpfs
    sometimes fails from a missing header file, under memory pressure on
    ppc G5. I've never seen this on x86, and I've never seen it on 3.5-rc7
    itself, despite that commit being in there: bisection pointed to an
    irrelevant pinctrl merge, but hard to tell when failure takes between
    18 minutes and 38 hours (but so far it's happened quicker on 3.6-rc2).

    (I've since found such __ext4_get_inode_loc errors in /var/log/messages
    from previous weeks: why the message never appeared on console until
    yesterday morning is a mystery for another day.)

    Revert 91f68c89d8f3, restoring __getblk_slow() to how it was (plus
    a checkpatch nitfix). Simplify the interface between grow_buffers()
    and grow_dev_page(), and avoid the infinite loop beyond end of device
    by instead checking init_page_buffers()'s end_block there (I presume
    that's more efficient than a repeated call to blkdev_max_block()),
    returning -ENXIO to __getblk_slow() in that case.

    And remove akpm's ten-year-old "__getblk() cannot fail ... weird"
    comment, but that is worrying: are all users of __getblk() really
    now prepared for a NULL bh beyond end of device, or will some oops??

    Signed-off-by: Hugh Dickins
    Cc: stable@vger.kernel.org # 3.0 3.2 3.4 3.5
    Signed-off-by: Jens Axboe

    Hugh Dickins