13 Apr, 2014

1 commit

  • Pull vfs updates from Al Viro:
    "The first vfs pile, with deep apologies for being very late in this
    window.

    Assorted cleanups and fixes, plus a large preparatory part of iov_iter
    work. There's a lot more of that, but it'll probably go into the next
    merge window - it *does* shape up nicely, removes a lot of
    boilerplate, gets rid of locking inconsistencie between aio_write and
    splice_write and I hope to get Kent's direct-io rewrite merged into
    the same queue, but some of the stuff after this point is having
    (mostly trivial) conflicts with the things already merged into
    mainline and with some I want more testing.

    This one passes LTP and xfstests without regressions, in addition to
    usual beating. BTW, readahead02 in ltp syscalls testsuite has started
    giving failures since "mm/readahead.c: fix readahead failure for
    memoryless NUMA nodes and limit readahead pages" - might be a false
    positive, might be a real regression..."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    missing bits of "splice: fix racy pipe->buffers uses"
    cifs: fix the race in cifs_writev()
    ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
    kill generic_file_buffered_write()
    ocfs2_file_aio_write(): switch to generic_perform_write()
    ceph_aio_write(): switch to generic_perform_write()
    xfs_file_buffered_aio_write(): switch to generic_perform_write()
    export generic_perform_write(), start getting rid of generic_file_buffer_write()
    generic_file_direct_write(): get rid of ppos argument
    btrfs_file_aio_write(): get rid of ppos
    kill the 5th argument of generic_file_buffered_write()
    kill the 4th argument of __generic_file_aio_write()
    lustre: don't open-code kernel_recvmsg()
    ocfs2: don't open-code kernel_recvmsg()
    drbd: don't open-code kernel_recvmsg()
    constify blk_rq_map_user_iov() and friends
    lustre: switch to kernel_sendmsg()
    ocfs2: don't open-code kernel_sendmsg()
    take iov_iter stuff to mm/iov_iter.c
    process_vm_access: tidy up a bit
    ...

    Linus Torvalds
     

02 Apr, 2014

1 commit


20 Feb, 2014

1 commit


19 Feb, 2014

1 commit


07 Feb, 2014

1 commit

  • To use spin_{un}lock_irq is dangerous if caller disabled interrupt.
    During aio buffer migration, we have a possibility to see the following
    call stack.

    aio_migratepage [disable interrupt]
    migrate_page_copy
    clear_page_dirty_for_io
    set_page_dirty
    __set_page_dirty_buffers
    __set_page_dirty
    spin_lock_irq

    This mean, current aio migration is a deadlockable. spin_lock_irqsave
    is a safer alternative and we should use it.

    Signed-off-by: KOSAKI Motohiro
    Reported-by: David Rientjes rientjes@google.com>
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

04 Dec, 2013

1 commit


24 Nov, 2013

1 commit

  • Immutable biovecs are going to require an explicit iterator. To
    implement immutable bvecs, a later patch is going to add a bi_bvec_done
    member to this struct; for now, this patch effectively just renames
    things.

    Signed-off-by: Kent Overstreet
    Cc: Jens Axboe
    Cc: Geert Uytterhoeven
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "Ed L. Cashin"
    Cc: Nick Piggin
    Cc: Lars Ellenberg
    Cc: Jiri Kosina
    Cc: Matthew Wilcox
    Cc: Geoff Levand
    Cc: Yehuda Sadeh
    Cc: Sage Weil
    Cc: Alex Elder
    Cc: ceph-devel@vger.kernel.org
    Cc: Joshua Morris
    Cc: Philip Kelleher
    Cc: Rusty Russell
    Cc: "Michael S. Tsirkin"
    Cc: Konrad Rzeszutek Wilk
    Cc: Jeremy Fitzhardinge
    Cc: Neil Brown
    Cc: Alasdair Kergon
    Cc: Mike Snitzer
    Cc: dm-devel@redhat.com
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: linux390@de.ibm.com
    Cc: Boaz Harrosh
    Cc: Benny Halevy
    Cc: "James E.J. Bottomley"
    Cc: Greg Kroah-Hartman
    Cc: "Nicholas A. Bellinger"
    Cc: Alexander Viro
    Cc: Chris Mason
    Cc: "Theodore Ts'o"
    Cc: Andreas Dilger
    Cc: Jaegeuk Kim
    Cc: Steven Whitehouse
    Cc: Dave Kleikamp
    Cc: Joern Engel
    Cc: Prasad Joshi
    Cc: Trond Myklebust
    Cc: KONISHI Ryusuke
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Ben Myers
    Cc: xfs@oss.sgi.com
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Len Brown
    Cc: Pavel Machek
    Cc: "Rafael J. Wysocki"
    Cc: Herton Ronaldo Krzesinski
    Cc: Ben Hutchings
    Cc: Andrew Morton
    Cc: Guo Chao
    Cc: Tejun Heo
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Wei Yongjun
    Cc: "Roger Pau Monné"
    Cc: Jan Beulich
    Cc: Stefano Stabellini
    Cc: Ian Campbell
    Cc: Sebastian Ott
    Cc: Christian Borntraeger
    Cc: Minchan Kim
    Cc: Jiang Liu
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Cc: Joe Perches
    Cc: Peng Tao
    Cc: Andy Adamson
    Cc: fanchaoting
    Cc: Jie Liu
    Cc: Sunil Mushran
    Cc: "Martin K. Petersen"
    Cc: Namjae Jeon
    Cc: Pankaj Kumar
    Cc: Dan Magenheimer
    Cc: Mel Gorman 6

    Kent Overstreet
     

17 Oct, 2013

1 commit

  • Buffer allocation has a very crude indefinite loop around waking the
    flusher threads and performing global NOFS direct reclaim because it can
    not handle allocation failures.

    The most immediate problem with this is that the allocation may fail due
    to a memory cgroup limit, where flushers + direct reclaim might not make
    any progress towards resolving the situation at all. Because unlike the
    global case, a memory cgroup may not have any cache at all, only
    anonymous pages but no swap. This situation will lead to a reclaim
    livelock with insane IO from waking the flushers and thrashing unrelated
    filesystem cache in a tight loop.

    Use __GFP_NOFAIL allocations for buffers for now. This makes sure that
    any looping happens in the page allocator, which knows how to
    orchestrate kswapd, direct reclaim, and the flushers sensibly. It also
    allows memory cgroups to detect allocations that can't handle failure
    and will allow them to ultimately bypass the limit if reclaim can not
    make progress.

    Reported-by: azurIt
    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

04 Jul, 2013

1 commit

  • Page reclaim keeps track of dirty and under writeback pages and uses it
    to determine if wait_iff_congested() should stall or if kswapd should
    begin writing back pages. This fails to account for buffer pages that
    can be under writeback but not PageWriteback which is the case for
    filesystems like ext3 ordered mode. Furthermore, PageDirty buffer pages
    can have all the buffers clean and writepage does no IO so it should not
    be accounted as congested.

    This patch adds an address_space operation that filesystems may
    optionally use to check if a page is really dirty or really under
    writeback. An implementation is provided for for buffer_heads is added
    and used for block operations and ext3 in ordered mode. By default the
    page flags are obeyed.

    Credit goes to Jan Kara for identifying that the page flags alone are
    not sufficient for ext3 and sanity checking a number of ideas on how the
    problem could be addressed.

    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: Jiri Slaby
    Cc: Valdis Kletnieks
    Cc: Zlatko Calusic
    Cc: dormando
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

22 May, 2013

1 commit

  • Currently there is no way to truncate partial page where the end
    truncate point is not at the end of the page. This is because it was not
    needed and the functionality was enough for file system truncate
    operation to work properly. However more file systems now support punch
    hole feature and it can benefit from mm supporting truncating page just
    up to the certain point.

    Specifically, with this functionality truncate_inode_pages_range() can
    be changed so it supports truncating partial page at the end of the
    range (currently it will BUG_ON() if 'end' is not at the end of the
    page).

    This commit changes the invalidatepage() address space operation
    prototype to accept range to be invalidated and update all the instances
    for it.

    We also change the block_invalidatepage() in the same way and actually
    make a use of the new length argument implementing range invalidation.

    Actual file system implementations will follow except the file systems
    where the changes are really simple and should not change the behaviour
    in any way .Implementation for truncate_page_range() which will be able
    to accept page unaligned ranges will follow as well.

    Signed-off-by: Lukas Czerner
    Cc: Andrew Morton
    Cc: Hugh Dickins

    Lukas Czerner
     

09 May, 2013

1 commit

  • Pull block core updates from Jens Axboe:

    - Major bit is Kents prep work for immutable bio vecs.

    - Stable candidate fix for a scheduling-while-atomic in the queue
    bypass operation.

    - Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
    discard bios.

    - Tejuns changes to convert the writeback thread pool to the generic
    workqueue mechanism.

    - Runtime PM framework, SCSI patches exists on top of these in James'
    tree.

    - A few random fixes.

    * 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
    relay: move remove_buf_file inside relay_close_buf
    partitions/efi.c: replace useless kzalloc's by kmalloc's
    fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
    block: fix max discard sectors limit
    blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
    Documentation: cfq-iosched: update documentation help for cfq tunables
    writeback: expose the bdi_wq workqueue
    writeback: replace custom worker pool implementation with unbound workqueue
    writeback: remove unused bdi_pending_list
    aoe: Fix unitialized var usage
    bio-integrity: Add explicit field for owner of bip_buf
    block: Add an explicit bio flag for bios that own their bvec
    block: Add bio_alloc_pages()
    block: Convert some code to bio_for_each_segment_all()
    block: Add bio_for_each_segment_all()
    bounce: Refactor __blk_queue_bounce to not use bi_io_vec
    raid1: use bio_copy_data()
    pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
    pktcdvd: use bio_copy_data()
    block: Add bio_copy_data()
    ...

    Linus Torvalds
     

01 May, 2013

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "Mostly performance and bug fixes, plus some cleanups. The one new
    feature this merge window is a new ioctl EXT4_IOC_SWAP_BOOT which
    allows installation of a hidden inode designed for boot loaders."

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (50 commits)
    ext4: fix type-widening bug in inode table readahead code
    ext4: add check for inodes_count overflow in new resize ioctl
    ext4: fix Kconfig documentation for CONFIG_EXT4_DEBUG
    ext4: fix online resizing for ext3-compat file systems
    jbd2: trace when lock_buffer in do_get_write_access takes a long time
    ext4: mark metadata blocks using bh flags
    buffer: add BH_Prio and BH_Meta flags
    ext4: mark all metadata I/O with REQ_META
    ext4: fix readdir error in case inline_data+^dir_index.
    ext4: fix readdir error in the case of inline_data+dir_index
    jbd2: use kmem_cache_zalloc instead of kmem_cache_alloc/memset
    ext4: mext_insert_extents should update extent block checksum
    ext4: move quota initialization out of inode allocation transaction
    ext4: reserve xattr index for Rich ACL support
    jbd2: reduce journal_head size
    ext4: clear buffer_uninit flag when submitting IO
    ext4: use io_end for multiple bios
    ext4: make ext4_bio_write_page() use BH_Async_Write flags
    ext4: Use kstrtoul() instead of parse_strtoul()
    ext4: defragmentation code cleanup
    ...

    Linus Torvalds
     

30 Apr, 2013

2 commits

  • bh allocation uses kmem_cache_zalloc() so we needn't call
    'init_buffer(bh, NULL, NULL)' and perform other set-zero-operations.

    Signed-off-by: Jianpeng Ma
    Cc: Jan Kara
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    majianpeng
     
  • Walking a bio's page mappings has proved problematic, so create a new
    bio flag to indicate that a bio's data needs to be snapshotted in order
    to guarantee stable pages during writeback. Next, for the one user
    (ext3/jbd) of snapshotting, hook all the places where writes can be
    initiated without PG_writeback set, and set BIO_SNAP_STABLE there.

    We must also flag journal "metadata" bios for stable writeout, since
    file data can be written through the journal. Finally, the
    MS_SNAP_STABLE mount flag (only used by ext3) is now superfluous, so get
    rid of it.

    [akpm@linux-foundation.org: rename _submit_bh()'s `flags' to `bio_flags', delobotomize the _submit_bh declaration]
    [akpm@linux-foundation.org: teeny cleanup]
    Signed-off-by: Darrick J. Wong
    Cc: Andy Lutomirski
    Cc: Adrian Hunter
    Cc: Artem Bityutskiy
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     

21 Apr, 2013

1 commit


24 Mar, 2013

1 commit

  • For immutable bvecs, all bi_idx usage needs to be audited - so here
    we're removing all the unnecessary uses.

    Most of these are places where it was being initialized on a bio that
    was just allocated, a few others are conversions to standard macros.

    Signed-off-by: Kent Overstreet
    CC: Jens Axboe

    Kent Overstreet
     

01 Mar, 2013

1 commit

  • Pull block IO core bits from Jens Axboe:
    "Below are the core block IO bits for 3.9. It was delayed a few days
    since my workstation kept crashing every 2-8h after pulling it into
    current -git, but turns out it is a bug in the new pstate code (divide
    by zero, will report separately). In any case, it contains:

    - The big cfq/blkcg update from Tejun and and Vivek.

    - Additional block and writeback tracepoints from Tejun.

    - Improvement of the should sort (based on queues) logic in the plug
    flushing.

    - _io() variants of the wait_for_completion() interface, using
    io_schedule() instead of schedule() to contribute to io wait
    properly.

    - Various little fixes.

    You'll get two trivial merge conflicts, which should be easy enough to
    fix up"

    Fix up the trivial conflicts due to hlist traversal cleanups (commit
    b67bfe0d42ca: "hlist: drop the node parameter from iterators").

    * 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits)
    block: remove redundant check to bd_openers()
    block: use i_size_write() in bd_set_size()
    cfq: fix lock imbalance with failed allocations
    drivers/block/swim3.c: fix null pointer dereference
    block: don't select PERCPU_RWSEM
    block: account iowait time when waiting for completion of IO request
    sched: add wait_for_completion_io[_timeout]
    writeback: add more tracepoints
    block: add block_{touch|dirty}_buffer tracepoint
    buffer: make touch_buffer() an exported function
    block: add @req to bio_{front|back}_merge tracepoints
    block: add missing block_bio_complete() tracepoint
    block: Remove should_sort judgement when flush blk_plug
    block,elevator: use new hashtable implementation
    cfq-iosched: add hierarchical cfq_group statistics
    cfq-iosched: collect stats from dead cfqgs
    cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
    blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
    block: RCU free request_queue
    blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
    ...

    Linus Torvalds
     

27 Feb, 2013

1 commit

  • Pull vfs pile (part one) from Al Viro:
    "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
    locking violations, etc.

    The most visible changes here are death of FS_REVAL_DOT (replaced with
    "has ->d_weak_revalidate()") and a new helper getting from struct file
    to inode. Some bits of preparation to xattr method interface changes.

    Misc patches by various people sent this cycle *and* ocfs2 fixes from
    several cycles ago that should've been upstream right then.

    PS: the next vfs pile will be xattr stuff."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
    saner proc_get_inode() calling conventions
    proc: avoid extra pde_put() in proc_fill_super()
    fs: change return values from -EACCES to -EPERM
    fs/exec.c: make bprm_mm_init() static
    ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
    ocfs2: fix possible use-after-free with AIO
    ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
    get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
    target: writev() on single-element vector is pointless
    export kernel_write(), convert open-coded instances
    fs: encode_fh: return FILEID_INVALID if invalid fid_type
    kill f_vfsmnt
    vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
    nfsd: handle vfs_getattr errors in acl protocol
    switch vfs_getattr() to struct path
    default SET_PERSONALITY() in linux/elf.h
    ceph: prepopulate inodes only when request is aborted
    d_hash_and_lookup(): export, switch open-coded instances
    9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
    9p: split dropping the acls from v9fs_set_create_acl()
    ...

    Linus Torvalds
     

24 Feb, 2013

1 commit


23 Feb, 2013

1 commit


22 Feb, 2013

1 commit

  • Create a helper function to check if a backing device requires stable
    page writes and, if so, performs the necessary wait. Then, make it so
    that all points in the memory manager that handle making pages writable
    use the helper function. This should provide stable page write support
    to most filesystems, while eliminating unnecessary waiting for devices
    that don't require the feature.

    Before this patchset, all filesystems would block, regardless of whether
    or not it was necessary. ext3 would wait, but still generate occasional
    checksum errors. The network filesystems were left to do their own
    thing, so they'd wait too.

    After this patchset, all the disk filesystems except ext3 and btrfs will
    wait only if the hardware requires it. ext3 (if necessary) snapshots
    pages instead of blocking, and btrfs provides its own bdi so the mm will
    never wait. Network filesystems haven't been touched, so either they
    provide their own stable page guarantees or they don't block at all.
    The blocking behavior is back to what it was before 3.0 if you don't
    have a disk requiring stable page writes.

    Here's the result of using dbench to test latency on ext2:

    3.8.0-rc3:
    Operation Count AvgLat MaxLat
    ----------------------------------------
    WriteX 109347 0.028 59.817
    ReadX 347180 0.004 3.391
    Flush 15514 29.828 287.283

    Throughput 57.429 MB/sec 4 clients 4 procs max_latency=287.290 ms

    3.8.0-rc3 + patches:
    WriteX 105556 0.029 4.273
    ReadX 335004 0.005 4.112
    Flush 14982 30.540 298.634

    Throughput 55.4496 MB/sec 4 clients 4 procs max_latency=298.650 ms

    As you can see, the maximum write latency drops considerably with this
    patch enabled. The other filesystems (ext3/ext4/xfs/btrfs) behave
    similarly, but see the cover letter for those results.

    Signed-off-by: Darrick J. Wong
    Acked-by: Steven Whitehouse
    Reviewed-by: Jan Kara
    Cc: Adrian Hunter
    Cc: Andy Lutomirski
    Cc: Artem Bityutskiy
    Cc: Joel Becker
    Cc: Mark Fasheh
    Cc: Jens Axboe
    Cc: Eric Van Hensbergen
    Cc: Ron Minnich
    Cc: Latchesar Ionkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     

15 Jan, 2013

1 commit

  • Andrew Morton pointed this out a month ago, and then I completely forgot
    about it.

    If we read a partial last page of a block device, we will zero out the
    end of the page, but since that page can then be mapped into user space,
    we should also make sure to flush the cache on architectures that have
    virtual caches. We have the flush_dcache_page() function for this, so
    use it.

    Now, in practice this really never matters, because nobody sane uses
    virtual caches to begin with, and they largely exist on old broken RISC
    arhitectures.

    And even if you did run on one of those obsolete CPU's, the whole "mmap
    and access the last partial page of a block device" behavior probably
    doesn't actually exist. The normal IO functions (read/write) will never
    see the zeroed-out part of the page that migth not be coherent in the
    cache, because they honor the size of the device.

    So I'm marking this for stable (3.7 only), but I'm not sure anybody will
    ever care.

    Pointed-out-by: Andrew Morton
    Cc: stable@vger.kernel.org # 3.7
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

14 Jan, 2013

2 commits

  • The former is triggered from touch_buffer() and the latter
    mark_buffer_dirty().

    This is part of tracepoint additions to improve visiblity into
    dirtying / writeback operations for io tracer and userland.

    v2: Transformed writeback_dirty_buffer to block_dirty_buffer and made
    it share TP definition with block_touch_buffer.

    Signed-off-by: Tejun Heo
    Cc: Fengguang Wu
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • We want to add a trace point to touch_buffer() but macros and inline
    functions defined in header files can't have tracing points. Move
    touch_buffer() to fs/buffer.c and make it a proper function.

    The new exported function is also declared inline. As most uses of
    touch_buffer() are inside buffer.c with nilfs2 as the only other user,
    the effect of this change should be negligible.

    Signed-off-by: Tejun Heo
    Cc: Steven Rostedt
    Signed-off-by: Jens Axboe

    Tejun Heo
     

13 Dec, 2012

2 commits


12 Dec, 2012

1 commit

  • Overhaul struct address_space.assoc_mapping renaming it to
    address_space.private_data and its type is redefined to void*. By this
    approach we consistently name the .private_* elements from struct
    address_space as well as allow extended usage for address_space
    association with other data structures through ->private_data.

    Also, all users of old ->assoc_mapping element are converted to reflect
    its new name and type change (->private_data).

    Signed-off-by: Rafael Aquini
    Cc: Rusty Russell
    Cc: "Michael S. Tsirkin"
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andi Kleen
    Cc: Konrad Rzeszutek Wilk
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     

06 Dec, 2012

1 commit


05 Dec, 2012

1 commit

  • The block device access simplification that avoided accessing the (racy)
    block size information (commit bbec0270bdd8: "blkdev_max_block: make
    private to fs/buffer.c") no longer checks the maximum block size in the
    block mapping path.

    That was _almost_ as simple as just removing the code entirely, because
    the readers and writers all check the size of the device anyway, so
    under normal circumstances it "just worked".

    However, the block size may be such that the end of the device may
    straddle one single buffer_head. At which point we may still want to
    access the end of the device, but the buffer we use to access it
    partially extends past the end.

    The 'bd_set_size()' function intentionally sets the block size to avoid
    this, but mounting the device - or setting the block size by hand to
    some other value - can modify that block size.

    So instead, teach 'submit_bh()' about the special case of the buffer
    head straddling the end of the device, and turning such an access into a
    smaller IO access, avoiding the problem.

    This, btw, also means that unlike before, we can now access the whole
    device regardless of device block size setting. So now, even if the
    device size is only 512-byte aligned, we can read and write even the
    last sector even when having a much bigger block size for accessing the
    rest of the device.

    So with this, we could now get rid of the 'bd_set_size()' block size
    code entirely - resulting in faster IO for the common case - but that
    would be a separate patch.

    Reported-and-tested-by: Romain Francoise
    Reporeted-and-tested-by: Meelis Roos
    Reported-by: Tony Luck
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

30 Nov, 2012

2 commits

  • We really don't want to look at the block size for the raw block device
    accesses in fs/block-dev.c, because it may be changing from under us.
    So get rid of the max_block logic entirely, since the caller should
    already have done it anyway.

    That leaves the only user of this function in fs/buffer.c, so move the
    whole function there and make it static.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • This makes the buffer size handling be a per-page thing, which allows us
    to not have to worry about locking too much when changing the buffer
    size. If a page doesn't have buffers, we still need to read the block
    size from the inode, but we can do that with ACCESS_ONCE(), so that even
    if the size is changing, we get a consistent value.

    This doesn't convert all functions - many of the buffer functions are
    used purely by filesystems, which in turn results in the buffer size
    being fixed at mount-time. So they don't have the same consistency
    issues that the raw device access can have.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

08 Oct, 2012

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "The big new feature added this time is supporting online resizing
    using the meta_bg feature. This allows us to resize file systems
    which are greater than 16TB. In addition, the speed of online
    resizing has been improved in general.

    We also fix a number of races, some of which could lead to deadlocks,
    in ext4's Asynchronous I/O and online defrag support, thanks to good
    work by Dmitry Monakhov.

    There are also a large number of more minor bug fixes and cleanups
    from a number of other ext4 contributors, quite of few of which have
    submitted fixes for the first time."

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (69 commits)
    ext4: fix ext4_flush_completed_IO wait semantics
    ext4: fix mtime update in nodelalloc mode
    ext4: fix ext_remove_space for punch_hole case
    ext4: punch_hole should wait for DIO writers
    ext4: serialize truncate with owerwrite DIO workers
    ext4: endless truncate due to nonlocked dio readers
    ext4: serialize unlocked dio reads with truncate
    ext4: serialize dio nonlocked reads with defrag workers
    ext4: completed_io locking cleanup
    ext4: fix unwritten counter leakage
    ext4: give i_aiodio_unwritten a more appropriate name
    ext4: ext4_inode_info diet
    ext4: convert to use leXX_add_cpu()
    ext4: ext4_bread usage audit
    fs: reserve fallocate flag codepoint
    ext4: remove redundant offset check in mext_check_arguments()
    ext4: don't clear orphan list on ro mount with errors
    jbd2: fix assertion failure in commit code due to lacking transaction credits
    ext4: release donor reference when EXT4_IOC_MOVE_EXT ioctl fails
    ext4: enable FITRIM ioctl on bigalloc file system
    ...

    Linus Torvalds
     

01 Oct, 2012

1 commit

  • Commits 5e8830dc85d0 and 41c4d25f78c0 introduced a regression into
    v3.6-rc1 for ext4 in nodealloc mode, such that mtime updates would not
    take place for files modified via mmap if the page was already in the
    page cache. This would also affect ext3 file systems mounted using
    the ext4 file system driver.

    The problem was that ext4_page_mkwrite() had a shortcut which would
    avoid calling __block_page_mkwrite() under some circumstances, and the
    above two commit transferred the responsibility of calling
    file_update_time() to __block_page_mkwrite --- which woudln't get
    called in some circumstances.

    Since __block_page_mkwrite() only has three callers,
    block_page_mkwrite(), ext4_page_mkwrite, and nilfs_page_mkwrite(), the
    best way to solve this is to move the responsibility for calling
    file_update_time() to its caller.

    This problem was found via xfstests #215 with a file system mounted
    with -o nodelalloc.

    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Jan Kara
    Cc: KONISHI Ryusuke
    Cc: stable@vger.kernel.org

    Theodore Ts'o
     

23 Aug, 2012

1 commit

  • Commit 91f68c89d8f3 ("block: fix infinite loop in __getblk_slow")
    is not good: a successful call to grow_buffers() cannot guarantee
    that the page won't be reclaimed before the immediate next call to
    __find_get_block(), which is why there was always a loop there.

    Yesterday I got "EXT4-fs error (device loop0): __ext4_get_inode_loc:3595:
    inode #19278: block 664: comm cc1: unable to read itable block" on console,
    which pointed to this commit.

    I've been trying to bisect for weeks, why kbuild-on-ext4-on-loop-on-tmpfs
    sometimes fails from a missing header file, under memory pressure on
    ppc G5. I've never seen this on x86, and I've never seen it on 3.5-rc7
    itself, despite that commit being in there: bisection pointed to an
    irrelevant pinctrl merge, but hard to tell when failure takes between
    18 minutes and 38 hours (but so far it's happened quicker on 3.6-rc2).

    (I've since found such __ext4_get_inode_loc errors in /var/log/messages
    from previous weeks: why the message never appeared on console until
    yesterday morning is a mystery for another day.)

    Revert 91f68c89d8f3, restoring __getblk_slow() to how it was (plus
    a checkpatch nitfix). Simplify the interface between grow_buffers()
    and grow_dev_page(), and avoid the infinite loop beyond end of device
    by instead checking init_page_buffers()'s end_block there (I presume
    that's more efficient than a repeated call to blkdev_max_block()),
    returning -ENXIO to __getblk_slow() in that case.

    And remove akpm's ten-year-old "__getblk() cannot fail ... weird"
    comment, but that is worrying: are all users of __getblk() really
    now prepared for a NULL bh beyond end of device, or will some oops??

    Signed-off-by: Hugh Dickins
    Cc: stable@vger.kernel.org # 3.0 3.2 3.4 3.5
    Signed-off-by: Jens Axboe

    Hugh Dickins
     

31 Jul, 2012

2 commits

  • There are several entry points which dirty pages in a filesystem. mmap
    (handled by block_page_mkwrite()), buffered write (handled by
    __generic_file_aio_write()), splice write (generic_file_splice_write),
    truncate, and fallocate (these can dirty last partial page - handled inside
    each filesystem separately). Protect these places with sb_start_write() and
    sb_end_write().

    ->page_mkwrite() calls are particularly complex since they are called with
    mmap_sem held and thus we cannot use standard sb_start_write() due to lock
    ordering constraints. We solve the problem by using a special freeze protection
    sb_start_pagefault() which ranks below mmap_sem.

    BugLink: https://bugs.launchpad.net/bugs/897421
    Tested-by: Kamal Mostafa
    Tested-by: Peter M. Petrakis
    Tested-by: Dann Frazier
    Tested-by: Massimo Morana
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     
  • Tested-by: Kamal Mostafa
    Tested-by: Peter M. Petrakis
    Tested-by: Dann Frazier
    Tested-by: Massimo Morana
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Jan Kara
     

13 Jul, 2012

1 commit

  • Commit 080399aaaf35 ("block: don't mark buffers beyond end of disk as
    mapped") exposed a bug in __getblk_slow that causes mount to hang as it
    loops infinitely waiting for a buffer that lies beyond the end of the
    disk to become uptodate.

    The problem was initially reported by Torsten Hilbrich here:

    https://lkml.org/lkml/2012/6/18/54

    and also reported independently here:

    http://www.sysresccd.org/forums/viewtopic.php?f=13&t=4511

    and then Richard W.M. Jones and Marcos Mello noted a few separate
    bugzillas also associated with the same issue. This patch has been
    confirmed to fix:

    https://bugzilla.redhat.com/show_bug.cgi?id=835019

    The main problem is here, in __getblk_slow:

    for (;;) {
    struct buffer_head * bh;
    int ret;

    bh = __find_get_block(bdev, block, size);
    if (bh)
    return bh;

    ret = grow_buffers(bdev, block, size);
    if (ret < 0)
    return NULL;
    if (ret == 0)
    free_more_memory();
    }

    __find_get_block does not find the block, since it will not be marked as
    mapped, and so grow_buffers is called to fill in the buffers for the
    associated page. I believe the for (;;) loop is there primarily to
    retry in the case of memory pressure keeping grow_buffers from
    succeeding. However, we also continue to loop for other cases, like the
    block lying beond the end of the disk. So, the fix I came up with is to
    only loop when grow_buffers fails due to memory allocation issues
    (return value of 0).

    The attached patch was tested by myself, Torsten, and Rich, and was
    found to resolve the problem in call cases.

    Signed-off-by: Jeff Moyer
    Reported-and-Tested-by: Torsten Hilbrich
    Tested-by: Richard W.M. Jones
    Reviewed-by: Josh Boyer
    Cc: Stable # 3.0+
    [ Jens is on vacation, taking this directly - Linus ]
    --
    Stable Notes: this patch requires backport to 3.0, 3.2 and 3.3.
    Signed-off-by: Linus Torvalds

    Jeff Moyer
     

31 May, 2012

1 commit


11 May, 2012

1 commit

  • Hi,

    We have a bug report open where a squashfs image mounted on ppc64 would
    exhibit errors due to trying to read beyond the end of the disk. It can
    easily be reproduced by doing the following:

    [root@ibm-p750e-02-lp3 ~]# ls -l install.img
    -rw-r--r-- 1 root root 142032896 Apr 30 16:46 install.img
    [root@ibm-p750e-02-lp3 ~]# mount -o loop ./install.img /mnt/test
    [root@ibm-p750e-02-lp3 ~]# dd if=/dev/loop0 of=/dev/null
    dd: reading `/dev/loop0': Input/output error
    277376+0 records in
    277376+0 records out
    142016512 bytes (142 MB) copied, 0.9465 s, 150 MB/s

    In dmesg, you'll find the following:

    squashfs: version 4.0 (2009/01/31) Phillip Lougher
    [ 43.106012] attempt to access beyond end of device
    [ 43.106029] loop0: rw=0, want=277410, limit=277408
    [ 43.106039] Buffer I/O error on device loop0, logical block 138704
    [ 43.106053] attempt to access beyond end of device
    [ 43.106057] loop0: rw=0, want=277412, limit=277408
    [ 43.106061] Buffer I/O error on device loop0, logical block 138705
    [ 43.106066] attempt to access beyond end of device
    [ 43.106070] loop0: rw=0, want=277414, limit=277408
    [ 43.106073] Buffer I/O error on device loop0, logical block 138706
    [ 43.106078] attempt to access beyond end of device
    [ 43.106081] loop0: rw=0, want=277416, limit=277408
    [ 43.106085] Buffer I/O error on device loop0, logical block 138707
    [ 43.106089] attempt to access beyond end of device
    [ 43.106093] loop0: rw=0, want=277418, limit=277408
    [ 43.106096] Buffer I/O error on device loop0, logical block 138708
    [ 43.106101] attempt to access beyond end of device
    [ 43.106104] loop0: rw=0, want=277420, limit=277408
    [ 43.106108] Buffer I/O error on device loop0, logical block 138709
    [ 43.106112] attempt to access beyond end of device
    [ 43.106116] loop0: rw=0, want=277422, limit=277408
    [ 43.106120] Buffer I/O error on device loop0, logical block 138710
    [ 43.106124] attempt to access beyond end of device
    [ 43.106128] loop0: rw=0, want=277424, limit=277408
    [ 43.106131] Buffer I/O error on device loop0, logical block 138711
    [ 43.106135] attempt to access beyond end of device
    [ 43.106139] loop0: rw=0, want=277426, limit=277408
    [ 43.106143] Buffer I/O error on device loop0, logical block 138712
    [ 43.106147] attempt to access beyond end of device
    [ 43.106151] loop0: rw=0, want=277428, limit=277408
    [ 43.106154] Buffer I/O error on device loop0, logical block 138713
    [ 43.106158] attempt to access beyond end of device
    [ 43.106162] loop0: rw=0, want=277430, limit=277408
    [ 43.106166] attempt to access beyond end of device
    [ 43.106169] loop0: rw=0, want=277432, limit=277408
    ...
    [ 43.106307] attempt to access beyond end of device
    [ 43.106311] loop0: rw=0, want=277470, limit=2774

    Squashfs manages to read in the end block(s) of the disk during the
    mount operation. Then, when dd reads the block device, it leads to
    block_read_full_page being called with buffers that are beyond end of
    disk, but are marked as mapped. Thus, it would end up submitting read
    I/O against them, resulting in the errors mentioned above. I fixed the
    problem by modifying init_page_buffers to only set the buffer mapped if
    it fell inside of i_size.

    Cheers,
    Jeff

    Signed-off-by: Jeff Moyer
    Acked-by: Nick Piggin

    --

    Changes from v1->v2: re-used max_block, as suggested by Nick Piggin.
    Signed-off-by: Jens Axboe

    Jeff Moyer
     

26 Apr, 2012

1 commit