17 Nov, 2014

1 commit

  • Currently, freezing a filesystem involves calling freeze_super, which locks
    sb->s_umount and then calls the fs-specific freeze_fs hook. This makes it
    hard for gfs2 (and potentially other cluster filesystems) to use the vfs
    freezing code to do freezes on all the cluster nodes.

    In order to communicate that a freeze has been requested, and to make sure
    that only one node is trying to freeze at a time, gfs2 uses a glock
    (sd_freeze_gl). The problem is that there is no hook for gfs2 to acquire
    this lock before calling freeze_super. This means that two nodes can
    attempt to freeze the filesystem by both calling freeze_super, acquiring
    the sb->s_umount lock, and then attempting to grab the cluster glock
    sd_freeze_gl. Only one will succeed, and the other will be stuck in
    freeze_super, making it impossible to finish freezing the node.

    To solve this problem, this patch adds the freeze_super and thaw_super
    hooks. If a filesystem implements these hooks, they are called instead of
    the vfs freeze_super and thaw_super functions. This means that every
    filesystem that implements these hooks must call the vfs freeze_super and
    thaw_super functions itself within the hook function to make use of the vfs
    freezing code.

    Reviewed-by: Jan Kara
    Signed-off-by: Benjamin Marzinski
    Signed-off-by: Steven Whitehouse

    Benjamin Marzinski
     

31 Oct, 2014

1 commit

  • Author: David Jeffery
    Changes to the basic direct I/O code have broken the raw driver when reading
    to the end of a raw device. Instead of returning a short read for a read that
    extends partially beyond the device's end or 0 when at the end of the device,
    these reads now return EIO.

    The raw driver needs the same end of device handling as was added for normal
    block devices. Using blkdev_read_iter, which has the needed size checks,
    prevents the EIO conditions at the end of the device.

    Signed-off-by: David Jeffery
    Signed-off-by: Al Viro

    David Jeffery
     

19 Oct, 2014

1 commit

  • Pull core block layer changes from Jens Axboe:
    "This is the core block IO pull request for 3.18. Apart from the new
    and improved flush machinery for blk-mq, this is all mostly bug fixes
    and cleanups.

    - blk-mq timeout updates and fixes from Christoph.

    - Removal of REQ_END, also from Christoph. We pass it through the
    ->queue_rq() hook for blk-mq instead, freeing up one of the request
    bits. The space was overly tight on 32-bit, so Martin also killed
    REQ_KERNEL since it's no longer used.

    - blk integrity updates and fixes from Martin and Gu Zheng.

    - Update to the flush machinery for blk-mq from Ming Lei. Now we
    have a per hardware context flush request, which both cleans up the
    code should scale better for flush intensive workloads on blk-mq.

    - Improve the error printing, from Rob Elliott.

    - Backing device improvements and cleanups from Tejun.

    - Fixup of a misplaced rq_complete() tracepoint from Hannes.

    - Make blk_get_request() return error pointers, fixing up issues
    where we NULL deref when a device goes bad or missing. From Joe
    Lawrence.

    - Prep work for drastically reducing the memory consumption of dm
    devices from Junichi Nomura. This allows creating clone bio sets
    without preallocating a lot of memory.

    - Fix a blk-mq hang on certain combinations of queue depths and
    hardware queues from me.

    - Limit memory consumption for blk-mq devices for crash dump
    scenarios and drivers that use crazy high depths (certain SCSI
    shared tag setups). We now just use a single queue and limited
    depth for that"

    * 'for-3.18/core' of git://git.kernel.dk/linux-block: (58 commits)
    block: Remove REQ_KERNEL
    blk-mq: allocate cpumask on the home node
    bio-integrity: remove the needless fail handle of bip_slab creating
    block: include func name in __get_request prints
    block: make blk_update_request print prefix match ratelimited prefix
    blk-merge: don't compute bi_phys_segments from bi_vcnt for cloned bio
    block: fix alignment_offset math that assumes io_min is a power-of-2
    blk-mq: Make bt_clear_tag() easier to read
    blk-mq: fix potential hang if rolling wakeup depth is too high
    block: add bioset_create_nobvec()
    block: use bio_clone_fast() in blk_rq_prep_clone()
    block: misplaced rq_complete tracepoint
    sd: Honor block layer integrity handling flags
    block: Replace strnicmp with strncasecmp
    block: Add T10 Protection Information functions
    block: Don't merge requests if integrity flags differ
    block: Integrity checksum flag
    block: Relocate bio integrity flags
    block: Add a disk flag to block integrity profile
    block: Add prefix to block integrity profile flags
    ...

    Linus Torvalds
     

10 Oct, 2014

1 commit

  • Sequential read from a block device is expected to be equal or faster than
    from the file on a filesystem. But it is not correct due to the lack of
    effective readpages() in the address space operations for block device.

    This implements readpages() operation for block device by using
    mpage_readpages() which can create multipage BIOs instead of BIOs for each
    page and reduce system CPU time consumption.

    Install 1GB of RAM disk storage:

    # modprobe scsi_debug dev_size_mb=1024 delay=0

    Sequential read from file on a filesystem:

    # mkfs.ext4 /dev/$DEV
    # mount /dev/$DEV /mnt
    # fio --name=t --size=512m --rw=read --filename=/mnt/file
    ...
    read : io=524288KB, bw=2133.4MB/s, iops=546133, runt= 240msec

    Sequential read from a block device:
    # fio --name=t --size=512m --rw=read --filename=/dev/$DEV
    ...
    (Without this commit)
    read : io=524288KB, bw=1700.2MB/s, iops=435455, runt= 301msec

    (With this commit)
    read : io=524288KB, bw=2160.4MB/s, iops=553046, runt= 237msec

    Signed-off-by: Akinobu Mita
    Cc: Jens Axboe
    Cc: Alexander Viro
    Cc: Jeff Moyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     

09 Sep, 2014

2 commits

  • A block_device may be attached to different gendisks and thus
    different bdis over time. bdev_inode_switch_bdi() is used to switch
    the associated bdi. The function assumes that the inode could be
    dirty and transfers it between bdis if so. This is a bit nasty in
    that it reaches into bdi internals.

    This patch reimplements the function so that it writes out the inode
    if dirty. This is a lot simpler and can be implemented without
    exposing bdi internals.

    Signed-off-by: Tejun Heo
    Cc: Alexander Viro
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • bdev_get_queue() returns the request_queue associated with the
    specified block_device. blk_get_backing_dev_info() makes use of
    bdev_get_queue() to determine the associated bdi given a block_device.

    All the callers of bdev_get_queue() including
    blk_get_backing_dev_info() assume that bdev_get_queue() may return
    NULL and implement NULL handling; however, bdev_get_queue() requires
    the passed in block_device is opened and attached to its gendisk.
    Because an active gendisk always has a valid request_queue associated
    with it, bdev_get_queue() can never return NULL and neither can
    blk_get_backing_dev_info().

    Make it clear that neither of the two functions can return NULL and
    remove NULL handling from all the callers.

    Signed-off-by: Tejun Heo
    Cc: Chris Mason
    Cc: Dave Chinner
    Signed-off-by: Jens Axboe

    Tejun Heo
     

13 Jun, 2014

1 commit

  • Pull vfs updates from Al Viro:
    "This the bunch that sat in -next + lock_parent() fix. This is the
    minimal set; there's more pending stuff.

    In particular, I really hope to get acct.c fixes merged this cycle -
    we need that to deal sanely with delayed-mntput stuff. In the next
    pile, hopefully - that series is fairly short and localized
    (kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more
    iov_iter work. Most of prereqs for ->splice_write with sane locking
    order are there and Kent's dio rewrite would also fit nicely on top of
    this pile"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits)
    lock_parent: don't step on stale ->d_parent of all-but-freed one
    kill generic_file_splice_write()
    ceph: switch to iter_file_splice_write()
    shmem: switch to iter_file_splice_write()
    nfs: switch to iter_splice_write_file()
    fs/splice.c: remove unneeded exports
    ocfs2: switch to iter_file_splice_write()
    ->splice_write() via ->write_iter()
    bio_vec-backed iov_iter
    optimize copy_page_{to,from}_iter()
    bury generic_file_aio_{read,write}
    lustre: get rid of messing with iovecs
    ceph: switch to ->write_iter()
    ceph_sync_direct_write: stop poking into iov_iter guts
    ceph_sync_read: stop poking into iov_iter guts
    new helper: copy_page_from_iter()
    fuse: switch to ->write_iter()
    btrfs: switch to ->write_iter()
    ocfs2: switch to ->write_iter()
    xfs: switch to ->write_iter()
    ...

    Linus Torvalds
     

12 Jun, 2014

1 commit

  • iter_file_splice_write() - a ->splice_write() instance that gathers the
    pipe buffers, builds a bio_vec-based iov_iter covering those and feeds
    it to ->write_iter(). A bunch of simple cases coverted to that...

    [AV: fixed the braino spotted by Cyrill]

    Signed-off-by: Al Viro

    Al Viro
     

05 Jun, 2014

1 commit

  • A block device driver may choose to provide a rw_page operation. These
    will be called when the filesystem is attempting to do page sized I/O to
    page cache pages (ie not for direct I/O). This does preclude I/Os that
    are larger than page size, so this may only be a performance gain for
    some devices.

    Signed-off-by: Matthew Wilcox
    Tested-by: Dheeraj Reddy
    Cc: Dave Chinner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

07 May, 2014

5 commits


13 Apr, 2014

1 commit

  • Pull vfs updates from Al Viro:
    "The first vfs pile, with deep apologies for being very late in this
    window.

    Assorted cleanups and fixes, plus a large preparatory part of iov_iter
    work. There's a lot more of that, but it'll probably go into the next
    merge window - it *does* shape up nicely, removes a lot of
    boilerplate, gets rid of locking inconsistencie between aio_write and
    splice_write and I hope to get Kent's direct-io rewrite merged into
    the same queue, but some of the stuff after this point is having
    (mostly trivial) conflicts with the things already merged into
    mainline and with some I want more testing.

    This one passes LTP and xfstests without regressions, in addition to
    usual beating. BTW, readahead02 in ltp syscalls testsuite has started
    giving failures since "mm/readahead.c: fix readahead failure for
    memoryless NUMA nodes and limit readahead pages" - might be a false
    positive, might be a real regression..."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    missing bits of "splice: fix racy pipe->buffers uses"
    cifs: fix the race in cifs_writev()
    ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
    kill generic_file_buffered_write()
    ocfs2_file_aio_write(): switch to generic_perform_write()
    ceph_aio_write(): switch to generic_perform_write()
    xfs_file_buffered_aio_write(): switch to generic_perform_write()
    export generic_perform_write(), start getting rid of generic_file_buffer_write()
    generic_file_direct_write(): get rid of ppos argument
    btrfs_file_aio_write(): get rid of ppos
    kill the 5th argument of generic_file_buffered_write()
    kill the 4th argument of __generic_file_aio_write()
    lustre: don't open-code kernel_recvmsg()
    ocfs2: don't open-code kernel_recvmsg()
    drbd: don't open-code kernel_recvmsg()
    constify blk_rq_map_user_iov() and friends
    lustre: switch to kernel_sendmsg()
    ocfs2: don't open-code kernel_sendmsg()
    take iov_iter stuff to mm/iov_iter.c
    process_vm_access: tidy up a bit
    ...

    Linus Torvalds
     

04 Apr, 2014

2 commits

  • Reclaim will be leaving shadow entries in the page cache radix tree upon
    evicting the real page. As those pages are found from the LRU, an
    iput() can lead to the inode being freed concurrently. At this point,
    reclaim must no longer install shadow pages because the inode freeing
    code needs to ensure the page tree is really empty.

    Add an address_space flag, AS_EXITING, that the inode freeing code sets
    under the tree lock before doing the final truncate. Reclaim will check
    for this flag before installing shadow pages.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • We know that "ret > 0" is true here. These tests were left over from
    commit 02afc27faec9 ('direct-io: Handle O_(D)SYNC AIO') and aren't
    needed any more.

    Signed-off-by: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Carpenter
     

02 Apr, 2014

1 commit


14 Sep, 2013

2 commits

  • Pull writeback fix from Wu Fengguang:
    "A trivial writeback fix"

    * tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: Do not sort b_io list only because of block device inode

    Linus Torvalds
     
  • Pull aio changes from Ben LaHaise:
    "First off, sorry for this pull request being late in the merge window.
    Al had raised a couple of concerns about 2 items in the series below.
    I addressed the first issue (the race introduced by Gu's use of
    mm_populate()), but he has not provided any further details on how he
    wants to rework the anon_inode.c changes (which were sent out months
    ago but have yet to be commented on).

    The bulk of the changes have been sitting in the -next tree for a few
    months, with all the issues raised being addressed"

    * git://git.kvack.org/~bcrl/aio-next: (22 commits)
    aio: rcu_read_lock protection for new rcu_dereference calls
    aio: fix race in ring buffer page lookup introduced by page migration support
    aio: fix rcu sparse warnings introduced by ioctx table lookup patch
    aio: remove unnecessary debugging from aio_free_ring()
    aio: table lookup: verify ctx pointer
    staging/lustre: kiocb->ki_left is removed
    aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3"
    aio: be defensive to ensure request batching is non-zero instead of BUG_ON()
    aio: convert the ioctx list to table lookup v3
    aio: double aio_max_nr in calculations
    aio: Kill ki_dtor
    aio: Kill ki_users
    aio: Kill unneeded kiocb members
    aio: Kill aio_rw_vect_retry()
    aio: Don't use ctx->tail unnecessarily
    aio: io_cancel() no longer returns the io_event
    aio: percpu ioctx refcount
    aio: percpu reqs_available
    aio: reqs_active -> reqs_available
    aio: fix build when migration is disabled
    ...

    Linus Torvalds
     

04 Sep, 2013

1 commit

  • Call generic_write_sync() from the deferred I/O completion handler if
    O_DSYNC is set for a write request. Also make sure various callers
    don't call generic_write_sync if the direct I/O code returns
    -EIOCBQUEUED.

    Based on an earlier patch from Jan Kara with updates from
    Jeff Moyer and Darrick J. Wong .

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Al Viro

    Christoph Hellwig
     

30 Jul, 2013

1 commit

  • This code doesn't serve any purpose anymore, since the aio retry
    infrastructure has been removed.

    This change should be safe because aio_read/write are also used for
    synchronous IO, and called from do_sync_read()/do_sync_write() - and
    there's no looping done in the sync case (the read and write syscalls).

    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Signed-off-by: Benjamin LaHaise

    Kent Overstreet
     

12 Jul, 2013

1 commit

  • Pull core block IO updates from Jens Axboe:
    "Here are the core IO block bits for 3.11. It contains:

    - A tweak to the reserved tag logic from Jan, for weirdo devices with
    just 3 free tags. But for those it improves things substantially
    for random writes.

    - Periodic writeback fix from Jan. Marked for stable as well.

    - Fix for a race condition in IO scheduler switching from Jianpeng.

    - The hierarchical blk-cgroup support from Tejun. This is the grunt
    of the series.

    - blk-throttle fix from Vivek.

    Just a note that I'm in the middle of a relocation, whole family is
    flying out tomorrow. Hence I will be awal the remainder of this week,
    but back at work again on Monday the 15th. CC'ing Tejun, since any
    potential "surprises" will most likely be from the blk-cgroup work.
    But it's been brewing for a while and sitting in my tree and
    linux-next for a long time, so should be solid."

    * 'for-3.11/core' of git://git.kernel.dk/linux-block: (36 commits)
    elevator: Fix a race in elevator switching
    block: Reserve only one queue tag for sync IO if only 3 tags are available
    writeback: Fix periodic writeback after fs mount
    blk-throttle: implement proper hierarchy support
    blk-throttle: implement throtl_grp->has_rules[]
    blk-throttle: Account for child group's start time in parent while bio climbs up
    blk-throttle: add throtl_qnode for dispatch fairness
    blk-throttle: make throtl_pending_timer_fn() ready for hierarchy
    blk-throttle: make tg_dispatch_one_bio() ready for hierarchy
    blk-throttle: make blk_throtl_bio() ready for hierarchy
    blk-throttle: make blk_throtl_drain() ready for hierarchy
    blk-throttle: dispatch from throtl_pending_timer_fn()
    blk-throttle: implement dispatch looping
    blk-throttle: separate out throtl_service_queue->pending_timer from throtl_data->dispatch_work
    blk-throttle: set REQ_THROTTLED from throtl_charge_bio() and gate stats update with it
    blk-throttle: implement sq_to_tg(), sq_to_td() and throtl_log()
    blk-throttle: add throtl_service_queue->parent_sq
    blk-throttle: generalize update_disptime optimization in blk_throtl_bio()
    blk-throttle: dispatch to throtl_data->service_queue.bio_lists[]
    blk-throttle: move bio_lists[] and friends to throtl_service_queue
    ...

    Linus Torvalds
     

09 Jul, 2013

1 commit

  • It is very likely that block device inode will be part of BDI dirty list
    as well. However it doesn't make sence to sort inodes on the b_io list
    just because of this inode (as it contains buffers all over the device
    anyway). So save some CPU cycles which is valuable since we hold relatively
    contented wb->list_lock.

    Signed-off-by: Jan Kara

    Jan Kara
     

04 Jul, 2013

1 commit

  • Page reclaim keeps track of dirty and under writeback pages and uses it
    to determine if wait_iff_congested() should stall or if kswapd should
    begin writing back pages. This fails to account for buffer pages that
    can be under writeback but not PageWriteback which is the case for
    filesystems like ext3 ordered mode. Furthermore, PageDirty buffer pages
    can have all the buffers clean and writepage does no IO so it should not
    be accounted as congested.

    This patch adds an address_space operation that filesystems may
    optionally use to check if a page is really dirty or really under
    writeback. An implementation is provided for for buffer_heads is added
    and used for block operations and ext3 in ordered mode. By default the
    page flags are obeyed.

    Credit goes to Jan Kara for identifying that the page flags alone are
    not sufficient for ext3 and sanity checking a number of ideas on how the
    problem could be addressed.

    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: Jiri Slaby
    Cc: Valdis Kletnieks
    Cc: Zlatko Calusic
    Cc: dormando
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

29 Jun, 2013

1 commit


28 Jun, 2013

1 commit

  • Code in blkdev.c moves a device inode to default_backing_dev_info when
    the last reference to the device is put and moves the device inode back
    to its bdi when the first reference is acquired. This includes moving to
    wb.b_dirty list if the device inode is dirty. The code however doesn't
    setup timer to wake corresponding flusher thread and while wb.b_dirty
    list is non-empty __mark_inode_dirty() will not set it up either. Thus
    periodic writeback is effectively disabled until a sync(2) call which can
    lead to unexpected data loss in case of crash or power failure.

    Fix the problem by setting up a timer for periodic writeback in case we
    add the first dirty inode to wb.b_dirty list in bdev_inode_switch_bdi().

    Reported-by: Bert De Jonghe
    CC: stable@vger.kernel.org # >= 3.0
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

09 May, 2013

1 commit

  • Pull block core updates from Jens Axboe:

    - Major bit is Kents prep work for immutable bio vecs.

    - Stable candidate fix for a scheduling-while-atomic in the queue
    bypass operation.

    - Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
    discard bios.

    - Tejuns changes to convert the writeback thread pool to the generic
    workqueue mechanism.

    - Runtime PM framework, SCSI patches exists on top of these in James'
    tree.

    - A few random fixes.

    * 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
    relay: move remove_buf_file inside relay_close_buf
    partitions/efi.c: replace useless kzalloc's by kmalloc's
    fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
    block: fix max discard sectors limit
    blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
    Documentation: cfq-iosched: update documentation help for cfq tunables
    writeback: expose the bdi_wq workqueue
    writeback: replace custom worker pool implementation with unbound workqueue
    writeback: remove unused bdi_pending_list
    aoe: Fix unitialized var usage
    bio-integrity: Add explicit field for owner of bip_buf
    block: Add an explicit bio flag for bios that own their bvec
    block: Add bio_alloc_pages()
    block: Convert some code to bio_for_each_segment_all()
    block: Add bio_for_each_segment_all()
    bounce: Refactor __blk_queue_bounce to not use bi_io_vec
    raid1: use bio_copy_data()
    pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
    pktcdvd: use bio_copy_data()
    block: Add bio_copy_data()
    ...

    Linus Torvalds
     

08 May, 2013

2 commits

  • Merge more incoming from Andrew Morton:

    - Various fixes which were stalled or which I picked up recently

    - A large rotorooting of the AIO code. Allegedly to improve
    performance but I don't really have good performance numbers (I might
    have lost the email) and I can't raise Kent today. I held this out
    of 3.9 and we could give it another cycle if it's all too late/scary.

    I ended up taking only the first two thirds of the AIO rotorooting. I
    left the percpu parts and the batch completion for later. - Linus

    * emailed patches from Andrew Morton : (33 commits)
    aio: don't include aio.h in sched.h
    aio: kill ki_retry
    aio: kill ki_key
    aio: give shared kioctx fields their own cachelines
    aio: kill struct aio_ring_info
    aio: kill batch allocation
    aio: change reqs_active to include unreaped completions
    aio: use cancellation list lazily
    aio: use flush_dcache_page()
    aio: make aio_read_evt() more efficient, convert to hrtimers
    wait: add wait_event_hrtimeout()
    aio: refcounting cleanup
    aio: make aio_put_req() lockless
    aio: do fget() after aio_get_req()
    aio: dprintk() -> pr_debug()
    aio: move private stuff out of aio.h
    aio: add kiocb_cancel()
    aio: kill return value of aio_complete()
    char: add aio_{read,write} to /dev/{null,zero}
    aio: remove retry-based AIO
    ...

    Linus Torvalds
     
  • Faster kernel compiles by way of fewer unnecessary includes.

    [akpm@linux-foundation.org: fix fallout]
    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kent Overstreet
     

07 May, 2013

2 commits


01 May, 2013

1 commit


30 Apr, 2013

1 commit

  • blkdev_aio_read() test 'size' to see if it is equal or greater than the
    target count we request(iocb->ki_left). If so there is no need to call
    iov_shorten() to reduce number of segments and the iovec's length. So the
    judgement should be changed to 'if (size < iocb->ki_left)' instead.

    Signed-off-by: Jianpeng Ma
    Signed-off-by: Gu Zheng
    Cc: Al Viro
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Gu Zheng
     

02 Apr, 2013

1 commit

  • struct block_device lifecycle is defined by its inode (see fs/block_dev.c) -
    block_device allocated first time we access /dev/loopXX and deallocated on
    bdev_destroy_inode. When we create the device "losetup /dev/loopXX afile"
    we want that block_device stay alive until we destroy the loop device
    with "losetup -d".

    But because we do not hold /dev/loopXX inode its counter goes 0, and
    inode/bdev can be destroyed at any moment. Usually it happens at memory
    pressure or when user drops inode cache (like in the test below). When later in
    loop_clr_fd() we want to use bdev we have use-after-free error with following
    stack:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000280
    bd_set_size+0x10/0xa0
    loop_clr_fd+0x1f8/0x420 [loop]
    lo_ioctl+0x200/0x7e0 [loop]
    lo_compat_ioctl+0x47/0xe0 [loop]
    compat_blkdev_ioctl+0x341/0x1290
    do_filp_open+0x42/0xa0
    compat_sys_ioctl+0xc1/0xf20
    do_sys_open+0x16e/0x1d0
    sysenter_dispatch+0x7/0x1a

    To prevent use-after-free we need to grab the device in loop_set_fd()
    and put it later in loop_clr_fd().

    The issue is reprodusible on current Linus head and v3.3. Here is the test:

    dd if=/dev/zero of=loop.file bs=1M count=1
    while [ true ]; do
    losetup /dev/loop0 loop.file
    echo 2 > /proc/sys/vm/drop_caches
    losetup -d /dev/loop0
    done

    [ Doing bdgrab/bput in loop_set_fd/loop_clr_fd is safe, because every
    time we call loop_set_fd() we check that loop_device->lo_state is
    Lo_unbound and set it to Lo_bound If somebody will try to set_fd again
    it will get EBUSY. And if we try to loop_clr_fd() on unbound loop
    device we'll get ENXIO.

    loop_set_fd/loop_clr_fd (and any other loop ioctl) is called under
    loop_device->lo_ctl_mutex. ]

    Signed-off-by: Anatol Pomozov
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Anatol Pomozov
     

01 Mar, 2013

1 commit

  • Pull block IO core bits from Jens Axboe:
    "Below are the core block IO bits for 3.9. It was delayed a few days
    since my workstation kept crashing every 2-8h after pulling it into
    current -git, but turns out it is a bug in the new pstate code (divide
    by zero, will report separately). In any case, it contains:

    - The big cfq/blkcg update from Tejun and and Vivek.

    - Additional block and writeback tracepoints from Tejun.

    - Improvement of the should sort (based on queues) logic in the plug
    flushing.

    - _io() variants of the wait_for_completion() interface, using
    io_schedule() instead of schedule() to contribute to io wait
    properly.

    - Various little fixes.

    You'll get two trivial merge conflicts, which should be easy enough to
    fix up"

    Fix up the trivial conflicts due to hlist traversal cleanups (commit
    b67bfe0d42ca: "hlist: drop the node parameter from iterators").

    * 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits)
    block: remove redundant check to bd_openers()
    block: use i_size_write() in bd_set_size()
    cfq: fix lock imbalance with failed allocations
    drivers/block/swim3.c: fix null pointer dereference
    block: don't select PERCPU_RWSEM
    block: account iowait time when waiting for completion of IO request
    sched: add wait_for_completion_io[_timeout]
    writeback: add more tracepoints
    block: add block_{touch|dirty}_buffer tracepoint
    buffer: make touch_buffer() an exported function
    block: add @req to bio_{front|back}_merge tracepoints
    block: add missing block_bio_complete() tracepoint
    block: Remove should_sort judgement when flush blk_plug
    block,elevator: use new hashtable implementation
    cfq-iosched: add hierarchical cfq_group statistics
    cfq-iosched: collect stats from dead cfqgs
    cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
    blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
    block: RCU free request_queue
    blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
    ...

    Linus Torvalds
     

27 Feb, 2013

1 commit

  • Pull vfs pile (part one) from Al Viro:
    "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
    locking violations, etc.

    The most visible changes here are death of FS_REVAL_DOT (replaced with
    "has ->d_weak_revalidate()") and a new helper getting from struct file
    to inode. Some bits of preparation to xattr method interface changes.

    Misc patches by various people sent this cycle *and* ocfs2 fixes from
    several cycles ago that should've been upstream right then.

    PS: the next vfs pile will be xattr stuff."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
    saner proc_get_inode() calling conventions
    proc: avoid extra pde_put() in proc_fill_super()
    fs: change return values from -EACCES to -EPERM
    fs/exec.c: make bprm_mm_init() static
    ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
    ocfs2: fix possible use-after-free with AIO
    ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
    get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
    target: writev() on single-element vector is pointless
    export kernel_write(), convert open-coded instances
    fs: encode_fh: return FILEID_INVALID if invalid fid_type
    kill f_vfsmnt
    vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
    nfsd: handle vfs_getattr errors in acl protocol
    switch vfs_getattr() to struct path
    default SET_PERSONALITY() in linux/elf.h
    ceph: prepopulate inodes only when request is aborted
    d_hash_and_lookup(): export, switch open-coded instances
    9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
    9p: split dropping the acls from v9fs_set_create_acl()
    ...

    Linus Torvalds
     

23 Feb, 2013

1 commit


22 Feb, 2013

2 commits

  • bd_openers is stable under bd_mutex, no need to check it twice.

    Signed-off-by: Guo Chao
    Cc: Alexander Viro
    Cc: Guo Chao
    Cc: M. Hindess
    Cc: Nikanth Karthikesan
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Guo Chao
     
  • blkdev_ioctl(GETBLKSIZE) uses i_size_read() to read size of block device.
    If we update block size directly, reader may see intermediate result in
    some machines and configurations. Use i_size_write() instead.

    Signed-off-by: Guo Chao
    Cc: Alexander Viro
    Cc: Guo Chao
    Cc: M. Hindess
    Cc: Nikanth Karthikesan
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Guo Chao