04 Aug, 2012

1 commit


30 May, 2012

1 commit

  • Merge block/IO core bits from Jens Axboe:
    "This is a bit bigger on the core side than usual, but that is purely
    because we decided to hold off on parts of Tejun's submission on 3.4
    to give it a bit more time to simmer. As a consequence, it's seen a
    long cycle in for-next.

    It contains:

    - Bug fix from Dan, wrong locking type.
    - Relax splice gifting restriction from Eric.
    - A ton of updates from Tejun, primarily for blkcg. This improves
    the code a lot, making the API nicer and cleaner, and also includes
    fixes for how we handle and tie policies and re-activate on
    switches. The changes also include generic bug fixes.
    - A simple fix from Vivek, along with a fix for doing proper delayed
    allocation of the blkcg stats."

    Fix up annoying conflict just due to different merge resolution in
    Documentation/feature-removal-schedule.txt

    * 'for-3.5/core' of git://git.kernel.dk/linux-block: (92 commits)
    blkcg: tg_stats_alloc_lock is an irq lock
    vmsplice: relax alignement requirements for SPLICE_F_GIFT
    blkcg: use radix tree to index blkgs from blkcg
    blkcg: fix blkcg->css ref leak in __blkg_lookup_create()
    block: fix elvpriv allocation failure handling
    block: collapse blk_alloc_request() into get_request()
    blkcg: collapse blkcg_policy_ops into blkcg_policy
    blkcg: embed struct blkg_policy_data in policy specific data
    blkcg: mass rename of blkcg API
    blkcg: style cleanups for blk-cgroup.h
    blkcg: remove blkio_group->path[]
    blkcg: blkg_rwstat_read() was missing inline
    blkcg: shoot down blkgs if all policies are deactivated
    blkcg: drop stuff unused after per-queue policy activation update
    blkcg: implement per-queue policy activation
    blkcg: add request_queue->root_blkg
    blkcg: make request_queue bypassing on allocation
    blkcg: make sure blkg_lookup() returns %NULL if @q is bypassing
    blkcg: make blkg_conf_prep() take @pol and return with queue lock held
    blkcg: remove static policy ID enums
    ...

    Linus Torvalds
     

11 May, 2012

1 commit

  • The number of bio_get_nr_vecs() is passed down via bio_alloc() to
    bvec_alloc_bs(), which fails the bio allocation if
    nr_iovecs > BIO_MAX_PAGES. For the underlying caller this causes an
    unexpected bio allocation failure.
    Limiting to queue_max_segments() is not sufficient, as max_segments
    also might be very large.

    bvec_alloc_bs(gfp_mask, nr_iovecs, ) => NULL when nr_iovecs > BIO_MAX_PAGES
    bio_alloc_bioset(gfp_mask, nr_iovecs, ...)
    bio_alloc(GFP_NOIO, nvecs)
    xfs_alloc_ioend_bio()

    Signed-off-by: Bernd Schubert
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Bernd Schubert
     

02 Apr, 2012

1 commit

  • cgroup/for-3.5 contains the following changes which blk-cgroup needs
    to proceed with the on-going cleanup.

    * Dynamic addition and removal of cftypes to make config/stat file
    handling modular for policies.

    * cgroup removal update to not wait for css references to drain to fix
    blkcg removal hang caused by cfq caching cfqgs.

    Pull in cgroup/for-3.5 into block/for-3.5/core. This causes the
    following conflicts in block/blk-cgroup.c.

    * 761b3ef50e "cgroup: remove cgroup_subsys argument from callbacks"
    conflicts with blkiocg_pre_destroy() addition and blkiocg_attach()
    removal. Resolved by removing @subsys from all subsys methods.

    * 676f7c8f84 "cgroup: relocate cftype and cgroup_subsys definitions in
    controllers" conflicts with ->pre_destroy() and ->attach() updates
    and removal of modular config. Resolved by dropping forward
    declarations of the methods and applying updates to the relocated
    blkio_subsys.

    * 4baf6e3325 "cgroup: convert all non-memcg controllers to the new
    cftype interface" builds upon the previous item. Resolved by adding
    ->base_cftypes to the relocated blkio_subsys.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

07 Mar, 2012

1 commit

  • IO scheduling and cgroup are tied to the issuing task via io_context
    and cgroup of %current. Unfortunately, there are cases where IOs need
    to be routed via a different task which makes scheduling and cgroup
    limit enforcement applied completely incorrectly.

    For example, all bios delayed by blk-throttle end up being issued by a
    delayed work item and get assigned the io_context of the worker task
    which happens to serve the work item and dumped to the default block
    cgroup. This is double confusing as bios which aren't delayed end up
    in the correct cgroup and makes using blk-throttle and cfq propio
    together impossible.

    Any code which punts IO issuing to another task is affected which is
    getting more and more common (e.g. btrfs). As both io_context and
    cgroup are firmly tied to task including userland visible APIs to
    manipulate them, it makes a lot of sense to match up tasks to bios.

    This patch implements bio_associate_current() which associates the
    specified bio with %current. The bio will record the associated ioc
    and blkcg at that point and block layer will use the recorded ones
    regardless of which task actually ends up issuing the bio. bio
    release puts the associated ioc and blkcg.

    It grabs and remembers ioc and blkcg instead of the task itself
    because task may already be dead by the time the bio is issued making
    ioc and blkcg inaccessible and those are all block layer cares about.

    elevator_set_req_fn() is updated such that the bio elvdata is being
    allocated for is available to the elevator.

    This doesn't update block cgroup policies yet. Further patches will
    implement the support.

    -v2: #ifdef CONFIG_BLK_CGROUP added around bio->bi_ioc dereference in
    rq_ioc() to fix build breakage.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Cc: Kent Overstreet
    Signed-off-by: Jens Axboe

    Tejun Heo
     

29 Feb, 2012

1 commit


09 Feb, 2012

1 commit

  • There were two places bio_get_nr_vecs() could overflow:

    First, it did a left shift to convert from sectors to bytes immediately
    before dividing by PAGE_SIZE. If PAGE_SIZE ever was less than 512 a great
    many things would break, so dividing by PAGE_SIZE >> 9 is safe and will
    generate smaller code too.

    The nastier overflow was in the DIV_ROUND_UP() (that's what the code was
    effectively doing, anyways). If n + d overflowed, the whole thing would
    return 0 which breaks things rather effectively.

    bio_get_nr_vecs() doesn't claim to give an exact value anyways, so the
    DIV_ROUND_UP() is silly; we could do a straight divide except if a
    device's queue_max_sectors was less than PAGE_SIZE we'd return 0. So we
    just add 1; this should always be safe - things will break badly if
    bio_get_nr_vecs() returns > BIO_MAX_PAGES (bio_alloc() will suddenly start
    failing) but it's queue_max_segments that must guard against this, if
    queue_max_sectors is preventing this from happen things are going to
    explode on architectures with different PAGE_SIZE.

    Signed-off-by: Kent Overstreet
    Cc: Tejun Heo
    Acked-by: Valdis Kletnieks
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Kent Overstreet
     

16 Nov, 2011

1 commit

  • This is just a cleanup patch to silence a static checker warning.

    The problem is that we cap "nr_iovecs" so it can't be larger than
    "UIO_MAXIOV" but we don't check for negative values. It turns out this is
    prevented at other layers, but logically it doesn't make sense to have
    negative nr_iovecs so making it unsigned is nicer.

    Signed-off-by: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Dan Carpenter
     

24 Oct, 2011

1 commit

  • bio originally has the functionality to set the complete cpu, but
    it is broken.

    Chirstoph said that "This code is unused, and from the all the
    discussions lately pretty obviously broken. The only thing keeping
    it serves is creating more confusion and possibly more bugs."

    And Jens replied with "We can kill bio_set_completion_cpu(). I'm fine
    with leaving cpu control to the request based drivers, they are the
    only ones that can toggle the setting anyway".

    So this patch tries to remove all the work of controling complete cpu
    from a bio.

    Cc: Shaohua Li
    Cc: Christoph Hellwig
    Signed-off-by: Tao Ma
    Signed-off-by: Jens Axboe

    Tao Ma
     

27 May, 2011

1 commit


31 Mar, 2011

1 commit


25 Mar, 2011

1 commit

  • * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
    Documentation/iostats.txt: bit-size reference etc.
    cfq-iosched: removing unnecessary think time checking
    cfq-iosched: Don't clear queue stats when preempt.
    blk-throttle: Reset group slice when limits are changed
    blk-cgroup: Only give unaccounted_time under debug
    cfq-iosched: Don't set active queue in preempt
    block: fix non-atomic access to genhd inflight structures
    block: attempt to merge with existing requests on plug flush
    block: NULL dereference on error path in __blkdev_get()
    cfq-iosched: Don't update group weights when on service tree
    fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
    block: Require subsystems to explicitly allocate bio_set integrity mempool
    jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    fs: make fsync_buffers_list() plug
    mm: make generic_writepages() use plugging
    blk-cgroup: Add unaccounted time to timeslice_used.
    block: fixup plugging stubs for !CONFIG_BLOCK
    block: remove obsolete comments for blkdev_issue_zeroout.
    blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
    ...

    Fix up conflicts in fs/{aio.c,super.c}

    Linus Torvalds
     

23 Mar, 2011

1 commit

  • printk()s without a priority level default to KERN_WARNING. To reduce
    noise at KERN_WARNING, this patch set the priority level appriopriately
    for unleveled printks()s. This should be useful to folks that look at
    dmesg warnings closely.

    Signed-off-by: Mandeep Singh Baines
    Cc: Jens Axboe
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mandeep Singh Baines
     

17 Mar, 2011

1 commit

  • MD and DM create a new bio_set for every metadevice. Each bio_set has an
    integrity mempool attached regardless of whether the metadevice is
    capable of passing integrity metadata. This is a waste of memory.

    Instead we defer the allocation decision to MD and DM since we know at
    metadevice creation time whether integrity passthrough is needed or not.

    Automatic integrity mempool allocation can then be removed from
    bioset_create() and we make an explicit integrity allocation for the
    fs_bio_set.

    Signed-off-by: Martin K. Petersen
    Reported-by: Zdenek Kabelac
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

08 Mar, 2011

1 commit


10 Nov, 2010

2 commits


08 Aug, 2010

1 commit

  • Remove the current bio flags and reuse the request flags for the bio, too.
    This allows to more easily trace the type of I/O from the filesystem
    down to the block driver. There were two flags in the bio that were
    missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've
    renamed two request flags that had a superflous RW in them.

    Note that the flags are in bio.h despite having the REQ_ name - as
    blkdev.h includes bio.h that is the only way to go for now.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

19 Mar, 2010

1 commit


08 Mar, 2010

2 commits

  • Conflicts:
    Documentation/filesystems/proc.txt
    arch/arm/mach-u300/include/mach/debug-macro.S
    drivers/net/qlge/qlge_ethtool.c
    drivers/net/qlge/qlge_main.c
    drivers/net/typhoon.c

    Jiri Kosina
     
  • merge_bvec_fn() returns bvec->bv_len on success. So we have to check
    against this value. But in case of fs_optimization merge we compare
    with wrong value. This patch must be included in
    b428cd6da7e6559aca69aa2e3a526037d3f20403
    But accidentally i've forgot to add this in the initial patch.
    To make things straight let's replace all such checks.
    In fact this makes code easy to understand.

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: Jens Axboe

    Dmitry Monakhov
     

03 Mar, 2010

1 commit


01 Mar, 2010

1 commit

  • merge_bvec_fn() returns bvec->bv_len on success. So we have to check
    against this value. But in case of fs_optimization merge we compare
    with wrong value. This patch must be included in
    b428cd6da7e6559aca69aa2e3a526037d3f20403
    But accidentally i've forgot to add this in the initial patch.
    To make things straight let's replace all such checks.
    In fact this makes code easy to understand.

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: Jens Axboe

    Dmitry Monakhov
     

26 Feb, 2010

1 commit


05 Feb, 2010

1 commit

  • In commit 451a9ebf653d28337ba53ed5b4b70b0b9543cca1 bio_alloc_bioset()
    was refactored not to take NULL as a valid argument for bs. This patch
    changes the comment for that function accordingly. Currently, passing
    NULL as argument to parameter bs would result in a NULL pointer
    dereference.

    Signed-off-by: Jaak Ristioja
    Signed-off-by: Jiri Kosina

    Jaak Ristioja
     

28 Jan, 2010

1 commit

  • We have to properly decrease bi_size in order to merge_bvec_fn return
    right result. Otherwise this result in false merge rejects for two
    absolutely valid bio_vecs. This may cause significant performance
    penalty for example fs_block_size == 1k and block device is raid0 with
    small chunk_size = 8k. Then it is impossible to merge 7-th fs-block in
    to bio which already has 6 fs-blocks.

    Cc:
    Signed-off-by: Dmitry Monakhov
    Signed-off-by: Jens Axboe

    Dmitry Monakhov
     

19 Jan, 2010

1 commit


10 Dec, 2009

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (42 commits)
    tree-wide: fix misspelling of "definition" in comments
    reiserfs: fix misspelling of "journaled"
    doc: Fix a typo in slub.txt.
    inotify: remove superfluous return code check
    hdlc: spelling fix in find_pvc() comment
    doc: fix regulator docs cut-and-pasteism
    mtd: Fix comment in Kconfig
    doc: Fix IRQ chip docs
    tree-wide: fix assorted typos all over the place
    drivers/ata/libata-sff.c: comment spelling fixes
    fix typos/grammos in Documentation/edac.txt
    sysctl: add missing comments
    fs/debugfs/inode.c: fix comment typos
    sgivwfb: Make use of ARRAY_SIZE.
    sky2: fix sky2_link_down copy/paste comment error
    tree-wide: fix typos "couter" -> "counter"
    tree-wide: fix typos "offest" -> "offset"
    fix kerneldoc for set_irq_msi()
    spidev: fix double "of of" in comment
    comment typo fix: sybsystem -> subsystem
    ...

    Linus Torvalds
     

04 Dec, 2009

1 commit

  • That is "success", "unknown", "through", "performance", "[re|un]mapping"
    , "access", "default", "reasonable", "[con]currently", "temperature"
    , "channel", "[un]used", "application", "example","hierarchy", "therefore"
    , "[over|under]flow", "contiguous", "threshold", "enough" and others.

    Signed-off-by: André Goddard Rosa
    Signed-off-by: Jiri Kosina

    André Goddard Rosa
     

26 Nov, 2009

1 commit

  • Mtdblock driver doesn't call flush_dcache_page for pages in request. So,
    this causes problems on architectures where the icache doesn't fill from
    the dcache or with dcache aliases. The patch fixes this.

    The ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE symbol was introduced to avoid
    pointless empty cache-thrashing loops on architectures for which
    flush_dcache_page() is a no-op. Every architecture was provided with this
    flush pages on architectires where ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE is
    equal 1 or do nothing otherwise.

    See "fix mtd_blkdevs problem with caches on some architectures" discussion
    on LKML for more information.

    Signed-off-by: Ilya Loginov
    Cc: Ingo Molnar
    Cc: David Woodhouse
    Cc: Peter Horton
    Cc: "Ed L. Cashin"
    Signed-off-by: Jens Axboe

    Ilya Loginov
     

02 Nov, 2009

2 commits


02 Oct, 2009

1 commit


11 Jul, 2009

1 commit

  • I overlooked SG_DXFER_TO_FROM_DEV support when I converted sg to use
    the block layer mapping API (2.6.28).

    Douglas Gilbert explained SG_DXFER_TO_FROM_DEV:

    http://www.spinics.net/lists/linux-scsi/msg37135.html

    =
    The semantics of SG_DXFER_TO_FROM_DEV were:
    - copy user space buffer to kernel (LLD) buffer
    - do SCSI command which is assumed to be of the DATA_IN
    (data from device) variety. This would overwrite
    some or all of the kernel buffer
    - copy kernel (LLD) buffer back to the user space.

    The idea was to detect short reads by filling the original
    user space buffer with some marker bytes ("0xec" it would
    seem in this report). The "resid" value is a better way
    of detecting short reads but that was only added this century
    and requires co-operation from the LLD.
    =

    This patch changes the block layer mapping API to support this
    semantics. This simply adds another field to struct rq_map_data and
    enables __bio_copy_iov() to copy data from user space even with READ
    requests.

    It's better to add the flags field and kills null_mapped and the new
    from_user fields in struct rq_map_data but that approach makes it
    difficult to send this patch to stable trees because st and osst
    drivers use struct rq_map_data (they were converted to use the block
    layer in 2.6.29 and 2.6.30). Well, I should clean up the block layer
    mapping API.

    zhou sf reported this regiression and tested this patch:

    http://www.spinics.net/lists/linux-scsi/msg37128.html
    http://www.spinics.net/lists/linux-scsi/msg37168.html

    Reported-by: zhou sf
    Tested-by: zhou sf
    Cc: stable@kernel.org
    Signed-off-by: FUJITA Tomonori
    Signed-off-by: Jens Axboe

    FUJITA Tomonori
     

01 Jul, 2009

1 commit

  • This patch restores stacking ability to the block layer integrity
    infrastructure by creating a set of dedicated bip slabs. Each bip slab
    has an embedded bio_vec array at the end. This cuts down on memory
    allocations and also simplifies the code compared to the original bvec
    version. Only the largest bip slab is backed by a mempool. The pool is
    contained in the bio_set so stacking drivers can ensure forward
    progress.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

16 Jun, 2009

1 commit


13 Jun, 2009

1 commit


12 Jun, 2009

1 commit

  • * 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block: (153 commits)
    block: add request clone interface (v2)
    floppy: fix hibernation
    ramdisk: remove long-deprecated "ramdisk=" boot-time parameter
    fs/bio.c: add missing __user annotation
    block: prevent possible io_context->refcount overflow
    Add serial number support for virtio_blk, V4a
    block: Add missing bounce_pfn stacking and fix comments
    Revert "block: Fix bounce limit setting in DM"
    cciss: decode unit attention in SCSI error handling code
    cciss: Remove no longer needed sendcmd reject processing code
    cciss: change SCSI error handling routines to work with interrupts enabled.
    cciss: separate error processing and command retrying code in sendcmd_withirq_core()
    cciss: factor out fix target status processing code from sendcmd functions
    cciss: simplify interface of sendcmd() and sendcmd_withirq()
    cciss: factor out core of sendcmd_withirq() for use by SCSI error handling code
    cciss: Use schedule_timeout_uninterruptible in SCSI error handling code
    block: needs to set the residual length of a bidi request
    Revert "block: implement blkdev_readpages"
    block: Fix bounce limit setting in DM
    Removed reference to non-existing file Documentation/PCI/PCI-DMA-mapping.txt
    ...

    Manually fix conflicts with tracing updates in:
    block/blk-sysfs.c
    drivers/ide/ide-atapi.c
    drivers/ide/ide-cd.c
    drivers/ide/ide-floppy.c
    drivers/ide/ide-tape.c
    include/trace/events/block.h
    kernel/trace/blktrace.c

    Linus Torvalds
     

11 Jun, 2009

1 commit

  • As reported by sparse:

    fs/bio.c:720:13: warning: incorrect type in assignment (different address spaces)
    fs/bio.c:720:13: expected char *iov_addr
    fs/bio.c:720:13: got void [noderef] *
    fs/bio.c:724:36: warning: incorrect type in argument 2 (different address spaces)
    fs/bio.c:724:36: expected void const [noderef] *from
    fs/bio.c:724:36: got char *iov_addr

    Signed-off-by: Michal Simek
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Michal Simek
     

10 Jun, 2009

1 commit

  • TRACE_EVENT is a more generic way to define tracepoints. Doing so adds
    these new capabilities to this tracepoint:

    - zero-copy and per-cpu splice() tracing
    - binary tracing without printf overhead
    - structured logging records exposed under /debug/tracing/events
    - trace events embedded in function tracer output and other plugins
    - user-defined, per tracepoint filter expressions
    ...

    Cons:

    - no dev_t info for the output of plug, unplug_timer and unplug_io events.
    no dev_t info for getrq and sleeprq events if bio == NULL.
    no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL.

    This is mainly because we can't get the deivce from a request queue.
    But this may change in the future.

    - A packet command is converted to a string in TP_assign, not TP_print.
    While blktrace do the convertion just before output.

    Since pc requests should be rather rare, this is not a big issue.

    - In blktrace, an event can have 2 different print formats, but a TRACE_EVENT
    has a unique format, which means we have some unused data in a trace entry.

    The overhead is minimized by using __dynamic_array() instead of __array().

    I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing:

    dd dd + ioctl blktrace dd + TRACE_EVENT (splice)
    1 7.36s, 42.7 MB/s 7.50s, 42.0 MB/s 7.41s, 42.5 MB/s
    2 7.43s, 42.3 MB/s 7.48s, 42.1 MB/s 7.43s, 42.4 MB/s
    3 7.38s, 42.6 MB/s 7.45s, 42.2 MB/s 7.41s, 42.5 MB/s

    So the overhead of tracing is very small, and no regression when using
    those trace events vs blktrace.

    And the binary output of TRACE_EVENT is much smaller than blktrace:

    # ls -l -h
    -rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0
    -rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1
    -rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out

    Following are some comparisons between TRACE_EVENT and blktrace:

    plug:
    kjournald-480 [000] 303.084981: block_plug: [kjournald]
    kjournald-480 [000] 303.084981: 8,0 P N [kjournald]

    unplug_io:
    kblockd/0-118 [000] 300.052973: block_unplug_io: [kblockd/0] 1
    kblockd/0-118 [000] 300.052974: 8,0 U N [kblockd/0] 1

    remap:
    kjournald-480 [000] 303.085042: block_remap: 8,0 W 102736992 + 8 v3:

    - use the newly introduced __dynamic_array().

    Changelog from v1 -> v2:

    - use __string() instead of __array() to minimize the memory required
    to store hex dump of rq->cmd().

    - support large pc requests.

    - add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT.

    - some cleanups.

    Signed-off-by: Li Zefan
    LKML-Reference:
    Signed-off-by: Steven Rostedt

    Li Zefan