06 May, 2016

1 commit

  • Commit 326e1dbb57 ("block: remove management of bi_remaining when
    restoring original bi_end_io") made bio_inc_remaining() private to bio.c
    because the only use-case that made sense was confined to the
    bio_chain() interface.

    Since that time DM thinp went on to use bio_chain() in its relatively
    complex implementation of async discard support. That implementation,
    even when converted over to use the new async __blkdev_issue_discard()
    interface, depends on deferred completion of the original discard bio --
    which is most appropriately implemented using bio_inc_remaining().

    DM thinp foolishly duplicated bio_inc_remaining(), local to dm-thin.c as
    __bio_inc_remaining(), so re-exporting bio_inc_remaining() allows us to
    put an end to that foolishness.

    All said, bio_inc_remaining() should really only be used in conjunction
    with bio_chain(). It isn't intended for generic bio reference counting.

    Signed-off-by: Mike Snitzer
    Acked-by: Joe Thornber
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

05 Apr, 2016

2 commits

  • Mostly direct substitution with occasional adjustment or removing
    outdated comments.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

19 Mar, 2016

1 commit

  • Pull core block updates from Jens Axboe:
    "Here are the core block changes for this merge window. Not a lot of
    exciting stuff going on in this round, most of the changes have been
    on the driver side of things. That pull request is coming next. This
    pull request contains:

    - A set of fixes for chained bio handling from Christoph.

    - A tag bounds check for blk-mq from Hannes, ensuring that we don't
    do something stupid if a device reports an invalid tag value.

    - A set of fixes/updates for the CFQ IO scheduler from Jan Kara.

    - A set of blk-mq fixes from Keith, adding support for dynamic
    hardware queues, and fixing init of max_dev_sectors for stacking
    devices.

    - A fix for the dynamic hw context from Ming.

    - Enabling of cgroup writeback support on a block device, from
    Shaohua"

    * 'for-4.6/core' of git://git.kernel.dk/linux-block:
    blk-mq: add bounds check on tag-to-rq conversion
    block: bio_remaining_done() isn't unlikely
    block: cleanup bio_endio
    block: factor out chained bio completion
    block: don't unecessarily clobber bi_error for chained bios
    block-dev: enable writeback cgroup support
    blk-mq: Fix NULL pointer updating nr_requests
    blk-mq: mark request queue as mq asap
    block: Initialize max_dev_sectors to 0
    blk-mq: dynamic h/w context count
    cfq-iosched: Allow parent cgroup to preempt its child
    cfq-iosched: Allow sync noidle workloads to preempt each other
    cfq-iosched: Reorder checks in cfq_should_preempt()
    cfq-iosched: Don't group_idle if cfqq has big thinktime

    Linus Torvalds
     

14 Mar, 2016

4 commits


12 Feb, 2016

1 commit

  • Commit 35dc248383bbab0a7203fca4d722875bc81ef091 introduced a check for
    current->mm to see if we have a user space context and only copies data
    if we do. Now if an IO gets interrupted by a signal data isn't copied
    into user space any more (as we don't have a user space context) but
    user space isn't notified about it.

    This patch modifies the behaviour to return -EINTR from bio_uncopy_user()
    to notify userland that a signal has interrupted the syscall, otherwise
    it could lead to a situation where the caller may get a buffer with
    no data returned.

    This can be reproduced by issuing SG_IO ioctl()s in one thread while
    constantly sending signals to it.

    Fixes: 35dc248 [SCSI] sg: Fix user memory corruption when SG_IO is interrupted by a signal
    Signed-off-by: Johannes Thumshirn
    Signed-off-by: Hannes Reinecke
    Cc: stable@vger.kernel.org # v.3.11+
    Signed-off-by: Jens Axboe

    Hannes Reinecke
     

10 Feb, 2016

1 commit

  • When a process is doing Random Write with O_DSYNC flag
    the I/O wait are not accounted in the kernel (get_cpu_iowait_time_us).
    This is preventing the governor or the cpufreq driver to account for
    I/O wait and thus use the right pstate

    Signed-off-by: Stephane Gasparini
    Signed-off-by: Philippe Longepe
    Signed-off-by: Jens Axboe

    Stephane Gasparini
     

25 Nov, 2015

1 commit


07 Nov, 2015

1 commit

  • …d avoiding waking kswapd

    __GFP_WAIT has been used to identify atomic context in callers that hold
    spinlocks or are in interrupts. They are expected to be high priority and
    have access one of two watermarks lower than "min" which can be referred
    to as the "atomic reserve". __GFP_HIGH users get access to the first
    lower watermark and can be called the "high priority reserve".

    Over time, callers had a requirement to not block when fallback options
    were available. Some have abused __GFP_WAIT leading to a situation where
    an optimisitic allocation with a fallback option can access atomic
    reserves.

    This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
    cannot sleep and have no alternative. High priority users continue to use
    __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
    are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
    callers that want to wake kswapd for background reclaim. __GFP_WAIT is
    redefined as a caller that is willing to enter direct reclaim and wake
    kswapd for background reclaim.

    This patch then converts a number of sites

    o __GFP_ATOMIC is used by callers that are high priority and have memory
    pools for those requests. GFP_ATOMIC uses this flag.

    o Callers that have a limited mempool to guarantee forward progress clear
    __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
    into this category where kswapd will still be woken but atomic reserves
    are not used as there is a one-entry mempool to guarantee progress.

    o Callers that are checking if they are non-blocking should use the
    helper gfpflags_allow_blocking() where possible. This is because
    checking for __GFP_WAIT as was done historically now can trigger false
    positives. Some exceptions like dm-crypt.c exist where the code intent
    is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
    flag manipulations.

    o Callers that built their own GFP flags instead of starting with GFP_KERNEL
    and friends now also need to specify __GFP_KSWAPD_RECLAIM.

    The first key hazard to watch out for is callers that removed __GFP_WAIT
    and was depending on access to atomic reserves for inconspicuous reasons.
    In some cases it may be appropriate for them to use __GFP_HIGH.

    The second key hazard is callers that assembled their own combination of
    GFP flags instead of starting with something like GFP_KERNEL. They may
    now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
    if it's missed in most cases as other activity will wake kswapd.

    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Vitaly Wool <vitalywool@gmail.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     

11 Sep, 2015

1 commit

  • Pull blk-cg updates from Jens Axboe:
    "A bit later in the cycle, but this has been in the block tree for a a
    while. This is basically four patchsets from Tejun, that improve our
    buffered cgroup writeback. It was dependent on the other cgroup
    changes, but they went in earlier in this cycle.

    Series 1 is set of 5 patches that has cgroup writeback updates:

    - bdi_writeback iteration fix which could lead to some wb's being
    skipped or repeated during e.g. sync under memory pressure.

    - Simplification of wb work wait mechanism.

    - Writeback tracepoints updated to report cgroup.

    Series 2 is is a set of updates for the CFQ cgroup writeback handling:

    cfq has always charged all async IOs to the root cgroup. It didn't
    have much choice as writeback didn't know about cgroups and there
    was no way to tell who to blame for a given writeback IO.
    writeback finally grew support for cgroups and now tags each
    writeback IO with the appropriate cgroup to charge it against.

    This patchset updates cfq so that it follows the blkcg each bio is
    tagged with. Async cfq_queues are now shared across cfq_group,
    which is per-cgroup, instead of per-request_queue cfq_data. This
    makes all IOs follow the weight based IO resource distribution
    implemented by cfq.

    - Switched from GFP_ATOMIC to GFP_NOWAIT as suggested by Jeff.

    - Other misc review points addressed, acks added and rebased.

    Series 3 is the blkcg policy cleanup patches:

    This patchset contains assorted cleanups for blkcg_policy methods
    and blk[c]g_policy_data handling.

    - alloc/free added for blkg_policy_data. exit dropped.

    - alloc/free added for blkcg_policy_data.

    - blk-throttle's async percpu allocation is replaced with direct
    allocation.

    - all methods now take blk[c]g_policy_data instead of blkcg_gq or
    blkcg.

    And finally, series 4 is a set of patches cleaning up the blkcg stats
    handling:

    blkcg's stats have always been somwhat of a mess. This patchset
    tries to improve the situation a bit.

    - The following patches added to consolidate blkcg entry point and
    blkg creation. This is in itself is an improvement and helps
    colllecting common stats on bio issue.

    - per-blkg stats now accounted on bio issue rather than request
    completion so that bio based and request based drivers can behave
    the same way. The issue was spotted by Vivek.

    - cfq-iosched implements custom recursive stats and blk-throttle
    implements custom per-cpu stats. This patchset make blkcg core
    support both by default.

    - cfq-iosched and blk-throttle keep track of the same stats
    multiple times. Unify them"

    * 'for-4.3/blkcg' of git://git.kernel.dk/linux-block: (45 commits)
    blkcg: use CGROUP_WEIGHT_* scale for io.weight on the unified hierarchy
    blkcg: s/CFQ_WEIGHT_*/CFQ_WEIGHT_LEGACY_*/
    blkcg: implement interface for the unified hierarchy
    blkcg: misc preparations for unified hierarchy interface
    blkcg: separate out tg_conf_updated() from tg_set_conf()
    blkcg: move body parsing from blkg_conf_prep() to its callers
    blkcg: mark existing cftypes as legacy
    blkcg: rename subsystem name from blkio to io
    blkcg: refine error codes returned during blkcg configuration
    blkcg: remove unnecessary NULL checks from __cfqg_set_weight_device()
    blkcg: reduce stack usage of blkg_rwstat_recursive_sum()
    blkcg: remove cfqg_stats->sectors
    blkcg: move io_service_bytes and io_serviced stats into blkcg_gq
    blkcg: make blkg_[rw]stat_recursive_sum() to be able to index into blkcg_gq
    blkcg: make blkcg_[rw]stat per-cpu
    blkcg: add blkg_[rw]stat->aux_cnt and replace cfq_group->dead_stats with it
    blkcg: consolidate blkg creation in blkcg_bio_issue_check()
    blk-throttle: improve queue bypass handling
    blkcg: move root blkg lookup optimization from throtl_lookup_tg() to __blkg_lookup()
    blkcg: inline [__]blkg_lookup()
    ...

    Linus Torvalds
     

03 Sep, 2015

1 commit

  • Pull core block updates from Jens Axboe:
    "This first core part of the block IO changes contains:

    - Cleanup of the bio IO error signaling from Christoph. We used to
    rely on the uptodate bit and passing around of an error, now we
    store the error in the bio itself.

    - Improvement of the above from myself, by shrinking the bio size
    down again to fit in two cachelines on x86-64.

    - Revert of the max_hw_sectors cap removal from a revision again,
    from Jeff Moyer. This caused performance regressions in various
    tests. Reinstate the limit, bump it to a more reasonable size
    instead.

    - Make /sys/block//queue/discard_max_bytes writeable, by me.
    Most devices have huge trim limits, which can cause nasty latencies
    when deleting files. Enable the admin to configure the size down.
    We will look into having a more sane default instead of UINT_MAX
    sectors.

    - Improvement of the SGP gaps logic from Keith Busch.

    - Enable the block core to handle arbitrarily sized bios, which
    enables a nice simplification of bio_add_page() (which is an IO hot
    path). From Kent.

    - Improvements to the partition io stats accounting, making it
    faster. From Ming Lei.

    - Also from Ming Lei, a basic fixup for overflow of the sysfs pending
    file in blk-mq, as well as a fix for a blk-mq timeout race
    condition.

    - Ming Lin has been carrying Kents above mentioned patches forward
    for a while, and testing them. Ming also did a few fixes around
    that.

    - Sasha Levin found and fixed a use-after-free problem introduced by
    the bio->bi_error changes from Christoph.

    - Small blk cgroup cleanup from Viresh Kumar"

    * 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits)
    blk: Fix bio_io_vec index when checking bvec gaps
    block: Replace SG_GAPS with new queue limits mask
    block: bump BLK_DEF_MAX_SECTORS to 2560
    Revert "block: remove artifical max_hw_sectors cap"
    blk-mq: fix race between timeout and freeing request
    blk-mq: fix buffer overflow when reading sysfs file of 'pending'
    Documentation: update notes in biovecs about arbitrarily sized bios
    block: remove bio_get_nr_vecs()
    fs: use helper bio_add_page() instead of open coding on bi_io_vec
    block: kill merge_bvec_fn() completely
    md/raid5: get rid of bio_fits_rdev()
    md/raid5: split bio for chunk_aligned_read
    block: remove split code in blkdev_issue_{discard,write_same}
    btrfs: remove bio splitting and merge_bvec_fn() calls
    bcache: remove driver private bio splitting code
    block: simplify bio_add_page()
    block: make generic_make_request handle arbitrarily sized bios
    blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL)
    block: don't access bio->bi_error after bio_put()
    block: shrink struct bio down to 2 cache lines again
    ...

    Linus Torvalds
     

20 Aug, 2015

1 commit

  • The SG_GAPS queue flag caused checks for bio vector alignment against
    PAGE_SIZE, but the device may have different constraints. This patch
    adds a queue limits so a driver with such constraints can set to allow
    requests that would have been unnecessarily split. The new gaps check
    takes the request_queue as a parameter to simplify the logic around
    invoking this function.

    This new limit makes the queue flag redundant, so removing it and
    all usage. Device-mappers will inherit the correct settings through
    blk_stack_limits().

    Signed-off-by: Keith Busch
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Keith Busch
     

19 Aug, 2015

1 commit

  • blkio interface has become messy over time and is currently the
    largest. In addition to the inconsistent naming scheme, it has
    multiple stat files which report more or less the same thing, a number
    of debug stat files which expose internal details which shouldn't have
    been part of the public interface in the first place, recursive and
    non-recursive stats and leaf and non-leaf knobs.

    Both recursive vs. non-recursive and leaf vs. non-leaf distinctions
    don't make any sense on the unified hierarchy as only leaf cgroups can
    contain processes. cgroups is going through a major interface
    revision with the unified hierarchy involving significant fundamental
    usage changes and given that a significant portion of the interface
    doesn't make sense anymore, it's a good time to reorganize the
    interface.

    As the first step, this patch renames the external visible subsystem
    name from "blkio" to "io". This is more concise, matches the other
    two major subsystem names, "cpu" and "memory", and better suited as
    blkcg will be involved in anything writeback related too whether an
    actual block device is involved or not.

    As the subsystem legacy_name is set to "blkio", the only userland
    visible change outside the unified hierarchy is that blkcg is reported
    as "io" instead of "blkio" in the subsystem initialized message during
    boot. On the unified hierarchy, blkcg now appears as "io".

    Signed-off-by: Tejun Heo
    Cc: Li Zefan
    Cc: Johannes Weiner
    Cc: cgroups@vger.kernel.org
    Signed-off-by: Jens Axboe

    Tejun Heo
     

14 Aug, 2015

2 commits

  • We can always fill up the bio now, no need to estimate the possible
    size based on queue parameters.

    Acked-by: Steven Whitehouse
    Signed-off-by: Kent Overstreet
    [hch: rebased and wrote a changelog]
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Ming Lin
    Signed-off-by: Jens Axboe

    Kent Overstreet
     
  • Since generic_make_request() can now handle arbitrary size bios, all we
    have to do is make sure the bvec array doesn't overflow.
    __bio_add_page() doesn't need to call ->merge_bvec_fn(), where
    we can get rid of unnecessary code paths.

    Removing the call to ->merge_bvec_fn() is also fine, as no driver that
    implements support for BLOCK_PC commands even has a ->merge_bvec_fn()
    method.

    Cc: Christoph Hellwig
    Cc: Jens Axboe
    Signed-off-by: Kent Overstreet
    [dpark: rebase and resolve merge conflicts, change a couple of comments,
    make bio_add_page() warn once upon a cloned bio.]
    Signed-off-by: Dongsu Park
    Signed-off-by: Ming Lin
    Signed-off-by: Jens Axboe

    Kent Overstreet
     

29 Jul, 2015

2 commits

  • Some places use helpers now, others don't. We only have the 'is set'
    helper, add helpers for setting and clearing flags too.

    It was a bit of a mess of atomic vs non-atomic access. With
    BIO_UPTODATE gone, we don't have any risk of concurrent access to the
    flags. So relax the restriction and don't make any of them atomic. The
    flags that do have serialization issues (reffed and chained), we
    already handle those separately.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Currently we have two different ways to signal an I/O error on a BIO:

    (1) by clearing the BIO_UPTODATE flag
    (2) by returning a Linux errno value to the bi_end_io callback

    The first one has the drawback of only communicating a single possible
    error (-EIO), and the second one has the drawback of not beeing persistent
    when bios are queued up, and are not passed along from child to parent
    bio in the ever more popular chaining scenario. Having both mechanisms
    available has the additional drawback of utterly confusing driver authors
    and introducing bugs where various I/O submitters only deal with one of
    them, and the others have to add boilerplate code to deal with both kinds
    of error returns.

    So add a new bi_error field to store an errno value directly in struct
    bio and remove the existing mechanisms to clean all this up.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: NeilBrown
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

24 Jul, 2015

2 commits

  • This fixes a data corruption bug when using discard on top of MD linear,
    raid0 and raid10 personalities.

    Commit 20d0189b1012 "block: Introduce new bio_split()" permits sharing
    the bio_vec between the two resulting bios. That is fine for read/write
    requests where the bio_vec is immutable. For discards, however, we need
    to be able to attach a payload and update the bio_vec so the page can
    get mapped to a scatterlist entry. Therefore the bio_vec can not be
    shared when splitting discards and we must do a full clone.

    Signed-off-by: Martin K. Petersen
    Reported-by: Seunguk Shin
    Tested-by: Seunguk Shin
    Cc: Seunguk Shin
    Cc: Jens Axboe
    Cc: Kent Overstreet
    Cc: # v3.14+
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • bio_associate_blkcg(), bio_associate_current() and wbc_account_io()
    are used to implement cgroup writeback support for filesystems and
    thus need to be exported. Export them.

    Signed-off-by: Tejun Heo
    Reported-by: Stephen Rothwell
    Signed-off-by: Jens Axboe

    Tejun Heo
     

02 Jun, 2015

2 commits

  • Currently, a bio can only be associated with the io_context and blkcg
    of %current using bio_associate_current(). This is too restrictive
    for cgroup writeback support. Implement bio_associate_blkcg() which
    associates a bio with the specified blkcg.

    bio_associate_blkcg() leaves the io_context unassociated.
    bio_associate_current() is updated so that it considers a bio as
    already associated if it has a blkcg_css, instead of an io_context,
    associated with it.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • bio_associate_current() currently open codes task_css() and
    css_tryget_online() to find and pin $current's blkcg css. Abstract it
    into task_get_css() which is implemented from cgroup side. As a task
    is always associated with an online css for every subsystem except
    while the css_set update is propagating, task_get_css() retries till
    css_tryget_online() succeeds.

    This is a cleanup and shouldn't lead to noticeable behavior changes.

    Signed-off-by: Tejun Heo
    Cc: Li Zefan
    Cc: Jens Axboe
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     

22 May, 2015

1 commit

  • Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
    non-chains") regressed all existing callers that followed this pattern:
    1) saving a bio's original bi_end_io
    2) wiring up an intermediate bi_end_io
    3) restoring the original bi_end_io from intermediate bi_end_io
    4) calling bio_endio() to execute the restored original bi_end_io

    The regression was due to BIO_CHAIN only ever getting set if
    bio_inc_remaining() is called. For the above pattern it isn't set until
    step 3 above (step 2 would've needed to establish BIO_CHAIN). As such
    the first bio_endio(), in step 2 above, never decremented __bi_remaining
    before calling the intermediate bi_end_io -- leaving __bi_remaining with
    the value 1 instead of 0. When bio_inc_remaining() occurred during step
    3 it brought it to a value of 2. When the second bio_endio() was
    called, in step 4 above, it should've called the original bi_end_io but
    it didn't because there was an extra reference that wasn't dropped (due
    to atomic operations being optimized away since BIO_CHAIN wasn't set
    upfront).

    Fix this issue by removing the __bi_remaining management complexity for
    all callers that use the above pattern -- bio_chain() is the only
    interface that _needs_ to be concerned with __bi_remaining. For the
    above pattern callers just expect the bi_end_io they set to get called!
    Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
    that aren't associated with the bio_chain() interface.

    Also, the bio_inc_remaining() interface has been moved local to bio.c.

    Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

06 May, 2015

2 commits

  • Struct bio has a reference count that controls when it can be freed.
    Most uses cases is allocating the bio, which then returns with a
    single reference to it, doing IO, and then dropping that single
    reference. We can remove this atomic_dec_and_test() in the completion
    path, if nobody else is holding a reference to the bio.

    If someone does call bio_get() on the bio, then we flag the bio as
    now having valid count and that we must properly honor the reference
    count when it's being put.

    Tested-by: Robert Elliott
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Struct bio has an atomic ref count for chained bio's, and we use this
    to know when to end IO on the bio. However, most bio's are not chained,
    so we don't need to always introduce this atomic operation as part of
    ending IO.

    Add a helper to elevate the bi_remaining count, and flag the bio as
    now actually needing the decrement at end_io time. Rename the field
    to __bi_remaining to catch any current users of this doing the
    incrementing manually.

    For high IOPS workloads, this reduces the overhead of bio_endio()
    substantially.

    Tested-by: Robert Elliott
    Acked-by: Kent Overstreet
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jens Axboe
     

06 Feb, 2015

7 commits


12 Dec, 2014

1 commit

  • The original behaviour is to refuse to add a new page if the maximum
    number of segments has been reached, regardless of the fact the page we
    are going to add can be merged into the last segment or not.

    Unfortunately, when the system runs under heavy memory fragmentation
    conditions, a driver may try to add multiple pages to the last segment.
    The original code won't accept them and EBUSY will be reported to
    userspace.

    This patch modifies the function so it refuses to add a page only in case
    the latter starts a new segment and the maximum number of segments has
    already been reached.

    The bug can be easily reproduced with the st driver:

    1) set CONFIG_SCSI_MPT2SAS_MAX_SGE or CONFIG_SCSI_MPT3SAS_MAX_SGE to 16
    2) modprobe st buffer_kbs=1024
    3) #dd if=/dev/zero of=/dev/st0 bs=1M count=10
    dd: error writing `/dev/st0': Device or resource busy

    Signed-off-by: Maurizio Lombardi
    Signed-off-by: Ming Lei
    Cc: Jet Chen
    Cc: Tomas Henzl
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Maurizio Lombardi
     

24 Nov, 2014

1 commit

  • Many block drivers accounting io stat based on bio (e.g. NVMe...),
    the blk_account_io_start/end() which is based on request
    does not make sense to them, so here we introduce the similar help
    function named generic_start/end_io_acct base on raw sectors, and it can
    simplify some driver's open io accounting code.

    Signed-off-by: Gu Zheng
    Signed-off-by: Jens Axboe

    Gu Zheng
     

04 Oct, 2014

1 commit

  • Users of bio_clone_fast() do not want bios with their own bvecs.
    Allocating a bvec mempool as part of the bioset intended for such users
    is a waste of memory.

    bioset_create_nobvec() creates a bioset that doesn't have the bvec
    mempool.

    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Junichi Nomura
     

02 Aug, 2014

1 commit

  • Various subsystems can ask the bio subsystem to create a bio slab cache
    with some free space before the bio. This free space can be used for any
    purpose. Device mapper uses this per-bio-data feature to place some
    target-specific and device-mapper specific data before the bio, so that
    the target-specific data doesn't have to be allocated separately.

    This per-bio-data mechanism is used in place of kmalloc, so we need the
    allocated slab to have the same memory alignment as memory allocated
    with kmalloc.

    Change bio_find_or_create_slab() so that it uses ARCH_KMALLOC_MINALIGN
    alignment when creating the slab cache. This is needed so that dm-crypt
    can use per-bio-data for encryption - the crypto subsystem assumes this
    data will have the same alignment as kmalloc'ed memory.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Acked-by: Jens Axboe

    Mikulas Patocka
     

25 Jun, 2014

1 commit

  • Another restriction inherited for NVMe - those devices don't support
    SG lists that have "gaps" in them. Gaps refers to cases where the
    previous SG entry doesn't end on a page boundary. For NVMe, all SG
    entries must start at offset 0 (except the first) and end on a page
    boundary (except the last).

    Signed-off-by: Jens Axboe

    Jens Axboe
     

11 Jun, 2014

1 commit

  • Pull block layer fixes from Jens Axboe:
    "Final small batch of fixes to be included before -rc1. Some general
    cleanups in here as well, but some of the blk-mq fixes we need for the
    NVMe conversion and/or scsi-mq. The pull request contains:

    - Support for not merging across a specified "chunk size", if set by
    the driver. Some NVMe devices perform poorly for IO that crosses
    such a chunk, so we need to support it generically as part of
    request merging avoid having to do complicated split logic. From
    me.

    - Bump max tag depth to 10Ki tags. Some scsi devices have a huge
    shared tag space. Before we failed with EINVAL if a too large tag
    depth was specified, now we truncate it and pass back the actual
    value. From me.

    - Various blk-mq rq init fixes from me and others.

    - A fix for enter on a dying queue for blk-mq from Keith. This is
    needed to prevent oopsing on hot device removal.

    - Fixup for blk-mq timer addition from Ming Lei.

    - Small round of performance fixes for mtip32xx from Sam Bradshaw.

    - Minor stack leak fix from Rickard Strandqvist.

    - Two __init annotations from Fabian Frederick"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    block: add __init to blkcg_policy_register
    block: add __init to elv_register
    block: ensure that bio_add_page() always accepts a page for an empty bio
    blk-mq: add timer in blk_mq_start_request
    blk-mq: always initialize request->start_time
    block: blk-exec.c: Cleaning up local variable address returnd
    mtip32xx: minor performance enhancements
    blk-mq: ->timeout should be cleared in blk_mq_rq_ctx_init()
    blk-mq: don't allow queue entering for a dying queue
    blk-mq: bump max tag depth to 10K tags
    block: add blk_rq_set_block_pc()
    block: add notion of a chunk size for request merging

    Linus Torvalds