07 Nov, 2015

1 commit

  • …d avoiding waking kswapd

    __GFP_WAIT has been used to identify atomic context in callers that hold
    spinlocks or are in interrupts. They are expected to be high priority and
    have access one of two watermarks lower than "min" which can be referred
    to as the "atomic reserve". __GFP_HIGH users get access to the first
    lower watermark and can be called the "high priority reserve".

    Over time, callers had a requirement to not block when fallback options
    were available. Some have abused __GFP_WAIT leading to a situation where
    an optimisitic allocation with a fallback option can access atomic
    reserves.

    This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
    cannot sleep and have no alternative. High priority users continue to use
    __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
    are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
    callers that want to wake kswapd for background reclaim. __GFP_WAIT is
    redefined as a caller that is willing to enter direct reclaim and wake
    kswapd for background reclaim.

    This patch then converts a number of sites

    o __GFP_ATOMIC is used by callers that are high priority and have memory
    pools for those requests. GFP_ATOMIC uses this flag.

    o Callers that have a limited mempool to guarantee forward progress clear
    __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
    into this category where kswapd will still be woken but atomic reserves
    are not used as there is a one-entry mempool to guarantee progress.

    o Callers that are checking if they are non-blocking should use the
    helper gfpflags_allow_blocking() where possible. This is because
    checking for __GFP_WAIT as was done historically now can trigger false
    positives. Some exceptions like dm-crypt.c exist where the code intent
    is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
    flag manipulations.

    o Callers that built their own GFP flags instead of starting with GFP_KERNEL
    and friends now also need to specify __GFP_KSWAPD_RECLAIM.

    The first key hazard to watch out for is callers that removed __GFP_WAIT
    and was depending on access to atomic reserves for inconspicuous reasons.
    In some cases it may be appropriate for them to use __GFP_HIGH.

    The second key hazard is callers that assembled their own combination of
    GFP flags instead of starting with something like GFP_KERNEL. They may
    now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
    if it's missed in most cases as other activity will wake kswapd.

    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Vitaly Wool <vitalywool@gmail.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     

11 Sep, 2015

1 commit

  • Pull blk-cg updates from Jens Axboe:
    "A bit later in the cycle, but this has been in the block tree for a a
    while. This is basically four patchsets from Tejun, that improve our
    buffered cgroup writeback. It was dependent on the other cgroup
    changes, but they went in earlier in this cycle.

    Series 1 is set of 5 patches that has cgroup writeback updates:

    - bdi_writeback iteration fix which could lead to some wb's being
    skipped or repeated during e.g. sync under memory pressure.

    - Simplification of wb work wait mechanism.

    - Writeback tracepoints updated to report cgroup.

    Series 2 is is a set of updates for the CFQ cgroup writeback handling:

    cfq has always charged all async IOs to the root cgroup. It didn't
    have much choice as writeback didn't know about cgroups and there
    was no way to tell who to blame for a given writeback IO.
    writeback finally grew support for cgroups and now tags each
    writeback IO with the appropriate cgroup to charge it against.

    This patchset updates cfq so that it follows the blkcg each bio is
    tagged with. Async cfq_queues are now shared across cfq_group,
    which is per-cgroup, instead of per-request_queue cfq_data. This
    makes all IOs follow the weight based IO resource distribution
    implemented by cfq.

    - Switched from GFP_ATOMIC to GFP_NOWAIT as suggested by Jeff.

    - Other misc review points addressed, acks added and rebased.

    Series 3 is the blkcg policy cleanup patches:

    This patchset contains assorted cleanups for blkcg_policy methods
    and blk[c]g_policy_data handling.

    - alloc/free added for blkg_policy_data. exit dropped.

    - alloc/free added for blkcg_policy_data.

    - blk-throttle's async percpu allocation is replaced with direct
    allocation.

    - all methods now take blk[c]g_policy_data instead of blkcg_gq or
    blkcg.

    And finally, series 4 is a set of patches cleaning up the blkcg stats
    handling:

    blkcg's stats have always been somwhat of a mess. This patchset
    tries to improve the situation a bit.

    - The following patches added to consolidate blkcg entry point and
    blkg creation. This is in itself is an improvement and helps
    colllecting common stats on bio issue.

    - per-blkg stats now accounted on bio issue rather than request
    completion so that bio based and request based drivers can behave
    the same way. The issue was spotted by Vivek.

    - cfq-iosched implements custom recursive stats and blk-throttle
    implements custom per-cpu stats. This patchset make blkcg core
    support both by default.

    - cfq-iosched and blk-throttle keep track of the same stats
    multiple times. Unify them"

    * 'for-4.3/blkcg' of git://git.kernel.dk/linux-block: (45 commits)
    blkcg: use CGROUP_WEIGHT_* scale for io.weight on the unified hierarchy
    blkcg: s/CFQ_WEIGHT_*/CFQ_WEIGHT_LEGACY_*/
    blkcg: implement interface for the unified hierarchy
    blkcg: misc preparations for unified hierarchy interface
    blkcg: separate out tg_conf_updated() from tg_set_conf()
    blkcg: move body parsing from blkg_conf_prep() to its callers
    blkcg: mark existing cftypes as legacy
    blkcg: rename subsystem name from blkio to io
    blkcg: refine error codes returned during blkcg configuration
    blkcg: remove unnecessary NULL checks from __cfqg_set_weight_device()
    blkcg: reduce stack usage of blkg_rwstat_recursive_sum()
    blkcg: remove cfqg_stats->sectors
    blkcg: move io_service_bytes and io_serviced stats into blkcg_gq
    blkcg: make blkg_[rw]stat_recursive_sum() to be able to index into blkcg_gq
    blkcg: make blkcg_[rw]stat per-cpu
    blkcg: add blkg_[rw]stat->aux_cnt and replace cfq_group->dead_stats with it
    blkcg: consolidate blkg creation in blkcg_bio_issue_check()
    blk-throttle: improve queue bypass handling
    blkcg: move root blkg lookup optimization from throtl_lookup_tg() to __blkg_lookup()
    blkcg: inline [__]blkg_lookup()
    ...

    Linus Torvalds
     

03 Sep, 2015

1 commit

  • Pull core block updates from Jens Axboe:
    "This first core part of the block IO changes contains:

    - Cleanup of the bio IO error signaling from Christoph. We used to
    rely on the uptodate bit and passing around of an error, now we
    store the error in the bio itself.

    - Improvement of the above from myself, by shrinking the bio size
    down again to fit in two cachelines on x86-64.

    - Revert of the max_hw_sectors cap removal from a revision again,
    from Jeff Moyer. This caused performance regressions in various
    tests. Reinstate the limit, bump it to a more reasonable size
    instead.

    - Make /sys/block//queue/discard_max_bytes writeable, by me.
    Most devices have huge trim limits, which can cause nasty latencies
    when deleting files. Enable the admin to configure the size down.
    We will look into having a more sane default instead of UINT_MAX
    sectors.

    - Improvement of the SGP gaps logic from Keith Busch.

    - Enable the block core to handle arbitrarily sized bios, which
    enables a nice simplification of bio_add_page() (which is an IO hot
    path). From Kent.

    - Improvements to the partition io stats accounting, making it
    faster. From Ming Lei.

    - Also from Ming Lei, a basic fixup for overflow of the sysfs pending
    file in blk-mq, as well as a fix for a blk-mq timeout race
    condition.

    - Ming Lin has been carrying Kents above mentioned patches forward
    for a while, and testing them. Ming also did a few fixes around
    that.

    - Sasha Levin found and fixed a use-after-free problem introduced by
    the bio->bi_error changes from Christoph.

    - Small blk cgroup cleanup from Viresh Kumar"

    * 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits)
    blk: Fix bio_io_vec index when checking bvec gaps
    block: Replace SG_GAPS with new queue limits mask
    block: bump BLK_DEF_MAX_SECTORS to 2560
    Revert "block: remove artifical max_hw_sectors cap"
    blk-mq: fix race between timeout and freeing request
    blk-mq: fix buffer overflow when reading sysfs file of 'pending'
    Documentation: update notes in biovecs about arbitrarily sized bios
    block: remove bio_get_nr_vecs()
    fs: use helper bio_add_page() instead of open coding on bi_io_vec
    block: kill merge_bvec_fn() completely
    md/raid5: get rid of bio_fits_rdev()
    md/raid5: split bio for chunk_aligned_read
    block: remove split code in blkdev_issue_{discard,write_same}
    btrfs: remove bio splitting and merge_bvec_fn() calls
    bcache: remove driver private bio splitting code
    block: simplify bio_add_page()
    block: make generic_make_request handle arbitrarily sized bios
    blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL)
    block: don't access bio->bi_error after bio_put()
    block: shrink struct bio down to 2 cache lines again
    ...

    Linus Torvalds
     

20 Aug, 2015

1 commit

  • The SG_GAPS queue flag caused checks for bio vector alignment against
    PAGE_SIZE, but the device may have different constraints. This patch
    adds a queue limits so a driver with such constraints can set to allow
    requests that would have been unnecessarily split. The new gaps check
    takes the request_queue as a parameter to simplify the logic around
    invoking this function.

    This new limit makes the queue flag redundant, so removing it and
    all usage. Device-mappers will inherit the correct settings through
    blk_stack_limits().

    Signed-off-by: Keith Busch
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Keith Busch
     

19 Aug, 2015

1 commit

  • blkio interface has become messy over time and is currently the
    largest. In addition to the inconsistent naming scheme, it has
    multiple stat files which report more or less the same thing, a number
    of debug stat files which expose internal details which shouldn't have
    been part of the public interface in the first place, recursive and
    non-recursive stats and leaf and non-leaf knobs.

    Both recursive vs. non-recursive and leaf vs. non-leaf distinctions
    don't make any sense on the unified hierarchy as only leaf cgroups can
    contain processes. cgroups is going through a major interface
    revision with the unified hierarchy involving significant fundamental
    usage changes and given that a significant portion of the interface
    doesn't make sense anymore, it's a good time to reorganize the
    interface.

    As the first step, this patch renames the external visible subsystem
    name from "blkio" to "io". This is more concise, matches the other
    two major subsystem names, "cpu" and "memory", and better suited as
    blkcg will be involved in anything writeback related too whether an
    actual block device is involved or not.

    As the subsystem legacy_name is set to "blkio", the only userland
    visible change outside the unified hierarchy is that blkcg is reported
    as "io" instead of "blkio" in the subsystem initialized message during
    boot. On the unified hierarchy, blkcg now appears as "io".

    Signed-off-by: Tejun Heo
    Cc: Li Zefan
    Cc: Johannes Weiner
    Cc: cgroups@vger.kernel.org
    Signed-off-by: Jens Axboe

    Tejun Heo
     

14 Aug, 2015

2 commits

  • We can always fill up the bio now, no need to estimate the possible
    size based on queue parameters.

    Acked-by: Steven Whitehouse
    Signed-off-by: Kent Overstreet
    [hch: rebased and wrote a changelog]
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Ming Lin
    Signed-off-by: Jens Axboe

    Kent Overstreet
     
  • Since generic_make_request() can now handle arbitrary size bios, all we
    have to do is make sure the bvec array doesn't overflow.
    __bio_add_page() doesn't need to call ->merge_bvec_fn(), where
    we can get rid of unnecessary code paths.

    Removing the call to ->merge_bvec_fn() is also fine, as no driver that
    implements support for BLOCK_PC commands even has a ->merge_bvec_fn()
    method.

    Cc: Christoph Hellwig
    Cc: Jens Axboe
    Signed-off-by: Kent Overstreet
    [dpark: rebase and resolve merge conflicts, change a couple of comments,
    make bio_add_page() warn once upon a cloned bio.]
    Signed-off-by: Dongsu Park
    Signed-off-by: Ming Lin
    Signed-off-by: Jens Axboe

    Kent Overstreet
     

29 Jul, 2015

2 commits

  • Some places use helpers now, others don't. We only have the 'is set'
    helper, add helpers for setting and clearing flags too.

    It was a bit of a mess of atomic vs non-atomic access. With
    BIO_UPTODATE gone, we don't have any risk of concurrent access to the
    flags. So relax the restriction and don't make any of them atomic. The
    flags that do have serialization issues (reffed and chained), we
    already handle those separately.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Currently we have two different ways to signal an I/O error on a BIO:

    (1) by clearing the BIO_UPTODATE flag
    (2) by returning a Linux errno value to the bi_end_io callback

    The first one has the drawback of only communicating a single possible
    error (-EIO), and the second one has the drawback of not beeing persistent
    when bios are queued up, and are not passed along from child to parent
    bio in the ever more popular chaining scenario. Having both mechanisms
    available has the additional drawback of utterly confusing driver authors
    and introducing bugs where various I/O submitters only deal with one of
    them, and the others have to add boilerplate code to deal with both kinds
    of error returns.

    So add a new bi_error field to store an errno value directly in struct
    bio and remove the existing mechanisms to clean all this up.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: NeilBrown
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

24 Jul, 2015

2 commits

  • This fixes a data corruption bug when using discard on top of MD linear,
    raid0 and raid10 personalities.

    Commit 20d0189b1012 "block: Introduce new bio_split()" permits sharing
    the bio_vec between the two resulting bios. That is fine for read/write
    requests where the bio_vec is immutable. For discards, however, we need
    to be able to attach a payload and update the bio_vec so the page can
    get mapped to a scatterlist entry. Therefore the bio_vec can not be
    shared when splitting discards and we must do a full clone.

    Signed-off-by: Martin K. Petersen
    Reported-by: Seunguk Shin
    Tested-by: Seunguk Shin
    Cc: Seunguk Shin
    Cc: Jens Axboe
    Cc: Kent Overstreet
    Cc: # v3.14+
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • bio_associate_blkcg(), bio_associate_current() and wbc_account_io()
    are used to implement cgroup writeback support for filesystems and
    thus need to be exported. Export them.

    Signed-off-by: Tejun Heo
    Reported-by: Stephen Rothwell
    Signed-off-by: Jens Axboe

    Tejun Heo
     

02 Jun, 2015

2 commits

  • Currently, a bio can only be associated with the io_context and blkcg
    of %current using bio_associate_current(). This is too restrictive
    for cgroup writeback support. Implement bio_associate_blkcg() which
    associates a bio with the specified blkcg.

    bio_associate_blkcg() leaves the io_context unassociated.
    bio_associate_current() is updated so that it considers a bio as
    already associated if it has a blkcg_css, instead of an io_context,
    associated with it.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • bio_associate_current() currently open codes task_css() and
    css_tryget_online() to find and pin $current's blkcg css. Abstract it
    into task_get_css() which is implemented from cgroup side. As a task
    is always associated with an online css for every subsystem except
    while the css_set update is propagating, task_get_css() retries till
    css_tryget_online() succeeds.

    This is a cleanup and shouldn't lead to noticeable behavior changes.

    Signed-off-by: Tejun Heo
    Cc: Li Zefan
    Cc: Jens Axboe
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     

22 May, 2015

1 commit

  • Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
    non-chains") regressed all existing callers that followed this pattern:
    1) saving a bio's original bi_end_io
    2) wiring up an intermediate bi_end_io
    3) restoring the original bi_end_io from intermediate bi_end_io
    4) calling bio_endio() to execute the restored original bi_end_io

    The regression was due to BIO_CHAIN only ever getting set if
    bio_inc_remaining() is called. For the above pattern it isn't set until
    step 3 above (step 2 would've needed to establish BIO_CHAIN). As such
    the first bio_endio(), in step 2 above, never decremented __bi_remaining
    before calling the intermediate bi_end_io -- leaving __bi_remaining with
    the value 1 instead of 0. When bio_inc_remaining() occurred during step
    3 it brought it to a value of 2. When the second bio_endio() was
    called, in step 4 above, it should've called the original bi_end_io but
    it didn't because there was an extra reference that wasn't dropped (due
    to atomic operations being optimized away since BIO_CHAIN wasn't set
    upfront).

    Fix this issue by removing the __bi_remaining management complexity for
    all callers that use the above pattern -- bio_chain() is the only
    interface that _needs_ to be concerned with __bi_remaining. For the
    above pattern callers just expect the bi_end_io they set to get called!
    Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
    that aren't associated with the bio_chain() interface.

    Also, the bio_inc_remaining() interface has been moved local to bio.c.

    Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

06 May, 2015

2 commits

  • Struct bio has a reference count that controls when it can be freed.
    Most uses cases is allocating the bio, which then returns with a
    single reference to it, doing IO, and then dropping that single
    reference. We can remove this atomic_dec_and_test() in the completion
    path, if nobody else is holding a reference to the bio.

    If someone does call bio_get() on the bio, then we flag the bio as
    now having valid count and that we must properly honor the reference
    count when it's being put.

    Tested-by: Robert Elliott
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Struct bio has an atomic ref count for chained bio's, and we use this
    to know when to end IO on the bio. However, most bio's are not chained,
    so we don't need to always introduce this atomic operation as part of
    ending IO.

    Add a helper to elevate the bi_remaining count, and flag the bio as
    now actually needing the decrement at end_io time. Rename the field
    to __bi_remaining to catch any current users of this doing the
    incrementing manually.

    For high IOPS workloads, this reduces the overhead of bio_endio()
    substantially.

    Tested-by: Robert Elliott
    Acked-by: Kent Overstreet
    Reviewed-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jens Axboe
     

06 Feb, 2015

7 commits


12 Dec, 2014

1 commit

  • The original behaviour is to refuse to add a new page if the maximum
    number of segments has been reached, regardless of the fact the page we
    are going to add can be merged into the last segment or not.

    Unfortunately, when the system runs under heavy memory fragmentation
    conditions, a driver may try to add multiple pages to the last segment.
    The original code won't accept them and EBUSY will be reported to
    userspace.

    This patch modifies the function so it refuses to add a page only in case
    the latter starts a new segment and the maximum number of segments has
    already been reached.

    The bug can be easily reproduced with the st driver:

    1) set CONFIG_SCSI_MPT2SAS_MAX_SGE or CONFIG_SCSI_MPT3SAS_MAX_SGE to 16
    2) modprobe st buffer_kbs=1024
    3) #dd if=/dev/zero of=/dev/st0 bs=1M count=10
    dd: error writing `/dev/st0': Device or resource busy

    Signed-off-by: Maurizio Lombardi
    Signed-off-by: Ming Lei
    Cc: Jet Chen
    Cc: Tomas Henzl
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Maurizio Lombardi
     

24 Nov, 2014

1 commit

  • Many block drivers accounting io stat based on bio (e.g. NVMe...),
    the blk_account_io_start/end() which is based on request
    does not make sense to them, so here we introduce the similar help
    function named generic_start/end_io_acct base on raw sectors, and it can
    simplify some driver's open io accounting code.

    Signed-off-by: Gu Zheng
    Signed-off-by: Jens Axboe

    Gu Zheng
     

04 Oct, 2014

1 commit

  • Users of bio_clone_fast() do not want bios with their own bvecs.
    Allocating a bvec mempool as part of the bioset intended for such users
    is a waste of memory.

    bioset_create_nobvec() creates a bioset that doesn't have the bvec
    mempool.

    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Junichi Nomura
     

02 Aug, 2014

1 commit

  • Various subsystems can ask the bio subsystem to create a bio slab cache
    with some free space before the bio. This free space can be used for any
    purpose. Device mapper uses this per-bio-data feature to place some
    target-specific and device-mapper specific data before the bio, so that
    the target-specific data doesn't have to be allocated separately.

    This per-bio-data mechanism is used in place of kmalloc, so we need the
    allocated slab to have the same memory alignment as memory allocated
    with kmalloc.

    Change bio_find_or_create_slab() so that it uses ARCH_KMALLOC_MINALIGN
    alignment when creating the slab cache. This is needed so that dm-crypt
    can use per-bio-data for encryption - the crypto subsystem assumes this
    data will have the same alignment as kmalloc'ed memory.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Acked-by: Jens Axboe

    Mikulas Patocka
     

25 Jun, 2014

1 commit

  • Another restriction inherited for NVMe - those devices don't support
    SG lists that have "gaps" in them. Gaps refers to cases where the
    previous SG entry doesn't end on a page boundary. For NVMe, all SG
    entries must start at offset 0 (except the first) and end on a page
    boundary (except the last).

    Signed-off-by: Jens Axboe

    Jens Axboe
     

11 Jun, 2014

2 commits

  • Pull block layer fixes from Jens Axboe:
    "Final small batch of fixes to be included before -rc1. Some general
    cleanups in here as well, but some of the blk-mq fixes we need for the
    NVMe conversion and/or scsi-mq. The pull request contains:

    - Support for not merging across a specified "chunk size", if set by
    the driver. Some NVMe devices perform poorly for IO that crosses
    such a chunk, so we need to support it generically as part of
    request merging avoid having to do complicated split logic. From
    me.

    - Bump max tag depth to 10Ki tags. Some scsi devices have a huge
    shared tag space. Before we failed with EINVAL if a too large tag
    depth was specified, now we truncate it and pass back the actual
    value. From me.

    - Various blk-mq rq init fixes from me and others.

    - A fix for enter on a dying queue for blk-mq from Keith. This is
    needed to prevent oopsing on hot device removal.

    - Fixup for blk-mq timer addition from Ming Lei.

    - Small round of performance fixes for mtip32xx from Sam Bradshaw.

    - Minor stack leak fix from Rickard Strandqvist.

    - Two __init annotations from Fabian Frederick"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    block: add __init to blkcg_policy_register
    block: add __init to elv_register
    block: ensure that bio_add_page() always accepts a page for an empty bio
    blk-mq: add timer in blk_mq_start_request
    blk-mq: always initialize request->start_time
    block: blk-exec.c: Cleaning up local variable address returnd
    mtip32xx: minor performance enhancements
    blk-mq: ->timeout should be cleared in blk_mq_rq_ctx_init()
    blk-mq: don't allow queue entering for a dying queue
    blk-mq: bump max tag depth to 10K tags
    block: add blk_rq_set_block_pc()
    block: add notion of a chunk size for request merging

    Linus Torvalds
     
  • With commit 762380ad9322 added support for chunk sizes and no merging
    across them, it broke the rule of always allowing adding of a single
    page to an empty bio. So relax the restriction a bit to allow for that,
    similarly to what we have always done.

    This fixes a crash with mkfs.xfs and 512b sector sizes on NVMe.

    Reported-by: Keith Busch
    Signed-off-by: Jens Axboe

    Jens Axboe
     

10 Jun, 2014

1 commit

  • Pull cgroup updates from Tejun Heo:
    "A lot of activities on cgroup side. Heavy restructuring including
    locking simplification took place to improve the code base and enable
    implementation of the unified hierarchy, which currently exists behind
    a __DEVEL__ mount option. The core support is mostly complete but
    individual controllers need further work. To explain the design and
    rationales of the the unified hierarchy

    Documentation/cgroups/unified-hierarchy.txt

    is added.

    Another notable change is css (cgroup_subsys_state - what each
    controller uses to identify and interact with a cgroup) iteration
    update. This is part of continuing updates on css object lifetime and
    visibility. cgroup started with reference count draining on removal
    way back and is now reaching a point where csses behave and are
    iterated like normal refcnted objects albeit with some complexities to
    allow distinguishing the state where they're being deleted. The css
    iteration update isn't taken advantage of yet but is planned to be
    used to simplify memcg significantly"

    * 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (77 commits)
    cgroup: disallow disabled controllers on the default hierarchy
    cgroup: don't destroy the default root
    cgroup: disallow debug controller on the default hierarchy
    cgroup: clean up MAINTAINERS entries
    cgroup: implement css_tryget()
    device_cgroup: use css_has_online_children() instead of has_children()
    cgroup: convert cgroup_has_live_children() into css_has_online_children()
    cgroup: use CSS_ONLINE instead of CGRP_DEAD
    cgroup: iterate cgroup_subsys_states directly
    cgroup: introduce CSS_RELEASED and reduce css iteration fallback window
    cgroup: move cgroup->serial_nr into cgroup_subsys_state
    cgroup: link all cgroup_subsys_states in their sibling lists
    cgroup: move cgroup->sibling and ->children into cgroup_subsys_state
    cgroup: remove cgroup->parent
    device_cgroup: remove direct access to cgroup->children
    memcg: update memcg_has_children() to use css_next_child()
    memcg: remove tasks/children test from mem_cgroup_force_empty()
    cgroup: remove css_parent()
    cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css
    cgroup: use cgroup->self.refcnt for cgroup refcnting
    ...

    Linus Torvalds
     

06 Jun, 2014

1 commit

  • Some drivers have different limits on what size a request should
    optimally be, depending on the offset of the request. Similar to
    dividing a device into chunks. Add a setting that allows the driver
    to inform the block layer of such a chunk size. The block layer will
    then prevent merging across the chunks.

    This is needed to optimally support NVMe with a non-zero stripe size.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

19 May, 2014

1 commit