28 Jun, 2015

1 commit

  • Whenever blk_fill_sghdr_rq fails, its errno code is ignored and changed to
    EFAULT. This can cause very confusing errors:

    $ sg_persist -k /dev/sda
    persistent reservation in: pass through os error: Bad address

    The fix is trivial, just propagate the return value from
    blk_fill_sghdr_rq.

    Signed-off-by: Paolo Bonzini
    Acked-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Paolo Bonzini
     

27 Jun, 2015

1 commit

  • Pull device mapper fixes from Mike Snitzer:
    "Apologies for not pressing this request-based DM partial completion
    issue further, it was an oversight on my part. We'll have to get it
    fixed up properly and revisit for a future release.

    - Revert block and DM core changes the removed request-based DM's
    ability to handle partial request completions -- otherwise with the
    current SCSI LLDs these changes could lead to silent data
    corruption.

    - Fix two DM version bumps that were missing from the initial 4.2 DM
    pull request (enabled userspace lvm2 to know certain changes have
    been made)"

    * tag 'dm-4.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
    dm cache policy smq: fix "default" version to be 1.4.0
    dm: bump the ioctl version to 4.32.0
    Revert "block, dm: don't copy bios for request clones"
    Revert "dm: do not allocate any mempools for blk-mq request-based DM"

    Linus Torvalds
     

26 Jun, 2015

3 commits

  • This reverts commit 5f1b670d0bef508a5554d92525f5f6d00d640b38.

    Justification for revert as reported in this dm-devel post:
    https://www.redhat.com/archives/dm-devel/2015-June/msg00160.html

    this change should not be pushed to mainline yet.

    Firstly, Christoph has a newer version of the patch that fixes silent
    data corruption problem:
    https://www.redhat.com/archives/dm-devel/2015-May/msg00229.html

    And the new version still depends on LLDDs to always complete requests
    to the end when error happens, while block API doesn't enforce such a
    requirement. If the assumption is ever broken, the inconsistency between
    request and bio (e.g. rq->__sector and rq->bio) will cause silent data
    corruption:
    https://www.redhat.com/archives/dm-devel/2015-June/msg00022.html

    Reported-by: Junichi Nomura
    Signed-off-by: Mike Snitzer

    Mike Snitzer
     
  • Pull cgroup writeback support from Jens Axboe:
    "This is the big pull request for adding cgroup writeback support.

    This code has been in development for a long time, and it has been
    simmering in for-next for a good chunk of this cycle too. This is one
    of those problems that has been talked about for at least half a
    decade, finally there's a solution and code to go with it.

    Also see last weeks writeup on LWN:

    http://lwn.net/Articles/648292/"

    * 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
    writeback, blkio: add documentation for cgroup writeback support
    vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
    writeback: do foreign inode detection iff cgroup writeback is enabled
    v9fs: fix error handling in v9fs_session_init()
    bdi: fix wrong error return value in cgwb_create()
    buffer: remove unusued 'ret' variable
    writeback: disassociate inodes from dying bdi_writebacks
    writeback: implement foreign cgroup inode bdi_writeback switching
    writeback: add lockdep annotation to inode_to_wb()
    writeback: use unlocked_inode_to_wb transaction in inode_congested()
    writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
    writeback: implement [locked_]inode_to_wb_and_lock_list()
    writeback: implement foreign cgroup inode detection
    writeback: make writeback_control track the inode being written back
    writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
    mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
    writeback: implement memcg writeback domain based throttling
    writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
    writeback: implement memcg wb_domain
    writeback: update wb_over_bg_thresh() to use wb_domain aware operations
    ...

    Linus Torvalds
     
  • Pull core block IO update from Jens Axboe:
    "Nothing really major in here, mostly a collection of smaller
    optimizations and cleanups, mixed with various fixes. In more detail,
    this contains:

    - Addition of policy specific data to blkcg for block cgroups. From
    Arianna Avanzini.

    - Various cleanups around command types from Christoph.

    - Cleanup of the suspend block I/O path from Christoph.

    - Plugging updates from Shaohua and Jeff Moyer, for blk-mq.

    - Eliminating atomic inc/dec of both remaining IO count and reference
    count in a bio. From me.

    - Fixes for SG gap and chunk size support for data-less (discards)
    IO, so we can merge these better. From me.

    - Small restructuring of blk-mq shared tag support, freeing drivers
    from iterating hardware queues. From Keith Busch.

    - A few cfq-iosched tweaks, from Tahsin Erdogan and me. Makes the
    IOPS mode the default for non-rotational storage"

    * 'for-4.2/core' of git://git.kernel.dk/linux-block: (35 commits)
    cfq-iosched: fix other locations where blkcg_to_cfqgd() can return NULL
    cfq-iosched: fix sysfs oops when attempting to read unconfigured weights
    cfq-iosched: move group scheduling functions under ifdef
    cfq-iosched: fix the setting of IOPS mode on SSDs
    blktrace: Add blktrace.c to BLOCK LAYER in MAINTAINERS file
    block, cgroup: implement policy-specific per-blkcg data
    block: Make CFQ default to IOPS mode on SSDs
    block: add blk_set_queue_dying() to blkdev.h
    blk-mq: Shared tag enhancements
    block: don't honor chunk sizes for data-less IO
    block: only honor SG gap prevention for merges that contain data
    block: fix returnvar.cocci warnings
    block, dm: don't copy bios for request clones
    block: remove management of bi_remaining when restoring original bi_end_io
    block: replace trylock with mutex_lock in blkdev_reread_part()
    block: export blkdev_reread_part() and __blkdev_reread_part()
    suspend: simplify block I/O handling
    block: collapse bio bit space
    block: remove unused BIO_RW_BLOCK and BIO_EOF flags
    block: remove BIO_EOPNOTSUPP
    ...

    Linus Torvalds
     

23 Jun, 2015

1 commit

  • Pull scheduler updates from Ingo Molnar:
    "The main changes are:

    - lockless wakeup support for futexes and IPC message queues
    (Davidlohr Bueso, Peter Zijlstra)

    - Replace spinlocks with atomics in thread_group_cputimer(), to
    improve scalability (Jason Low)

    - NUMA balancing improvements (Rik van Riel)

    - SCHED_DEADLINE improvements (Wanpeng Li)

    - clean up and reorganize preemption helpers (Frederic Weisbecker)

    - decouple page fault disabling machinery from the preemption
    counter, to improve debuggability and robustness (David
    Hildenbrand)

    - SCHED_DEADLINE documentation updates (Luca Abeni)

    - topology CPU masks cleanups (Bartosz Golaszewski)

    - /proc/sched_debug improvements (Srikar Dronamraju)"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (79 commits)
    sched/deadline: Remove needless parameter in dl_runtime_exceeded()
    sched: Remove superfluous resetting of the p->dl_throttled flag
    sched/deadline: Drop duplicate init_sched_dl_class() declaration
    sched/deadline: Reduce rq lock contention by eliminating locking of non-feasible target
    sched/deadline: Make init_sched_dl_class() __init
    sched/deadline: Optimize pull_dl_task()
    sched/preempt: Add static_key() to preempt_notifiers
    sched/preempt: Fix preempt notifiers documentation about hlist_del() within unsafe iteration
    sched/stop_machine: Fix deadlock between multiple stop_two_cpus()
    sched/debug: Add sum_sleep_runtime to /proc//sched
    sched/debug: Replace vruntime with wait_sum in /proc/sched_debug
    sched/debug: Properly format runnable tasks in /proc/sched_debug
    sched/numa: Only consider less busy nodes as numa balancing destinations
    Revert 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced")
    sched/fair: Prevent throttling in early pick_next_task_fair()
    preempt: Reorganize the notrace definitions a bit
    preempt: Use preempt_schedule_context() as the official tracing preemption point
    sched: Make preempt_schedule_context() function-tracing safe
    x86: Remove cpu_sibling_mask() and cpu_core_mask()
    x86: Replace cpu_**_mask() with topology_**_cpumask()
    ...

    Linus Torvalds
     

21 Jun, 2015

1 commit


20 Jun, 2015

2 commits

  • If none of the devices in the system are using CFQ, then attempting to
    read:

    /sys/fs/cgroup/blkio/blkio.leaf_weight

    will results in a NULL dereference. Check for a valid cfq_group_data
    struct before attempting to dereference it.

    Reported-by: Andrey Wagin
    Fixes: e48453c3 ("block, cgroup: implement policy-specific per-blkcg data")
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • If CFQ_GROUP_IOSCHED is not set, the compiler produces the
    following warning:

    CC block/cfq-iosched.o
    linux/block/cfq-iosched.c:469:2:
    warning: 'cpd_to_cfqgd' defined but not used [-Wunused-function]
    *cpd_to_cfqgd(struct blkcg_policy_data *cpd)
    ^

    In reality, two other lookup functions aren't used either if
    CFQ_GROUP_IOSCHED isn't set. Move all three under one of the
    CFQ_GROUP_IOSCHED sections in the code.

    Reported-by: Vladimir Zapolskiy
    Signed-off-by: Jens Axboe

    Jens Axboe
     

11 Jun, 2015

1 commit

  • =================================
    [ INFO: inconsistent lock state ]
    4.1.0-rc7+ #217 Tainted: G O
    ---------------------------------
    inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
    swapper/6/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
    (ext_devt_lock){+.?...}, at: [] blk_free_devt+0x3c/0x70
    {SOFTIRQ-ON-W} state was registered at:
    [] __lock_acquire+0x461/0x1e70
    [] lock_acquire+0xb7/0x290
    [] _raw_spin_lock+0x38/0x50
    [] blk_alloc_devt+0x6d/0xd0 ] __lock_acquire+0x3fe/0x1e70
    [] ? __lock_acquire+0xe5d/0x1e70
    [] lock_acquire+0xb7/0x290
    [] ? blk_free_devt+0x3c/0x70
    [] _raw_spin_lock+0x38/0x50
    [] ? blk_free_devt+0x3c/0x70
    [] blk_free_devt+0x3c/0x70 ] part_release+0x1c/0x50
    [] device_release+0x36/0xb0
    [] kobject_cleanup+0x7b/0x1a0
    [] kobject_put+0x30/0x70
    [] put_device+0x17/0x20
    [] delete_partition_rcu_cb+0x16c/0x180
    [] ? read_dev_sector+0xa0/0xa0
    [] rcu_process_callbacks+0x2ff/0xa90
    [] ? rcu_process_callbacks+0x2bf/0xa90
    [] __do_softirq+0xde/0x600

    Neil sees this in his tests and it also triggers on pmem driver unbind
    for the libnvdimm tests. This fix is on top of an initial fix by Keith
    for incorrect usage of mutex_lock() in this path: 2da78092dda1 "block:
    Fix dev_t minor allocation lifetime". Both this and 2da78092dda1 are
    candidates for -stable.

    Fixes: 2da78092dda1 ("block: Fix dev_t minor allocation lifetime")
    Cc:
    Cc: Keith Busch
    Reported-by: NeilBrown
    Signed-off-by: Dan Williams
    Signed-off-by: Jens Axboe

    Dan Williams
     

10 Jun, 2015

2 commits

  • A previous commit wanted to make CFQ default to IOPS mode on
    non-rotational storage, however it did so when the queue was
    initialized and the non-rotational flag is only set later on
    in the probe.

    Add an elevator hook that gets called off the add_disk() path,
    at that point we know that feature probing has finished, and
    we can reliably check for the various flags that drivers can
    set.

    Fixes: 41c0126b ("block: Make CFQ default to IOPS mode on SSDs")
    Tested-by: Romain Francoise
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Now blk_cleanup_queue() can be called before calling
    del_gendisk()[1], inside which hctx->ctxs is touched
    from blk_mq_unregister_hctx(), but the variable has
    been freed by blk_cleanup_queue() at that time.

    So this patch moves freeing of hctx->ctxs into queue's
    release handler for fixing the oops reported by Stefan.

    [1], 6cd18e711dd8075 (block: destroy bdi before blockdev is
    unregistered)

    Reported-by: Stefan Seyfried
    Cc: NeilBrown
    Cc: Christoph Hellwig
    Cc: stable@vger.kernel.org (v4.0)
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

07 Jun, 2015

1 commit

  • The block IO (blkio) controller enables the block layer to provide service
    guarantees in a hierarchical fashion. Specifically, service guarantees
    are provided by registered request-accounting policies. As of now, a
    proportional-share and a throttling policy are available. They are
    implemented, respectively, by the CFQ I/O scheduler and the blk-throttle
    subsystem. Unfortunately, as for adding new policies, the current
    implementation of the block IO controller is only halfway ready to allow
    new policies to be plugged in. This commit provides a solution to make
    the block IO controller fully ready to handle new policies.
    In what follows, we first describe briefly the current state, and then
    list the changes made by this commit.

    The throttling policy does not need any per-cgroup information to perform
    its task. In contrast, the proportional share policy uses, for each cgroup,
    both the weight assigned by the user to the cgroup, and a set of dynamically-
    computed weights, one for each device.

    The first, user-defined weight is stored in the blkcg data structure: the
    block IO controller allocates a private blkcg data structure for each
    cgroup in the blkio cgroups hierarchy (regardless of which policy is active).
    In other words, the block IO controller internally mirrors the blkio cgroups
    with private blkcg data structures.

    On the other hand, for each cgroup and device, the corresponding dynamically-
    computed weight is maintained in the following, different way. For each device,
    the block IO controller keeps a private blkcg_gq structure for each cgroup in
    blkio. In other words, block IO also keeps one private mirror copy of the blkio
    cgroups hierarchy for each device, made of blkcg_gq structures.
    Each blkcg_gq structure keeps per-policy information in a generic array of
    dynamically-allocated 'dedicated' data structures, one for each registered
    policy (so currently the array contains two elements). To be inserted into the
    generic array, each dedicated data structure embeds a generic blkg_policy_data
    structure. Consider now the array contained in the blkcg_gq structure
    corresponding to a given pair of cgroup and device: one of the elements
    of the array contains the dedicated data structure for the proportional-share
    policy, and this dedicated data structure contains the dynamically-computed
    weight for that pair of cgroup and device.

    The generic strategy adopted for storing per-policy data in blkcg_gq structures
    is already capable of handling new policies, whereas the one adopted with blkcg
    structures is not, because per-policy data are hard-coded in the blkcg
    structures themselves (currently only data related to the proportional-
    share policy).

    This commit addresses the above issues through the following changes:
    . It generalizes blkcg structures so that per-policy data are stored in the same
    way as in blkcg_gq structures.
    Specifically, it lets also the blkcg structure store per-policy data in a
    generic array of dynamically-allocated dedicated data structures. We will
    refer to these data structures as blkcg dedicated data structures, to
    distinguish them from the dedicated data structures inserted in the generic
    arrays kept by blkcg_gq structures.
    To allow blkcg dedicated data structures to be inserted in the generic array
    inside a blkcg structure, this commit also introduces a new blkcg_policy_data
    structure, which is the equivalent of blkg_policy_data for blkcg dedicated
    data structures.
    . It adds to the blkcg_policy structure, i.e., to the descriptor of a policy, a
    cpd_size field and a cpd_init field, to be initialized by the policy with,
    respectively, the size of the blkcg dedicated data structures, and the
    address of a constructor function for blkcg dedicated data structures.
    . It moves the CFQ-specific fields embedded in the blkcg data structure (i.e.,
    the fields related to the proportional-share policy), into a new blkcg
    dedicated data structure called cfq_group_data.

    Signed-off-by: Paolo Valente
    Signed-off-by: Arianna Avanzini
    Acked-by: Tejun Heo
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Arianna Avanzini
     

06 Jun, 2015

1 commit

  • CFQ idling causes reduced IOPS throughput on non-rotational disks.
    Since disk head seeking is not applicable to SSDs, it doesn't really
    help performance by anticipating future near-by IO requests.

    By turning off idling (and switching to IOPS mode), we allow other
    processes to dispatch IO requests down to the driver and so increase IO
    throughput.

    Following FIO benchmark results were taken on a cloud SSD offering with
    idling on and off:

    Idling iops avg-lat(ms) stddev bw
    ------------------------------------------------------
    On 7054 90.107 38.697 28217KB/s
    Off 29255 21.836 11.730 117022KB/s

    fio --name=temp --size=100G --time_based --ioengine=libaio \
    --randrepeat=0 --direct=1 --invalidate=1 --verify=0 \
    --verify_fatal=0 --rw=randread --blocksize=4k --group_reporting=1 \
    --filename=/dev/sdb --runtime=10 --iodepth=64 --numjobs=10

    And the following is from a local SSD run:

    Idling iops avg-lat(ms) stddev bw
    ------------------------------------------------------
    On 19320 33.043 14.068 77281KB/s
    Off 21626 29.465 12.662 86507KB/s

    fio --name=temp --size=5G --time_based --ioengine=libaio \
    --randrepeat=0 --direct=1 --invalidate=1 --verify=0 \
    --verify_fatal=0 --rw=randread --blocksize=4k --group_reporting=1 \
    --filename=/fio_data --runtime=10 --iodepth=64 --numjobs=10

    Reviewed-by: Nauman Rafique
    Signed-off-by: Tahsin Erdogan
    Signed-off-by: Jens Axboe

    Tahsin Erdogan
     

02 Jun, 2015

14 commits

  • Now that bdi layer can handle per-blkcg bdi_writeback_congested state,
    blk_{set|clear}_congested() can propagate non-root blkcg congestion
    state to them.

    This can be easily achieved by disabling the root_rl tests in
    blk_{set|clear}_congested(). Note that we still need those tests when
    !CONFIG_CGROUP_WRITEBACK as otherwise we'll end up flipping root blkcg
    wb's congestion state for events happening on other blkcgs.

    v2: Updated for bdi_writeback_congested.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blk_{set|clear}_queue_congested() take @q and set or clear,
    respectively, the congestion state of its bdi's root wb. Because bdi
    used to be able to handle congestion state only on the root wb, the
    callers of those functions tested whether the congestion is on the
    root blkcg and skipped if not.

    This is cumbersome and makes implementation of per cgroup
    bdi_writeback congestion state propagation difficult. This patch
    renames blk_{set|clear}_queue_congested() to
    blk_{set|clear}_congested(), and makes them take request_list instead
    of request_queue and test whether the specified request_list is the
    root one before updating bdi_writeback congestion state. This makes
    the tests in the callers unnecessary and simplifies them.

    As there are no external users of these functions, the definitions are
    moved from include/linux/blkdev.h to block/blk-core.c.

    This patch doesn't introduce any noticeable behavior difference.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • A blkg (blkcg_gq) can be congested and decongested independently from
    other blkgs on the same request_queue. Accordingly, for cgroup
    writeback support, the congestion status at bdi (backing_dev_info)
    should be split and updated separately from matching blkg's.

    This patch prepares by adding blkg->wb_congested and associating a
    blkg with its matching per-blkcg bdi_writeback_congested on creation.

    v2: Updated to associate bdi_writeback_congested instead of
    bdi_writeback.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • For the planned cgroup writeback support, on each bdi
    (backing_dev_info), each memcg will be served by a separate wb
    (bdi_writeback). This patch updates bdi so that a bdi can host
    multiple wbs (bdi_writebacks).

    On the default hierarchy, blkcg implicitly enables memcg. This allows
    using memcg's page ownership for attributing writeback IOs, and every
    memcg - blkcg combination can be served by its own wb by assigning a
    dedicated wb to each memcg. This means that there may be multiple
    wb's of a bdi mapped to the same blkcg. As congested state is per
    blkcg - bdi combination, those wb's should share the same congested
    state. This is achieved by tracking congested state via
    bdi_writeback_congested structs which are keyed by blkcg.

    bdi->wb remains unchanged and will keep serving the root cgroup.
    cgwb's (cgroup wb's) for non-root cgroups are created on-demand or
    looked up while dirtying an inode according to the memcg of the page
    being dirtied or current task. Each cgwb is indexed on bdi->cgwb_tree
    by its memcg id. Once an inode is associated with its wb, it can be
    retrieved using inode_to_wb().

    Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
    pages will keep being associated with bdi->wb.

    v3: inode_attach_wb() in account_page_dirtied() moved inside
    mapping_cap_account_dirty() block where it's known to be !NULL.
    Also, an unnecessary NULL check before kfree() removed. Both
    detected by the kbuild bot.

    v2: Updated so that wb association is per inode and wb is per memcg
    rather than blkcg.

    Signed-off-by: Tejun Heo
    Cc: kbuild test robot
    Cc: Dan Carpenter
    Cc: Jens Axboe
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • cgroup writeback requires support from both bdi and filesystem sides.
    Add BDI_CAP_CGROUP_WRITEBACK and FS_CGROUP_WRITEBACK to indicate
    support and enable BDI_CAP_CGROUP_WRITEBACK on block based bdi's by
    default. Also, define CONFIG_CGROUP_WRITEBACK which is enabled if
    both MEMCG and BLK_CGROUP are enabled.

    inode_cgwb_enabled() which determines whether a given inode's both bdi
    and fs support cgroup writeback is added.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • With the planned cgroup writeback support, backing-dev related
    declarations will be more widely used across block and cgroup;
    unfortunately, including backing-dev.h from include/linux/blkdev.h
    makes cyclic include dependency quite likely.

    This patch separates out backing-dev-defs.h which only has the
    essential definitions and updates blkdev.h to include it. c files
    which need access to more backing-dev details now include
    backing-dev.h directly. This takes backing-dev.h off the common
    include dependency chain making it a lot easier to use it across block
    and cgroup.

    v2: fs/fat build failure fixed.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
    and the role of the separation is unclear. For cgroup support for
    writeback IOs, a bdi will be updated to host multiple wb's where each
    wb serves writeback IOs of a different cgroup on the bdi. To achieve
    that, a wb should carry all states necessary for servicing writeback
    IOs for a cgroup independently.

    This patch moves bdi->state into wb.

    * enum bdi_state is renamed to wb_state and the prefix of all enums is
    changed from BDI_ to WB_.

    * Explicit zeroing of bdi->state is removed without adding zeoring of
    wb->state as the whole data structure is zeroed on init anyway.

    * As there's still only one bdi_writeback per backing_dev_info, all
    uses of bdi->state are mechanically replaced with bdi->wb.state
    introducing no behavior changes.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Cc: Wu Fengguang
    Cc: drbd-dev@lists.linbit.com
    Cc: Neil Brown
    Cc: Alasdair Kergon
    Cc: Mike Snitzer
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, a bio can only be associated with the io_context and blkcg
    of %current using bio_associate_current(). This is too restrictive
    for cgroup writeback support. Implement bio_associate_blkcg() which
    associates a bio with the specified blkcg.

    bio_associate_blkcg() leaves the io_context unassociated.
    bio_associate_current() is updated so that it considers a bio as
    already associated if it has a blkcg_css, instead of an io_context,
    associated with it.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • bio_associate_current() currently open codes task_css() and
    css_tryget_online() to find and pin $current's blkcg css. Abstract it
    into task_get_css() which is implemented from cgroup side. As a task
    is always associated with an online css for every subsystem except
    while the css_set update is propagating, task_get_css() retries till
    css_tryget_online() succeeds.

    This is a cleanup and shouldn't lead to noticeable behavior changes.

    Signed-off-by: Tejun Heo
    Cc: Li Zefan
    Cc: Jens Axboe
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Add global constant blkcg_root_css which points to &blkcg_root.css.
    This will be used by cgroup writeback support. If blkcg is disabled,
    it's defined as ERR_PTR(-EINVAL).

    v2: The declarations moved to include/linux/blk-cgroup.h as suggested
    by Vivek.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, blkcg does a minor optimization where the root blkcg is
    created when the first blkcg policy is activated on a queue and
    destroyed on the deactivation of the last. On systems where blkcg is
    configured but not used, this saves one blkcg_gq struct per queue. On
    systems where blkcg is actually used, there's no difference. The only
    case where this can lead to any meaninful, albeit still minute, save
    in memory consumption is when all blkcg policies are deactivated after
    being widely used in the system, which is a hihgly unlikely scenario.

    The conditional existence of root blkcg_gq has already created several
    bugs in blkcg and became an issue once again for the new per-cgroup
    wb_congested mechanism for cgroup writeback support leading to a NULL
    dereference when no blkcg policy is active. This is really not worth
    bothering with. This patch makes blkcg always allocate and link the
    root blkcg_gq and release it only on queue destruction.

    Signed-off-by: Tejun Heo
    Reported-by: Fengguang Wu
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • cgroup aware writeback support will require exposing some of blkcg
    details. In preprataion, move block/blk-cgroup.h to
    include/linux/blk-cgroup.h. This patch is pure file move.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Conflicts:
    arch/sparc/include/asm/topology_64.h

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Storage controllers may expose multiple block devices that share hardware
    resources managed by blk-mq. This patch enhances the shared tags so a
    low-level driver can access the shared resources not tied to the unshared
    h/w contexts. This way the LLD can dynamically add and delete disks and
    request queues without having to track all the request_queue hctx's to
    iterate outstanding tags.

    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Keith Busch
     

30 May, 2015

2 commits


29 May, 2015

1 commit

  • bdi_unregister() now contains very little functionality.

    It contains a "WARN_ON" if bdi->dev is NULL. This warning is of no
    real consequence as bdi->dev isn't needed by anything else in the function,
    and it triggers if
    blk_cleanup_queue() -> bdi_destroy()
    is called before bdi_unregister, which happens since
    Commit: 6cd18e711dd8 ("block: destroy bdi before blockdev is unregistered.")

    So this isn't wanted.

    It also calls bdi_set_min_ratio(). This needs to be called after
    writes through the bdi have all been flushed, and before the bdi is destroyed.
    Calling it early is better than calling it late as it frees up a global
    resource.

    Calling it immediately after bdi_wb_shutdown() in bdi_destroy()
    perfectly fits these requirements.

    So bdi_unregister() can be discarded with the important content moved to
    bdi_destroy(), as can the
    writeback_bdi_unregister
    event which is already not used.

    Reported-by: Mike Snitzer
    Cc: stable@vger.kernel.org (v4.0)
    Fixes: c4db59d31e39 ("fs: don't reassign dirty inodes to default_backing_dev_info")
    Fixes: 6cd18e711dd8 ("block: destroy bdi before blockdev is unregistered.")
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Dan Williams
    Tested-by: Nicholas Moulin
    Signed-off-by: NeilBrown
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    NeilBrown
     

27 May, 2015

1 commit

  • Rename topology_thread_cpumask() to topology_sibling_cpumask()
    for more consistency with scheduler code.

    Signed-off-by: Bartosz Golaszewski
    Reviewed-by: Thomas Gleixner
    Acked-by: Russell King
    Acked-by: Catalin Marinas
    Cc: Benoit Cousson
    Cc: Fenghua Yu
    Cc: Guenter Roeck
    Cc: Jean Delvare
    Cc: Jonathan Corbet
    Cc: Linus Torvalds
    Cc: Oleg Drokin
    Cc: Peter Zijlstra
    Cc: Rafael J. Wysocki
    Cc: Russell King
    Cc: Viresh Kumar
    Link: http://lkml.kernel.org/r/1432645896-12588-2-git-send-email-bgolaszewski@baylibre.com
    Signed-off-by: Ingo Molnar

    Bartosz Golaszewski
     

22 May, 2015

2 commits

  • Currently dm-multipath has to clone the bios for every request sent
    to the lower devices, which wastes cpu cycles and ties down memory.

    This patch instead adds a new REQ_CLONE flag that instructs req_bio_endio
    to not complete bios attached to a request, which we set on clone
    requests similar to bios in a flush sequence. With this change I/O
    errors on a path failure only get propagated to dm-multipath, which
    can then either resubmit the I/O or complete the bios on the original
    request.

    I've done some basic testing of this on a Linux target with ALUA support,
    and it survives path failures during I/O nicely.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
    non-chains") regressed all existing callers that followed this pattern:
    1) saving a bio's original bi_end_io
    2) wiring up an intermediate bi_end_io
    3) restoring the original bi_end_io from intermediate bi_end_io
    4) calling bio_endio() to execute the restored original bi_end_io

    The regression was due to BIO_CHAIN only ever getting set if
    bio_inc_remaining() is called. For the above pattern it isn't set until
    step 3 above (step 2 would've needed to establish BIO_CHAIN). As such
    the first bio_endio(), in step 2 above, never decremented __bi_remaining
    before calling the intermediate bi_end_io -- leaving __bi_remaining with
    the value 1 instead of 0. When bio_inc_remaining() occurred during step
    3 it brought it to a value of 2. When the second bio_endio() was
    called, in step 4 above, it should've called the original bi_end_io but
    it didn't because there was an extra reference that wasn't dropped (due
    to atomic operations being optimized away since BIO_CHAIN wasn't set
    upfront).

    Fix this issue by removing the __bi_remaining management complexity for
    all callers that use the above pattern -- bio_chain() is the only
    interface that _needs_ to be concerned with __bi_remaining. For the
    above pattern callers just expect the bi_end_io they set to get called!
    Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
    that aren't associated with the bio_chain() interface.

    Also, the bio_inc_remaining() interface has been moved local to bio.c.

    Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

20 May, 2015

2 commits

  • The only possible problem of using mutex_lock() instead of trylock
    is about deadlock.

    If there aren't any locks held before calling blkdev_reread_part(),
    deadlock can't be caused by this conversion.

    If there are locks held before calling blkdev_reread_part(),
    and if these locks arn't required in open, close handler and I/O
    path, deadlock shouldn't be caused too.

    Both user space's ioctl(BLKRRPART) and md_setup_drive() from
    init/do_mounts_md.c belongs to the 1st case, so the conversion is safe
    for the two cases.

    For loop, the previous patches in this pathset has fixed the ABBA lock
    dependency, so the conversion is OK.

    For nbd, tx_lock is held when calling the function:

    - both open and release won't hold the lock
    - when blkdev_reread_part() is run, I/O thread has been stopped
    already, so tx_lock won't be acquired in I/O path at that time.
    - so the conversion won't cause deadlock for nbd

    For dasd, both dasd_open(), dasd_release() and request function don't
    acquire any mutex/semphone, so the conversion should be safe.

    Reviewed-by: Christoph Hellwig
    Tested-by: Jarod Wilson
    Acked-by: Jarod Wilson
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • This patch exports blkdev_reread_part() for block drivers, also
    introduce __blkdev_reread_part().

    For some drivers, such as loop, reread of partitions can be run
    from the release path, and bd_mutex may already be held prior to
    calling ioctl_by_bdev(bdev, BLKRRPART, 0), so introduce
    __blkdev_reread_part for use in such cases.

    CC: Christoph Hellwig
    CC: Jens Axboe
    CC: Tejun Heo
    CC: Alexander Viro
    CC: Markus Pargmann
    CC: Stefan Weinhuber
    CC: Stefan Haberland
    CC: Sebastian Ott
    CC: Fabian Frederick
    CC: Ming Lei
    CC: David Herrmann
    CC: Andrew Morton
    CC: Peter Zijlstra
    CC: nbd-general@lists.sourceforge.net
    CC: linux-s390@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jarod Wilson
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Jarod Wilson
     

19 May, 2015

3 commits


13 May, 2015

1 commit

  • With commit ff36ab345 ("dm: remove request-based logic from
    make_request_fn wrapper") DM no longer calls blk_queue_bio() directly,
    so remove its export. Doing so required a forward declaration in
    blk-core.c.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer