16 Aug, 2015

1 commit

  • Pull SCSI fixes from James Bottomley:
    "This has two libfc fixes for bugs causing rare crashes, one iscsi fix
    for a potential hang on shutdown, and a fix for an I/O blocksize issue
    which caused a regression"

    * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
    sd: Fix maximum I/O size for BLOCK_PC requests
    libfc: Fix fc_fcp_cleanup_each_cmd()
    libfc: Fix fc_exch_recv_req() error path
    libiscsi: Fix host busy blocking during connection teardown

    Linus Torvalds
     

13 Aug, 2015

1 commit

  • Commit bcdb247c6b6a ("sd: Limit transfer length") clamped the maximum
    size of an I/O request to the MAXIMUM TRANSFER LENGTH field in the BLOCK
    LIMITS VPD. This had the unfortunate effect of also limiting the maximum
    size of non-filesystem requests sent to the device through sg/bsg.

    Avoid using blk_queue_max_hw_sectors() and set the max_sectors queue
    limit directly.

    Also update the comment in blk_limits_max_hw_sectors() to clarify that
    max_hw_sectors defines the limit for the I/O controller only.

    Signed-off-by: Martin K. Petersen
    Reported-by: Brian King
    Tested-by: Brian King
    Cc: stable@vger.kernel.org # 3.17+
    Signed-off-by: James Bottomley

    Martin K. Petersen
     

24 Jul, 2015

2 commits

  • This fixes a data corruption bug when using discard on top of MD linear,
    raid0 and raid10 personalities.

    Commit 20d0189b1012 "block: Introduce new bio_split()" permits sharing
    the bio_vec between the two resulting bios. That is fine for read/write
    requests where the bio_vec is immutable. For discards, however, we need
    to be able to attach a payload and update the bio_vec so the page can
    get mapped to a scatterlist entry. Therefore the bio_vec can not be
    shared when splitting discards and we must do a full clone.

    Signed-off-by: Martin K. Petersen
    Reported-by: Seunguk Shin
    Tested-by: Seunguk Shin
    Cc: Seunguk Shin
    Cc: Jens Axboe
    Cc: Kent Overstreet
    Cc: # v3.14+
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • bio_associate_blkcg(), bio_associate_current() and wbc_account_io()
    are used to implement cgroup writeback support for filesystems and
    thus need to be exported. Export them.

    Signed-off-by: Tejun Heo
    Reported-by: Stephen Rothwell
    Signed-off-by: Jens Axboe

    Tejun Heo
     

23 Jul, 2015

1 commit


16 Jul, 2015

1 commit

  • It is reasonable to set default timeout of request as 30 seconds instead of
    30000 ticks, which may be 300 seconds if HZ is 100, for example, some arm64
    based systems may choose 100 HZ.

    Signed-off-by: Ming Lei
    Fixes: c76cbbcf4044 ("blk-mq: put blk_queue_rq_timeout together in blk_mq_init_queue()"
    Signed-off-by: Jens Axboe

    Ming Lei
     

10 Jul, 2015

4 commits

  • e48453c386f3 ("block, cgroup: implement policy-specific per-blkcg
    data") updated per-blkcg policy data to be dynamically allocated.
    When a policy is registered, its policy data aren't created. Instead,
    when the policy is activated on a queue, the policy data are allocated
    if there are blkg's (blkcg_gq's) which are attached to a given blkcg.
    This is buggy. Consider the following scenario.

    1. A blkcg is created. No blkg's attached yet.

    2. The policy is registered. No policy data is allocated.

    3. The policy is activated on a queue. As the above blkcg doesn't
    have any blkg's, it won't allocate the matching blkcg_policy_data.

    4. An IO is issued from the blkcg and blkg is created and the blkcg
    still doesn't have the matching policy data allocated.

    With cfq-iosched, this leads to an oops.

    It also doesn't free policy data on policy unregistration assuming
    that freeing of all policy data on blkcg destruction should take care
    of it; however, this also is incorrect.

    1. A blkcg has policy data.

    2. The policy gets unregistered but the policy data remains.

    3. Another policy gets registered on the same slot.

    4. Later, the new policy tries to allocate policy data on the previous
    blkcg but the slot is already occupied and gets skipped. The
    policy ends up operating on the policy data of the previous policy.

    There's no reason to manage blkcg_policy_data lazily. The reason we
    do lazy allocation of blkg's is that the number of all possible blkg's
    is the product of cgroups and block devices which can reach a
    surprising level. blkcg_policy_data is contrained by the number of
    cgroups and shouldn't be a problem.

    This patch makes blkcg_policy_data to be allocated for all existing
    blkcg's on policy registration and freed on unregistration and removes
    blkcg_policy_data handling from policy [de]activation paths. This
    makes that blkcg_policy_data are created and removed with the policy
    they belong to and fixes the above described problems.

    Signed-off-by: Tejun Heo
    Fixes: e48453c386f3 ("block, cgroup: implement policy-specific per-blkcg data")
    Cc: Vivek Goyal
    Cc: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Add all_blkcgs list goes through blkcg->all_blkcgs_node and is
    protected by blkcg_pol_mutex. This will be used to fix
    blkcg_policy_data allocation bug.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Cc: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • An entry in blkcg_policy[] is stable while there are non-bypassing
    in-flight IOs on a request_queue which has the policy activated. This
    is why most derefs of blkcg_policy[] don't need explicit locking;
    however, blkcg_css_alloc() isn't invoked from IO path and thus doesn't
    have this protection and may race policies being added and removed.

    Fix it by adding explicit blkcg_pol_mutex protection around
    blkcg_policy[] iteration in blkcg_css_alloc().

    Signed-off-by: Tejun Heo
    Fixes: e48453c386f3 ("block, cgroup: implement policy-specific per-blkcg data")
    Cc: Vivek Goyal
    Cc: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blkcg_pol_mutex primarily protects the blkcg_policy array. It also
    protects cgroup file type [un]registration during policy addition /
    removal. This puts blkcg_pol_mutex outside cgroup internal
    synchronization and in turn makes it impossible to grab from blkcg's
    cgroup methods as that leads to cyclic dependency.

    Another problematic dependency arising from this is through cgroup
    interface file deactivation. Removing a cftype requires removing all
    files of the type which in turn involves draining all on-going
    invocations of the file methods. This means that an interface file
    implementation can't grab blkcg_pol_mutex as draining can lead to AA
    deadlock.

    blkcg_reset_stats() is already in this situation. It currently
    trylocks blkcg_pol_mutex and then unwinds and retries the whole
    operation on failure, which is cumbersome at best. It has a lengthy
    comment explaining how cgroup internal synchronization is involved and
    expected to be updated but as explained above this doesn't need cgroup
    internal locking to deadlock. It's a self-contained AA deadlock.

    The described circular dependencies can be easily broken by moving
    cftype [un]registration out of blkcg_pol_mutex and protect them with
    an outer mutex. This patch introduces blkcg_pol_register_mutex which
    wraps entire policy [un]registration including cftype operations and
    shrinks blkcg_pol_mutex critical section. This also makes the trylock
    dancing in blkcg_reset_stats() unnecessary. Removed.

    This patch is necessary for the following blkcg_policy_data allocation
    bug fixes.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Cc: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Tejun Heo
     

07 Jul, 2015

3 commits

  • Currently, per-blkcg data is freed each time a policy is deactivated,
    that is also upon scheduler switch. However, when switching from a
    scheduler implementing a policy which requires per-blkcg data to
    another one, that same policy might be active on other devices, and
    therefore those same per-blkcg data could be still in use.
    This commit lets per-blkcg data be freed when the blkcg is freed
    instead of on policy deactivation.

    Signed-off-by: Arianna Avanzini
    Reported-and-tested-by: Michael Kaminsky
    Fixes: e48453c3 ("block, cgroup: implement policy-specific per-blkcg data")
    Signed-off-by: Jens Axboe

    Arianna Avanzini
     
  • use FIELD_SIZEOF instead of open coding

    Signed-off-by: Maninder Singh
    Signed-off-by: Jens Axboe

    Maninder Singh
     
  • bio_integrity_alloc() and bio_integrity_free() assume that if a bio was
    allocated from a bioset that that bioset also had its bio_integrity_pool
    allocated using bioset_integrity_create(). This is a very bad
    assumption given that bioset_create() and bioset_integrity_create() are
    completely disjoint. Not all callers of bioset_create() have been
    trained to also call bioset_integrity_create() -- and they may not care
    to be.

    Fix this by falling back to kmalloc'ing 'struct bio_integrity_payload'
    rather than force all bioset consumers to (wastefully) preallocate a
    bio_integrity_pool that they very likely won't actually need (given the
    niche nature of the current block integrity support).

    Otherwise, a NULL pointer "Kernel BUG" with a trace like the following
    will be observed (as seen on s390x using zfcp storage) because dm-io
    doesn't use bioset_integrity_create() when creating its bioset:

    [ 791.643338] Call Trace:
    [ 791.643339] ([] 0x3df98b848)
    [ 791.643341] [] bio_integrity_alloc+0x48/0xf8
    [ 791.643348] [] bio_integrity_prep+0xae/0x2f0
    [ 791.643349] [] blk_queue_bio+0x1c8/0x3d8
    [ 791.643355] [] generic_make_request+0xc0/0x100
    [ 791.643357] [] submit_bio+0xa2/0x198
    [ 791.643406] [] dispatch_io+0x15c/0x3b0 [dm_mod]
    [ 791.643419] [] dm_io+0x176/0x2f0 [dm_mod]
    [ 791.643423] [] do_reads+0x13a/0x1a8 [dm_mirror]
    [ 791.643425] [] do_mirror+0x142/0x298 [dm_mirror]
    [ 791.643428] [] process_one_work+0x18a/0x3f8
    [ 791.643432] [] worker_thread+0x132/0x3b0
    [ 791.643435] [] kthread+0xd2/0xd8
    [ 791.643438] [] kernel_thread_starter+0x6/0xc
    [ 791.643446] [] kernel_thread_starter+0x0/0xc

    Signed-off-by: Mike Snitzer
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

28 Jun, 2015

1 commit

  • Whenever blk_fill_sghdr_rq fails, its errno code is ignored and changed to
    EFAULT. This can cause very confusing errors:

    $ sg_persist -k /dev/sda
    persistent reservation in: pass through os error: Bad address

    The fix is trivial, just propagate the return value from
    blk_fill_sghdr_rq.

    Signed-off-by: Paolo Bonzini
    Acked-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Paolo Bonzini
     

27 Jun, 2015

1 commit

  • Pull device mapper fixes from Mike Snitzer:
    "Apologies for not pressing this request-based DM partial completion
    issue further, it was an oversight on my part. We'll have to get it
    fixed up properly and revisit for a future release.

    - Revert block and DM core changes the removed request-based DM's
    ability to handle partial request completions -- otherwise with the
    current SCSI LLDs these changes could lead to silent data
    corruption.

    - Fix two DM version bumps that were missing from the initial 4.2 DM
    pull request (enabled userspace lvm2 to know certain changes have
    been made)"

    * tag 'dm-4.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
    dm cache policy smq: fix "default" version to be 1.4.0
    dm: bump the ioctl version to 4.32.0
    Revert "block, dm: don't copy bios for request clones"
    Revert "dm: do not allocate any mempools for blk-mq request-based DM"

    Linus Torvalds
     

26 Jun, 2015

3 commits

  • This reverts commit 5f1b670d0bef508a5554d92525f5f6d00d640b38.

    Justification for revert as reported in this dm-devel post:
    https://www.redhat.com/archives/dm-devel/2015-June/msg00160.html

    this change should not be pushed to mainline yet.

    Firstly, Christoph has a newer version of the patch that fixes silent
    data corruption problem:
    https://www.redhat.com/archives/dm-devel/2015-May/msg00229.html

    And the new version still depends on LLDDs to always complete requests
    to the end when error happens, while block API doesn't enforce such a
    requirement. If the assumption is ever broken, the inconsistency between
    request and bio (e.g. rq->__sector and rq->bio) will cause silent data
    corruption:
    https://www.redhat.com/archives/dm-devel/2015-June/msg00022.html

    Reported-by: Junichi Nomura
    Signed-off-by: Mike Snitzer

    Mike Snitzer
     
  • Pull cgroup writeback support from Jens Axboe:
    "This is the big pull request for adding cgroup writeback support.

    This code has been in development for a long time, and it has been
    simmering in for-next for a good chunk of this cycle too. This is one
    of those problems that has been talked about for at least half a
    decade, finally there's a solution and code to go with it.

    Also see last weeks writeup on LWN:

    http://lwn.net/Articles/648292/"

    * 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
    writeback, blkio: add documentation for cgroup writeback support
    vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
    writeback: do foreign inode detection iff cgroup writeback is enabled
    v9fs: fix error handling in v9fs_session_init()
    bdi: fix wrong error return value in cgwb_create()
    buffer: remove unusued 'ret' variable
    writeback: disassociate inodes from dying bdi_writebacks
    writeback: implement foreign cgroup inode bdi_writeback switching
    writeback: add lockdep annotation to inode_to_wb()
    writeback: use unlocked_inode_to_wb transaction in inode_congested()
    writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
    writeback: implement [locked_]inode_to_wb_and_lock_list()
    writeback: implement foreign cgroup inode detection
    writeback: make writeback_control track the inode being written back
    writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
    mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
    writeback: implement memcg writeback domain based throttling
    writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
    writeback: implement memcg wb_domain
    writeback: update wb_over_bg_thresh() to use wb_domain aware operations
    ...

    Linus Torvalds
     
  • Pull core block IO update from Jens Axboe:
    "Nothing really major in here, mostly a collection of smaller
    optimizations and cleanups, mixed with various fixes. In more detail,
    this contains:

    - Addition of policy specific data to blkcg for block cgroups. From
    Arianna Avanzini.

    - Various cleanups around command types from Christoph.

    - Cleanup of the suspend block I/O path from Christoph.

    - Plugging updates from Shaohua and Jeff Moyer, for blk-mq.

    - Eliminating atomic inc/dec of both remaining IO count and reference
    count in a bio. From me.

    - Fixes for SG gap and chunk size support for data-less (discards)
    IO, so we can merge these better. From me.

    - Small restructuring of blk-mq shared tag support, freeing drivers
    from iterating hardware queues. From Keith Busch.

    - A few cfq-iosched tweaks, from Tahsin Erdogan and me. Makes the
    IOPS mode the default for non-rotational storage"

    * 'for-4.2/core' of git://git.kernel.dk/linux-block: (35 commits)
    cfq-iosched: fix other locations where blkcg_to_cfqgd() can return NULL
    cfq-iosched: fix sysfs oops when attempting to read unconfigured weights
    cfq-iosched: move group scheduling functions under ifdef
    cfq-iosched: fix the setting of IOPS mode on SSDs
    blktrace: Add blktrace.c to BLOCK LAYER in MAINTAINERS file
    block, cgroup: implement policy-specific per-blkcg data
    block: Make CFQ default to IOPS mode on SSDs
    block: add blk_set_queue_dying() to blkdev.h
    blk-mq: Shared tag enhancements
    block: don't honor chunk sizes for data-less IO
    block: only honor SG gap prevention for merges that contain data
    block: fix returnvar.cocci warnings
    block, dm: don't copy bios for request clones
    block: remove management of bi_remaining when restoring original bi_end_io
    block: replace trylock with mutex_lock in blkdev_reread_part()
    block: export blkdev_reread_part() and __blkdev_reread_part()
    suspend: simplify block I/O handling
    block: collapse bio bit space
    block: remove unused BIO_RW_BLOCK and BIO_EOF flags
    block: remove BIO_EOPNOTSUPP
    ...

    Linus Torvalds
     

23 Jun, 2015

1 commit

  • Pull scheduler updates from Ingo Molnar:
    "The main changes are:

    - lockless wakeup support for futexes and IPC message queues
    (Davidlohr Bueso, Peter Zijlstra)

    - Replace spinlocks with atomics in thread_group_cputimer(), to
    improve scalability (Jason Low)

    - NUMA balancing improvements (Rik van Riel)

    - SCHED_DEADLINE improvements (Wanpeng Li)

    - clean up and reorganize preemption helpers (Frederic Weisbecker)

    - decouple page fault disabling machinery from the preemption
    counter, to improve debuggability and robustness (David
    Hildenbrand)

    - SCHED_DEADLINE documentation updates (Luca Abeni)

    - topology CPU masks cleanups (Bartosz Golaszewski)

    - /proc/sched_debug improvements (Srikar Dronamraju)"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (79 commits)
    sched/deadline: Remove needless parameter in dl_runtime_exceeded()
    sched: Remove superfluous resetting of the p->dl_throttled flag
    sched/deadline: Drop duplicate init_sched_dl_class() declaration
    sched/deadline: Reduce rq lock contention by eliminating locking of non-feasible target
    sched/deadline: Make init_sched_dl_class() __init
    sched/deadline: Optimize pull_dl_task()
    sched/preempt: Add static_key() to preempt_notifiers
    sched/preempt: Fix preempt notifiers documentation about hlist_del() within unsafe iteration
    sched/stop_machine: Fix deadlock between multiple stop_two_cpus()
    sched/debug: Add sum_sleep_runtime to /proc//sched
    sched/debug: Replace vruntime with wait_sum in /proc/sched_debug
    sched/debug: Properly format runnable tasks in /proc/sched_debug
    sched/numa: Only consider less busy nodes as numa balancing destinations
    Revert 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced")
    sched/fair: Prevent throttling in early pick_next_task_fair()
    preempt: Reorganize the notrace definitions a bit
    preempt: Use preempt_schedule_context() as the official tracing preemption point
    sched: Make preempt_schedule_context() function-tracing safe
    x86: Remove cpu_sibling_mask() and cpu_core_mask()
    x86: Replace cpu_**_mask() with topology_**_cpumask()
    ...

    Linus Torvalds
     

21 Jun, 2015

1 commit


20 Jun, 2015

2 commits

  • If none of the devices in the system are using CFQ, then attempting to
    read:

    /sys/fs/cgroup/blkio/blkio.leaf_weight

    will results in a NULL dereference. Check for a valid cfq_group_data
    struct before attempting to dereference it.

    Reported-by: Andrey Wagin
    Fixes: e48453c3 ("block, cgroup: implement policy-specific per-blkcg data")
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • If CFQ_GROUP_IOSCHED is not set, the compiler produces the
    following warning:

    CC block/cfq-iosched.o
    linux/block/cfq-iosched.c:469:2:
    warning: 'cpd_to_cfqgd' defined but not used [-Wunused-function]
    *cpd_to_cfqgd(struct blkcg_policy_data *cpd)
    ^

    In reality, two other lookup functions aren't used either if
    CFQ_GROUP_IOSCHED isn't set. Move all three under one of the
    CFQ_GROUP_IOSCHED sections in the code.

    Reported-by: Vladimir Zapolskiy
    Signed-off-by: Jens Axboe

    Jens Axboe
     

11 Jun, 2015

1 commit

  • =================================
    [ INFO: inconsistent lock state ]
    4.1.0-rc7+ #217 Tainted: G O
    ---------------------------------
    inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
    swapper/6/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
    (ext_devt_lock){+.?...}, at: [] blk_free_devt+0x3c/0x70
    {SOFTIRQ-ON-W} state was registered at:
    [] __lock_acquire+0x461/0x1e70
    [] lock_acquire+0xb7/0x290
    [] _raw_spin_lock+0x38/0x50
    [] blk_alloc_devt+0x6d/0xd0 ] __lock_acquire+0x3fe/0x1e70
    [] ? __lock_acquire+0xe5d/0x1e70
    [] lock_acquire+0xb7/0x290
    [] ? blk_free_devt+0x3c/0x70
    [] _raw_spin_lock+0x38/0x50
    [] ? blk_free_devt+0x3c/0x70
    [] blk_free_devt+0x3c/0x70 ] part_release+0x1c/0x50
    [] device_release+0x36/0xb0
    [] kobject_cleanup+0x7b/0x1a0
    [] kobject_put+0x30/0x70
    [] put_device+0x17/0x20
    [] delete_partition_rcu_cb+0x16c/0x180
    [] ? read_dev_sector+0xa0/0xa0
    [] rcu_process_callbacks+0x2ff/0xa90
    [] ? rcu_process_callbacks+0x2bf/0xa90
    [] __do_softirq+0xde/0x600

    Neil sees this in his tests and it also triggers on pmem driver unbind
    for the libnvdimm tests. This fix is on top of an initial fix by Keith
    for incorrect usage of mutex_lock() in this path: 2da78092dda1 "block:
    Fix dev_t minor allocation lifetime". Both this and 2da78092dda1 are
    candidates for -stable.

    Fixes: 2da78092dda1 ("block: Fix dev_t minor allocation lifetime")
    Cc:
    Cc: Keith Busch
    Reported-by: NeilBrown
    Signed-off-by: Dan Williams
    Signed-off-by: Jens Axboe

    Dan Williams
     

10 Jun, 2015

2 commits

  • A previous commit wanted to make CFQ default to IOPS mode on
    non-rotational storage, however it did so when the queue was
    initialized and the non-rotational flag is only set later on
    in the probe.

    Add an elevator hook that gets called off the add_disk() path,
    at that point we know that feature probing has finished, and
    we can reliably check for the various flags that drivers can
    set.

    Fixes: 41c0126b ("block: Make CFQ default to IOPS mode on SSDs")
    Tested-by: Romain Francoise
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Now blk_cleanup_queue() can be called before calling
    del_gendisk()[1], inside which hctx->ctxs is touched
    from blk_mq_unregister_hctx(), but the variable has
    been freed by blk_cleanup_queue() at that time.

    So this patch moves freeing of hctx->ctxs into queue's
    release handler for fixing the oops reported by Stefan.

    [1], 6cd18e711dd8075 (block: destroy bdi before blockdev is
    unregistered)

    Reported-by: Stefan Seyfried
    Cc: NeilBrown
    Cc: Christoph Hellwig
    Cc: stable@vger.kernel.org (v4.0)
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

07 Jun, 2015

1 commit

  • The block IO (blkio) controller enables the block layer to provide service
    guarantees in a hierarchical fashion. Specifically, service guarantees
    are provided by registered request-accounting policies. As of now, a
    proportional-share and a throttling policy are available. They are
    implemented, respectively, by the CFQ I/O scheduler and the blk-throttle
    subsystem. Unfortunately, as for adding new policies, the current
    implementation of the block IO controller is only halfway ready to allow
    new policies to be plugged in. This commit provides a solution to make
    the block IO controller fully ready to handle new policies.
    In what follows, we first describe briefly the current state, and then
    list the changes made by this commit.

    The throttling policy does not need any per-cgroup information to perform
    its task. In contrast, the proportional share policy uses, for each cgroup,
    both the weight assigned by the user to the cgroup, and a set of dynamically-
    computed weights, one for each device.

    The first, user-defined weight is stored in the blkcg data structure: the
    block IO controller allocates a private blkcg data structure for each
    cgroup in the blkio cgroups hierarchy (regardless of which policy is active).
    In other words, the block IO controller internally mirrors the blkio cgroups
    with private blkcg data structures.

    On the other hand, for each cgroup and device, the corresponding dynamically-
    computed weight is maintained in the following, different way. For each device,
    the block IO controller keeps a private blkcg_gq structure for each cgroup in
    blkio. In other words, block IO also keeps one private mirror copy of the blkio
    cgroups hierarchy for each device, made of blkcg_gq structures.
    Each blkcg_gq structure keeps per-policy information in a generic array of
    dynamically-allocated 'dedicated' data structures, one for each registered
    policy (so currently the array contains two elements). To be inserted into the
    generic array, each dedicated data structure embeds a generic blkg_policy_data
    structure. Consider now the array contained in the blkcg_gq structure
    corresponding to a given pair of cgroup and device: one of the elements
    of the array contains the dedicated data structure for the proportional-share
    policy, and this dedicated data structure contains the dynamically-computed
    weight for that pair of cgroup and device.

    The generic strategy adopted for storing per-policy data in blkcg_gq structures
    is already capable of handling new policies, whereas the one adopted with blkcg
    structures is not, because per-policy data are hard-coded in the blkcg
    structures themselves (currently only data related to the proportional-
    share policy).

    This commit addresses the above issues through the following changes:
    . It generalizes blkcg structures so that per-policy data are stored in the same
    way as in blkcg_gq structures.
    Specifically, it lets also the blkcg structure store per-policy data in a
    generic array of dynamically-allocated dedicated data structures. We will
    refer to these data structures as blkcg dedicated data structures, to
    distinguish them from the dedicated data structures inserted in the generic
    arrays kept by blkcg_gq structures.
    To allow blkcg dedicated data structures to be inserted in the generic array
    inside a blkcg structure, this commit also introduces a new blkcg_policy_data
    structure, which is the equivalent of blkg_policy_data for blkcg dedicated
    data structures.
    . It adds to the blkcg_policy structure, i.e., to the descriptor of a policy, a
    cpd_size field and a cpd_init field, to be initialized by the policy with,
    respectively, the size of the blkcg dedicated data structures, and the
    address of a constructor function for blkcg dedicated data structures.
    . It moves the CFQ-specific fields embedded in the blkcg data structure (i.e.,
    the fields related to the proportional-share policy), into a new blkcg
    dedicated data structure called cfq_group_data.

    Signed-off-by: Paolo Valente
    Signed-off-by: Arianna Avanzini
    Acked-by: Tejun Heo
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Arianna Avanzini
     

06 Jun, 2015

1 commit

  • CFQ idling causes reduced IOPS throughput on non-rotational disks.
    Since disk head seeking is not applicable to SSDs, it doesn't really
    help performance by anticipating future near-by IO requests.

    By turning off idling (and switching to IOPS mode), we allow other
    processes to dispatch IO requests down to the driver and so increase IO
    throughput.

    Following FIO benchmark results were taken on a cloud SSD offering with
    idling on and off:

    Idling iops avg-lat(ms) stddev bw
    ------------------------------------------------------
    On 7054 90.107 38.697 28217KB/s
    Off 29255 21.836 11.730 117022KB/s

    fio --name=temp --size=100G --time_based --ioengine=libaio \
    --randrepeat=0 --direct=1 --invalidate=1 --verify=0 \
    --verify_fatal=0 --rw=randread --blocksize=4k --group_reporting=1 \
    --filename=/dev/sdb --runtime=10 --iodepth=64 --numjobs=10

    And the following is from a local SSD run:

    Idling iops avg-lat(ms) stddev bw
    ------------------------------------------------------
    On 19320 33.043 14.068 77281KB/s
    Off 21626 29.465 12.662 86507KB/s

    fio --name=temp --size=5G --time_based --ioengine=libaio \
    --randrepeat=0 --direct=1 --invalidate=1 --verify=0 \
    --verify_fatal=0 --rw=randread --blocksize=4k --group_reporting=1 \
    --filename=/fio_data --runtime=10 --iodepth=64 --numjobs=10

    Reviewed-by: Nauman Rafique
    Signed-off-by: Tahsin Erdogan
    Signed-off-by: Jens Axboe

    Tahsin Erdogan
     

02 Jun, 2015

13 commits

  • Now that bdi layer can handle per-blkcg bdi_writeback_congested state,
    blk_{set|clear}_congested() can propagate non-root blkcg congestion
    state to them.

    This can be easily achieved by disabling the root_rl tests in
    blk_{set|clear}_congested(). Note that we still need those tests when
    !CONFIG_CGROUP_WRITEBACK as otherwise we'll end up flipping root blkcg
    wb's congestion state for events happening on other blkcgs.

    v2: Updated for bdi_writeback_congested.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blk_{set|clear}_queue_congested() take @q and set or clear,
    respectively, the congestion state of its bdi's root wb. Because bdi
    used to be able to handle congestion state only on the root wb, the
    callers of those functions tested whether the congestion is on the
    root blkcg and skipped if not.

    This is cumbersome and makes implementation of per cgroup
    bdi_writeback congestion state propagation difficult. This patch
    renames blk_{set|clear}_queue_congested() to
    blk_{set|clear}_congested(), and makes them take request_list instead
    of request_queue and test whether the specified request_list is the
    root one before updating bdi_writeback congestion state. This makes
    the tests in the callers unnecessary and simplifies them.

    As there are no external users of these functions, the definitions are
    moved from include/linux/blkdev.h to block/blk-core.c.

    This patch doesn't introduce any noticeable behavior difference.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • A blkg (blkcg_gq) can be congested and decongested independently from
    other blkgs on the same request_queue. Accordingly, for cgroup
    writeback support, the congestion status at bdi (backing_dev_info)
    should be split and updated separately from matching blkg's.

    This patch prepares by adding blkg->wb_congested and associating a
    blkg with its matching per-blkcg bdi_writeback_congested on creation.

    v2: Updated to associate bdi_writeback_congested instead of
    bdi_writeback.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • For the planned cgroup writeback support, on each bdi
    (backing_dev_info), each memcg will be served by a separate wb
    (bdi_writeback). This patch updates bdi so that a bdi can host
    multiple wbs (bdi_writebacks).

    On the default hierarchy, blkcg implicitly enables memcg. This allows
    using memcg's page ownership for attributing writeback IOs, and every
    memcg - blkcg combination can be served by its own wb by assigning a
    dedicated wb to each memcg. This means that there may be multiple
    wb's of a bdi mapped to the same blkcg. As congested state is per
    blkcg - bdi combination, those wb's should share the same congested
    state. This is achieved by tracking congested state via
    bdi_writeback_congested structs which are keyed by blkcg.

    bdi->wb remains unchanged and will keep serving the root cgroup.
    cgwb's (cgroup wb's) for non-root cgroups are created on-demand or
    looked up while dirtying an inode according to the memcg of the page
    being dirtied or current task. Each cgwb is indexed on bdi->cgwb_tree
    by its memcg id. Once an inode is associated with its wb, it can be
    retrieved using inode_to_wb().

    Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
    pages will keep being associated with bdi->wb.

    v3: inode_attach_wb() in account_page_dirtied() moved inside
    mapping_cap_account_dirty() block where it's known to be !NULL.
    Also, an unnecessary NULL check before kfree() removed. Both
    detected by the kbuild bot.

    v2: Updated so that wb association is per inode and wb is per memcg
    rather than blkcg.

    Signed-off-by: Tejun Heo
    Cc: kbuild test robot
    Cc: Dan Carpenter
    Cc: Jens Axboe
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • cgroup writeback requires support from both bdi and filesystem sides.
    Add BDI_CAP_CGROUP_WRITEBACK and FS_CGROUP_WRITEBACK to indicate
    support and enable BDI_CAP_CGROUP_WRITEBACK on block based bdi's by
    default. Also, define CONFIG_CGROUP_WRITEBACK which is enabled if
    both MEMCG and BLK_CGROUP are enabled.

    inode_cgwb_enabled() which determines whether a given inode's both bdi
    and fs support cgroup writeback is added.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • With the planned cgroup writeback support, backing-dev related
    declarations will be more widely used across block and cgroup;
    unfortunately, including backing-dev.h from include/linux/blkdev.h
    makes cyclic include dependency quite likely.

    This patch separates out backing-dev-defs.h which only has the
    essential definitions and updates blkdev.h to include it. c files
    which need access to more backing-dev details now include
    backing-dev.h directly. This takes backing-dev.h off the common
    include dependency chain making it a lot easier to use it across block
    and cgroup.

    v2: fs/fat build failure fixed.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
    and the role of the separation is unclear. For cgroup support for
    writeback IOs, a bdi will be updated to host multiple wb's where each
    wb serves writeback IOs of a different cgroup on the bdi. To achieve
    that, a wb should carry all states necessary for servicing writeback
    IOs for a cgroup independently.

    This patch moves bdi->state into wb.

    * enum bdi_state is renamed to wb_state and the prefix of all enums is
    changed from BDI_ to WB_.

    * Explicit zeroing of bdi->state is removed without adding zeoring of
    wb->state as the whole data structure is zeroed on init anyway.

    * As there's still only one bdi_writeback per backing_dev_info, all
    uses of bdi->state are mechanically replaced with bdi->wb.state
    introducing no behavior changes.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Cc: Wu Fengguang
    Cc: drbd-dev@lists.linbit.com
    Cc: Neil Brown
    Cc: Alasdair Kergon
    Cc: Mike Snitzer
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, a bio can only be associated with the io_context and blkcg
    of %current using bio_associate_current(). This is too restrictive
    for cgroup writeback support. Implement bio_associate_blkcg() which
    associates a bio with the specified blkcg.

    bio_associate_blkcg() leaves the io_context unassociated.
    bio_associate_current() is updated so that it considers a bio as
    already associated if it has a blkcg_css, instead of an io_context,
    associated with it.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • bio_associate_current() currently open codes task_css() and
    css_tryget_online() to find and pin $current's blkcg css. Abstract it
    into task_get_css() which is implemented from cgroup side. As a task
    is always associated with an online css for every subsystem except
    while the css_set update is propagating, task_get_css() retries till
    css_tryget_online() succeeds.

    This is a cleanup and shouldn't lead to noticeable behavior changes.

    Signed-off-by: Tejun Heo
    Cc: Li Zefan
    Cc: Jens Axboe
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Add global constant blkcg_root_css which points to &blkcg_root.css.
    This will be used by cgroup writeback support. If blkcg is disabled,
    it's defined as ERR_PTR(-EINVAL).

    v2: The declarations moved to include/linux/blk-cgroup.h as suggested
    by Vivek.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, blkcg does a minor optimization where the root blkcg is
    created when the first blkcg policy is activated on a queue and
    destroyed on the deactivation of the last. On systems where blkcg is
    configured but not used, this saves one blkcg_gq struct per queue. On
    systems where blkcg is actually used, there's no difference. The only
    case where this can lead to any meaninful, albeit still minute, save
    in memory consumption is when all blkcg policies are deactivated after
    being widely used in the system, which is a hihgly unlikely scenario.

    The conditional existence of root blkcg_gq has already created several
    bugs in blkcg and became an issue once again for the new per-cgroup
    wb_congested mechanism for cgroup writeback support leading to a NULL
    dereference when no blkcg policy is active. This is really not worth
    bothering with. This patch makes blkcg always allocate and link the
    root blkcg_gq and release it only on queue destruction.

    Signed-off-by: Tejun Heo
    Reported-by: Fengguang Wu
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • cgroup aware writeback support will require exposing some of blkcg
    details. In preprataion, move block/blk-cgroup.h to
    include/linux/blk-cgroup.h. This patch is pure file move.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Conflicts:
    arch/sparc/include/asm/topology_64.h

    Signed-off-by: Ingo Molnar

    Ingo Molnar