18 Aug, 2020

1 commit

  • Changes from v1:
    - update commit description with proper ref-accounting justification

    commit db37a34c563b ("block, bfq: get a ref to a group when adding it to a service tree")
    introduce leak forbfq_group and blkcg_gq objects because of get/put
    imbalance.
    In fact whole idea of original commit is wrong because bfq_group entity
    can not dissapear under us because it is referenced by child bfq_queue's
    entities from here:
    -> bfq_init_entity()
    ->bfqg_and_blkg_get(bfqg);
    ->entity->parent = bfqg->my_entity

    -> bfq_put_queue(bfqq)
    FINAL_PUT
    ->bfqg_and_blkg_put(bfqq_group(bfqq))
    ->kmem_cache_free(bfq_pool, bfqq);

    So parent entity can not disappear while child entity is in tree,
    and child entities already has proper protection.
    This patch revert commit db37a34c563b ("block, bfq: get a ref to a group when adding it to a service tree")

    bfq_group leak trace caused by bad commit:
    -> blkg_alloc
    -> bfq_pq_alloc
    -> bfqg_get (+1)
    ->bfq_activate_bfqq
    ->bfq_activate_requeue_entity
    -> __bfq_activate_entity
    ->bfq_get_entity
    ->bfqg_and_blkg_get (+1) bfq_del_bfqq_busy
    ->bfq_deactivate_entity+0x53/0xc0 [bfq]
    ->__bfq_deactivate_entity+0x1b8/0x210 [bfq]
    -> bfq_forget_entity(is_in_service = true)
    entity->on_st_or_in_serv = false do not touch reference
    -> blkcg_css_offline
    -> blkcg_destroy_blkgs
    -> blkg_destroy
    -> bfq_pd_offline
    -> __bfq_deactivate_entity
    if (!entity->on_st_or_in_serv) /* true, because (Note2)
    return false;
    -> bfq_pd_free
    -> bfqg_put() (-1, byt bfqg->ref == 2) because of (Note2)
    So bfq_group and blkcg_gq will leak forever, see test-case below.

    ##TESTCASE_BEGIN:
    #!/bin/bash

    max_iters=${1:-100}
    #prep cgroup mounts
    mount -t tmpfs cgroup_root /sys/fs/cgroup
    mkdir /sys/fs/cgroup/blkio
    mount -t cgroup -o blkio none /sys/fs/cgroup/blkio

    # Prepare blkdev
    grep blkio /proc/cgroups
    truncate -s 1M img
    losetup /dev/loop0 img
    echo bfq > /sys/block/loop0/queue/scheduler

    grep blkio /proc/cgroups
    for ((i=0;i /sys/fs/cgroup/blkio/a/cgroup.procs
    dd if=/dev/loop0 bs=4k count=1 of=/dev/null iflag=direct 2> /dev/null
    echo 0 > /sys/fs/cgroup/blkio/cgroup.procs
    rmdir /sys/fs/cgroup/blkio/a
    grep blkio /proc/cgroups
    done
    ##TESTCASE_END:

    Fixes: db37a34c563b ("block, bfq: get a ref to a group when adding it to a service tree")
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Dmitry Monakhov
    Signed-off-by: Jens Axboe

    Dmitry Monakhov
     

22 Mar, 2020

4 commits

  • In bfq_pd_offline(), the function bfq_flush_idle_tree() is invoked to
    flush the rb tree that contains all idle entities belonging to the pd
    (cgroup) being destroyed. In particular, bfq_flush_idle_tree() is
    invoked before bfq_reparent_active_queues(). Yet the latter may happen
    to add some entities to the idle tree. It happens if, in some of the
    calls to bfq_bfqq_move() performed by bfq_reparent_active_queues(),
    the queue to move is empty and gets expired.

    This commit simply reverses the invocation order between
    bfq_flush_idle_tree() and bfq_reparent_active_queues().

    Tested-by: cki-project@redhat.com
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • bfq_reparent_leaf_entity() reparents the input leaf entity (a leaf
    entity represents just a bfq_queue in an entity tree). Yet, the input
    entity is guaranteed to always be a leaf entity only in two-level
    entity trees. In this respect, because of the error fixed by
    commit 14afc5936197 ("block, bfq: fix overwrite of bfq_group pointer
    in bfq_find_set_group()"), all (wrongly collapsed) entity trees happened
    to actually have only two levels. After the latter commit, this does not
    hold any longer.

    This commit fixes this problem by modifying
    bfq_reparent_leaf_entity(), so that it searches an active leaf entity
    down the path that stems from the input entity. Such a leaf entity is
    guaranteed to exist when bfq_reparent_leaf_entity() is invoked.

    Tested-by: cki-project@redhat.com
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • A bfq_put_queue() may be invoked in __bfq_bic_change_cgroup(). The
    goal of this put is to release a process reference to a bfq_queue. But
    process-reference releases may trigger also some extra operation, and,
    to this goal, are handled through bfq_release_process_ref(). So, turn
    the invocation of bfq_put_queue() into an invocation of
    bfq_release_process_ref().

    Tested-by: cki-project@redhat.com
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Commit ecedd3d7e199 ("block, bfq: get extra ref to prevent a queue
    from being freed during a group move") gets an extra reference to a
    bfq_queue before possibly deactivating it (temporarily), in
    bfq_bfqq_move(). This prevents the bfq_queue from disappearing before
    being reactivated in its new group.

    Yet, the bfq_queue may also be expired (i.e., its service may be
    stopped) before the bfq_queue is deactivated. And also an expiration
    may lead to a premature freeing. This commit fixes this issue by
    simply moving forward the getting of the extra reference already
    introduced by commit ecedd3d7e199 ("block, bfq: get extra ref to
    prevent a queue from being freed during a group move").

    Reported-by: cki-project@redhat.com
    Tested-by: cki-project@redhat.com
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

06 Mar, 2020

1 commit

  • The bfq_find_set_group() function takes as input a blkcg (which represents
    a cgroup) and retrieves the corresponding bfq_group, then it updates the
    bfq internal group hierarchy (see comments inside the function for why
    this is needed) and finally it returns the bfq_group.
    In the hierarchy update cycle, the pointer holding the correct bfq_group
    that has to be returned is mistakenly used to traverse the hierarchy
    bottom to top, meaning that in each iteration it gets overwritten with the
    parent of the current group. Since the update cycle stops at root's
    children (depth = 2), the overwrite becomes a problem only if the blkcg
    describes a cgroup at a hierarchy level deeper than that (depth > 2). In
    this case the root's child that happens to be also an ancestor of the
    correct bfq_group is returned. The main consequence is that processes
    contained in a cgroup at depth greater than 2 are wrongly placed in the
    group described above by BFQ.

    This commits fixes this problem by using a different bfq_group pointer in
    the update cycle in order to avoid the overwrite of the variable holding
    the original group reference.

    Reported-by: Kwon Je Oh
    Signed-off-by: Carlo Nonato
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Carlo Nonato
     

03 Feb, 2020

4 commits

  • BFQ schedules generic entities, which may represent either bfq_queues
    or groups of bfq_queues. When an entity is inserted into a service
    tree, a reference must be taken, to make sure that the entity does not
    disappear while still referred in the tree. Unfortunately, such a
    reference is mistakenly taken only if the entity represents a
    bfq_queue. This commit takes a reference also in case the entity
    represents a group.

    Tested-by: Oleksandr Natalenko
    Tested-by: Chris Evich
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • ifdefs around gets and puts of bfq groups reduce readability, remove them.

    Tested-by: Oleksandr Natalenko
    Reported-by: Jens Axboe
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • The flag on_st in the bfq_entity data structure is true if the entity
    is on a service tree or is in service. Yet the name of the field,
    confusingly, does not mention the second, very important case. Extend
    the name to mention the second case too.

    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • In bfq_bfqq_move(), the bfq_queue, say Q, to be moved to a new group
    may happen to be deactivated in the scheduling data structures of the
    source group (and then activated in the destination group). If Q is
    referred only by the data structures in the source group when the
    deactivation happens, then Q is freed upon the deactivation.

    This commit addresses this issue by getting an extra reference before
    the possible deactivation, and releasing this extra reference after Q
    has been moved.

    Tested-by: Chris Evich
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

05 Dec, 2019

1 commit

  • bio->bi_blkg will be NULL when the issue of the request
    has bypassed the block layer as shown in the following oops:

    Internal error: Oops: 96000005 [#1] SMP
    CPU: 17 PID: 2996 Comm: scsi_id Not tainted 5.4.0 #4
    Call trace:
    percpu_counter_add_batch+0x38/0x4c8
    bfqg_stats_update_legacy_io+0x9c/0x280
    bfq_insert_requests+0xbac/0x2190
    blk_mq_sched_insert_request+0x288/0x670
    blk_execute_rq_nowait+0x140/0x178
    blk_execute_rq+0x8c/0x140
    sg_io+0x604/0x9c0
    scsi_cmd_ioctl+0xe38/0x10a8
    scsi_cmd_blk_ioctl+0xac/0xe8
    sd_ioctl+0xe4/0x238
    blkdev_ioctl+0x590/0x20e0
    block_ioctl+0x60/0x98
    do_vfs_ioctl+0xe0/0x1b58
    ksys_ioctl+0x80/0xd8
    __arm64_sys_ioctl+0x40/0x78
    el0_svc_handler+0xc4/0x270

    so ensure its validity before using it.

    Fixes: fd41e60331b1 ("bfq-iosched: stop using blkg->stat_bytes and ->stat_ios")
    Signed-off-by: Hou Tao
    Signed-off-by: Jens Axboe

    Hou Tao
     

08 Nov, 2019

2 commits

  • When used on cgroup1, bfq uses the blkg->stat_bytes and ->stat_ios
    from blk-cgroup core to populate six stat knobs. blk-cgroup core is
    moving away from blkg_rwstat to improve scalability and won't be able
    to support this usage.

    It isn't like the sharing gains all that much. Let's break it out to
    dedicated rwstat counters which are updated when on cgroup1. This
    makes use of bfqg_*rwstat*() helpers outside of
    CONFIG_BFQ_CGROUP_DEBUG. Move them out.

    v2: Compile fix when !CONFIG_BFQ_CGROUP_DEBUG.

    Signed-off-by: Tejun Heo
    Cc: Paolo Valente
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Collect them right under #ifdef CONFIG_BFQ_CGROUP_DEBUG. The next
    patch will use them from !DEBUG path and this makes it easy to move
    them out of the ifdef block.

    This is pure code reorganization.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

07 Sep, 2019

2 commits

  • This adds to BFQ the missing per-device weight interfaces:
    blkio.bfq.weight_device on legacy and io.bfq.weight on unified. The
    implementation pretty closely resembles what we had in CFQ and the parsing code
    is basically reused.

    Tests
    =====

    Using two cgroups and three block devices, having weights setup as:

    Cgroup test1 test2
    ============================================
    default 100 500
    sda 500 100
    sdb default default
    sdc 200 200

    cgroup v1 runs
    --------------

    sda.test1.out: READ: bw=913MiB/s
    sda.test2.out: READ: bw=183MiB/s

    sdb.test1.out: READ: bw=213MiB/s
    sdb.test2.out: READ: bw=1054MiB/s

    sdc.test1.out: READ: bw=650MiB/s
    sdc.test2.out: READ: bw=650MiB/s

    cgroup v2 runs
    --------------

    sda.test1.out: READ: bw=915MiB/s
    sda.test2.out: READ: bw=184MiB/s

    sdb.test1.out: READ: bw=216MiB/s
    sdb.test2.out: READ: bw=1069MiB/s

    sdc.test1.out: READ: bw=621MiB/s
    sdc.test2.out: READ: bw=622MiB/s

    Signed-off-by: Fam Zheng
    Acked-by: Tejun Heo
    Reviewed-by: Paolo Valente

    Signed-off-by: Jens Axboe

    Fam Zheng
     
  • This function will be useful when we update weight from the soon-coming
    per-device interface.

    Signed-off-by: Fam Zheng
    Reviewed-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Fam Zheng
     

29 Aug, 2019

1 commit


21 Jun, 2019

5 commits


10 Jun, 2019

1 commit

  • There's some discussion on how to do this the best, and Tejun prefers
    that BFQ just create the file itself instead of having cgroups support
    a symlink feature.

    Hence revert commit 54b7b868e826 and 19e9da9e86c4 for 5.2, and this
    can be done properly for 5.3.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

07 Jun, 2019

1 commit

  • Many userspace tools and services use the proportional-share policy of
    the blkio/io cgroups controller. The CFQ I/O scheduler implemented
    this policy for the legacy block layer. To modify the weight of a
    group in case CFQ was in charge, the 'weight' parameter of the group
    must be modified. On the other hand, the BFQ I/O scheduler implements
    the same policy in blk-mq, but, with BFQ, the parameter to modify has
    a different name: bfq.weight (forced choice until legacy block was
    present, because two different policies cannot share a common parameter
    in cgroups).

    Due to CFQ legacy, most if not all userspace configurations still use
    the parameter 'weight', and for the moment do not seem likely to be
    changed. But, when CFQ went away with legacy block, such a parameter
    ceased to exist.

    So, a simple workaround has been proposed [1] to make all
    configurations work: add a symlink, named weight, to bfq.weight. This
    commit adds such a symlink.

    [1] https://lkml.org/lkml/2019/4/8/555

    Suggested-by: Johannes Thumshirn
    Signed-off-by: Angelo Ruocco
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Angelo Ruocco
     

01 May, 2019

1 commit


09 Apr, 2019

1 commit


01 Apr, 2019

1 commit

  • To boost throughput with a set of processes doing interleaved I/O
    (i.e., a set of processes whose individual I/O is random, but whose
    merged cumulative I/O is sequential), BFQ merges the queues associated
    with these processes, i.e., redirects the I/O of these processes into a
    common, shared queue. In the shared queue, I/O requests are ordered by
    their position on the medium, thus sequential I/O gets dispatched to
    the device when the shared queue is served.

    Queue merging costs execution time, because, to detect which queues to
    merge, BFQ must maintain a list of the head I/O requests of active
    queues, ordered by request positions. Measurements showed that this
    costs about 10% of BFQ's total per-request processing time.

    Request processing time becomes more and more critical as the speed of
    the underlying storage device grows. Yet, fortunately, queue merging
    is basically useless on the very devices that are so fast to make
    request processing time critical. To reach a high throughput, these
    devices must have many requests queued at the same time. But, in this
    configuration, the internal scheduling algorithms of these devices do
    also the job of queue merging: they reorder requests so as to obtain
    as much as possible a sequential I/O pattern. As a consequence, with
    processes doing interleaved I/O, the throughput reached by one such
    device is likely to be the same, with and without queue merging.

    In view of this fact, this commit disables queue merging, and all
    related housekeeping, for non-rotational devices with internal
    queueing. The total, single-lock-protected, per-request processing
    time of BFQ drops to, e.g., 1.9 us on an Intel Core i7-2760QM@2.40GHz
    (time measured with simple code instrumentation, and using the
    throughput-sync.sh script of the S suite [1], in performance-profiling
    mode). To put this result into context, the total,
    single-lock-protected, per-request execution time of the lightest I/O
    scheduler available in blk-mq, mq-deadline, is 0.7 us (mq-deadline is
    ~800 LOC, against ~10500 LOC for BFQ).

    Disabling merging provides a further, remarkable benefit in terms of
    throughput. Merging tends to make many workloads artificially more
    uneven, mainly because of shared queues remaining non empty for
    incomparably more time than normal queues. So, if, e.g., one of the
    queues in a set of merged queues has a higher weight than a normal
    queue, then the shared queue may inherit such a high weight and, by
    staying almost always active, may force BFQ to perform I/O plugging
    most of the time. This evidently makes it harder for BFQ to let the
    device reach a high throughput.

    As a practical example of this problem, and of the benefits of this
    commit, we measured again the throughput in the nasty scenario
    considered in previous commit messages: dbench test (in the Phoronix
    suite), with 6 clients, on a filesystem with journaling, and with the
    journaling daemon enjoying a higher weight than normal processes. With
    this commit, the throughput grows from ~150 MB/s to ~200 MB/s on a
    PLEXTOR PX-256M5 SSD. This is the same peak throughput reached by any
    of the other I/O schedulers. As such, this is also likely to be the
    maximum possible throughput reachable with this workload on this
    device, because I/O is mostly random, and the other schedulers
    basically just pass I/O requests to the drive as fast as possible.

    [1] https://github.com/Algodev-github/S

    Tested-by: Holger Hoffstätte
    Tested-by: Oleksandr Natalenko
    Tested-by: Francesco Pollicino
    Signed-off-by: Alessio Masola
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

08 Dec, 2018

1 commit

  • The bio_blkcg() function turns out to be inconsistent and consequently
    dangerous to use. The first part returns a blkcg where a reference is
    owned by the bio meaning it does not need to be rcu protected. However,
    the third case, the last line, is problematic:

    return css_to_blkcg(task_css(current, io_cgrp_id));

    This can race against task migration and the cgroup dying. It is also
    semantically different as it must be called rcu protected and is
    susceptible to failure when trying to get a reference to it.

    This patch adds association ahead of calling bio_blkcg() rather than
    after. This makes association a required and explicit step along the
    code paths for calling bio_blkcg(). In blk-iolatency, association is
    moved above the bio_blkcg() call to ensure it will not return %NULL.

    BFQ uses the old bio_blkcg() function, but I do not want to address it
    in this series due to the complexity. I have created a private version
    documenting the inconsistency and noting not to use it.

    Signed-off-by: Dennis Zhou
    Acked-by: Tejun Heo
    Reviewed-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

16 Nov, 2018

1 commit

  • With the legacy request path gone there is no good reason to keep
    queue_lock as a pointer, we can always use the embedded lock now.

    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig

    Fixed floppy and blk-cgroup missing conversions and half done edits.

    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

02 Nov, 2018

1 commit

  • This reverts a series committed earlier due to null pointer exception
    bug report in [1]. It seems there are edge case interactions that I did
    not consider and will need some time to understand what causes the
    adverse interactions.

    The original series can be found in [2] with a follow up series in [3].

    [1] https://www.spinics.net/lists/cgroups/msg20719.html
    [2] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/
    [3] https://lore.kernel.org/lkml/20181020185612.51587-1-dennis@kernel.org/

    This reverts the following commits:
    d459d853c2ed, b2c3fa546705, 101246ec02b5, b3b9f24f5fcc, e2b0989954ae,
    f0fcb3ec89f3, c839e7a03f92, bdc2491708c4, 74b7c02a9bc1, 5bf9a1f3b4ef,
    a7b39b4e961c, 07b05bcc3213, 49f4c2dc2b50, 27e6fa996c53

    Signed-off-by: Dennis Zhou
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

22 Sep, 2018

1 commit

  • The accessor function bio_blkcg either returns the blkcg associated with
    the bio or finds one in the current context. This can cause an issue
    when trying to associate a bio with a blkcg. Particularly, it's the
    third case that is problematic:

    return css_to_blkcg(task_css(current, io_cgrp_id));

    As the above may race against task migration and the cgroup exiting, it
    is not always ok to take a reference on the blkcg returned from
    bio_blkcg.

    This patch adds association ahead of calling bio_blkcg rather than
    after. This makes association a required and explicit step along the
    code paths for calling bio_blkcg. blk_get_rl is modified as well to get
    a reference to the blkcg it may use and blk_put_rl will always put the
    reference back. Association is also moved above the bio_blkcg call to
    ensure it will not return NULL in blk-iolatency.

    BFQ and CFQ utilize this flaw, but due to the complexity, I do not want
    to address this in this series. I've created a private version of the
    function with notes not to use it describing the flaw. Hopefully soon,
    that code can be cleaned up.

    Signed-off-by: Dennis Zhou
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Dennis Zhou (Facebook)
     

07 Sep, 2018

1 commit


17 Aug, 2018

1 commit

  • The value that struct cftype .write() method returns is then directly
    returned to userspace as the value returned by write() syscall, so it
    should be the number of bytes actually written (or consumed) and not zero.

    Returning zero from write() syscall makes programs like /bin/echo or bash
    spin.

    Signed-off-by: Maciej S. Szmigiero
    Fixes: e21b7a0b9887 ("block, bfq: add full hierarchical scheduling and cgroups support")
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe

    Maciej S. Szmigiero
     

09 May, 2018

1 commit

  • cfq and bfq have some internal fields that use sched_clock() which can
    trivially use ktime_get_ns() instead. Their timestamp fields in struct
    request can also use ktime_get_ns(), which resolves the 8 year old
    comment added by commit 28f4197e5d47 ("block: disable preemption before
    using sched_clock()").

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

09 Jan, 2018

1 commit

  • For each pair [device for which bfq is selected as I/O scheduler,
    group in blkio/io], bfq maintains a corresponding bfq group. Each such
    bfq group contains a set of async queues, with each async queue
    created on demand, i.e., when some I/O request arrives for it. On
    creation, an async queue gets an extra reference, to make sure that
    the queue is not freed as long as its bfq group exists. Accordingly,
    to allow the queue to be freed after the group exited, this extra
    reference must released on group exit.

    The above holds also for a bfq root group, i.e., for the bfq group
    corresponding to the root blkio/io root for a given device. Yet, by
    mistake, the references to the existing async queues of a root group
    are not released when the latter exits. This causes a memory leak when
    the instance of bfq for a given device exits. In a similar vein,
    bfqg_stats_xfer_dead is not executed for a root group.

    This commit fixes bfq_pd_offline so that the latter executes the above
    missing operations for a root group too.

    Reported-by: Holger Hoffstätte
    Reported-by: Guoqing Jiang
    Tested-by: Holger Hoffstätte
    Signed-off-by: Davide Ferrari
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

15 Nov, 2017

1 commit

  • BFQ currently creates, and updates, its own instance of the whole
    set of blkio statistics that cfq creates. Yet, from the comments
    of Tejun Heo in [1], it turned out that most of these statistics
    are meant/useful only for debugging. This commit makes BFQ create
    the latter, debugging statistics only if the option
    CONFIG_DEBUG_BLK_CGROUP is set.

    By doing so, this commit also enables BFQ to enjoy a high perfomance
    boost. The reason is that, if CONFIG_DEBUG_BLK_CGROUP is not set, then
    BFQ has to update far fewer statistics, and, in particular, not the
    heaviest to update. To give an idea of the benefits, if
    CONFIG_DEBUG_BLK_CGROUP is not set, then, on an Intel i7-4850HQ, and
    with 8 threads doing random I/O in parallel on null_blk (configured
    with 0 latency), the throughput of BFQ grows from 310 to 400 KIOPS
    (+30%). We have measured similar or even much higher boosts with other
    CPUs: e.g., +45% with an ARM CortexTM-A53 Octa-core. Our results have
    been obtained and can be reproduced very easily with the script in [1].

    [1] https://www.spinics.net/lists/linux-block/msg18943.html

    Suggested-by: Tejun Heo
    Suggested-by: Ulf Hansson
    Tested-by: Lee Tibbert
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Luca Miccio
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Luca Miccio
     

02 Sep, 2017

1 commit


08 Jun, 2017

1 commit

  • In blk-cgroup, operations on blkg objects are protected with the
    request_queue lock. This is no more the lock that protects
    I/O-scheduler operations in blk-mq. In fact, the latter are now
    protected with a finer-grained per-scheduler-instance lock. As a
    consequence, although blkg lookups are also rcu-protected, blk-mq I/O
    schedulers may see inconsistent data when they access blkg and
    blkg-related objects. BFQ does access these objects, and does incur
    this problem, in the following case.

    The blkg_lookup performed in bfq_get_queue, being protected (only)
    through rcu, may happen to return the address of a copy of the
    original blkg. If this is the case, then the blkg_get performed in
    bfq_get_queue, to pin down the blkg, is useless: it does not prevent
    blk-cgroup code from destroying both the original blkg and all objects
    directly or indirectly referred by the copy of the blkg. BFQ accesses
    these objects, which typically causes a crash for NULL-pointer
    dereference of memory-protection violation.

    Some additional protection mechanism should be added to blk-cgroup to
    address this issue. In the meantime, this commit provides a quick
    temporary fix for BFQ: cache (when safe) blkg data that might
    disappear right after a blkg_lookup.

    In particular, this commit exploits the following facts to achieve its
    goal without introducing further locks. Destroy operations on a blkg
    invoke, as a first step, hooks of the scheduler associated with the
    blkg. And these hooks are executed with bfqd->lock held for BFQ. As a
    consequence, for any blkg associated with the request queue an
    instance of BFQ is attached to, we are guaranteed that such a blkg is
    not destroyed, and that all the pointers it contains are consistent,
    while that instance is holding its bfqd->lock. A blkg_lookup performed
    with bfqd->lock held then returns a fully consistent blkg, which
    remains consistent until this lock is held. In more detail, this holds
    even if the returned blkg is a copy of the original one.

    Finally, also the object describing a group inside BFQ needs to be
    protected from destruction on the blkg_free of the original blkg
    (which invokes bfq_pd_free). This commit adds private refcounting for
    this object, to let it disappear only after no bfq_queue refers to it
    any longer.

    This commit also removes or updates some stale comments on locking
    issues related to blk-cgroup operations.

    Reported-by: Tomas Konir
    Reported-by: Lee Tibbert
    Reported-by: Marco Piazza
    Signed-off-by: Paolo Valente
    Tested-by: Tomas Konir
    Tested-by: Lee Tibbert
    Tested-by: Marco Piazza
    Signed-off-by: Jens Axboe

    Paolo Valente
     

19 Apr, 2017

1 commit

  • The BFQ I/O scheduler features an optimal fair-queuing
    (proportional-share) scheduling algorithm, enriched with several
    mechanisms to boost throughput and reduce latency for interactive and
    real-time applications. This makes BFQ a large and complex piece of
    code. This commit addresses this issue by splitting BFQ into three
    main, independent components, and by moving each component into a
    separate source file:
    1. Main algorithm: handles the interaction with the kernel, and
    decides which requests to dispatch; it uses the following two further
    components to achieve its goals.
    2. Scheduling engine (Hierarchical B-WF2Q+ scheduling algorithm):
    computes the schedule, using weights and budgets provided by the above
    component.
    3. cgroups support: handles group operations (creation, destruction,
    move, ...).

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente