08 Aug, 2018

2 commits

  • This patch does not change any functionality but avoids that gcc
    reports the following warnings when building with W=1:

    block/cfq-iosched.c: In function ?cfq_back_seek_max_store?:
    block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
    if (__data < (MIN)) \
    ^
    block/cfq-iosched.c:4756:1: note: in expansion of macro ?STORE_FUNCTION?
    STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
    ^~~~~~~~~~~~~~
    block/cfq-iosched.c: In function ?cfq_slice_idle_store?:
    block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
    if (__data < (MIN)) \
    ^
    block/cfq-iosched.c:4759:1: note: in expansion of macro ?STORE_FUNCTION?
    STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
    ^~~~~~~~~~~~~~
    block/cfq-iosched.c: In function ?cfq_group_idle_store?:
    block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
    if (__data < (MIN)) \
    ^
    block/cfq-iosched.c:4760:1: note: in expansion of macro ?STORE_FUNCTION?
    STORE_FUNCTION(cfq_group_idle_store, &cfqd->cfq_group_idle, 0, UINT_MAX, 1);
    ^~~~~~~~~~~~~~
    block/cfq-iosched.c: In function ?cfq_low_latency_store?:
    block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
    if (__data < (MIN)) \
    ^
    block/cfq-iosched.c:4765:1: note: in expansion of macro ?STORE_FUNCTION?
    STORE_FUNCTION(cfq_low_latency_store, &cfqd->cfq_latency, 0, 1, 0);
    ^~~~~~~~~~~~~~
    block/cfq-iosched.c: In function ?cfq_slice_idle_us_store?:
    block/cfq-iosched.c:4775:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
    if (__data < (MIN)) \
    ^
    block/cfq-iosched.c:4782:1: note: in expansion of macro ?USEC_STORE_FUNCTION?
    USEC_STORE_FUNCTION(cfq_slice_idle_us_store, &cfqd->cfq_slice_idle, 0, UINT_MAX);
    ^~~~~~~~~~~~~~~~~~~
    block/cfq-iosched.c: In function ?cfq_group_idle_us_store?:
    block/cfq-iosched.c:4775:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
    if (__data < (MIN)) \
    ^
    block/cfq-iosched.c:4783:1: note: in expansion of macro ?USEC_STORE_FUNCTION?
    USEC_STORE_FUNCTION(cfq_group_idle_us_store, &cfqd->cfq_group_idle, 0, UINT_MAX);
    ^~~~~~~~~~~~~~~~~~~

    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • This patch avoids that gcc complains about fall-through when building
    with W=1.

    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

25 May, 2018

1 commit

  • Convert the S_ symbolic permissions to their octal equivalents as
    using octal and not symbolic permissions is preferred by many as more
    readable.

    see: https://lkml.org/lkml/2016/8/2/1945

    Done with automated conversion via:
    $ ./scripts/checkpatch.pl -f --types=SYMBOLIC_PERMS --fix-inplace

    Miscellanea:

    o Wrapped modified multi-line calls to a single line where appropriate
    o Realign modified multi-line calls to open parenthesis

    Signed-off-by: Joe Perches
    Signed-off-by: Jens Axboe

    Joe Perches
     

09 May, 2018

2 commits

  • Currently, struct request has four timestamp fields:

    - A start time, set at get_request time, in jiffies, used for iostats
    - An I/O start time, set at start_request time, in ktime nanoseconds,
    used for blk-stats (i.e., wbt, kyber, hybrid polling)
    - Another start time and another I/O start time, used for cfq and bfq

    These can all be consolidated into one start time and one I/O start
    time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
    request depending on the kernel config.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • cfq and bfq have some internal fields that use sched_clock() which can
    trivially use ktime_get_ns() instead. Their timestamp fields in struct
    request can also use ktime_get_ns(), which resolves the 8 year old
    comment added by commit 28f4197e5d47 ("block: disable preemption before
    using sched_clock()").

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

09 Sep, 2017

2 commits

  • For the same reasons we already cache the leftmost pointer, apply the same
    optimization for rb_last() calls. Users must explicitly do this as
    rb_root_cached only deals with the smallest node.

    [dave@stgolabs.net: brain fart #1]
    Link: http://lkml.kernel.org/r/20170731155955.GD21328@linux-80c1.suse
    Link: http://lkml.kernel.org/r/20170719014603.19029-18-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • ... with the generic rbtree flavor instead. No changes
    in semantics whatsoever.

    Link: http://lkml.kernel.org/r/20170719014603.19029-11-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Jan Kara
    Acked-by: Peter Zijlstra (Intel)
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

29 Aug, 2017

1 commit


11 Aug, 2017

1 commit

  • In below scenario blkio cgroup does not work as per their assigned
    weights :-
    1. When the underlying device is nonrotational with a single HW queue
    with depth of >= CFQ_HW_QUEUE_MIN
    2. When the use case is forming two blkio cgroups cg1(weight 1000) &
    cg2(wight 100) and two processes(file1 and file2) doing sync IO in
    their respective blkio cgroups.

    For above usecase result of fio (without this patch):-
    file1: (groupid=0, jobs=1): err= 0: pid=685: Thu Jan 1 19:41:49 1970
    write: IOPS=1315, BW=41.1MiB/s (43.1MB/s)(1024MiB/24906msec)

    file2: (groupid=0, jobs=1): err= 0: pid=686: Thu Jan 1 19:41:49 1970
    write: IOPS=1295, BW=40.5MiB/s (42.5MB/s)(1024MiB/25293msec)

    // both the process BW is equal even though they belong to diff.
    cgroups with weight of 1000(cg1) and 100(cg2)

    In above case (for non rotational NCQ devices),
    as soon as the request from cg1 is completed and even
    though it is provided with higher set_slice=10, because of CFQ
    algorithm when the driver tries to fetch the request, CFQ expires
    this group without providing any idle time nor weight priority
    and schedules another cfq group (in this case cg2).
    And thus both cfq groups(cg1 & cg2) keep alternating to get the
    disk time and hence loses the cgroup weight based scheduling.

    Below patch gives a chance to cfq algorithm (cfq_arm_slice_timer)
    to arm the slice timer in case group_idle is enabled.
    In case if group_idle is also not required (including for nonrotational
    NCQ drives), we need to explicitly set group_idle = 0 from sysfs for
    such cases.

    With this patch result of fio(for above usecase) :-
    file1: (groupid=0, jobs=1): err= 0: pid=690: Thu Jan 1 00:06:08 1970
    write: IOPS=1706, BW=53.3MiB/s (55.9MB/s)(1024MiB/19197msec)

    file2: (groupid=0, jobs=1): err= 0: pid=691: Thu Jan 1 00:06:08 1970
    write: IOPS=1043, BW=32.6MiB/s (34.2MB/s)(1024MiB/31401msec)

    // In this processes BW is as per their respective cgroups weight.

    Signed-off-by: Ritesh Harjani
    Signed-off-by: Jens Axboe

    Ritesh Harjani
     

29 Jul, 2017

1 commit

  • Currently cfq/bfq/blk-throttle output cgroup info in trace in their own
    way. Now we have standard blktrace API for this, so convert them to use
    it.

    Note, this changes the behavior a little bit. cgroup info isn't output
    by default, we only do this with 'blk_cgroup' option enabled. cgroup
    info isn't output as a string by default too, we only do this with
    'blk_cgname' option enabled. Also cgroup info is output in different
    position of the note string. I think these behavior changes aren't a big
    issue (actually we make trace data shorter which is good), since the
    blktrace note is solely for debugging.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

12 Jun, 2017

1 commit

  • We've already got a few conflicts and upcoming work depends on some of the
    changes that have gone into mainline as regression fixes for this series.

    Pull in 4.12-rc5 to resolve these conflicts and make it easier on down stream
    trees to continue working on 4.13 changes.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

31 May, 2017

1 commit

  • When adding a cfq_group into the cfq service tree, we use CFQ_IDLE_DELAY
    as the delay of cfq_group's vdisktime if there have been other cfq_groups
    already.

    When cfq is under iops mode, commit 9a7f38c42c2b ("cfq-iosched: Convert
    from jiffies to nanoseconds") could result in a large iops delay and
    lead to an abnormal io schedule delay for the added cfq_group. To fix
    it, we just need to revert to the old CFQ_IDLE_DELAY value: HZ / 5
    when iops mode is enabled.

    Despite having the same value, the delay of a cfq_queue in idle class
    and the delay of cfq_group are different things, so I define two new
    macros for the delay of a cfq_group under time-slice mode and iops mode.

    Fixes: 9a7f38c42c2b ("cfq-iosched: Convert from jiffies to nanoseconds")
    Cc: # 4.8+
    Signed-off-by: Hou Tao
    Acked-by: Jan Kara
    Signed-off-by: Jens Axboe

    Hou Tao
     

30 May, 2017

1 commit


05 Apr, 2017

1 commit

  • Writeback throttling does not play well with CFQ since that also tries
    to throttle async writes. As a result async writeback can get starved in
    presence of readers. As an example take a benchmark simulating
    postgreSQL database running over a standard rotating SATA drive. There
    are 16 processes doing random reads from a huge file (2*machine memory),
    1 process doing random writes to the huge file and calling fsync once
    per 50000 writes and 1 process doing sequential 8k writes to a
    relatively small file wrapping around at the end of the file and calling
    fsync every 5 writes. Under this load read latency easily exceeds the
    target latency of 75 ms (just because there are so many reads happening
    against a relatively slow disk) and thus writeback is throttled to a
    point where only 1 write request is allowed at a time. Blktrace data
    then looks like:

    8,0 1 0 8.347751764 0 m N cfq workload slice:40000000
    8,0 1 0 8.347755256 0 m N cfq293A / set_active wl_class: 0 wl_type:0
    8,0 1 0 8.347784100 0 m N cfq293A / Not idling. st->count:1
    8,0 1 3814 8.347763916 5839 UT N [kworker/u9:2] 1
    8,0 0 0 8.347777605 0 m N cfq293A / Not idling. st->count:1
    8,0 1 0 8.347784100 0 m N cfq293A / Not idling. st->count:1
    8,0 3 1596 8.354364057 0 C R 156109528 + 8 (6906954) [0]
    8,0 3 0 8.354383193 0 m N cfq6196SN / complete rqnoidle 0
    8,0 3 0 8.354386476 0 m N cfq schedule dispatch
    8,0 3 0 8.354399397 0 m N cfq293A / Not idling. st->count:1
    8,0 3 0 8.354404705 0 m N cfq293A / dispatch_insert
    8,0 3 0 8.354409454 0 m N cfq293A / dispatched a request
    8,0 3 0 8.354412527 0 m N cfq293A / activate rq, drv=1
    8,0 3 1597 8.354414692 0 D W 145961400 + 24 (6718452) [swapper/0]
    8,0 3 0 8.354484184 0 m N cfq293A / Not idling. st->count:1
    8,0 3 0 8.354487536 0 m N cfq293A / slice expired t=0
    8,0 3 0 8.354498013 0 m N / served: vt=5888102466265088 min_vt=5888074869387264
    8,0 3 0 8.354502692 0 m N cfq293A / sl_used=6737519 disp=1 charge=6737519 iops=0 sect=24
    8,0 3 0 8.354505695 0 m N cfq293A / del_from_rr
    ...
    8,0 0 1810 8.354728768 0 C W 145961400 + 24 (314076) [0]
    8,0 0 0 8.354746927 0 m N cfq293A / complete rqnoidle 0
    ...
    8,0 1 3829 8.389886102 5839 G W 145962968 + 24 [kworker/u9:2]
    8,0 1 3830 8.389888127 5839 P N [kworker/u9:2]
    8,0 1 3831 8.389908102 5839 A W 145978336 + 24 count:1
    8,0 0 0 9.455395969 0 m N cfq293A / slice expired t=0
    8,0 0 0 9.455404210 0 m N / served: vt=5888958194597888 min_vt=5888941810597888
    8,0 0 0 9.455410077 0 m N cfq293A / sl_used=4000000 disp=1 charge=4000000 iops=0 sect=24
    8,0 0 0 9.455416851 0 m N cfq293A / del_from_rr
    ...
    8,0 0 2045 9.455648515 0 C W 145962968 + 24 (332305) [0]
    8,0 0 0 9.455668350 0 m N cfq293A / complete rqnoidle 0
    ...
    8,0 1 4371 9.455710115 5839 G W 145978336 + 24 [kworker/u9:2]
    8,0 1 4372 9.455712350 5839 P N [kworker/u9:2]
    8,0 1 4373 9.455730159 5839 A W 145986616 + 24
    Signed-off-by: Jens Axboe

    Jan Kara
     

02 Mar, 2017

1 commit


22 Feb, 2017

1 commit

  • Pull block layer updates from Jens Axboe:

    - blk-mq scheduling framework from me and Omar, with a port of the
    deadline scheduler for this framework. A port of BFQ from Paolo is in
    the works, and should be ready for 4.12.

    - Various fixups and improvements to the above scheduling framework
    from Omar, Paolo, Bart, me, others.

    - Cleanup of the exported sysfs blk-mq data into debugfs, from Omar.
    This allows us to export more information that helps debug hangs or
    performance issues, without cluttering or abusing the sysfs API.

    - Fixes for the sbitmap code, the scalable bitmap code that was
    migrated from blk-mq, from Omar.

    - Removal of the BLOCK_PC support in struct request, and refactoring of
    carrying SCSI payloads in the block layer. This cleans up the code
    nicely, and enables us to kill the SCSI specific parts of struct
    request, shrinking it down nicely. From Christoph mainly, with help
    from Hannes.

    - Support for ranged discard requests and discard merging, also from
    Christoph.

    - Support for OPAL in the block layer, and for NVMe as well. Mainly
    from Scott Bauer, with fixes/updates from various others folks.

    - Error code fixup for gdrom from Christophe.

    - cciss pci irq allocation cleanup from Christoph.

    - Making the cdrom device operations read only, from Kees Cook.

    - Fixes for duplicate bdi registrations and bdi/queue life time
    problems from Jan and Dan.

    - Set of fixes and updates for lightnvm, from Matias and Javier.

    - A few fixes for nbd from Josef, using idr to name devices and a
    workqueue deadlock fix on receive. Also marks Josef as the current
    maintainer of nbd.

    - Fix from Josef, overwriting queue settings when the number of
    hardware queues is updated for a blk-mq device.

    - NVMe fix from Keith, ensuring that we don't repeatedly mark and IO
    aborted, if we didn't end up aborting it.

    - SG gap merging fix from Ming Lei for block.

    - Loop fix also from Ming, fixing a race and crash between setting loop
    status and IO.

    - Two block race fixes from Tahsin, fixing request list iteration and
    fixing a race between device registration and udev device add
    notifiations.

    - Double free fix from cgroup writeback, from Tejun.

    - Another double free fix in blkcg, from Hou Tao.

    - Partition overflow fix for EFI from Alden Tondettar.

    * tag 'for-4.11/linus-merge-signed' of git://git.kernel.dk/linux-block: (156 commits)
    nvme: Check for Security send/recv support before issuing commands.
    block/sed-opal: allocate struct opal_dev dynamically
    block/sed-opal: tone down not supported warnings
    block: don't defer flushes on blk-mq + scheduling
    blk-mq-sched: ask scheduler for work, if we failed dispatching leftovers
    blk-mq: don't special case flush inserts for blk-mq-sched
    blk-mq-sched: don't add flushes to the head of requeue queue
    blk-mq: have blk_mq_dispatch_rq_list() return if we queued IO or not
    block: do not allow updates through sysfs until registration completes
    lightnvm: set default lun range when no luns are specified
    lightnvm: fix off-by-one error on target initialization
    Maintainers: Modify SED list from nvme to block
    Move stack parameters for sed_ioctl to prevent oversized stack with CONFIG_KASAN
    uapi: sed-opal fix IOW for activate lsp to use correct struct
    cdrom: Make device operations read-only
    elevator: fix loading wrong elevator type for blk-mq devices
    cciss: switch to pci_irq_alloc_vectors
    block/loop: fix race between I/O and set_status
    blk-mq-sched: don't hold queue_lock when calling exit_icq
    block: set make_request_fn manually in blk_mq_update_nr_hw_queues
    ...

    Linus Torvalds
     

16 Feb, 2017

1 commit


09 Feb, 2017

1 commit


23 Jan, 2017

2 commits

  • The script "checkpatch.pl" pointed information out like the following.

    ERROR: do not use assignment in if condition

    Thus fix the affected source code place.

    Signed-off-by: Markus Elfring
    Signed-off-by: Jens Axboe

    Markus Elfring
     
  • KMSAN (KernelMemorySanitizer, a new error detection tool) reports use of
    uninitialized memory in cfq_init_cfqq():

    ==================================================================
    BUG: KMSAN: use of unitialized memory
    ...
    Call Trace:
    [< inline >] __dump_stack lib/dump_stack.c:15
    [] dump_stack+0x157/0x1d0 lib/dump_stack.c:51
    [] kmsan_report+0x205/0x360 ??:?
    [] __msan_warning+0x5b/0xb0 ??:?
    [< inline >] cfq_init_cfqq block/cfq-iosched.c:3754
    [] cfq_get_queue+0xc80/0x14d0 block/cfq-iosched.c:3857
    ...
    origin:
    [] save_stack_trace+0x27/0x50 arch/x86/kernel/stacktrace.c:67
    [] kmsan_internal_poison_shadow+0xab/0x150 ??:?
    [] kmsan_poison_slab+0xbb/0x120 ??:?
    [< inline >] allocate_slab mm/slub.c:1627
    [] new_slab+0x3af/0x4b0 mm/slub.c:1641
    [< inline >] new_slab_objects mm/slub.c:2407
    [] ___slab_alloc+0x323/0x4a0 mm/slub.c:2564
    [< inline >] __slab_alloc mm/slub.c:2606
    [< inline >] slab_alloc_node mm/slub.c:2669
    [] kmem_cache_alloc_node+0x1d2/0x1f0 mm/slub.c:2746
    [] cfq_get_queue+0x47d/0x14d0 block/cfq-iosched.c:3850
    ...
    ==================================================================
    (the line numbers are relative to 4.8-rc6, but the bug persists
    upstream)

    The uninitialized struct cfq_queue is created by kmem_cache_alloc_node()
    and then passed to cfq_init_cfqq(), which accesses cfqq->ioprio_class
    before it's initialized.

    Signed-off-by: Alexander Potapenko
    Signed-off-by: Jens Axboe

    Alexander Potapenko
     

18 Jan, 2017

1 commit


29 Nov, 2016

1 commit


22 Nov, 2016

1 commit

  • blkcg allocates some per-cgroup data structures with GFP_NOWAIT and
    when that fails falls back to operations which aren't specific to the
    cgroup. Occassional failures are expected under pressure and falling
    back to non-cgroup operation is the right thing to do.

    Unfortunately, I forgot to add __GFP_NOWARN to these allocations and
    these expected failures end up creating a lot of noise. Add
    __GFP_NOWARN.

    Signed-off-by: Tejun Heo
    Reported-by: Marc MERLIN
    Reported-by: Vlastimil Babka
    Signed-off-by: Jens Axboe

    Tejun Heo
     

11 Nov, 2016

2 commits

  • Enable throttling of buffered writeback to make it a lot
    more smooth, and has way less impact on other system activity.
    Background writeback should be, by definition, background
    activity. The fact that we flush huge bundles of it at the time
    means that it potentially has heavy impacts on foreground workloads,
    which isn't ideal. We can't easily limit the sizes of writes that
    we do, since that would impact file system layout in the presence
    of delayed allocation. So just throttle back buffered writeback,
    unless someone is waiting for it.

    The algorithm for when to throttle takes its inspiration in the
    CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
    the minimum latencies of requests over a window of time. In that
    window of time, if the minimum latency of any request exceeds a
    given target, then a scale count is incremented and the queue depth
    is shrunk. The next monitoring window is shrunk accordingly. Unlike
    CoDel, if we hit a window that exhibits good behavior, then we
    simply increment the scale count and re-calculate the limits for that
    scale value. This prevents us from oscillating between a
    close-to-ideal value and max all the time, instead remaining in the
    windows where we get good behavior.

    Unlike CoDel, blk-wb allows the scale count to to negative. This
    happens if we primarily have writes going on. Unlike positive
    scale counts, this doesn't change the size of the monitoring window.
    When the heavy writers finish, blk-bw quickly snaps back to it's
    stable state of a zero scale count.

    The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
    target to me met. It defaults to 2 msec for non-rotational storage, and
    75 msec for rotational storage. Setting this value to '0' disables
    blk-wb. Generally, a user would not have to touch this setting.

    We don't enable WBT on devices that are managed with CFQ, and have
    a non-root block cgroup attached. If we have a proportional share setup
    on this particular disk, then the wbt throttling will interfere with
    that. We don't have a strong need for wbt for that case, since we will
    rely on CFQ doing that for us.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • cfq_cpd_alloc() which is the cpd_alloc_fn implementation for cfq was
    incorrectly hard coding GFP_KERNEL instead of using the mask specified
    through the @gfp parameter. This currently doesn't cause any actual
    issues because all current callers specify GFP_KERNEL. Fix it.

    Signed-off-by: Tejun Heo
    Reported-by: Dan Carpenter
    Fixes: e4a9bde9589f ("blkcg: replace blkcg_policy->cpd_size with ->cpd_alloc/free_fn() methods")
    Signed-off-by: Jens Axboe

    Tejun Heo
     

01 Nov, 2016

2 commits


28 Oct, 2016

1 commit

  • Now that we don't need the common flags to overflow outside the range
    of a 32-bit type we can encode them the same way for both the bio and
    request fields. This in addition allows us to place the operation
    first (and make some room for more ops while we're at it) and to
    stop having to shift around the operation values.

    In addition this allows passing around only one value in the block layer
    instead of two (and eventuall also in the file systems, but we can do
    that later) and thus clean up a lot of code.

    Last but not least this allows decreasing the size of the cmd_flags
    field in struct request to 32-bits. Various functions passing this
    value could also be updated, but I'd like to avoid the churn for now.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

24 Sep, 2016

1 commit

  • While debugging timeouts happening in my application workload (ScyllaDB), I have
    observed calls to open() taking a long time, ranging everywhere from 2 seconds -
    the first ones that are enough to time out my application - to more than 30
    seconds.

    The problem seems to happen because XFS may block on pending metadata updates
    under certain circumnstances, and that's confirmed with the following backtrace
    taken by the offcputime tool (iovisor/bcc):

    ffffffffb90c57b1 finish_task_switch
    ffffffffb97dffb5 schedule
    ffffffffb97e310c schedule_timeout
    ffffffffb97e1f12 __down
    ffffffffb90ea821 down
    ffffffffc046a9dc xfs_buf_lock
    ffffffffc046abfb _xfs_buf_find
    ffffffffc046ae4a xfs_buf_get_map
    ffffffffc046babd xfs_buf_read_map
    ffffffffc0499931 xfs_trans_read_buf_map
    ffffffffc044a561 xfs_da_read_buf
    ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
    ffffffffc0452b90 xfs_dir2_leaf_lookup_int
    ffffffffc0452e0f xfs_dir2_leaf_lookup
    ffffffffc044d9d3 xfs_dir_lookup
    ffffffffc047d1d9 xfs_lookup
    ffffffffc0479e53 xfs_vn_lookup
    ffffffffb925347a path_openat
    ffffffffb9254a71 do_filp_open
    ffffffffb9242a94 do_sys_open
    ffffffffb9242b9e sys_open
    ffffffffb97e42b2 entry_SYSCALL_64_fastpath
    00007fb0698162ed [unknown]

    Inspecting my run with blktrace, I can see that the xfsaild kthread exhibit very
    high "Dispatch wait" times, on the dozens of seconds range and consistent with
    the open() times I have saw in that run.

    Still from the blktrace output, we can after searching a bit, identify the
    request that wasn't dispatched:

    8,0 11 152 81.092472813 804 A WM 141698288 + 8
    8,0 0 289372 96.718761435 0 D WM 141698288 + 8 (15626265317) [swapper/0]

    As we can see above, in this particular example CFQ took 15 seconds to dispatch
    this request. Going back to the full trace, we can see that the xfsaild queue
    had plenty of opportunity to run, and it was selected as the active queue many
    times. It would just always be preempted by something else (example):

    8,0 1 0 81.117912979 0 m N cfq1618SN / insert_request
    8,0 1 0 81.117913419 0 m N cfq1618SN / add_to_rr
    8,0 1 0 81.117914044 0 m N cfq1618SN / preempt
    8,0 1 0 81.117914398 0 m N cfq767A / slice expired t=1
    8,0 1 0 81.117914755 0 m N cfq767A / resid=40
    8,0 1 0 81.117915340 0 m N / served: vt=1948520448 min_vt=1948520448
    8,0 1 0 81.117915858 0 m N cfq767A / sl_used=1 disp=0 charge=0 iops=1 sect=0

    where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the ScyllaDB
    IO dispatchers.

    The requests preempting the xfsaild queue are synchronous requests. That's a
    characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT requests.
    While it can be argued that preempting ASYNC requests in favor of SYNC is part
    of the CFQ logic, I don't believe that doing so for 15+ seconds is anyone's
    goal.

    Moreover, unless I am misunderstanding something, that breaks the expectation
    set by the "fifo_expire_async" tunable, which in my system is set to the
    default.

    Looking at the code, it seems to me that the issue is that after we make
    an async queue active, there is no guarantee that it will execute any request.

    When the queue itself tests if it cfq_may_dispatch() it can bail if it sees SYNC
    requests in flight. An incoming request from another queue can also preempt it
    in such situation before we have the chance to execute anything (as seen in the
    trace above).

    This patch sets the must_dispatch flag if we notice that we have requests
    that are already fifo_expired. This flag is always cleared after
    cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't pin
    the queue for subsequent requests (unless they are themselves expired)

    Care is taken during preempt to still allow rt requests to preempt us
    regardless.

    Testing my workload with this patch applied produces much better results.
    From the application side I see no timeouts, and the open() latency histogram
    generated by systemtap looks much better, with the worst outlier at 131ms:

    Latency histogram of xfs_buf_lock acquisition (microseconds):
    value |-------------------------------------------------- count
    0 | 11
    1 |@@@@ 161
    2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1966
    4 |@ 54
    8 | 36
    16 | 7
    32 | 0
    64 | 0
    ~
    1024 | 0
    2048 | 0
    4096 | 1
    8192 | 1
    16384 | 2
    32768 | 0
    65536 | 0
    131072 | 1
    262144 | 0
    524288 | 0

    Signed-off-by: Glauber Costa
    CC: Jens Axboe
    CC: linux-block@vger.kernel.org
    CC: linux-kernel@vger.kernel.org

    Signed-off-by: Glauber Costa
    Signed-off-by: Jens Axboe

    Glauber Costa
     

08 Aug, 2016

1 commit

  • Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower
    portion and the op code in the higher portions. This means that
    old code that relies on manually setting bi_rw is most likely
    going to be broken. Instead of letting that brokeness linger,
    rename the member, to force old and out-of-tree code to break
    at compile time instead of at runtime.

    No intended functional changes in this commit.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

21 Jul, 2016

1 commit

  • Before merging a bio into an existing request, io scheduler is called to
    get its approval first. However, the requests that come from a plug
    flush may get merged by block layer without consulting with io
    scheduler.

    In case of CFQ, this can cause fairness problems. For instance, if a
    request gets merged into a low weight cgroup's request, high weight cgroup
    now will depend on low weight cgroup to get scheduled. If high weigt cgroup
    needs that io request to complete before submitting more requests, then it
    will also lose its timeslice.

    Following script demonstrates the problem. Group g1 has a low weight, g2
    and g3 have equal high weights but g2's requests are adjacent to g1's
    requests so they are subject to merging. Due to these merges, g2 gets
    poor disk time allocation.

    cat > cfq-merge-repro.sh << "EOF"
    #!/bin/bash
    set -e

    IO_ROOT=/mnt-cgroup/io

    mkdir -p $IO_ROOT

    if ! mount | grep -qw $IO_ROOT; then
    mount -t cgroup none -oblkio $IO_ROOT
    fi

    cd $IO_ROOT

    for i in g1 g2 g3; do
    if [ -d $i ]; then
    rmdir $i
    fi
    done

    mkdir g1 && echo 10 > g1/blkio.weight
    mkdir g2 && echo 495 > g2/blkio.weight
    mkdir g3 && echo 495 > g3/blkio.weight

    RUNTIME=10

    (echo $BASHPID > g1/cgroup.procs &&
    fio --readonly --name name1 --filename /dev/sdb \
    --rw read --size 64k --bs 64k --time_based \
    --runtime=$RUNTIME --offset=0k &> /dev/null)&

    (echo $BASHPID > g2/cgroup.procs &&
    fio --readonly --name name1 --filename /dev/sdb \
    --rw read --size 64k --bs 64k --time_based \
    --runtime=$RUNTIME --offset=64k &> /dev/null)&

    (echo $BASHPID > g3/cgroup.procs &&
    fio --readonly --name name1 --filename /dev/sdb \
    --rw read --size 64k --bs 64k --time_based \
    --runtime=$RUNTIME --offset=256k &> /dev/null)&

    sleep $((RUNTIME+1))

    for i in g1 g2 g3; do
    echo ---- $i ----
    cat $i/blkio.time
    done

    EOF
    # ./cfq-merge-repro.sh
    ---- g1 ----
    8:16 162
    ---- g2 ----
    8:16 165
    ---- g3 ----
    8:16 686

    After applying the patch:

    # ./cfq-merge-repro.sh
    ---- g1 ----
    8:16 90
    ---- g2 ----
    8:16 445
    ---- g3 ----
    8:16 471

    Signed-off-by: Tahsin Erdogan
    Signed-off-by: Jens Axboe

    Tahsin Erdogan
     

28 Jun, 2016

3 commits

  • Commit 9a7f38c42c2b (cfq-iosched: Convert from jiffies to nanoseconds)
    could result in charging just 1 ns to a cgroup submitting IO instead of 1
    jiffie we always charged before. It is arguable what is the right amount
    to change but for now lets retain the old behavior of always charging at
    least one jiffie.

    Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • Commit 9a7f38c42c2 (cfq-iosched: Convert from jiffies to nanoseconds)
    broke the condition for detecting starved sync IO in
    cfq_completed_request() because rq->start_time remained in jiffies but
    we compared it with nanosecond values. This manifested as a regression
    in bonnie++ rewrite performance because we always ended up considering
    sync IO starved and thus never increased async IO queue depth.

    Since rq->start_time is used in a lot of places, converting it to ns
    values would be non-trivial. So just revert the condition in CFQ to use
    comparison with jiffies. This will lead to suboptimal results if
    cfq_fifo_expire[1] will ever come close to 1 jiffie but so far we are
    relatively far from that with the storage used with CFQ (the default
    value is 128 ms).

    Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • slice_resid can be both positive and negative. Commit 9a7f38c42c2b
    (cfq-iosched: Convert from jiffies to nanoseconds) converted it from
    long to u64. Although this did not introduce any functional regression
    (the operations just overflow and the result was fine), it is certainly
    wrong and could cause issues in future. So convert the type to more
    appropriate s64.

    Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

10 Jun, 2016

1 commit

  • If we're queuing REQ_PRIO IO and the task is running at an idle IO
    class, then temporarily boost the priority. This prevents livelocks
    due to priority inversion, when a low priority task is holding file
    system resources while attempting to do IO.

    An example of that is shown below. An ioniced idle task is holding
    the directory mutex, while a normal priority task is trying to do
    a directory lookup.

    [478381.198925] ------------[ cut here ]------------
    [478381.200315] INFO: task ionice:1168369 blocked for more than 120 seconds.
    [478381.201324] Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1
    [478381.202278] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [478381.203462] ionice D ffff8803692736a8 0 1168369 1 0x00000080
    [478381.203466] ffff8803692736a8 ffff880399c21300 ffff880276adcc00 ffff880369273698
    [478381.204589] ffff880369273fd8 0000000000000000 7fffffffffffffff 0000000000000002
    [478381.205752] ffffffff8177d5e0 ffff8803692736c8 ffffffff8177cea7 0000000000000000
    [478381.206874] Call Trace:
    [478381.207253] [] ? bit_wait_io_timeout+0x80/0x80
    [478381.208175] [] schedule+0x37/0x90
    [478381.208932] [] schedule_timeout+0x1dc/0x250
    [478381.209805] [] ? __blk_run_queue+0x37/0x50
    [478381.210706] [] ? ktime_get+0x45/0xb0
    [478381.211489] [] io_schedule_timeout+0xa7/0x110
    [478381.212402] [] ? prepare_to_wait+0x5b/0x90
    [478381.213280] [] bit_wait_io+0x36/0x50
    [478381.214063] [] __wait_on_bit+0x65/0x90
    [478381.214961] [] ? bit_wait_io_timeout+0x80/0x80
    [478381.215872] [] out_of_line_wait_on_bit+0x7c/0x90
    [478381.216806] [] ? wake_atomic_t_function+0x40/0x40
    [478381.217773] [] __wait_on_buffer+0x2a/0x30
    [478381.218641] [] ext4_bread+0x57/0x70
    [478381.219425] [] __ext4_read_dirblock+0x3c/0x380
    [478381.220467] [] ext4_dx_find_entry+0x7d/0x170
    [478381.221357] [] ? find_get_entry+0x1e/0xa0
    [478381.222208] [] ext4_find_entry+0x484/0x510
    [478381.223090] [] ext4_lookup+0x52/0x160
    [478381.223882] [] lookup_real+0x1d/0x60
    [478381.224675] [] __lookup_hash+0x38/0x50
    [478381.225697] [] lookup_slow+0x45/0xab
    [478381.226941] [] link_path_walk+0x7ae/0x820
    [478381.227880] [] path_init+0xc2/0x430
    [478381.228677] [] ? security_file_alloc+0x16/0x20
    [478381.229776] [] path_openat+0x77/0x620
    [478381.230767] [] ? page_add_file_rmap+0x2e/0x70
    [478381.232019] [] do_filp_open+0x43/0xa0
    [478381.233016] [] ? creds_are_invalid+0x29/0x70
    [478381.234072] [] do_open_execat+0x70/0x170
    [478381.235039] [] do_execveat_common.isra.36+0x1b8/0x6e0
    [478381.236051] [] do_execve+0x2c/0x30
    [478381.236809] [] ? getname+0x12/0x20
    [478381.237564] [] SyS_execve+0x2e/0x40
    [478381.238338] [] stub_execve+0x6d/0xa0
    [478381.239126] ------------[ cut here ]------------
    [478381.239915] ------------[ cut here ]------------
    [478381.240606] INFO: task python2.7:1168375 blocked for more than 120 seconds.
    [478381.242673] Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1
    [478381.243653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [478381.244902] python2.7 D ffff88005cf8fb98 0 1168375 1168248 0x00000080
    [478381.244904] ffff88005cf8fb98 ffff88016c1f0980 ffffffff81c134c0 ffff88016c1f11a0
    [478381.246023] ffff88005cf8ffd8 ffff880466cd0cbc ffff88016c1f0980 00000000ffffffff
    [478381.247138] ffff880466cd0cc0 ffff88005cf8fbb8 ffffffff8177cea7 ffff88005cf8fcc8
    [478381.248252] Call Trace:
    [478381.248630] [] schedule+0x37/0x90
    [478381.249382] [] schedule_preempt_disabled+0xe/0x10
    [478381.250465] [] __mutex_lock_slowpath+0x92/0x100
    [478381.251409] [] mutex_lock+0x1b/0x2f
    [478381.252199] [] lookup_slow+0x36/0xab
    [478381.253023] [] link_path_walk+0x7ae/0x820
    [478381.253877] [] ? try_charge+0xc1/0x700
    [478381.254690] [] path_init+0xc2/0x430
    [478381.255525] [] ? security_file_alloc+0x16/0x20
    [478381.256450] [] path_openat+0x77/0x620
    [478381.257256] [] ? lru_cache_add_active_or_unevictable+0x2b/0xa0
    [478381.258390] [] ? handle_mm_fault+0x13f3/0x1720
    [478381.259309] [] do_filp_open+0x43/0xa0
    [478381.260139] [] ? __alloc_fd+0x42/0x120
    [478381.260962] [] do_sys_open+0x13c/0x230
    [478381.261779] [] ? syscall_trace_enter_phase1+0x113/0x170
    [478381.262851] [] SyS_open+0x22/0x30
    [478381.263598] [] system_call_fastpath+0x12/0x17
    [478381.264551] ------------[ cut here ]------------
    [478381.265377] ------------[ cut here ]------------

    Signed-off-by: Jens Axboe
    Reviewed-by: Jeff Moyer

    Jens Axboe
     

08 Jun, 2016

5 commits