05 Apr, 2017

1 commit

  • Writeback throttling does not play well with CFQ since that also tries
    to throttle async writes. As a result async writeback can get starved in
    presence of readers. As an example take a benchmark simulating
    postgreSQL database running over a standard rotating SATA drive. There
    are 16 processes doing random reads from a huge file (2*machine memory),
    1 process doing random writes to the huge file and calling fsync once
    per 50000 writes and 1 process doing sequential 8k writes to a
    relatively small file wrapping around at the end of the file and calling
    fsync every 5 writes. Under this load read latency easily exceeds the
    target latency of 75 ms (just because there are so many reads happening
    against a relatively slow disk) and thus writeback is throttled to a
    point where only 1 write request is allowed at a time. Blktrace data
    then looks like:

    8,0 1 0 8.347751764 0 m N cfq workload slice:40000000
    8,0 1 0 8.347755256 0 m N cfq293A / set_active wl_class: 0 wl_type:0
    8,0 1 0 8.347784100 0 m N cfq293A / Not idling. st->count:1
    8,0 1 3814 8.347763916 5839 UT N [kworker/u9:2] 1
    8,0 0 0 8.347777605 0 m N cfq293A / Not idling. st->count:1
    8,0 1 0 8.347784100 0 m N cfq293A / Not idling. st->count:1
    8,0 3 1596 8.354364057 0 C R 156109528 + 8 (6906954) [0]
    8,0 3 0 8.354383193 0 m N cfq6196SN / complete rqnoidle 0
    8,0 3 0 8.354386476 0 m N cfq schedule dispatch
    8,0 3 0 8.354399397 0 m N cfq293A / Not idling. st->count:1
    8,0 3 0 8.354404705 0 m N cfq293A / dispatch_insert
    8,0 3 0 8.354409454 0 m N cfq293A / dispatched a request
    8,0 3 0 8.354412527 0 m N cfq293A / activate rq, drv=1
    8,0 3 1597 8.354414692 0 D W 145961400 + 24 (6718452) [swapper/0]
    8,0 3 0 8.354484184 0 m N cfq293A / Not idling. st->count:1
    8,0 3 0 8.354487536 0 m N cfq293A / slice expired t=0
    8,0 3 0 8.354498013 0 m N / served: vt=5888102466265088 min_vt=5888074869387264
    8,0 3 0 8.354502692 0 m N cfq293A / sl_used=6737519 disp=1 charge=6737519 iops=0 sect=24
    8,0 3 0 8.354505695 0 m N cfq293A / del_from_rr
    ...
    8,0 0 1810 8.354728768 0 C W 145961400 + 24 (314076) [0]
    8,0 0 0 8.354746927 0 m N cfq293A / complete rqnoidle 0
    ...
    8,0 1 3829 8.389886102 5839 G W 145962968 + 24 [kworker/u9:2]
    8,0 1 3830 8.389888127 5839 P N [kworker/u9:2]
    8,0 1 3831 8.389908102 5839 A W 145978336 + 24 count:1
    8,0 0 0 9.455395969 0 m N cfq293A / slice expired t=0
    8,0 0 0 9.455404210 0 m N / served: vt=5888958194597888 min_vt=5888941810597888
    8,0 0 0 9.455410077 0 m N cfq293A / sl_used=4000000 disp=1 charge=4000000 iops=0 sect=24
    8,0 0 0 9.455416851 0 m N cfq293A / del_from_rr
    ...
    8,0 0 2045 9.455648515 0 C W 145962968 + 24 (332305) [0]
    8,0 0 0 9.455668350 0 m N cfq293A / complete rqnoidle 0
    ...
    8,0 1 4371 9.455710115 5839 G W 145978336 + 24 [kworker/u9:2]
    8,0 1 4372 9.455712350 5839 P N [kworker/u9:2]
    8,0 1 4373 9.455730159 5839 A W 145986616 + 24
    Signed-off-by: Jens Axboe

    Jan Kara
     

02 Mar, 2017

1 commit


22 Feb, 2017

1 commit

  • Pull block layer updates from Jens Axboe:

    - blk-mq scheduling framework from me and Omar, with a port of the
    deadline scheduler for this framework. A port of BFQ from Paolo is in
    the works, and should be ready for 4.12.

    - Various fixups and improvements to the above scheduling framework
    from Omar, Paolo, Bart, me, others.

    - Cleanup of the exported sysfs blk-mq data into debugfs, from Omar.
    This allows us to export more information that helps debug hangs or
    performance issues, without cluttering or abusing the sysfs API.

    - Fixes for the sbitmap code, the scalable bitmap code that was
    migrated from blk-mq, from Omar.

    - Removal of the BLOCK_PC support in struct request, and refactoring of
    carrying SCSI payloads in the block layer. This cleans up the code
    nicely, and enables us to kill the SCSI specific parts of struct
    request, shrinking it down nicely. From Christoph mainly, with help
    from Hannes.

    - Support for ranged discard requests and discard merging, also from
    Christoph.

    - Support for OPAL in the block layer, and for NVMe as well. Mainly
    from Scott Bauer, with fixes/updates from various others folks.

    - Error code fixup for gdrom from Christophe.

    - cciss pci irq allocation cleanup from Christoph.

    - Making the cdrom device operations read only, from Kees Cook.

    - Fixes for duplicate bdi registrations and bdi/queue life time
    problems from Jan and Dan.

    - Set of fixes and updates for lightnvm, from Matias and Javier.

    - A few fixes for nbd from Josef, using idr to name devices and a
    workqueue deadlock fix on receive. Also marks Josef as the current
    maintainer of nbd.

    - Fix from Josef, overwriting queue settings when the number of
    hardware queues is updated for a blk-mq device.

    - NVMe fix from Keith, ensuring that we don't repeatedly mark and IO
    aborted, if we didn't end up aborting it.

    - SG gap merging fix from Ming Lei for block.

    - Loop fix also from Ming, fixing a race and crash between setting loop
    status and IO.

    - Two block race fixes from Tahsin, fixing request list iteration and
    fixing a race between device registration and udev device add
    notifiations.

    - Double free fix from cgroup writeback, from Tejun.

    - Another double free fix in blkcg, from Hou Tao.

    - Partition overflow fix for EFI from Alden Tondettar.

    * tag 'for-4.11/linus-merge-signed' of git://git.kernel.dk/linux-block: (156 commits)
    nvme: Check for Security send/recv support before issuing commands.
    block/sed-opal: allocate struct opal_dev dynamically
    block/sed-opal: tone down not supported warnings
    block: don't defer flushes on blk-mq + scheduling
    blk-mq-sched: ask scheduler for work, if we failed dispatching leftovers
    blk-mq: don't special case flush inserts for blk-mq-sched
    blk-mq-sched: don't add flushes to the head of requeue queue
    blk-mq: have blk_mq_dispatch_rq_list() return if we queued IO or not
    block: do not allow updates through sysfs until registration completes
    lightnvm: set default lun range when no luns are specified
    lightnvm: fix off-by-one error on target initialization
    Maintainers: Modify SED list from nvme to block
    Move stack parameters for sed_ioctl to prevent oversized stack with CONFIG_KASAN
    uapi: sed-opal fix IOW for activate lsp to use correct struct
    cdrom: Make device operations read-only
    elevator: fix loading wrong elevator type for blk-mq devices
    cciss: switch to pci_irq_alloc_vectors
    block/loop: fix race between I/O and set_status
    blk-mq-sched: don't hold queue_lock when calling exit_icq
    block: set make_request_fn manually in blk_mq_update_nr_hw_queues
    ...

    Linus Torvalds
     

16 Feb, 2017

1 commit


09 Feb, 2017

1 commit


23 Jan, 2017

2 commits

  • The script "checkpatch.pl" pointed information out like the following.

    ERROR: do not use assignment in if condition

    Thus fix the affected source code place.

    Signed-off-by: Markus Elfring
    Signed-off-by: Jens Axboe

    Markus Elfring
     
  • KMSAN (KernelMemorySanitizer, a new error detection tool) reports use of
    uninitialized memory in cfq_init_cfqq():

    ==================================================================
    BUG: KMSAN: use of unitialized memory
    ...
    Call Trace:
    [< inline >] __dump_stack lib/dump_stack.c:15
    [] dump_stack+0x157/0x1d0 lib/dump_stack.c:51
    [] kmsan_report+0x205/0x360 ??:?
    [] __msan_warning+0x5b/0xb0 ??:?
    [< inline >] cfq_init_cfqq block/cfq-iosched.c:3754
    [] cfq_get_queue+0xc80/0x14d0 block/cfq-iosched.c:3857
    ...
    origin:
    [] save_stack_trace+0x27/0x50 arch/x86/kernel/stacktrace.c:67
    [] kmsan_internal_poison_shadow+0xab/0x150 ??:?
    [] kmsan_poison_slab+0xbb/0x120 ??:?
    [< inline >] allocate_slab mm/slub.c:1627
    [] new_slab+0x3af/0x4b0 mm/slub.c:1641
    [< inline >] new_slab_objects mm/slub.c:2407
    [] ___slab_alloc+0x323/0x4a0 mm/slub.c:2564
    [< inline >] __slab_alloc mm/slub.c:2606
    [< inline >] slab_alloc_node mm/slub.c:2669
    [] kmem_cache_alloc_node+0x1d2/0x1f0 mm/slub.c:2746
    [] cfq_get_queue+0x47d/0x14d0 block/cfq-iosched.c:3850
    ...
    ==================================================================
    (the line numbers are relative to 4.8-rc6, but the bug persists
    upstream)

    The uninitialized struct cfq_queue is created by kmem_cache_alloc_node()
    and then passed to cfq_init_cfqq(), which accesses cfqq->ioprio_class
    before it's initialized.

    Signed-off-by: Alexander Potapenko
    Signed-off-by: Jens Axboe

    Alexander Potapenko
     

18 Jan, 2017

1 commit


29 Nov, 2016

1 commit


22 Nov, 2016

1 commit

  • blkcg allocates some per-cgroup data structures with GFP_NOWAIT and
    when that fails falls back to operations which aren't specific to the
    cgroup. Occassional failures are expected under pressure and falling
    back to non-cgroup operation is the right thing to do.

    Unfortunately, I forgot to add __GFP_NOWARN to these allocations and
    these expected failures end up creating a lot of noise. Add
    __GFP_NOWARN.

    Signed-off-by: Tejun Heo
    Reported-by: Marc MERLIN
    Reported-by: Vlastimil Babka
    Signed-off-by: Jens Axboe

    Tejun Heo
     

11 Nov, 2016

2 commits

  • Enable throttling of buffered writeback to make it a lot
    more smooth, and has way less impact on other system activity.
    Background writeback should be, by definition, background
    activity. The fact that we flush huge bundles of it at the time
    means that it potentially has heavy impacts on foreground workloads,
    which isn't ideal. We can't easily limit the sizes of writes that
    we do, since that would impact file system layout in the presence
    of delayed allocation. So just throttle back buffered writeback,
    unless someone is waiting for it.

    The algorithm for when to throttle takes its inspiration in the
    CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
    the minimum latencies of requests over a window of time. In that
    window of time, if the minimum latency of any request exceeds a
    given target, then a scale count is incremented and the queue depth
    is shrunk. The next monitoring window is shrunk accordingly. Unlike
    CoDel, if we hit a window that exhibits good behavior, then we
    simply increment the scale count and re-calculate the limits for that
    scale value. This prevents us from oscillating between a
    close-to-ideal value and max all the time, instead remaining in the
    windows where we get good behavior.

    Unlike CoDel, blk-wb allows the scale count to to negative. This
    happens if we primarily have writes going on. Unlike positive
    scale counts, this doesn't change the size of the monitoring window.
    When the heavy writers finish, blk-bw quickly snaps back to it's
    stable state of a zero scale count.

    The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
    target to me met. It defaults to 2 msec for non-rotational storage, and
    75 msec for rotational storage. Setting this value to '0' disables
    blk-wb. Generally, a user would not have to touch this setting.

    We don't enable WBT on devices that are managed with CFQ, and have
    a non-root block cgroup attached. If we have a proportional share setup
    on this particular disk, then the wbt throttling will interfere with
    that. We don't have a strong need for wbt for that case, since we will
    rely on CFQ doing that for us.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • cfq_cpd_alloc() which is the cpd_alloc_fn implementation for cfq was
    incorrectly hard coding GFP_KERNEL instead of using the mask specified
    through the @gfp parameter. This currently doesn't cause any actual
    issues because all current callers specify GFP_KERNEL. Fix it.

    Signed-off-by: Tejun Heo
    Reported-by: Dan Carpenter
    Fixes: e4a9bde9589f ("blkcg: replace blkcg_policy->cpd_size with ->cpd_alloc/free_fn() methods")
    Signed-off-by: Jens Axboe

    Tejun Heo
     

01 Nov, 2016

2 commits


28 Oct, 2016

1 commit

  • Now that we don't need the common flags to overflow outside the range
    of a 32-bit type we can encode them the same way for both the bio and
    request fields. This in addition allows us to place the operation
    first (and make some room for more ops while we're at it) and to
    stop having to shift around the operation values.

    In addition this allows passing around only one value in the block layer
    instead of two (and eventuall also in the file systems, but we can do
    that later) and thus clean up a lot of code.

    Last but not least this allows decreasing the size of the cmd_flags
    field in struct request to 32-bits. Various functions passing this
    value could also be updated, but I'd like to avoid the churn for now.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

24 Sep, 2016

1 commit

  • While debugging timeouts happening in my application workload (ScyllaDB), I have
    observed calls to open() taking a long time, ranging everywhere from 2 seconds -
    the first ones that are enough to time out my application - to more than 30
    seconds.

    The problem seems to happen because XFS may block on pending metadata updates
    under certain circumnstances, and that's confirmed with the following backtrace
    taken by the offcputime tool (iovisor/bcc):

    ffffffffb90c57b1 finish_task_switch
    ffffffffb97dffb5 schedule
    ffffffffb97e310c schedule_timeout
    ffffffffb97e1f12 __down
    ffffffffb90ea821 down
    ffffffffc046a9dc xfs_buf_lock
    ffffffffc046abfb _xfs_buf_find
    ffffffffc046ae4a xfs_buf_get_map
    ffffffffc046babd xfs_buf_read_map
    ffffffffc0499931 xfs_trans_read_buf_map
    ffffffffc044a561 xfs_da_read_buf
    ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
    ffffffffc0452b90 xfs_dir2_leaf_lookup_int
    ffffffffc0452e0f xfs_dir2_leaf_lookup
    ffffffffc044d9d3 xfs_dir_lookup
    ffffffffc047d1d9 xfs_lookup
    ffffffffc0479e53 xfs_vn_lookup
    ffffffffb925347a path_openat
    ffffffffb9254a71 do_filp_open
    ffffffffb9242a94 do_sys_open
    ffffffffb9242b9e sys_open
    ffffffffb97e42b2 entry_SYSCALL_64_fastpath
    00007fb0698162ed [unknown]

    Inspecting my run with blktrace, I can see that the xfsaild kthread exhibit very
    high "Dispatch wait" times, on the dozens of seconds range and consistent with
    the open() times I have saw in that run.

    Still from the blktrace output, we can after searching a bit, identify the
    request that wasn't dispatched:

    8,0 11 152 81.092472813 804 A WM 141698288 + 8
    8,0 0 289372 96.718761435 0 D WM 141698288 + 8 (15626265317) [swapper/0]

    As we can see above, in this particular example CFQ took 15 seconds to dispatch
    this request. Going back to the full trace, we can see that the xfsaild queue
    had plenty of opportunity to run, and it was selected as the active queue many
    times. It would just always be preempted by something else (example):

    8,0 1 0 81.117912979 0 m N cfq1618SN / insert_request
    8,0 1 0 81.117913419 0 m N cfq1618SN / add_to_rr
    8,0 1 0 81.117914044 0 m N cfq1618SN / preempt
    8,0 1 0 81.117914398 0 m N cfq767A / slice expired t=1
    8,0 1 0 81.117914755 0 m N cfq767A / resid=40
    8,0 1 0 81.117915340 0 m N / served: vt=1948520448 min_vt=1948520448
    8,0 1 0 81.117915858 0 m N cfq767A / sl_used=1 disp=0 charge=0 iops=1 sect=0

    where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the ScyllaDB
    IO dispatchers.

    The requests preempting the xfsaild queue are synchronous requests. That's a
    characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT requests.
    While it can be argued that preempting ASYNC requests in favor of SYNC is part
    of the CFQ logic, I don't believe that doing so for 15+ seconds is anyone's
    goal.

    Moreover, unless I am misunderstanding something, that breaks the expectation
    set by the "fifo_expire_async" tunable, which in my system is set to the
    default.

    Looking at the code, it seems to me that the issue is that after we make
    an async queue active, there is no guarantee that it will execute any request.

    When the queue itself tests if it cfq_may_dispatch() it can bail if it sees SYNC
    requests in flight. An incoming request from another queue can also preempt it
    in such situation before we have the chance to execute anything (as seen in the
    trace above).

    This patch sets the must_dispatch flag if we notice that we have requests
    that are already fifo_expired. This flag is always cleared after
    cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't pin
    the queue for subsequent requests (unless they are themselves expired)

    Care is taken during preempt to still allow rt requests to preempt us
    regardless.

    Testing my workload with this patch applied produces much better results.
    From the application side I see no timeouts, and the open() latency histogram
    generated by systemtap looks much better, with the worst outlier at 131ms:

    Latency histogram of xfs_buf_lock acquisition (microseconds):
    value |-------------------------------------------------- count
    0 | 11
    1 |@@@@ 161
    2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1966
    4 |@ 54
    8 | 36
    16 | 7
    32 | 0
    64 | 0
    ~
    1024 | 0
    2048 | 0
    4096 | 1
    8192 | 1
    16384 | 2
    32768 | 0
    65536 | 0
    131072 | 1
    262144 | 0
    524288 | 0

    Signed-off-by: Glauber Costa
    CC: Jens Axboe
    CC: linux-block@vger.kernel.org
    CC: linux-kernel@vger.kernel.org

    Signed-off-by: Glauber Costa
    Signed-off-by: Jens Axboe

    Glauber Costa
     

08 Aug, 2016

1 commit

  • Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower
    portion and the op code in the higher portions. This means that
    old code that relies on manually setting bi_rw is most likely
    going to be broken. Instead of letting that brokeness linger,
    rename the member, to force old and out-of-tree code to break
    at compile time instead of at runtime.

    No intended functional changes in this commit.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

21 Jul, 2016

1 commit

  • Before merging a bio into an existing request, io scheduler is called to
    get its approval first. However, the requests that come from a plug
    flush may get merged by block layer without consulting with io
    scheduler.

    In case of CFQ, this can cause fairness problems. For instance, if a
    request gets merged into a low weight cgroup's request, high weight cgroup
    now will depend on low weight cgroup to get scheduled. If high weigt cgroup
    needs that io request to complete before submitting more requests, then it
    will also lose its timeslice.

    Following script demonstrates the problem. Group g1 has a low weight, g2
    and g3 have equal high weights but g2's requests are adjacent to g1's
    requests so they are subject to merging. Due to these merges, g2 gets
    poor disk time allocation.

    cat > cfq-merge-repro.sh << "EOF"
    #!/bin/bash
    set -e

    IO_ROOT=/mnt-cgroup/io

    mkdir -p $IO_ROOT

    if ! mount | grep -qw $IO_ROOT; then
    mount -t cgroup none -oblkio $IO_ROOT
    fi

    cd $IO_ROOT

    for i in g1 g2 g3; do
    if [ -d $i ]; then
    rmdir $i
    fi
    done

    mkdir g1 && echo 10 > g1/blkio.weight
    mkdir g2 && echo 495 > g2/blkio.weight
    mkdir g3 && echo 495 > g3/blkio.weight

    RUNTIME=10

    (echo $BASHPID > g1/cgroup.procs &&
    fio --readonly --name name1 --filename /dev/sdb \
    --rw read --size 64k --bs 64k --time_based \
    --runtime=$RUNTIME --offset=0k &> /dev/null)&

    (echo $BASHPID > g2/cgroup.procs &&
    fio --readonly --name name1 --filename /dev/sdb \
    --rw read --size 64k --bs 64k --time_based \
    --runtime=$RUNTIME --offset=64k &> /dev/null)&

    (echo $BASHPID > g3/cgroup.procs &&
    fio --readonly --name name1 --filename /dev/sdb \
    --rw read --size 64k --bs 64k --time_based \
    --runtime=$RUNTIME --offset=256k &> /dev/null)&

    sleep $((RUNTIME+1))

    for i in g1 g2 g3; do
    echo ---- $i ----
    cat $i/blkio.time
    done

    EOF
    # ./cfq-merge-repro.sh
    ---- g1 ----
    8:16 162
    ---- g2 ----
    8:16 165
    ---- g3 ----
    8:16 686

    After applying the patch:

    # ./cfq-merge-repro.sh
    ---- g1 ----
    8:16 90
    ---- g2 ----
    8:16 445
    ---- g3 ----
    8:16 471

    Signed-off-by: Tahsin Erdogan
    Signed-off-by: Jens Axboe

    Tahsin Erdogan
     

28 Jun, 2016

3 commits

  • Commit 9a7f38c42c2b (cfq-iosched: Convert from jiffies to nanoseconds)
    could result in charging just 1 ns to a cgroup submitting IO instead of 1
    jiffie we always charged before. It is arguable what is the right amount
    to change but for now lets retain the old behavior of always charging at
    least one jiffie.

    Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • Commit 9a7f38c42c2 (cfq-iosched: Convert from jiffies to nanoseconds)
    broke the condition for detecting starved sync IO in
    cfq_completed_request() because rq->start_time remained in jiffies but
    we compared it with nanosecond values. This manifested as a regression
    in bonnie++ rewrite performance because we always ended up considering
    sync IO starved and thus never increased async IO queue depth.

    Since rq->start_time is used in a lot of places, converting it to ns
    values would be non-trivial. So just revert the condition in CFQ to use
    comparison with jiffies. This will lead to suboptimal results if
    cfq_fifo_expire[1] will ever come close to 1 jiffie but so far we are
    relatively far from that with the storage used with CFQ (the default
    value is 128 ms).

    Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • slice_resid can be both positive and negative. Commit 9a7f38c42c2b
    (cfq-iosched: Convert from jiffies to nanoseconds) converted it from
    long to u64. Although this did not introduce any functional regression
    (the operations just overflow and the result was fine), it is certainly
    wrong and could cause issues in future. So convert the type to more
    appropriate s64.

    Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

10 Jun, 2016

1 commit

  • If we're queuing REQ_PRIO IO and the task is running at an idle IO
    class, then temporarily boost the priority. This prevents livelocks
    due to priority inversion, when a low priority task is holding file
    system resources while attempting to do IO.

    An example of that is shown below. An ioniced idle task is holding
    the directory mutex, while a normal priority task is trying to do
    a directory lookup.

    [478381.198925] ------------[ cut here ]------------
    [478381.200315] INFO: task ionice:1168369 blocked for more than 120 seconds.
    [478381.201324] Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1
    [478381.202278] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [478381.203462] ionice D ffff8803692736a8 0 1168369 1 0x00000080
    [478381.203466] ffff8803692736a8 ffff880399c21300 ffff880276adcc00 ffff880369273698
    [478381.204589] ffff880369273fd8 0000000000000000 7fffffffffffffff 0000000000000002
    [478381.205752] ffffffff8177d5e0 ffff8803692736c8 ffffffff8177cea7 0000000000000000
    [478381.206874] Call Trace:
    [478381.207253] [] ? bit_wait_io_timeout+0x80/0x80
    [478381.208175] [] schedule+0x37/0x90
    [478381.208932] [] schedule_timeout+0x1dc/0x250
    [478381.209805] [] ? __blk_run_queue+0x37/0x50
    [478381.210706] [] ? ktime_get+0x45/0xb0
    [478381.211489] [] io_schedule_timeout+0xa7/0x110
    [478381.212402] [] ? prepare_to_wait+0x5b/0x90
    [478381.213280] [] bit_wait_io+0x36/0x50
    [478381.214063] [] __wait_on_bit+0x65/0x90
    [478381.214961] [] ? bit_wait_io_timeout+0x80/0x80
    [478381.215872] [] out_of_line_wait_on_bit+0x7c/0x90
    [478381.216806] [] ? wake_atomic_t_function+0x40/0x40
    [478381.217773] [] __wait_on_buffer+0x2a/0x30
    [478381.218641] [] ext4_bread+0x57/0x70
    [478381.219425] [] __ext4_read_dirblock+0x3c/0x380
    [478381.220467] [] ext4_dx_find_entry+0x7d/0x170
    [478381.221357] [] ? find_get_entry+0x1e/0xa0
    [478381.222208] [] ext4_find_entry+0x484/0x510
    [478381.223090] [] ext4_lookup+0x52/0x160
    [478381.223882] [] lookup_real+0x1d/0x60
    [478381.224675] [] __lookup_hash+0x38/0x50
    [478381.225697] [] lookup_slow+0x45/0xab
    [478381.226941] [] link_path_walk+0x7ae/0x820
    [478381.227880] [] path_init+0xc2/0x430
    [478381.228677] [] ? security_file_alloc+0x16/0x20
    [478381.229776] [] path_openat+0x77/0x620
    [478381.230767] [] ? page_add_file_rmap+0x2e/0x70
    [478381.232019] [] do_filp_open+0x43/0xa0
    [478381.233016] [] ? creds_are_invalid+0x29/0x70
    [478381.234072] [] do_open_execat+0x70/0x170
    [478381.235039] [] do_execveat_common.isra.36+0x1b8/0x6e0
    [478381.236051] [] do_execve+0x2c/0x30
    [478381.236809] [] ? getname+0x12/0x20
    [478381.237564] [] SyS_execve+0x2e/0x40
    [478381.238338] [] stub_execve+0x6d/0xa0
    [478381.239126] ------------[ cut here ]------------
    [478381.239915] ------------[ cut here ]------------
    [478381.240606] INFO: task python2.7:1168375 blocked for more than 120 seconds.
    [478381.242673] Not tainted 4.0.9-38_fbk5_hotfix1_2936_g85409c6 #1
    [478381.243653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [478381.244902] python2.7 D ffff88005cf8fb98 0 1168375 1168248 0x00000080
    [478381.244904] ffff88005cf8fb98 ffff88016c1f0980 ffffffff81c134c0 ffff88016c1f11a0
    [478381.246023] ffff88005cf8ffd8 ffff880466cd0cbc ffff88016c1f0980 00000000ffffffff
    [478381.247138] ffff880466cd0cc0 ffff88005cf8fbb8 ffffffff8177cea7 ffff88005cf8fcc8
    [478381.248252] Call Trace:
    [478381.248630] [] schedule+0x37/0x90
    [478381.249382] [] schedule_preempt_disabled+0xe/0x10
    [478381.250465] [] __mutex_lock_slowpath+0x92/0x100
    [478381.251409] [] mutex_lock+0x1b/0x2f
    [478381.252199] [] lookup_slow+0x36/0xab
    [478381.253023] [] link_path_walk+0x7ae/0x820
    [478381.253877] [] ? try_charge+0xc1/0x700
    [478381.254690] [] path_init+0xc2/0x430
    [478381.255525] [] ? security_file_alloc+0x16/0x20
    [478381.256450] [] path_openat+0x77/0x620
    [478381.257256] [] ? lru_cache_add_active_or_unevictable+0x2b/0xa0
    [478381.258390] [] ? handle_mm_fault+0x13f3/0x1720
    [478381.259309] [] do_filp_open+0x43/0xa0
    [478381.260139] [] ? __alloc_fd+0x42/0x120
    [478381.260962] [] do_sys_open+0x13c/0x230
    [478381.261779] [] ? syscall_trace_enter_phase1+0x113/0x170
    [478381.262851] [] SyS_open+0x22/0x30
    [478381.263598] [] system_call_fastpath+0x12/0x17
    [478381.264551] ------------[ cut here ]------------
    [478381.265377] ------------[ cut here ]------------

    Signed-off-by: Jens Axboe
    Reviewed-by: Jeff Moyer

    Jens Axboe
     

08 Jun, 2016

6 commits


05 Apr, 2016

1 commit

  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

05 Feb, 2016

4 commits

  • Currently we don't allow sync workload of one cgroup to preempt sync
    workload of any other cgroup. This is because we want to achieve service
    separation between cgroups. However in cases where cgroup preempting is
    ancestor of the current cgroup, there is no need of separation and
    idling introduces unnecessary overhead. This hurts for example the case
    when workload is isolated within a cgroup but journalling threads are in
    root cgroup. Simple way to demostrate the issue is using:

    dbench4 -c /usr/share/dbench4/client.txt -t 10 -D /mnt 1

    on ext4 filesystem on plain SATA drive (mounted with barrier=0 to make
    difference more visible). When all processes are in the root cgroup,
    reported throughput is 153.132 MB/sec. When dbench process gets its own
    blkio cgroup, reported throughput drops to 26.1006 MB/sec.

    Fix the problem by making check in cfq_should_preempt() more benevolent
    and allow preemption by ancestor cgroup. This improves the throughput
    reported by dbench4 to 48.9106 MB/sec.

    Acked-by: Tejun Heo
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • The original idea with preemption of sync noidle queues (introduced in
    commit 718eee0579b8 "cfq-iosched: fairness for sync no-idle queues") was
    that we service all sync noidle queues together, we don't idle on any of
    the queues individually and we idle only if there is no sync noidle
    queue to be served. This intention also matches the original test:

    if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD
    && new_cfqq->service_tree == cfqq->service_tree)
    return true;

    However since at that time cfqq->service_tree was not set for idling
    queues, this test was unreliable and was replaced in commit e4a229196a7c
    "cfq-iosched: fix no-idle preemption logic" by:

    if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD &&
    cfqq_type(new_cfqq) == SYNC_NOIDLE_WORKLOAD &&
    new_cfqq->service_tree->count == 1)
    return true;

    That was a reliable test but was actually doing something different -
    now we preempt sync noidle queue only if the new queue is the only one
    busy in the service tree.

    These days cfq queue is kept in service tree even if it is idling and
    thus the original check would be safe again. But since we actually check
    that cfq queues are in the same cgroup, of the same priority class and
    workload type (sync noidle), we know that new_cfqq is fine to preempt
    cfqq. So just remove the service tree check.

    Acked-by: Tejun Heo
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • Move check for preemption by rt class up. There is no functional change
    but it makes arguing about conditions simpler since we can be sure both
    cfq queues are from the same ioprio class.

    Acked-by: Tejun Heo
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • There is no point in idling on a cfq group if the only cfq queue that is
    there has too big thinktime.

    Signed-off-by: Jan Kara
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Jan Kara
     

18 Sep, 2015

1 commit

  • cgroup_on_dfl() tests whether the cgroup's root is the default
    hierarchy; however, an individual controller is only interested in
    whether the controller is attached to the default hierarchy and never
    tests a cgroup which doesn't belong to the hierarchy that the
    controller is attached to.

    This patch replaces cgroup_on_dfl() tests in controllers with faster
    static_key based cgroup_subsys_on_dfl(). This leaves cgroup core as
    the only user of cgroup_on_dfl() and the function is moved from the
    header file to cgroup.c.

    Signed-off-by: Tejun Heo
    Acked-by: Zefan Li
    Cc: Vivek Goyal
    Cc: Jens Axboe
    Cc: Johannes Weiner
    Cc: Michal Hocko

    Tejun Heo
     

19 Aug, 2015

6 commits

  • cgroup is trying to make interface consistent across different
    controllers. For weight based resource control, the knob should have
    the range [1, 10000] and default to 100. This patch updates
    cfq-iosched so that the weight range conforms. The internal
    calculations have enough range and the widening of the weight range
    shouldn't cause any problem.

    * blkcg_policy->cpd_bind_fn() is added. If present, this is invoked
    when blkcg is attached to a hierarchy.

    * cfq_cpd_init() is updated to use the new default value on the
    unified hierarchy.

    * cfq_cpd_bind() callback is implemented to clear per-blkg configs and
    apply the default config matching the hierarchy type.

    * cfqd->root_group->[leaf_]weight initialization in cfq_init_queue()
    is moved into !CONFIG_CFQ_GROUP_IOSCHED block. cfq_cpd_bind() is
    now responsible for initializing the initial weights when blkcg is
    enabled.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Cc: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blkcg is gonna switch to cgroup common weight range as defined by
    CGROUP_WEIGHT_* on the unified hierarchy. In preparation, rename
    CFQ_WEIGHT_* constants to CFQ_WEIGHT_LEGACY_*.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Cc: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blkcg interface grew to be the biggest of all controllers and
    unfortunately most inconsistent too. The interface files are
    inconsistent with a number of cloes duplicates. Some files have
    recursive variants while others don't. There's distinction between
    normal and leaf weights which isn't intuitive and there are a lot of
    stat knobs which don't make much sense outside of debugging and expose
    too much implementation details to userland.

    In the unified hierarchy, everything is always hierarchical and
    internal nodes can't have tasks rendering the two structural issues
    twisting the current interface. The interface has to be updated in a
    significant anyway and this is a good chance to revamp it as a whole.
    This patch implements blkcg interface for the unified hierarchy.

    * (from a previous patch) blkcg is identified by "io" instead of
    "blkio" on the unified hierarchy. Given that the whole interface is
    updated anyway, the rename shouldn't carry noticeable conversion
    overhead.

    * The original interface consisted of 27 files is replaced with the
    following three files.

    blkio.stat : per-blkcg stats
    blkio.weight : per-cgroup and per-cgroup-queue weight settings
    blkio.max : per-cgroup-queue bps and iops max limits

    Documentation/cgroups/unified-hierarchy.txt updated accordingly.

    v2: blkcg_policy->dfl_cftypes wasn't removed on
    blkcg_policy_unregister() corrupting the cftypes list. Fixed.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • * Export blkg_dev_name()

    * Drop unnecessary @cft from __cfq_set_weight().

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, blkg_conf_prep() expects input to be of the following form

    MAJ:MIN NUM

    and reads the NUM part into blkg_conf_ctx->v. This is quite
    restrictive and gets in the way in implementing blkcg interface for
    the unified hierarchy. This patch updates blkg_conf_prep() so that it
    expects

    MAJ:MIN BODY_STR

    where BODY_STR is an arbitrary string. blkg_conf_ctx->v is replaced
    with ->body which is a char pointer pointing to the start of BODY_STR.
    Parsing of the body is moved to blkg_conf_prep()'s callers.

    To allow using, for example, strsep() on blkg_conf_ctx->val, it is a
    non-const pointer and to accommodate that const is dropped from @input
    too.

    This doesn't cause any behavior changes.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blkcg is about to grow interface for the unified hierarchy. Add
    legacy to existing cftypes.

    * blkcg_policy->cftypes -> blkcg_policy->legacy_cftypes
    * blk-cgroup.c:blkcg_files -> blkcg_legacy_files
    * cfq-iosched.c:cfq_blkcg_files -> cfq_blkcg_legacy_files
    * blk-throttle.c:throtl_files -> throtl_legacy_files

    Pure renames. No functional change.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo