17 Jun, 2017

1 commit

  • commit 223220356d5ebc05ead9a8d697abb0c0a906fc81 upstream.

    The code in block/partitions/msdos.c recognizes FreeBSD, OpenBSD
    and NetBSD partitions and does a reasonable job picking out OpenBSD
    and NetBSD UFS subpartitions.

    But for FreeBSD the subpartitions are always "bad".

    Kernel:
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Richard
     

14 Jun, 2017

1 commit

  • commit 5be6b75610cefd1e21b98a218211922c2feb6e08 upstream.

    When adding a cfq_group into the cfq service tree, we use CFQ_IDLE_DELAY
    as the delay of cfq_group's vdisktime if there have been other cfq_groups
    already.

    When cfq is under iops mode, commit 9a7f38c42c2b ("cfq-iosched: Convert
    from jiffies to nanoseconds") could result in a large iops delay and
    lead to an abnormal io schedule delay for the added cfq_group. To fix
    it, we just need to revert to the old CFQ_IDLE_DELAY value: HZ / 5
    when iops mode is enabled.

    Despite having the same value, the delay of a cfq_queue in idle class
    and the delay of cfq_group are different things, so I define two new
    macros for the delay of a cfq_group under time-slice mode and iops mode.

    Fixes: 9a7f38c42c2b ("cfq-iosched: Convert from jiffies to nanoseconds")
    Signed-off-by: Hou Tao
    Acked-by: Jan Kara
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Hou Tao
     

20 May, 2017

1 commit

  • commit 2859323e35ab5fc42f351fbda23ab544eaa85945 upstream.

    When registering an integrity profile: if the template's interval_exp is
    not 0 use it, otherwise use the ilog2() of logical block size of the
    provided gendisk.

    This fixes a long-standing DM linear target bug where it cannot pass
    integrity data to the underlying device if its logical block size
    conflicts with the underlying device's logical block size.

    Reported-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Acked-by: Martin K. Petersen
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Mike Snitzer
     

14 May, 2017

1 commit

  • commit 19b7ccf8651df09d274671b53039c672a52ad84d upstream.

    Commit 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk")
    introduced blk_integrity_revalidate(), which seems to assume ownership
    of the stable pages flag and unilaterally clears it if no blk_integrity
    profile is registered:

    if (bi->profile)
    disk->queue->backing_dev_info->capabilities |=
    BDI_CAP_STABLE_WRITES;
    else
    disk->queue->backing_dev_info->capabilities &=
    ~BDI_CAP_STABLE_WRITES;

    It's called from revalidate_disk() and rescan_partitions(), making it
    impossible to enable stable pages for drivers that support partitions
    and don't use blk_integrity: while the call in revalidate_disk() can be
    trivially worked around (see zram, which doesn't support partitions and
    hence gets away with zram_revalidate_disk()), rescan_partitions() can
    be triggered from userspace at any time. This breaks rbd, where the
    ceph messenger is responsible for generating/verifying CRCs.

    Since blk_integrity_{un,}register() "must" be used for (un)registering
    the integrity profile with the block layer, move BDI_CAP_STABLE_WRITES
    setting there. This way drivers that call blk_integrity_register() and
    use integrity infrastructure won't interfere with drivers that don't
    but still want stable pages.

    Fixes: 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk")
    Cc: "Martin K. Petersen"
    Cc: Christoph Hellwig
    Cc: Mike Snitzer
    Tested-by: Dan Williams
    Signed-off-by: Ilya Dryomov
    [idryomov@gmail.com: backport to < 4.11: bdi is embedded in queue]
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Ilya Dryomov
     

18 Apr, 2017

1 commit

  • commit 36e1f3d107867b25c616c2fd294f5a1c9d4e5d09 upstream.

    While stressing memory and IO at the same time we changed SMT settings,
    we were able to consistently trigger deadlocks in the mm system, which
    froze the entire machine.

    I think that under memory stress conditions, the large allocations
    performed by blk_mq_init_rq_map may trigger a reclaim, which stalls
    waiting on the block layer remmaping completion, thus deadlocking the
    system. The trace below was collected after the machine stalled,
    waiting for the hotplug event completion.

    The simplest fix for this is to make allocations in this path
    non-reclaimable, with GFP_NOIO. With this patch, We couldn't hit the
    issue anymore.

    This should apply on top of Jens's for-next branch cleanly.

    Changes since v1:
    - Use GFP_NOIO instead of GFP_NOWAIT.

    Call Trace:
    [c000000f0160aaf0] [c000000f0160ab50] 0xc000000f0160ab50 (unreliable)
    [c000000f0160acc0] [c000000000016624] __switch_to+0x2e4/0x430
    [c000000f0160ad20] [c000000000b1a880] __schedule+0x310/0x9b0
    [c000000f0160ae00] [c000000000b1af68] schedule+0x48/0xc0
    [c000000f0160ae30] [c000000000b1b4b0] schedule_preempt_disabled+0x20/0x30
    [c000000f0160ae50] [c000000000b1d4fc] __mutex_lock_slowpath+0xec/0x1f0
    [c000000f0160aed0] [c000000000b1d678] mutex_lock+0x78/0xa0
    [c000000f0160af00] [d000000019413cac] xfs_reclaim_inodes_ag+0x33c/0x380 [xfs]
    [c000000f0160b0b0] [d000000019415164] xfs_reclaim_inodes_nr+0x54/0x70 [xfs]
    [c000000f0160b0f0] [d0000000194297f8] xfs_fs_free_cached_objects+0x38/0x60 [xfs]
    [c000000f0160b120] [c0000000003172c8] super_cache_scan+0x1f8/0x210
    [c000000f0160b190] [c00000000026301c] shrink_slab.part.13+0x21c/0x4c0
    [c000000f0160b2d0] [c000000000268088] shrink_zone+0x2d8/0x3c0
    [c000000f0160b380] [c00000000026834c] do_try_to_free_pages+0x1dc/0x520
    [c000000f0160b450] [c00000000026876c] try_to_free_pages+0xdc/0x250
    [c000000f0160b4e0] [c000000000251978] __alloc_pages_nodemask+0x868/0x10d0
    [c000000f0160b6f0] [c000000000567030] blk_mq_init_rq_map+0x160/0x380
    [c000000f0160b7a0] [c00000000056758c] blk_mq_map_swqueue+0x33c/0x360
    [c000000f0160b820] [c000000000567904] blk_mq_queue_reinit+0x64/0xb0
    [c000000f0160b850] [c00000000056a16c] blk_mq_queue_reinit_notify+0x19c/0x250
    [c000000f0160b8a0] [c0000000000f5d38] notifier_call_chain+0x98/0x100
    [c000000f0160b8f0] [c0000000000c5fb0] __cpu_notify+0x70/0xe0
    [c000000f0160b930] [c0000000000c63c4] notify_prepare+0x44/0xb0
    [c000000f0160b9b0] [c0000000000c52f4] cpuhp_invoke_callback+0x84/0x250
    [c000000f0160ba10] [c0000000000c570c] cpuhp_up_callbacks+0x5c/0x120
    [c000000f0160ba60] [c0000000000c7cb8] _cpu_up+0xf8/0x1d0
    [c000000f0160bac0] [c0000000000c7eb0] do_cpu_up+0x120/0x150
    [c000000f0160bb40] [c0000000006fe024] cpu_subsys_online+0x64/0xe0
    [c000000f0160bb90] [c0000000006f5124] device_online+0xb4/0x120
    [c000000f0160bbd0] [c0000000006f5244] online_store+0xb4/0xc0
    [c000000f0160bc20] [c0000000006f0a68] dev_attr_store+0x68/0xa0
    [c000000f0160bc60] [c0000000003ccc30] sysfs_kf_write+0x80/0xb0
    [c000000f0160bca0] [c0000000003cbabc] kernfs_fop_write+0x17c/0x250
    [c000000f0160bcf0] [c00000000030fe6c] __vfs_write+0x6c/0x1e0
    [c000000f0160bd90] [c000000000311490] vfs_write+0xd0/0x270
    [c000000f0160bde0] [c0000000003131fc] SyS_write+0x6c/0x110
    [c000000f0160be30] [c000000000009204] system_call+0x38/0xec

    Signed-off-by: Gabriel Krisman Bertazi
    Cc: Brian King
    Cc: Douglas Miller
    Cc: linux-block@vger.kernel.org
    Cc: linux-scsi@vger.kernel.org
    Signed-off-by: Jens Axboe
    Signed-off-by: Sumit Semwal
    Signed-off-by: Greg Kroah-Hartman

    Gabriel Krisman Bertazi
     

08 Apr, 2017

2 commits

  • commit f5fe1b51905df7cfe4fdfd85c5fb7bc5b71a094f upstream.

    Commit 79bd99596b73 ("blk: improve order of bio handling in generic_make_request()")
    changed current->bio_list so that it did not contain *all* of the
    queued bios, but only those submitted by the currently running
    make_request_fn.

    There are two places which walk the list and requeue selected bios,
    and others that check if the list is empty. These are no longer
    correct.

    So redefine current->bio_list to point to an array of two lists, which
    contain all queued bios, and adjust various code to test or walk both
    lists.

    Signed-off-by: NeilBrown
    Fixes: 79bd99596b73 ("blk: improve order of bio handling in generic_make_request()")
    Signed-off-by: Jens Axboe
    Cc: Jack Wang
    Signed-off-by: Greg Kroah-Hartman

    NeilBrown
     
  • commit 79bd99596b7305ab08109a8bf44a6a4511dbf1cd upstream.

    To avoid recursion on the kernel stack when stacked block devices
    are in use, generic_make_request() will, when called recursively,
    queue new requests for later handling. They will be handled when the
    make_request_fn for the current bio completes.

    If any bios are submitted by a make_request_fn, these will ultimately
    be handled seqeuntially. If the handling of one of those generates
    further requests, they will be added to the end of the queue.

    This strict first-in-first-out behaviour can lead to deadlocks in
    various ways, normally because a request might need to wait for a
    previous request to the same device to complete. This can happen when
    they share a mempool, and can happen due to interdependencies
    particular to the device. Both md and dm have examples where this happens.

    These deadlocks can be erradicated by more selective ordering of bios.
    Specifically by handling them in depth-first order. That is: when the
    handling of one bio generates one or more further bios, they are
    handled immediately after the parent, before any siblings of the
    parent. That way, when generic_make_request() calls make_request_fn
    for some particular device, we can be certain that all previously
    submited requests for that device have been completely handled and are
    not waiting for anything in the queue of requests maintained in
    generic_make_request().

    An easy way to achieve this would be to use a last-in-first-out stack
    instead of a queue. However this will change the order of consecutive
    bios submitted by a make_request_fn, which could have unexpected consequences.
    Instead we take a slightly more complex approach.
    A fresh queue is created for each call to a make_request_fn. After it completes,
    any bios for a different device are placed on the front of the main queue, followed
    by any bios for the same device, followed by all bios that were already on
    the queue before the make_request_fn was called.
    This provides the depth-first approach without reordering bios on the same level.

    This, by itself, it not enough to remove all deadlocks. It just makes
    it possible for drivers to take the extra step required themselves.

    To avoid deadlocks, drivers must never risk waiting for a request
    after submitting one to generic_make_request. This includes never
    allocing from a mempool twice in the one call to a make_request_fn.

    A common pattern in drivers is to call bio_split() in a loop, handling
    the first part and then looping around to possibly split the next part.
    Instead, a driver that finds it needs to split a bio should queue
    (with generic_make_request) the second part, handle the first part,
    and then return. The new code in generic_make_request will ensure the
    requests to underlying bios are processed first, then the second bio
    that was split off. If it splits again, the same process happens. In
    each case one bio will be completely handled before the next one is attempted.

    With this is place, it should be possible to disable the
    punt_bios_to_recover() recovery thread for many block devices, and
    eventually it may be possible to remove it completely.

    Ref: http://www.spinics.net/lists/raid/msg54680.html
    Tested-by: Jinpu Wang
    Inspired-by: Lars Ellenberg
    Signed-off-by: NeilBrown
    Signed-off-by: Jens Axboe
    Cc: Jack Wang
    Signed-off-by: Greg Kroah-Hartman

    NeilBrown
     

30 Mar, 2017

1 commit

  • commit 95a49603707d982b25d17c5b70e220a05556a2f9 upstream.

    When iterating busy requests in timeout handler,
    if the STARTED flag of one request isn't set, that means
    the request is being processed in block layer or driver, and
    isn't submitted to hardware yet.

    In current implementation of blk_mq_check_expired(),
    if the request queue becomes dying, un-started requests are
    handled as being completed/freed immediately. This way is
    wrong, and can cause rq corruption or double allocation[1][2],
    when doing I/O and removing&resetting NVMe device at the sametime.

    This patch fixes several issues reported by Yi Zhang.

    [1]. oops log 1
    [ 581.789754] ------------[ cut here ]------------
    [ 581.789758] kernel BUG at block/blk-mq.c:374!
    [ 581.789760] invalid opcode: 0000 [#1] SMP
    [ 581.789761] Modules linked in: vfat fat ipmi_ssif intel_rapl sb_edac
    edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm nvme
    irqbypass crct10dif_pclmul nvme_core crc32_pclmul ghash_clmulni_intel
    intel_cstate ipmi_si mei_me ipmi_devintf intel_uncore sg ipmi_msghandler
    intel_rapl_perf iTCO_wdt mei iTCO_vendor_support mxm_wmi lpc_ich dcdbas shpchp
    pcspkr acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd dm_multipath grace
    sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper
    syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci
    crc32c_intel tg3 libata megaraid_sas i2c_core ptp fjes pps_core dm_mirror
    dm_region_hash dm_log dm_mod
    [ 581.789796] CPU: 1 PID: 1617 Comm: kworker/1:1H Not tainted 4.10.0.bz1420297+ #4
    [ 581.789797] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
    [ 581.789804] Workqueue: kblockd blk_mq_timeout_work
    [ 581.789806] task: ffff8804721c8000 task.stack: ffffc90006ee4000
    [ 581.789809] RIP: 0010:blk_mq_end_request+0x58/0x70
    [ 581.789810] RSP: 0018:ffffc90006ee7d50 EFLAGS: 00010202
    [ 581.789811] RAX: 0000000000000001 RBX: ffff8802e4195340 RCX: ffff88028e2f4b88
    [ 581.789812] RDX: 0000000000001000 RSI: 0000000000001000 RDI: 0000000000000000
    [ 581.789813] RBP: ffffc90006ee7d60 R08: 0000000000000003 R09: ffff88028e2f4b00
    [ 581.789814] R10: 0000000000001000 R11: 0000000000000001 R12: 00000000fffffffb
    [ 581.789815] R13: ffff88042abe5780 R14: 000000000000002d R15: ffff88046fbdff80
    [ 581.789817] FS: 0000000000000000(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000
    [ 581.789818] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 581.789819] CR2: 00007f64f403a008 CR3: 000000014d078000 CR4: 00000000001406e0
    [ 581.789820] Call Trace:
    [ 581.789825] blk_mq_check_expired+0x76/0x80
    [ 581.789828] bt_iter+0x45/0x50
    [ 581.789830] blk_mq_queue_tag_busy_iter+0xdd/0x1f0
    [ 581.789832] ? blk_mq_rq_timed_out+0x70/0x70
    [ 581.789833] ? blk_mq_rq_timed_out+0x70/0x70
    [ 581.789840] ? __switch_to+0x140/0x450
    [ 581.789841] blk_mq_timeout_work+0x88/0x170
    [ 581.789845] process_one_work+0x165/0x410
    [ 581.789847] worker_thread+0x137/0x4c0
    [ 581.789851] kthread+0x101/0x140
    [ 581.789853] ? rescuer_thread+0x3b0/0x3b0
    [ 581.789855] ? kthread_park+0x90/0x90
    [ 581.789860] ret_from_fork+0x2c/0x40
    [ 581.789861] Code: 48 85 c0 74 0d 44 89 e6 48 89 df ff d0 5b 41 5c 5d c3 48
    8b bb 70 01 00 00 48 85 ff 75 0f 48 89 df e8 7d f0 ff ff 5b 41 5c 5d c3
    0b e8 71 f0 ff ff 90 eb e9 0f 1f 40 00 66 2e 0f 1f 84 00 00
    [ 581.789882] RIP: blk_mq_end_request+0x58/0x70 RSP: ffffc90006ee7d50
    [ 581.789889] ---[ end trace bcaf03d9a14a0a70 ]---

    [2]. oops log2
    [ 6984.857362] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
    [ 6984.857372] IP: nvme_queue_rq+0x6e6/0x8cd [nvme]
    [ 6984.857373] PGD 0
    [ 6984.857374]
    [ 6984.857376] Oops: 0000 [#1] SMP
    [ 6984.857379] Modules linked in: ipmi_ssif vfat fat intel_rapl sb_edac
    edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
    irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_si iTCO_wdt
    iTCO_vendor_support mxm_wmi ipmi_devintf intel_cstate sg dcdbas intel_uncore
    mei_me intel_rapl_perf mei pcspkr lpc_ich ipmi_msghandler shpchp
    acpi_power_meter wmi nfsd auth_rpcgss dm_multipath nfs_acl lockd grace sunrpc
    ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea
    sysfillrect crc32c_intel sysimgblt fb_sys_fops ttm nvme drm nvme_core ahci
    libahci i2c_core tg3 libata ptp megaraid_sas pps_core fjes dm_mirror
    dm_region_hash dm_log dm_mod
    [ 6984.857416] CPU: 7 PID: 1635 Comm: kworker/7:1H Not tainted
    4.10.0-2.el7.bz1420297.x86_64 #1
    [ 6984.857417] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
    [ 6984.857427] Workqueue: kblockd blk_mq_run_work_fn
    [ 6984.857429] task: ffff880476e3da00 task.stack: ffffc90002e90000
    [ 6984.857432] RIP: 0010:nvme_queue_rq+0x6e6/0x8cd [nvme]
    [ 6984.857433] RSP: 0018:ffffc90002e93c50 EFLAGS: 00010246
    [ 6984.857434] RAX: 0000000000000000 RBX: ffff880275646600 RCX: 0000000000001000
    [ 6984.857435] RDX: 0000000000000fff RSI: 00000002fba2a000 RDI: ffff8804734e6950
    [ 6984.857436] RBP: ffffc90002e93d30 R08: 0000000000002000 R09: 0000000000001000
    [ 6984.857437] R10: 0000000000001000 R11: 0000000000000000 R12: ffff8804741d8000
    [ 6984.857438] R13: 0000000000000040 R14: ffff880475649f80 R15: ffff8804734e6780
    [ 6984.857439] FS: 0000000000000000(0000) GS:ffff88047fcc0000(0000) knlGS:0000000000000000
    [ 6984.857440] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 6984.857442] CR2: 0000000000000010 CR3: 0000000001c09000 CR4: 00000000001406e0
    [ 6984.857443] Call Trace:
    [ 6984.857451] ? mempool_free+0x2b/0x80
    [ 6984.857455] ? bio_free+0x4e/0x60
    [ 6984.857459] blk_mq_dispatch_rq_list+0xf5/0x230
    [ 6984.857462] blk_mq_process_rq_list+0x133/0x170
    [ 6984.857465] __blk_mq_run_hw_queue+0x8c/0xa0
    [ 6984.857467] blk_mq_run_work_fn+0x12/0x20
    [ 6984.857473] process_one_work+0x165/0x410
    [ 6984.857475] worker_thread+0x137/0x4c0
    [ 6984.857478] kthread+0x101/0x140
    [ 6984.857480] ? rescuer_thread+0x3b0/0x3b0
    [ 6984.857481] ? kthread_park+0x90/0x90
    [ 6984.857489] ret_from_fork+0x2c/0x40
    [ 6984.857490] Code: 8b bd 70 ff ff ff 89 95 50 ff ff ff 89 8d 58 ff ff ff 44
    89 95 60 ff ff ff e8 b7 dd 12 e1 8b 95 50 ff ff ff 48 89 85 68 ff ff ff
    8b 48 10 44 8b 58 18 8b 8d 58 ff ff ff 44 8b 95 60 ff ff ff
    [ 6984.857511] RIP: nvme_queue_rq+0x6e6/0x8cd [nvme] RSP: ffffc90002e93c50
    [ 6984.857512] CR2: 0000000000000010
    [ 6984.895359] ---[ end trace 2d7ceb528432bf83 ]---

    Reported-by: Yi Zhang
    Tested-by: Yi Zhang
    Reviewed-by: Bart Van Assche
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     

22 Mar, 2017

1 commit

  • [ Upstream commit 25cdb64510644f3e854d502d69c73f21c6df88a9 ]

    The WRITE_SAME commands are not present in the blk_default_cmd_filter
    write_ok list, and thus are failed with -EPERM when the SG_IO ioctl()
    is executed without CAP_SYS_RAWIO capability (e.g., unprivileged users).
    [ sg_io() -> blk_fill_sghdr_rq() > blk_verify_command() -> -EPERM ]

    The problem can be reproduced with the sg_write_same command

    # sg_write_same --num 1 --xferlen 512 /dev/sda
    #

    # capsh --drop=cap_sys_rawio -- -c \
    'sg_write_same --num 1 --xferlen 512 /dev/sda'
    Write same: pass through os error: Operation not permitted
    #

    For comparison, the WRITE_VERIFY command does not observe this problem,
    since it is in that list:

    # capsh --drop=cap_sys_rawio -- -c \
    'sg_write_verify --num 1 --ilen 512 --lba 0 /dev/sda'
    #

    So, this patch adds the WRITE_SAME commands to the list, in order
    for the SG_IO ioctl to finish successfully:

    # capsh --drop=cap_sys_rawio -- -c \
    'sg_write_same --num 1 --xferlen 512 /dev/sda'
    #

    That case happens to be exercised by QEMU KVM guests with 'scsi-block' devices
    (qemu "-device scsi-block" [1], libvirt "" [2]),
    which employs the SG_IO ioctl() and runs as an unprivileged user (libvirt-qemu).

    In that scenario, when a filesystem (e.g., ext4) performs its zero-out calls,
    which are translated to write-same calls in the guest kernel, and then into
    SG_IO ioctls to the host kernel, SCSI I/O errors may be observed in the guest:

    [...] sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
    [...] sd 0:0:0:0: [sda] tag#0 Sense Key : Aborted Command [current]
    [...] sd 0:0:0:0: [sda] tag#0 Add. Sense: I/O process terminated
    [...] sd 0:0:0:0: [sda] tag#0 CDB: Write Same(10) 41 00 01 04 e0 78 00 00 08 00
    [...] blk_update_request: I/O error, dev sda, sector 17096824

    Links:
    [1] http://git.qemu.org/?p=qemu.git;a=commit;h=336a6915bc7089fb20fea4ba99972ad9a97c5f52
    [2] https://libvirt.org/formatdomain.html#elementsDisks (see 'disk' -> 'device')

    Signed-off-by: Mauricio Faria de Oliveira
    Signed-off-by: Brahadambal Srinivasan
    Reported-by: Manjunatha H R
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Mauricio Faria de Oliveira
     

20 Jan, 2017

2 commits

  • commit c02ebfdddbafa9a6a0f52fbd715e6bfa229af9d3 upstream.

    Commit 0e87e58bf60e ("blk-mq: improve warning for running a queue on the
    wrong CPU") attempts to avoid triggering the WARN_ON in
    __blk_mq_run_hw_queue when the expected CPU is dead. Problem is, in the
    last batch execution before round robin, blk_mq_hctx_next_cpu can
    schedule a dead CPU and also update next_cpu to the next alive CPU in
    the mask, which will trigger the WARN_ON despite the previous
    workaround.

    The following patch fixes this scenario by always scheduling the value
    in hctx->next_cpu. This changes the moment when we round-robin the CPU
    running the hctx, but it really doesn't matter, since it still executes
    BLK_MQ_CPU_WORK_BATCH times in a row before switching to another CPU.

    Fixes: 0e87e58bf60e ("blk-mq: improve warning for running a queue on the wrong CPU")
    Signed-off-by: Gabriel Krisman Bertazi
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Gabriel Krisman Bertazi
     
  • commit ebc4ff661fbe76781c6b16dfb7b754a5d5073f8e upstream.

    cfq_cpd_alloc() which is the cpd_alloc_fn implementation for cfq was
    incorrectly hard coding GFP_KERNEL instead of using the mask specified
    through the @gfp parameter. This currently doesn't cause any actual
    issues because all current callers specify GFP_KERNEL. Fix it.

    Signed-off-by: Tejun Heo
    Reported-by: Dan Carpenter
    Fixes: e4a9bde9589f ("blkcg: replace blkcg_policy->cpd_size with ->cpd_alloc/free_fn() methods")
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

09 Jan, 2017

1 commit

  • commit 128394eff343fc6d2f32172f03e24829539c5835 upstream.

    Both damn things interpret userland pointers embedded into the payload;
    worse, they are actually traversing those. Leaving aside the bad
    API design, this is very much _not_ safe to call with KERNEL_DS.
    Bail out early if that happens.

    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     

06 Jan, 2017

1 commit

  • commit bc27c01b5c46d3bfec42c96537c7a3fae0bb2cc4 upstream.

    The meaning of the BLK_MQ_S_STOPPED flag is "do not call
    .queue_rq()". Hence modify blk_mq_make_request() such that requests
    are queued instead of issued if a queue has been stopped.

    Reported-by: Ming Lei
    Signed-off-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Bart Van Assche
     

08 Dec, 2016

1 commit


27 Oct, 2016

1 commit

  • If we end up sleeping due to running out of requests, we should
    update the hardware and software queues in the map ctx structure.
    Otherwise we could end up having rq->mq_ctx point to the pre-sleep
    context, and risk corrupting ctx->rq_list since we'll be
    grabbing the wrong lock when inserting the request.

    Reported-by: Dave Jones
    Reported-by: Chris Mason
    Tested-by: Chris Mason
    Fixes: 63581af3f31e ("blk-mq: remove non-blocking pass in blk_mq_map_request")
    Signed-off-by: Jens Axboe

    Jens Axboe
     

26 Oct, 2016

1 commit


22 Oct, 2016

2 commits

  • When bandblocks_set acknowledges a range or badblocks_clear a range,
    it's possible all badblocks are acknowledged. We should update
    unacked_exist if this occurs.

    Signed-off-by: Shaohua Li
    Reviewed-by: Tomasz Majchrzak
    Tested-by: Tomasz Majchrzak
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • Pull block fixes from Jens Axboe:
    "A set of fixes that missed the merge window, mostly due to me being
    away around that time.

    Nothing major here, a mix of nvme cleanups and fixes, and one fix for
    the badblocks handling"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    nvmet: use symbolic constants for CNS values
    nvme: use symbolic constants for CNS values
    nvme.h: add an enum for cns values
    nvme.h: don't use uuid_be
    nvme.h: resync with nvme-cli
    nvme: Add tertiary number to NVME_VS
    nvme : Add sysfs entry for NVMe CMBs when appropriate
    nvme: don't schedule multiple resets
    nvme: Delete created IO queues on reset
    nvme: Stop probing a removed device
    badblocks: fix overlapping check for clearing

    Linus Torvalds
     

16 Oct, 2016

1 commit

  • Pull gcc plugins update from Kees Cook:
    "This adds a new gcc plugin named "latent_entropy". It is designed to
    extract as much possible uncertainty from a running system at boot
    time as possible, hoping to capitalize on any possible variation in
    CPU operation (due to runtime data differences, hardware differences,
    SMP ordering, thermal timing variation, cache behavior, etc).

    At the very least, this plugin is a much more comprehensive example
    for how to manipulate kernel code using the gcc plugin internals"

    * tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    latent_entropy: Mark functions with __latent_entropy
    gcc-plugins: Add latent_entropy plugin

    Linus Torvalds
     

15 Oct, 2016

1 commit

  • Pull cgroup updates from Tejun Heo:

    - tracepoints for basic cgroup management operations added

    - kernfs and cgroup path formatting functions updated to behave in the
    style of strlcpy()

    - non-critical bug fixes

    * 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    blkcg: Unlock blkcg_pol_mutex only once when cpd == NULL
    cgroup: fix error handling regressions in proc_cgroup_show() and cgroup_release_agent()
    cpuset: fix error handling regression in proc_cpuset_show()
    cgroup: add tracepoints for basic operations
    cgroup: make cgroup_path() and friends behave in the style of strlcpy()
    kernfs: remove kernfs_path_len()
    kernfs: make kernfs_path*() behave in the style of strlcpy()
    kernfs: add dummy implementation of kernfs_path_from_node()

    Linus Torvalds
     

12 Oct, 2016

3 commits

  • Current bad block clear implementation assumes the range to clear
    overlaps with at least one bad block already stored. If given range to
    clear precedes first bad block in a list, the first entry is incorrectly
    updated.

    Check not only if stored block end is past clear block end but also if
    stored block start is before clear block end.

    Signed-off-by: Tomasz Majchrzak
    Acked-by: NeilBrown
    Signed-off-by: Jens Axboe

    Tomasz Majchrzak
     
  • Make sure that the offset and length arguments that we're using to
    construct WRITE SAME and DISCARD requests are actually aligned to the
    logical block size. Failure to do this causes other errors in other parts
    of the block layer or the SCSI layer because disks don't support partial
    logical block writes.

    Link: http://lkml.kernel.org/r/147518379026.22791.4437508871355153928.stgit@birch.djwong.org
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Bart Van Assche
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Hannes Reinecke
    Cc: Theodore Ts'o
    Cc: Mike Snitzer # tweaked header
    Cc: Brian Foster
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     
  • Patch series "fallocate for block devices", v11.

    This is a patchset to fix page cache coherency with BLKZEROOUT and
    implement fallocate for block devices.

    The first patch is a fix to the existing BLKZEROOUT ioctl to invalidate
    the page cache if the zeroing command to the underlying device succeeds.
    Without this patch we still have the pagecache coherence bug that's been
    in the kernel forever.

    The second patch changes the internal block device functions to reject
    attempts to discard or zeroout that are not aligned to the logical block
    size. Previously, we only checked that the start/len parameters were
    512-byte aligned, which caused kernel BUG_ONs for unaligned IOs to 4k-LBA
    devices.

    The third patch creates an fallocate handler for block devices, wires up
    the FALLOC_FL_PUNCH_HOLE flag to zeroing-discard, and connects
    FALLOC_FL_ZERO_RANGE to write-same so that we can have a consistent
    fallocate interface between files and block devices. It also allows the
    combination of PUNCH_HOLE and NO_HIDE_STALE to invoke non-zeroing discard.

    Test cases for the new block device fallocate are now in xfstests as
    generic/349-351.

    This patch (of 3):

    Invalidate the page cache (as a regular O_DIRECT write would do) to avoid
    returning stale cache contents at a later time.

    Link: http://lkml.kernel.org/r/147518378313.22791.16649519283678515021.stgit@birch.djwong.org
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Bart Van Assche
    Reviewed-by: Hannes Reinecke
    Cc: Theodore Ts'o
    Cc: Mike Snitzer
    Cc: Brian Foster
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     

11 Oct, 2016

1 commit

  • The __latent_entropy gcc attribute can be used only on functions and
    variables. If it is on a function then the plugin will instrument it for
    gathering control-flow entropy. If the attribute is on a variable then
    the plugin will initialize it with random contents. The variable must
    be an integer, an integer array type or a structure with integer fields.

    These specific functions have been selected because they are init
    functions (to help gather boot-time entropy), are called at unpredictable
    times, or they have variable loops, each of which provide some level of
    latent entropy.

    Signed-off-by: Emese Revfy
    [kees: expanded commit message]
    Signed-off-by: Kees Cook

    Emese Revfy
     

10 Oct, 2016

2 commits

  • Pull blk-mq CPU hotplug update from Jens Axboe:
    "This is the conversion of blk-mq to the new hotplug state machine"

    * 'for-4.9/block-smp' of git://git.kernel.dk/linux-block:
    blk-mq: fixup "Convert to new hotplug state machine"
    blk-mq: Convert to new hotplug state machine
    blk-mq/cpu-notif: Convert to new hotplug state machine

    Linus Torvalds
     
  • Pull blk-mq irq/cpu mapping updates from Jens Axboe:
    "This is the block-irq topic branch for 4.9-rc. It's mostly from
    Christoph, and it allows drivers to specify their own mappings, and
    more importantly, to share the blk-mq mappings with the IRQ affinity
    mappings. It's a good step towards making this work better out of the
    box"

    * 'for-4.9/block-irq' of git://git.kernel.dk/linux-block:
    blk_mq: linux/blk-mq.h does not include all the headers it depends on
    blk-mq: kill unused blk_mq_create_mq_map()
    blk-mq: get rid of the cpumask in struct blk_mq_tags
    nvme: remove the post_scan callout
    nvme: switch to use pci_alloc_irq_vectors
    blk-mq: provide a default queue mapping for PCI device
    blk-mq: allow the driver to pass in a queue mapping
    blk-mq: remove ->map_queue
    blk-mq: only allocate a single mq_map per tag_set
    blk-mq: don't redistribute hardware queues on a CPU hotplug event

    Linus Torvalds
     

08 Oct, 2016

1 commit

  • Pull block layer updates from Jens Axboe:
    "This is the main pull request for block layer changes in 4.9.

    As mentioned at the last merge window, I've changed things up and now
    do just one branch for core block layer changes, and driver changes.
    This avoids dependencies between the two branches. Outside of this
    main pull request, there are two topical branches coming as well.

    This pull request contains:

    - A set of fixes, and a conversion to blk-mq, of nbd. From Josef.

    - Set of fixes and updates for lightnvm from Matias, Simon, and Arnd.
    Followup dependency fix from Geert.

    - General fixes from Bart, Baoyou, Guoqing, and Linus W.

    - CFQ async write starvation fix from Glauber.

    - Add supprot for delayed kick of the requeue list, from Mike.

    - Pull out the scalable bitmap code from blk-mq-tag.c and make it
    generally available under the name of sbitmap. Only blk-mq-tag uses
    it for now, but the blk-mq scheduling bits will use it as well.
    From Omar.

    - bdev thaw error progagation from Pierre.

    - Improve the blk polling statistics, and allow the user to clear
    them. From Stephen.

    - Set of minor cleanups from Christoph in block/blk-mq.

    - Set of cleanups and optimizations from me for block/blk-mq.

    - Various nvme/nvmet/nvmeof fixes from the various folks"

    * 'for-4.9/block' of git://git.kernel.dk/linux-block: (54 commits)
    fs/block_dev.c: return the right error in thaw_bdev()
    nvme: Pass pointers, not dma addresses, to nvme_get/set_features()
    nvme/scsi: Remove power management support
    nvmet: Make dsm number of ranges zero based
    nvmet: Use direct IO for writes
    admin-cmd: Added smart-log command support.
    nvme-fabrics: Add host_traddr options field to host infrastructure
    nvme-fabrics: revise host transport option descriptions
    nvme-fabrics: rework nvmf_get_address() for variable options
    nbd: use BLK_MQ_F_BLOCKING
    blkcg: Annotate blkg_hint correctly
    cfq: fix starvation of asynchronous writes
    blk-mq: add flag for drivers wanting blocking ->queue_rq()
    blk-mq: remove non-blocking pass in blk_mq_map_request
    blk-mq: get rid of manual run of queue with __blk_mq_run_hw_queue()
    block: export bio_free_pages to other modules
    lightnvm: propagate device_add() error code
    lightnvm: expose device geometry through sysfs
    lightnvm: control life of nvm_dev in driver
    blk-mq: register device instead of disk
    ...

    Linus Torvalds
     

04 Oct, 2016

1 commit

  • Pull CPU hotplug updates from Thomas Gleixner:
    "Yet another batch of cpu hotplug core updates and conversions:

    - Provide core infrastructure for multi instance drivers so the
    drivers do not have to keep custom lists.

    - Convert custom lists to the new infrastructure. The block-mq custom
    list conversion comes through the block tree and makes the diffstat
    tip over to more lines removed than added.

    - Handle unbalanced hotplug enable/disable calls more gracefully.

    - Remove the obsolete CPU_STARTING/DYING notifier support.

    - Convert another batch of notifier users.

    The relayfs changes which conflicted with the conversion have been
    shipped to me by Andrew.

    The remaining lot is targeted for 4.10 so that we finally can remove
    the rest of the notifiers"

    * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
    cpufreq: Fix up conversion to hotplug state machine
    blk/mq: Reserve hotplug states for block multiqueue
    x86/apic/uv: Convert to hotplug state machine
    s390/mm/pfault: Convert to hotplug state machine
    mips/loongson/smp: Convert to hotplug state machine
    mips/octeon/smp: Convert to hotplug state machine
    fault-injection/cpu: Convert to hotplug state machine
    padata: Convert to hotplug state machine
    cpufreq: Convert to hotplug state machine
    ACPI/processor: Convert to hotplug state machine
    virtio scsi: Convert to hotplug state machine
    oprofile/timer: Convert to hotplug state machine
    block/softirq: Convert to hotplug state machine
    lib/irq_poll: Convert to hotplug state machine
    x86/microcode: Convert to hotplug state machine
    sh/SH-X3 SMP: Convert to hotplug state machine
    ia64/mca: Convert to hotplug state machine
    ARM/OMAP/wakeupgen: Convert to hotplug state machine
    ARM/shmobile: Convert to hotplug state machine
    arm64/FP/SIMD: Convert to hotplug state machine
    ...

    Linus Torvalds
     

30 Sep, 2016

1 commit

  • Unlocking a mutex twice is wrong. Hence modify blkcg_policy_register()
    such that blkcg_pol_mutex is unlocked once if cpd == NULL. This patch
    avoids that smatch reports the following error:

    block/blk-cgroup.c:1378: blkcg_policy_register() error: double unlock 'mutex:&blkcg_pol_mutex'

    Fixes: 06b285bd1125 ("blkcg: fix blkcg_policy_data allocation bug")
    Signed-off-by: Bart Van Assche
    Cc: Tejun Heo
    Cc: # v4.2+
    Signed-off-by: Tejun Heo

    Bart Van Assche
     

24 Sep, 2016

2 commits

  • This provides the caller a feedback that a given hctx is not mapped and thus
    no command can be sent on it.

    Signed-off-by: Christoph Hellwig
    Tested-by: Steve Wise
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • While debugging timeouts happening in my application workload (ScyllaDB), I have
    observed calls to open() taking a long time, ranging everywhere from 2 seconds -
    the first ones that are enough to time out my application - to more than 30
    seconds.

    The problem seems to happen because XFS may block on pending metadata updates
    under certain circumnstances, and that's confirmed with the following backtrace
    taken by the offcputime tool (iovisor/bcc):

    ffffffffb90c57b1 finish_task_switch
    ffffffffb97dffb5 schedule
    ffffffffb97e310c schedule_timeout
    ffffffffb97e1f12 __down
    ffffffffb90ea821 down
    ffffffffc046a9dc xfs_buf_lock
    ffffffffc046abfb _xfs_buf_find
    ffffffffc046ae4a xfs_buf_get_map
    ffffffffc046babd xfs_buf_read_map
    ffffffffc0499931 xfs_trans_read_buf_map
    ffffffffc044a561 xfs_da_read_buf
    ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
    ffffffffc0452b90 xfs_dir2_leaf_lookup_int
    ffffffffc0452e0f xfs_dir2_leaf_lookup
    ffffffffc044d9d3 xfs_dir_lookup
    ffffffffc047d1d9 xfs_lookup
    ffffffffc0479e53 xfs_vn_lookup
    ffffffffb925347a path_openat
    ffffffffb9254a71 do_filp_open
    ffffffffb9242a94 do_sys_open
    ffffffffb9242b9e sys_open
    ffffffffb97e42b2 entry_SYSCALL_64_fastpath
    00007fb0698162ed [unknown]

    Inspecting my run with blktrace, I can see that the xfsaild kthread exhibit very
    high "Dispatch wait" times, on the dozens of seconds range and consistent with
    the open() times I have saw in that run.

    Still from the blktrace output, we can after searching a bit, identify the
    request that wasn't dispatched:

    8,0 11 152 81.092472813 804 A WM 141698288 + 8
    8,0 0 289372 96.718761435 0 D WM 141698288 + 8 (15626265317) [swapper/0]

    As we can see above, in this particular example CFQ took 15 seconds to dispatch
    this request. Going back to the full trace, we can see that the xfsaild queue
    had plenty of opportunity to run, and it was selected as the active queue many
    times. It would just always be preempted by something else (example):

    8,0 1 0 81.117912979 0 m N cfq1618SN / insert_request
    8,0 1 0 81.117913419 0 m N cfq1618SN / add_to_rr
    8,0 1 0 81.117914044 0 m N cfq1618SN / preempt
    8,0 1 0 81.117914398 0 m N cfq767A / slice expired t=1
    8,0 1 0 81.117914755 0 m N cfq767A / resid=40
    8,0 1 0 81.117915340 0 m N / served: vt=1948520448 min_vt=1948520448
    8,0 1 0 81.117915858 0 m N cfq767A / sl_used=1 disp=0 charge=0 iops=1 sect=0

    where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the ScyllaDB
    IO dispatchers.

    The requests preempting the xfsaild queue are synchronous requests. That's a
    characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT requests.
    While it can be argued that preempting ASYNC requests in favor of SYNC is part
    of the CFQ logic, I don't believe that doing so for 15+ seconds is anyone's
    goal.

    Moreover, unless I am misunderstanding something, that breaks the expectation
    set by the "fifo_expire_async" tunable, which in my system is set to the
    default.

    Looking at the code, it seems to me that the issue is that after we make
    an async queue active, there is no guarantee that it will execute any request.

    When the queue itself tests if it cfq_may_dispatch() it can bail if it sees SYNC
    requests in flight. An incoming request from another queue can also preempt it
    in such situation before we have the chance to execute anything (as seen in the
    trace above).

    This patch sets the must_dispatch flag if we notice that we have requests
    that are already fifo_expired. This flag is always cleared after
    cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't pin
    the queue for subsequent requests (unless they are themselves expired)

    Care is taken during preempt to still allow rt requests to preempt us
    regardless.

    Testing my workload with this patch applied produces much better results.
    From the application side I see no timeouts, and the open() latency histogram
    generated by systemtap looks much better, with the worst outlier at 131ms:

    Latency histogram of xfs_buf_lock acquisition (microseconds):
    value |-------------------------------------------------- count
    0 | 11
    1 |@@@@ 161
    2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1966
    4 |@ 54
    8 | 36
    16 | 7
    32 | 0
    64 | 0
    ~
    1024 | 0
    2048 | 0
    4096 | 1
    8192 | 1
    16384 | 2
    32768 | 0
    65536 | 0
    131072 | 1
    262144 | 0
    524288 | 0

    Signed-off-by: Glauber Costa
    CC: Jens Axboe
    CC: linux-block@vger.kernel.org
    CC: linux-kernel@vger.kernel.org

    Signed-off-by: Glauber Costa
    Signed-off-by: Jens Axboe

    Glauber Costa
     

23 Sep, 2016

3 commits


22 Sep, 2016

4 commits

  • Two cases:

    1) blk_mq_alloc_request() needlessly re-runs the queue, after
    calling into the tag allocation without NOWAIT set. We don't
    need to do that.

    2) blk_mq_map_request() should just use blk_mq_run_hw_queue() with
    the async flag set to false.

    Signed-off-by: Jens Axboe
    Reviewed-by: Christoph Hellwig

    Jens Axboe
     
  • Install the callbacks via the state machine so we can phase out the cpu
    hotplug notifiers mess.

    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: linux-block@vger.kernel.org
    Cc: rt@linutronix.de
    Cc: Christoph Hellwing
    Link: http://lkml.kernel.org/r/20160919212601.180033814@linutronix.de
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Jens Axboe

    Sebastian Andrzej Siewior
     
  • Replace the block-mq notifier list management with the multi instance
    facility in the cpu hotplug state machine.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: linux-block@vger.kernel.org
    Cc: rt@linutronix.de
    Cc: Christoph Hellwing
    Signed-off-by: Jens Axboe

    Thomas Gleixner
     
  • bio_free_pages is introduced in commit 1dfa0f68c040
    ("block: add a helper to free bio bounce buffer pages"),
    we can reuse the func in other modules after it was
    imported.

    Cc: Christoph Hellwig
    Cc: Jens Axboe
    Cc: Mike Snitzer
    Cc: Shaohua Li
    Signed-off-by: Guoqing Jiang
    Acked-by: Kent Overstreet
    Signed-off-by: Jens Axboe

    Guoqing Jiang
     

21 Sep, 2016

1 commit


20 Sep, 2016

1 commit

  • Right now, if slice is expired, we start a new slice. If a bio is
    queued, we keep on extending slice by throtle_slice interval (100ms).

    This worked well as long as pending timer function got executed with-in
    few milli seconds of scheduled time. But looks like with recent changes
    in timer subsystem, slack can be much longer depending on the expiry time
    of the scheduled timer.

    commit 500462a9de65 ("timers: Switch to a non-cascading wheel")

    This means, by the time timer function gets executed, it is possible the
    delay from scheduled time is more than 100ms. That means current code
    will conclude that existing slice has expired and a new one needs to
    be started. New slice will be 100ms by default and that will not be
    sufficient to meet rate requirement of group given the bio size and
    bio will not be dispatched and we will start a new timer function to
    wait. And when that timer expires, same process will repeat and we
    will wait again and this can easily be an infinite loop.

    Solve this issue by starting a new slice only if throttle gropup is
    empty. If it is not empty, that means there should be an active slice
    going on. Ideally it should not be expired but given the slack, it is
    possible that it has expired.

    Reported-by: Hou Tao
    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal