12 Feb, 2019

5 commits

  • bio_check_eod() should check partition size not the whole disk if
    bio->bi_partno is non-zero. Do this by moving the call
    to bio_check_eod() into blk_partition_remap().

    Based on an earlier patch from Jiufei Xue.

    Fixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and partitions index")
    Reported-by: Jiufei Xue
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe
    (cherry picked from commit 52c5e62d4c4beecddc6e1b8045ce1d695fca1ba7)

    Christoph Hellwig
     
  • Regular block device writes go through blkdev_write_iter(), which does
    bdev_read_only(), while zeroout/discard/etc requests are never checked,
    both userspace- and kernel-triggered. Add a generic catch-all check to
    generic_make_request_checks() to actually enforce ioctl(BLKROSET) and
    set_disk_ro(), which is used by quite a few drivers for things like
    snapshots, read-only backing files/images, etc.

    Reviewed-by: Sagi Grimberg
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Jens Axboe
    (cherry picked from commit 721c7fc701c71f693307d274d2b346a1ecd4a534)

    Ilya Dryomov
     
  • That we we can also poll non blk-mq queues. Mostly needed for
    the NVMe multipath code, but could also be useful elsewhere.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe
    (cherry picked from commit ea435e1b9392a33deceaea2a16ebaa3397bead93)

    Christoph Hellwig
     
  • This helpers allows to bounce steal the uncompleted bios from a request so
    that they can be reissued on another path.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe
    (cherry picked from commit ef71de8b15d891b27b8c983a9a8972b11cb4576a)

    Christoph Hellwig
     
  • This helper allows reinserting a bio into a new queue without much
    overhead, but requires all queue limits to be the same for the upper
    and lower queues, and it does not provide any recursion preventions.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Javier González
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe
    (cherry picked from commit f421e1d9ade4e1b88183e54425cf50e390d16a7f)

    Christoph Hellwig
     

21 Nov, 2018

1 commit

  • commit 8dc765d438f1e42b3e8227b3b09fad7d73f4ec9a upstream.

    c2856ae2f315d ("blk-mq: quiesce queue before freeing queue") has
    already fixed this race, however the implied synchronize_rcu()
    in blk_mq_quiesce_queue() can slow down LUN probe a lot, so caused
    performance regression.

    Then 1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")
    tried to quiesce queue for avoiding unnecessary synchronize_rcu()
    only when queue initialization is done, because it is usual to see
    lots of inexistent LUNs which need to be probed.

    However, turns out it isn't safe to quiesce queue only when queue
    initialization is done. Because when one SCSI command is completed,
    the user of sending command can be waken up immediately, then the
    scsi device may be removed, meantime the run queue in scsi_end_request()
    is still in-progress, so kernel panic can be caused.

    In Red Hat QE lab, there are several reports about this kind of kernel
    panic triggered during kernel booting.

    This patch tries to address the issue by grabing one queue usage
    counter during freeing one request and the following run queue.

    Fixes: 1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")
    Cc: Andrew Jones
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org
    Cc: Martin K. Petersen
    Cc: Christoph Hellwig
    Cc: James E.J. Bottomley
    Cc: stable
    Cc: jianchao.wang
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     

26 Sep, 2018

1 commit

  • [ Upstream commit 1311326cf4755c7ffefd20f576144ecf46d9906b ]

    SCSI probing may synchronously create and destroy a lot of request_queues
    for non-existent devices. Any synchronize_rcu() in queue creation or
    destroy path may introduce long latency during booting, see detailed
    description in comment of blk_register_queue().

    This patch removes one synchronize_rcu() inside blk_cleanup_queue()
    for this case, commit c2856ae2f315d75(blk-mq: quiesce queue before freeing queue)
    needs synchronize_rcu() for implementing blk_mq_quiesce_queue(), but
    when queue isn't initialized, it isn't necessary to do that since
    only pass-through requests are involved, no original issue in
    scsi_execute() at all.

    Without this patch and previous one, it may take more 20+ seconds for
    virtio-scsi to complete disk probe. With the two patches, the time becomes
    less than 100ms.

    Fixes: c2856ae2f315d75 ("blk-mq: quiesce queue before freeing queue")
    Reported-by: Andrew Jones
    Cc: Omar Sandoval
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org
    Cc: "Martin K. Petersen"
    Cc: Christoph Hellwig
    Tested-by: Andrew Jones
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     

10 Sep, 2018

2 commits

  • commit b233f127042dba991229e3882c6217c80492f6ef upstream.

    Runtime PM isn't ready for blk-mq yet, and commit 765e40b675a9 ("block:
    disable runtime-pm for blk-mq") tried to disable it. Unfortunately,
    it can't take effect in that way since user space still can switch
    it on via 'echo auto > /sys/block/sdN/device/power/control'.

    This patch disables runtime-pm for blk-mq really by pm_runtime_disable()
    and fixes all kinds of PM related kernel crash.

    Cc: Tomas Janousek
    Cc: Przemek Socha
    Cc: Alan Stern
    Cc:
    Reviewed-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Tested-by: Patrick Steinhardt
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     
  • commit 54648cf1ec2d7f4b6a71767799c45676a138ca24 upstream.

    We find the memory use-after-free issue in __blk_drain_queue()
    on the kernel 4.14. After read the latest kernel 4.18-rc6 we
    think it has the same problem.

    Memory is allocated for q->fq in the blk_init_allocated_queue().
    If the elevator init function called with error return, it will
    run into the fail case to free the q->fq.

    Then the __blk_drain_queue() uses the same memory after the free
    of the q->fq, it will lead to the unpredictable event.

    The patch is to set q->fq as NULL in the fail case of
    blk_init_allocated_queue().

    Fixes: commit 7c94e1c157a2 ("block: introduce blk_flush_queue to drive flush machinery")
    Cc:
    Reviewed-by: Ming Lei
    Reviewed-by: Bart Van Assche
    Signed-off-by: xiao jin
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    xiao jin
     

22 Jul, 2018

1 commit

  • commit 1dc3039bc87ae7d19a990c3ee71cfd8a9068f428 upstream.

    When blk_queue_enter() waits for a queue to unfreeze, or unset the
    PREEMPT_ONLY flag, do not allow it to be interrupted by a signal.

    The PREEMPT_ONLY flag was introduced later in commit 3a0a529971ec
    ("block, scsi: Make SCSI quiesce and resume work reliably"). Note the SCSI
    device is resumed asynchronously, i.e. after un-freezing userspace tasks.

    So that commit exposed the bug as a regression in v4.15. A mysterious
    SIGBUS (or -EIO) sometimes happened during the time the device was being
    resumed. Most frequently, there was no kernel log message, and we saw Xorg
    or Xwayland killed by SIGBUS.[1]

    [1] E.g. https://bugzilla.redhat.com/show_bug.cgi?id=1553979

    Without this fix, I get an IO error in this test:

    # dd if=/dev/sda of=/dev/null iflag=direct & \
    while killall -SIGUSR1 dd; do sleep 0.1; done & \
    echo mem > /sys/power/state ; \
    sleep 5; killall dd # stop after 5 seconds

    The interruptible wait was added to blk_queue_enter in
    commit 3ef28e83ab15 ("block: generic request_queue reference counting").
    Before then, the interruptible wait was only in blk-mq, but I don't think
    it could ever have been correct.

    Reviewed-by: Bart Van Assche
    Cc: stable@vger.kernel.org
    Signed-off-by: Alan Jenkins
    Signed-off-by: Jens Axboe
    Signed-off-by: Sudip Mukherjee
    Signed-off-by: Greg Kroah-Hartman

    Alan Jenkins
     

03 Jul, 2018

1 commit

  • commit 297ba57dcdec7ea37e702bcf1a577ac32a034e21 upstream.

    This patch avoids that removing a path controlled by the dm-mpath driver
    while mkfs is running triggers the following kernel bug:

    kernel BUG at block/blk-core.c:3347!
    invalid opcode: 0000 [#1] PREEMPT SMP KASAN
    CPU: 20 PID: 24369 Comm: mkfs.ext4 Not tainted 4.18.0-rc1-dbg+ #2
    RIP: 0010:blk_end_request_all+0x68/0x70
    Call Trace:

    dm_softirq_done+0x326/0x3d0 [dm_mod]
    blk_done_softirq+0x19b/0x1e0
    __do_softirq+0x128/0x60d
    irq_exit+0x100/0x110
    smp_call_function_single_interrupt+0x90/0x330
    call_function_single_interrupt+0xf/0x20

    Fixes: f9d03f96b988 ("block: improve handling of the magic discard payload")
    Reviewed-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Acked-by: Mike Snitzer
    Signed-off-by: Bart Van Assche
    Cc: Hannes Reinecke
    Cc: Johannes Thumshirn
    Cc:
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Bart Van Assche
     

26 Apr, 2018

1 commit

  • [ Upstream commit 445251d0f4d329aa061f323546cd6388a3bb7ab5 ]

    I ran into an issue on my laptop that triggered a bug on the
    discard path:

    WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3/0x430
    Modules linked in: rfcomm fuse ctr ccm bnep arc4 binfmt_misc snd_hda_codec_hdmi nls_iso8859_1 nls_cp437 vfat snd_hda_codec_conexant fat snd_hda_codec_generic iwlmvm snd_hda_intel snd_hda_codec snd_hwdep mac80211 snd_hda_core snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq x86_pkg_temp_thermal intel_powerclamp kvm_intel uvcvideo iwlwifi btusb snd_seq_device videobuf2_vmalloc btintel videobuf2_memops kvm snd_timer videobuf2_v4l2 bluetooth irqbypass videobuf2_core aesni_intel aes_x86_64 crypto_simd cryptd snd glue_helper videodev cfg80211 ecdh_generic soundcore hid_generic usbhid hid i915 psmouse e1000e ptp pps_core xhci_pci xhci_hcd intel_gtt
    CPU: 2 PID: 207 Comm: jbd2/nvme0n1p7- Tainted: G U 4.15.0+ #176
    Hardware name: LENOVO 20FBCTO1WW/20FBCTO1WW, BIOS N1FET59W (1.33 ) 12/19/2017
    RIP: 0010:nvme_setup_cmd+0x3d3/0x430
    RSP: 0018:ffff880423e9f838 EFLAGS: 00010217
    RAX: 0000000000000000 RBX: ffff880423e9f8c8 RCX: 0000000000010000
    RDX: ffff88022b200010 RSI: 0000000000000002 RDI: 00000000327f0000
    RBP: ffff880421251400 R08: ffff88022b200000 R09: 0000000000000009
    R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000ffff
    R13: ffff88042341e280 R14: 000000000000ffff R15: ffff880421251440
    FS: 0000000000000000(0000) GS:ffff880441500000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000055b684795030 CR3: 0000000002e09006 CR4: 00000000001606e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    nvme_queue_rq+0x40/0xa00
    ? __sbitmap_queue_get+0x24/0x90
    ? blk_mq_get_tag+0xa3/0x250
    ? wait_woken+0x80/0x80
    ? blk_mq_get_driver_tag+0x97/0xf0
    blk_mq_dispatch_rq_list+0x7b/0x4a0
    ? deadline_remove_request+0x49/0xb0
    blk_mq_do_dispatch_sched+0x4f/0xc0
    blk_mq_sched_dispatch_requests+0x106/0x170
    __blk_mq_run_hw_queue+0x53/0xa0
    __blk_mq_delay_run_hw_queue+0x83/0xa0
    blk_mq_run_hw_queue+0x6c/0xd0
    blk_mq_sched_insert_request+0x96/0x140
    __blk_mq_try_issue_directly+0x3d/0x190
    blk_mq_try_issue_directly+0x30/0x70
    blk_mq_make_request+0x1a4/0x6a0
    generic_make_request+0xfd/0x2f0
    ? submit_bio+0x5c/0x110
    submit_bio+0x5c/0x110
    ? __blkdev_issue_discard+0x152/0x200
    submit_bio_wait+0x43/0x60
    ext4_process_freed_data+0x1cd/0x440
    ? account_page_dirtied+0xe2/0x1a0
    ext4_journal_commit_callback+0x4a/0xc0
    jbd2_journal_commit_transaction+0x17e2/0x19e0
    ? kjournald2+0xb0/0x250
    kjournald2+0xb0/0x250
    ? wait_woken+0x80/0x80
    ? commit_timeout+0x10/0x10
    kthread+0x111/0x130
    ? kthread_create_worker_on_cpu+0x50/0x50
    ? do_group_exit+0x3a/0xa0
    ret_from_fork+0x1f/0x30
    Code: 73 89 c1 83 ce 10 c1 e1 10 09 ca 83 f8 04 0f 87 0f ff ff ff 8b 4d 20 48 8b 7d 00 c1 e9 09 48 01 8c c7 00 08 00 00 e9 f8 fe ff ff ff 4c 89 c7 41 bc 0a 00 00 00 e8 0d 78 d6 ff e9 a1 fc ff ff
    ---[ end trace 50d361cc444506c8 ]---
    print_req_error: I/O error, dev nvme0n1, sector 847167488

    Decoding the assembly, the request claims to have 0xffff segments,
    while nvme counts two. This turns out to be because we don't check
    for a data carrying request on the mq scheduler path, and since
    blk_phys_contig_segment() returns true for a non-data request,
    we decrement the initial segment count of 0 and end up with
    0xffff in the unsigned short.

    There are a few issues here:

    1) We should initialize the segment count for a discard to 1.
    2) The discard merging is currently using the data limits for
    segments and sectors.

    Fix this up by having attempt_merge() correctly identify the
    request, and by initializing the segment count correctly
    for discards.

    This can only be triggered with mq-deadline on discard capable
    devices right now, which isn't a common configuration.

    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Jens Axboe
     

09 Mar, 2018

1 commit

  • commit 7c5a0dcf557c6511a61e092ba887de28882fe857 upstream.

    The vm counters is counted in sectors, so we should do the conversation
    in submit_bio.

    Fixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and partitions index")
    Cc: stable@vger.kernel.org
    Reviewed-by: Omar Sandoval
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jiufei Xue
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jiufei Xue
     

03 Mar, 2018

1 commit

  • [ Upstream commit 454be724f6f99cc7e7bbf15067128be9868186c6 ]

    Now we track legacy requests with .q_usage_counter in commit 055f6e18e08f
    ("block: Make q_usage_counter also track legacy requests"), but that
    commit never runs and drains legacy queue before waiting for this counter
    becoming zero, then IO hang is caused in the test of pulling disk during IO.

    This patch fixes the issue by draining requests before waiting for
    q_usage_counter becoming zero, both Mauricio and chenxiang reported this
    issue, and observed that it can be fixed by this patch.

    Link: https://marc.info/?l=linux-block&m=151192424731797&w=2
    Fixes: 055f6e18e08f("block: Make q_usage_counter also track legacy requests")
    Cc: Wen Xiong
    Tested-by: "chenxiang (M)"
    Tested-by: Mauricio Faria de Oliveira
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     

17 Feb, 2018

1 commit

  • commit c2856ae2f315d754a0b6a268e4c6745b332b42e7 upstream.

    After queue is frozen, dispatch still may happen, for example:

    1) requests are submitted from several contexts
    2) requests from all these contexts are inserted to queue, but may dispatch
    to LLD in one of these paths, but other paths sill need to move on even all
    these requests are completed(that means blk_mq_freeze_queue_wait() returns
    at that time)
    3) dispatch after queue freezing still moves on and causes use-after-free,
    because request queue is freed

    This patch quiesces queue after it is frozen, and makes sure all
    in-progress dispatch are completed.

    This patch fixes the following kernel crash when running heavy IOs vs.
    deleting device:

    [ 36.719251] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
    [ 36.720318] IP: kyber_has_work+0x14/0x40
    [ 36.720847] PGD 254bf5067 P4D 254bf5067 PUD 255e6a067 PMD 0
    [ 36.721584] Oops: 0000 [#1] PREEMPT SMP
    [ 36.722105] Dumping ftrace buffer:
    [ 36.722570] (ftrace buffer empty)
    [ 36.723057] Modules linked in: scsi_debug ebtable_filter ebtables ip6table_filter ip6_tables tcm_loop iscsi_target_mod target_core_file target_core_iblock target_core_pscsi target_core_mod xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c bridge stp llc fuse iptable_filter ip_tables sd_mod sg btrfs xor zstd_decompress zstd_compress xxhash raid6_pq mptsas mptscsih bcache crc32c_intel ahci mptbase libahci serio_raw scsi_transport_sas nvme libata shpchp lpc_ich virtio_scsi nvme_core binfmt_misc dm_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi null_blk configs
    [ 36.733438] CPU: 2 PID: 2374 Comm: fio Not tainted 4.15.0-rc2.blk_mq_quiesce+ #714
    [ 36.735143] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.9.3-1.fc25 04/01/2014
    [ 36.736688] RIP: 0010:kyber_has_work+0x14/0x40
    [ 36.737515] RSP: 0018:ffffc9000209bca0 EFLAGS: 00010202
    [ 36.738431] RAX: 0000000000000008 RBX: ffff88025578bfc8 RCX: ffff880257bf4ed0
    [ 36.739581] RDX: 0000000000000038 RSI: ffffffff81a98c6d RDI: ffff88025578bfc8
    [ 36.740730] RBP: ffff880253cebfc8 R08: ffffc9000209bda0 R09: ffff8802554f3480
    [ 36.741885] R10: ffffc9000209be60 R11: ffff880263f72538 R12: ffff88025573e9e8
    [ 36.743036] R13: ffff88025578bfd0 R14: 0000000000000001 R15: 0000000000000000
    [ 36.744189] FS: 00007f9b9bee67c0(0000) GS:ffff88027fc80000(0000) knlGS:0000000000000000
    [ 36.746617] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 36.748483] CR2: 0000000000000008 CR3: 0000000254bf4001 CR4: 00000000003606e0
    [ 36.750164] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 36.751455] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 36.752796] Call Trace:
    [ 36.753992] blk_mq_do_dispatch_sched+0x7f/0xe0
    [ 36.755110] blk_mq_sched_dispatch_requests+0x119/0x190
    [ 36.756179] __blk_mq_run_hw_queue+0x83/0x90
    [ 36.757144] __blk_mq_delay_run_hw_queue+0xaf/0x110
    [ 36.758046] blk_mq_run_hw_queue+0x24/0x70
    [ 36.758845] blk_mq_flush_plug_list+0x1e7/0x270
    [ 36.759676] blk_flush_plug_list+0xd6/0x240
    [ 36.760463] blk_finish_plug+0x27/0x40
    [ 36.761195] do_io_submit+0x19b/0x780
    [ 36.761921] ? entry_SYSCALL_64_fastpath+0x1a/0x7d
    [ 36.762788] entry_SYSCALL_64_fastpath+0x1a/0x7d
    [ 36.763639] RIP: 0033:0x7f9b9699f697
    [ 36.764352] RSP: 002b:00007ffc10f991b8 EFLAGS: 00000206 ORIG_RAX: 00000000000000d1
    [ 36.765773] RAX: ffffffffffffffda RBX: 00000000008f6f00 RCX: 00007f9b9699f697
    [ 36.766965] RDX: 0000000000a5e6c0 RSI: 0000000000000001 RDI: 00007f9b8462a000
    [ 36.768377] RBP: 0000000000000000 R08: 0000000000000001 R09: 00000000008f6420
    [ 36.769649] R10: 00007f9b846e5000 R11: 0000000000000206 R12: 00007f9b795d6a70
    [ 36.770807] R13: 00007f9b795e4140 R14: 00007f9b795e3fe0 R15: 0000000100000000
    [ 36.771955] Code: 83 c7 10 e9 3f 68 d1 ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 97 b0 00 00 00 48 8d 42 08 48 83 c2 38 3b 00 74 06 b8 01 00 00 00 c3 48 3b 40 08 75 f4 48 83 c0 10
    [ 36.775004] RIP: kyber_has_work+0x14/0x40 RSP: ffffc9000209bca0
    [ 36.776012] CR2: 0000000000000008
    [ 36.776690] ---[ end trace 4045cbce364ff2a4 ]---
    [ 36.777527] Kernel panic - not syncing: Fatal exception
    [ 36.778526] Dumping ftrace buffer:
    [ 36.779313] (ftrace buffer empty)
    [ 36.780081] Kernel Offset: disabled
    [ 36.780877] ---[ end Kernel panic - not syncing: Fatal exception

    Reviewed-by: Christoph Hellwig
    Tested-by: Yi Zhang
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     

17 Dec, 2017

1 commit

  • [ Upstream commit aba7afc5671c23beade64d10caf86e24a9105dab ]

    Avoid that removal of a request queue sporadically triggers the
    following warning:

    list_del corruption. next->prev should be ffff8807d649b970, but was 6b6b6b6b6b6b6b6b
    WARNING: CPU: 3 PID: 342 at lib/list_debug.c:56 __list_del_entry_valid+0x92/0xa0
    Call Trace:
    process_one_work+0x11b/0x660
    worker_thread+0x3d/0x3b0
    kthread+0x129/0x140
    ret_from_fork+0x27/0x40

    Signed-off-by: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Johannes Thumshirn
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Bart Van Assche
     

14 Dec, 2017

1 commit

  • [ Upstream commit 34d9715ac1edd50285168dd8d80c972739a4f6a4 ]

    Once blk_set_queue_dying() is done in blk_cleanup_queue(), we call
    blk_freeze_queue() and wait for q->q_usage_counter becoming zero. But
    if there are tasks blocked in get_request(), q->q_usage_counter can
    never become zero. So we have to wake up all these tasks in
    blk_set_queue_dying() first.

    Fixes: 3ef28e83ab157997 ("block: generic request_queue reference counting")
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     

30 Nov, 2017

1 commit

  • commit 4e9b6f20828ac880dbc1fa2fdbafae779473d1af upstream.

    Make sure that if the timeout timer fires after a queue has been
    marked "dying" that the affected requests are finished.

    Reported-by: chenxiang (M)
    Fixes: commit 287922eb0b18 ("block: defer timeouts to a workqueue")
    Signed-off-by: Bart Van Assche
    Tested-by: chenxiang (M)
    Cc: Christoph Hellwig
    Cc: Keith Busch
    Cc: Hannes Reinecke
    Cc: Ming Lei
    Cc: Johannes Thumshirn
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Bart Van Assche
     

25 Sep, 2017

1 commit

  • The lockdep code had reported the following unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(s_active#228);
    lock(&bdev->bd_mutex/1);
    lock(s_active#228);
    lock(&bdev->bd_mutex);

    *** DEADLOCK ***

    The deadlock may happen when one task (CPU1) is trying to delete a
    partition in a block device and another task (CPU0) is accessing
    tracing sysfs file (e.g. /sys/block/dm-1/trace/act_mask) in that
    partition.

    The s_active isn't an actual lock. It is a reference count (kn->count)
    on the sysfs (kernfs) file. Removal of a sysfs file, however, require
    a wait until all the references are gone. The reference count is
    treated like a rwsem using lockdep instrumentation code.

    The fact that a thread is in the sysfs callback method or in the
    ioctl call means there is a reference to the opended sysfs or device
    file. That should prevent the underlying block structure from being
    removed.

    Instead of using bd_mutex in the block_device structure, a new
    blk_trace_mutex is now added to the request_queue structure to protect
    access to the blk_trace structure.

    Suggested-by: Christoph Hellwig
    Signed-off-by: Waiman Long
    Acked-by: Steven Rostedt (VMware)

    Fix typo in patch subject line, and prune a comment detailing how
    the code used to work.

    Signed-off-by: Jens Axboe

    Waiman Long
     

12 Sep, 2017

1 commit

  • A NULL pointer crash was reported for the case of having the BFQ IO
    scheduler attached to the underlying blk-mq paths of a DM multipath
    device. The crash occured in blk_mq_sched_insert_request()'s call to
    e->type->ops.mq.insert_requests().

    Paolo Valente correctly summarized why the crash occured with:
    "the call chain (dm_mq_queue_rq -> map_request -> setup_clone ->
    blk_rq_prep_clone) creates a cloned request without invoking
    e->type->ops.mq.prepare_request for the target elevator e. The cloned
    request is therefore not initialized for the scheduler, but it is
    however inserted into the scheduler by blk_mq_sched_insert_request."

    All said, a request-based DM multipath device's IO scheduler should be
    the only one used -- when the original requests are issued to the
    underlying paths as cloned requests they are inserted directly in the
    underlying dispatch queue(s) rather than through an additional elevator.

    But commit bd166ef18 ("blk-mq-sched: add framework for MQ capable IO
    schedulers") switched blk_insert_cloned_request() from using
    blk_mq_insert_request() to blk_mq_sched_insert_request(). Which
    incorrectly added elevator machinery into a call chain that isn't
    supposed to have any.

    To fix this introduce a blk-mq private blk_mq_request_bypass_insert()
    that blk_insert_cloned_request() calls to insert the request without
    involving any elevator that may be attached to the cloned request's
    request_queue.

    Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers")
    Cc: stable@vger.kernel.org
    Reported-by: Bart Van Assche
    Tested-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Jens Axboe
     

29 Aug, 2017

1 commit


24 Aug, 2017

1 commit

  • This way we don't need a block_device structure to submit I/O. The
    block_device has different life time rules from the gendisk and
    request_queue and is usually only available when the block device node
    is open. Other callers need to explicitly create one (e.g. the lightnvm
    passthrough code, or the new nvme multipathing code).

    For the actual I/O path all that we need is the gendisk, which exists
    once per block device. But given that the block layer also does
    partition remapping we additionally need a partition index, which is
    used for said remapping in generic_make_request.

    Note that all the block drivers generally want request_queue or
    sometimes the gendisk, so this removes a layer of indirection all
    over the stack.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

18 Aug, 2017

1 commit

  • Calling blk_start_queue() from interrupt context with the queue
    lock held and without disabling IRQs, as the skd driver does, is
    safe. This patch avoids that loading the skd driver triggers the
    following warning:

    WARNING: CPU: 11 PID: 1348 at block/blk-core.c:283 blk_start_queue+0x84/0xa0
    RIP: 0010:blk_start_queue+0x84/0xa0
    Call Trace:
    skd_unquiesce_dev+0x12a/0x1d0 [skd]
    skd_complete_internal+0x1e7/0x5a0 [skd]
    skd_complete_other+0xc2/0xd0 [skd]
    skd_isr_completion_posted.isra.30+0x2a5/0x470 [skd]
    skd_isr+0x14f/0x180 [skd]
    irq_forced_thread_fn+0x2a/0x70
    irq_thread+0x144/0x1a0
    kthread+0x125/0x140
    ret_from_fork+0x2a/0x40

    Fixes: commit a038e2536472 ("[PATCH] blk_start_queue() must be called with irq disabled - add warning")
    Signed-off-by: Bart Van Assche
    Cc: Paolo 'Blaisorblade' Giarrusso
    Cc: Andrew Morton
    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Johannes Thumshirn
    Cc:
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

10 Aug, 2017

3 commits


24 Jul, 2017

1 commit

  • The blk-mq code lacks support for looking at the rpm_status field, tracking
    active requests and the RQF_PM flag.

    Due to the default switch to blk-mq for scsi people start to run into
    suspend / resume issue due to this fact, so make sure we disable the runtime
    PM functionality until it is properly implemented.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

04 Jul, 2017

1 commit

  • Currently all integrity prep hooks are open-coded, and if prepare fails
    we ignore it's code and fail bio with EIO. Let's return real error to
    upper layer, so later caller may react accordingly.

    In fact no one want to use bio_integrity_prep() w/o bio_integrity_enabled,
    so it is reasonable to fold it in to one function.

    Signed-off-by: Dmitry Monakhov
    Reviewed-by: Martin K. Petersen
    [hch: merged with the latest block tree,
    return bool from bio_integrity_prep]
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Dmitry Monakhov
     

28 Jun, 2017

4 commits


22 Jun, 2017

1 commit


21 Jun, 2017

4 commits

  • Some functions in block/blk-core.c must only be used on blk-sq queues
    while others are safe to use against any queue type. Document which
    functions are intended for blk-sq queues and issue a warning if the
    blk-sq API is misused. This does not only help block driver authors
    but will also make it easier to remove the blk-sq code once that code
    is declared obsolete.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Cc: Ming Lei
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Instead of documenting the locking assumptions of most block layer
    functions as a comment, use lockdep_assert_held() to verify locking
    assumptions at runtime.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Cc: Ming Lei
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Several block drivers need to initialize the driver-private request
    data after having called blk_get_request() and before .prep_rq_fn()
    is called, e.g. when submitting a REQ_OP_SCSI_* request. Avoid that
    that initialization code has to be repeated after every
    blk_get_request() call by adding new callback functions to struct
    request_queue and to struct blk_mq_ops.

    Signed-off-by: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Instead of declaring the second argument of blk_*_get_request()
    as int and passing it to functions that expect an unsigned int,
    declare that second argument as unsigned int. Also because of
    consistency, rename that second argument from 'rw' into 'op'.
    This patch does not change any functionality.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Cc: Ming Lei
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

20 Jun, 2017

1 commit

  • A new bio operation flag REQ_NOWAIT is introduced to identify bio's
    orignating from iocb with IOCB_NOWAIT. This flag indicates
    to return immediately if a request cannot be made instead
    of retrying.

    Stacked devices such as md (the ones with make_request_fn hooks)
    currently are not supported because it may block for housekeeping.
    For example, an md can have a part of the device suspended.
    For this reason, only request based devices are supported.
    In the future, this feature will be expanded to stacked devices
    by teaching them how to handle the REQ_NOWAIT flags.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jens Axboe
    Signed-off-by: Goldwyn Rodrigues
    Signed-off-by: Jens Axboe

    Goldwyn Rodrigues
     

19 Jun, 2017

2 commits

  • A rescuing bioset is only useful if there might be bios from
    that same bioset on the bio_list_on_stack queue at a time
    when bio_alloc_bioset() is called. This never applies to
    q->bio_split.

    Allocations from q->bio_split are only ever made from
    blk_queue_split() which is only ever called early in each of
    various make_request_fn()s. The original bio (call this A)
    is then passed to generic_make_request() and is placed on
    the bio_list_on_stack queue, and the bio that was allocated
    from q->bio_split (B) is processed.

    The processing of this may cause other bios to be passed to
    generic_make_request() or may even cause the bio B itself to
    be passed, possible after some prefix has been split off
    (using some other bioset).

    generic_make_request() now guarantees that all of these bios
    (B and dependants) will be fully processed before the tail
    of the original bio A gets handled. None of these early bios
    can possible trigger an allocation from the original
    q->bio_split as they are either too small to require
    splitting or (more likely) are destined for a different queue.

    The next time that the original q->bio_split might be used
    by this thread is when A is processed again, as it might
    still be too big to handle directly. By this time there
    cannot be any other bios allocated from q->bio_split in the
    generic_make_request() queue. So no rescuing will ever be
    needed.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: NeilBrown
    Signed-off-by: Jens Axboe

    NeilBrown
     
  • This patch converts bioset_create() to not create a workqueue by
    default, so alloctions will never trigger punt_bios_to_rescuer(). It
    also introduces a new flag BIOSET_NEED_RESCUER which tells
    bioset_create() to preserve the old behavior.

    All callers of bioset_create() that are inside block device drivers,
    are given the BIOSET_NEED_RESCUER flag.

    biosets used by filesystems or other top-level users do not
    need rescuing as the bio can never be queued behind other
    bios. This includes fs_bio_set, blkdev_dio_pool,
    btrfs_bioset, xfs_ioend_bioset, and one allocated by
    target_core_iblock.c.

    biosets used by md/raid do not need rescuing as
    their usage was recently audited and revised to never
    risk deadlock.

    It is hoped that most, if not all, of the remaining biosets
    can end up being the non-rescued version.

    Reviewed-by: Christoph Hellwig
    Credit-to: Ming Lei (minor fixes)
    Reviewed-by: Ming Lei
    Signed-off-by: NeilBrown
    Signed-off-by: Jens Axboe

    NeilBrown