16 Aug, 2020

1 commit

  • Pull block fixes from Jens Axboe:
    "A few fixes on the block side of things:

    - Discard granularity fix (Coly)

    - rnbd cleanups (Guoqing)

    - md error handling fix (Dan)

    - md sysfs fix (Junxiao)

    - Fix flush request accounting, which caused an IO slowdown for some
    configurations (Ming)

    - Properly propagate loop flag for partition scanning (Lennart)"

    * tag 'block-5.9-2020-08-14' of git://git.kernel.dk/linux-block:
    block: fix double account of flush request's driver tag
    loop: unset GENHD_FL_NO_PART_SCAN on LOOP_CONFIGURE
    rnbd: no need to set bi_end_io in rnbd_bio_map_kern
    rnbd: remove rnbd_dev_submit_io
    md-cluster: Fix potential error pointer dereference in resize_bitmaps()
    block: check queue's limits.discard_granularity in __blkdev_issue_discard()
    md: get sysfs entry after redundancy attr group create

    Linus Torvalds
     

12 Aug, 2020

1 commit

  • In case of none scheduler, we share data request's driver tag for
    flush request, so have to mark the flush request as INFLIGHT for
    avoiding double account of this driver tag.

    Fixes: 568f27006577 ("blk-mq: centralise related handling into blk_mq_get_driver_tag")
    Reported-by: Matthew Wilcox
    Signed-off-by: Ming Lei
    Tested-by: Matthew Wilcox
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Ming Lei
     

11 Aug, 2020

1 commit

  • Pull locking updates from Thomas Gleixner:
    "A set of locking fixes and updates:

    - Untangle the header spaghetti which causes build failures in
    various situations caused by the lockdep additions to seqcount to
    validate that the write side critical sections are non-preemptible.

    - The seqcount associated lock debug addons which were blocked by the
    above fallout.

    seqcount writers contrary to seqlock writers must be externally
    serialized, which usually happens via locking - except for strict
    per CPU seqcounts. As the lock is not part of the seqcount, lockdep
    cannot validate that the lock is held.

    This new debug mechanism adds the concept of associated locks.
    sequence count has now lock type variants and corresponding
    initializers which take a pointer to the associated lock used for
    writer serialization. If lockdep is enabled the pointer is stored
    and write_seqcount_begin() has a lockdep assertion to validate that
    the lock is held.

    Aside of the type and the initializer no other code changes are
    required at the seqcount usage sites. The rest of the seqcount API
    is unchanged and determines the type at compile time with the help
    of _Generic which is possible now that the minimal GCC version has
    been moved up.

    Adding this lockdep coverage unearthed a handful of seqcount bugs
    which have been addressed already independent of this.

    While generally useful this comes with a Trojan Horse twist: On RT
    kernels the write side critical section can become preemtible if
    the writers are serialized by an associated lock, which leads to
    the well known reader preempts writer livelock. RT prevents this by
    storing the associated lock pointer independent of lockdep in the
    seqcount and changing the reader side to block on the lock when a
    reader detects that a writer is in the write side critical section.

    - Conversion of seqcount usage sites to associated types and
    initializers"

    * tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
    locking/seqlock, headers: Untangle the spaghetti monster
    locking, arch/ia64: Reduce header dependencies by moving XTP bits into the new header
    x86/headers: Remove APIC headers from
    seqcount: More consistent seqprop names
    seqcount: Compress SEQCNT_LOCKNAME_ZERO()
    seqlock: Fold seqcount_LOCKNAME_init() definition
    seqlock: Fold seqcount_LOCKNAME_t definition
    seqlock: s/__SEQ_LOCKDEP/__SEQ_LOCK/g
    hrtimer: Use sequence counter with associated raw spinlock
    kvm/eventfd: Use sequence counter with associated spinlock
    userfaultfd: Use sequence counter with associated spinlock
    NFSv4: Use sequence counter with associated spinlock
    iocost: Use sequence counter with associated spinlock
    raid5: Use sequence counter with associated spinlock
    vfs: Use sequence counter with associated spinlock
    timekeeping: Use sequence counter with associated raw spinlock
    xfrm: policy: Use sequence counters with associated lock
    netfilter: nft_set_rbtree: Use sequence counter with associated rwlock
    netfilter: conntrack: Use sequence counter with associated spinlock
    sched: tasks: Use sequence counter with associated spinlock
    ...

    Linus Torvalds
     

07 Aug, 2020

1 commit

  • Pull SCSI updates from James Bottomley:
    "This consists of the usual driver updates (ufs, qla2xxx, tcmu, lpfc,
    hpsa, zfcp, scsi_debug) and minor bug fixes.

    We also have a huge docbook fix update like most other subsystems and
    no major update to the core (the few non trivial updates are either
    minor fixes or removing an unused feature [scsi_sdb_cache])"

    * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (307 commits)
    scsi: scsi_transport_srp: Sanitize scsi_target_block/unblock sequences
    scsi: ufs-mediatek: Apply DELAY_AFTER_LPM quirk to Micron devices
    scsi: ufs: Introduce device quirk "DELAY_AFTER_LPM"
    scsi: virtio-scsi: Correctly handle the case where all LUNs are unplugged
    scsi: scsi_debug: Implement tur_ms_to_ready parameter
    scsi: scsi_debug: Fix request sense
    scsi: lpfc: Fix typo in comment for ULP
    scsi: ufs-mediatek: Prevent LPM operation on undeclared VCC
    scsi: iscsi: Do not put host in iscsi_set_flashnode_param()
    scsi: hpsa: Correct ctrl queue depth
    scsi: target: tcmu: Make TMR notification optional
    scsi: target: tcmu: Implement tmr_notify callback
    scsi: target: tcmu: Fix and simplify timeout handling
    scsi: target: tcmu: Factor out new helper ring_insert_padding
    scsi: target: tcmu: Do not queue aborted commands
    scsi: target: tcmu: Use priv pointer in se_cmd
    scsi: target: Add tmr_notify backend function
    scsi: target: Modify core_tmr_abort_task()
    scsi: target: iscsi: Fix inconsistent debug message
    scsi: target: iscsi: Fix login error when receiving
    ...

    Linus Torvalds
     

06 Aug, 2020

3 commits

  • If create a loop device with a backing NVMe SSD, current loop device
    driver doesn't correctly set its queue's limits.discard_granularity and
    leaves it as 0. If a discard request at LBA 0 on this loop device, in
    __blkdev_issue_discard() the calculated req_sects will be 0, and a zero
    length discard request will trigger a BUG() panic in generic block layer
    code at block/blk-mq.c:563.

    [ 955.565006][ C39] ------------[ cut here ]------------
    [ 955.559660][ C39] invalid opcode: 0000 [#1] SMP NOPTI
    [ 955.622171][ C39] CPU: 39 PID: 248 Comm: ksoftirqd/39 Tainted: G E 5.8.0-default+ #40
    [ 955.622171][ C39] Hardware name: Lenovo ThinkSystem SR650 -[7X05CTO1WW]-/-[7X05CTO1WW]-, BIOS -[IVE160M-2.70]- 07/17/2020
    [ 955.622175][ C39] RIP: 0010:blk_mq_end_request+0x107/0x110
    [ 955.622177][ C39] Code: 48 8b 03 e9 59 ff ff ff 48 89 df 5b 5d 41 5c e9 9f ed ff ff 48 8b 35 98 3c f4 00 48 83 c7 10 48 83 c6 19 e8 cb 56 c9 ff eb cb 0b 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 56 41 54
    [ 955.622179][ C39] RSP: 0018:ffffb1288701fe28 EFLAGS: 00010202
    [ 955.749277][ C39] RAX: 0000000000000001 RBX: ffff956fffba5080 RCX: 0000000000004003
    [ 955.749278][ C39] RDX: 0000000000000003 RSI: 0000000000000000 RDI: 0000000000000000
    [ 955.749279][ C39] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
    [ 955.749279][ C39] R10: ffffb1288701fd28 R11: 0000000000000001 R12: ffffffffa8e05160
    [ 955.749280][ C39] R13: 0000000000000004 R14: 0000000000000004 R15: ffffffffa7ad3a1e
    [ 955.749281][ C39] FS: 0000000000000000(0000) GS:ffff95bfbda00000(0000) knlGS:0000000000000000
    [ 955.749282][ C39] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 955.749282][ C39] CR2: 00007f6f0ef766a8 CR3: 0000005a37012002 CR4: 00000000007606e0
    [ 955.749283][ C39] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 955.749284][ C39] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 955.749284][ C39] PKRU: 55555554
    [ 955.749285][ C39] Call Trace:
    [ 955.749290][ C39] blk_done_softirq+0x99/0xc0
    [ 957.550669][ C39] __do_softirq+0xd3/0x45f
    [ 957.550677][ C39] ? smpboot_thread_fn+0x2f/0x1e0
    [ 957.550679][ C39] ? smpboot_thread_fn+0x74/0x1e0
    [ 957.550680][ C39] ? smpboot_thread_fn+0x14e/0x1e0
    [ 957.550684][ C39] run_ksoftirqd+0x30/0x60
    [ 957.550687][ C39] smpboot_thread_fn+0x149/0x1e0
    [ 957.886225][ C39] ? sort_range+0x20/0x20
    [ 957.886226][ C39] kthread+0x137/0x160
    [ 957.886228][ C39] ? kthread_park+0x90/0x90
    [ 957.886231][ C39] ret_from_fork+0x22/0x30
    [ 959.117120][ C39] ---[ end trace 3dacdac97e2ed164 ]---

    This is the procedure to reproduce the panic,
    # modprobe scsi_debug delay=0 dev_size_mb=2048 max_queue=1
    # losetup -f /dev/nvme0n1 --direct-io=on
    # blkdiscard /dev/loop0 -o 0 -l 0x200

    This patch fixes the issue by checking q->limits.discard_granularity in
    __blkdev_issue_discard() before composing the discard bio. If the value
    is 0, then prints a warning oops information and returns -EOPNOTSUPP to
    the caller to indicate that this buggy device driver doesn't support
    discard request.

    Fixes: 9b15d109a6b2 ("block: improve discard bio alignment in __blkdev_issue_discard()")
    Fixes: c52abf563049 ("loop: Better discard support for block devices")
    Reported-and-suggested-by: Ming Lei
    Signed-off-by: Coly Li
    Reviewed-by: Ming Lei
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Jack Wang
    Cc: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Darrick J. Wong
    Cc: Enzo Matsumiya
    Cc: Evan Green
    Cc: Jens Axboe
    Cc: Martin K. Petersen
    Cc: Xiao Ni
    Signed-off-by: Jens Axboe

    Coly Li
     
  • Pull block stacking updates from Jens Axboe:
    "The stacking related fixes depended on both the core block and drivers
    branches, so here's a topic branch with that change.

    Outside of that, a late fix from Johannes for zone revalidation"

    * tag 'for-5.9/block-merge-20200804' of git://git.kernel.dk/linux-block:
    block: don't do revalidate zones on invalid devices
    block: remove blk_queue_stack_limits
    block: remove bdev_stack_limits
    block: inherit the zoned characteristics in blk_stack_limits

    Linus Torvalds
     
  • Pull block driver updates from Jens Axboe:

    - NVMe:
    - ZNS support (Aravind, Keith, Matias, Niklas)
    - Misc cleanups, optimizations, fixes (Baolin, Chaitanya, David,
    Dongli, Max, Sagi)

    - null_blk zone capacity support (Aravind)

    - MD:
    - raid5/6 fixes (ChangSyun)
    - Warning fixes (Damien)
    - raid5 stripe fixes (Guoqing, Song, Yufen)
    - sysfs deadlock fix (Junxiao)
    - raid10 deadlock fix (Vitaly)

    - struct_size conversions (Gustavo)

    - Set of bcache updates/fixes (Coly)

    * tag 'for-5.9/drivers-20200803' of git://git.kernel.dk/linux-block: (117 commits)
    md/raid5: Allow degraded raid6 to do rmw
    md/raid5: Fix Force reconstruct-write io stuck in degraded raid5
    raid5: don't duplicate code for different paths in handle_stripe
    raid5-cache: hold spinlock instead of mutex in r5c_journal_mode_show
    md: print errno in super_written
    md/raid5: remove the redundant setting of STRIPE_HANDLE
    md: register new md sysfs file 'uuid' read-only
    md: fix max sectors calculation for super 1.0
    nvme-loop: remove extra variable in create ctrl
    nvme-loop: set ctrl state connecting after init
    nvme-multipath: do not fall back to __nvme_find_path() for non-optimized paths
    nvme-multipath: fix logic for non-optimized paths
    nvme-rdma: fix controller reset hang during traffic
    nvme-tcp: fix controller reset hang during traffic
    nvmet: introduce the passthru Kconfig option
    nvmet: introduce the passthru configfs interface
    nvmet: Add passthru enable/disable helpers
    nvmet: add passthru code to process commands
    nvme: export nvme_find_get_ns() and nvme_put_ns()
    nvme: introduce nvme_ctrl_get_by_path()
    ...

    Linus Torvalds
     

05 Aug, 2020

1 commit

  • Pull uninitialized_var() macro removal from Kees Cook:
    "This is long overdue, and has hidden too many bugs over the years. The
    series has several "by hand" fixes, and then a trivial treewide
    replacement.

    - Clean up non-trivial uses of uninitialized_var()

    - Update documentation and checkpatch for uninitialized_var() removal

    - Treewide removal of uninitialized_var()"

    * tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    compiler: Remove uninitialized_var() macro
    treewide: Remove uninitialized_var() usage
    checkpatch: Remove awareness of uninitialized_var() macro
    mm/debug_vm_pgtable: Remove uninitialized_var() usage
    f2fs: Eliminate usage of uninitialized_var() macro
    media: sur40: Remove uninitialized_var() usage
    KVM: PPC: Book3S PR: Remove uninitialized_var() usage
    clk: spear: Remove uninitialized_var() usage
    clk: st: Remove uninitialized_var() usage
    spi: davinci: Remove uninitialized_var() usage
    ide: Remove uninitialized_var() usage
    rtlwifi: rtl8192cu: Remove uninitialized_var() usage
    b43: Remove uninitialized_var() usage
    drbd: Remove uninitialized_var() usage
    x86/mm/numa: Remove uninitialized_var() usage
    docs: deprecated.rst: Add uninitialized_var()

    Linus Torvalds
     

04 Aug, 2020

2 commits

  • Pull io_uring updates from Jens Axboe:
    "Lots of cleanups in here, hardening the code and/or making it easier
    to read and fixing bugs, but a core feature/change too adding support
    for real async buffered reads. With the latter in place, we just need
    buffered write async support and we're done relying on kthreads for
    the fast path. In detail:

    - Cleanup how memory accounting is done on ring setup/free (Bijan)

    - sq array offset calculation fixup (Dmitry)

    - Consistently handle blocking off O_DIRECT submission path (me)

    - Support proper async buffered reads, instead of relying on kthread
    offload for that. This uses the page waitqueue to drive retries
    from task_work, like we handle poll based retry. (me)

    - IO completion optimizations (me)

    - Fix race with accounting and ring fd install (me)

    - Support EPOLLEXCLUSIVE (Jiufei)

    - Get rid of the io_kiocb unionizing, made possible by shrinking
    other bits (Pavel)

    - Completion side cleanups (Pavel)

    - Cleanup REQ_F_ flags handling, and kill off many of them (Pavel)

    - Request environment grabbing cleanups (Pavel)

    - File and socket read/write cleanups (Pavel)

    - Improve kiocb_set_rw_flags() (Pavel)

    - Tons of fixes and cleanups (Pavel)

    - IORING_SQ_NEED_WAKEUP clear fix (Xiaoguang)"

    * tag 'for-5.9/io_uring-20200802' of git://git.kernel.dk/linux-block: (127 commits)
    io_uring: flip if handling after io_setup_async_rw
    fs: optimise kiocb_set_rw_flags()
    io_uring: don't touch 'ctx' after installing file descriptor
    io_uring: get rid of atomic FAA for cq_timeouts
    io_uring: consolidate *_check_overflow accounting
    io_uring: fix stalled deferred requests
    io_uring: fix racy overflow count reporting
    io_uring: deduplicate __io_complete_rw()
    io_uring: de-unionise io_kiocb
    io-wq: update hash bits
    io_uring: fix missing io_queue_linked_timeout()
    io_uring: mark ->work uninitialised after cleanup
    io_uring: deduplicate io_grab_files() calls
    io_uring: don't do opcode prep twice
    io_uring: clear IORING_SQ_NEED_WAKEUP after executing task works
    io_uring: batch put_task_struct()
    tasks: add put_task_struct_many()
    io_uring: return locked and pinned page accounting
    io_uring: don't miscount pinned memory
    io_uring: don't open-code recv kbuf managment
    ...

    Linus Torvalds
     
  • Pull core block updates from Jens Axboe:
    "Good amount of cleanups and tech debt removals in here, and as a
    result, the diffstat shows a nice net reduction in code.

    - Softirq completion cleanups (Christoph)

    - Stop using ->queuedata (Christoph)

    - Cleanup bd claiming (Christoph)

    - Use check_events, moving away from the legacy media change
    (Christoph)

    - Use inode i_blkbits consistently (Christoph)

    - Remove old unused writeback congestion bits (Christoph)

    - Cleanup/unify submission path (Christoph)

    - Use bio_uninit consistently, instead of bio_disassociate_blkg
    (Christoph)

    - sbitmap cleared bits handling (John)

    - Request merging blktrace event addition (Jan)

    - sysfs add/remove race fixes (Luis)

    - blk-mq tag fixes/optimizations (Ming)

    - Duplicate words in comments (Randy)

    - Flush deferral cleanup (Yufen)

    - IO context locking/retry fixes (John)

    - struct_size() usage (Gustavo)

    - blk-iocost fixes (Chengming)

    - blk-cgroup IO stats fixes (Boris)

    - Various little fixes"

    * tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block: (135 commits)
    block: blk-timeout: delete duplicated word
    block: blk-mq-sched: delete duplicated word
    block: blk-mq: delete duplicated word
    block: genhd: delete duplicated words
    block: elevator: delete duplicated word and fix typos
    block: bio: delete duplicated words
    block: bfq-iosched: fix duplicated word
    iocost_monitor: start from the oldest usage index
    iocost: Fix check condition of iocg abs_vdebt
    block: Remove callback typedefs for blk_mq_ops
    block: Use non _rcu version of list functions for tag_set_list
    blk-cgroup: show global disk stats in root cgroup io.stat
    blk-cgroup: make iostat functions visible to stat printing
    block: improve discard bio alignment in __blkdev_issue_discard()
    block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers
    block: defer flush request no matter whether we have elevator
    block: make blk_timeout_init() static
    block: remove retry loop in ioc_release_fn()
    block: remove unnecessary ioc nested locking
    block: integrate bd_start_claiming into __blkdev_get
    ...

    Linus Torvalds
     

03 Aug, 2020

1 commit

  • When we loose a device for whatever reason while (re)scanning zones, we
    trip over a NULL pointer in blk_revalidate_zone_cb, like in the following
    log:

    sd 0:0:0:0: [sda] 3418095616 4096-byte logical blocks: (14.0 TB/12.7 TiB)
    sd 0:0:0:0: [sda] 52156 zones of 65536 logical blocks
    sd 0:0:0:0: [sda] Write Protect is off
    sd 0:0:0:0: [sda] Mode Sense: 37 00 00 08
    sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
    sd 0:0:0:0: [sda] REPORT ZONES start lba 1065287680 failed
    sd 0:0:0:0: [sda] REPORT ZONES: Result: hostbyte=0x00 driverbyte=0x08
    sd 0:0:0:0: [sda] Sense Key : 0xb [current]
    sd 0:0:0:0: [sda] ASC=0x0 ASCQ=0x6
    sda: failed to revalidate zones
    sd 0:0:0:0: [sda] 0 4096-byte logical blocks: (0 B/0 B)
    sda: detected capacity change from 14000519643136 to 0
    ==================================================================
    BUG: KASAN: null-ptr-deref in blk_revalidate_zone_cb+0x1b7/0x550
    Write of size 8 at addr 0000000000000010 by task kworker/u4:1/58

    CPU: 1 PID: 58 Comm: kworker/u4:1 Not tainted 5.8.0-rc1 #692
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
    Workqueue: events_unbound async_run_entry_fn
    Call Trace:
    dump_stack+0x7d/0xb0
    ? blk_revalidate_zone_cb+0x1b7/0x550
    kasan_report.cold+0x5/0x37
    ? blk_revalidate_zone_cb+0x1b7/0x550
    check_memory_region+0x145/0x1a0
    blk_revalidate_zone_cb+0x1b7/0x550
    sd_zbc_parse_report+0x1f1/0x370
    ? blk_req_zone_write_trylock+0x200/0x200
    ? sectors_to_logical+0x60/0x60
    ? blk_req_zone_write_trylock+0x200/0x200
    ? blk_req_zone_write_trylock+0x200/0x200
    sd_zbc_report_zones+0x3c4/0x5e0
    ? sd_dif_config_host+0x500/0x500
    blk_revalidate_disk_zones+0x231/0x44d
    ? _raw_write_lock_irqsave+0xb0/0xb0
    ? blk_queue_free_zone_bitmaps+0xd0/0xd0
    sd_zbc_read_zones+0x8cf/0x11a0
    sd_revalidate_disk+0x305c/0x64e0
    ? __device_add_disk+0x776/0xf20
    ? read_capacity_16.part.0+0x1080/0x1080
    ? blk_alloc_devt+0x250/0x250
    ? create_object.isra.0+0x595/0xa20
    ? kasan_unpoison_shadow+0x33/0x40
    sd_probe+0x8dc/0xcd2
    really_probe+0x20e/0xaf0
    __driver_attach_async_helper+0x249/0x2d0
    async_run_entry_fn+0xbe/0x560
    process_one_work+0x764/0x1290
    ? _raw_read_unlock_irqrestore+0x30/0x30
    worker_thread+0x598/0x12f0
    ? __kthread_parkme+0xc6/0x1b0
    ? schedule+0xed/0x2c0
    ? process_one_work+0x1290/0x1290
    kthread+0x36b/0x440
    ? kthread_create_worker_on_cpu+0xa0/0xa0
    ret_from_fork+0x22/0x30
    ==================================================================

    When the device is already gone we end up with the following scenario:
    The device's capacity is 0 and thus the number of zones will be 0 as well. When
    allocating the bitmap for the conventional zones, we then trip over a NULL
    pointer.

    So if we encounter a zoned block device with a 0 capacity, don't dare to
    revalidate the zones sizes.

    Fixes: 6c6b35491422 ("block: set the zone size in blk_revalidate_disk_zones atomically")
    Signed-off-by: Johannes Thumshirn
    Reviewed-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Johannes Thumshirn
     

01 Aug, 2020

7 commits


31 Jul, 2020

1 commit


29 Jul, 2020

1 commit

  • A sequence counter write side critical section must be protected by some
    form of locking to serialize writers. A plain seqcount_t does not
    contain the information of which lock must be held when entering a write
    side critical section.

    Use the new seqcount_spinlock_t data type, which allows to associate a
    spinlock with the sequence counter. This enables lockdep to verify that
    the spinlock used for writer serialization is held when the write side
    critical section is entered.

    If lockdep is disabled this lock association is compiled out and has
    neither storage size nor runtime overhead.

    Signed-off-by: Ahmed S. Darwish
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Daniel Wagner
    Link: https://lkml.kernel.org/r/20200720155530.1173732-21-a.darwish@linutronix.de

    Ahmed S. Darwish
     

28 Jul, 2020

1 commit

  • tag_set_list is only accessed under the tag_set_lock lock. There is
    no need for using the _rcu list functions.

    The _rcu list function were introduced to allow read access to the
    tag_set_list protected under RCU, see 705cda97ee3a ("blk-mq: Make it
    safe to use RCU to iterate over blk_mq_tag_set.tag_list") and
    05b79413946d ("Revert "blk-mq: don't handle TAG_SHARED in restart"").
    Those changes got reverted later but the cleanup commit missed a
    couple of places to undo the changes.

    Fixes: 97889f9ac24f ("blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set()"
    Signed-off-by: Daniel Wagner
    Reviewed-by: Hannes Reinecke
    Cc: Ming Lei
    Signed-off-by: Jens Axboe

    Daniel Wagner
     

25 Jul, 2020

1 commit

  • Commit 05d18ae1cc8a ("scsi: pm: Balance pm_only counter of request queue
    during system resume") fixed a problem in the block layer's runtime-PM
    code: blk_set_runtime_active() failed to call blk_clear_pm_only().
    However, the commit's implementation was awkward; it forced the SCSI
    system-resume handler to choose whether to call blk_post_runtime_resume()
    or blk_set_runtime_active(), depending on whether or not the SCSI device
    had previously been runtime suspended.

    This patch simplifies the situation considerably by adding the missing
    function call directly into blk_set_runtime_active() (under the condition
    that the queue is not already in the RPM_ACTIVE state). This allows the
    SCSI routine to revert back to its original form. Furthermore, making this
    change reveals that blk_post_runtime_resume() (in its success pathway) does
    exactly the same thing as blk_set_runtime_active(). The duplicate code is
    easily removed by making one routine call the other.

    No functional changes are intended.

    Link: https://lore.kernel.org/r/20200706151436.GA702867@rowland.harvard.edu
    CC: Can Guo
    CC: Bart Van Assche
    Reviewed-by: Bart Van Assche
    Signed-off-by: Alan Stern
    Signed-off-by: Martin K. Petersen

    Alan Stern
     

21 Jul, 2020

5 commits

  • This function is just a tiny wrapper around blk_stack_limits. Open code
    it int the two callers.

    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Damien Le Moal
    Tested-by: Damien Le Moal
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • This function is just a tiny wrapper around blk_stack_limit and has
    two callers. Simplify the stack a bit by open coding it in the two
    callers.

    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Damien Le Moal
    Tested-by: Damien Le Moal
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Lift the code from device mapper into blk_stack_limits to inherity
    the stacking limitations. This ensures we do the right thing for
    all stacked zoned block devices.

    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Damien Le Moal
    Tested-by: Damien Le Moal
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • * for-5.9/drivers: (38 commits)
    block: add max_active_zones to blk-sysfs
    block: add max_open_zones to blk-sysfs
    s390/dasd: Use struct_size() helper
    s390/dasd: fix inability to use DASD with DIAG driver
    md-cluster: fix wild pointer of unlock_all_bitmaps()
    md/raid5-cache: clear MD_SB_CHANGE_PENDING before flushing stripes
    md: fix deadlock causing by sysfs_notify
    md: improve io stats accounting
    md: raid0/linear: fix dereference before null check on pointer mddev
    rsxx: switch from 'pci_free_consistent()' to 'dma_free_coherent()'
    nvme: remove ns->disk checks
    nvme-pci: use standard block status symbolic names
    nvme-pci: use the consistent return type of nvme_pci_iod_alloc_size()
    nvme-pci: add a blank line after declarations
    nvme-pci: fix some comments issues
    nvme-pci: remove redundant segment validation
    nvme: document quirked Intel models
    nvme: expose reconnect_delay and ctrl_loss_tmo via sysfs
    nvme: support for zoned namespaces
    nvme: support for multiple Command Sets Supported and Effects log pages
    ...

    Jens Axboe
     
  • * for-5.9/block: (124 commits)
    blk-cgroup: show global disk stats in root cgroup io.stat
    blk-cgroup: make iostat functions visible to stat printing
    block: improve discard bio alignment in __blkdev_issue_discard()
    block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers
    block: defer flush request no matter whether we have elevator
    block: make blk_timeout_init() static
    block: remove retry loop in ioc_release_fn()
    block: remove unnecessary ioc nested locking
    block: integrate bd_start_claiming into __blkdev_get
    block: use bd_prepare_to_claim directly in the loop driver
    block: refactor bd_start_claiming
    block: simplify the restart case in __blkdev_get
    Revert "blk-rq-qos: remove redundant finish_wait to rq_qos_wait."
    block: always remove partitions from blk_drop_partitions()
    block: relax jiffies rounding for timeouts
    blk-mq: remove redundant validation in __blk_mq_end_request()
    blk-mq: Remove unnecessary local variable
    writeback: remove bdi->congested_fn
    writeback: remove struct bdi_writeback_congested
    writeback: remove {set,clear}_wb_congested
    ...

    Jens Axboe
     

18 Jul, 2020

2 commits

  • In order to improve consistency and usability in cgroup stat accounting,
    we would like to support the root cgroup's io.stat.

    Since the root cgroup has processes doing io even if the system has no
    explicitly created cgroups, we need to be careful to avoid overhead in
    that case. For that reason, the rstat algorithms don't handle the root
    cgroup, so just turning the file on wouldn't give correct statistics.

    To get around this, we simulate flushing the iostat struct by filling it
    out directly from global disk stats. The result is a root cgroup io.stat
    file consistent with both /proc/diskstats and io.stat.

    Note that in order to collect the disk stats, we needed to iterate over
    devices. To facilitate that, we had to change the linkage of a disk_type
    to external so that it can be used from blk-cgroup.c to iterate over
    disks.

    Suggested-by: Tejun Heo
    Signed-off-by: Boris Burkov
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Boris Burkov
     
  • Previously, the code which printed io.stat only needed access to the
    generic rstat flushing code, but since we plan to write some more
    specific code for preparing root cgroup stats, we need to manipulate
    iostat structs directly. Since declaring static functions ahead does not
    seem like common practice in this file, simply move the iostat functions
    up. We only plan to use blkg_iostat_set, but it seems better to keep them
    all together.

    Signed-off-by: Boris Burkov
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Boris Burkov
     

17 Jul, 2020

6 commits

  • This patch improves discard bio split for address and size alignment in
    __blkdev_issue_discard(). The aligned discard bio may help underlying
    device controller to perform better discard and internal garbage
    collection, and avoid unnecessary internal fragment.

    Current discard bio split algorithm in __blkdev_issue_discard() may have
    non-discarded fregment on device even the discard bio LBA and size are
    both aligned to device's discard granularity size.

    Here is the example steps on how to reproduce the above problem.
    - On a VMWare ESXi 6.5 update3 installation, create a 51GB virtual disk
    with thin mode and give it to a Linux virtual machine.
    - Inside the Linux virtual machine, if the 50GB virtual disk shows up as
    /dev/sdb, fill data into the first 50GB by,
    # dd if=/dev/zero of=/dev/sdb bs=4096 count=13107200
    - Discard the 50GB range from offset 0 on /dev/sdb,
    # blkdiscard /dev/sdb -o 0 -l 53687091200
    - Observe the underlying mapping status of the device
    # sg_get_lba_status /dev/sdb -m 1048 --lba=0
    descriptor LBA: 0x0000000000000000 blocks: 2048 mapped (or unknown)
    descriptor LBA: 0x0000000000000800 blocks: 16773120 deallocated
    descriptor LBA: 0x0000000000fff800 blocks: 2048 mapped (or unknown)
    descriptor LBA: 0x0000000001000000 blocks: 8386560 deallocated
    descriptor LBA: 0x00000000017ff800 blocks: 2048 mapped (or unknown)
    descriptor LBA: 0x0000000001800000 blocks: 8386560 deallocated
    descriptor LBA: 0x0000000001fff800 blocks: 2048 mapped (or unknown)
    descriptor LBA: 0x0000000002000000 blocks: 8386560 deallocated
    descriptor LBA: 0x00000000027ff800 blocks: 2048 mapped (or unknown)
    descriptor LBA: 0x0000000002800000 blocks: 8386560 deallocated
    descriptor LBA: 0x0000000002fff800 blocks: 2048 mapped (or unknown)
    descriptor LBA: 0x0000000003000000 blocks: 8386560 deallocated
    descriptor LBA: 0x00000000037ff800 blocks: 2048 mapped (or unknown)
    descriptor LBA: 0x0000000003800000 blocks: 8386560 deallocated
    descriptor LBA: 0x0000000003fff800 blocks: 2048 mapped (or unknown)
    descriptor LBA: 0x0000000004000000 blocks: 8386560 deallocated
    descriptor LBA: 0x00000000047ff800 blocks: 2048 mapped (or unknown)
    descriptor LBA: 0x0000000004800000 blocks: 8386560 deallocated
    descriptor LBA: 0x0000000004fff800 blocks: 2048 mapped (or unknown)
    descriptor LBA: 0x0000000005000000 blocks: 8386560 deallocated
    descriptor LBA: 0x00000000057ff800 blocks: 2048 mapped (or unknown)
    descriptor LBA: 0x0000000005800000 blocks: 8386560 deallocated
    descriptor LBA: 0x0000000005fff800 blocks: 2048 mapped (or unknown)
    descriptor LBA: 0x0000000006000000 blocks: 6291456 deallocated
    descriptor LBA: 0x0000000006600000 blocks: 0 deallocated

    Although the discard bio starts at LBA 0 and has 50<<< 9) > UINT_MAX);
    62
    63 bio = blk_next_bio(bio, 0, gfp_mask);
    64 bio->bi_iter.bi_sector = sector;
    65 bio_set_dev(bio, bdev);
    66 bio_set_op_attrs(bio, op, 0);
    67
    68 bio->bi_iter.bi_size = req_sects << 9;
    69 sector += req_sects;
    70 nr_sects -= req_sects;
    [snipped]
    79 }
    80
    81 *biop = bio;
    82 return 0;
    83 }
    84 EXPORT_SYMBOL(__blkdev_issue_discard);

    At line 58-59, to discard a 50GB range, req_sects is set as return value
    of bio_allowed_max_sectors(q), which is 8388607 sectors. In the above
    case, the discard granularity is 2048 sectors, although the start LBA
    and discard length are aligned to discard granularity, req_sects never
    has chance to be aligned to discard granularity. This is why there are
    some still-mapped 2048 sectors fragment in every 4 or 8 GB range.

    If req_sects at line 58 is set to a value aligned to discard_granularity
    and close to UNIT_MAX, then all consequent split bios inside device
    driver are (almostly) aligned to discard_granularity of the device
    queue. The 2048 sectors still-mapped fragment will disappear.

    This patch introduces bio_aligned_discard_max_sectors() to return the
    the value which is aligned to q->limits.discard_granularity and closest
    to UINT_MAX. Then this patch replaces bio_allowed_max_sectors() with
    this new routine to decide a more proper split bio length.

    But we still need to handle the situation when discard start LBA is not
    aligned to q->limits.discard_granularity, otherwise even the length is
    aligned, current code may still leave 2048 fragment around every 4GB
    range. Therefore, to calculate req_sects, firstly the start LBA of
    discard range is checked (including partition offset), if it is not
    aligned to discard granularity, the first split location should make
    sure following bio has bi_sector aligned to discard granularity. Then
    there won't be still-mapped fragment in the middle of the discard range.

    The above is how this patch improves discard bio alignment in
    __blkdev_issue_discard(). Now with this patch, after discard with same
    command line mentiond previously, sg_get_lba_status returns,
    descriptor LBA: 0x0000000000000000 blocks: 106954752 deallocated
    descriptor LBA: 0x0000000006600000 blocks: 0 deallocated

    We an see there is no 2048 sectors segment anymore, everything is clean.

    Reported-and-tested-by: Acshai Manoj
    Signed-off-by: Coly Li
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Ming Lei
    Reviewed-by: Xiao Ni
    Cc: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Enzo Matsumiya
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Coly Li
     
  • Commit 7520872c0cf4 ("block: don't defer flushes on blk-mq + scheduling")
    tried to fix deadlock for cycled wait between flush requests and data
    request into flush_data_in_flight. The former holded all driver tags
    and wait for data request completion, but the latter can not complete
    for waiting free driver tags.

    After commit 923218f6166a ("blk-mq: don't allocate driver tag upfront
    for flush rq"), flush requests will not get driver tag before queuing
    into flush queue.

    * With elevator, flush request just get sched_tags before inserting
    flush queue. It will not get driver tag until issue them to driver.
    data request on list fq->flush_data_in_flight will complete in
    the end.

    * Without elevator, each flush request will get a driver tag when
    allocate request. Then data request on fq->flush_data_in_flight
    don't worry about lacking driver tag.

    In both of these cases, cycled wait cannot be true. So we may allow
    to defer flush request.

    Signed-off-by: Yufen Yu
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Yufen Yu
     
  • The sparse tool complains as follows:

    block/blk-timeout.c:93:12: warning:
    symbol 'blk_timeout_init' was not declared. Should it be static?

    Function blk_timeout_init() is not used outside of blk-timeout.c, so
    mark it static.

    Fixes: 9054650fac24 ("block: relax jiffies rounding for timeouts")
    Reported-by: Hulk Robot
    Signed-off-by: Wei Yongjun
    Signed-off-by: Jens Axboe

    Wei Yongjun
     
  • Using uninitialized_var() is dangerous as it papers over real bugs[1]
    (or can in the future), and suppresses unrelated compiler warnings
    (e.g. "unused variable"). If the compiler thinks it is uninitialized,
    either simply initialize the variable or make compiler changes.

    In preparation for removing[2] the[3] macro[4], remove all remaining
    needless uses with the following script:

    git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
    xargs perl -pi -e \
    's/\buninitialized_var\(([^\)]+)\)/\1/g;
    s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

    drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
    pathological white-space.

    No outstanding warnings were found building allmodconfig with GCC 9.3.0
    for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
    alpha, and m68k.

    [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
    [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
    [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
    [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

    Reviewed-by: Leon Romanovsky # drivers/infiniband and mlx4/mlx5
    Acked-by: Jason Gunthorpe # IB
    Acked-by: Kalle Valo # wireless drivers
    Reviewed-by: Chao Yu # erofs
    Signed-off-by: Kees Cook

    Kees Cook
     
  • The reverse-order double lock dance in ioc_release_fn() is using a
    retry loop. This is a problem on PREEMPT_RT because it could preempt
    the task that would release q->queue_lock and thus live lock in the
    retry loop.

    RCU is already managing the freeing of the request queue and icq. If
    the trylock fails, use RCU to guarantee that the request queue and
    icq are not freed and re-acquire the locks in the correct order,
    allowing forward progress.

    Signed-off-by: John Ogness
    Reviewed-by: Daniel Wagner
    Signed-off-by: Jens Axboe

    John Ogness
     
  • The legacy CFQ IO scheduler could call put_io_context() in its exit_icq()
    elevator callback. This led to a lockdep warning, which was fixed in
    commit d8c66c5d5924 ("block: fix lockdep warning on io_context release
    put_io_context()") by using a nested subclass for the ioc spinlock.
    However, with commit f382fb0bcef4 ("block: remove legacy IO schedulers")
    the CFQ IO scheduler no longer exists.

    The BFQ IO scheduler also implements the exit_icq() elevator callback but
    does not call put_io_context().

    The nested subclass for the ioc spinlock is no longer needed. Since it
    existed as an exception and no longer applies, remove the nested subclass
    usage.

    Signed-off-by: John Ogness
    Reviewed-by: Daniel Wagner
    Signed-off-by: Jens Axboe

    John Ogness
     

16 Jul, 2020

2 commits

  • Add a new max_active zones definition in the sysfs documentation.
    This definition will be common for all devices utilizing the zoned block
    device support in the kernel.

    Export max_active_zones according to this new definition for NVMe Zoned
    Namespace devices, ZAC ATA devices (which are treated as SCSI devices by
    the kernel), and ZBC SCSI devices.

    Add the new max_active_zones member to struct request_queue, rather
    than as a queue limit, since this property cannot be split across stacking
    drivers.

    For SCSI devices, even though max active zones is not part of the ZBC/ZAC
    spec, export max_active_zones as 0, signifying "no limit".

    Signed-off-by: Niklas Cassel
    Reviewed-by: Javier González
    Reviewed-by: Damien Le Moal
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Niklas Cassel
     
  • Add a new max_open_zones definition in the sysfs documentation.
    This definition will be common for all devices utilizing the zoned block
    device support in the kernel.

    Export max open zones according to this new definition for NVMe Zoned
    Namespace devices, ZAC ATA devices (which are treated as SCSI devices by
    the kernel), and ZBC SCSI devices.

    Add the new max_open_zones member to struct request_queue, rather
    than as a queue limit, since this property cannot be split across stacking
    drivers.

    Signed-off-by: Niklas Cassel
    Reviewed-by: Javier González
    Reviewed-by: Damien Le Moal
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Niklas Cassel
     

15 Jul, 2020

3 commits

  • This reverts commit 826f2f48da8c331ac51e1381998d318012d66550.

    Qian Cai reports that this commit causes stalls with swap. Revert until
    the reason can be figured out.

    Reported-by: Qian Cai
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • In theory, when GENHD_FL_NO_PART_SCAN is set, no partitions can be created
    on one disk. However, ioctl(BLKPG, BLKPG_ADD_PARTITION) doesn't check
    GENHD_FL_NO_PART_SCAN, so partitions still can be added even though
    GENHD_FL_NO_PART_SCAN is set.

    So far blk_drop_partitions() only removes partitions when disk_part_scan_enabled()
    return true. This way can make ghost partition on loop device after changing/clearing
    FD in case that PARTSCAN is disabled, such as partitions can be added
    via 'parted' on loop disk even though GENHD_FL_NO_PART_SCAN is set.

    Fix this issue by always removing partitions in blk_drop_partitions(), and
    this way is correct because the current code supposes that no partitions
    can be added in case of GENHD_FL_NO_PART_SCAN.

    Signed-off-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • In doing high IOPS testing, blk-mq is generally pretty well optimized.
    There are a few things that stuck out as using more CPU than what is
    really warranted, and one thing is the round_jiffies_up() that we do
    twice for each request. That accounts for about 0.8% of the CPU in
    my testing.

    We can make this cheaper by avoiding an integer division, by just adding
    a rough HZ mask that we can AND with instead. The timeouts are only on a
    second granularity already, we don't have to be that accurate here and
    this patch barely changes that. All we care about is nice grouping.

    Signed-off-by: Jens Axboe

    Jens Axboe