27 Sep, 2019

1 commit


26 Sep, 2019

1 commit

  • cecf5d87ff20 ("block: split .sysfs_lock into two locks") starts to
    release & acquire sysfs_lock before registering/un-registering elevator
    queue during switching elevator for avoiding potential deadlock from
    showing & storing 'queue/iosched' attributes and removing elevator's
    kobject.

    Turns out there isn't such deadlock because 'q->sysfs_lock' isn't
    required in .show & .store of queue/iosched's attributes, and just
    elevator's sysfs lock is acquired in elv_iosched_store() and
    elv_iosched_show(). So it is safe to hold queue's sysfs lock when
    registering/un-registering elevator queue.

    The biggest issue is that commit cecf5d87ff20 assumes that concurrent
    write on 'queue/scheduler' can't happen. However, this assumption isn't
    true, because kernfs_fop_write() only guarantees that concurrent write
    aren't called on the same open file, but the write could be from
    different open on the file. So we can't release & re-acquire queue's
    sysfs lock during switching elevator, otherwise use-after-free on
    elevator could be triggered.

    Fixes the issue by not releasing queue's sysfs lock during switching
    elevator.

    Fixes: cecf5d87ff20 ("block: split .sysfs_lock into two locks")
    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Greg KH
    Cc: Mike Snitzer
    Reviewed-by: Bart Van Assche
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

18 Sep, 2019

1 commit

  • Pull block updates from Jens Axboe:

    - Two NVMe pull requests:
    - ana log parse fix from Anton
    - nvme quirks support for Apple devices from Ben
    - fix missing bio completion tracing for multipath stack devices
    from Hannes and Mikhail
    - IP TOS settings for nvme rdma and tcp transports from Israel
    - rq_dma_dir cleanups from Israel
    - tracing for Get LBA Status command from Minwoo
    - Some nvme-tcp cleanups from Minwoo, Potnuri and Myself
    - Some consolidation between the fabrics transports for handling
    the CAP register
    - reset race with ns scanning fix for fabrics (move fabrics
    commands to a dedicated request queue with a different lifetime
    from the admin request queue)."
    - controller reset and namespace scan races fixes
    - nvme discovery log change uevent support
    - naming improvements from Keith
    - multiple discovery controllers reject fix from James
    - some regular cleanups from various people

    - Series fixing (and re-fixing) null_blk debug printing and nr_devices
    checks (André)

    - A few pull requests from Song, with fixes from Andy, Guoqing,
    Guilherme, Neil, Nigel, and Yufen.

    - REQ_OP_ZONE_RESET_ALL support (Chaitanya)

    - Bio merge handling unification (Christoph)

    - Pick default elevator correctly for devices with special needs
    (Damien)

    - Block stats fixes (Hou)

    - Timeout and support devices nbd fixes (Mike)

    - Series fixing races around elevator switching and device add/remove
    (Ming)

    - sed-opal cleanups (Revanth)

    - Per device weight support for BFQ (Fam)

    - Support for blk-iocost, a new model that can properly account cost of
    IO workloads. (Tejun)

    - blk-cgroup writeback fixes (Tejun)

    - paride queue init fixes (zhengbin)

    - blk_set_runtime_active() cleanup (Stanley)

    - Block segment mapping optimizations (Bart)

    - lightnvm fixes (Hans/Minwoo/YueHaibing)

    - Various little fixes and cleanups

    * tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block: (186 commits)
    null_blk: format pr_* logs with pr_fmt
    null_blk: match the type of parameter nr_devices
    null_blk: do not fail the module load with zero devices
    block: also check RQF_STATS in blk_mq_need_time_stamp()
    block: make rq sector size accessible for block stats
    bfq: Fix bfq linkage error
    raid5: use bio_end_sector in r5_next_bio
    raid5: remove STRIPE_OPS_REQ_PENDING
    md: add feature flag MD_FEATURE_RAID0_LAYOUT
    md/raid0: avoid RAID0 data corruption due to layout confusion.
    raid5: don't set STRIPE_HANDLE to stripe which is in batch list
    raid5: don't increment read_errors on EILSEQ return
    nvmet: fix a wrong error status returned in error log page
    nvme: send discovery log page change events to userspace
    nvme: add uevent variables for controller devices
    nvme: enable aen regardless of the presence of I/O queues
    nvme-fabrics: allow discovery subsystems accept a kato
    nvmet: Use PTR_ERR_OR_ZERO() in nvmet_init_discovery()
    nvme: Remove redundant assignment of cq vector
    nvme: Assign subsys instance from first ctrl
    ...

    Linus Torvalds
     

12 Sep, 2019

1 commit

  • cecf5d87ff20 ("block: split .sysfs_lock into two locks") starts to
    release & actuire sysfs_lock again during switching elevator. So it
    isn't enough to prevent switching elevator from happening by simply
    clearing QUEUE_FLAG_REGISTERED with holding sysfs_lock, because
    in-progress switch still can move on after re-acquiring the lock,
    meantime the flag of QUEUE_FLAG_REGISTERED won't get checked.

    Fixes this issue by checking 'q->elevator' directly & locklessly after
    q->kobj is removed in blk_unregister_queue(), this way is safe because
    q->elevator can't be changed at that time.

    Fixes: cecf5d87ff20 ("block: split .sysfs_lock into two locks")
    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Greg KH
    Cc: Mike Snitzer
    Cc: Bart Van Assche
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

28 Aug, 2019

2 commits

  • The kernfs built-in lock of 'kn->count' is held in sysfs .show/.store
    path. Meantime, inside block's .show/.store callback, q->sysfs_lock is
    required.

    However, when mq & iosched kobjects are removed via
    blk_mq_unregister_dev() & elv_unregister_queue(), q->sysfs_lock is held
    too. This way causes AB-BA lock because the kernfs built-in lock of
    'kn-count' is required inside kobject_del() too, see the lockdep warning[1].

    On the other hand, it isn't necessary to acquire q->sysfs_lock for
    both blk_mq_unregister_dev() & elv_unregister_queue() because
    clearing REGISTERED flag prevents storing to 'queue/scheduler'
    from being happened. Also sysfs write(store) is exclusive, so no
    necessary to hold the lock for elv_unregister_queue() when it is
    called in switching elevator path.

    So split .sysfs_lock into two: one is still named as .sysfs_lock for
    covering sync .store, the other one is named as .sysfs_dir_lock
    for covering kobjects and related status change.

    sysfs itself can handle the race between add/remove kobjects and
    showing/storing attributes under kobjects. For switching scheduler
    via storing to 'queue/scheduler', we use the queue flag of
    QUEUE_FLAG_REGISTERED with .sysfs_lock for avoiding the race, then
    we can avoid to hold .sysfs_lock during removing/adding kobjects.

    [1] lockdep warning
    ======================================================
    WARNING: possible circular locking dependency detected
    5.3.0-rc3-00044-g73277fc75ea0 #1380 Not tainted
    ------------------------------------------------------
    rmmod/777 is trying to acquire lock:
    00000000ac50e981 (kn->count#202){++++}, at: kernfs_remove_by_name_ns+0x59/0x72

    but task is already holding lock:
    00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&q->sysfs_lock){+.+.}:
    __lock_acquire+0x95f/0xa2f
    lock_acquire+0x1b4/0x1e8
    __mutex_lock+0x14a/0xa9b
    blk_mq_hw_sysfs_show+0x63/0xb6
    sysfs_kf_seq_show+0x11f/0x196
    seq_read+0x2cd/0x5f2
    vfs_read+0xc7/0x18c
    ksys_read+0xc4/0x13e
    do_syscall_64+0xa7/0x295
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    -> #0 (kn->count#202){++++}:
    check_prev_add+0x5d2/0xc45
    validate_chain+0xed3/0xf94
    __lock_acquire+0x95f/0xa2f
    lock_acquire+0x1b4/0x1e8
    __kernfs_remove+0x237/0x40b
    kernfs_remove_by_name_ns+0x59/0x72
    remove_files+0x61/0x96
    sysfs_remove_group+0x81/0xa4
    sysfs_remove_groups+0x3b/0x44
    kobject_del+0x44/0x94
    blk_mq_unregister_dev+0x83/0xdd
    blk_unregister_queue+0xa0/0x10b
    del_gendisk+0x259/0x3fa
    null_del_dev+0x8b/0x1c3 [null_blk]
    null_exit+0x5c/0x95 [null_blk]
    __se_sys_delete_module+0x204/0x337
    do_syscall_64+0xa7/0x295
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    other info that might help us debug this:

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&q->sysfs_lock);
    lock(kn->count#202);
    lock(&q->sysfs_lock);
    lock(kn->count#202);

    *** DEADLOCK ***

    2 locks held by rmmod/777:
    #0: 00000000e69bd9de (&lock){+.+.}, at: null_exit+0x2e/0x95 [null_blk]
    #1: 00000000fb16ae21 (&q->sysfs_lock){+.+.}, at: blk_unregister_queue+0x78/0x10b

    stack backtrace:
    CPU: 0 PID: 777 Comm: rmmod Not tainted 5.3.0-rc3-00044-g73277fc75ea0 #1380
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180724_192412-buildhw-07.phx4
    Call Trace:
    dump_stack+0x9a/0xe6
    check_noncircular+0x207/0x251
    ? print_circular_bug+0x32a/0x32a
    ? find_usage_backwards+0x84/0xb0
    check_prev_add+0x5d2/0xc45
    validate_chain+0xed3/0xf94
    ? check_prev_add+0xc45/0xc45
    ? mark_lock+0x11b/0x804
    ? check_usage_forwards+0x1ca/0x1ca
    __lock_acquire+0x95f/0xa2f
    lock_acquire+0x1b4/0x1e8
    ? kernfs_remove_by_name_ns+0x59/0x72
    __kernfs_remove+0x237/0x40b
    ? kernfs_remove_by_name_ns+0x59/0x72
    ? kernfs_next_descendant_post+0x7d/0x7d
    ? strlen+0x10/0x23
    ? strcmp+0x22/0x44
    kernfs_remove_by_name_ns+0x59/0x72
    remove_files+0x61/0x96
    sysfs_remove_group+0x81/0xa4
    sysfs_remove_groups+0x3b/0x44
    kobject_del+0x44/0x94
    blk_mq_unregister_dev+0x83/0xdd
    blk_unregister_queue+0xa0/0x10b
    del_gendisk+0x259/0x3fa
    ? disk_events_poll_msecs_store+0x12b/0x12b
    ? check_flags+0x1ea/0x204
    ? mark_held_locks+0x1f/0x7a
    null_del_dev+0x8b/0x1c3 [null_blk]
    null_exit+0x5c/0x95 [null_blk]
    __se_sys_delete_module+0x204/0x337
    ? free_module+0x39f/0x39f
    ? blkcg_maybe_throttle_current+0x8a/0x718
    ? rwlock_bug+0x62/0x62
    ? __blkcg_punt_bio_submit+0xd0/0xd0
    ? trace_hardirqs_on_thunk+0x1a/0x20
    ? mark_held_locks+0x1f/0x7a
    ? do_syscall_64+0x4c/0x295
    do_syscall_64+0xa7/0x295
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x7fb696cdbe6b
    Code: 73 01 c3 48 8b 0d 1d 20 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 008
    RSP: 002b:00007ffec9588788 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
    RAX: ffffffffffffffda RBX: 0000559e589137c0 RCX: 00007fb696cdbe6b
    RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559e58913828
    RBP: 0000000000000000 R08: 00007ffec9587701 R09: 0000000000000000
    R10: 00007fb696d4eae0 R11: 0000000000000206 R12: 00007ffec95889b0
    R13: 00007ffec95896b3 R14: 0000559e58913260 R15: 0000559e589137c0

    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Greg KH
    Cc: Mike Snitzer
    Reviewed-by: Bart Van Assche
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • There are 4 users which check if queue is registered, so add one helper
    to check it.

    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Greg KH
    Cc: Mike Snitzer
    Cc: Bart Van Assche
    Reviewed-by: Bart Van Assche
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

12 Aug, 2019

1 commit

  • blk_exit_queue will free elevator_data, while blk_mq_requeue_work
    will access it. Move cancel of requeue_work to the front of
    blk_exit_queue to avoid use-after-free.

    blk_exit_queue blk_mq_requeue_work
    __elevator_exit blk_mq_run_hw_queues
    blk_mq_exit_sched blk_mq_run_hw_queue
    dd_exit_queue blk_mq_hctx_has_pending
    kfree(elevator_data) blk_mq_sched_has_work
    dd_has_work

    Fixes: fbc2a15e3433 ("blk-mq: move cancel of requeue_work into blk_mq_release")
    Cc: stable@vger.kernel.org
    Reviewed-by: Ming Lei
    Signed-off-by: zhengbin
    Signed-off-by: Jens Axboe

    zhengbin
     

07 Jun, 2019

1 commit

  • In theory, IO scheduler belongs to request queue, and the request pool
    of sched tags belongs to the request queue too.

    However, the current tags allocation interfaces are re-used for both
    driver tags and sched tags, and driver tags is definitely host wide,
    and doesn't belong to any request queue, same with its request pool.
    So we need tagset instance for freeing request of sched tags.

    Meantime, blk_mq_free_tag_set() often follows blk_cleanup_queue() in case
    of non-BLK_MQ_F_TAG_SHARED, this way requires that request pool of sched
    tags to be freed before calling blk_mq_free_tag_set().

    Commit 47cdee29ef9d94e ("block: move blk_exit_queue into __blk_release_queue")
    moves blk_exit_queue into __blk_release_queue for simplying the fast
    path in generic_make_request(), then causes oops during freeing requests
    of sched tags in __blk_release_queue().

    Fix the above issue by move freeing request pool of sched tags into
    blk_cleanup_queue(), this way is safe becasue queue has been frozen and no any
    in-queue requests at that time. Freeing sched tags has to be kept in queue's
    release handler becasue there might be un-completed dispatch activity
    which might refer to sched tags.

    Cc: Bart Van Assche
    Cc: Christoph Hellwig
    Fixes: 47cdee29ef9d94e485eb08f962c74943023a5271 ("block: move blk_exit_queue into __blk_release_queue")
    Tested-by: Yi Zhang
    Reported-by: kernel test robot
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

29 May, 2019

1 commit

  • Commit 498f6650aec8 ("block: Fix a race between the cgroup code and
    request queue initialization") moves what blk_exit_queue does into
    blk_cleanup_queue() for fixing issue caused by changing back
    queue lock.

    However, after legacy request IO path is killed, driver queue lock
    won't be used at all, and there isn't story for changing back
    queue lock. Then the issue addressed by Commit 498f6650aec8 doesn't
    exist any more.

    So move move blk_exit_queue into __blk_release_queue.

    This patch basically reverts the following two commits:

    498f6650aec8 block: Fix a race between the cgroup code and request queue initialization
    24ecc3585348 block: Ensure that a request queue is dissociated from the cgroup controller

    Cc: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

22 Apr, 2019

1 commit


21 Mar, 2019

1 commit

  • For q->poll_nsec == -1, means doing classic poll, not hybrid poll.
    We introduce a new flag BLK_MQ_POLL_CLASSIC to replace -1, which
    may make code much easier to read.

    Additionally, since val is an int obtained with kstrtoint(), val can be
    a negative value other than -1, so return -EINVAL for that case.

    Thanks to Damien Le Moal for some good suggestion.

    Reviewed-by: Damien Le Moal
    Signed-off-by: Yufen Yu
    Signed-off-by: Jens Axboe

    Yufen Yu
     

11 Feb, 2019

2 commits


29 Dec, 2018

1 commit

  • Pull SCSI updates from James Bottomley:
    "This is mostly update of the usual drivers: smarpqi, lpfc, qedi,
    megaraid_sas, libsas, zfcp, mpt3sas, hisi_sas.

    Additionally, we have a pile of annotation, unused variable and minor
    updates.

    The big API change is the updates for Christoph's DMA rework which
    include removing the DISABLE_CLUSTERING flag.

    And finally there are a couple of target tree updates"

    * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (259 commits)
    scsi: isci: request: mark expected switch fall-through
    scsi: isci: remote_node_context: mark expected switch fall-throughs
    scsi: isci: remote_device: Mark expected switch fall-throughs
    scsi: isci: phy: Mark expected switch fall-through
    scsi: iscsi: Capture iscsi debug messages using tracepoints
    scsi: myrb: Mark expected switch fall-throughs
    scsi: megaraid: fix out-of-bound array accesses
    scsi: mpt3sas: mpt3sas_scsih: Mark expected switch fall-through
    scsi: fcoe: remove set but not used variable 'port'
    scsi: smartpqi: call pqi_free_interrupts() in pqi_shutdown()
    scsi: smartpqi: fix build warnings
    scsi: smartpqi: update driver version
    scsi: smartpqi: add ofa support
    scsi: smartpqi: increase fw status register read timeout
    scsi: smartpqi: bump driver version
    scsi: smartpqi: add smp_utils support
    scsi: smartpqi: correct lun reset issues
    scsi: smartpqi: correct volume status
    scsi: smartpqi: do not offline disks for transient did no connect conditions
    scsi: smartpqi: allow for larger raid maps
    ...

    Linus Torvalds
     

19 Dec, 2018

1 commit

  • Now that the the SCSI layer replaced the use of the cluster flag with
    segment size limits and the DMA boundary we can remove the cluster flag
    from the block layer.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jens Axboe
    Signed-off-by: Martin K. Petersen

    Christoph Hellwig
     

18 Dec, 2018

1 commit

  • The queue mapping of type poll only exists when set->map[HCTX_TYPE_POLL].nr_queues
    is bigger than zero, so enhance the constraint by checking .nr_queues of type poll
    before enabling IO poll.

    Otherwise IO race & timeout can be observed when running block/007.

    Cc: Jeff Moyer
    Cc: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

05 Dec, 2018

1 commit


29 Nov, 2018

1 commit


16 Nov, 2018

3 commits

  • Various spots check for q->mq_ops being non-NULL, but provide
    a helper to do this instead.

    Where the ->mq_ops != NULL check is redundant, remove it.

    Since mq == rq-based now that legacy is gone, get rid of the
    queue_is_rq_based() and just use queue_is_mq() everywhere.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • With the legacy request path gone there is no good reason to keep
    queue_lock as a pointer, we can always use the embedded lock now.

    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig

    Fixed floppy and blk-cgroup missing conversions and half done edits.

    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • ->queue_flags is generally not set or cleared in the fast path, and also
    generally set or cleared one flag at a time. Make use of the normal
    atomic bitops for it so that we don't need to take the queue_lock,
    which is otherwise mostly unused in the core block layer now.

    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

08 Nov, 2018

2 commits

  • This removes a bunch of core and elevator related code. On the core
    front, we remove anything related to queue running, draining,
    initialization, plugging, and congestions. We also kill anything
    related to request allocation, merging, retrieval, and completion.

    Remove any checking for single queue IO schedulers, as they no
    longer exist. This means we can also delete a bunch of code related
    to request issue, adding, completion, etc - and all the SQ related
    ops and helpers.

    Also kill the load_default_modules(), as all that did was provide
    for a way to load the default single queue elevator.

    Tested-by: Ming Lei
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • It's now unused, kill it.

    Reviewed-by: Hannes Reinecke
    Tested-by: Ming Lei
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     

31 Oct, 2018

1 commit

  • rq_qos_exit() removes the current q->rq_qos, this action has to be
    done after queue is frozen, otherwise the IO queue path may never
    be waken up, then IO hang is caused.

    So fixes this issue by moving rq_qos_exit() after queue is frozen.

    Cc: Josef Bacik
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

26 Oct, 2018

2 commits

  • Drivers exposing zoned block devices have to initialize and maintain
    correctness (i.e. revalidate) of the device zone bitmaps attached to
    the device request queue (seq_zones_bitmap and seq_zones_wlock).

    To simplify coding this, introduce a generic helper function
    blk_revalidate_disk_zones() suitable for most (and likely all) cases.
    This new function always update the seq_zones_bitmap and seq_zones_wlock
    bitmaps as well as the queue nr_zones field when called for a disk
    using a request based queue. For a disk using a BIO based queue, only
    the number of zones is updated since these queues do not have
    schedulers and so do not need the zone bitmaps.

    With this change, the zone bitmap initialization code in sd_zbc.c can be
    replaced with a call to this function in sd_zbc_read_zones(), which is
    called from the disk revalidate block operation method.

    A call to blk_revalidate_disk_zones() is also added to the null_blk
    driver for devices created with the zoned mode enabled.

    Finally, to ensure that zoned devices created with dm-linear or
    dm-flakey expose the correct number of zones through sysfs, a call to
    blk_revalidate_disk_zones() is added to dm_table_set_restrictions().

    The zone bitmaps allocated and initialized with
    blk_revalidate_disk_zones() are freed automatically from
    __blk_release_queue() using the block internal function
    blk_queue_free_zone_bitmaps().

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Mike Snitzer
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • Expose through sysfs the nr_zones field of struct request_queue.
    Exposing this value helps in debugging disk issues as well as
    facilitating scripts based use of the disk (e.g. blktests).

    For zoned block devices, the nr_zones field indicates the total number
    of zones of the device calculated using the known disk capacity and
    zone size. This number of zones is always 0 for regular block devices.

    Since nr_zones is defined conditionally with CONFIG_BLK_DEV_ZONED,
    introduce the blk_queue_nr_zones() function to return the correct value
    for any device, regardless if CONFIG_BLK_DEV_ZONED is set.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

23 Aug, 2018

1 commit

  • A previous commit removed the ability to have per-rq flags. We used
    those flags to maintain inflight counts. Since we don't have those
    anymore, we have to always maintain inflight counts, even if wbt is
    disabled. This is clearly suboptimal.

    Add a queue quiesce around changing the wbt latency settings from sysfs
    to work around this. With that, we can reliably put the enabled check in
    our bio_to_wbt_flags(), since we know the WBT_TRACKED flag will be
    consistent for the lifetime of the request.

    Fixes: c1c80384c8f ("block: remove external dependency on wbt_flags")
    Reviewed-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Jens Axboe
     

12 Aug, 2018

1 commit

  • For legacy queues the only call of blkg_root_lookup() happens after
    bypass mode has been enabled. Since blkg_lookup() returns NULL for
    queues in bypass mode, modify the blkg_root_lookup() such that it
    no longer depends on bypass mode. Rename the function into
    blk_queue_root_blkg() as suggested by Tejun.

    Suggested-by: Tejun Heo
    Fixes: 6bad9b210a22 ("blkcg: Introduce blkg_root_lookup()")
    Signed-off-by: Bart Van Assche
    Cc: Tejun Heo
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

09 Aug, 2018

1 commit

  • Several block drivers call alloc_disk() followed by put_disk() if
    something fails before device_add_disk() is called without calling
    blk_cleanup_queue(). Make sure that also for this scenario a request
    queue is dissociated from the cgroup controller. This patch avoids
    that loading the parport_pc, paride and pf drivers triggers the
    following kernel crash:

    BUG: KASAN: null-ptr-deref in pi_init+0x42e/0x580 [paride]
    Read of size 4 at addr 0000000000000008 by task modprobe/744
    Call Trace:
    dump_stack+0x9a/0xeb
    kasan_report+0x139/0x350
    pi_init+0x42e/0x580 [paride]
    pf_init+0x2bb/0x1000 [pf]
    do_one_initcall+0x8e/0x405
    do_init_module+0xd9/0x2f2
    load_module+0x3ab4/0x4700
    SYSC_finit_module+0x176/0x1a0
    do_syscall_64+0xee/0x2b0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7

    Reported-by: Alexandru Moise
    Fixes: a063057d7c73 ("block: Fix a race between request queue removal and the block cgroup controller") # v4.17
    Signed-off-by: Bart Van Assche
    Tested-by: Alexandru Moise
    Reviewed-by: Johannes Thumshirn
    Cc: Tejun Heo
    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Alexandru Moise
    Cc: Joseph Qi
    Cc:
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

09 Jul, 2018

1 commit


31 May, 2018

1 commit


25 May, 2018

1 commit

  • Convert the S_ symbolic permissions to their octal equivalents as
    using octal and not symbolic permissions is preferred by many as more
    readable.

    see: https://lkml.org/lkml/2016/8/2/1945

    Done with automated conversion via:
    $ ./scripts/checkpatch.pl -f --types=SYMBOLIC_PERMS --fix-inplace

    Miscellanea:

    o Wrapped modified multi-line calls to a single line where appropriate
    o Realign modified multi-line calls to open parenthesis

    Signed-off-by: Joe Perches
    Signed-off-by: Jens Axboe

    Joe Perches
     

15 May, 2018

1 commit


09 Mar, 2018

1 commit

  • Introduce functions that modify the queue flags and that protect
    these modifications with the request queue lock. Except for moving
    one wake_up_all() call from inside to outside a critical section,
    this patch does not change any functionality.

    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Ming Lei
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

01 Mar, 2018

1 commit

  • Avoid that the following race can occur:

    blk_cleanup_queue() blkcg_print_blkgs()
    spin_lock_irq(lock) (1) spin_lock_irq(blkg->q->queue_lock) (2,5)
    q->queue_lock = &q->__queue_lock (3)
    spin_unlock_irq(lock) (4)
    spin_unlock_irq(blkg->q->queue_lock) (6)

    (1) take driver lock;
    (2) busy loop for driver lock;
    (3) override driver lock with internal lock;
    (4) unlock driver lock;
    (5) can take driver lock now;
    (6) but unlock internal lock.

    This change is safe because only the SCSI core and the NVME core keep
    a reference on a request queue after having called blk_cleanup_queue().
    Neither driver accesses any of the removed data structures between its
    blk_cleanup_queue() and blk_put_queue() calls.

    Reported-by: Joseph Qi
    Signed-off-by: Bart Van Assche
    Reviewed-by: Joseph Qi
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

19 Jan, 2018

1 commit

  • The __blk_mq_register_dev(), blk_mq_unregister_dev(),
    elv_register_queue() and elv_unregister_queue() calls need to be
    protected with sysfs_lock but other code in these functions not.
    Hence protect only this code with sysfs_lock. This patch fixes a
    locking inversion issue in blk_unregister_queue() and also in an
    error path of blk_register_queue(): it is not allowed to hold
    sysfs_lock around the kobject_del(&q->kobj) call.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

15 Jan, 2018

2 commits

  • Since I can remember DM has forced the block layer to allow the
    allocation and initialization of the request_queue to be distinct
    operations. Reason for this is block/genhd.c:add_disk() has requires
    that the request_queue (and associated bdi) be tied to the gendisk
    before add_disk() is called -- because add_disk() also deals with
    exposing the request_queue via blk_register_queue().

    DM's dynamic creation of arbitrary device types (and associated
    request_queue types) requires the DM device's gendisk be available so
    that DM table loads can establish a master/slave relationship with
    subordinate devices that are referenced by loaded DM tables -- using
    bd_link_disk_holder(). But until these DM tables, and their associated
    subordinate devices, are known DM cannot know what type of request_queue
    it needs -- nor what its queue_limits should be.

    This chicken and egg scenario has created all manner of problems for DM
    and, at times, the block layer.

    Summary of changes:

    - Add device_add_disk_no_queue_reg() and add_disk_no_queue_reg() variant
    that drivers may use to add a disk without also calling
    blk_register_queue(). Driver must call blk_register_queue() once its
    request_queue is fully initialized.

    - Return early from blk_unregister_queue() if QUEUE_FLAG_REGISTERED
    is not set. It won't be set if driver used add_disk_no_queue_reg()
    but driver encounters an error and must del_gendisk() before calling
    blk_register_queue().

    - Export blk_register_queue().

    These changes allow DM to use add_disk_no_queue_reg() to anchor its
    gendisk as the "master" for master/slave relationships DM must establish
    with subordinate devices referenced in DM tables that get loaded. Once
    all "slave" devices for a DM device are known its request_queue can be
    properly initialized and then advertised via sysfs -- important
    improvement being that no request_queue resource initialization
    performed by blk_register_queue() is missed for DM devices anymore.

    Signed-off-by: Mike Snitzer
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Mike Snitzer
     
  • The original commit e9a823fb34a8b (block: fix warning when I/O elevator
    is changed as request_queue is being removed) is pretty conflated.
    "conflated" because the resource being protected by q->sysfs_lock isn't
    the queue_flags (it is the 'queue' kobj).

    q->sysfs_lock serializes __elevator_change() (via elv_iosched_store)
    from racing with blk_unregister_queue():
    1) By holding q->sysfs_lock first, __elevator_change() can complete
    before a racing blk_unregister_queue().
    2) Conversely, __elevator_change() is testing for QUEUE_FLAG_REGISTERED
    in case elv_iosched_store() loses the race with blk_unregister_queue(),
    it needs a way to know the 'queue' kobj isn't there.

    Expand the scope of blk_unregister_queue()'s q->sysfs_lock use so it is
    held until after the 'queue' kobj is removed.

    To do so blk_mq_unregister_dev() must not also take q->sysfs_lock. So
    rename __blk_mq_unregister_dev() to blk_mq_unregister_dev().

    Also, blk_unregister_queue() should use q->queue_lock to protect against
    any concurrent writes to q->queue_flags -- even though chances are the
    queue is being cleaned up so no concurrent writes are likely.

    Fixes: e9a823fb34a8b ("block: fix warning when I/O elevator is changed as request_queue is being removed")
    Signed-off-by: Mike Snitzer
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

24 Nov, 2017

1 commit


02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman