05 Apr, 2018

1 commit

  • Pull char/misc updates from Greg KH:
    "Here is the big set of char/misc driver patches for 4.17-rc1.

    There are a lot of little things in here, nothing huge, but all
    important to the different hardware types involved:

    - thunderbolt driver updates

    - parport updates (people still care...)

    - nvmem driver updates

    - mei updates (as always)

    - hwtracing driver updates

    - hyperv driver updates

    - extcon driver updates

    - ... and a handful of even smaller driver subsystem and individual
    driver updates

    All of these have been in linux-next with no reported issues"

    * tag 'char-misc-4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (149 commits)
    hwtracing: Add HW tracing support menu
    intel_th: Add ACPI glue layer
    intel_th: Allow forcing host mode through drvdata
    intel_th: Pick up irq number from resources
    intel_th: Don't touch switch routing in host mode
    intel_th: Use correct method of finding hub
    intel_th: Add SPDX GPL-2.0 header to replace GPLv2 boilerplate
    stm class: Make dummy's master/channel ranges configurable
    stm class: Add SPDX GPL-2.0 header to replace GPLv2 boilerplate
    MAINTAINERS: Bestow upon myself the care for drivers/hwtracing
    hv: add SPDX license id to Kconfig
    hv: add SPDX license to trace
    Drivers: hv: vmbus: do not mark HV_PCIE as perf_device
    Drivers: hv: vmbus: respect what we get from hv_get_synint_state()
    /dev/mem: Avoid overwriting "err" in read_mem()
    eeprom: at24: use SPDX identifier instead of GPL boiler-plate
    eeprom: at24: simplify the i2c functionality checking
    eeprom: at24: fix a line break
    eeprom: at24: tweak newlines
    eeprom: at24: refactor at24_probe()
    ...

    Linus Torvalds
     

26 Mar, 2018

1 commit


16 Mar, 2018

1 commit

  • register_blkdev() and __register_chrdev_region() treat the major
    number as an unsigned int. So print it the same way to avoid
    absurd error statements such as:
    "... major requested (-1) is greater than the maximum (511) ..."
    (and also fix off-by-one bugs in the error prints).

    While at it, also update the comment describing register_blkdev().

    Signed-off-by: Srivatsa S. Bhat
    Reviewed-by: Logan Gunthorpe
    Signed-off-by: Greg Kroah-Hartman

    Srivatsa S. Bhat
     

03 Mar, 2018

1 commit

  • Pull block fixes from Jens Axboe:
    "A collection of fixes for this series. This is a little larger than
    usual at this time, but that's mainly because I was out on vacation
    last week. Nothing in here is major in any way, it's just two weeks of
    fixes. This contains:

    - NVMe pull from Keith, with a set of fixes from the usual suspects.

    - mq-deadline zone unlock fix from Damien, fixing an issue with the
    SMR zone locking added for 4.16.

    - two bcache fixes sent in by Michael, with changes from Coly and
    Tang.

    - comment typo fix from Eric for blktrace.

    - return-value error handling fix for nbd, from Gustavo.

    - fix a direct-io case where we don't defer to a completion handler,
    making us sleep from IRQ device completion. From Jan.

    - a small series from Jan fixing up holes around handling of bdev
    references.

    - small set of regression fixes from Jiufei, mostly fixing problems
    around the gendisk pointer -> partition index change.

    - regression fix from Ming, fixing a boundary issue with the discard
    page cache invalidation.

    - two-patch series from Ming, fixing both a core blk-mq-sched and
    kyber issue around token freeing on a requeue condition"

    * tag 'for-linus-20180302' of git://git.kernel.dk/linux-block: (24 commits)
    block: fix a typo
    block: display the correct diskname for bio
    block: fix the count of PGPGOUT for WRITE_SAME
    mq-deadline: Make sure to always unlock zones
    nvmet: fix PSDT field check in command format
    nvme-multipath: fix sysfs dangerously created links
    nbd: fix return value in error handling path
    bcache: fix kcrashes with fio in RAID5 backend dev
    bcache: correct flash only vols (check all uuids)
    blktrace_api.h: fix comment for struct blk_user_trace_setup
    blockdev: Avoid two active bdev inodes for one device
    genhd: Fix BUG in blkdev_open()
    genhd: Fix use after free in __blkdev_get()
    genhd: Add helper put_disk_and_module()
    genhd: Rename get_disk() to get_disk_and_module()
    genhd: Fix leaked module reference for NVME devices
    direct-io: Fix sleep in atomic due to sync AIO
    nvme-pci: Fix nvme queue cleanup if IRQ setup fails
    block: kyber: fix domain token leak during requeue
    blk-mq: don't call io sched's .requeue_request when requeueing rq to ->dispatch
    ...

    Linus Torvalds
     

01 Mar, 2018

3 commits

  • bio_devname use __bdevname to display the device name, and can
    only show the major and minor of the part0,
    Fix this by using disk_name to display the correct name.

    Fixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and partitions index")
    Reviewed-by: Omar Sandoval
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jiufei Xue
    Signed-off-by: Jens Axboe

    Jiufei Xue
     
  • The vm counters is counted in sectors, so we should do the conversation
    in submit_bio.

    Fixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and partitions index")
    Cc: stable@vger.kernel.org
    Reviewed-by: Omar Sandoval
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jiufei Xue
    Signed-off-by: Jens Axboe

    Jiufei Xue
     
  • In case of a failed write request (all retries failed) and when using
    libata, the SCSI error handler calls scsi_finish_command(). In the
    case of blk-mq this means that scsi_mq_done() does not get called,
    that blk_mq_complete_request() does not get called and also that the
    mq-deadline .completed_request() method is not called. This results in
    the target zone of the failed write request being left in a locked
    state, preventing that any new write requests are issued to the same
    zone.

    Fix this by replacing the .completed_request() method with the
    .finish_request() method as this method is always called whether or
    not a request completes successfully. Since the .finish_request()
    method is only called by the blk-mq core if a .prepare_request()
    method exists, add a dummy .prepare_request() method.

    Fixes: 5700f69178e9 ("mq-deadline: Introduce zone locking support")
    Cc: Hannes Reinecke
    Reviewed-by: Ming Lei
    Signed-off-by: Damien Le Moal
    [ bvanassche: edited patch description ]
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

27 Feb, 2018

4 commits

  • When two blkdev_open() calls for a partition race with device removal
    and recreation, we can hit BUG_ON(!bd_may_claim(bdev, whole, holder)) in
    blkdev_open(). The race can happen as follows:

    CPU0 CPU1 CPU2
    del_gendisk()
    bdev_unhash_inode(part1);

    blkdev_open(part1, O_EXCL) blkdev_open(part1, O_EXCL)
    bdev = bd_acquire() bdev = bd_acquire()
    blkdev_get(bdev)
    bd_start_claiming(bdev)
    - finds old inode 'whole'
    bd_prepare_to_claim() -> 0
    bdev_unhash_inode(whole);


    blkdev_get(bdev);
    bd_start_claiming(bdev)
    - finds new inode 'whole'
    bd_prepare_to_claim()
    - this also succeeds as we have
    different 'whole' here...
    - bad things happen now as we
    have two exclusive openers of
    the same bdev

    The problem here is that block device opens can see various intermediate
    states while gendisk is shutting down and then being recreated.

    We fix the problem by introducing new lookup_sem in gendisk that
    synchronizes gendisk deletion with get_gendisk() and furthermore by
    making sure that get_gendisk() does not return gendisk that is being (or
    has been) deleted. This makes sure that once we ever manage to look up
    newly created bdev inode, we are also guaranteed that following
    get_gendisk() will either return failure (and we fail open) or it
    returns gendisk for the new device and following bdget_disk() will
    return new bdev inode (i.e., blkdev_open() follows the path as if it is
    completely run after new device is created).

    Reported-and-analyzed-by: Hou Tao
    Tested-by: Hou Tao
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • Add a proper counterpart to get_disk_and_module() -
    put_disk_and_module(). Currently it is opencoded in several places.

    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • Rename get_disk() to get_disk_and_module() to make sure what the
    function does. It's not a great name but at least it is now clear that
    put_disk() is not it's counterpart.

    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • Commit 8ddcd653257c "block: introduce GENHD_FL_HIDDEN" added handling of
    hidden devices to get_gendisk() but forgot to drop module reference
    which is also acquired by get_disk(). Drop the reference as necessary.

    Arguably the function naming here is misleading as put_disk() is *not*
    the counterpart of get_disk() but let's fix that in the follow up
    commit since that will be more intrusive.

    Fixes: 8ddcd653257c18a669fcb75ee42c37054908e0d6
    CC: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

25 Feb, 2018

2 commits

  • When requeuing request, the domain token should have been freed
    before re-inserting the request to io scheduler. Otherwise, the
    assigned domain token will be leaked, and IO hang can be caused.

    Cc: Paolo Valente
    Cc: Omar Sandoval
    Cc: stable@vger.kernel.org
    Reviewed-by: Bart Van Assche
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • __blk_mq_requeue_request() covers two cases:

    - one is that the requeued request is added to hctx->dispatch, such as
    blk_mq_dispatch_rq_list()

    - another case is that the request is requeued to io scheduler, such as
    blk_mq_requeue_request().

    We should call io sched's .requeue_request callback only for the 2nd
    case.

    Cc: Paolo Valente
    Cc: Omar Sandoval
    Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers")
    Cc: stable@vger.kernel.org
    Reviewed-by: Bart Van Assche
    Acked-by: Paolo Valente
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

24 Feb, 2018

1 commit


22 Feb, 2018

1 commit

  • On lkml suggestions were made to split up such trivial typo fixes into per subsystem
    patches:

    --- a/arch/x86/boot/compressed/eboot.c
    +++ b/arch/x86/boot/compressed/eboot.c
    @@ -439,7 +439,7 @@ setup_uga32(void **uga_handle, unsigned long size, u32 *width, u32 *height)
    struct efi_uga_draw_protocol *uga = NULL, *first_uga;
    efi_guid_t uga_proto = EFI_UGA_PROTOCOL_GUID;
    unsigned long nr_ugas;
    - u32 *handles = (u32 *)uga_handle;;
    + u32 *handles = (u32 *)uga_handle;
    efi_status_t status = EFI_INVALID_PARAMETER;
    int i;

    This patch is the result of the following script:

    $ sed -i 's/;;$/;/g' $(git grep -E ';;$' | grep "\.[ch]:" | grep -vwE 'for|ia64' | cut -d: -f1 | sort | uniq)

    ... followed by manual review to make sure it's all good.

    Splitting this up is just crazy talk, let's get over with this and just do it.

    Reported-by: Pavel Machek
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

14 Feb, 2018

1 commit

  • This removes the dependency on interrupts to wake up task. Set task
    state as TASK_RUNNING, if need_resched() returns true,
    while polling for IO completion.
    Earlier, polling task used to sleep, relying on interrupt to wake it up.
    This made some IO take very long when interrupt-coalescing is enabled in
    NVMe.

    Reference:
    http://lists.infradead.org/pipermail/linux-nvme/2018-February/015435.html

    Changes since v2->v3:
    -using __set_current_state() instead of set_current_state()

    Changes since v1->v2:
    -setting task state once in blk_poll, instead of multiple
    callers.

    Signed-off-by: Nitesh Shetty
    Signed-off-by: Jens Axboe

    Nitesh Shetty
     

12 Feb, 2018

1 commit

  • This is the mindless scripted replacement of kernel use of POLL*
    variables as described by Al, done by this script:

    for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
    L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
    for f in $L; do sed -i "-es/^\([^\"]*\)\(\\)/\\1E\\2/" $f; done
    done

    with de-mangling cleanups yet to come.

    NOTE! On almost all architectures, the EPOLL* constants have the same
    values as the POLL* constants do. But they keyword here is "almost".
    For various bad reasons they aren't the same, and epoll() doesn't
    actually work quite correctly in some cases due to this on Sparc et al.

    The next patch from Al will sort out the final differences, and we
    should be all done.

    Scripted-by: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

08 Feb, 2018

1 commit

  • Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via
    RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
    be re-inserted into the active I/O scheduler for that device. As a
    consequence, I/O schedulers may get the same request inserted again,
    even several times, without a finish_request invoked on that request
    before each re-insertion.

    This fact is the cause of the failure reported in [1]. For an I/O
    scheduler, every re-insertion of the same re-prepared request is
    equivalent to the insertion of a new request. For schedulers like
    mq-deadline or kyber, this fact causes no harm. In contrast, it
    confuses a stateful scheduler like BFQ, which keeps state for an I/O
    request, until the finish_request hook is invoked on the request. In
    particular, BFQ may get stuck, waiting forever for the number of
    request dispatches, of the same request, to be balanced by an equal
    number of request completions (while there will be one completion for
    that request). In this state, BFQ may refuse to serve I/O requests
    from other bfq_queues. The hang reported in [1] then follows.

    However, the above re-prepared requests undergo a requeue, thus the
    requeue_request hook of the active elevator is invoked for these
    requests, if set. This commit then addresses the above issue by
    properly implementing the hook requeue_request in BFQ.

    [1] https://marc.info/?l=linux-block&m=151211117608676

    Reported-by: Ivan Kozik
    Reported-by: Alban Browaeys
    Tested-by: Mike Galbraith
    Signed-off-by: Paolo Valente
    Signed-off-by: Serena Ziviani
    Signed-off-by: Jens Axboe

    Paolo Valente
     

07 Feb, 2018

2 commits

  • The classic error injection mechanism, should_fail_request() does not
    support use cases where more information is required (from the entire
    struct bio, for example).

    To that end, this patch introduces should_fail_bio(), which calls
    should_fail_request() under the hood but provides a convenient
    place for kprobes to hook into if they require the entire struct bio.
    This patch also replaces some existing calls to should_fail_request()
    with should_fail_bio() with no degradation in performance.

    Signed-off-by: Howard McLauchlan
    Signed-off-by: Jens Axboe

    Howard McLauchlan
     
  • Mikulas reported a workload that saw bad performance, and figured
    out what it was due to various other types of requests being
    accounted as reads. Flush requests, for instance. Due to the
    high latency of those, we heavily throttle the writes to keep
    the latencies in balance. But they really should be accounted
    as writes.

    Fix this by checking the exact type of the request. If it's a
    read, account as a read, if it's a write or a flush, account
    as a write. Any other request we disregard. Previously everything
    would have been mistakenly accounted as reads.

    Reported-by: Mikulas Patocka
    Cc: stable@vger.kernel.org # v4.12+
    Signed-off-by: Jens Axboe

    Jens Axboe
     

05 Feb, 2018

1 commit

  • Pull more block updates from Jens Axboe:
    "Most of this is fixes and not new code/features:

    - skd fix from Arnd, fixing a build error dependent on sla allocator
    type.

    - blk-mq scheduler discard merging fixes, one from me and one from
    Keith. This fixes a segment miscalculation for blk-mq-sched, where
    we mistakenly think two segments are physically contigious even
    though the request isn't carrying real data. Also fixes a bio-to-rq
    merge case.

    - Don't re-set a bit on the buffer_head flags, if it's already set.
    This can cause scalability concerns on bigger machines and
    workloads. From Kemi Wang.

    - Add BLK_STS_DEV_RESOURCE return value to blk-mq, allowing us to
    distuingish between a local (device related) resource starvation
    and a global one. The latter might happen without IO being in
    flight, so it has to be handled a bit differently. From Ming"

    * tag 'for-linus-20180204' of git://git.kernel.dk/linux-block:
    block: skd: fix incorrect linux/slab_def.h inclusion
    buffer: Avoid setting buffer bits that are already set
    blk-mq-sched: Enable merging discard bio into request
    blk-mq: fix discard merge with scheduler attached
    blk-mq: introduce BLK_STS_DEV_RESOURCE

    Linus Torvalds
     

02 Feb, 2018

2 commits

  • Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • I ran into an issue on my laptop that triggered a bug on the
    discard path:

    WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3/0x430
    Modules linked in: rfcomm fuse ctr ccm bnep arc4 binfmt_misc snd_hda_codec_hdmi nls_iso8859_1 nls_cp437 vfat snd_hda_codec_conexant fat snd_hda_codec_generic iwlmvm snd_hda_intel snd_hda_codec snd_hwdep mac80211 snd_hda_core snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq x86_pkg_temp_thermal intel_powerclamp kvm_intel uvcvideo iwlwifi btusb snd_seq_device videobuf2_vmalloc btintel videobuf2_memops kvm snd_timer videobuf2_v4l2 bluetooth irqbypass videobuf2_core aesni_intel aes_x86_64 crypto_simd cryptd snd glue_helper videodev cfg80211 ecdh_generic soundcore hid_generic usbhid hid i915 psmouse e1000e ptp pps_core xhci_pci xhci_hcd intel_gtt
    CPU: 2 PID: 207 Comm: jbd2/nvme0n1p7- Tainted: G U 4.15.0+ #176
    Hardware name: LENOVO 20FBCTO1WW/20FBCTO1WW, BIOS N1FET59W (1.33 ) 12/19/2017
    RIP: 0010:nvme_setup_cmd+0x3d3/0x430
    RSP: 0018:ffff880423e9f838 EFLAGS: 00010217
    RAX: 0000000000000000 RBX: ffff880423e9f8c8 RCX: 0000000000010000
    RDX: ffff88022b200010 RSI: 0000000000000002 RDI: 00000000327f0000
    RBP: ffff880421251400 R08: ffff88022b200000 R09: 0000000000000009
    R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000ffff
    R13: ffff88042341e280 R14: 000000000000ffff R15: ffff880421251440
    FS: 0000000000000000(0000) GS:ffff880441500000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000055b684795030 CR3: 0000000002e09006 CR4: 00000000001606e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    nvme_queue_rq+0x40/0xa00
    ? __sbitmap_queue_get+0x24/0x90
    ? blk_mq_get_tag+0xa3/0x250
    ? wait_woken+0x80/0x80
    ? blk_mq_get_driver_tag+0x97/0xf0
    blk_mq_dispatch_rq_list+0x7b/0x4a0
    ? deadline_remove_request+0x49/0xb0
    blk_mq_do_dispatch_sched+0x4f/0xc0
    blk_mq_sched_dispatch_requests+0x106/0x170
    __blk_mq_run_hw_queue+0x53/0xa0
    __blk_mq_delay_run_hw_queue+0x83/0xa0
    blk_mq_run_hw_queue+0x6c/0xd0
    blk_mq_sched_insert_request+0x96/0x140
    __blk_mq_try_issue_directly+0x3d/0x190
    blk_mq_try_issue_directly+0x30/0x70
    blk_mq_make_request+0x1a4/0x6a0
    generic_make_request+0xfd/0x2f0
    ? submit_bio+0x5c/0x110
    submit_bio+0x5c/0x110
    ? __blkdev_issue_discard+0x152/0x200
    submit_bio_wait+0x43/0x60
    ext4_process_freed_data+0x1cd/0x440
    ? account_page_dirtied+0xe2/0x1a0
    ext4_journal_commit_callback+0x4a/0xc0
    jbd2_journal_commit_transaction+0x17e2/0x19e0
    ? kjournald2+0xb0/0x250
    kjournald2+0xb0/0x250
    ? wait_woken+0x80/0x80
    ? commit_timeout+0x10/0x10
    kthread+0x111/0x130
    ? kthread_create_worker_on_cpu+0x50/0x50
    ? do_group_exit+0x3a/0xa0
    ret_from_fork+0x1f/0x30
    Code: 73 89 c1 83 ce 10 c1 e1 10 09 ca 83 f8 04 0f 87 0f ff ff ff 8b 4d 20 48 8b 7d 00 c1 e9 09 48 01 8c c7 00 08 00 00 e9 f8 fe ff ff ff 4c 89 c7 41 bc 0a 00 00 00 e8 0d 78 d6 ff e9 a1 fc ff ff
    ---[ end trace 50d361cc444506c8 ]---
    print_req_error: I/O error, dev nvme0n1, sector 847167488

    Decoding the assembly, the request claims to have 0xffff segments,
    while nvme counts two. This turns out to be because we don't check
    for a data carrying request on the mq scheduler path, and since
    blk_phys_contig_segment() returns true for a non-data request,
    we decrement the initial segment count of 0 and end up with
    0xffff in the unsigned short.

    There are a few issues here:

    1) We should initialize the segment count for a discard to 1.
    2) The discard merging is currently using the data limits for
    segments and sectors.

    Fix this up by having attempt_merge() correctly identify the
    request, and by initializing the segment count correctly
    for discards.

    This can only be triggered with mq-deadline on discard capable
    devices right now, which isn't a common configuration.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

31 Jan, 2018

2 commits

  • This status is returned from driver to block layer if device related
    resource is unavailable, but driver can guarantee that IO dispatch
    will be triggered in future when the resource is available.

    Convert some drivers to return BLK_STS_DEV_RESOURCE. Also, if driver
    returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after
    a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls. BLK_MQ_DELAY_QUEUE is
    3 ms because both scsi-mq and nvmefc are using that magic value.

    If a driver can make sure there is in-flight IO, it is safe to return
    BLK_STS_DEV_RESOURCE because:

    1) If all in-flight IOs complete before examining SCHED_RESTART in
    blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue
    is run immediately in this case by blk_mq_dispatch_rq_list();

    2) if there is any in-flight IO after/when examining SCHED_RESTART
    in blk_mq_dispatch_rq_list():
    - if SCHED_RESTART isn't set, queue is run immediately as handled in 1)
    - otherwise, this request will be dispatched after any in-flight IO is
    completed via blk_mq_sched_restart()

    3) if SCHED_RESTART is set concurently in context because of
    BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two
    cases and make sure IO hang can be avoided.

    One invariant is that queue will be rerun if SCHED_RESTART is set.

    Suggested-by: Jens Axboe
    Tested-by: Laurence Oberman
    Signed-off-by: Ming Lei
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Pull poll annotations from Al Viro:
    "This introduces a __bitwise type for POLL### bitmap, and propagates
    the annotations through the tree. Most of that stuff is as simple as
    'make ->poll() instances return __poll_t and do the same to local
    variables used to hold the future return value'.

    Some of the obvious brainos found in process are fixed (e.g. POLLIN
    misspelled as POLL_IN). At that point the amount of sparse warnings is
    low and most of them are for genuine bugs - e.g. ->poll() instance
    deciding to return -EINVAL instead of a bitmap. I hadn't touched those
    in this series - it's large enough as it is.

    Another problem it has caught was eventpoll() ABI mess; select.c and
    eventpoll.c assumed that corresponding POLL### and EPOLL### were
    equal. That's true for some, but not all of them - EPOLL### are
    arch-independent, but POLL### are not.

    The last commit in this series separates userland POLL### values from
    the (now arch-independent) kernel-side ones, converting between them
    in the few places where they are copied to/from userland. AFAICS, this
    is the least disruptive fix preserving poll(2) ABI and making epoll()
    work on all architectures.

    As it is, it's simply broken on sparc - try to give it EPOLLWRNORM and
    it will trigger only on what would've triggered EPOLLWRBAND on other
    architectures. EPOLLWRBAND and EPOLLRDHUP, OTOH, are never triggered
    at all on sparc. With this patch they should work consistently on all
    architectures"

    * 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
    make kernel-side POLL... arch-independent
    eventpoll: no need to mask the result of epi_item_poll() again
    eventpoll: constify struct epoll_event pointers
    debugging printk in sg_poll() uses %x to print POLL... bitmap
    annotate poll(2) guts
    9p: untangle ->poll() mess
    ->si_band gets POLL... bitmap stored into a user-visible long field
    ring_buffer_poll_wait() return value used as return value of ->poll()
    the rest of drivers/*: annotate ->poll() instances
    media: annotate ->poll() instances
    fs: annotate ->poll() instances
    ipc, kernel, mm: annotate ->poll() instances
    net: annotate ->poll() instances
    apparmor: annotate ->poll() instances
    tomoyo: annotate ->poll() instances
    sound: annotate ->poll() instances
    acpi: annotate ->poll() instances
    crypto: annotate ->poll() instances
    block: annotate ->poll() instances
    x86: annotate ->poll() instances
    ...

    Linus Torvalds
     

30 Jan, 2018

1 commit

  • Pull block updates from Jens Axboe:
    "This is the main pull request for block IO related changes for the
    4.16 kernel. Nothing major in this pull request, but a good amount of
    improvements and fixes all over the map. This contains:

    - BFQ improvements, fixes, and cleanups from Angelo, Chiara, and
    Paolo.

    - Support for SMR zones for deadline and mq-deadline from Damien and
    Christoph.

    - Set of fixes for bcache by way of Michael Lyle, including fixes
    from himself, Kent, Rui, Tang, and Coly.

    - Series from Matias for lightnvm with fixes from Hans Holmberg,
    Javier, and Matias. Mostly centered around pblk, and the removing
    rrpc 1.2 in preparation for supporting 2.0.

    - A couple of NVMe pull requests from Christoph. Nothing major in
    here, just fixes and cleanups, and support for command tracing from
    Johannes.

    - Support for blk-throttle for tracking reads and writes separately.
    From Joseph Qi. A few cleanups/fixes also for blk-throttle from
    Weiping.

    - Series from Mike Snitzer that enables dm to register its queue more
    logically, something that's alwways been problematic on dm since
    it's a stacked device.

    - Series from Ming cleaning up some of the bio accessor use, in
    preparation for supporting multipage bvecs.

    - Various fixes from Ming closing up holes around queue mapping and
    quiescing.

    - BSD partition fix from Richard Narron, fixing a problem where we
    can't mount newer (10/11) FreeBSD partitions.

    - Series from Tejun reworking blk-mq timeout handling. The previous
    scheme relied on atomic bits, but it had races where we would think
    a request had timed out if it to reused at the wrong time.

    - null_blk now supports faking timeouts, to enable us to better
    exercise and test that functionality separately. From me.

    - Kill the separate atomic poll bit in the request struct. After
    this, we don't use the atomic bits on blk-mq anymore at all. From
    me.

    - sgl_alloc/free helpers from Bart.

    - Heavily contended tag case scalability improvement from me.

    - Various little fixes and cleanups from Arnd, Bart, Corentin,
    Douglas, Eryu, Goldwyn, and myself"

    * 'for-4.16/block' of git://git.kernel.dk/linux-block: (186 commits)
    block: remove smart1,2.h
    nvme: add tracepoint for nvme_complete_rq
    nvme: add tracepoint for nvme_setup_cmd
    nvme-pci: introduce RECONNECTING state to mark initializing procedure
    nvme-rdma: remove redundant boolean for inline_data
    nvme: don't free uuid pointer before printing it
    nvme-pci: Suspend queues after deleting them
    bsg: use pr_debug instead of hand crafted macros
    blk-mq-debugfs: don't allow write on attributes with seq_operations set
    nvme-pci: Fix queue double allocations
    block: Set BIO_TRACE_COMPLETION on new bio during split
    blk-throttle: use queue_is_rq_based
    block: Remove kblockd_schedule_delayed_work{,_on}()
    blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended delays
    blk-mq: Rename blk_mq_request_direct_issue() into blk_mq_request_issue_directly()
    lib/scatterlist: Fix chaining support in sgl_alloc_order()
    blk-throttle: track read and write request individually
    block: add bdev_read_only() checks to common helpers
    block: fail op_is_write() requests to read-only partitions
    blk-throttle: export io_serviced_recursive, io_service_bytes_recursive
    ...

    Linus Torvalds
     

25 Jan, 2018

2 commits

  • Use pr_debug instead of hand crafted macros. This way it is not needed to
    re-compile the kernel to enable bsg debug outputs and it's possible to
    selectively enable specific prints.

    Cc: Joe Perches
    Reviewed-by: Bart Van Assche
    Signed-off-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Johannes Thumshirn
     
  • Attributes that only implement .seq_ops are read-only, any write to
    them should be rejected. But currently kernel would crash when
    writing to such debugfs entries, e.g.

    chmod +w /sys/kernel/debug/block//requeue_list
    echo 0 > /sys/kernel/debug/block//requeue_list
    chmod -w /sys/kernel/debug/block//requeue_list

    Fix it by returning -EPERM in blk_mq_debugfs_write() when writing to
    such attributes.

    Cc: Ming Lei
    Signed-off-by: Eryu Guan
    Signed-off-by: Jens Axboe

    Eryu Guan
     

24 Jan, 2018

1 commit


20 Jan, 2018

4 commits


19 Jan, 2018

7 commits

  • In mixed read/write workload on SSD, write latency is much lower than
    read. But now we only track and record read latency and then use it as
    threshold base for both read and write io latency accounting. As a
    result, write io latency will always be considered as good and
    bad_bio_cnt is much smaller than 20% of bio_cnt. That is to mean, the
    tg to be checked will be treated as idle most of the time and still let
    others dispatch more ios, even it is truly running under low limit and
    wants its low limit to be guaranteed, which is not we expected in fact.
    So track read and write request individually, which can bring more
    precise latency control for low limit idle detection.

    Signed-off-by: Joseph Qi
    Reviewed-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Joseph Qi
     
  • Similar to blkdev_write_iter(), return -EPERM if the partition is
    read-only. This covers ioctl(), fallocate() and most in-kernel users
    but isn't meant to be exhaustive -- everything else will be caught in
    generic_make_request_checks(), fail with -EIO and can be fixed later.

    Reviewed-by: Sagi Grimberg
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Jens Axboe

    Ilya Dryomov
     
  • Regular block device writes go through blkdev_write_iter(), which does
    bdev_read_only(), while zeroout/discard/etc requests are never checked,
    both userspace- and kernel-triggered. Add a generic catch-all check to
    generic_make_request_checks() to actually enforce ioctl(BLKROSET) and
    set_disk_ro(), which is used by quite a few drivers for things like
    snapshots, read-only backing files/images, etc.

    Reviewed-by: Sagi Grimberg
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Jens Axboe

    Ilya Dryomov
     
  • export these two interface for cgroup-v1.

    Acked-by: Tejun Heo
    Signed-off-by: weiping zhang
    Signed-off-by: Jens Axboe

    weiping zhang
     
  • The __blk_mq_register_dev(), blk_mq_unregister_dev(),
    elv_register_queue() and elv_unregister_queue() calls need to be
    protected with sysfs_lock but other code in these functions not.
    Hence protect only this code with sysfs_lock. This patch fixes a
    locking inversion issue in blk_unregister_queue() and also in an
    error path of blk_register_queue(): it is not allowed to hold
    sysfs_lock around the kobject_del(&q->kobj) call.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • This patch does not change any functionality.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • These two functions are only called from inside the block layer so
    unexport them.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche