14 Nov, 2018

1 commit

  • commit 1adfc5e4136f5967d591c399aff95b3b035f16b7 upstream.

    Obviously the created discard bio has to be aligned with logical block size.

    This patch introduces the helper of bio_allowed_max_sectors() for
    this purpose.

    Cc: stable@vger.kernel.org
    Cc: Mike Snitzer
    Cc: Christoph Hellwig
    Cc: Xiao Ni
    Cc: Mariusz Dabrowski
    Fixes: 744889b7cbb56a6 ("block: don't deal with discard limit in blkdev_issue_discard()")
    Fixes: a22c4d7e34402cc ("block: re-add discard_granularity and alignment checks")
    Reported-by: Rui Salvaterra
    Tested-by: Rui Salvaterra
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     

03 Jun, 2018

1 commit

  • If we end up splitting a bio and the queue goes away between
    the initial submission and the later split submission, then we
    can block forever in blk_queue_enter() waiting for the reference
    to drop to zero. This will never happen, since we already hold
    a reference.

    Mark a split bio as already having entered the queue, so we can
    just use the live non-blocking queue enter variant.

    Thanks to Tetsuo Handa for the analysis.

    Reported-by: syzbot+c4f9cebf9d651f6e54de@syzkaller.appspotmail.com
    Signed-off-by: Jens Axboe

    Jens Axboe
     

31 May, 2018

1 commit


09 May, 2018

1 commit

  • Currently, struct request has four timestamp fields:

    - A start time, set at get_request time, in jiffies, used for iostats
    - An I/O start time, set at start_request time, in ktime nanoseconds,
    used for blk-stats (i.e., wbt, kyber, hybrid polling)
    - Another start time and another I/O start time, used for cfq and bfq

    These can all be consolidated into one start time and one I/O start
    time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
    request depending on the kernel config.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

02 Feb, 2018

1 commit

  • I ran into an issue on my laptop that triggered a bug on the
    discard path:

    WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3/0x430
    Modules linked in: rfcomm fuse ctr ccm bnep arc4 binfmt_misc snd_hda_codec_hdmi nls_iso8859_1 nls_cp437 vfat snd_hda_codec_conexant fat snd_hda_codec_generic iwlmvm snd_hda_intel snd_hda_codec snd_hwdep mac80211 snd_hda_core snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq x86_pkg_temp_thermal intel_powerclamp kvm_intel uvcvideo iwlwifi btusb snd_seq_device videobuf2_vmalloc btintel videobuf2_memops kvm snd_timer videobuf2_v4l2 bluetooth irqbypass videobuf2_core aesni_intel aes_x86_64 crypto_simd cryptd snd glue_helper videodev cfg80211 ecdh_generic soundcore hid_generic usbhid hid i915 psmouse e1000e ptp pps_core xhci_pci xhci_hcd intel_gtt
    CPU: 2 PID: 207 Comm: jbd2/nvme0n1p7- Tainted: G U 4.15.0+ #176
    Hardware name: LENOVO 20FBCTO1WW/20FBCTO1WW, BIOS N1FET59W (1.33 ) 12/19/2017
    RIP: 0010:nvme_setup_cmd+0x3d3/0x430
    RSP: 0018:ffff880423e9f838 EFLAGS: 00010217
    RAX: 0000000000000000 RBX: ffff880423e9f8c8 RCX: 0000000000010000
    RDX: ffff88022b200010 RSI: 0000000000000002 RDI: 00000000327f0000
    RBP: ffff880421251400 R08: ffff88022b200000 R09: 0000000000000009
    R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000ffff
    R13: ffff88042341e280 R14: 000000000000ffff R15: ffff880421251440
    FS: 0000000000000000(0000) GS:ffff880441500000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000055b684795030 CR3: 0000000002e09006 CR4: 00000000001606e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    nvme_queue_rq+0x40/0xa00
    ? __sbitmap_queue_get+0x24/0x90
    ? blk_mq_get_tag+0xa3/0x250
    ? wait_woken+0x80/0x80
    ? blk_mq_get_driver_tag+0x97/0xf0
    blk_mq_dispatch_rq_list+0x7b/0x4a0
    ? deadline_remove_request+0x49/0xb0
    blk_mq_do_dispatch_sched+0x4f/0xc0
    blk_mq_sched_dispatch_requests+0x106/0x170
    __blk_mq_run_hw_queue+0x53/0xa0
    __blk_mq_delay_run_hw_queue+0x83/0xa0
    blk_mq_run_hw_queue+0x6c/0xd0
    blk_mq_sched_insert_request+0x96/0x140
    __blk_mq_try_issue_directly+0x3d/0x190
    blk_mq_try_issue_directly+0x30/0x70
    blk_mq_make_request+0x1a4/0x6a0
    generic_make_request+0xfd/0x2f0
    ? submit_bio+0x5c/0x110
    submit_bio+0x5c/0x110
    ? __blkdev_issue_discard+0x152/0x200
    submit_bio_wait+0x43/0x60
    ext4_process_freed_data+0x1cd/0x440
    ? account_page_dirtied+0xe2/0x1a0
    ext4_journal_commit_callback+0x4a/0xc0
    jbd2_journal_commit_transaction+0x17e2/0x19e0
    ? kjournald2+0xb0/0x250
    kjournald2+0xb0/0x250
    ? wait_woken+0x80/0x80
    ? commit_timeout+0x10/0x10
    kthread+0x111/0x130
    ? kthread_create_worker_on_cpu+0x50/0x50
    ? do_group_exit+0x3a/0xa0
    ret_from_fork+0x1f/0x30
    Code: 73 89 c1 83 ce 10 c1 e1 10 09 ca 83 f8 04 0f 87 0f ff ff ff 8b 4d 20 48 8b 7d 00 c1 e9 09 48 01 8c c7 00 08 00 00 e9 f8 fe ff ff ff 4c 89 c7 41 bc 0a 00 00 00 e8 0d 78 d6 ff e9 a1 fc ff ff
    ---[ end trace 50d361cc444506c8 ]---
    print_req_error: I/O error, dev nvme0n1, sector 847167488

    Decoding the assembly, the request claims to have 0xffff segments,
    while nvme counts two. This turns out to be because we don't check
    for a data carrying request on the mq scheduler path, and since
    blk_phys_contig_segment() returns true for a non-data request,
    we decrement the initial segment count of 0 and end up with
    0xffff in the unsigned short.

    There are a few issues here:

    1) We should initialize the segment count for a discard to 1.
    2) The discard merging is currently using the data limits for
    segments and sectors.

    Fix this up by having attempt_merge() correctly identify the
    request, and by initializing the segment count correctly
    for discards.

    This can only be triggered with mq-deadline on discard capable
    devices right now, which isn't a common configuration.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

10 Jan, 2018

1 commit

  • This reverts commit a2d37968d784363842f87820a21e106741d28004.

    If max segment size isn't 512-aligned, this patch won't work well.

    Also once multipage bvec is enabled, adjacent bvecs won't be physically
    contiguous if page is added via bio_add_page(), so we don't need this
    kind of complicated logic.

    Reported-by: Dmitry Osipenko
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

07 Jan, 2018

3 commits

  • In this case, 'sectors' can't be zero at all, so remove the check
    and let the bio be split.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • When merging one bvec into segment, if the bvec is too big
    to merge, current policy is to move the whole bvec into another
    new segment.

    This patchset changes the policy into trying to maximize size of
    front segments, that means in above situation, part of bvec
    is merged into current segment, and the remainder is put
    into next segment.

    This patch prepares for support multipage bvec because
    it can be quite common to see this case and we should try
    to make front segments in full size.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • It is enough to check and compute bio->bi_seg_front_size just
    after the 1st segment is found, but current code checks that
    for each bvec, which is inefficient.

    This patch follows the way in __blk_recalc_rq_segments()
    for computing bio->bi_seg_front_size, and it is more efficient
    and code becomes more readable too.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

24 Aug, 2017

1 commit

  • This way we don't need a block_device structure to submit I/O. The
    block_device has different life time rules from the gendisk and
    request_queue and is usually only available when the block device node
    is open. Other callers need to explicitly create one (e.g. the lightnvm
    passthrough code, or the new nvme multipathing code).

    For the actual I/O path all that we need is the gendisk, which exists
    once per block device. But given that the block layer also does
    partition remapping we additionally need a partition index, which is
    used for said remapping in generic_make_request.

    Note that all the block drivers generally want request_queue or
    sometimes the gendisk, so this removes a layer of indirection all
    over the stack.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

10 Aug, 2017

1 commit


28 Jun, 2017

1 commit

  • No functional changes in this patch, we just use up some holes
    in the bio and request structures to define a write hint that
    we psas down the stack.

    Ensure that we don't merge requests that have different life time
    hints assigned to them, and that we inherit the write hint when
    cloning a bio.

    Reviewed-by: Martin K. Petersen
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     

21 Jun, 2017

1 commit

  • Instead of documenting the locking assumptions of most block layer
    functions as a comment, use lockdep_assert_held() to verify locking
    assumptions at runtime.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Cc: Ming Lei
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

19 Jun, 2017

4 commits

  • blk_bio_segment_split() makes sure bios have no more than
    BIO_MAX_PAGES entries in the bi_io_vec.
    This was done because bio_clone_bioset() (when given a
    mempool bioset) could not handle larger io_vecs.

    No driver uses bio_clone_bioset() any more, they all
    use bio_clone_fast() if anything, and bio_clone_fast()
    doesn't clone the bi_io_vec.

    The main user of of bio_clone_bioset() at this level
    is bounce.c, and bouncing now happens before blk_bio_segment_split(),
    so that is not of concern.

    So remove the big helpful comment and the code.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: NeilBrown
    Signed-off-by: Jens Axboe

    NeilBrown
     
  • bio_clone() is no longer used.
    Only bio_clone_bioset() or bio_clone_fast().
    This is for the best, as bio_clone() used fs_bio_set,
    and filesystems are unlikely to want to use bio_clone().

    So remove bio_clone() and all references.
    This includes a fix to some incorrect documentation.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: NeilBrown
    Signed-off-by: Jens Axboe

    NeilBrown
     
  • Since commit 23688bf4f830 ("block: ensure to split after potentially
    bouncing a bio") blk_queue_bounce() is called *before*
    blk_queue_split().
    This means that:
    1/ the comments blk_queue_split() about bounce buffers are
    irrelevant, and
    2/ a very large bio (more than BIO_MAX_PAGES) will no longer be
    split before it arrives at blk_queue_bounce(), leading to the
    possibility that bio_clone_bioset() will fail and a NULL
    will be dereferenced.

    Separately, blk_queue_bounce() shouldn't use fs_bio_set as the bio
    being copied could be from the same set, and this could lead to a
    deadlock.

    So:
    - allocate 2 private biosets for blk_queue_bounce, one for
    splitting enormous bios and one for cloning bios.
    - add code to split a bio that exceeds BIO_MAX_PAGES.
    - Fix up the comments in blk_queue_split()

    Credit-to: Ming Lei (suggested using single bio_for_each_segment loop)
    Reviewed-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Signed-off-by: NeilBrown
    Signed-off-by: Jens Axboe

    NeilBrown
     
  • blk_queue_split() is always called with the last arg being q->bio_split,
    where 'q' is the first arg.

    Also blk_queue_split() sometimes uses the passed-in 'bs' and sometimes uses
    q->bio_split.

    This is inconsistent and unnecessary. Remove the last arg and always use
    q->bio_split inside blk_queue_split()

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Credit-to: Javier González (Noticed that lightnvm was missed)
    Reviewed-by: Javier González
    Tested-by: Javier González
    Signed-off-by: NeilBrown
    Signed-off-by: Jens Axboe

    NeilBrown
     

09 Apr, 2017

1 commit


09 Feb, 2017

3 commits

  • Add a new merge strategy that merges discard bios into a request until the
    maximum number of discard ranges (or the maximum discard size) is reached
    from the plug merging code. I/O scheduler merging is not wired up yet
    but might also be useful, although not for fast devices like NVMe which
    are the only user for now.

    Note that for now we don't support limiting the size of each discard range,
    but if needed that can be added later.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Switch these constants to an enum, and make let the compiler ensure that
    all callers of blk_try_merge and elv_merge handle all potential values.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • This makes it available outside of blk-merge.c, and inlining such a trivial
    helper seems pretty useful to start with.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

04 Feb, 2017

2 commits

  • If we end up doing a request-to-request merge when we have completed
    a bio-to-request merge, we free the request from deep down in that
    path. For blk-mq-sched, the merge path has to hold the appropriate
    lock, but we don't need it for freeing the request. And in fact
    holding the lock is problematic, since we are now calling the
    mq sched put_rq_private() hook with the lock held. Other call paths
    do not hold this lock.

    Fix this inconsistency by ensuring that the caller frees a merged
    request. Then we can do it outside of the lock, making it both more
    efficient and fixing the blk-mq-sched problem of invoking parts of
    the scheduler with an unknown lock state.

    Reported-by: Paolo Valente
    Signed-off-by: Jens Axboe
    Reviewed-by: Omar Sandoval

    Jens Axboe
     
  • When we attempt to merge request-to-request, we return a 0/1 if we
    ended up merging or not. Change that to return the pointer to the
    request that we freed. We will use this to move the freeing of
    that request out of the merge logic, so that callers can drop
    locks before freeing the request.

    There should be no functional changes in this patch.

    Signed-off-by: Jens Axboe
    Reviewed-by: Omar Sandoval

    Jens Axboe
     

18 Jan, 2017

2 commits

  • This adds a set of hooks that intercepts the blk-mq path of
    allocating/inserting/issuing/completing requests, allowing
    us to develop a scheduler within that framework.

    We reuse the existing elevator scheduler API on the registration
    side, but augment that with the scheduler flagging support for
    the blk-mq interfce, and with a separate set of ops hooks for MQ
    devices.

    We split driver and scheduler tags, so we can run the scheduling
    independently of device queue depth.

    Signed-off-by: Jens Axboe
    Reviewed-by: Bart Van Assche
    Reviewed-by: Omar Sandoval

    Jens Axboe
     
  • Prep patch for adding MQ ops as well, since doing anon unions with
    named initializers doesn't work on older compilers.

    Signed-off-by: Jens Axboe
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Bart Van Assche
    Reviewed-by: Omar Sandoval

    Jens Axboe
     

09 Dec, 2016

1 commit

  • Instead of allocating a single unused biovec for discard requests, send
    them down without any payload. Instead we allow the driver to add a
    "special" payload using a biovec embedded into struct request (unioned
    over other fields never used while in the driver), and overloading
    the number of segments for this case.

    This has a couple of advantages:

    - we don't have to allocate the bio_vec
    - the amount of special casing for discard requests in the block
    layer is significantly reduced
    - using this same scheme for other request types is trivial,
    which will be important for implementing the new WRITE_ZEROES
    op on devices where it actually requires a payload (e.g. SCSI)
    - we can get rid of playing games with the request length, as
    we'll never touch it and completions will work just fine
    - it will allow us to support ranged discard operations in the
    future by merging non-contiguous discard bios into a single
    request
    - last but not least it removes a lot of code

    This patch is the common base for my WIP series for ranges discards and to
    remove discard_zeroes_data in favor of always using REQ_OP_WRITE_ZEROES,
    so it would be good to get it in quickly.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

01 Dec, 2016

2 commits

  • Factor out common code for setting REQ_NOMERGE flag which is being used
    out at certain places and make it a helper instead, req_set_nomerge().

    Signed-off-by: Ritesh Harjani

    Get rid of the inline.

    Signed-off-by: Jens Axboe

    Ritesh Harjani
     
  • This adds a new block layer operation to zero out a range of
    LBAs. This allows to implement zeroing for devices that don't use
    either discard with a predictable zero pattern or WRITE SAME of zeroes.
    The prominent example of that is NVMe with the Write Zeroes command,
    but in the future, this should also help with improving the way
    zeroing discards work. For this operation, suitable entry is exported in
    sysfs which indicate the number of maximum bytes allowed in one
    write zeroes operation by the device.

    Signed-off-by: Chaitanya Kulkarni
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Chaitanya Kulkarni
     

28 Oct, 2016

1 commit

  • A lot of the REQ_* flags are only used on struct requests, and only of
    use to the block layer and a few drivers that dig into struct request
    internals.

    This patch adds a new req_flags_t rq_flags field to struct request for
    them, and thus dramatically shrinks the number of common requests. It
    also removes the unfortunate situation where we have to fit the fields
    from the same enum into 32 bits for struct bio and 64 bits for
    struct request.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Shaun Tancheff
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

24 Aug, 2016

1 commit

  • After arbitrary bio size was introduced, the incoming bio may
    be very big. We have to split the bio into small bios so that
    each holds at most BIO_MAX_PAGES bvecs for safety reason, such
    as bio_clone().

    This patch fixes the following kernel crash:

    > [ 172.660142] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
    > [ 172.660229] IP: [] bio_trim+0xf/0x2a
    > [ 172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
    > [ 172.660399] Oops: 0000 [#1] SMP
    > [...]
    > [ 172.664780] Call Trace:
    > [ 172.664813] [] ? raid1_make_request+0x2e8/0xad7 [raid1]
    > [ 172.664846] [] ? blk_queue_split+0x377/0x3d4
    > [ 172.664880] [] ? md_make_request+0xf6/0x1e9 [md_mod]
    > [ 172.664912] [] ? generic_make_request+0xb5/0x155
    > [ 172.664947] [] ? prio_io+0x85/0x95 [bcache]
    > [ 172.664981] [] ? register_cache_set+0x355/0x8d0 [bcache]
    > [ 172.665016] [] ? register_bcache+0x1006/0x1174 [bcache]

    The issue can be reproduced by the following steps:
    - create one raid1 over two virtio-blk
    - build bcache device over the above raid1 and another cache device
    and bucket size is set as 2Mbytes
    - set cache mode as writeback
    - run random write over ext4 on the bcache device

    Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized bios)
    Reported-by: Sebastian Roesner
    Reported-by: Eric Wheeler
    Cc: stable@vger.kernel.org (4.3+)
    Cc: Shaohua Li
    Acked-by: Kent Overstreet
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

16 Aug, 2016

1 commit

  • Commit 288dab8a35a0 ("block: add a separate operation type for secure
    erase") split REQ_OP_SECURE_ERASE from REQ_OP_DISCARD without considering
    all the places REQ_OP_DISCARD was being used to mean either. Fix those.

    Signed-off-by: Adrian Hunter
    Fixes: 288dab8a35a0 ("block: add a separate operation type for secure erase")
    Signed-off-by: Jens Axboe

    Adrian Hunter
     

08 Aug, 2016

1 commit

  • Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower
    portion and the op code in the higher portions. This means that
    old code that relies on manually setting bi_rw is most likely
    going to be broken. Instead of letting that brokeness linger,
    rename the member, to force old and out-of-tree code to break
    at compile time instead of at runtime.

    No intended functional changes in this commit.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

27 Jul, 2016

1 commit

  • Pull block driver updates from Jens Axboe:
    "This branch also contains core changes. I've come to the conclusion
    that from 4.9 and forward, I'll be doing just a single branch. We
    often have dependencies between core and drivers, and it's hard to
    always split them up appropriately without pulling core into drivers
    when that happens.

    That said, this contains:

    - separate secure erase type for the core block layer, from
    Christoph.

    - set of discard fixes, from Christoph.

    - bio shrinking fixes from Christoph, as a followup up to the
    op/flags change in the core branch.

    - map and append request fixes from Christoph.

    - NVMeF (NVMe over Fabrics) code from Christoph. This is pretty
    exciting!

    - nvme-loop fixes from Arnd.

    - removal of ->driverfs_dev from Dan, after providing a
    device_add_disk() helper.

    - bcache fixes from Bhaktipriya and Yijing.

    - cdrom subchannel read fix from Vchannaiah.

    - set of lightnvm updates from Wenwei, Matias, Johannes, and Javier.

    - set of drbd updates and fixes from Fabian, Lars, and Philipp.

    - mg_disk error path fix from Bart.

    - user notification for failed device add for loop, from Minfei.

    - NVMe in general:
    + NVMe delay quirk from Guilherme.
    + SR-IOV support and command retry limits from Keith.
    + fix for memory-less NUMA node from Masayoshi.
    + use UINT_MAX for discard sectors, from Minfei.
    + cancel IO fixes from Ming.
    + don't allocate unused major, from Neil.
    + error code fixup from Dan.
    + use constants for PSDT/FUSE from James.
    + variable init fix from Jay.
    + fabrics fixes from Ming, Sagi, and Wei.
    + various fixes"

    * 'for-4.8/drivers' of git://git.kernel.dk/linux-block: (115 commits)
    nvme/pci: Provide SR-IOV support
    nvme: initialize variable before logical OR'ing it
    block: unexport various bio mapping helpers
    scsi/osd: open code blk_make_request
    target: stop using blk_make_request
    block: simplify and export blk_rq_append_bio
    block: ensure bios return from blk_get_request are properly initialized
    virtio_blk: use blk_rq_map_kern
    memstick: don't allow REQ_TYPE_BLOCK_PC requests
    block: shrink bio size again
    block: simplify and cleanup bvec pool handling
    block: get rid of bio_rw and READA
    block: don't ignore -EOPNOTSUPP blkdev_issue_write_same
    block: introduce BLKDEV_DISCARD_ZERO to fix zeroout
    NVMe: don't allocate unused nvme_major
    nvme: avoid crashes when node 0 is memoryless node.
    nvme: Limit command retries
    loop: Make user notify for adding loop device failed
    nvme-loop: fix nvme-loop Kconfig dependencies
    nvmet: fix return value check in nvmet_subsys_alloc()
    ...

    Linus Torvalds
     

21 Jul, 2016

2 commits

  • For a front merge, the maximum number of sectors of the
    request must be checked against the front merge BIO sector,
    not the current sector of the request.

    Signed-off-by: Damien Le Moal
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • Before merging a bio into an existing request, io scheduler is called to
    get its approval first. However, the requests that come from a plug
    flush may get merged by block layer without consulting with io
    scheduler.

    In case of CFQ, this can cause fairness problems. For instance, if a
    request gets merged into a low weight cgroup's request, high weight cgroup
    now will depend on low weight cgroup to get scheduled. If high weigt cgroup
    needs that io request to complete before submitting more requests, then it
    will also lose its timeslice.

    Following script demonstrates the problem. Group g1 has a low weight, g2
    and g3 have equal high weights but g2's requests are adjacent to g1's
    requests so they are subject to merging. Due to these merges, g2 gets
    poor disk time allocation.

    cat > cfq-merge-repro.sh << "EOF"
    #!/bin/bash
    set -e

    IO_ROOT=/mnt-cgroup/io

    mkdir -p $IO_ROOT

    if ! mount | grep -qw $IO_ROOT; then
    mount -t cgroup none -oblkio $IO_ROOT
    fi

    cd $IO_ROOT

    for i in g1 g2 g3; do
    if [ -d $i ]; then
    rmdir $i
    fi
    done

    mkdir g1 && echo 10 > g1/blkio.weight
    mkdir g2 && echo 495 > g2/blkio.weight
    mkdir g3 && echo 495 > g3/blkio.weight

    RUNTIME=10

    (echo $BASHPID > g1/cgroup.procs &&
    fio --readonly --name name1 --filename /dev/sdb \
    --rw read --size 64k --bs 64k --time_based \
    --runtime=$RUNTIME --offset=0k &> /dev/null)&

    (echo $BASHPID > g2/cgroup.procs &&
    fio --readonly --name name1 --filename /dev/sdb \
    --rw read --size 64k --bs 64k --time_based \
    --runtime=$RUNTIME --offset=64k &> /dev/null)&

    (echo $BASHPID > g3/cgroup.procs &&
    fio --readonly --name name1 --filename /dev/sdb \
    --rw read --size 64k --bs 64k --time_based \
    --runtime=$RUNTIME --offset=256k &> /dev/null)&

    sleep $((RUNTIME+1))

    for i in g1 g2 g3; do
    echo ---- $i ----
    cat $i/blkio.time
    done

    EOF
    # ./cfq-merge-repro.sh
    ---- g1 ----
    8:16 162
    ---- g2 ----
    8:16 165
    ---- g3 ----
    8:16 686

    After applying the patch:

    # ./cfq-merge-repro.sh
    ---- g1 ----
    8:16 90
    ---- g2 ----
    8:16 445
    ---- g3 ----
    8:16 471

    Signed-off-by: Tahsin Erdogan
    Signed-off-by: Jens Axboe

    Tahsin Erdogan
     

09 Jun, 2016

1 commit


08 Jun, 2016

3 commits

  • This patch converts the block layer merging code to use separate variables
    for the operation and flags, and to check req_op for the REQ_OP.

    Signed-off-by: Mike Christie
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Mike Christie
     
  • This patch converts the simple bi_rw use cases in the block,
    drivers, mm and fs code to set/get the bio operation using
    bio_set_op_attrs/bio_op

    These should be simple one or two liner cases, so I just did them
    in one patch. The next patches handle the more complicated
    cases in a module per patch.

    Signed-off-by: Mike Christie
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Mike Christie
     
  • We currently set REQ_WRITE/WRITE for all non READ IOs
    like discard, flush, writesame, etc. In the next patches where we
    no longer set up the op as a bitmap, we will not be able to
    detect a operation direction like writesame by testing if REQ_WRITE is
    set.

    This patch converts the drivers and cgroup to use the
    op_is_write helper. This should just cover the simple
    cases. I did dm, md and bcache in their own patches
    because they were more involved.

    Signed-off-by: Mike Christie
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Mike Christie