12 Dec, 2018

1 commit

  • null_blk_zoned creation fails if the number of zones specified is equal to or is
    smaller than 64 due to a memory allocation failure in blk_alloc_zones(). With
    such a small number of zones, the required memory size for all zones descriptors
    fits in a single page, and the page order for alloc_pages_node() is zero. Allow
    this value in blk_alloc_zones() for the allocation to succeed.

    Fixes: bf5054569653 "block: Introduce blk_revalidate_disk_zones()"
    Reviewed-by: Damien Le Moal
    Signed-off-by: Shin'ichiro Kawasaki
    Signed-off-by: Jens Axboe

    Shin'ichiro Kawasaki
     

11 Dec, 2018

1 commit

  • We don't need to zero fill the bio if not using kernel allocated pages.

    Fixes: f3587d76da05 ("block: Clear kernel memory before copying to user") # v4.20-rc2
    Reported-by: Todd Aiken
    Cc: Laurence Oberman
    Cc: stable@vger.kernel.org
    Cc: Bart Van Assche
    Tested-by: Laurence Oberman
    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Keith Busch
     

07 Dec, 2018

2 commits

  • After the direct dispatch corruption fix, we permanently disallow direct
    dispatch of non read/write requests. This works fine off the normal IO
    path, as they will be retried like any other failed direct dispatch
    request. But for the blk_insert_cloned_request() that only DM uses to
    bypass the bottom level scheduler, we always first attempt direct
    dispatch. For some types of requests, that's now a permanent failure,
    and no amount of retrying will make that succeed. This results in a
    livelock.

    Instead of making special cases for what we can direct issue, and now
    having to deal with DM solving the livelock while still retaining a BUSY
    condition feedback loop, always just add a request that has been through
    ->queue_rq() to the hardware queue dispatch list. These are safe to use
    as no merging can take place there. Additionally, if requests do have
    prepped data from drivers, we aren't dependent on them not sharing space
    in the request structure to safely add them to the IO scheduler lists.

    This basically reverts ffe81d45322c and is based on a patch from Ming,
    but with the list insert case covered as well.

    Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
    Cc: stable@vger.kernel.org
    Suggested-by: Ming Lei
    Reported-by: Bart Van Assche
    Tested-by: Ming Lei
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Since commit '2d29c9f89fcd ("block, bfq: improve asymmetric scenarios
    detection")', if there are process groups with I/O requests waiting for
    completion, then BFQ tags the scenario as 'asymmetric'. This detection
    is needed for preserving service guarantees (for details, see comments
    on the computation * of the variable asymmetric_scenario in the
    function bfq_better_to_idle).

    Unfortunately, commit '2d29c9f89fcd ("block, bfq: improve asymmetric
    scenarios detection")' contains an error exactly in the updating of
    the number of groups with I/O requests waiting for completion: if a
    group has more than one descendant process, then the above number of
    groups, which is renamed from num_active_groups to a more appropriate
    num_groups_with_pending_reqs by this commit, may happen to be wrongly
    decremented multiple times, namely every time one of the descendant
    processes gets all its pending I/O requests completed.

    A correct, complete solution should work as follows. Consider a group
    that is inactive, i.e., that has no descendant process with pending
    I/O inside BFQ queues. Then suppose that num_groups_with_pending_reqs
    is still accounting for this group, because the group still has some
    descendant process with some I/O request still in
    flight. num_groups_with_pending_reqs should be decremented when the
    in-flight request of the last descendant process is finally completed
    (assuming that nothing else has changed for the group in the meantime,
    in terms of composition of the group and active/inactive state of
    child groups and processes). To accomplish this, an additional
    pending-request counter must be added to entities, and must be
    updated correctly.

    To avoid this additional field and operations, this commit resorts to
    the following tradeoff between simplicity and accuracy: for an
    inactive group that is still counted in num_groups_with_pending_reqs,
    this commit decrements num_groups_with_pending_reqs when the first
    descendant process of the group remains with no request waiting for
    completion.

    This simplified scheme provides a fix to the unbalanced decrements
    introduced by 2d29c9f89fcd. Since this error was also caused by lack
    of comments on this non-trivial issue, this commit also adds related
    comments.

    Fixes: 2d29c9f89fcd ("block, bfq: improve asymmetric scenarios detection")
    Reported-by: Steven Barrett
    Tested-by: Steven Barrett
    Tested-by: Lucjan Lucjanov
    Reviewed-by: Federico Motta
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     

05 Dec, 2018

1 commit

  • If we attempt a direct issue to a SCSI device, and it returns BUSY, then
    we queue the request up normally. However, the SCSI layer may have
    already setup SG tables etc for this particular command. If we later
    merge with this request, then the old tables are no longer valid. Once
    we issue the IO, we only read/write the original part of the request,
    not the new state of it.

    This causes data corruption, and is most often noticed with the file
    system complaining about the just read data being invalid:

    [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256)

    because most of it is garbage...

    This doesn't happen from the normal issue path, as we will simply defer
    the request to the hardware queue dispatch list if we fail. Once it's on
    the dispatch list, we never merge with it.

    Fix this from the direct issue path by flagging the request as
    REQ_NOMERGE so we don't change the size of it before issue.

    See also:
    https://bugzilla.kernel.org/show_bug.cgi?id=201685

    Tested-by: Guenter Roeck
    Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe

    Jens Axboe
     

01 Dec, 2018

1 commit

  • There are actually two kinds of discard merge:

    - one is the normal discard merge, just like normal read/write request,
    and call it single-range discard

    - another is the multi-range discard, queue_max_discard_segments(rq->q) > 1

    For the former case, queue_max_discard_segments(rq->q) is 1, and we
    should handle this kind of discard merge like the normal read/write
    request.

    This patch fixes the following kernel panic issue[1], which is caused by
    not removing the single-range discard request from elevator queue.

    Guangwu has one raid discard test case, in which this issue is a bit
    easier to trigger, and I verified that this patch can fix the kernel
    panic issue in Guangwu's test case.

    [1] kernel panic log from Jens's report

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000148
    PGD 0 P4D 0.
    Oops: 0000 [#1] SMP PTI
    CPU: 37 PID: 763 Comm: kworker/37:1H Not tainted \
    4.20.0-rc3-00649-ge64d9a554a91-dirty #14 Hardware name: Wiwynn \
    Leopard-Orv2/Leopard-DDR BW, BIOS LBM08 03/03/2017 Workqueue: kblockd \
    blk_mq_run_work_fn RIP: \
    0010:blk_mq_get_driver_tag+0x81/0x120 Code: 24 \
    10 48 89 7c 24 20 74 21 83 fa ff 0f 95 c0 48 8b 4c 24 28 65 48 33 0c 25 28 00 00 00 \
    0f 85 96 00 00 00 48 83 c4 30 5b 5d c3 8b 87 48 01 00 00 8b 40 04 39 43 20 72 37 \
    f6 87 b0 00 00 00 02 RSP: 0018:ffffc90004aabd30 EFLAGS: 00010246 \
    RAX: 0000000000000003 RBX: ffff888465ea1300 RCX: ffffc90004aabde8
    RDX: 00000000ffffffff RSI: ffffc90004aabde8 RDI: 0000000000000000
    RBP: 0000000000000000 R08: ffff888465ea1348 R09: 0000000000000000
    R10: 0000000000001000 R11: 00000000ffffffff R12: ffff888465ea1300
    R13: 0000000000000000 R14: ffff888465ea1348 R15: ffff888465d10000
    FS: 0000000000000000(0000) GS:ffff88846f9c0000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000148 CR3: 000000000220a003 CR4: 00000000003606e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    blk_mq_dispatch_rq_list+0xec/0x480
    ? elv_rb_del+0x11/0x30
    blk_mq_do_dispatch_sched+0x6e/0xf0
    blk_mq_sched_dispatch_requests+0xfa/0x170
    __blk_mq_run_hw_queue+0x5f/0xe0
    process_one_work+0x154/0x350
    worker_thread+0x46/0x3c0
    kthread+0xf5/0x130
    ? process_one_work+0x350/0x350
    ? kthread_destroy_worker+0x50/0x50
    ret_from_fork+0x1f/0x30
    Modules linked in: sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel \
    kvm switchtec irqbypass iTCO_wdt iTCO_vendor_support efivars cdc_ether usbnet mii \
    cdc_acm i2c_i801 lpc_ich mfd_core ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq \
    button sch_fq_codel nfsd nfs_acl lockd grace auth_rpcgss oid_registry sunrpc nvme \
    nvme_core fuse sg loop efivarfs autofs4 CR2: 0000000000000148 \

    ---[ end trace 340a1fb996df1b9b ]---
    RIP: 0010:blk_mq_get_driver_tag+0x81/0x120
    Code: 24 10 48 89 7c 24 20 74 21 83 fa ff 0f 95 c0 48 8b 4c 24 28 65 48 33 0c 25 28 \
    00 00 00 0f 85 96 00 00 00 48 83 c4 30 5b 5d c3 8b 87 48 01 00 00 8b 40 04 39 43 \
    20 72 37 f6 87 b0 00 00 00 02

    Fixes: 445251d0f4d329a ("blk-mq: fix discard merge with scheduler attached")
    Reported-by: Jens Axboe
    Cc: Guangwu Zhang
    Cc: Christoph Hellwig
    Cc: Jianchao Wang
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

14 Nov, 2018

2 commits

  • c2856ae2f315d ("blk-mq: quiesce queue before freeing queue") has
    already fixed this race, however the implied synchronize_rcu()
    in blk_mq_quiesce_queue() can slow down LUN probe a lot, so caused
    performance regression.

    Then 1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")
    tried to quiesce queue for avoiding unnecessary synchronize_rcu()
    only when queue initialization is done, because it is usual to see
    lots of inexistent LUNs which need to be probed.

    However, turns out it isn't safe to quiesce queue only when queue
    initialization is done. Because when one SCSI command is completed,
    the user of sending command can be waken up immediately, then the
    scsi device may be removed, meantime the run queue in scsi_end_request()
    is still in-progress, so kernel panic can be caused.

    In Red Hat QE lab, there are several reports about this kind of kernel
    panic triggered during kernel booting.

    This patch tries to address the issue by grabing one queue usage
    counter during freeing one request and the following run queue.

    Fixes: 1311326cf4755c7 ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")
    Cc: Andrew Jones
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org
    Cc: Martin K. Petersen
    Cc: Christoph Hellwig
    Cc: James E.J. Bottomley
    Cc: stable
    Cc: jianchao.wang
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • A discard cleanup merged into 4.20-rc2 causes fstests xfs/259 to
    fall into an endless loop in the discard code. The test is creating
    a device that is exactly 2^32 sectors in size to test mkfs boundary
    conditions around the 32 bit sector overflow region.

    mkfs issues a discard for the entire device size by default, and
    hence this throws a sector count of 2^32 into
    blkdev_issue_discard(). It takes the number of sectors to discard as
    a sector_t - a 64 bit value.

    The commit ba5d73851e71 ("block: cleanup __blkdev_issue_discard")
    takes this sector count and casts it to a 32 bit value before
    comapring it against the maximum allowed discard size the device
    has. This truncates away the upper 32 bits, and so if the lower 32
    bits of the sector count is zero, it starts issuing discards of
    length 0. This causes the code to fall into an endless loop, issuing
    a zero length discards over and over again on the same sector.

    Fixes: ba5d73851e71 ("block: cleanup __blkdev_issue_discard")
    Tested-by: Darrick J. Wong
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Dave Chinner

    Killed pointless WARN_ON().

    Signed-off-by: Jens Axboe

    Dave Chinner
     

13 Nov, 2018

1 commit

  • We need to copy the io priority, too; otherwise the clone will run
    with a different priority than the original one.

    Fixes: 43b62ce3ff0a ("block: move bio io prio to a new field")
    Signed-off-by: Hannes Reinecke
    Signed-off-by: Jean Delvare

    Fixed up subject, and ordered stores.

    Signed-off-by: Jens Axboe

    Hannes Reinecke
     

10 Nov, 2018

1 commit

  • Pull block layer fixes from Jens Axboe:

    - Two fixes for an ubd regression, one for missing locking, and one for
    a missing initialization of a field. The latter was an old latent
    bug, but it's now visible and triggers (Me, Anton Ivanov)

    - Set of NVMe fixes via Christoph, but applied manually due to a git
    tree mixup (Christoph, Sagi)

    - Fix for a discard split regression, in three patches (Ming)

    - Update libata git trees (Geert)

    - SPDX identifier for sata_rcar (Kuninori Morimoto)

    - Virtual boundary merge fix (Johannes)

    - Preemptively clear memory we are going to pass to userspace, in case
    the driver does a short read (Keith)

    * tag 'for-linus-20181109' of git://git.kernel.dk/linux-block:
    block: make sure writesame bio is aligned with logical block size
    block: cleanup __blkdev_issue_discard()
    block: make sure discard bio is aligned with logical block size
    Revert "nvmet-rdma: use a private workqueue for delete"
    nvme: make sure ns head inherits underlying device limits
    nvmet: don't try to add ns to p2p map unless it actually uses it
    sata_rcar: convert to SPDX identifiers
    ubd: fix missing initialization of io_req
    block: Clear kernel memory before copying to user
    MAINTAINERS: Fix remaining pointers to obsolete libata.git
    ubd: fix missing lock around request issue
    block: respect virtual boundary mask in bvecs

    Linus Torvalds
     

09 Nov, 2018

3 commits

  • Obviously the created writesame bio has to be aligned with logical block
    size, and use bio_allowed_max_sectors() to retrieve this number.

    Cc: stable@vger.kernel.org
    Cc: Mike Snitzer
    Cc: Christoph Hellwig
    Cc: Xiao Ni
    Cc: Mariusz Dabrowski
    Fixes: b49a0871be31a745b2ef ("block: remove split code in blkdev_issue_{discard,write_same}")
    Tested-by: Rui Salvaterra
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Cleanup __blkdev_issue_discard() a bit:

    - remove local variable of 'end_sect'
    - remove code block of 'fail'

    Cc: Mike Snitzer
    Cc: Christoph Hellwig
    Cc: Xiao Ni
    Cc: Mariusz Dabrowski
    Tested-by: Rui Salvaterra
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Obviously the created discard bio has to be aligned with logical block size.

    This patch introduces the helper of bio_allowed_max_sectors() for
    this purpose.

    Cc: stable@vger.kernel.org
    Cc: Mike Snitzer
    Cc: Christoph Hellwig
    Cc: Xiao Ni
    Cc: Mariusz Dabrowski
    Fixes: 744889b7cbb56a6 ("block: don't deal with discard limit in blkdev_issue_discard()")
    Fixes: a22c4d7e34402cc ("block: re-add discard_granularity and alignment checks")
    Reported-by: Rui Salvaterra
    Tested-by: Rui Salvaterra
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

08 Nov, 2018

2 commits

  • If the kernel allocates a bounce buffer for user read data, this memory
    needs to be cleared before copying it to the user, otherwise it may leak
    kernel memory to user space.

    Laurence Oberman
    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • With drivers that are settting a virtual boundary constrain, we are
    seeing a lot of bio splitting and smaller I/Os being submitted to the
    driver.

    This happens because the bio gap detection code does not account cases
    where PAGE_SIZE - 1 is bigger than queue_virt_boundary() and thus will
    split the bio unnecessarily.

    Cc: Jan Kara
    Cc: Bart Van Assche
    Cc: Ming Lei
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Johannes Thumshirn
    Acked-by: Keith Busch
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Johannes Thumshirn
     

03 Nov, 2018

1 commit

  • Pull block layer fixes from Jens Axboe:
    "The biggest part of this pull request is the revert of the blkcg
    cleanup series. It had one fix earlier for a stacked device issue, but
    another one was reported. Rather than play whack-a-mole with this,
    revert the entire series and try again for the next kernel release.

    Apart from that, only small fixes/changes.

    Summary:

    - Indentation fixup for mtip32xx (Colin Ian King)

    - The blkcg cleanup series revert (Dennis Zhou)

    - Two NVMe fixes. One fixing a regression in the nvme request
    initialization in this merge window, causing nvme-fc to not work.
    The other is a suspend/resume p2p resource issue (James, Keith)

    - Fix sg discard merge, allowing us to merge in cases where we didn't
    before (Jianchao Wang)

    - Call rq_qos_exit() after the queue is frozen, preventing a hang
    (Ming)

    - Fix brd queue setup, fixing an oops if we fail setting up all
    devices (Ming)"

    * tag 'for-linus-20181102' of git://git.kernel.dk/linux-block:
    nvme-pci: fix conflicting p2p resource adds
    nvme-fc: fix request private initialization
    blkcg: revert blkcg cleanups series
    block: brd: associate with queue until adding disk
    block: call rq_qos_exit() after queue is frozen
    mtip32xx: clean an indentation issue, remove extraneous tabs
    block: fix the DISCARD request merge

    Linus Torvalds
     

02 Nov, 2018

2 commits

  • Pull AFS updates from Al Viro:
    "AFS series, with some iov_iter bits included"

    * 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
    missing bits of "iov_iter: Separate type from direction and use accessor functions"
    afs: Probe multiple fileservers simultaneously
    afs: Fix callback handling
    afs: Eliminate the address pointer from the address list cursor
    afs: Allow dumping of server cursor on operation failure
    afs: Implement YFS support in the fs client
    afs: Expand data structure fields to support YFS
    afs: Get the target vnode in afs_rmdir() and get a callback on it
    afs: Calc callback expiry in op reply delivery
    afs: Fix FS.FetchStatus delivery from updating wrong vnode
    afs: Implement the YFS cache manager service
    afs: Remove callback details from afs_callback_break struct
    afs: Commit the status on a new file/dir/symlink
    afs: Increase to 64-bit volume ID and 96-bit vnode ID for YFS
    afs: Don't invoke the server to read data beyond EOF
    afs: Add a couple of tracepoints to log I/O errors
    afs: Handle EIO from delivery function
    afs: Fix TTL on VL server and address lists
    afs: Implement VL server rotation
    afs: Improve FS server rotation error handling
    ...

    Linus Torvalds
     
  • This reverts a series committed earlier due to null pointer exception
    bug report in [1]. It seems there are edge case interactions that I did
    not consider and will need some time to understand what causes the
    adverse interactions.

    The original series can be found in [2] with a follow up series in [3].

    [1] https://www.spinics.net/lists/cgroups/msg20719.html
    [2] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/
    [3] https://lore.kernel.org/lkml/20181020185612.51587-1-dennis@kernel.org/

    This reverts the following commits:
    d459d853c2ed, b2c3fa546705, 101246ec02b5, b3b9f24f5fcc, e2b0989954ae,
    f0fcb3ec89f3, c839e7a03f92, bdc2491708c4, 74b7c02a9bc1, 5bf9a1f3b4ef,
    a7b39b4e961c, 07b05bcc3213, 49f4c2dc2b50, 27e6fa996c53

    Signed-off-by: Dennis Zhou
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

31 Oct, 2018

2 commits

  • Move remaining definitions and declarations from include/linux/bootmem.h
    into include/linux/memblock.h and remove the redundant header.

    The includes were replaced with the semantic patch below and then
    semi-automated removal of duplicated '#include

    @@
    @@
    - #include
    + #include

    [sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h]
    Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au
    [sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h]
    Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au
    [sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal]
    Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au
    Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Signed-off-by: Stephen Rothwell
    Acked-by: Michal Hocko
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Kroah-Hartman
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Cc: "James E.J. Bottomley"
    Cc: Jonas Bonn
    Cc: Jonathan Corbet
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Richard Kuo
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Serge Semin
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • rq_qos_exit() removes the current q->rq_qos, this action has to be
    done after queue is frozen, otherwise the IO queue path may never
    be waken up, then IO hang is caused.

    So fixes this issue by moving rq_qos_exit() after queue is frozen.

    Cc: Josef Bacik
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

29 Oct, 2018

1 commit

  • There are two cases when handle DISCARD merge.
    If max_discard_segments == 1, the bios/requests need to be contiguous
    to merge. If max_discard_segments > 1, it takes every bio as a range
    and different range needn't to be contiguous.

    But now, attempt_merge screws this up. It always consider contiguity
    for DISCARD for the case max_discard_segments > 1 and cannot merge
    contiguous DISCARD for the case max_discard_segments == 1, because
    rq_attempt_discard_merge always returns false in this case.
    This patch fixes both of the two cases above.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: Jianchao Wang
    Signed-off-by: Jens Axboe

    Jianchao Wang
     

27 Oct, 2018

2 commits

  • Merge updates from Andrew Morton:

    - a few misc things

    - ocfs2 updates

    - most of MM

    * emailed patches from Andrew Morton : (132 commits)
    hugetlbfs: dirty pages as they are added to pagecache
    mm: export add_swap_extent()
    mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS
    tools/testing/selftests/vm/map_fixed_noreplace.c: add test for MAP_FIXED_NOREPLACE
    mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()
    mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()
    mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition
    mm/kasan/quarantine.c: make quarantine_lock a raw_spinlock_t
    mm/gup: cache dev_pagemap while pinning pages
    Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"
    mm: return zero_resv_unavail optimization
    mm: zero remaining unavailable struct pages
    tools/testing/selftests/vm/gup_benchmark.c: add MAP_HUGETLB option
    tools/testing/selftests/vm/gup_benchmark.c: add MAP_SHARED option
    tools/testing/selftests/vm/gup_benchmark.c: allow user specified file
    tools/testing/selftests/vm/gup_benchmark.c: fix 'write' flag usage
    mm/gup_benchmark.c: add additional pinning methods
    mm/gup_benchmark.c: time put_page()
    mm: don't raise MEMCG_OOM event due to failed high-order allocation
    mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock
    ...

    Linus Torvalds
     
  • There are several definitions of those functions/macros in places that
    mess with fixed-point load averages. Provide an official version.

    [akpm@linux-foundation.org: fix missed conversion in block/blk-iolatency.c]
    Link: http://lkml.kernel.org/r/20180828172258.3185-5-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Peter Zijlstra (Intel)
    Tested-by: Suren Baghdasaryan
    Tested-by: Daniel Drake
    Cc: Christopher Lameter
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Mike Galbraith
    Cc: Peter Enderborg
    Cc: Randy Dunlap
    Cc: Shakeel Butt
    Cc: Tejun Heo
    Cc: Vinayak Menon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

26 Oct, 2018

10 commits

  • Since commit 2d29c9f89fcd ("block, bfq: improve asymmetric scenarios
    detection"), a scenario is defined asymmetric when one of the
    following conditions holds:
    - active bfq_queues have different weights
    - one or more group of entities (bfq_queue or other groups of entities)
    are active
    bfq grants fairness and low latency also in such asymmetric scenarios,
    by plugging the dispatching of I/O if the bfq_queue in service happens
    to be temporarily idle. This plugging may lower throughput, so it is
    important to do it only when strictly needed.

    By mistake, in commit '2d29c9f89fcd' ("block, bfq: improve asymmetric
    scenarios detection") the num_active_groups counter was firstly
    incremented and subsequently decremented at any entity (group or
    bfq_queue) weight change.

    This is useless, because only transitions from active to inactive and
    vice versa matter for that counter. Unfortunately this is also
    incorrect in the following case: the entity at issue is a bfq_queue
    and it is under weight raising. In fact in this case there is a
    spurious increment of the num_active_groups counter.

    This spurious increment may cause scenarios to be wrongly detected as
    asymmetric, thus causing useless plugging and loss of throughput.

    This commit fixes this issue by simply removing the above useless and
    wrong increments and decrements.

    Fixes: 2d29c9f89fcd ("block, bfq: improve asymmetric scenarios detection")
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Federico Motta
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Federico Motta
     
  • trace_block_getrq() is to indicate a request struct has been allocated
    for queue, so put it in right place.

    Reviewed-by: Jianchao Wang
    Signed-off-by: Xiaoguang Wang
    Signed-off-by: Jens Axboe

    Xiaoguang Wang
     
  • Drivers exposing zoned block devices have to initialize and maintain
    correctness (i.e. revalidate) of the device zone bitmaps attached to
    the device request queue (seq_zones_bitmap and seq_zones_wlock).

    To simplify coding this, introduce a generic helper function
    blk_revalidate_disk_zones() suitable for most (and likely all) cases.
    This new function always update the seq_zones_bitmap and seq_zones_wlock
    bitmaps as well as the queue nr_zones field when called for a disk
    using a request based queue. For a disk using a BIO based queue, only
    the number of zones is updated since these queues do not have
    schedulers and so do not need the zone bitmaps.

    With this change, the zone bitmap initialization code in sd_zbc.c can be
    replaced with a call to this function in sd_zbc_read_zones(), which is
    called from the disk revalidate block operation method.

    A call to blk_revalidate_disk_zones() is also added to the null_blk
    driver for devices created with the zoned mode enabled.

    Finally, to ensure that zoned devices created with dm-linear or
    dm-flakey expose the correct number of zones through sysfs, a call to
    blk_revalidate_disk_zones() is added to dm_table_set_restrictions().

    The zone bitmaps allocated and initialized with
    blk_revalidate_disk_zones() are freed automatically from
    __blk_release_queue() using the block internal function
    blk_queue_free_zone_bitmaps().

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Mike Snitzer
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • Dispatching a report zones command through the request queue is a major
    pain due to the command reply payload rewriting necessary. Given that
    blkdev_report_zones() is executing everything synchronously, implement
    report zones as a block device file operation instead, allowing major
    simplification of the code in many places.

    sd, null-blk, dm-linear and dm-flakey being the only block device
    drivers supporting exposing zoned block devices, these drivers are
    modified to provide the device side implementation of the
    report_zones() block device file operation.

    For device mappers, a new report_zones() target type operation is
    defined so that the upper block layer calls blkdev_report_zones() can
    be propagated down to the underlying devices of the dm targets.
    Implementation for this new operation is added to the dm-linear and
    dm-flakey targets.

    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig
    [Damien]
    * Changed method block_device argument to gendisk
    * Various bug fixes and improvements
    * Added support for null_blk, dm-linear and dm-flakey.
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Mike Snitzer
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Expose through sysfs the nr_zones field of struct request_queue.
    Exposing this value helps in debugging disk issues as well as
    facilitating scripts based use of the disk (e.g. blktests).

    For zoned block devices, the nr_zones field indicates the total number
    of zones of the device calculated using the known disk capacity and
    zone size. This number of zones is always 0 for regular block devices.

    Since nr_zones is defined conditionally with CONFIG_BLK_DEV_ZONED,
    introduce the blk_queue_nr_zones() function to return the correct value
    for any device, regardless if CONFIG_BLK_DEV_ZONED is set.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • There is no need to synchronously execute all REQ_OP_ZONE_RESET BIOs
    necessary to reset a range of zones. Similarly to what is done for
    discard BIOs in blk-lib.c, all zone reset BIOs can be chained and
    executed asynchronously and a synchronous call done only for the last
    BIO of the chain.

    Modify blkdev_reset_zones() to operate similarly to
    blkdev_issue_discard() using the next_bio() helper for chaining BIOs. To
    avoid code duplication of that function in blk_zoned.c, rename
    next_bio() into blk_next_bio() and declare it as a block internal
    function in blk.h.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • Get a zoned block device total number of zones. The device can be a
    partition of the whole device. The number of zones is always 0 for
    regular block devices.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • Get a zoned block device zone size in number of 512 B sectors.
    The zone size is always 0 for regular block devices.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • There is no point in allocating more zone descriptors than the number of
    zones a block device has for doing a zone report. Avoid doing that in
    blkdev_report_zones_ioctl() by limiting the number of zone decriptors
    allocated internally to process the user request.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • Introduce the blkdev_nr_zones() helper function to get the total
    number of zones of a zoned block device. This number is always 0 for a
    regular block device (q->limits.zoned == BLK_ZONED_NONE case).

    Replace hard-coded number of zones calculation in dmz_get_zoned_device()
    with a call to this helper.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

24 Oct, 2018

1 commit

  • Use accessor functions to access an iterator's type and direction. This
    allows for the possibility of using some other method of determining the
    type of iterator than if-chains with bitwise-AND conditions.

    Signed-off-by: David Howells

    David Howells
     

23 Oct, 2018

1 commit

  • Pull block layer updates from Jens Axboe:
    "This is the main pull request for block changes for 4.20. This
    contains:

    - Series enabling runtime PM for blk-mq (Bart).

    - Two pull requests from Christoph for NVMe, with items such as;
    - Better AEN tracking
    - Multipath improvements
    - RDMA fixes
    - Rework of FC for target removal
    - Fixes for issues identified by static checkers
    - Fabric cleanups, as prep for TCP transport
    - Various cleanups and bug fixes

    - Block merging cleanups (Christoph)

    - Conversion of drivers to generic DMA mapping API (Christoph)

    - Series fixing ref count issues with blkcg (Dennis)

    - Series improving BFQ heuristics (Paolo, et al)

    - Series improving heuristics for the Kyber IO scheduler (Omar)

    - Removal of dangerous bio_rewind_iter() API (Ming)

    - Apply single queue IPI redirection logic to blk-mq (Ming)

    - Set of fixes and improvements for bcache (Coly et al)

    - Series closing a hotplug race with sysfs group attributes (Hannes)

    - Set of patches for lightnvm:
    - pblk trace support (Hans)
    - SPDX license header update (Javier)
    - Tons of refactoring patches to cleanly abstract the 1.2 and 2.0
    specs behind a common core interface. (Javier, Matias)
    - Enable pblk to use a common interface to retrieve chunk metadata
    (Matias)
    - Bug fixes (Various)

    - Set of fixes and updates to the blk IO latency target (Josef)

    - blk-mq queue number updates fixes (Jianchao)

    - Convert a bunch of drivers from the old legacy IO interface to
    blk-mq. This will conclude with the removal of the legacy IO
    interface itself in 4.21, with the rest of the drivers (me, Omar)

    - Removal of the DAC960 driver. The SCSI tree will introduce two
    replacement drivers for this (Hannes)"

    * tag 'for-4.20/block-20181021' of git://git.kernel.dk/linux-block: (204 commits)
    block: setup bounce bio_sets properly
    blkcg: reassociate bios when make_request() is called recursively
    blkcg: fix edge case for blk_get_rl() under memory pressure
    nvme-fabrics: move controller options matching to fabrics
    nvme-rdma: always have a valid trsvcid
    mtip32xx: fully switch to the generic DMA API
    rsxx: switch to the generic DMA API
    umem: switch to the generic DMA API
    sx8: switch to the generic DMA API
    sx8: remove dead IF_64BIT_DMA_IS_POSSIBLE code
    skd: switch to the generic DMA API
    ubd: remove use of blk_rq_map_sg
    nvme-pci: remove duplicate check
    drivers/block: Remove DAC960 driver
    nvme-pci: fix hot removal during error handling
    nvmet-fcloop: suppress a compiler warning
    nvme-core: make implicit seed truncation explicit
    nvmet-fc: fix kernel-doc headers
    nvme-fc: rework the request initialization code
    nvme-fc: introduce struct nvme_fcp_op_w_sgl
    ...

    Linus Torvalds
     

22 Oct, 2018

1 commit

  • We're only setting up the bounce bio sets if we happen
    to need bouncing for regular HIGHMEM, not if we only need
    it for ISA devices.

    Protect the ISA bounce setup with a mutex, since it's
    being invoked from driver init functions and can thus be
    called in parallel.

    Cc: stable@vger.kernel.org
    Reported-by: Ondrej Zary
    Tested-by: Ondrej Zary
    Signed-off-by: Jens Axboe

    Jens Axboe
     

21 Oct, 2018

1 commit

  • When submitting a bio, multiple recursive calls to make_request() may
    occur. This causes the initial associate done in blkcg_bio_issue_check()
    to be incorrect and reference the prior request_queue. This introduces
    a helper to do reassociation when make_request() is recursively called.

    Fixes: a7b39b4e961c ("blkcg: always associate a bio with a blkg")
    Reported-by: Valdis Kletnieks
    Signed-off-by: Dennis Zhou
    Tested-by: Valdis Kletnieks
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

18 Oct, 2018

1 commit

  • blk_queue_split() does respect this limit via bio splitting, so no
    need to do that in blkdev_issue_discard(), then we can align to
    normal bio submit(bio_add_page() & submit_bio()).

    More importantly, this patch fixes one issue introduced in a22c4d7e34402cc
    ("block: re-add discard_granularity and alignment checks"), in which
    zero discard bio may be generated in case of zero alignment.

    Fixes: a22c4d7e34402ccdf3 ("block: re-add discard_granularity and alignment checks")
    Cc: stable@vger.kernel.org
    Cc: Ming Lin
    Cc: Mike Snitzer
    Cc: Christoph Hellwig
    Cc: Xiao Ni
    Tested-by: Mariusz Dabrowski
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

16 Oct, 2018

1 commit


15 Oct, 2018

1 commit