17 Apr, 2019

1 commit

  • commit a89afe58f1a74aac768a5eb77af95ef4ee15beaa upstream.

    If the last bio returned is not dio->bio, the status of the bio will
    not assigned to dio->bio if it is error. This will cause the whole IO
    status wrong.

    ksoftirqd/21-117 [021] ..s. 4017.966090: 8,0 C N 4883648 [0]
    -0 [018] ..s. 4017.970888: 8,0 C WS 4924800 + 1024 [0]
    -0 [018] ..s. 4017.970909: 8,0 D WS 4935424 + 1024 []
    -0 [018] ..s. 4017.970924: 8,0 D WS 4936448 + 321 []
    ksoftirqd/21-117 [021] ..s. 4017.995033: 8,0 C R 4883648 + 336 [65475]
    ksoftirqd/21-117 [021] d.s. 4018.001988: myprobe1: (blkdev_bio_end_io+0x0/0x168) bi_status=7
    ksoftirqd/21-117 [021] d.s. 4018.001992: myprobe: (aio_complete_rw+0x0/0x148) x0=0xffff802f2595ad80 res=0x12a000 res2=0x0

    We always have to assign bio->bi_status to dio->bio.bi_status because we
    will only check dio->bio.bi_status when we return the whole IO to
    the upper layer.

    Fixes: 542ff7bf18c6 ("block: new direct I/O implementation")
    Cc: stable@vger.kernel.org
    Cc: Christoph Hellwig
    Cc: Jens Axboe
    Reviewed-by: Ming Lei
    Signed-off-by: Jason Yan
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jason Yan
     

23 Jan, 2019

1 commit

  • commit 04906b2f542c23626b0ef6219b808406f8dddbe9 upstream.

    bd_set_size() updates also block device's block size. This is somewhat
    unexpected from its name and at this point, only blkdev_open() uses this
    functionality. Furthermore, this can result in changing block size under
    a filesystem mounted on a loop device which leads to livelocks inside
    __getblk_gfp() like:

    Sending NMI from CPU 0 to CPUs 1:
    NMI backtrace for cpu 1
    CPU: 1 PID: 10863 Comm: syz-executor0 Not tainted 4.18.0-rc5+ #151
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
    01/01/2011
    RIP: 0010:__sanitizer_cov_trace_pc+0x3f/0x50 kernel/kcov.c:106
    ...
    Call Trace:
    init_page_buffers+0x3e2/0x530 fs/buffer.c:904
    grow_dev_page fs/buffer.c:947 [inline]
    grow_buffers fs/buffer.c:1009 [inline]
    __getblk_slow fs/buffer.c:1036 [inline]
    __getblk_gfp+0x906/0xb10 fs/buffer.c:1313
    __bread_gfp+0x2d/0x310 fs/buffer.c:1347
    sb_bread include/linux/buffer_head.h:307 [inline]
    fat12_ent_bread+0x14e/0x3d0 fs/fat/fatent.c:75
    fat_ent_read_block fs/fat/fatent.c:441 [inline]
    fat_alloc_clusters+0x8ce/0x16e0 fs/fat/fatent.c:489
    fat_add_cluster+0x7a/0x150 fs/fat/inode.c:101
    __fat_get_block fs/fat/inode.c:148 [inline]
    ...

    Trivial reproducer for the problem looks like:

    truncate -s 1G /tmp/image
    losetup /dev/loop0 /tmp/image
    mkfs.ext4 -b 1024 /dev/loop0
    mount -t ext4 /dev/loop0 /mnt
    losetup -c /dev/loop0
    l /mnt

    Fix the problem by moving initialization of a block device block size
    into a separate function and call it when needed.

    Thanks to Tetsuo Handa for help with
    debugging the problem.

    Reported-by: syzbot+9933e4476f365f5d5a1b@syzkaller.appspotmail.com
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     

15 Aug, 2018

1 commit

  • Pull block updates from Jens Axboe:
    "First pull request for this merge window, there will also be a
    followup request with some stragglers.

    This pull request contains:

    - Fix for a thundering heard issue in the wbt block code (Anchal
    Agarwal)

    - A few NVMe pull requests:
    * Improved tracepoints (Keith)
    * Larger inline data support for RDMA (Steve Wise)
    * RDMA setup/teardown fixes (Sagi)
    * Effects log suppor for NVMe target (Chaitanya Kulkarni)
    * Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
    * TP4004 (ANA) support (Christoph)
    * Various NVMe fixes

    - Block io-latency controller support. Much needed support for
    properly containing block devices. (Josef)

    - Series improving how we handle sense information on the stack
    (Kees)

    - Lightnvm fixes and updates/improvements (Mathias/Javier et al)

    - Zoned device support for null_blk (Matias)

    - AIX partition fixes (Mauricio Faria de Oliveira)

    - DIF checksum code made generic (Max Gurtovoy)

    - Add support for discard in iostats (Michael Callahan / Tejun)

    - Set of updates for BFQ (Paolo)

    - Removal of async write support for bsg (Christoph)

    - Bio page dirtying and clone fixups (Christoph)

    - Set of bcache fix/changes (via Coly)

    - Series improving blk-mq queue setup/teardown speed (Ming)

    - Series improving merging performance on blk-mq (Ming)

    - Lots of other fixes and cleanups from a slew of folks"

    * tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
    blkcg: Make blkg_root_lookup() work for queues in bypass mode
    bcache: fix error setting writeback_rate through sysfs interface
    null_blk: add lock drop/acquire annotation
    Blk-throttle: reduce tail io latency when iops limit is enforced
    block: paride: pd: mark expected switch fall-throughs
    block: Ensure that a request queue is dissociated from the cgroup controller
    block: Introduce blk_exit_queue()
    blkcg: Introduce blkg_root_lookup()
    block: Remove two superfluous #include directives
    blk-mq: count the hctx as active before allocating tag
    block: bvec_nr_vecs() returns value for wrong slab
    bcache: trivial - remove tailing backslash in macro BTREE_FLAG
    bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
    bcache: set max writeback rate when I/O request is idle
    bcache: add code comments for bset.c
    bcache: fix mistaken comments in request.c
    bcache: fix mistaken code comments in bcache.h
    bcache: add a comment in super.c
    bcache: avoid unncessary cache prefetch bch_btree_node_get()
    bcache: display rate debug parameters to 0 when writeback is not running
    ...

    Linus Torvalds
     

28 Jul, 2018

1 commit

  • Pull block fixes from Jens Axboe:
    "Bigger than usual at this time, mostly due to the O_DIRECT corruption
    issue and the fact that I was on vacation last week. This contains:

    - NVMe pull request with two fixes for the FC code, and two target
    fixes (Christoph)

    - a DIF bio reset iteration fix (Greg Edwards)

    - two nbd reply and requeue fixes (Josef)

    - SCSI timeout fixup (Keith)

    - a small series that fixes an issue with bio_iov_iter_get_pages(),
    which ended up causing corruption for larger sized O_DIRECT writes
    that ended up racing with buffered writes (Martin Wilck)"

    * tag 'for-linus-20180727' of git://git.kernel.dk/linux-block:
    block: reset bi_iter.bi_done after splitting bio
    block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs
    blkdev: __blkdev_direct_IO_simple: fix leak in error case
    block: bio_iov_iter_get_pages: fix size of last iovec
    nvmet: only check for filebacking on -ENOTBLK
    nvmet: fixup crash on NULL device path
    scsi: set timed out out mq requests to complete
    blk-mq: export setting request completion state
    nvme: if_ready checks to fail io to deleting controller
    nvmet-fc: fix target sgl list on large transfers
    nbd: handle unexpected replies better
    nbd: don't requeue the same request twice.

    Linus Torvalds
     

27 Jul, 2018

1 commit


18 Jul, 2018

1 commit

  • c11f0c0b5bb9 ("block/mm: make bdev_ops->rw_page() take a bool for
    read/write") replaced @op with boolean @is_write, which limited the
    amount of information going into ->rw_page() and more importantly
    page_endio(), which removed the need to expose block internals to mm.

    Unfortunately, we want to track discards separately and @is_write
    isn't enough information. This patch updates bdev_ops->rw_page() to
    take REQ_OP instead but leaves page_endio() to take bool @is_write.
    This allows the block part of operations to have enough information
    while not leaking it to mm.

    Signed-off-by: Tejun Heo
    Cc: Mike Christie
    Cc: Minchan Kim
    Cc: Dan Williams
    Signed-off-by: Jens Axboe

    Tejun Heo
     

13 Jun, 2018

1 commit

  • The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
    patch replaces cases of:

    kmalloc(a * b, gfp)

    with:
    kmalloc_array(a * b, gfp)

    as well as handling cases of:

    kmalloc(a * b * c, gfp)

    with:

    kmalloc(array3_size(a, b, c), gfp)

    as it's slightly less ugly than:

    kmalloc_array(array_size(a, b), c, gfp)

    This does, however, attempt to ignore constant size factors like:

    kmalloc(4 * 1024, gfp)

    though any constants defined via macros get caught up in the conversion.

    Any factors with a sizeof() of "unsigned char", "char", and "u8" were
    dropped, since they're redundant.

    The tools/ directory was manually excluded, since it has its own
    implementation of kmalloc().

    The Coccinelle script used for this was:

    // Fix redundant parens around sizeof().
    @@
    type TYPE;
    expression THING, E;
    @@

    (
    kmalloc(
    - (sizeof(TYPE)) * E
    + sizeof(TYPE) * E
    , ...)
    |
    kmalloc(
    - (sizeof(THING)) * E
    + sizeof(THING) * E
    , ...)
    )

    // Drop single-byte sizes and redundant parens.
    @@
    expression COUNT;
    typedef u8;
    typedef __u8;
    @@

    (
    kmalloc(
    - sizeof(u8) * (COUNT)
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(__u8) * (COUNT)
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(char) * (COUNT)
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(unsigned char) * (COUNT)
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(u8) * COUNT
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(__u8) * COUNT
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(char) * COUNT
    + COUNT
    , ...)
    |
    kmalloc(
    - sizeof(unsigned char) * COUNT
    + COUNT
    , ...)
    )

    // 2-factor product with sizeof(type/expression) and identifier or constant.
    @@
    type TYPE;
    expression THING;
    identifier COUNT_ID;
    constant COUNT_CONST;
    @@

    (
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * (COUNT_ID)
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * COUNT_ID
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * (COUNT_CONST)
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * COUNT_CONST
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * (COUNT_ID)
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * COUNT_ID
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * (COUNT_CONST)
    + COUNT_CONST, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * COUNT_CONST
    + COUNT_CONST, sizeof(THING)
    , ...)
    )

    // 2-factor product, only identifiers.
    @@
    identifier SIZE, COUNT;
    @@

    - kmalloc
    + kmalloc_array
    (
    - SIZE * COUNT
    + COUNT, SIZE
    , ...)

    // 3-factor product with 1 sizeof(type) or sizeof(expression), with
    // redundant parens removed.
    @@
    expression THING;
    identifier STRIDE, COUNT;
    type TYPE;
    @@

    (
    kmalloc(
    - sizeof(TYPE) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kmalloc(
    - sizeof(THING) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kmalloc(
    - sizeof(THING) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kmalloc(
    - sizeof(THING) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kmalloc(
    - sizeof(THING) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    )

    // 3-factor product with 2 sizeof(variable), with redundant parens removed.
    @@
    expression THING1, THING2;
    identifier COUNT;
    type TYPE1, TYPE2;
    @@

    (
    kmalloc(
    - sizeof(TYPE1) * sizeof(TYPE2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kmalloc(
    - sizeof(THING1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kmalloc(
    - sizeof(THING1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    |
    kmalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    )

    // 3-factor product, only identifiers, with redundant parens removed.
    @@
    identifier STRIDE, SIZE, COUNT;
    @@

    (
    kmalloc(
    - (COUNT) * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - COUNT * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - COUNT * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - (COUNT) * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - COUNT * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - (COUNT) * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - (COUNT) * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kmalloc(
    - COUNT * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    )

    // Any remaining multi-factor products, first at least 3-factor products,
    // when they're not all constants...
    @@
    expression E1, E2, E3;
    constant C1, C2, C3;
    @@

    (
    kmalloc(C1 * C2 * C3, ...)
    |
    kmalloc(
    - (E1) * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kmalloc(
    - (E1) * (E2) * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kmalloc(
    - (E1) * (E2) * (E3)
    + array3_size(E1, E2, E3)
    , ...)
    |
    kmalloc(
    - E1 * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    )

    // And then all remaining 2 factors products when they're not all constants,
    // keeping sizeof() as the second factor argument.
    @@
    expression THING, E1, E2;
    type TYPE;
    constant C1, C2, C3;
    @@

    (
    kmalloc(sizeof(THING) * C2, ...)
    |
    kmalloc(sizeof(TYPE) * C2, ...)
    |
    kmalloc(C1 * C2 * C3, ...)
    |
    kmalloc(C1 * C2, ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * (E2)
    + E2, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(TYPE) * E2
    + E2, sizeof(TYPE)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * (E2)
    + E2, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - sizeof(THING) * E2
    + E2, sizeof(THING)
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - (E1) * E2
    + E1, E2
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - (E1) * (E2)
    + E1, E2
    , ...)
    |
    - kmalloc
    + kmalloc_array
    (
    - E1 * E2
    + E1, E2
    , ...)
    )

    Signed-off-by: Kees Cook

    Kees Cook
     

09 Jun, 2018

1 commit

  • Pull aio iopriority support from Al Viro:
    "The rest of aio stuff for this cycle - Adam's aio ioprio series"

    * 'work.aio' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: aio ioprio use ioprio_check_cap ret val
    fs: aio ioprio add explicit block layer dependence
    fs: iomap dio set bio prio from kiocb prio
    fs: blkdev set bio prio from kiocb prio
    fs: Add aio iopriority support
    fs: Convert kiocb rw_hint from enum to u16
    block: add ioprio_check_cap function

    Linus Torvalds
     

31 May, 2018

2 commits


29 May, 2018

2 commits


11 Apr, 2018

1 commit

  • Pull libnvdimm updates from Dan Williams:
    "This cycle was was not something I ever want to repeat as there were
    several late changes that have only now just settled.

    Half of the branch up to commit d2c997c0f145 ("fs, dax: use
    page->mapping to warn...") have been in -next for several releases.
    The of_pmem driver and the address range scrub rework were late
    arrivals, and the dax work was scaled back at the last moment.

    The of_pmem driver missed a previous merge window due to an oversight.
    A sense of obligation to rectify that miss is why it is included for
    4.17. It has acks from PowerPC folks. Stephen reported a build failure
    that only occurs when merging it with your latest tree, for now I have
    fixed that up by disabling modular builds of of_pmem. A test merge
    with your tree has received a build success report from the 0day robot
    over 156 configs.

    An initial version of the ARS rework was submitted before the merge
    window. It is self contained to libnvdimm, a net code reduction, and
    passing all unit tests.

    The filesystem-dax changes are based on the wait_var_event()
    functionality from tip/sched/core. However, late review feedback
    showed that those changes regressed truncate performance to a large
    degree. The branch was rewound to drop the truncate behavior change
    and now only includes preparation patches and cleanups (with full acks
    and reviews). The finalization of this dax-dma-vs-trnucate work will
    need to wait for 4.18.

    Summary:

    - A rework of the filesytem-dax implementation provides for detection
    of unmap operations (truncate / hole punch) colliding with
    in-progress device-DMA. A fix for these collisions remains a
    work-in-progress pending resolution of truncate latency and
    starvation regressions.

    - The of_pmem driver expands the users of libnvdimm outside of x86
    and ACPI to describe an implementation of persistent memory on
    PowerPC with Open Firmware / Device tree.

    - Address Range Scrub (ARS) handling is completely rewritten to
    account for the fact that ARS may run for 100s of seconds and there
    is no platform defined way to cancel it. ARS will now no longer
    block namespace initialization.

    - The NVDIMM Namespace Label implementation is updated to handle
    label areas as small as 1K, down from 128K.

    - Miscellaneous cleanups and updates to unit test infrastructure"

    * tag 'libnvdimm-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (39 commits)
    libnvdimm, of_pmem: workaround OF_NUMA=n build error
    nfit, address-range-scrub: add module option to skip initial ars
    nfit, address-range-scrub: rework and simplify ARS state machine
    nfit, address-range-scrub: determine one platform max_ars value
    powerpc/powernv: Create platform devs for nvdimm buses
    doc/devicetree: Persistent memory region bindings
    libnvdimm: Add device-tree based driver
    libnvdimm: Add of_node to region and bus descriptors
    libnvdimm, region: quiet region probe
    libnvdimm, namespace: use a safe lookup for dimm device name
    libnvdimm, dimm: fix dpa reservation vs uninitialized label area
    libnvdimm, testing: update the default smart ctrl_temperature
    libnvdimm, testing: Add emulation for smart injection commands
    nfit, address-range-scrub: introduce nfit_spa->ars_state
    libnvdimm: add an api to cast a 'struct nd_region' to its 'struct device'
    nfit, address-range-scrub: fix scrub in-progress reporting
    dax, dm: allow device-mapper to operate without dax support
    dax: introduce CONFIG_DAX_DRIVER
    fs, dax: use page->mapping to warn if truncate collides with a busy page
    ext2, dax: introduce ext2_dax_aops
    ...

    Linus Torvalds
     

06 Apr, 2018

1 commit

  • When changing the size of a block device, its all caches are freed.
    It's necessary on shrinking to prevent spurious I/Os to the disappeared
    region. However, on expanding, such kind of I/Os doesn't happen.

    Similar things can be considered for btrfs filesystem resize and
    resize2fs, but they are designed not to drop caches when expanding.
    Therefore this patch removes unnecessary cache drop.

    Link: http://lkml.kernel.org/r/1521457240-153390-1-git-send-email-shunki-fujita@cybozu.co.jp
    Signed-off-by: Shunki Fujita
    Cc: Al Viro
    Cc: Jens Axboe
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    shunki-fujita
     

31 Mar, 2018

1 commit


27 Feb, 2018

3 commits

  • When blkdev_open() races with device removal and creation it can happen
    that unhashed bdev inode gets associated with newly created gendisk
    like:

    CPU0 CPU1
    blkdev_open()
    bdev = bd_acquire()
    del_gendisk()
    bdev_unhash_inode(bdev);
    remove device
    create new device with the same number
    __blkdev_get()
    disk = get_gendisk()
    - gets reference to gendisk of the new device

    Now another blkdev_open() will not find original 'bdev' as it got
    unhashed, create a new one and associate it with the same 'disk' at
    which point problems start as we have two independent page caches for
    one device.

    Fix the problem by verifying that the bdev inode didn't get unhashed
    before we acquired gendisk reference. That way we make sure gendisk can
    get associated only with visible bdev inodes.

    Tested-by: Hou Tao
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • When two blkdev_open() calls race with device removal and recreation,
    __blkdev_get() can use looked up gendisk after it is freed:

    CPU0 CPU1 CPU2
    del_gendisk(disk);
    bdev_unhash_inode(inode);
    blkdev_open() blkdev_open()
    bdev = bd_acquire(inode);
    - creates and returns new inode
    bdev = bd_acquire(inode);
    - returns the same inode
    __blkdev_get(devt) __blkdev_get(devt)
    disk = get_gendisk(devt);
    - got structure of device going away


    disk = get_gendisk(devt);
    - got new device structure
    if (!bdev->bd_openers) {
    does the first open
    }
    if (!bdev->bd_openers)
    - false
    } else {
    put_disk_and_module(disk)
    - remember this was old device - this was last ref and disk is
    now freed
    }
    disk_unblock_events(disk); -> oops

    Fix the problem by making sure we drop reference to disk in
    __blkdev_get() only after we are really done with it.

    Reported-by: Hou Tao
    Tested-by: Hou Tao
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • Add a proper counterpart to get_disk_and_module() -
    put_disk_and_module(). Currently it is opencoded in several places.

    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

15 Nov, 2017

1 commit

  • Pull core block layer updates from Jens Axboe:
    "This is the main pull request for block storage for 4.15-rc1.

    Nothing out of the ordinary in here, and no API changes or anything
    like that. Just various new features for drivers, core changes, etc.
    In particular, this pull request contains:

    - A patch series from Bart, closing the whole on blk/scsi-mq queue
    quescing.

    - A series from Christoph, building towards hidden gendisks (for
    multipath) and ability to move bio chains around.

    - NVMe
    - Support for native multipath for NVMe (Christoph).
    - Userspace notifications for AENs (Keith).
    - Command side-effects support (Keith).
    - SGL support (Chaitanya Kulkarni)
    - FC fixes and improvements (James Smart)
    - Lots of fixes and tweaks (Various)

    - bcache
    - New maintainer (Michael Lyle)
    - Writeback control improvements (Michael)
    - Various fixes (Coly, Elena, Eric, Liang, et al)

    - lightnvm updates, mostly centered around the pblk interface
    (Javier, Hans, and Rakesh).

    - Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)

    - Writeback series that fix the much discussed hundreds of millions
    of sync-all units. This goes all the way, as discussed previously
    (me).

    - Fix for missing wakeup on writeback timer adjustments (Yafang
    Shao).

    - Fix laptop mode on blk-mq (me).

    - {mq,name} tupple lookup for IO schedulers, allowing us to have
    alias names. This means you can use 'deadline' on both !mq and on
    mq (where it's called mq-deadline). (me).

    - blktrace race fix, oopsing on sg load (me).

    - blk-mq optimizations (me).

    - Obscure waitqueue race fix for kyber (Omar).

    - NBD fixes (Josef).

    - Disable writeback throttling by default on bfq, like we do on cfq
    (Luca Miccio).

    - Series from Ming that enable us to treat flush requests on blk-mq
    like any other request. This is a really nice cleanup.

    - Series from Ming that improves merging on blk-mq with schedulers,
    getting us closer to flipping the switch on scsi-mq again.

    - BFQ updates (Paolo).

    - blk-mq atomic flags memory ordering fixes (Peter Z).

    - Loop cgroup support (Shaohua).

    - Lots of minor fixes from lots of different folks, both for core and
    driver code"

    * 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
    nvme: fix visibility of "uuid" ns attribute
    blk-mq: fixup some comment typos and lengths
    ide: ide-atapi: fix compile error with defining macro DEBUG
    blk-mq: improve tag waiting setup for non-shared tags
    brd: remove unused brd_mutex
    blk-mq: only run the hardware queue if IO is pending
    block: avoid null pointer dereference on null disk
    fs: guard_bio_eod() needs to consider partitions
    xtensa/simdisk: fix compile error
    nvme: expose subsys attribute to sysfs
    nvme: create 'slaves' and 'holders' entries for hidden controllers
    block: create 'slaves' and 'holders' entries for hidden gendisks
    nvme: also expose the namespace identification sysfs files for mpath nodes
    nvme: implement multipath access to nvme subsystems
    nvme: track shared namespaces
    nvme: introduce a nvme_ns_ids structure
    nvme: track subsystems
    block, nvme: Introduce blk_mq_req_flags_t
    block, scsi: Make SCSI quiesce and resume work reliably
    block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
    ...

    Linus Torvalds
     

11 Nov, 2017

1 commit

  • The contexts from which a SCSI device can be quiesced or resumed are:
    * Writing into /sys/class/scsi_device/*/device/state.
    * SCSI parallel (SPI) domain validation.
    * The SCSI device power management methods. See also scsi_bus_pm_ops.

    It is essential during suspend and resume that neither the filesystem
    state nor the filesystem metadata in RAM changes. This is why while
    the hibernation image is being written or restored that SCSI devices
    are quiesced. The SCSI core quiesces devices through scsi_device_quiesce()
    and scsi_device_resume(). In the SDEV_QUIESCE state execution of
    non-preempt requests is deferred. This is realized by returning
    BLKPREP_DEFER from inside scsi_prep_state_check() for quiesced SCSI
    devices. Avoid that a full queue prevents power management requests
    to be submitted by deferring allocation of non-preempt requests for
    devices in the quiesced state. This patch has been tested by running
    the following commands and by verifying that after each resume the
    fio job was still running:

    for ((i=0; i sync_action
    sleep 1
    done
    ) &
    pids=($!)
    for d in /sys/class/block/sd*[a-z]; do
    bdev=${d#/sys/class/block/}
    hcil=$(readlink "$d/device")
    hcil=${hcil#../../../}
    echo 4 > "$d/queue/nr_requests"
    echo 1 > "/sys/class/scsi_device/$hcil/device/queue_depth"
    fio --name="$bdev" --filename="/dev/$bdev" --buffered=0 --bs=512 \
    --rw=randread --ioengine=libaio --numjobs=4 --iodepth=16 \
    --iodepth_batch=1 --thread --loops=$((2**31)) &
    pids+=($!)
    done
    sleep 1
    echo "$(date) Hibernating ..." >>hibernate-test-log.txt
    systemctl hibernate
    sleep 10
    kill "${pids[@]}"
    echo idle > /sys/block/md0/md/sync_action
    wait
    echo "$(date) Done." >>hibernate-test-log.txt
    done

    Reported-by: Oleksandr Natalenko
    References: "I/O hangs after resuming from suspend-to-ram" (https://marc.info/?l=linux-block&m=150340235201348).
    Signed-off-by: Bart Van Assche
    Reviewed-by: Hannes Reinecke
    Tested-by: Martin Steigerwald
    Tested-by: Oleksandr Natalenko
    Cc: Martin K. Petersen
    Cc: Ming Lei
    Cc: Christoph Hellwig
    Cc: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

04 Nov, 2017

1 commit


14 Oct, 2017

1 commit

  • When using FAT on a block device which supports rw_page, we can hit
    BUG_ON(!PageLocked(page)) in try_to_free_buffers(). This is because we
    call clean_buffers() after unlocking the page we've written. Introduce
    a new clean_page_buffers() which cleans all buffers associated with a
    page and call it from within bdev_write_page().

    [akpm@linux-foundation.org: s/PAGE_SIZE/~0U/ per Linus and Matthew]
    Link: http://lkml.kernel.org/r/20171006211541.GA7409@bombadil.infradead.org
    Signed-off-by: Matthew Wilcox
    Reported-by: Toshi Kani
    Reported-by: OGAWA Hirofumi
    Tested-by: Toshi Kani
    Acked-by: Johannes Thumshirn
    Cc: Ross Zwisler
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

13 Oct, 2017

1 commit


15 Sep, 2017

1 commit


05 Sep, 2017

1 commit


24 Aug, 2017

2 commits

  • This way we don't need a block_device structure to submit I/O. The
    block_device has different life time rules from the gendisk and
    request_queue and is usually only available when the block device node
    is open. Other callers need to explicitly create one (e.g. the lightnvm
    passthrough code, or the new nvme multipathing code).

    For the actual I/O path all that we need is the gendisk, which exists
    once per block device. But given that the block layer also does
    partition remapping we additionally need a partition index, which is
    used for said remapping in generic_make_request.

    Note that all the block drivers generally want request_queue or
    sometimes the gendisk, so this removes a layer of indirection all
    over the stack.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

08 Jul, 2017

1 commit

  • Pull Writeback error handling updates from Jeff Layton:
    "This pile represents the bulk of the writeback error handling fixes
    that I have for this cycle. Some of the earlier patches in this pile
    may look trivial but they are prerequisites for later patches in the
    series.

    The aim of this set is to improve how we track and report writeback
    errors to userland. Most applications that care about data integrity
    will periodically call fsync/fdatasync/msync to ensure that their
    writes have made it to the backing store.

    For a very long time, we have tracked writeback errors using two flags
    in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a
    writeback error occurs (via mapping_set_error) and are cleared as a
    side-effect of filemap_check_errors (as you noted yesterday). This
    model really sucks for userland.

    Only the first task to call fsync (or msync or fdatasync) will see the
    error. Any subsequent task calling fsync on a file will get back 0
    (unless another writeback error occurs in the interim). If I have
    several tasks writing to a file and calling fsync to ensure that their
    writes got stored, then I need to have them coordinate with one
    another. That's difficult enough, but in a world of containerized
    setups that coordination may even not be possible.

    But wait...it gets worse!

    The calls to filemap_check_errors can be buried pretty far down in the
    call stack, and there are internal callers of filemap_write_and_wait
    and the like that also end up clearing those errors. Many of those
    callers ignore the error return from that function or return it to
    userland at nonsensical times (e.g. truncate() or stat()). If I get
    back -EIO on a truncate, there is no reason to think that it was
    because some previous writeback failed, and a subsequent fsync() will
    (incorrectly) return 0.

    This pile aims to do three things:

    1) ensure that when a writeback error occurs that that error will be
    reported to userland on a subsequent fsync/fdatasync/msync call,
    regardless of what internal callers are doing

    2) report writeback errors on all file descriptions that were open at
    the time that the error occurred. This is a user-visible change,
    but I think most applications are written to assume this behavior
    anyway. Those that aren't are unlikely to be hurt by it.

    3) document what filesystems should do when there is a writeback
    error. Today, there is very little consistency between them, and a
    lot of cargo-cult copying. We need to make it very clear what
    filesystems should do in this situation.

    To achieve this, the set adds a new data type (errseq_t) and then
    builds new writeback error tracking infrastructure around that. Once
    all of that is in place, we change the filesystems to use the new
    infrastructure for reporting wb errors to userland.

    Note that this is just the initial foray into cleaning up this mess.
    There is a lot of work remaining here:

    1) convert the rest of the filesystems in a similar fashion. Once the
    initial set is in, then I think most other fs' will be fairly
    simple to convert. Hopefully most of those can in via individual
    filesystem trees.

    2) convert internal waiters on writeback to use errseq_t for
    detecting errors instead of relying on the AS_* flags. I have some
    draft patches for this for ext4, but they are not quite ready for
    prime time yet.

    This was a discussion topic this year at LSF/MM too. If you're
    interested in the gory details, LWN has some good articles about this:

    https://lwn.net/Articles/718734/
    https://lwn.net/Articles/724307/"

    * tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
    btrfs: minimal conversion to errseq_t writeback error reporting on fsync
    xfs: minimal conversion to errseq_t writeback error reporting
    ext4: use errseq_t based error handling for reporting data writeback errors
    fs: convert __generic_file_fsync to use errseq_t based reporting
    block: convert to errseq_t based writeback error tracking
    dax: set errors in mapping when writeback fails
    Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors
    mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
    fs: new infrastructure for writeback error handling and reporting
    lib: add errseq_t type and infrastructure for handling it
    mm: don't TestClearPageError in __filemap_fdatawait_range
    mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails
    jbd2: don't clear and reset errors after waiting on writeback
    buffer: set errors in mapping at the time that the error occurs
    fs: check for writeback errors after syncing out buffers in generic_file_fsync
    buffer: use mapping_set_error instead of setting the flag
    mm: fix mapping_set_error call in me_pagecache_dirty

    Linus Torvalds
     

06 Jul, 2017

2 commits

  • This is a very minimal conversion to errseq_t based error tracking
    for raw block device access. Just have it use the standard
    file_write_and_wait_range call.

    Note that there are internal callers that call sync_blockdev
    and the like that are not affected by this. They'll continue
    to use the AS_EIO/AS_ENOSPC flags for error reporting like
    they always have for now.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jeff Layton

    Jeff Layton
     
  • Most filesystems currently use mapping_set_error and
    filemap_check_errors for setting and reporting/clearing writeback errors
    at the mapping level. filemap_check_errors is indirectly called from
    most of the filemap_fdatawait_* functions and from
    filemap_write_and_wait*. These functions are called from all sorts of
    contexts to wait on writeback to finish -- e.g. mostly in fsync, but
    also in truncate calls, getattr, etc.

    The non-fsync callers are problematic. We should be reporting writeback
    errors during fsync, but many places spread over the tree clear out
    errors before they can be properly reported, or report errors at
    nonsensical times.

    If I get -EIO on a stat() call, there is no reason for me to assume that
    it is because some previous writeback failed. The fact that it also
    clears out the error such that a subsequent fsync returns 0 is a bug,
    and a nasty one since that's potentially silent data corruption.

    This patch adds a small bit of new infrastructure for setting and
    reporting errors during address_space writeback. While the above was my
    original impetus for adding this, I think it's also the case that
    current fsync semantics are just problematic for userland. Most
    applications that call fsync do so to ensure that the data they wrote
    has hit the backing store.

    In the case where there are multiple writers to the file at the same
    time, this is really hard to determine. The first one to call fsync will
    see any stored error, and the rest get back 0. The processes with open
    fds may not be associated with one another in any way. They could even
    be in different containers, so ensuring coordination between all fsync
    callers is not really an option.

    One way to remedy this would be to track what file descriptor was used
    to dirty the file, but that's rather cumbersome and would likely be
    slow. However, there is a simpler way to improve the semantics here
    without incurring too much overhead.

    This set adds an errseq_t to struct address_space, and a corresponding
    one is added to struct file. Writeback errors are recorded in the
    mapping's errseq_t, and the one in struct file is used as the "since"
    value.

    This changes the semantics of the Linux fsync implementation such that
    applications can now use it to determine whether there were any
    writeback errors since fsync(fd) was last called (or since the file was
    opened in the case of fsync having never been called).

    Note that those writeback errors may have occurred when writing data
    that was dirtied via an entirely different fd, but that's the case now
    with the current mapping_set_error/filemap_check_error infrastructure.
    This will at least prevent you from getting a false report of success.

    The new behavior is still consistent with the POSIX spec, and is more
    reliable for application developers. This patch just adds some basic
    infrastructure for doing this, and ensures that the f_wb_err "cursor"
    is properly set when a file is opened. Later patches will change the
    existing code to use this new infrastructure for reporting errors at
    fsync time.

    Signed-off-by: Jeff Layton
    Reviewed-by: Jan Kara

    Jeff Layton
     

04 Jul, 2017

1 commit

  • Pull core block/IO updates from Jens Axboe:
    "This is the main pull request for the block layer for 4.13. Not a huge
    round in terms of features, but there's a lot of churn related to some
    core cleanups.

    Note this depends on the UUID tree pull request, that Christoph
    already sent out.

    This pull request contains:

    - A series from Christoph, unifying the error/stats codes in the
    block layer. We now use blk_status_t everywhere, instead of using
    different schemes for different places.

    - Also from Christoph, some cleanups around request allocation and IO
    scheduler interactions in blk-mq.

    - And yet another series from Christoph, cleaning up how we handle
    and do bounce buffering in the block layer.

    - A blk-mq debugfs series from Bart, further improving on the support
    we have for exporting internal information to aid debugging IO
    hangs or stalls.

    - Also from Bart, a series that cleans up the request initialization
    differences across types of devices.

    - A series from Goldwyn Rodrigues, allowing the block layer to return
    failure if we will block and the user asked for non-blocking.

    - Patch from Hannes for supporting setting loop devices block size to
    that of the underlying device.

    - Two series of patches from Javier, fixing various issues with
    lightnvm, particular around pblk.

    - A series from me, adding support for write hints. This comes with
    NVMe support as well, so applications can help guide data placement
    on flash to improve performance, latencies, and write
    amplification.

    - A series from Ming, improving and hardening blk-mq support for
    stopping/starting and quiescing hardware queues.

    - Two pull requests for NVMe updates. Nothing major on the feature
    side, but lots of cleanups and bug fixes. From the usual crew.

    - A series from Neil Brown, greatly improving the bio rescue set
    support. Most notably, this kills the bio rescue work queues, if we
    don't really need them.

    - Lots of other little bug fixes that are all over the place"

    * 'for-4.13/block' of git://git.kernel.dk/linux-block: (217 commits)
    lightnvm: pblk: set line bitmap check under debug
    lightnvm: pblk: verify that cache read is still valid
    lightnvm: pblk: add initialization check
    lightnvm: pblk: remove target using async. I/Os
    lightnvm: pblk: use vmalloc for GC data buffer
    lightnvm: pblk: use right metadata buffer for recovery
    lightnvm: pblk: schedule if data is not ready
    lightnvm: pblk: remove unused return variable
    lightnvm: pblk: fix double-free on pblk init
    lightnvm: pblk: fix bad le64 assignations
    nvme: Makefile: remove dead build rule
    blk-mq: map all HWQ also in hyperthreaded system
    nvmet-rdma: register ib_client to not deadlock in device removal
    nvme_fc: fix error recovery on link down.
    nvmet_fc: fix crashes on bad opcodes
    nvme_fc: Fix crash when nvme controller connection fails.
    nvme_fc: replace ioabort msleep loop with completion
    nvme_fc: fix double calls to nvme_cleanup_cmd()
    nvme-fabrics: verify that a controller returns the correct NQN
    nvme: simplify nvme_dev_attrs_are_visible
    ...

    Linus Torvalds
     

29 Jun, 2017

1 commit

  • Wen reports significant memory leaks with DIF and O_DIRECT:

    "With nvme devive + T10 enabled, On a system it has 256GB and started
    logging /proc/meminfo & /proc/slabinfo for every minute and in an hour
    it increased by 15968128 kB or ~15+GB.. Approximately 256 MB / minute
    leaking.

    /proc/meminfo | grep SUnreclaim...

    SUnreclaim: 6752128 kB
    SUnreclaim: 6874880 kB
    SUnreclaim: 7238080 kB
    ....
    SUnreclaim: 22307264 kB
    SUnreclaim: 22485888 kB
    SUnreclaim: 22720256 kB

    When testcases with T10 enabled call into __blkdev_direct_IO_simple,
    code doesn't free memory allocated by bio_integrity_alloc. The patch
    fixes the issue. HTX has been run with +60 hours without failure."

    Since __blkdev_direct_IO_simple() allocates the bio on the stack, it
    doesn't go through the regular bio free. This means that any ancillary
    data allocated with the bio through the stack is not freed. Hence, we
    can leak the integrity data associated with the bio, if the device is
    using DIF/DIX.

    Fix this by providing a bio_uninit() and export it, so that we can use
    it to free this data. Note that this is a minimal fix for this issue.
    Any current user of bio's that are allocated outside of
    bio_alloc_bioset() suffers from this issue, most notably some drivers.
    We will fix those in a more comprehensive patch for 4.13. This also
    means that the commit marked as being fixed by this isn't the real
    culprit, it's just the most obvious one out there.

    Fixes: 542ff7bf18c6 ("block: new direct I/O implementation")
    Reported-by: Wen Xiong
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     

28 Jun, 2017

1 commit


19 Jun, 2017

1 commit

  • "flags" arguments are often seen as good API design as they allow
    easy extensibility.
    bioset_create_nobvec() is implemented internally as a variation in
    flags passed to __bioset_create().

    To support future extension, make the internal structure part of the
    API.
    i.e. add a 'flags' argument to bioset_create() and discard
    bioset_create_nobvec().

    Note that the bio_split allocations in drivers/md/raid* do not need
    the bvec mempool - they should have used bioset_create_nobvec().

    Suggested-by: Christoph Hellwig
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: NeilBrown
    Signed-off-by: Jens Axboe

    NeilBrown
     

09 Jun, 2017

2 commits


13 May, 2017

1 commit

  • Pull libnvdimm fixes from Dan Williams:
    "Incremental fixes and a small feature addition on top of the main
    libnvdimm 4.12 pull request:

    - Geert noticed that tinyconfig was bloated by BLOCK selecting DAX.
    The size regression is fixed by moving all dax helpers into the
    dax-core and only specifying "select DAX" for FS_DAX and
    dax-capable drivers. He also asked for clarification of the
    NR_DEV_DAX config option which, on closer look, does not need to be
    a config option at all. Mike also throws in a DEV_DAX_PMEM fixup
    for good measure.

    - Ben's attention to detail on -stable patch submissions caught a
    case where the recent fixes to arch_copy_from_iter_pmem() missed a
    condition where we strand dirty data in the cache. This is tagged
    for -stable and will also be included in the rework of the pmem api
    to a proposed {memcpy,copy_user}_flushcache() interface for 4.13.

    - Vishal adds a feature that missed the initial pull due to pending
    review feedback. It allows the kernel to clear media errors when
    initializing a BTT (atomic sector update driver) instance on a pmem
    namespace.

    - Ross noticed that the dax_device + dax_operations conversion broke
    __dax_zero_page_range(). The nvdimm unit tests fail to check this
    path, but xfstests immediately trips over it. No excuse for missing
    this before submitting the 4.12 pull request.

    These all pass the nvdimm unit tests and an xfstests spot check. The
    set has received a build success notification from the kbuild robot"

    * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    filesystem-dax: fix broken __dax_zero_page_range() conversion
    libnvdimm, btt: ensure that initializing metadata clears poison
    libnvdimm: add an atomic vs process context flag to rw_bytes
    x86, pmem: Fix cache flushing for iovec write < 8 bytes
    device-dax: kill NR_DEV_DAX
    block, dax: move "select DAX" from BLOCK to FS_DAX
    device-dax: Tell kbuild DEV_DAX_PMEM depends on DEV_DAX

    Linus Torvalds
     

09 May, 2017

1 commit

  • For configurations that do not enable DAX filesystems or drivers, do not
    require the DAX core to be built.

    Given that the 'direct_access' method has been removed from
    'block_device_operations', we can also go ahead and remove the
    block-related dax helper functions from fs/block_dev.c to
    drivers/dax/super.c. This keeps dax details out of the block layer and
    lets the DAX core be built as a module in the FS_DAX=n case.

    Filesystems need to include dax.h to call bdev_dax_supported().

    Cc: linux-xfs@vger.kernel.org
    Cc: Jens Axboe
    Cc: "Theodore Ts'o"
    Cc: Matthew Wilcox
    Cc: Alexander Viro
    Cc: "Darrick J. Wong"
    Cc: Ross Zwisler
    Reviewed-by: Jan Kara
    Reported-by: Geert Uytterhoeven
    Signed-off-by: Dan Williams

    Dan Williams
     

06 May, 2017

1 commit

  • Pull libnvdimm updates from Dan Williams:
    "The bulk of this has been in multiple -next releases. There were a few
    late breaking fixes and small features that got added in the last
    couple days, but the whole set has received a build success
    notification from the kbuild robot.

    Change summary:

    - Region media error reporting: A libnvdimm region device is the
    parent to one or more namespaces. To date, media errors have been
    reported via the "badblocks" attribute attached to pmem block
    devices for namespaces in "raw" or "memory" mode. Given that
    namespaces can be in "device-dax" or "btt-sector" mode this new
    interface reports media errors generically, i.e. independent of
    namespace modes or state.

    This subsequently allows userspace tooling to craft "ACPI 6.1
    Section 9.20.7.6 Function Index 4 - Clear Uncorrectable Error"
    requests and submit them via the ioctl path for NVDIMM root bus
    devices.

    - Introduce 'struct dax_device' and 'struct dax_operations': Prompted
    by a request from Linus and feedback from Christoph this allows for
    dax capable drivers to publish their own custom dax operations.
    This fixes the broken assumption that all dax operations are
    related to a persistent memory device, and makes it easier for
    other architectures and platforms to add customized persistent
    memory support.

    - 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is
    available for storage appliance applications to manually trigger
    memory controllers to drain write-pending buffers that would
    otherwise be flushed automatically by the platform ADR
    (asynchronous-DRAM-refresh) mechanism at a power loss event.
    Support for "locked" DIMMs is included to prevent namespaces from
    surfacing when the namespace label data area is locked. Finally,
    fixes for various reported deadlocks and crashes, also tagged for
    -stable.

    - ACPI / nfit driver updates: General updates of the nfit driver to
    add DSM command overrides, ACPI 6.1 health state flags support, DSM
    payload debug available by default, and various fixes.

    Acknowledgements that came after the branch was pushed:

    - commmit 565851c972b5 "device-dax: fix sysfs attribute deadlock":
    Tested-by: Yi Zhang

    - commit 23f498448362 "libnvdimm: rework region badblocks clearing"
    Tested-by: Toshi Kani "

    * tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (52 commits)
    libnvdimm, pfn: fix 'npfns' vs section alignment
    libnvdimm: handle locked label storage areas
    libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED
    brd: fix uninitialized use of brd->dax_dev
    block, dax: use correct format string in bdev_dax_supported
    device-dax: fix sysfs attribute deadlock
    libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking"
    libnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering
    libnvdimm: rework region badblocks clearing
    acpi, nfit: kill ACPI_NFIT_DEBUG
    libnvdimm: fix clear length of nvdimm_forget_poison()
    libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify
    libnvdimm, region: sysfs trigger for nvdimm_flush()
    libnvdimm: fix phys_addr for nvdimm_clear_poison
    x86, dax, pmem: remove indirection around memcpy_from_pmem()
    block: remove block_device_operations ->direct_access()
    block, dax: convert bdev_dax_supported() to dax_direct_access()
    filesystem-dax: convert to dax_direct_access()
    Revert "block: use DAX for partition table reads"
    ext2, ext4, xfs: retrieve dax_device for iomap operations
    ...

    Linus Torvalds
     

04 May, 2017

1 commit

  • invalidate_bdev() calls cleancache_invalidate_inode() iff ->nrpages != 0
    which doen't make any sense.

    Make sure that invalidate_bdev() always calls cleancache_invalidate_inode()
    regardless of mapping->nrpages value.

    Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
    Link: http://lkml.kernel.org/r/20170424164135.22350-3-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Reviewed-by: Jan Kara
    Acked-by: Konrad Rzeszutek Wilk
    Cc: Alexander Viro
    Cc: Ross Zwisler
    Cc: Jens Axboe
    Cc: Johannes Weiner
    Cc: Alexey Kuznetsov
    Cc: Christoph Hellwig
    Cc: Nikolay Borisov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin