25 Sep, 2016

1 commit

  • The definition of the flush hint table as:

    void __iomem *flush_wpq[0][0];

    ...passed the unit test, but is broken as flush_wpq[0][1] and
    flush_wpq[1][0] refer to the same entry. Fix this to use a helper that
    calculates a slot in the table based on the geometry of flush hints in
    the region. This is important to get right since virtualization
    solutions use this mechanism to trigger hypervisor flushes to platform
    persistence.

    Reported-by: Dave Jiang
    Tested-by: Dave Jiang
    Signed-off-by: Dan Williams

    Dan Williams
     

22 Sep, 2016

1 commit


19 Sep, 2016

1 commit

  • nd_activate_region() iomaps any hint addresses required when activating
    a region. To prevent duplicate mappings it checks the PFN of the hint to
    be mapped against the PFNs of the already mapped hints. Unfortunately it
    doesn't convert the PFN back into a physical address before passing it
    to devm_nvdimm_ioremap(). Instead it applies PHYS_PFN a second time
    which ends about as well as you would imagine.

    Signed-off-by: Oliver O'Halloran
    Signed-off-by: Dan Williams

    Oliver O'Halloran
     

10 Sep, 2016

1 commit

  • Bad blocks can be injected via /sys/block/pmemN/badblocks. In a situation
    where legacy pmem is being used or a pmem region created by using memmap
    kernel parameter, the injected bad blocks are not cleared due to
    nvdimm_clear_poison() failing from lack of ndctl function pointer. In
    this case we need to just return as handled and allow the bad blocks to
    be cleared rather than fail.

    Reviewed-by: Vishal Verma
    Signed-off-by: Dave Jiang
    Signed-off-by: Dan Williams

    Dave Jiang
     

09 Aug, 2016

1 commit


08 Aug, 2016

2 commits

  • Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower
    portion and the op code in the higher portions. This means that
    old code that relies on manually setting bi_rw is most likely
    going to be broken. Instead of letting that brokeness linger,
    rename the member, to force old and out-of-tree code to break
    at compile time instead of at runtime.

    No intended functional changes in this commit.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Commit abf545484d31 changed it from an 'rw' flags type to the
    newer ops based interface, but now we're effectively leaking
    some bdev internals to the rest of the kernel. Since we only
    care about whether it's a read or a write at that level, just
    pass in a bool 'is_write' parameter instead.

    Then we can also move op_is_write() and friends back under
    CONFIG_BLOCK protection.

    Reviewed-by: Mike Christie
    Signed-off-by: Jens Axboe

    Jens Axboe
     

05 Aug, 2016

1 commit

  • The rw_page users were not converted to use bio/req ops. As a result
    bdev_write_page is not passing down REQ_OP_WRITE and the IOs will
    be sent down as reads.

    Signed-off-by: Mike Christie
    Fixes: 4e1b2d52a80d ("block, fs, drivers: remove REQ_OP compat defs and related code")

    Modified by me to:

    1) Drop op_flags passing into ->rw_page(), as we don't use it.
    2) Make op_is_write() and friends safe to use for !CONFIG_BLOCK

    Signed-off-by: Jens Axboe

    Mike Christie
     

29 Jul, 2016

1 commit

  • Pull libnvdimm updates from Dan Williams:

    - Replace pcommit with ADR / directed-flushing.

    The pcommit instruction, which has not shipped on any product, is
    deprecated. Instead, the requirement is that platforms implement
    either ADR, or provide one or more flush addresses per nvdimm.

    ADR (Asynchronous DRAM Refresh) flushes data in posted write buffers
    to the memory controller on a power-fail event.

    Flush addresses are defined in ACPI 6.x as an NVDIMM Firmware
    Interface Table (NFIT) sub-structure: "Flush Hint Address Structure".
    A flush hint is an mmio address that when written and fenced assures
    that all previous posted writes targeting a given dimm have been
    flushed to media.

    - On-demand ARS (address range scrub).

    Linux uses the results of the ACPI ARS commands to track bad blocks
    in pmem devices. When latent errors are detected we re-scrub the
    media to refresh the bad block list, userspace can also request a
    re-scrub at any time.

    - Support for the Microsoft DSM (device specific method) command
    format.

    - Support for EDK2/OVMF virtual disk device memory ranges.

    - Various fixes and cleanups across the subsystem.

    * tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (41 commits)
    libnvdimm-btt: Delete an unnecessary check before the function call "__nd_device_register"
    nfit: do an ARS scrub on hitting a latent media error
    nfit: move to nfit/ sub-directory
    nfit, libnvdimm: allow an ARS scrub to be triggered on demand
    libnvdimm: register nvdimm_bus devices with an nd_bus driver
    pmem: clarify a debug print in pmem_clear_poison
    x86/insn: remove pcommit
    Revert "KVM: x86: add pcommit support"
    nfit, tools/testing/nvdimm/: unify shutdown paths
    libnvdimm: move ->module to struct nvdimm_bus_descriptor
    nfit: cleanup acpi_nfit_init calling convention
    nfit: fix _FIT evaluation memory leak + use after free
    tools/testing/nvdimm: add manufacturing_{date|location} dimm properties
    tools/testing/nvdimm: add virtual ramdisk range
    acpi, nfit: treat virtual ramdisk SPA as pmem region
    pmem: kill __pmem address space
    pmem: kill wmb_pmem()
    libnvdimm, pmem: use nvdimm_flush() for namespace I/O writes
    fs/dax: remove wmb_pmem()
    libnvdimm, pmem: flush posted-write queues on shutdown
    ...

    Linus Torvalds
     

27 Jul, 2016

2 commits

  • Pull block driver updates from Jens Axboe:
    "This branch also contains core changes. I've come to the conclusion
    that from 4.9 and forward, I'll be doing just a single branch. We
    often have dependencies between core and drivers, and it's hard to
    always split them up appropriately without pulling core into drivers
    when that happens.

    That said, this contains:

    - separate secure erase type for the core block layer, from
    Christoph.

    - set of discard fixes, from Christoph.

    - bio shrinking fixes from Christoph, as a followup up to the
    op/flags change in the core branch.

    - map and append request fixes from Christoph.

    - NVMeF (NVMe over Fabrics) code from Christoph. This is pretty
    exciting!

    - nvme-loop fixes from Arnd.

    - removal of ->driverfs_dev from Dan, after providing a
    device_add_disk() helper.

    - bcache fixes from Bhaktipriya and Yijing.

    - cdrom subchannel read fix from Vchannaiah.

    - set of lightnvm updates from Wenwei, Matias, Johannes, and Javier.

    - set of drbd updates and fixes from Fabian, Lars, and Philipp.

    - mg_disk error path fix from Bart.

    - user notification for failed device add for loop, from Minfei.

    - NVMe in general:
    + NVMe delay quirk from Guilherme.
    + SR-IOV support and command retry limits from Keith.
    + fix for memory-less NUMA node from Masayoshi.
    + use UINT_MAX for discard sectors, from Minfei.
    + cancel IO fixes from Ming.
    + don't allocate unused major, from Neil.
    + error code fixup from Dan.
    + use constants for PSDT/FUSE from James.
    + variable init fix from Jay.
    + fabrics fixes from Ming, Sagi, and Wei.
    + various fixes"

    * 'for-4.8/drivers' of git://git.kernel.dk/linux-block: (115 commits)
    nvme/pci: Provide SR-IOV support
    nvme: initialize variable before logical OR'ing it
    block: unexport various bio mapping helpers
    scsi/osd: open code blk_make_request
    target: stop using blk_make_request
    block: simplify and export blk_rq_append_bio
    block: ensure bios return from blk_get_request are properly initialized
    virtio_blk: use blk_rq_map_kern
    memstick: don't allow REQ_TYPE_BLOCK_PC requests
    block: shrink bio size again
    block: simplify and cleanup bvec pool handling
    block: get rid of bio_rw and READA
    block: don't ignore -EOPNOTSUPP blkdev_issue_write_same
    block: introduce BLKDEV_DISCARD_ZERO to fix zeroout
    NVMe: don't allocate unused nvme_major
    nvme: avoid crashes when node 0 is memoryless node.
    nvme: Limit command retries
    loop: Make user notify for adding loop device failed
    nvme-loop: fix nvme-loop Kconfig dependencies
    nvmet: fix return value check in nvmet_subsys_alloc()
    ...

    Linus Torvalds
     
  • Pull core block updates from Jens Axboe:

    - the big change is the cleanup from Mike Christie, cleaning up our
    uses of command types and modified flags. This is what will throw
    some merge conflicts

    - regression fix for the above for btrfs, from Vincent

    - following up to the above, better packing of struct request from
    Christoph

    - a 2038 fix for blktrace from Arnd

    - a few trivial/spelling fixes from Bart Van Assche

    - a front merge check fix from Damien, which could cause issues on
    SMR drives

    - Atari partition fix from Gabriel

    - convert cfq to highres timers, since jiffies isn't granular enough
    for some devices these days. From Jan and Jeff

    - CFQ priority boost fix idle classes, from me

    - cleanup series from Ming, improving our bio/bvec iteration

    - a direct issue fix for blk-mq from Omar

    - fix for plug merging not involving the IO scheduler, like we do for
    other types of merges. From Tahsin

    - expose DAX type internally and through sysfs. From Toshi and Yigal

    * 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits)
    block: Fix front merge check
    block: do not merge requests without consulting with io scheduler
    block: Fix spelling in a source code comment
    block: expose QUEUE_FLAG_DAX in sysfs
    block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
    Btrfs: fix comparison in __btrfs_map_block()
    block: atari: Return early for unsupported sector size
    Doc: block: Fix a typo in queue-sysfs.txt
    cfq-iosched: Charge at least 1 jiffie instead of 1 ns
    cfq-iosched: Fix regression in bonnie++ rewrite performance
    cfq-iosched: Convert slice_resid from u64 to s64
    block: Convert fifo_time from ulong to u64
    blktrace: avoid using timespec
    block/blk-cgroup.c: Declare local symbols static
    block/bio-integrity.c: Add #include "blk.h"
    block/partition-generic.c: Remove a set-but-not-used variable
    block: bio: kill BIO_MAX_SIZE
    cfq-iosched: temporarily boost queue priority for idle classes
    block: drbd: avoid to use BIO_MAX_SIZE
    block: bio: remove BIO_MAX_SECTORS
    ...

    Linus Torvalds
     

24 Jul, 2016

5 commits

  • Dan Williams
     
  • The __nd_device_register() function tests whether its argument is NULL
    and then returns immediately. Thus the test around the call is not needed.

    This issue was detected by using the Coccinelle software.

    Signed-off-by: Markus Elfring
    Signed-off-by: Dan Williams

    Markus Elfring
     
  • Normally, an ARS (Address Range Scrub) only happens at
    boot/initialization time. There can however arise situations where a
    bus-wide rescan is needed - notably, in the case of discovering a latent
    media error, we should do a full rescan to figure out what other sectors
    are bad, and thus potentially avoid triggering an mce on them in the
    future. Also provide a sysfs trigger to start a bus-wide scrub.

    Cc: Rafael J. Wysocki
    Signed-off-by: Vishal Verma
    Signed-off-by: Dan Williams

    Vishal Verma
     
  • A recent effort to add a new nvdimm bus provider attribute highlighted a
    race between interrogating nvdimm_bus->nd_desc and nvdimm_bus tear down.
    The typical way to handle these races is to take the device_lock() in
    the attribute method and validate that the device is still active. In
    order for a device to be 'active' it needs to be associated with a
    driver. So, we create the small boilerplate for a driver and register
    nvdimm_bus devices on the 'nvdimm_bus_type' bus.

    A result of this change is that ndbusX devices now appear under
    /sys/bus/nd/devices. In fact this makes /sys/class/nd somewhat
    redundant, but removing that will need to take a long deprecation period
    given its use by ndctl binaries in the field.

    This change naturally pulls code from drivers/nvdimm/core.c to
    drivers/nvdimm/bus.c, so it is a nice code organization clean-up as
    well.

    Cc: Vishal Verma
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Prefix the sector number being cleared with a '0x' to make it clear that
    this is a hex value.

    Signed-off-by: Vishal Verma
    Signed-off-by: Dan Williams

    Vishal Verma
     

22 Jul, 2016

1 commit


21 Jul, 2016

1 commit

  • Currently, presence of direct_access() in block_device_operations
    indicates support of DAX on its block device. Because
    block_device_operations is instantiated with 'const', this DAX
    capablity may not be enabled conditinally.

    In preparation for supporting DAX to device-mapper devices, add
    QUEUE_FLAG_DAX to request_queue flags to advertise their DAX
    support. This will allow to set the DAX capability based on how
    mapped device is composed.

    Signed-off-by: Toshi Kani
    Acked-by: Dan Williams
    Signed-off-by: Mike Snitzer
    Cc: Jens Axboe
    Cc: Ross Zwisler
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc:
    Signed-off-by: Jens Axboe

    Toshi Kani
     

13 Jul, 2016

3 commits


12 Jul, 2016

6 commits

  • Given that nvdimm_flush() has higher overhead than wmb_pmem() (pointer
    chasing through nd_region), and that we otherwise assume a platform has
    ADR capability when flush hints are not present, move nvdimm_flush() to
    REQ_FLUSH context.

    Note that we still arrange for nvdimm_flush() to be called even in the
    ADR case. We need at least once wmb() fence to push buffered writes in
    the cpu out to the ADR protected domain.

    Cc: Toshi Kani
    Cc: Ross Zwisler
    Signed-off-by: Dan Williams

    Dan Williams
     
  • When the NFIT provides multiple flush hint addresses per-dimm it is
    expressing that the platform is capable of processing multiple flush
    requests in parallel. There is some fixed cost per flush request, let
    the cost be shared in parallel on multiple cpus.

    Since there may not be enough flush hint addresses for each cpu to have
    one, keep a per-cpu index of the last used hint, hash it with current
    pid, and assume that access pattern and scheduler randomness will keep
    the flush-hint usage somewhat staggered across cpus.

    Cc: Ross Zwisler
    Signed-off-by: Dan Williams

    Dan Williams
     
  • nvdimm_flush() is a replacement for the x86 'pcommit' instruction. It is
    an optional write flushing mechanism that an nvdimm bus can provide for
    the pmem driver to consume. In the case of the NFIT nvdimm-bus-provider
    nvdimm_flush() is implemented as a series of flush-hint-address [1]
    writes to each dimm in the interleave set (region) that backs the
    namespace.

    The nvdimm_has_flush() routine relies on platform firmware to describe
    the flushing capabilities of a platform. It uses the heuristic of
    whether an nvdimm bus provider provides flush address data to return a
    ternary result:

    1: flush addresses defined
    0: dimm topology described without flush addresses (assume ADR)
    -errno: no topology information, unable to determine flush mechanism

    The pmem driver is expected to take the following actions on this ternary
    result:

    1: nvdimm_flush() in response to REQ_FUA / REQ_FLUSH and shutdown
    0: do not set, WC or FUA on the queue, take no further action
    -errno: warn and then operate as if nvdimm_has_flush() returned '0'

    The caveat of this heuristic is that it can not distinguish the "dimm
    does not have flush address" case from the "platform firmware is broken
    and failed to describe a flush address". Given we are already
    explicitly trusting the NFIT there's not much more we can do beyond
    blacklisting broken firmwares if they are ever encountered.

    Cc: Ross Zwisler
    Signed-off-by: Dan Williams

    Dan Williams
     
  • nd_region device driver data will be used in the namespace i/o path.
    Re-order nd_region_remove() to ensure this data stays live across
    namespace device removal

    Signed-off-by: Dan Williams

    Dan Williams
     
  • In preparation for triggering flushes of a DIMM's writes-posted-queue
    (WPQ) via the pmem driver move mapping of flush hint addresses to the
    region driver. Since this uses devm_nvdimm_memremap() the flush
    addresses will remain mapped while any region to which the dimm belongs
    is active.

    We need to communicate more information to the nvdimm core to facilitate
    this mapping, namely each dimm object now carries an array of flush hint
    address resources.

    Signed-off-by: Dan Williams

    Dan Williams
     
  • Now that all shared mappings are handled by devm_nvdimm_memremap() we no
    longer need nfit_spa_map() nor do we need to trigger a callback to the
    bus provider at region disable time.

    Signed-off-by: Dan Williams

    Dan Williams
     

08 Jul, 2016

1 commit

  • In preparation for generically mapping flush hint addresses for both the
    BLK and PMEM use case, provide a generic / reference counted mapping
    api. Given the fact that a dimm may belong to multiple regions (PMEM
    and BLK), the flush hint addresses need to be held valid as long as any
    region associated with the dimm is active. This is similar to the
    existing BLK-region case where multiple BLK-regions may share an
    aperture mapping. Up-level this shared / reference-counted mapping
    capability from the nfit driver to a core nvdimm capability.

    This eliminates the need for the nd_blk_region.disable() callback. Note
    that the removal of nfit_spa_map() and related infrastructure is
    deferred to a later patch.

    Signed-off-by: Dan Williams

    Dan Williams
     

07 Jul, 2016

1 commit


28 Jun, 2016

2 commits

  • Now that all drivers that specify a ->driverfs_dev have been converted
    to device_add_disk(), the pointer can be removed from struct gendisk.

    Cc: Jens Axboe
    Cc: Ross Zwisler
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Dan Williams

    Dan Williams
     
  • For block drivers that specify a parent device, convert them to use
    device_add_disk().

    This conversion was done with the following semantic patch:

    @@
    struct gendisk *disk;
    expression E;
    @@

    - disk->driverfs_dev = E;
    ...
    - add_disk(disk);
    + device_add_disk(E, disk);

    @@
    struct gendisk *disk;
    expression E1, E2;
    @@

    - disk->driverfs_dev = E1;
    ...
    E2 = disk;
    ...
    - add_disk(E2);
    + device_add_disk(E1, E2);

    ...plus some manual fixups for a few missed conversions.

    Cc: Jens Axboe
    Cc: Keith Busch
    Cc: Michael S. Tsirkin
    Cc: David Woodhouse
    Cc: David S. Miller
    Cc: James Bottomley
    Cc: Ross Zwisler
    Cc: Konrad Rzeszutek Wilk
    Cc: Martin K. Petersen
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Dan Williams

    Dan Williams
     

25 Jun, 2016

1 commit

  • Currently phys_to_pfn_t() is an exported symbol to allow nfit_test to
    override it and indicate that nfit_test-pmem is not device-mapped. Now,
    we want to enable nfit_test to operate without DMA_CMA and the pmem it
    provides will no longer be physically contiguous, i.e. won't be capable
    of supporting direct_access requests larger than a page. Make
    pmem_direct_access() a weak symbol so that it can be replaced by the
    tools/testing/nvdimm/ version, and move phys_to_pfn_t() to a static
    inline now that it no longer needs to be overridden.

    Acked-by: Johannes Thumshirn
    Signed-off-by: Dan Williams

    Dan Williams
     

24 Jun, 2016

1 commit

  • The updated ndctl unit tests discovered that if a pfn configuration with
    a 4K alignment is read from the namespace, that alignment will be
    ignored in favor of the default 2M alignment. The result is that the
    configuration will fail initialization with a message like:

    dax6.1: bad offset: 0x22000 dax disabled align: 0x200000

    Fix this by allowing the alignment read from the info block to override
    the default which is 2M not 0 in the autodetect path. This also fixes a
    similar problem with the mode and alignment settings silently being
    overwritten by the kernel when userspace has changed it. We now will
    either overwrite the info block if userspace changes the uuid or fail
    and warn if a live setting disagrees with the info block.

    Cc:
    Cc: Micah Parrish
    Cc: Toshi Kani
    Signed-off-by: Dan Williams

    Dan Williams
     

18 Jun, 2016

1 commit

  • Prompted by commit 287980e49ffc "remove lots of IS_ERR_VALUE abuses", I
    ran make coccicheck against drivers/nvdimm/ and found that:

    if (IS_ERR(x))
    return PTR_ERR(x);
    return 0;

    ...can be replaced with PTR_ERR_OR_ZERO().

    Reported-by: Linus Torvalds
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Dan Williams

    Dan Williams
     

16 Jun, 2016

1 commit

  • Clean up needless calls to the action routine by letting
    devm_add_action_or_reset() call it automatically. This does cause the
    disk to registered and immediately unregistered when a memory allocation
    fails, but the block layer should be prepared for such an event.

    Reported-by: Sudip Mukherjee
    Signed-off-by: Dan Williams

    Dan Williams
     

27 May, 2016

1 commit

  • Pull misc DAX updates from Vishal Verma:
    "DAX error handling for 4.7

    - Until now, dax has been disabled if media errors were found on any
    device. This enables the use of DAX in the presence of these
    errors by making all sector-aligned zeroing go through the driver.

    - The driver (already) has the ability to clear errors on writes that
    are sent through the block layer using 'DSMs' defined in ACPI 6.1.

    Other misc changes:

    - When mounting DAX filesystems, check to make sure the partition is
    page aligned. This is a requirement for DAX, and previously, we
    allowed such unaligned mounts to succeed, but subsequent
    reads/writes would fail.

    - Misc/cleanup fixes from Jan that remove unused code from DAX
    related to zeroing, writeback, and some size checks"

    * tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    dax: fix a comment in dax_zero_page_range and dax_truncate_page
    dax: for truncate/hole-punch, do zeroing through the driver if possible
    dax: export a low-level __dax_zero_page_range helper
    dax: use sb_issue_zerout instead of calling dax_clear_sectors
    dax: enable dax in the presence of known media errors (badblocks)
    dax: fallback from pmd to pte on error
    block: Update blkdev_dax_capable() for consistency
    xfs: Add alignment check for DAX mount
    ext2: Add alignment check for DAX mount
    ext4: Add alignment check for DAX mount
    block: Add bdev_dax_supported() for dax mount checks
    block: Add vfs_msg() interface
    dax: Remove redundant inode size checks
    dax: Remove pointless writeback from dax_do_io()
    dax: Remove zeroing from dax_io()
    dax: Remove dead zeroing code from fault handlers
    ext2: Avoid DAX zeroing to corrupt data
    ext2: Fix block zeroing in ext2_get_blocks() for DAX
    dax: Remove complete_unwritten argument
    DAX: move RADIX_DAX_ definitions to dax.c

    Linus Torvalds
     

22 May, 2016

3 commits

  • Dan Williams
     
  • The ndctl unit tests discovered that the dax enabling omitted updates to
    nd_detach_and_reset(). This routine clears device the configuration
    when the namespace is detached. Without this clearing userspace may
    assume that the device is in the process of being configured by another
    agent in the system.

    Signed-off-by: Dan Williams

    Dan Williams
     
  • Testing the dax-device autodetect support revealed a probe failure with
    the following result:

    dax0.1: bad offset: 0x8200000 dax disabled

    The original pfn-device implementation inferred the alignment from
    ilog2(offset), now that the alignment is explicit the is_power_of_2()
    needs replacing with a real sanity check against the recorded alignment.
    Otherwise the alignment check is useless in the implicit case and only
    the minimum size of the offset matters.

    This self-consistency check is further validated by the probe path that
    will re-check that the offset is large enough to contain all the
    metadata required to enable the device.

    Cc:
    Signed-off-by: Dan Williams

    Dan Williams
     

21 May, 2016

1 commit

  • For autodetecting a previously established dax configuration we need the
    info block to indicate block-device vs device-dax mode, and we need to
    have the default namespace probe hand-off the configuration to the
    dax_pmem driver.

    Signed-off-by: Dan Williams

    Dan Williams