13 Nov, 2015

2 commits


11 Nov, 2015

2 commits

  • Pull block IO poll support from Jens Axboe:
    "Various groups have been doing experimentation around IO polling for
    (really) fast devices. The code has been reviewed and has been
    sitting on the side for a few releases, but this is now good enough
    for coordinated benchmarking and further experimentation.

    Currently O_DIRECT sync read/write are supported. A framework is in
    the works that allows scalable stats tracking so we can auto-tune
    this. And we'll add libaio support as well soon. Fow now, it's an
    opt-in feature for test purposes"

    * 'for-4.4/io-poll' of git://git.kernel.dk/linux-block:
    direct-io: be sure to assign dio->bio_bdev for both paths
    directio: add block polling support
    NVMe: add blk polling support
    block: add block polling support
    blk-mq: return tag/queue combo in the make_request_fn handlers
    block: change ->make_request_fn() and users to return a queue cookie

    Linus Torvalds
     
  • Pull libnvdimm updates from Dan Williams:
    "Outside of the new ACPI-NFIT hot-add support this pull request is more
    notable for what it does not contain, than what it does. There were a
    handful of development topics this cycle, dax get_user_pages, dax
    fsync, and raw block dax, that need more more iteration and will wait
    for 4.5.

    The patches to make devm and the pmem driver NUMA aware have been in
    -next for several weeks. The hot-add support has not, but is
    contained to the NFIT driver and is passing unit tests. The coredump
    support is straightforward and was looked over by Jeff. All of it has
    received a 0day build success notification across 107 configs.

    Summary:

    - Add support for the ACPI 6.0 NFIT hot add mechanism to process
    updates of the NFIT at runtime.

    - Teach the coredump implementation how to filter out DAX mappings.

    - Introduce NUMA hints for allocations made by the pmem driver, and
    as a side effect all devm allocations now hint their NUMA node by
    default"

    * tag 'libnvdimm-for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    coredump: add DAX filtering for FDPIC ELF coredumps
    coredump: add DAX filtering for ELF coredumps
    acpi: nfit: Add support for hot-add
    nfit: in acpi_nfit_init, break on a 0-length table
    pmem, memremap: convert to numa aware allocations
    devm_memremap_pages: use numa_mem_id
    devm: make allocations numa aware by default
    devm_memremap: convert to return ERR_PTR
    devm_memunmap: use devres_release()
    pmem: kill memremap_pmem()
    x86, mm: quiet arch_add_memory()

    Linus Torvalds
     

08 Nov, 2015

1 commit


22 Oct, 2015

4 commits

  • The libnvidmm-btt and nvme drivers use blk_integrity to reserve space
    for per-sector metadata, but sometimes without protection checksums.
    This property is generically useful, so teach the block core to
    internally specify a nop profile if one is not provided at registration
    time.

    Cc: Keith Busch
    Cc: Matthew Wilcox
    Suggested-by: Christoph Hellwig
    [hch: kill the local nvme nop profile as well]
    Acked-by: Martin K. Petersen
    Signed-off-by: Dan Williams
    Signed-off-by: Jens Axboe

    Dan Williams
     
  • Now that the integrity profile is statically allocated there is no work
    to do when shutting down an integrity enabled block device.

    Cc: Matthew Wilcox
    Cc: Mike Snitzer
    Cc: James Bottomley
    Acked-by: NeilBrown
    Acked-by: Keith Busch
    Acked-by: Vishal Verma
    Tested-by: Ross Zwisler
    Signed-off-by: Dan Williams
    Signed-off-by: Jens Axboe

    Dan Williams
     
  • Up until now the_integrity profile has been dynamically allocated and
    attached to struct gendisk after the disk has been made active.

    This causes problems because NVMe devices need to register the profile
    prior to the partition table being read due to a mandatory metadata
    buffer requirement. In addition, DM goes through hoops to deal with
    preallocating, but not initializing integrity profiles.

    Since the integrity profile is small (4 bytes + a pointer), Christoph
    suggested moving it to struct gendisk proper. This requires several
    changes:

    - Moving the blk_integrity definition to genhd.h.

    - Inlining blk_integrity in struct gendisk.

    - Removing the dynamic allocation code.

    - Adding helper functions which allow gendisk to set up and tear down
    the integrity sysfs dir when a disk is added/deleted.

    - Adding a blk_integrity_revalidate() callback for updating the stable
    pages bdi setting.

    - The calls that depend on whether a device has an integrity profile or
    not now key off of the bi->profile pointer.

    - Simplifying the integrity support routines in DM (Mike Snitzer).

    Signed-off-by: Martin K. Petersen
    Reported-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Mike Snitzer
    Cc: Dan Williams
    Signed-off-by: Dan Williams
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • We previously made a complete copy of a device's data integrity profile
    even though several of the fields inside the blk_integrity struct are
    pointers to fixed template entries in t10-pi.c.

    Split the static and per-device portions so that we can reference the
    template directly.

    Signed-off-by: Martin K. Petersen
    Reported-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Cc: Dan Williams
    Signed-off-by: Dan Williams
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

10 Oct, 2015

3 commits

  • Given that pmem ranges come with numa-locality hints, arrange for the
    resulting driver objects to be obtained from node-local memory.

    Reviewed-by: Tejun Heo
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Make devm_memremap consistent with the error return scheme of
    devm_memremap_pages to remove special casing in the pmem driver.

    Cc: Christoph Hellwig
    Cc: Ross Zwisler
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Now that the pmem-api is defined as "a set of apis that enables access
    to WB mapped pmem", the mapping type is implied. Remove the wrapper
    and push the functionality down into the pmem driver in preparation for
    adding support for direct-mapped pmem.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     

17 Sep, 2015

3 commits


09 Sep, 2015

1 commit

  • Pull libnvdimm updates from Dan Williams:
    "This update has successfully completed a 0day-kbuild run and has
    appeared in a linux-next release. The changes outside of the typical
    drivers/nvdimm/ and drivers/acpi/nfit.[ch] paths are related to the
    removal of IORESOURCE_CACHEABLE, the introduction of memremap(), and
    the introduction of ZONE_DEVICE + devm_memremap_pages().

    Summary:

    - Introduce ZONE_DEVICE and devm_memremap_pages() as a generic
    mechanism for adding device-driver-discovered memory regions to the
    kernel's direct map.

    This facility is used by the pmem driver to enable pfn_to_page()
    operations on the page frames returned by DAX ('direct_access' in
    'struct block_device_operations').

    For now, the 'memmap' allocation for these "device" pages comes
    from "System RAM". Support for allocating the memmap from device
    memory will arrive in a later kernel.

    - Introduce memremap() to replace usages of ioremap_cache() and
    ioremap_wt(). memremap() drops the __iomem annotation for these
    mappings to memory that do not have i/o side effects. The
    replacement of ioremap_cache() with memremap() is limited to the
    pmem driver to ease merging the api change in v4.3.

    Completion of the conversion is targeted for v4.4.

    - Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem
    driver, update the VFS DAX implementation and PMEM api to provide
    persistence guarantees for kernel operations on a DAX mapping.

    - Convert the ACPI NFIT 'BLK' driver to map the block apertures as
    cacheable to improve performance.

    - Miscellaneous updates and fixes to libnvdimm including support for
    issuing "address range scrub" commands, clarifying the optimal
    'sector size' of pmem devices, a clarification of the usage of the
    ACPI '_STA' (status) property for DIMM devices, and other minor
    fixes"

    * tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (34 commits)
    libnvdimm, pmem: direct map legacy pmem by default
    libnvdimm, pmem: 'struct page' for pmem
    libnvdimm, pfn: 'struct page' provider infrastructure
    x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB
    add devm_memremap_pages
    mm: ZONE_DEVICE for "device memory"
    mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h
    dax: drop size parameter to ->direct_access()
    nd_blk: change aperture mapping from WC to WB
    nvdimm: change to use generic kvfree()
    pmem, dax: have direct_access use __pmem annotation
    dax: update I/O path to do proper PMEM flushing
    pmem: add copy_from_iter_pmem() and clear_pmem()
    pmem, x86: clean up conditional pmem includes
    pmem: remove layer when calling arch_has_wmb_pmem()
    pmem, x86: move x86 PMEM API to new pmem.h header
    libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option
    pmem: switch to devm_ allocations
    devres: add devm_memremap
    libnvdimm, btt: write and validate parent_uuid
    ...

    Linus Torvalds
     

03 Sep, 2015

1 commit

  • Pull core block updates from Jens Axboe:
    "This first core part of the block IO changes contains:

    - Cleanup of the bio IO error signaling from Christoph. We used to
    rely on the uptodate bit and passing around of an error, now we
    store the error in the bio itself.

    - Improvement of the above from myself, by shrinking the bio size
    down again to fit in two cachelines on x86-64.

    - Revert of the max_hw_sectors cap removal from a revision again,
    from Jeff Moyer. This caused performance regressions in various
    tests. Reinstate the limit, bump it to a more reasonable size
    instead.

    - Make /sys/block//queue/discard_max_bytes writeable, by me.
    Most devices have huge trim limits, which can cause nasty latencies
    when deleting files. Enable the admin to configure the size down.
    We will look into having a more sane default instead of UINT_MAX
    sectors.

    - Improvement of the SGP gaps logic from Keith Busch.

    - Enable the block core to handle arbitrarily sized bios, which
    enables a nice simplification of bio_add_page() (which is an IO hot
    path). From Kent.

    - Improvements to the partition io stats accounting, making it
    faster. From Ming Lei.

    - Also from Ming Lei, a basic fixup for overflow of the sysfs pending
    file in blk-mq, as well as a fix for a blk-mq timeout race
    condition.

    - Ming Lin has been carrying Kents above mentioned patches forward
    for a while, and testing them. Ming also did a few fixes around
    that.

    - Sasha Levin found and fixed a use-after-free problem introduced by
    the bio->bi_error changes from Christoph.

    - Small blk cgroup cleanup from Viresh Kumar"

    * 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits)
    blk: Fix bio_io_vec index when checking bvec gaps
    block: Replace SG_GAPS with new queue limits mask
    block: bump BLK_DEF_MAX_SECTORS to 2560
    Revert "block: remove artifical max_hw_sectors cap"
    blk-mq: fix race between timeout and freeing request
    blk-mq: fix buffer overflow when reading sysfs file of 'pending'
    Documentation: update notes in biovecs about arbitrarily sized bios
    block: remove bio_get_nr_vecs()
    fs: use helper bio_add_page() instead of open coding on bi_io_vec
    block: kill merge_bvec_fn() completely
    md/raid5: get rid of bio_fits_rdev()
    md/raid5: split bio for chunk_aligned_read
    block: remove split code in blkdev_issue_{discard,write_same}
    btrfs: remove bio splitting and merge_bvec_fn() calls
    bcache: remove driver private bio splitting code
    block: simplify bio_add_page()
    block: make generic_make_request handle arbitrarily sized bios
    blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL)
    block: don't access bio->bi_error after bio_put()
    block: shrink struct bio down to 2 cache lines again
    ...

    Linus Torvalds
     

29 Aug, 2015

3 commits

  • The expectation is that the legacy / non-standard pmem discovery method
    (e820 type-12) will only ever be used to describe small quantities of
    persistent memory. Larger capacities will be described via the ACPI
    NFIT. When "allocate struct page from pmem" support is added this default
    policy can be overridden by assigning a legacy pmem namespace to a pfn
    device, however this would be only be necessary if a platform used the
    legacy mechanism to define a very large range.

    Cc: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Enable the pmem driver to handle PFN device instances. Attaching a pmem
    namespace to a pfn device triggers the driver to allocate and initialize
    struct page entries for pmem. Memory capacity for this allocation comes
    exclusively from RAM for now which is suitable for low PMEM to RAM
    ratios. This mechanism will be expanded later for setting an "allocate
    from PMEM" policy.

    Cc: Boaz Harrosh
    Cc: Ross Zwisler
    Cc: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Implement the base infrastructure for libnvdimm PFN devices. Similar to
    BTT devices they take a namespace as a backing device and layer
    functionality on top. In this case the functionality is reserving space
    for an array of 'struct page' entries to be handed out through
    pfn_to_page(). For now this is just the basic libnvdimm-device-model for
    configuring the base PFN device.

    As the namespace claiming mechanism for PFN devices is mostly identical
    to BTT devices drivers/nvdimm/claim.c is created to house the common
    bits.

    Cc: Ross Zwisler
    Signed-off-by: Dan Williams

    Dan Williams
     

28 Aug, 2015

4 commits

  • Given that a write-back (WB) mapping plus non-temporal stores is
    expected to be the most efficient way to access PMEM, update the
    definition of ARCH_HAS_PMEM_API to imply arch support for
    WB-mapped-PMEM. This is needed as a pre-requisite for adding PMEM to
    the direct map and mapping it with struct page.

    The above clarification for X86_64 means that memcpy_to_pmem() is
    permitted to use the non-temporal arch_memcpy_to_pmem() rather than
    needlessly fall back to default_memcpy_to_pmem() when the pcommit
    instruction is not available. When arch_memcpy_to_pmem() is not
    guaranteed to flush writes out of cache, i.e. on older X86_32
    implementations where non-temporal stores may just dirty cache,
    ARCH_HAS_PMEM_API is simply disabled.

    The default fall back for persistent memory handling remains. Namely,
    map it with the WT (write-through) cache-type and hope for the best.

    arch_has_pmem_api() is updated to only indicate whether the arch
    provides the proper helpers to meet the minimum "writes are visible
    outside the cache hierarchy after memcpy_to_pmem() + wmb_pmem()". Code
    that cares whether wmb_pmem() actually flushes writes to pmem must now
    call arch_has_wmb_pmem() directly.

    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Reviewed-by: Ross Zwisler
    [hch: set ARCH_HAS_PMEM_API=n on x86_32]
    Reviewed-by: Christoph Hellwig
    [toshi: x86_32 compile fixes]
    Signed-off-by: Toshi Kani
    Signed-off-by: Dan Williams

    Dan Williams
     
  • None of the implementations currently use it. The common
    bdev_direct_access() entry point handles all the size checks before
    calling ->direct_access().

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Dan Williams
     
  • Signed-off-by: yalin wang
    Reviewed-by: Ross Zwisler
    Signed-off-by: Dan Williams

    yalin wang
     

21 Aug, 2015

1 commit

  • Update the annotation for the kaddr pointer returned by direct_access()
    so that it is a __pmem pointer. This is consistent with the PMEM driver
    and with how this direct_access() pointer is used in the DAX code.

    Signed-off-by: Ross Zwisler
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dan Williams

    Ross Zwisler
     

19 Aug, 2015

1 commit

  • We currently register a platform device for e820 type-12 memory and
    register a nvdimm bus beneath it. Registering the platform device
    triggers the device-core machinery to probe for a driver, but that
    search currently comes up empty. Building the nvdimm-bus registration
    into the e820_pmem platform device registration in this way forces
    libnvdimm to be built-in. Instead, convert the built-in portion of
    CONFIG_X86_PMEM_LEGACY to simply register a platform device and move the
    rest of the logic to the driver for e820_pmem, for the following
    reasons:

    1/ Letting e820_pmem support be a module allows building and testing
    libnvdimm.ko changes without rebooting

    2/ All the normal policy around modules can be applied to e820_pmem
    (unbind to disable and/or blacklisting the module from loading by
    default)

    3/ Moving the driver to a generic location and converting it to scan
    "iomem_resource" rather than "e820.map" means any other architecture can
    take advantage of this simple nvdimm resource discovery mechanism by
    registering a resource named "Persistent Memory (legacy)"

    Cc: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     

15 Aug, 2015

4 commits

  • Signed-off-by: Christoph Hellwig
    [djbw: tools/testing/nvdimm/ and memunmap_pmem support]
    Reviewed-by: Ross Zwisler
    Signed-off-by: Dan Williams

    Christoph Hellwig
     
  • When a BTT is instantiated on a namespace it must validate the namespace
    uuid matches the 'parent_uuid' stored in the btt superblock. This
    property enforces that changing the namespace UUID invalidates all
    former BTT instances on that storage. For "IO namespaces" that don't
    have a label or UUID, the parent_uuid is set to zero, and this
    validation is skipped. For such cases, old BTTs have to be invalidated
    by forcing the namespace to raw mode, and overwriting the BTT info
    blocks.

    Based on a patch by Dan Williams

    Signed-off-by: Vishal Verma
    Signed-off-by: Dan Williams

    Vishal Verma
     
  • Use arena_is_valid as a common routine for checking the validity of an
    info block from both discover_arenas, and nd_btt_probe.

    As a result, don't check for validity of the BTT's UUID, and lbasize.
    The checksum in the BTT info block guarantees self-consistency, and when
    we're called from nd_btt_probe, we don't have a valid uuid or lbasize
    available to check against.

    Also cleanup to return a bool instead of an int.

    Signed-off-by: Vishal Verma
    Signed-off-by: Dan Williams

    Vishal Verma
     
  • Consolidate the parameters passed to arena_is_valid into just nd_btt,
    and an info block to increase re-usability.

    Similarly, btt_arena_write_layout doesn't need to be passed a uuid, as
    it can be obtained from arena->nd_btt.

    Signed-off-by: Vishal Verma
    Signed-off-by: Dan Williams

    Vishal Verma
     

01 Aug, 2015

1 commit

  • Fix multiple build warnings when CONFIG_BTT is not enabled:

    In file included from ../drivers/nvdimm/bus.c:29:0:
    ../drivers/nvdimm/nd.h:169:15: warning: return type defaults to 'int' [-Wreturn-type]
    static inline nd_btt_probe(struct nd_namespace_common *ndns, void *drvdata)
    ^

    Signed-off-by: Randy Dunlap
    Cc: Dan Williams
    Cc: linux-nvdimm@lists.01.org
    Signed-off-by: Dan Williams

    Randy Dunlap
     

29 Jul, 2015

1 commit

  • Currently we have two different ways to signal an I/O error on a BIO:

    (1) by clearing the BIO_UPTODATE flag
    (2) by returning a Linux errno value to the bi_end_io callback

    The first one has the drawback of only communicating a single possible
    error (-EIO), and the second one has the drawback of not beeing persistent
    when bios are queued up, and are not passed along from child to parent
    bio in the ever more popular chaining scenario. Having both mechanisms
    available has the additional drawback of utterly confusing driver authors
    and introducing bugs where various I/O submitters only deal with one of
    them, and the others have to add boilerplate code to deal with both kinds
    of error returns.

    So add a new bi_error field to store an errno value directly in struct
    bio and remove the existing mechanisms to clean all this up.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: NeilBrown
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

28 Jul, 2015

2 commits

  • Based on a patch: c8fa317 brd: Request from fdisk 4k alignment by Boaz
    Harrosh, allow fdisk to create properly aligned partitions for DAX. This
    will also cause mkfs.ext4 to emit a warning if using a file system block
    size of less than PAGE_SIZE.

    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Matthew Wilcox
    Cc: Christoph Hellwig
    Cc: Elliott, Robert
    Signed-off-by: Vishal Verma
    Acked-by: Boaz Harrosh
    Acked-by: Ross Zwisler
    Signed-off-by: Dan Williams

    Vishal Verma
     
  • Fix:
    drivers/nvdimm/btt.c:635:29: warning: restricted __le64 degrades to integer

    Signed-off-by: Dan Williams

    Dan Williams
     

26 Jul, 2015

1 commit

  • A new BLK namespace "seed" device is created whenever the current seed
    is successfully probed. However, if that namespace is assigned to a BTT
    it may never directly experience a successful probe as it is a
    subordinate device to a BTT configuration.

    The effect of the current code is that no new namespaces can be
    instantiated, after the seed namespace, to consume available BLK DPA
    capacity. Fix this by treating a successful BTT probe event as a
    successful probe event for the backing namespace.

    Reported-by: Nicholas Moulin
    Signed-off-by: Dan Williams

    Dan Williams
     

01 Jul, 2015

2 commits


26 Jun, 2015

3 commits

  • Based on an original patch by Ross Zwisler [1].

    Writes to persistent memory have the potential to be posted to cpu
    cache, cpu write buffers, and platform write buffers (memory controller)
    before being committed to persistent media. Provide apis,
    memcpy_to_pmem(), wmb_pmem(), and memremap_pmem(), to write data to
    pmem and assert that it is durable in PMEM (a persistent linear address
    range). A '__pmem' attribute is added so sparse can track proper usage
    of pointers to pmem.

    This continues the status quo of pmem being x86 only for 4.2, but
    reworks to ioremap, and wider implementation of memremap() will enable
    other archs in 4.3.

    [1]: https://lists.01.org/pipermail/linux-nvdimm/2015-May/000932.html

    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Signed-off-by: Ross Zwisler
    [djbw: various reworks]
    Signed-off-by: Dan Williams

    Ross Zwisler
     
  • Add support of sysfs 'numa_node' to I/O-related NVDIMM devices
    under /sys/bus/nd/devices, regionN, namespaceN.0, and bttN.x.

    An example of numa_node values on a 2-socket system with a single
    NVDIMM range on each socket is shown below.
    /sys/bus/nd/devices
    |-- btt0.0/numa_node:0
    |-- btt1.0/numa_node:1
    |-- btt1.1/numa_node:1
    |-- namespace0.0/numa_node:0
    |-- namespace1.0/numa_node:1
    |-- region0/numa_node:0
    |-- region1/numa_node:1

    These numa_node files are then linked under the block class of
    their device names.
    /sys/class/block/pmem0/device/numa_node:0
    /sys/class/block/pmem1s/device/numa_node:1

    This enables numactl(8) to accept 'block:' and 'file:' paths of
    pmem and btt devices as shown in the examples below.
    numactl --preferred block:pmem0 --show
    numactl --preferred file:/dev/pmem1s --show

    Signed-off-by: Toshi Kani
    Signed-off-by: Dan Williams

    Toshi Kani
     
  • ACPI NFIT table has System Physical Address Range Structure entries that
    describe a proximity ID of each range when ACPI_NFIT_PROXIMITY_VALID is
    set in the flags.

    Change acpi_nfit_register_region() to map a proximity ID to its node ID,
    and set it to a new numa_node field of nd_region_desc, which is then
    conveyed to the nd_region device.

    The device core arranges for btt and namespace devices to inherit their
    node from their parent region.

    Signed-off-by: Toshi Kani
    [djbw: move set_dev_node() from region.c to bus.c]
    Signed-off-by: Dan Williams

    Toshi Kani