14 Oct, 2020

1 commit

  • The 'struct resource' in 'struct dev_pagemap' is only used for holding
    resource span information. The other fields, 'name', 'flags', 'desc',
    'parent', 'sibling', and 'child' are all unused wasted space.

    This is in preparation for introducing a multi-range extension of
    devm_memremap_pages().

    The bulk of this change is unwinding all the places internal to libnvdimm
    that used 'struct resource' unnecessarily, and replacing instances of
    'struct dev_pagemap'.res with 'struct dev_pagemap'.range.

    P2PDMA had a minor usage of the resource flags field, but only to report
    failures with "%pR". That is replaced with an open coded print of the
    range.

    [dan.carpenter@oracle.com: mm/hmm/test: use after free in dmirror_allocate_chunk()]
    Link: https://lkml.kernel.org/r/20200926121402.GA7467@kadam

    Signed-off-by: Dan Williams
    Signed-off-by: Dan Carpenter
    Signed-off-by: Andrew Morton
    Reviewed-by: Boris Ostrovsky [xen]
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Vishal Verma
    Cc: Vivek Goyal
    Cc: Dave Jiang
    Cc: Ben Skeggs
    Cc: David Airlie
    Cc: Daniel Vetter
    Cc: Ira Weiny
    Cc: Bjorn Helgaas
    Cc: Juergen Gross
    Cc: Stefano Stabellini
    Cc: "Jérôme Glisse"
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Ard Biesheuvel
    Cc: Borislav Petkov
    Cc: Brice Goglin
    Cc: Catalin Marinas
    Cc: Dave Hansen
    Cc: David Hildenbrand
    Cc: Greg Kroah-Hartman
    Cc: "H. Peter Anvin"
    Cc: Hulk Robot
    Cc: Ingo Molnar
    Cc: Jason Gunthorpe
    Cc: Jason Yan
    Cc: Jeff Moyer
    Cc: Jia He
    Cc: Joao Martins
    Cc: Jonathan Cameron
    Cc: kernel test robot
    Cc: Mike Rapoport
    Cc: Pavel Tatashin
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Randy Dunlap
    Cc: Thomas Gleixner
    Cc: Tom Lendacky
    Cc: Wei Yang
    Cc: Will Deacon
    Link: https://lkml.kernel.org/r/159643103173.4062302.768998885691711532.stgit@dwillia2-desk3.amr.corp.intel.com
    Link: https://lkml.kernel.org/r/160106115761.30709.13539840236873663620.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Linus Torvalds

    Dan Williams
     

02 Sep, 2020

1 commit

  • The nvdimm block driver abuse revalidate_disk in a strange way, and
    totally unrelated to what other drivers do. Simplify this by just
    calling nvdimm_revalidate_disk (which seems rather misnamed) from the
    probe routines, as the additional bdev size revalidation is pointless
    at this point, and remove the revalidate_disk methods given that
    it can only be triggered from add_disk, which is right before the
    manual calls.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Josef Bacik
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

27 May, 2020

1 commit


03 Apr, 2020

1 commit

  • - Introduce 'zero_page_range' as a dax operation. This facilitates
    filesystem-dax operation without a block-device.

    - Advertise a persistence-domain for of_pmem and papr_scm. The
    persistence domain indicates where cpu-store cycles need to reach in
    the platform-memory subsystem before the platform will consider them
    power-fail protected.

    - Fixup some flexible-array declarations.

    Dan Williams
     

31 Mar, 2020

1 commit

  • The current codebase makes use of the zero-length array language
    extension to the C90 standard, but the preferred mechanism to declare
    variable-length types such as these ones is a flexible array member[1][2],
    introduced in C99:

    struct foo {
    int stuff;
    struct boo array[];
    };

    By making use of the mechanism above, we will get a compiler warning
    in case the flexible array does not occur last in the structure, which
    will help us prevent some kind of undefined behavior bugs from being
    inadvertently introduced[3] to the codebase from now on.

    Also, notice that, dynamic memory allocations won't be affected by
    this change:

    "Flexible array members have incomplete type, and so the sizeof operator
    may not be applied. As a quirk of the original implementation of
    zero-length arrays, sizeof evaluates to zero."[1]

    This issue was found with the help of Coccinelle.

    [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
    [2] https://github.com/KSPP/linux/issues/21
    [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

    Signed-off-by: Gustavo A. R. Silva
    Link: https://lore.kernel.org/r/20200319230937.GA16648@embeddedor.com
    Signed-off-by: Dan Williams

    Gustavo A. R. Silva
     

18 Mar, 2020

2 commits

  • The align attribute applies an alignment constraint for namespace
    creation in a region. Whereas the 'align' attribute of a namespace
    applied alignment padding via an info block, the 'align' attribute
    applies alignment constraints to the free space allocation.

    The default for 'align' is the maximum known memremap_compat_align()
    across all archs (16MiB from PowerPC at time of writing) multiplied by
    the number of interleave ways if there is blk-aliasing. The minimum is
    PAGE_SIZE and allows for the creation of cross-arch incompatible
    namespaces, just as previous kernels allowed, but the expectation is
    cross-arch and mode-independent compatibility by default.

    The regression risk with this change is limited to cases that were
    dependent on the ability to create unaligned namespaces, *and* for some
    reason are unable to opt-out of aligned namespaces by writing to
    'regionX/align'. If such a scenario arises the default can be flipped
    from opt-out to opt-in of compat-aligned namespace creation, but that is
    a last resort. The kernel will otherwise continue to support existing
    defined misaligned namespaces.

    Unfortunately this change needs to touch several parts of the
    implementation at once:

    - region/available_size: expand busy extents to current align
    - region/max_available_extent: expand busy extents to current align
    - namespace/size: trim free space to current align

    ...to keep the free space accounting conforming to the dynamic align
    setting.

    Reported-by: Aneesh Kumar K.V
    Reported-by: Jeff Moyer
    Reviewed-by: Aneesh Kumar K.V
    Reviewed-by: Jeff Moyer
    Link: https://lore.kernel.org/r/158041478371.3889308.14542630147672668068.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams

    Dan Williams
     
  • The NDD_ALIASING flag is used to indicate where pmem capacity might
    alias with blk capacity and require labeling. It is also used to
    indicate whether the DIMM supports labeling. Separate this latter
    capability into its own flag so that the NDD_ALIASING flag is scoped to
    true aliased configurations.

    To my knowledge aliased configurations only exist in the ACPI spec,
    there are no known platforms that ship this support in production.

    This clarity allows namespace-capacity alignment constraints around
    interleave-ways to be relaxed.

    Cc: Vishal Verma
    Cc: Oliver O'Halloran
    Reviewed-by: Jeff Moyer
    Reviewed-by: Aneesh Kumar K.V
    Link: https://lore.kernel.org/r/158041477856.3889308.4212605617834097674.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams

    Dan Williams
     

20 Nov, 2019

2 commits

  • A 'struct device_type' instance can carry default attributes for the
    device. Use this facility to remove the export of
    nvdimm_bus_attribute_group and put the responsibility on the core rather
    than leaf implementations to define this attribute.

    Cc: Ira Weiny
    Cc: Michael Ellerman
    Cc: "Oliver O'Halloran"
    Cc: Vishal Verma
    Cc: Aneesh Kumar K.V
    Signed-off-by: Dan Williams
    Reviewed-by: Aneesh Kumar K.V
    Link: https://lore.kernel.org/r/157309903815.1582359.6418211876315050283.stgit@dwillia2-desk3.amr.corp.intel.com

    Dan Williams
     
  • A 'struct device_type' instance can carry default attributes for the
    device. Use this facility to remove the export of
    nd_numa_attribute_group and put the responsibility on the core rather
    than leaf implementations to define this attribute.

    Cc: Ira Weiny
    Cc: Michael Ellerman
    Cc: "Oliver O'Halloran"
    Cc: Vishal Verma
    Cc: Aneesh Kumar K.V
    Reviewed-by: Aneesh Kumar K.V
    Link: https://lore.kernel.org/r/157401269537.43284.14411189404186877352.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams

    Dan Williams
     

18 Nov, 2019

2 commits

  • A 'struct device_type' instance can carry default attributes for the
    device. Use this facility to remove the export of
    nd_device_attribute_group and put the responsibility on the core rather
    than leaf implementations to define this attribute.

    For regions this creates a new nd_region_attribute_groups[] added to the
    per-region device-type instances.

    Cc: Ira Weiny
    Cc: Michael Ellerman
    Cc: "Oliver O'Halloran"
    Cc: Vishal Verma
    Cc: Aneesh Kumar K.V
    Reviewed-by: Aneesh Kumar K.V
    Link: https://lore.kernel.org/r/157309901138.1582359.12909354140826530394.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams

    Dan Williams
     
  • Statically initialize the attribute groups for each libnvdimm
    device_type. This is a preparation step for removing unnecessary exports
    of attributes that can be included in the device_type by default.

    Also take the opportunity to mark 'struct device_type' instances const.

    Cc: Ira Weiny
    Cc: Vishal Verma
    Signed-off-by: Dan Williams
    Reviewed-by: Aneesh Kumar K.V
    Link: https://lore.kernel.org/r/157309900111.1582359.2445687530383470348.stgit@dwillia2-desk3.amr.corp.intel.com

    Dan Williams
     

15 Nov, 2019

1 commit

  • The nvdimm core currently maps the full namespace to an ioremap range
    while probing the namespace mode. This can result in probe failures on
    architectures that have limited ioremap space.

    For example, with a large btt namespace that consumes most of I/O remap
    range, depending on the sequence of namespace initialization, the user
    can find a pfn namespace initialization failure due to unavailable I/O
    remap space which nvdimm core uses for temporary mapping.

    nvdimm core can avoid this failure by only mapping the reserved info
    block area to check for pfn superblock type and map the full namespace
    resource only before using the namespace.

    Given that personalities like BTT can be layered on top of any namespace
    type create a generic form of devm_nsio_enable (devm_namespace_enable)
    and use it inside the per-personality attach routines. Now
    devm_namespace_enable() is always paired with disable unless the mapping
    is going to be used for long term runtime access.

    Signed-off-by: Aneesh Kumar K.V
    Link: https://lore.kernel.org/r/20191017073308.32645-1-aneesh.kumar@linux.ibm.com
    [djbw: reworks to move devm_namespace_{en,dis}able into *attach helpers]
    Reported-by: kbuild test robot
    Link: https://lore.kernel.org/r/20191031105741.102793-2-aneesh.kumar@linux.ibm.com
    Signed-off-by: Dan Williams

    Aneesh Kumar K.V
     

25 Sep, 2019

1 commit

  • Allow arch to provide the supported alignments and use hugepage alignment only
    if we support hugepage. Right now we depend on compile time configs whereas this
    patch switch this to runtime discovery.

    Architectures like ppc64 can have THP enabled in code, but then can have
    hugepage size disabled by the hypervisor. This allows us to create dax devices
    with PAGE_SIZE alignment in this case.

    Existing dax namespace with alignment larger than PAGE_SIZE will fail to
    initialize in this specific case. We still allow fsdax namespace initialization.

    With respect to identifying whether to enable hugepage fault for a dax device,
    if THP is enabled during compile, we default to taking hugepage fault and in dax
    fault handler if we find the fault size > alignment we retry with PAGE_SIZE
    fault size.

    This also addresses the below failure scenario on ppc64

    ndctl create-namespace --mode=devdax | grep align
    "align":16777216,
    "align":16777216

    cat /sys/devices/ndbus0/region0/dax0.0/supported_alignments
    65536 16777216

    daxio.static-debug -z -o /dev/dax0.0
    Bus error (core dumped)

    $ dmesg | tail
    lpar: Failed hash pte insert with error -4
    hash-mmu: mm: Hashing failure ! EA=0x7fff17000000 access=0x8000000000000006 current=daxio
    hash-mmu: trap=0x300 vsid=0x22cb7a3 ssize=1 base psize=2 psize 10 pte=0xc000000501002b86
    daxio[3860]: bus error (7) at 7fff17000000 nip 7fff973c007c lr 7fff973bff34 code 2 in libpmem.so.1.0.0[7fff973b0000+20000]
    daxio[3860]: code: 792945e4 7d494b78 e95f0098 7d494b78 f93f00a0 4800012c e93f0088 f93f0120
    daxio[3860]: code: e93f00a0 f93f0128 e93f0120 e95f0128 e93f0088 39290008 f93f0110

    The failure was due to guest kernel using wrong page size.

    The namespaces created with 16M alignment will appear as below on a config with
    16M page size disabled.

    $ ndctl list -Ni
    [
    {
    "dev":"namespace0.1",
    "mode":"fsdax",
    "map":"dev",
    "size":5351931904,
    "uuid":"fc6e9667-461a-4718-82b4-69b24570bddb",
    "align":16777216,
    "blockdev":"pmem0.1",
    "supported_alignments":[
    65536
    ]
    },
    {
    "dev":"namespace0.0",
    "mode":"fsdax",
    Signed-off-by: Aneesh Kumar K.V
    Link: https://lore.kernel.org/r/20190905154603.10349-8-aneesh.kumar@linux.ibm.com
    Signed-off-by: Dan Williams

    Aneesh Kumar K.V
     

06 Sep, 2019

1 commit

  • Namespaces created with PFN_MODE_PMEM mode stores struct page in the reserve
    block area. We need to make sure we account for the right struct page
    size while doing this. Instead of directly depending on sizeof(struct page)
    which can change based on different kernel config option, use the max struct
    page size (64) while calculating the reserve block area. This makes sure pmem
    device can be used across kernels built with different configs.

    If the above assumption of max struct page size change, we need to update the
    reserve block allocation space for new namespaces created.

    Signed-off-by: Aneesh Kumar K.V
    Link: https://lore.kernel.org/r/20190905154603.10349-4-aneesh.kumar@linux.ibm.com
    Signed-off-by: Dan Williams

    Aneesh Kumar K.V
     

06 Jul, 2019

1 commit

  • This patch adds functionality to perform flush from guest
    to host over VIRTIO. We are registering a callback based
    on 'nd_region' type. virtio_pmem driver requires this special
    flush function. For rest of the region types we are registering
    existing flush function. Report error returned by host fsync
    failure to userspace.

    Signed-off-by: Pankaj Gupta
    Signed-off-by: Dan Williams

    Pankaj Gupta
     

05 Jun, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of version 2 of the gnu general public license as
    published by the free software foundation this program is
    distributed in the hope that it will be useful but without any
    warranty without even the implied warranty of merchantability or
    fitness for a particular purpose see the gnu general public license
    for more details

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 64 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Alexios Zavras
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190529141901.894819585@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

01 May, 2019

1 commit

  • Users have reported intermittent occurrences of DIMM initialization
    failures due to duplicate allocations of address capacity detected in
    the labels, or errors of the form below, both have the same root cause.

    nd namespace1.4: failed to track label: 0
    WARNING: CPU: 17 PID: 1381 at drivers/nvdimm/label.c:863

    RIP: 0010:__pmem_label_update+0x56c/0x590 [libnvdimm]
    Call Trace:
    ? nd_pmem_namespace_label_update+0xd6/0x160 [libnvdimm]
    nd_pmem_namespace_label_update+0xd6/0x160 [libnvdimm]
    uuid_store+0x17e/0x190 [libnvdimm]
    kernfs_fop_write+0xf0/0x1a0
    vfs_write+0xb7/0x1b0
    ksys_write+0x57/0xd0
    do_syscall_64+0x60/0x210

    Unfortunately those reports were typically with a busy parallel
    namespace creation / destruction loop making it difficult to see the
    components of the bug. However, Jane provided a simple reproducer using
    the work-in-progress sub-section implementation.

    When ndctl is reconfiguring a namespace it may take an existing defunct
    / disabled namespace and reconfigure it with a new uuid and other
    parameters. Critically namespace_update_uuid() takes existing address
    resources and renames them for the new namespace to use / reconfigure as
    it sees fit. The bug is that this rename only happens in the resource
    tracking tree. Existing labels with the old uuid are not reaped leading
    to a scenario where multiple active labels reference the same span of
    address range.

    Teach namespace_update_uuid() to flag any references to the old uuid for
    reaping at the next label update attempt.

    Cc:
    Fixes: bf9bccc14c05 ("libnvdimm: pmem label sets and namespace instantiation")
    Link: https://github.com/pmem/ndctl/issues/91
    Reported-by: Jane Chu
    Reported-by: Jeff Moyer
    Reported-by: Erwin Tsaur
    Cc: Johannes Thumshirn
    Signed-off-by: Dan Williams

    Dan Williams
     

17 Mar, 2019

1 commit

  • Pull device-dax updates from Dan Williams:
    "New device-dax infrastructure to allow persistent memory and other
    "reserved" / performance differentiated memories, to be assigned to
    the core-mm as "System RAM".

    Some users want to use persistent memory as additional volatile
    memory. They are willing to cope with potential performance
    differences, for example between DRAM and 3D Xpoint, and want to use
    typical Linux memory management apis rather than a userspace memory
    allocator layered over an mmap() of a dax file. The administration
    model is to decide how much Persistent Memory (pmem) to use as System
    RAM, create a device-dax-mode namespace of that size, and then assign
    it to the core-mm. The rationale for device-dax is that it is a
    generic memory-mapping driver that can be layered over any "special
    purpose" memory, not just pmem. On subsequent boots udev rules can be
    used to restore the memory assignment.

    One implication of using pmem as RAM is that mlock() no longer keeps
    data off persistent media. For this reason it is recommended to enable
    NVDIMM Security (previously merged for 5.0) to encrypt pmem contents
    at rest. We considered making this recommendation an actively enforced
    requirement, but in the end decided to leave it as a distribution /
    administrator policy to allow for emulation and test environments that
    lack security capable NVDIMMs.

    Summary:

    - Replace the /sys/class/dax device model with /sys/bus/dax, and
    include a compat driver so distributions can opt-in to the new ABI.

    - Allow for an alternative driver for the device-dax address-range

    - Introduce the 'kmem' driver to hotplug / assign a device-dax
    address-range to the core-mm.

    - Arrange for the device-dax target-node to be onlined so that the
    newly added memory range can be uniquely referenced by numa apis"

    NOTE! I'm not entirely happy with the whole "PMEM as RAM" model because
    we currently have special - and very annoying rules in the kernel about
    accessing PMEM only with the "MC safe" accessors, because machine checks
    inside the regular repeat string copy functions can be fatal in some
    (not described) circumstances.

    And apparently the PMEM modules can cause that a lot more than regular
    RAM. The argument is that this happens because PMEM doesn't necessarily
    get scrubbed at boot like RAM does, but that is planned to be added for
    the user space tooling.

    Quoting Dan from another email:
    "The exposure can be reduced in the volatile-RAM case by scanning for
    and clearing errors before it is onlined as RAM. The userspace tooling
    for that can be in place before v5.1-final. There's also runtime
    notifications of errors via acpi_nfit_uc_error_notify() from
    background scrubbers on the DIMM devices. With that mechanism the
    kernel could proactively clear newly discovered poison in the volatile
    case, but that would be additional development more suitable for v5.2.

    I understand the concern, and the need to highlight this issue by
    tapping the brakes on feature development, but I don't see PMEM as RAM
    making the situation worse when the exposure is also there via DAX in
    the PMEM case. Volatile-RAM is arguably a safer use case since it's
    possible to repair pages where the persistent case needs active
    application coordination"

    * tag 'devdax-for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    device-dax: "Hotplug" persistent memory for use like normal RAM
    mm/resource: Let walk_system_ram_range() search child resources
    mm/memory-hotplug: Allow memory resources to be children
    mm/resource: Move HMM pr_debug() deeper into resource code
    mm/resource: Return real error codes from walk failures
    device-dax: Add a 'modalias' attribute to DAX 'bus' devices
    device-dax: Add a 'target_node' attribute
    device-dax: Auto-bind device after successful new_id
    acpi/nfit, device-dax: Identify differentiated memory with a unique numa-node
    device-dax: Add /sys/class/dax backwards compatibility
    device-dax: Add support for a dax override driver
    device-dax: Move resource pinning+mapping into the common driver
    device-dax: Introduce bus + driver model
    device-dax: Start defining a dax bus model
    device-dax: Remove multi-resource infrastructure
    device-dax: Kill dax_region base
    device-dax: Kill dax_region ida

    Linus Torvalds
     

22 Jan, 2019

1 commit

  • The following warning:

    ACPI0012:00: security event setup failed: -19

    ...is meant to capture exceptional failures of sysfs_get_dirent(),
    however it will also fail in the common case when security support is
    disabled. A few issues:

    1/ A dev_warn() report for a common case is too chatty
    2/ The setup of this notifier is generic, no need for it to be driven
    from the nfit driver, it can exist completely in the core.
    3/ If it fails for any reason besides security support being disabled,
    that's fatal and should abort DIMM activation. Userspace may hang if
    it never gets overwrite notifications.
    4/ The dirent needs to be released.

    Move the call to the core 'dimm' driver, make it conditional on security
    support being active, make it fatal for the exceptional case, add the
    missing sysfs_put() at device disable time.

    Fixes: 7d988097c546 ("...Add security DSM overwrite support")
    Reviewed-by: Dave Jiang
    Signed-off-by: Dan Williams

    Dan Williams
     

07 Jan, 2019

1 commit

  • Persistent memory, as described by the ACPI NFIT (NVDIMM Firmware
    Interface Table), is the first known instance of a memory range
    described by a unique "target" proximity domain. Where "initiator" and
    "target" proximity domains is an approach that the ACPI HMAT
    (Heterogeneous Memory Attributes Table) uses to described the unique
    performance properties of a memory range relative to a given initiator
    (e.g. CPU or DMA device).

    Currently the numa-node for a /dev/pmemX block-device or /dev/daxX.Y
    char-device follows the traditional notion of 'numa-node' where the
    attribute conveys the closest online numa-node. That numa-node attribute
    is useful for cpu-binding and memory-binding processes *near* the
    device. However, when the memory range backing a 'pmem', or 'dax' device
    is onlined (memory hot-add) the memory-only-numa-node representing that
    address needs to be differentiated from the set of online nodes. In
    other words, the numa-node association of the device depends on whether
    you can bind processes *near* the cpu-numa-node in the offline
    device-case, or bind process *on* the memory-range directly after the
    backing address range is onlined.

    Allow for the case that platform firmware describes persistent memory
    with a unique proximity domain, i.e. when it is distinct from the
    proximity of DRAM and CPUs that are on the same socket. Plumb the Linux
    numa-node translation of that proximity through the libnvdimm region
    device to namespaces that are in device-dax mode. With this in place the
    proposed kmem driver [1] can optionally discover a unique numa-node
    number for the address range as it transitions the memory from an
    offline state managed by a device-driver to an online memory range
    managed by the core-mm.

    [1]: https://lore.kernel.org/lkml/20181022201317.8558C1D8@viggo.jf.intel.com

    Reported-by: Fan Du
    Cc: Michael Ellerman
    Cc: "Oliver O'Halloran"
    Cc: Dave Hansen
    Cc: Jérôme Glisse
    Reviewed-by: Yang Shi
    Signed-off-by: Dan Williams

    Dan Williams
     

14 Dec, 2018

1 commit

  • Add support to unlock the dimm via the kernel key management APIs. The
    passphrase is expected to be pulled from userspace through keyutils.
    The key management and sysfs attributes are libnvdimm generic.

    Encrypted keys are used to protect the nvdimm passphrase at rest. The
    master key can be a trusted-key sealed in a TPM, preferred, or an
    encrypted-key, more flexible, but more exposure to a potential attacker.

    Signed-off-by: Dave Jiang
    Co-developed-by: Dan Williams
    Reported-by: Randy Dunlap
    Signed-off-by: Dan Williams

    Dave Jiang
     

12 Oct, 2018

1 commit

  • This patch splits the initialization of the label data into two functions.
    One for doing the init, and another for reading the actual configuration
    data. The idea behind this is that by doing this we create a symmetry
    between the getting and setting of config data in that we have a function
    for both. In addition it will make it easier for us to identify the bits
    that are related to init versus the pieces that are a wrapper for reading
    data from the ACPI interface.

    So for example by splitting things out like this it becomes much more
    obvious that we were performing checks that weren't necessarily related to
    the set/get operations such as relying on ndd->data being present when the
    set and get ops should not care about a locally cached copy of the label
    area.

    Reviewed-by: Toshi Kani
    Signed-off-by: Alexander Duyck
    Signed-off-by: Dan Williams

    Alexander Duyck
     

26 Aug, 2018

1 commit

  • Pull libnvdimm updates from Dave Jiang:
    "Collection of misc libnvdimm patches for 4.19 submission:

    - Adding support to read locked nvdimm capacity.

    - Change test code to make DSM failure code injection an override.

    - Add support for calculate maximum contiguous area for namespace.

    - Add support for queueing a short ARS when there is on going ARS for
    nvdimm.

    - Allow NULL to be passed in to ->direct_access() for kaddr and pfn
    params.

    - Improve smart injection support for nvdimm emulation testing.

    - Fix test code that supports for emulating controller temperature.

    - Fix hang on error before devm_memremap_pages()

    - Fix a bug that causes user memory corruption when data returned to
    user for ars_status.

    - Maintainer updates for Ross Zwisler emails and adding Jan Kara to
    fsdax"

    * tag 'libnvdimm-for-4.19_misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm:
    libnvdimm: fix ars_status output length calculation
    device-dax: avoid hang on error before devm_memremap_pages()
    tools/testing/nvdimm: improve emulation of smart injection
    filesystem-dax: Do not request kaddr and pfn when not required
    md/dm-writecache: Don't request pointer dummy_addr when not required
    dax/super: Do not request a pointer kaddr when not required
    tools/testing/nvdimm: kaddr and pfn can be NULL to ->direct_access()
    s390, dcssblk: kaddr and pfn can be NULL to ->direct_access()
    libnvdimm, pmem: kaddr and pfn can be NULL to ->direct_access()
    acpi/nfit: queue issuing of ars when an uc error notification comes in
    libnvdimm: Export max available extent
    libnvdimm: Use max contiguous area for namespace size
    MAINTAINERS: Add Jan Kara for filesystem DAX
    MAINTAINERS: update Ross Zwisler's email address
    tools/testing/nvdimm: Fix support for emulating controller temperature
    tools/testing/nvdimm: Make DSM failure code injection an override
    acpi, nfit: Prefer _DSM over _LSR for namespace label reads
    libnvdimm: Introduce locked DIMM capacity support

    Linus Torvalds
     

18 Jul, 2018

1 commit

  • Add and use a new op_stat_group() function for indexing partition stat
    fields rather than indexing them by rq_data_dir() or bio_data_dir().
    This function works similarly to op_is_sync() in that it takes the
    request::cmd_flags or bio::bi_opf flags and determines which stats
    should et updated.

    In addition, the second parameter to generic_start_io_acct() and
    generic_end_io_acct() is now a REQ_OP rather than simply a read or
    write bit and it uses op_stat_group() on the parameter to determine
    the stat group.

    Note that the partition in_flight counts are not part of the per-cpu
    statistics and as such are not indexed via this function. It's now
    indexed by op_is_write().

    tj: Refreshed on top of v4.17. Updated to pass around REQ_OP.

    Signed-off-by: Michael Callahan
    Signed-off-by: Tejun Heo
    Cc: Minchan Kim
    Cc: Dan Williams
    Cc: Joshua Morris
    Cc: Philipp Reisner
    Cc: Matias Bjorling
    Cc: Kent Overstreet
    Cc: Alasdair Kergon
    Signed-off-by: Jens Axboe

    Michael Callahan
     

15 Jul, 2018

1 commit


11 Apr, 2018

1 commit

  • Pull libnvdimm updates from Dan Williams:
    "This cycle was was not something I ever want to repeat as there were
    several late changes that have only now just settled.

    Half of the branch up to commit d2c997c0f145 ("fs, dax: use
    page->mapping to warn...") have been in -next for several releases.
    The of_pmem driver and the address range scrub rework were late
    arrivals, and the dax work was scaled back at the last moment.

    The of_pmem driver missed a previous merge window due to an oversight.
    A sense of obligation to rectify that miss is why it is included for
    4.17. It has acks from PowerPC folks. Stephen reported a build failure
    that only occurs when merging it with your latest tree, for now I have
    fixed that up by disabling modular builds of of_pmem. A test merge
    with your tree has received a build success report from the 0day robot
    over 156 configs.

    An initial version of the ARS rework was submitted before the merge
    window. It is self contained to libnvdimm, a net code reduction, and
    passing all unit tests.

    The filesystem-dax changes are based on the wait_var_event()
    functionality from tip/sched/core. However, late review feedback
    showed that those changes regressed truncate performance to a large
    degree. The branch was rewound to drop the truncate behavior change
    and now only includes preparation patches and cleanups (with full acks
    and reviews). The finalization of this dax-dma-vs-trnucate work will
    need to wait for 4.18.

    Summary:

    - A rework of the filesytem-dax implementation provides for detection
    of unmap operations (truncate / hole punch) colliding with
    in-progress device-DMA. A fix for these collisions remains a
    work-in-progress pending resolution of truncate latency and
    starvation regressions.

    - The of_pmem driver expands the users of libnvdimm outside of x86
    and ACPI to describe an implementation of persistent memory on
    PowerPC with Open Firmware / Device tree.

    - Address Range Scrub (ARS) handling is completely rewritten to
    account for the fact that ARS may run for 100s of seconds and there
    is no platform defined way to cancel it. ARS will now no longer
    block namespace initialization.

    - The NVDIMM Namespace Label implementation is updated to handle
    label areas as small as 1K, down from 128K.

    - Miscellaneous cleanups and updates to unit test infrastructure"

    * tag 'libnvdimm-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (39 commits)
    libnvdimm, of_pmem: workaround OF_NUMA=n build error
    nfit, address-range-scrub: add module option to skip initial ars
    nfit, address-range-scrub: rework and simplify ARS state machine
    nfit, address-range-scrub: determine one platform max_ars value
    powerpc/powernv: Create platform devs for nvdimm buses
    doc/devicetree: Persistent memory region bindings
    libnvdimm: Add device-tree based driver
    libnvdimm: Add of_node to region and bus descriptors
    libnvdimm, region: quiet region probe
    libnvdimm, namespace: use a safe lookup for dimm device name
    libnvdimm, dimm: fix dpa reservation vs uninitialized label area
    libnvdimm, testing: update the default smart ctrl_temperature
    libnvdimm, testing: Add emulation for smart injection commands
    nfit, address-range-scrub: introduce nfit_spa->ars_state
    libnvdimm: add an api to cast a 'struct nd_region' to its 'struct device'
    nfit, address-range-scrub: fix scrub in-progress reporting
    dax, dm: allow device-mapper to operate without dax support
    dax: introduce CONFIG_DAX_DRIVER
    fs, dax: use page->mapping to warn if truncate collides with a busy page
    ext2, dax: introduce ext2_dax_aops
    ...

    Linus Torvalds
     

04 Apr, 2018

1 commit


18 Mar, 2018

1 commit

  • It happens often while I'm preparing a patch for a block driver that
    I'm wondering: is a definition of SECTOR_SIZE and/or SECTOR_SHIFT
    available for this driver? Do I have to introduce definitions of these
    constants before I can use these constants? To avoid this confusion,
    move the existing definitions of SECTOR_SIZE and SECTOR_SHIFT into the
    header file such that these become available for all
    block drivers. Make the SECTOR_SIZE definition in the uapi msdos_fs.h
    header file conditional to avoid that including that header file after
    causes the compiler to complain about a SECTOR_SIZE
    redefinition.

    Note: the SECTOR_SIZE / SECTOR_SHIFT / SECTOR_BITS definitions have
    not been removed from uapi header files nor from NAND drivers in
    which these constants are used for another purpose than converting
    block layer offsets and sizes into a number of sectors.

    Cc: David S. Miller
    Cc: Mike Snitzer
    Cc: Dan Williams
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Reviewed-by: Sergey Senozhatsky
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

09 Jan, 2018

1 commit

  • This new interface is similar to how struct device (and many others)
    work. The caller initializes a 'struct dev_pagemap' as required
    and calls 'devm_memremap_pages'. This allows the pagemap structure to
    be embedded in another structure and thus container_of can be used. In
    this way application specific members can be stored in a containing
    struct.

    This will be used by the P2P infrastructure and HMM could probably
    be cleaned up to use it as well (instead of having it's own, similar
    'hmm_devmem_pages_create' function).

    Signed-off-by: Logan Gunthorpe
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Dan Williams

    Christoph Hellwig
     

03 Nov, 2017

1 commit

  • nfit_test needs to use the poison list manipulation code as well. Make
    it more generic and in the process rename poison to badrange, and move
    all the related helpers to a new file.

    Signed-off-by: Dave Jiang
    [vishal: Add badrange.o to nfit_test's Kbuild]
    [vishal: add a missed include in bus.c for the new badrange functions]
    [vishal: rename all instances of 'be' to 'bre']
    Signed-off-by: Vishal Verma
    Signed-off-by: Dan Williams

    Dave Jiang
     

29 Sep, 2017

1 commit

  • If we successfully enable a DIMM then it must not be locked and we can
    clear the label-read failure condition. Otherwise, we need to reload the
    entire bus provider driver to achieve the same effect, and that can
    disrupt unrelated DIMMs and namespaces.

    Fixes: 9d62ed965118 ("libnvdimm: handle locked label storage areas")
    Cc:
    Signed-off-by: Dan Williams

    Dan Williams
     

12 Sep, 2017

1 commit

  • Pull libnvdimm from Dan Williams:
    "A rework of media error handling in the BTT driver and other updates.
    It has appeared in a few -next releases and collected some late-
    breaking build-error and warning fixups as a result.

    Summary:

    - Media error handling support in the Block Translation Table (BTT)
    driver is reworked to address sleeping-while-atomic locking and
    memory-allocation-context conflicts.

    - The dax_device lookup overhead for xfs and ext4 is moved out of the
    iomap hot-path to a mount-time lookup.

    - A new 'ecc_unit_size' sysfs attribute is added to advertise the
    read-modify-write boundary property of a persistent memory range.

    - Preparatory fix-ups for arm and powerpc pmem support are included
    along with other miscellaneous fixes"

    * tag 'libnvdimm-for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (26 commits)
    libnvdimm, btt: fix format string warnings
    libnvdimm, btt: clean up warning and error messages
    ext4: fix null pointer dereference on sbi
    libnvdimm, nfit: move the check on nd_reserved2 to the endpoint
    dax: fix FS_DAX=n BLOCK=y compilation
    libnvdimm: fix integer overflow static analysis warning
    libnvdimm, nd_blk: remove mmio_flush_range()
    libnvdimm, btt: rework error clearing
    libnvdimm: fix potential deadlock while clearing errors
    libnvdimm, btt: cache sector_size in arena_info
    libnvdimm, btt: ensure that flags were also unchanged during a map_read
    libnvdimm, btt: refactor map entry operations with macros
    libnvdimm, btt: fix a missed NVDIMM_IO_ATOMIC case in the write path
    libnvdimm, nfit: export an 'ecc_unit_size' sysfs attribute
    ext4: perform dax_device lookup at mount
    ext2: perform dax_device lookup at mount
    xfs: perform dax_device lookup at mount
    dax: introduce a fs_dax_get_by_bdev() helper
    libnvdimm, btt: check memory allocation failure
    libnvdimm, label: fix index block size calculation
    ...

    Linus Torvalds
     

30 Aug, 2017

1 commit

  • The old calculation assumed that the label space was 128k and the label
    size is 128. With v1.2 labels where the label size is 256 this
    calculation will return zero. We are saved by the fact that the
    nsindex_size is always pre-initialized from a previous 128 byte
    assumption and we are lucky that the index sizes turn out the same.

    Fix this going forward in case we start encountering different
    geometries of label areas besides 128k.

    Since the label size can change from one call to the next, drop the
    caching of nsindex_size.

    Signed-off-by: Dan Williams

    Dan Williams
     

24 Aug, 2017

1 commit

  • This way we don't need a block_device structure to submit I/O. The
    block_device has different life time rules from the gendisk and
    request_queue and is usually only available when the block device node
    is open. Other callers need to explicitly create one (e.g. the lightnvm
    passthrough code, or the new nvme multipathing code).

    For the actual I/O path all that we need is the gendisk, which exists
    once per block device. But given that the block layer also does
    partition remapping we additionally need a partition index, which is
    used for said remapping in generic_make_request.

    Note that all the block drivers generally want request_queue or
    sometimes the gendisk, so this removes a layer of indirection all
    over the stack.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

12 Aug, 2017

1 commit


10 Aug, 2017

1 commit


05 Aug, 2017

1 commit

  • It is useful to be able to know the position of a DIMM in an
    interleave-set. Consider the case where the order of the DIMMs changes
    causing a namespace to be invalidated because the interleave-set cookie no
    longer matches. If the before and after state of each DIMM position is
    known this state debugged by the system owner.

    Signed-off-by: Dan Williams

    Dan Williams
     

26 Jul, 2017

1 commit

  • Currently libnvdimm uses HPAGE_SIZE as the default alignment for DAX and
    PFN devices. HPAGE_SIZE is the default hugetlbfs page size and when
    hugetlbfs is disabled it defaults to PAGE_SIZE. Given DAX has more
    in common with THP than hugetlbfs we should proably be using
    HPAGE_PMD_SIZE, but this is undefined when THP is disabled so lets just
    give it a new name.

    The other usage of HPAGE_SIZE in libnvdimm is when determining how large
    the altmap should be. For the reasons mentioned above it doesn't really
    make sense to use HPAGE_SIZE here either. PMD_SIZE seems to be safe to
    use in generic code and it happens to match the vmemmap allocation block
    on x86 and Power. It's still a hack, but it's a slightly nicer hack.

    Signed-off-by: Oliver O'Halloran
    Signed-off-by: Dan Williams

    Oliver O'Halloran
     

30 Jun, 2017

1 commit

  • The UEFI 2.7 specification defines an updated BTT metadata format,
    bumping the revision to 2.0. Add support for the new format, while
    retaining compatibility for the old 1.1 format.

    Cc: Toshi Kani
    Cc: Linda Knippers
    Cc: Dan Williams
    Signed-off-by: Vishal Verma
    Signed-off-by: Dan Williams

    Vishal Verma
     

16 Jun, 2017

1 commit

  • Sysfs "badblocks" information may be updated during run-time that:
    - MCE, SCI, and sysfs "scrub" may add new bad blocks
    - Writes and ioctl() may clear bad blocks

    Add support to send sysfs notifications to sysfs "badblocks" file
    under region and pmem directories when their badblocks information
    is re-evaluated (but is not necessarily changed) during run-time.

    Signed-off-by: Toshi Kani
    Cc: Vishal Verma
    Cc: Linda Knippers
    Signed-off-by: Dan Williams

    Toshi Kani