20 Jul, 2019

1 commit

  • Pull vfs mount updates from Al Viro:
    "The first part of mount updates.

    Convert filesystems to use the new mount API"

    * 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    mnt_init(): call shmem_init() unconditionally
    constify ksys_mount() string arguments
    don't bother with registering rootfs
    init_rootfs(): don't bother with init_ramfs_fs()
    vfs: Convert smackfs to use the new mount API
    vfs: Convert selinuxfs to use the new mount API
    vfs: Convert securityfs to use the new mount API
    vfs: Convert apparmorfs to use the new mount API
    vfs: Convert openpromfs to use the new mount API
    vfs: Convert xenfs to use the new mount API
    vfs: Convert gadgetfs to use the new mount API
    vfs: Convert oprofilefs to use the new mount API
    vfs: Convert ibmasmfs to use the new mount API
    vfs: Convert qib_fs/ipathfs to use the new mount API
    vfs: Convert efivarfs to use the new mount API
    vfs: Convert configfs to use the new mount API
    vfs: Convert binfmt_misc to use the new mount API
    convenience helper: get_tree_single()
    convenience helper get_tree_nodev()
    vfs: Kill sget_userns()
    ...

    Linus Torvalds
     

19 Jul, 2019

2 commits

  • Pull dax updates from Dan Williams:
    "The fruits of a bug hunt in the fsdax implementation with Willy and a
    small feature update for device-dax:

    - Fix a hang condition that started triggering after the Xarray
    conversion of fsdax in the v4.20 kernel.

    - Add a 'resource' (root-only physical base address) sysfs attribute
    to device-dax instances to correlate memory-blocks onlined via the
    kmem driver with a given device instance"

    * tag 'dax-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    dax: Fix missed wakeup with PMD faults
    device-dax: Add a 'resource' attribute

    Linus Torvalds
     
  • Pull libnvdimm updates from Dan Williams:
    "Primarily just the virtio_pmem driver:

    - virtio_pmem

    The new virtio_pmem facility introduces a paravirtualized
    persistent memory device that allows a guest VM to use DAX
    mechanisms to access a host-file with host-page-cache. It arranges
    for MAP_SYNC to be disabled and instead triggers a host fsync()
    when a 'write-cache flush' command is sent to the virtual disk
    device.

    - Miscellaneous small fixups"

    * tag 'libnvdimm-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    virtio_pmem: fix sparse warning
    xfs: disable map_sync for async flush
    ext4: disable map_sync for async flush
    dax: check synchronous mapping is supported
    dm: enable synchronous dax
    libnvdimm: add dax_dev sync flag
    virtio-pmem: Add virtio pmem driver
    libnvdimm: nd_region flush callback support
    libnvdimm, namespace: Drop uuid_t implementation detail

    Linus Torvalds
     

17 Jul, 2019

2 commits

  • It is now allowed to use persistent memory like a regular RAM, but
    currently there is no way to remove this memory until machine is
    rebooted.

    This work expands the functionality to also allows hotremoving
    previously hotplugged persistent memory, and recover the device for use
    for other purposes.

    To hotremove persistent memory, the management software must first
    offline all memory blocks of dax region, and than unbind it from
    device-dax/kmem driver. So, operations should look like this:

    echo offline > /sys/devices/system/memory/memoryN/state
    ...
    echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind

    Note: if unbind is done without offlining memory beforehand, it won't be
    possible to do dax0.0 hotremove, and dax's memory is going to be part of
    System RAM until reboot.

    Link: http://lkml.kernel.org/r/20190517215438.6487-4-pasha.tatashin@soleen.com
    Signed-off-by: Pavel Tatashin
    Reviewed-by: David Hildenbrand
    Cc: James Morris
    Cc: Sasha Levin
    Cc: Michal Hocko
    Cc: Dave Hansen
    Cc: Dan Williams
    Cc: Keith Busch
    Cc: Vishal Verma
    Cc: Dave Jiang
    Cc: Ross Zwisler
    Cc: Tom Lendacky
    Cc: Huang Ying
    Cc: Fengguang Wu
    Cc: Borislav Petkov
    Cc: Bjorn Helgaas
    Cc: Yaowei Bai
    Cc: Takashi Iwai
    Cc: Jérôme Glisse
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     
  • Patch series ""Hotremove" persistent memory", v6.

    Recently, adding a persistent memory to be used like a regular RAM was
    added to Linux. This work extends this functionality to also allow hot
    removing persistent memory.

    We (Microsoft) have an important use case for this functionality.

    The requirement is for physical machines with small amount of RAM (~8G)
    to be able to reboot in a very short period of time ( /sys/bus/dax/drivers/device_dax/unbind
    echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
    echo online_movable > /sys/devices/system/memoryXXX/state
    4. Before reboot hotremove device-dax memory from System RAM
    echo offline > /sys/devices/system/memoryXXX/state
    echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
    5. Create raw pmem0 device
    ndctl create-namespace --mode raw -e namespace0.0 -f
    6. Copy the state that was stored by apps to ramdisk to pmem device
    7. Do kexec reboot or reboot through firmware if firmware does not
    zero memory in pmem0 region (These machines have only regular
    volatile memory). So to have pmem0 device either memmap kernel
    parameter is used, or devices nodes in dtb are specified.

    This patch (of 3):

    When add_memory() fails, the resource and the memory should be freed.

    Link: http://lkml.kernel.org/r/20190517215438.6487-2-pasha.tatashin@soleen.com
    Fixes: c221c0b0308f ("device-dax: "Hotplug" persistent memory for use like normal RAM")
    Signed-off-by: Pavel Tatashin
    Reviewed-by: Dave Hansen
    Cc: Bjorn Helgaas
    Cc: Borislav Petkov
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: Dave Jiang
    Cc: David Hildenbrand
    Cc: Fengguang Wu
    Cc: Huang Ying
    Cc: James Morris
    Cc: Jérôme Glisse
    Cc: Keith Busch
    Cc: Michal Hocko
    Cc: Ross Zwisler
    Cc: Sasha Levin
    Cc: Takashi Iwai
    Cc: Tom Lendacky
    Cc: Vishal Verma
    Cc: Yaowei Bai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     

06 Jul, 2019

1 commit

  • This patch adds 'DAXDEV_SYNC' flag which is set
    for nd_region doing synchronous flush. This later
    is used to disable MAP_SYNC functionality for
    ext4 & xfs filesystem for devices don't support
    synchronous flush.

    Signed-off-by: Pankaj Gupta
    Signed-off-by: Dan Williams

    Pankaj Gupta
     

03 Jul, 2019

4 commits


21 Jun, 2019

1 commit

  • device-dax based devices were missing a 'resource' attribute to indicate
    the physical address range contributed by the device in question. This
    information is desirable to userspace tooling that may want to use the
    dax device as system-ram, and wants to selectively hotplug and online
    the memory blocks associated with a given device.

    Without this, the tooling would have to parse /proc/iomem for the memory
    ranges contributed by dax devices, which can be a workaround, but it is
    far easier to provide this information in the sysfs hierarchy.

    Cc: Dave Hansen
    Cc: Dan Williams
    Signed-off-by: Vishal Verma
    Signed-off-by: Dan Williams

    Vishal Verma
     

14 Jun, 2019

1 commit

  • Logan noticed that devm_memremap_pages_release() kills the percpu_ref
    drops all the page references that were acquired at init and then
    immediately proceeds to unplug, arch_remove_memory(), the backing pages
    for the pagemap. If for some reason device shutdown actually collides
    with a busy / elevated-ref-count page then arch_remove_memory() should
    be deferred until after that reference is dropped.

    As it stands the "wait for last page ref drop" happens *after*
    devm_memremap_pages_release() returns, which is obviously too late and
    can lead to crashes.

    Fix this situation by assigning the responsibility to wait for the
    percpu_ref to go idle to devm_memremap_pages() with a new ->cleanup()
    callback. Implement the new cleanup callback for all
    devm_memremap_pages() users: pmem, devdax, hmm, and p2pdma.

    Link: http://lkml.kernel.org/r/155727339156.292046.5432007428235387859.stgit@dwillia2-desk3.amr.corp.intel.com
    Fixes: 41e94a851304 ("add devm_memremap_pages")
    Signed-off-by: Dan Williams
    Reported-by: Logan Gunthorpe
    Reviewed-by: Ira Weiny
    Reviewed-by: Logan Gunthorpe
    Cc: Bjorn Helgaas
    Cc: "Jérôme Glisse"
    Cc: Christoph Hellwig
    Cc: Greg Kroah-Hartman
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

05 Jun, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of version 2 of the gnu general public license as
    published by the free software foundation this program is
    distributed in the hope that it will be useful but without any
    warranty without even the implied warranty of merchantability or
    fitness for a particular purpose see the gnu general public license
    for more details

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 64 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Alexios Zavras
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190529141901.894819585@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

26 May, 2019

3 commits

  • Convert the dax filesystem to the new internal mount API as the old
    one will be obsoleted and removed. This allows greater flexibility in
    communication of mount parameters between userspace, the VFS and the
    filesystem.

    See Documentation/filesystems/mount_api.txt for more information.

    Signed-off-by: David Howells
    cc: Dan Williams
    cc: Vishal Verma
    cc: Keith Busch
    cc: Dave Jiang
    cc: linux-nvdimm@lists.01.org
    Signed-off-by: Al Viro

    David Howells
     
  • Once upon a time we used to set ->d_name of e.g. pipefs root
    so that d_path() on pipes would work. These days it's
    completely pointless - dentries of pipes are not even connected
    to pipefs root. However, mount_pseudo() had set the root
    dentry name (passed as the second argument) and callers
    kept inventing names to pass to it. Including those that
    didn't *have* any non-root dentries to start with...

    All of that had been pointless for about 8 years now; it's
    time to get rid of that cargo-culting...

    Signed-off-by: Al Viro

    Al Viro
     
  • Pull libnvdimm fixes from Dan Williams:

    - Fix a regression that disabled device-mapper dax support

    - Remove unnecessary hardened-user-copy overhead (>30%) for dax
    read(2)/write(2).

    - Fix some compilation warnings.

    * tag 'libnvdimm-fixes-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead
    dax: Arrange for dax_supported check to span multiple devices
    libnvdimm: Fix compilation warnings with W=1

    Linus Torvalds
     

21 May, 2019

3 commits

  • Add SPDX license identifiers to all Make/Kconfig files which:

    - Have no license information of any form

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • The device-dax fs is only there to allocate a common inode for each
    device-node that refers to the same device by major:minor. It is
    otherwise not user mountable and need not be displayed in
    /proc/filesystems.

    Reported-by: Al Viro
    Acked-by: Al Viro
    Signed-off-by: Dan Williams
    Signed-off-by: Al Viro

    Dan Williams
     
  • Pankaj reports that starting with commit ad428cdb525a "dax: Check the
    end of the block-device capacity with dax_direct_access()" device-mapper
    no longer allows dax operation. This results from the stricter checks in
    __bdev_dax_supported() that validate that the start and end of a
    block-device map to the same 'pagemap' instance.

    Teach the dax-core and device-mapper to validate the 'pagemap' on a
    per-target basis. This is accomplished by refactoring the
    bdev_dax_supported() internals into generic_fsdax_supported() which
    takes a sector range to validate. Consequently generic_fsdax_supported()
    is suitable to be used in a device-mapper ->iterate_devices() callback.
    A new ->dax_supported() operation is added to allow composite devices to
    split and route upper-level bdev_dax_supported() requests.

    Fixes: ad428cdb525a ("dax: Check the end of the block-device...")
    Cc:
    Cc: Ira Weiny
    Cc: Dave Jiang
    Cc: Keith Busch
    Cc: Matthew Wilcox
    Cc: Vishal Verma
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Reviewed-by: Jan Kara
    Reported-by: Pankaj Gupta
    Reviewed-by: Pankaj Gupta
    Tested-by: Pankaj Gupta
    Tested-by: Vaibhav Jain
    Reviewed-by: Mike Snitzer
    Signed-off-by: Dan Williams

    Dan Williams
     

16 May, 2019

1 commit

  • Pull libnvdimm updates from Dan Williams:
    "Just a small collection of fixes this time around.

    The new virtio-pmem driver is nearly ready, but some last minute
    device-mapper acks and virtio questions made it prudent to await v5.3.

    Other major topics that were brewing on the linux-nvdimm mailing list
    like sub-section hotplug, and other devm_memremap_pages() reworks will
    go upstream through Andrew's tree.

    Summary:

    - Fix a long standing namespace label corruption scenario when
    re-provisioning capacity for a namespace.

    - Restore the ability of the dax_pmem module to be built-in.

    - Harden the build for the 'nfit_test' unit test modules so that the
    userspace test harness can ensure all required test modules are
    available"

    * tag 'libnvdimm-fixes-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    drivers/dax: Allow to include DEV_DAX_PMEM as builtin
    libnvdimm/namespace: Fix label tracking error
    tools/testing/nvdimm: add watermarks for dax_pmem* modules
    dax/pmem: Fix whitespace in dax_pmem

    Linus Torvalds
     

15 May, 2019

1 commit

  • Starting with c6f3c5ee40c1 ("mm/huge_memory.c: fix modifying of page
    protection by insert_pfn_pmd()") vmf_insert_pfn_pmd() internally calls
    pmdp_set_access_flags(). That helper enforces a pmd aligned @address
    argument via VM_BUG_ON() assertion.

    Update the implementation to take a 'struct vm_fault' argument directly
    and apply the address alignment fixup internally to fix crash signatures
    like:

    kernel BUG at arch/x86/mm/pgtable.c:515!
    invalid opcode: 0000 [#1] SMP NOPTI
    CPU: 51 PID: 43713 Comm: java Tainted: G OE 4.19.35 #1
    [..]
    RIP: 0010:pmdp_set_access_flags+0x48/0x50
    [..]
    Call Trace:
    vmf_insert_pfn_pmd+0x198/0x350
    dax_iomap_fault+0xe82/0x1190
    ext4_dax_huge_fault+0x103/0x1f0
    ? __switch_to_asm+0x40/0x70
    __handle_mm_fault+0x3f6/0x1370
    ? __switch_to_asm+0x34/0x70
    ? __switch_to_asm+0x40/0x70
    handle_mm_fault+0xda/0x200
    __do_page_fault+0x249/0x4f0
    do_page_fault+0x32/0x110
    ? page_fault+0x8/0x30
    page_fault+0x1e/0x30

    Link: http://lkml.kernel.org/r/155741946350.372037.11148198430068238140.stgit@dwillia2-desk3.amr.corp.intel.com
    Fixes: c6f3c5ee40c1 ("mm/huge_memory.c: fix modifying of page protection by insert_pfn_pmd()")
    Signed-off-by: Dan Williams
    Reported-by: Piotr Balcer
    Tested-by: Yan Ma
    Tested-by: Pankaj Gupta
    Reviewed-by: Matthew Wilcox
    Reviewed-by: Jan Kara
    Reviewed-by: Aneesh Kumar K.V
    Cc: Chandan Rajendra
    Cc: Souptick Joarder
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

07 May, 2019

1 commit

  • This move the dependency to DEV_DAX_PMEM_COMPAT such that only
    if DEV_DAX_PMEM is built as module we can allow the compat support.

    This allows to test the new code easily in a emulation setup where we
    often build things without module support.

    Cc:
    Fixes: 730926c3b099 ("device-dax: Add /sys/class/dax backwards compatibility")
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Dan Williams

    Aneesh Kumar K.V
     

02 May, 2019

1 commit

  • we might want to drop ->destroy_inode() there - it's used only for
    WARN_ON() now, and AFAICS that could be moved to ->evict_inode()
    if we had one...

    Reviewed-by: Jan Kara
    Acked-by: Dan Williams
    Signed-off-by: Al Viro

    Al Viro
     

23 Apr, 2019

1 commit


17 Mar, 2019

1 commit

  • Pull device-dax updates from Dan Williams:
    "New device-dax infrastructure to allow persistent memory and other
    "reserved" / performance differentiated memories, to be assigned to
    the core-mm as "System RAM".

    Some users want to use persistent memory as additional volatile
    memory. They are willing to cope with potential performance
    differences, for example between DRAM and 3D Xpoint, and want to use
    typical Linux memory management apis rather than a userspace memory
    allocator layered over an mmap() of a dax file. The administration
    model is to decide how much Persistent Memory (pmem) to use as System
    RAM, create a device-dax-mode namespace of that size, and then assign
    it to the core-mm. The rationale for device-dax is that it is a
    generic memory-mapping driver that can be layered over any "special
    purpose" memory, not just pmem. On subsequent boots udev rules can be
    used to restore the memory assignment.

    One implication of using pmem as RAM is that mlock() no longer keeps
    data off persistent media. For this reason it is recommended to enable
    NVDIMM Security (previously merged for 5.0) to encrypt pmem contents
    at rest. We considered making this recommendation an actively enforced
    requirement, but in the end decided to leave it as a distribution /
    administrator policy to allow for emulation and test environments that
    lack security capable NVDIMMs.

    Summary:

    - Replace the /sys/class/dax device model with /sys/bus/dax, and
    include a compat driver so distributions can opt-in to the new ABI.

    - Allow for an alternative driver for the device-dax address-range

    - Introduce the 'kmem' driver to hotplug / assign a device-dax
    address-range to the core-mm.

    - Arrange for the device-dax target-node to be onlined so that the
    newly added memory range can be uniquely referenced by numa apis"

    NOTE! I'm not entirely happy with the whole "PMEM as RAM" model because
    we currently have special - and very annoying rules in the kernel about
    accessing PMEM only with the "MC safe" accessors, because machine checks
    inside the regular repeat string copy functions can be fatal in some
    (not described) circumstances.

    And apparently the PMEM modules can cause that a lot more than regular
    RAM. The argument is that this happens because PMEM doesn't necessarily
    get scrubbed at boot like RAM does, but that is planned to be added for
    the user space tooling.

    Quoting Dan from another email:
    "The exposure can be reduced in the volatile-RAM case by scanning for
    and clearing errors before it is onlined as RAM. The userspace tooling
    for that can be in place before v5.1-final. There's also runtime
    notifications of errors via acpi_nfit_uc_error_notify() from
    background scrubbers on the DIMM devices. With that mechanism the
    kernel could proactively clear newly discovered poison in the volatile
    case, but that would be additional development more suitable for v5.2.

    I understand the concern, and the need to highlight this issue by
    tapping the brakes on feature development, but I don't see PMEM as RAM
    making the situation worse when the exposure is also there via DAX in
    the PMEM case. Volatile-RAM is arguably a safer use case since it's
    possible to repair pages where the persistent case needs active
    application coordination"

    * tag 'devdax-for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    device-dax: "Hotplug" persistent memory for use like normal RAM
    mm/resource: Let walk_system_ram_range() search child resources
    mm/memory-hotplug: Allow memory resources to be children
    mm/resource: Move HMM pr_debug() deeper into resource code
    mm/resource: Return real error codes from walk failures
    device-dax: Add a 'modalias' attribute to DAX 'bus' devices
    device-dax: Add a 'target_node' attribute
    device-dax: Auto-bind device after successful new_id
    acpi/nfit, device-dax: Identify differentiated memory with a unique numa-node
    device-dax: Add /sys/class/dax backwards compatibility
    device-dax: Add support for a dax override driver
    device-dax: Move resource pinning+mapping into the common driver
    device-dax: Introduce bus + driver model
    device-dax: Start defining a dax bus model
    device-dax: Remove multi-resource infrastructure
    device-dax: Kill dax_region base
    device-dax: Kill dax_region ida

    Linus Torvalds
     

01 Mar, 2019

1 commit

  • This is intended for use with NVDIMMs that are physically persistent
    (physically like flash) so that they can be used as a cost-effective
    RAM replacement. Intel Optane DC persistent memory is one
    implementation of this kind of NVDIMM.

    Currently, a persistent memory region is "owned" by a device driver,
    either the "Direct DAX" or "Filesystem DAX" drivers. These drivers
    allow applications to explicitly use persistent memory, generally
    by being modified to use special, new libraries. (DIMM-based
    persistent memory hardware/software is described in great detail
    here: Documentation/nvdimm/nvdimm.txt).

    However, this limits persistent memory use to applications which
    *have* been modified. To make it more broadly usable, this driver
    "hotplugs" memory into the kernel, to be managed and used just like
    normal RAM would be.

    To make this work, management software must remove the device from
    being controlled by the "Device DAX" infrastructure:

    echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind

    and then tell the new driver that it can bind to the device:

    echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id

    After this, there will be a number of new memory sections visible
    in sysfs that can be onlined, or that may get onlined by existing
    udev-initiated memory hotplug rules.

    This rebinding procedure is currently a one-way trip. Once memory
    is bound to "kmem", it's there permanently and can not be
    unbound and assigned back to device_dax.

    The kmem driver will never bind to a dax device unless the device
    is *explicitly* bound to the driver. There are two reasons for
    this: One, since it is a one-way trip, it can not be undone if
    bound incorrectly. Two, the kmem driver destroys data on the
    device. Think of if you had good data on a pmem device. It
    would be catastrophic if you compile-in "kmem", but leave out
    the "device_dax" driver. kmem would take over the device and
    write volatile data all over your good data.

    This inherits any existing NUMA information for the newly-added
    memory from the persistent memory device that came from the
    firmware. On Intel platforms, the firmware has guarantees that
    require each socket's persistent memory to be in a separate
    memory-only NUMA node. That means that this patch is not expected
    to create NUMA nodes, but will simply hotplug memory into existing
    nodes.

    Because NUMA nodes are created, the existing NUMA APIs and tools
    are sufficient to create policies for applications or memory areas
    to have affinity for or an aversion to using this memory.

    There is currently some metadata at the beginning of pmem regions.
    The section-size memory hotplug restrictions, plus this small
    reserved area can cause the "loss" of a section or two of capacity.
    This should be fixable in follow-on patches. But, as a first step,
    losing 256MB of memory (worst case) out of hundreds of gigabytes
    is a good tradeoff vs. the required code to fix this up precisely.
    This calculation is also the reason we export
    memory_block_size_bytes().

    Signed-off-by: Dave Hansen
    Reviewed-by: Dan Williams
    Reviewed-by: Keith Busch
    Cc: Dave Jiang
    Cc: Ross Zwisler
    Cc: Vishal Verma
    Cc: Tom Lendacky
    Cc: Andrew Morton
    Cc: Michal Hocko
    Cc: linux-nvdimm@lists.01.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: Huang Ying
    Cc: Fengguang Wu
    Cc: Borislav Petkov
    Cc: Bjorn Helgaas
    Cc: Yaowei Bai
    Cc: Takashi Iwai
    Cc: Jerome Glisse
    Reviewed-by: Vishal Verma
    Signed-off-by: Dan Williams

    Dave Hansen
     

28 Feb, 2019

1 commit

  • Add a 'modalias' attribute to devices under the DAX bus so that userspace
    is able to dynamically load modules as needed.

    Normally, udev can get the modalias from 'uevent', and that is correctly
    set up by the DAX bus. However other tooling such as 'libndctl' for
    interacting with drivers/nvdimm/, and 'libdaxctl' for drivers/dax/ can
    also use the modalias to dynamically load modules via libkmod lookups.

    The 'nd' bus set up by the libnvdimm subsystem exports a modalias
    attribute. Imitate this to export the same for the 'dax' bus.

    Cc: Dave Hansen
    Signed-off-by: Vishal Verma
    Signed-off-by: Dan Williams

    Vishal Verma
     

21 Feb, 2019

2 commits

  • The checks in __bdev_dax_supported() helped mitigate a potential data
    corruption bug in the pmem driver's handling of section alignment
    padding. Strengthen the checks, including checking the end of the range,
    to validate the dev_pagemap, Xarray entries, and sector-to-pfn
    translation established for pmem namespaces.

    Acked-by: Jan Kara
    Cc: "Darrick J. Wong"
    Signed-off-by: Dan Williams

    Dan Williams
     
  • The target-node attribute is the Linux numa-node that a device-dax
    instance may create when it is online. Prior to being online the
    device's 'numa_node' property reflects the closest online cpu node which
    is the typical expectation of a device 'numa_node'. Once it is online it
    becomes its own distinct numa node, i.e. 'target_node'.

    Export the 'target_node' property to give userspace tooling the ability
    to predict the effective numa-node from a device-dax instance configured
    to provide 'System RAM' capacity.

    Cc: Vishal Verma
    Reported-by: Dave Hansen
    Signed-off-by: Dan Williams

    Dan Williams
     

25 Jan, 2019

1 commit


07 Jan, 2019

9 commits

  • Persistent memory, as described by the ACPI NFIT (NVDIMM Firmware
    Interface Table), is the first known instance of a memory range
    described by a unique "target" proximity domain. Where "initiator" and
    "target" proximity domains is an approach that the ACPI HMAT
    (Heterogeneous Memory Attributes Table) uses to described the unique
    performance properties of a memory range relative to a given initiator
    (e.g. CPU or DMA device).

    Currently the numa-node for a /dev/pmemX block-device or /dev/daxX.Y
    char-device follows the traditional notion of 'numa-node' where the
    attribute conveys the closest online numa-node. That numa-node attribute
    is useful for cpu-binding and memory-binding processes *near* the
    device. However, when the memory range backing a 'pmem', or 'dax' device
    is onlined (memory hot-add) the memory-only-numa-node representing that
    address needs to be differentiated from the set of online nodes. In
    other words, the numa-node association of the device depends on whether
    you can bind processes *near* the cpu-numa-node in the offline
    device-case, or bind process *on* the memory-range directly after the
    backing address range is onlined.

    Allow for the case that platform firmware describes persistent memory
    with a unique proximity domain, i.e. when it is distinct from the
    proximity of DRAM and CPUs that are on the same socket. Plumb the Linux
    numa-node translation of that proximity through the libnvdimm region
    device to namespaces that are in device-dax mode. With this in place the
    proposed kmem driver [1] can optionally discover a unique numa-node
    number for the address range as it transitions the memory from an
    offline state managed by a device-driver to an online memory range
    managed by the core-mm.

    [1]: https://lore.kernel.org/lkml/20181022201317.8558C1D8@viggo.jf.intel.com

    Reported-by: Fan Du
    Cc: Michael Ellerman
    Cc: "Oliver O'Halloran"
    Cc: Dave Hansen
    Cc: Jérôme Glisse
    Reviewed-by: Yang Shi
    Signed-off-by: Dan Williams

    Dan Williams
     
  • On the expectation that some environments may not upgrade libdaxctl
    (userspace component that depends on the /sys/class/dax hierarchy),
    provide a default / legacy dax_pmem_compat driver. The dax_pmem_compat
    driver implements the original /sys/class/dax sysfs layout rather than
    /sys/bus/dax. When userspace is upgraded it can blacklist this module
    and switch to the dax_pmem driver going forward.

    CONFIG_DEV_DAX_PMEM_COMPAT and supporting code will be deleted according
    to the dax_pmem entry in Documentation/ABI/obsolete/.

    Signed-off-by: Dan Williams

    Dan Williams
     
  • Introduce the 'new_id' concept for enabling a custom device-driver attach
    policy for dax-bus drivers. The intended use is to have a mechanism for
    hot-plugging device-dax ranges into the page allocator on-demand. With
    this in place the default policy of using device-dax for performance
    differentiated memory can be overridden by user-space policy that can
    arrange for the memory range to be managed as 'System RAM' with
    user-defined NUMA and other performance attributes.

    Signed-off-by: Dan Williams

    Dan Williams
     
  • Move the responsibility of calling devm_request_resource() and
    devm_memremap_pages() into the common device-dax driver. This is another
    preparatory step to allowing an alternate personality driver for a
    device-dax range.

    Signed-off-by: Dan Williams

    Dan Williams
     
  • In support of multiple device-dax instances per device-dax-region and
    allowing the 'kmem' driver to attach to dax-instances instead of the
    current device-node access, convert the dax sub-system from a class to a
    bus. Recall that the kmem driver takes reserved / special purpose
    memories and assigns them to be managed by the core-mm.

    Aside from the fact the device-dax instances are registered and probed
    on a bus, two other lifetime-management changes are made:

    1/ Delay attaching a cdev until driver probe time

    2/ A new run_dax() helper is introduced to allow restoring dax-operation
    after a kill_dax() event. So, at driver ->probe() time we run_dax()
    and at ->remove() time we kill_dax() and invalidate all mappings.

    Signed-off-by: Dan Williams

    Dan Williams
     
  • Towards eliminating the dax_class, move the dax-device-attribute
    enabling to a new bus.c file in the core. The amount of code
    thrash of sub-sequent patches is reduced as no logic changes are made,
    just pure code movement.

    A temporary export of unregister_dex_dax() and dax_attribute_groups is
    needed to preserve compilation, but those symbols become static again in
    a follow-on patch.

    Signed-off-by: Dan Williams

    Dan Williams
     
  • The multi-resource implementation anticipated discontiguous sub-division
    support. That has not yet materialized, delete the infrastructure and
    related code.

    Signed-off-by: Dan Williams

    Dan Williams
     
  • Nothing consumes this attribute of a region and devres otherwise
    remembers the value for de-allocation purposes.

    Signed-off-by: Dan Williams

    Dan Williams
     
  • Commit bbb3be170ac2 "device-dax: fix sysfs duplicate warnings" arranged
    for passing a dax instance-id to devm_create_dax_dev(), rather than
    generating one internally. Remove the dax_region ida and related code.

    Signed-off-by: Dan Williams

    Dan Williams
     

29 Dec, 2018

1 commit

  • The last step before devm_memremap_pages() returns success is to allocate
    a release action, devm_memremap_pages_release(), to tear the entire setup
    down. However, the result from devm_add_action() is not checked.

    Checking the error from devm_add_action() is not enough. The api
    currently relies on the fact that the percpu_ref it is using is killed by
    the time the devm_memremap_pages_release() is run. Rather than continue
    this awkward situation, offload the responsibility of killing the
    percpu_ref to devm_memremap_pages_release() directly. This allows
    devm_memremap_pages() to do the right thing relative to init failures and
    shutdown.

    Without this change we could fail to register the teardown of
    devm_memremap_pages(). The likelihood of hitting this failure is tiny as
    small memory allocations almost always succeed. However, the impact of
    the failure is large given any future reconfiguration, or disable/enable,
    of an nvdimm namespace will fail forever as subsequent calls to
    devm_memremap_pages() will fail to setup the pgmap_radix since there will
    be stale entries for the physical address range.

    An argument could be made to require that the ->kill() operation be set in
    the @pgmap arg rather than passed in separately. However, it helps code
    readability, tracking the lifetime of a given instance, to be able to grep
    the kill routine directly at the devm_memremap_pages() call site.

    Link: http://lkml.kernel.org/r/154275558526.76910.7535251937849268605.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Fixes: e8d513483300 ("memremap: change devm_memremap_pages interface...")
    Reviewed-by: "Jérôme Glisse"
    Reported-by: Logan Gunthorpe
    Reviewed-by: Logan Gunthorpe
    Reviewed-by: Christoph Hellwig
    Cc: Balbir Singh
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams