Eric Lee / smarc-fsl-linux-kernel

20 Jul, 2019

1 commit

933a90bf4 Merge branch 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs mount updates from Al Viro:
"The first part of mount updates.

Convert filesystems to use the new mount API"

* 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
mnt_init(): call shmem_init() unconditionally
constify ksys_mount() string arguments
don't bother with registering rootfs
init_rootfs(): don't bother with init_ramfs_fs()
vfs: Convert smackfs to use the new mount API
vfs: Convert selinuxfs to use the new mount API
vfs: Convert securityfs to use the new mount API
vfs: Convert apparmorfs to use the new mount API
vfs: Convert openpromfs to use the new mount API
vfs: Convert xenfs to use the new mount API
vfs: Convert gadgetfs to use the new mount API
vfs: Convert oprofilefs to use the new mount API
vfs: Convert ibmasmfs to use the new mount API
vfs: Convert qib_fs/ipathfs to use the new mount API
vfs: Convert efivarfs to use the new mount API
vfs: Convert configfs to use the new mount API
vfs: Convert binfmt_misc to use the new mount API
convenience helper: get_tree_single()
convenience helper get_tree_nodev()
vfs: Kill sget_userns()
...

Linus Torvalds
2019-07-20 01:42:02 +0800

19 Jul, 2019

2 commits

0fe49f70a Merge tag 'dax-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm ... Browse Code »

Pull dax updates from Dan Williams:
"The fruits of a bug hunt in the fsdax implementation with Willy and a
small feature update for device-dax:

- Fix a hang condition that started triggering after the Xarray
conversion of fsdax in the v4.20 kernel.

- Add a 'resource' (root-only physical base address) sysfs attribute
to device-dax instances to correlate memory-blocks onlined via the
kmem driver with a given device instance"

* tag 'dax-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
dax: Fix missed wakeup with PMD faults
device-dax: Add a 'resource' attribute

Linus Torvalds
2019-07-19 01:58:52 +0800
f8c3500cd Merge tag 'libnvdimm-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm ... Browse Code »

Pull libnvdimm updates from Dan Williams:
"Primarily just the virtio_pmem driver:

- virtio_pmem

The new virtio_pmem facility introduces a paravirtualized
persistent memory device that allows a guest VM to use DAX
mechanisms to access a host-file with host-page-cache. It arranges
for MAP_SYNC to be disabled and instead triggers a host fsync()
when a 'write-cache flush' command is sent to the virtual disk
device.

- Miscellaneous small fixups"

* tag 'libnvdimm-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
virtio_pmem: fix sparse warning
xfs: disable map_sync for async flush
ext4: disable map_sync for async flush
dax: check synchronous mapping is supported
dm: enable synchronous dax
libnvdimm: add dax_dev sync flag
virtio-pmem: Add virtio pmem driver
libnvdimm: nd_region flush callback support
libnvdimm, namespace: Drop uuid_t implementation detail

Linus Torvalds
2019-07-19 01:52:08 +0800

17 Jul, 2019

2 commits

9f960da72 device-dax: "Hotremove" persistent memory that is used like normal RAM ... Browse Code »

It is now allowed to use persistent memory like a regular RAM, but
currently there is no way to remove this memory until machine is
rebooted.

This work expands the functionality to also allows hotremoving
previously hotplugged persistent memory, and recover the device for use
for other purposes.

To hotremove persistent memory, the management software must first
offline all memory blocks of dax region, and than unbind it from
device-dax/kmem driver. So, operations should look like this:

echo offline > /sys/devices/system/memory/memoryN/state
...
echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind

Note: if unbind is done without offlining memory beforehand, it won't be
possible to do dax0.0 hotremove, and dax's memory is going to be part of
System RAM until reboot.

Link: http://lkml.kernel.org/r/20190517215438.6487-4-pasha.tatashin@soleen.com
Signed-off-by: Pavel Tatashin
Reviewed-by: David Hildenbrand
Cc: James Morris
Cc: Sasha Levin
Cc: Michal Hocko
Cc: Dave Hansen
Cc: Dan Williams
Cc: Keith Busch
Cc: Vishal Verma
Cc: Dave Jiang
Cc: Ross Zwisler
Cc: Tom Lendacky
Cc: Huang Ying
Cc: Fengguang Wu
Cc: Borislav Petkov
Cc: Bjorn Helgaas
Cc: Yaowei Bai
Cc: Takashi Iwai
Cc: Jérôme Glisse
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Tatashin
2019-07-17 10:23:24 +0800
31e4ca92a device-dax: fix memory and resource leak if hotplug fails ... Browse Code »

Patch series ""Hotremove" persistent memory", v6.

Recently, adding a persistent memory to be used like a regular RAM was
added to Linux. This work extends this functionality to also allow hot
removing persistent memory.

We (Microsoft) have an important use case for this functionality.

The requirement is for physical machines with small amount of RAM (~8G)
to be able to reboot in a very short period of time ( /sys/bus/dax/drivers/device_dax/unbind
echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
echo online_movable > /sys/devices/system/memoryXXX/state
4. Before reboot hotremove device-dax memory from System RAM
echo offline > /sys/devices/system/memoryXXX/state
echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
5. Create raw pmem0 device
ndctl create-namespace --mode raw -e namespace0.0 -f
6. Copy the state that was stored by apps to ramdisk to pmem device
7. Do kexec reboot or reboot through firmware if firmware does not
zero memory in pmem0 region (These machines have only regular
volatile memory). So to have pmem0 device either memmap kernel
parameter is used, or devices nodes in dtb are specified.

This patch (of 3):

When add_memory() fails, the resource and the memory should be freed.

Link: http://lkml.kernel.org/r/20190517215438.6487-2-pasha.tatashin@soleen.com
Fixes: c221c0b0308f ("device-dax: "Hotplug" persistent memory for use like normal RAM")
Signed-off-by: Pavel Tatashin
Reviewed-by: Dave Hansen
Cc: Bjorn Helgaas
Cc: Borislav Petkov
Cc: Dan Williams
Cc: Dave Hansen
Cc: Dave Jiang
Cc: David Hildenbrand
Cc: Fengguang Wu
Cc: Huang Ying
Cc: James Morris
Cc: Jérôme Glisse
Cc: Keith Busch
Cc: Michal Hocko
Cc: Ross Zwisler
Cc: Sasha Levin
Cc: Takashi Iwai
Cc: Tom Lendacky
Cc: Vishal Verma
Cc: Yaowei Bai
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Tatashin
2019-07-17 10:23:24 +0800

06 Jul, 2019

1 commit

fefc1d97f libnvdimm: add dax_dev sync flag ... Browse Code »

This patch adds 'DAXDEV_SYNC' flag which is set
for nd_region doing synchronous flush. This later
is used to disable MAP_SYNC functionality for
ext4 & xfs filesystem for devices don't support
synchronous flush.

Signed-off-by: Pankaj Gupta
Signed-off-by: Dan Williams

Pankaj Gupta
2019-07-06 06:19:10 +0800

03 Jul, 2019

4 commits

ea31d5859 device-dax: use the dev_pagemap internal refcount ... Browse Code »

The functionality is identical to the one currently open coded in
device-dax.

Signed-off-by: Christoph Hellwig
Reviewed-by: Ira Weiny
Reviewed-by: Dan Williams
Tested-by: Dan Williams
Signed-off-by: Jason Gunthorpe

Christoph Hellwig
2019-07-03 01:32:44 +0800
d8668bb04 memremap: pass a struct dev_pagemap to ->kill and ->cleanup ... Browse Code »

Passing the actual typed structure leads to more understandable code
vs just passing the ref member.

Reported-by: Logan Gunthorpe
Signed-off-by: Christoph Hellwig
Reviewed-by: Logan Gunthorpe
Reviewed-by: Jason Gunthorpe
Reviewed-by: Dan Williams
Tested-by: Dan Williams
Signed-off-by: Jason Gunthorpe

Christoph Hellwig
2019-07-03 01:32:44 +0800
1e240e8d4 memremap: move dev_pagemap callbacks into a separate structure ... Browse Code »

The dev_pagemap is a growing too many callbacks. Move them into a
separate ops structure so that they are not duplicated for multiple
instances, and an attacker can't easily overwrite them.

Signed-off-by: Christoph Hellwig
Reviewed-by: Logan Gunthorpe
Reviewed-by: Jason Gunthorpe
Reviewed-by: Dan Williams
Tested-by: Dan Williams
Signed-off-by: Jason Gunthorpe

Christoph Hellwig
2019-07-03 01:32:44 +0800
3ed2dcdf5 memremap: validate the pagemap type passed to devm_memremap_pages ... Browse Code »

Most pgmap types are only supported when certain config options are
enabled. Check for a type that is valid for the current configuration
before setting up the pagemap. For this the usage of the 0 type for
device dax gets replaced with an explicit MEMORY_DEVICE_DEVDAX type.

Signed-off-by: Christoph Hellwig
Reviewed-by: Ira Weiny
Reviewed-by: Dan Williams
Tested-by: Dan Williams
Signed-off-by: Jason Gunthorpe

Christoph Hellwig
2019-07-03 01:32:44 +0800

21 Jun, 2019

1 commit

40cdc60ac device-dax: Add a 'resource' attribute ... Browse Code »

device-dax based devices were missing a 'resource' attribute to indicate
the physical address range contributed by the device in question. This
information is desirable to userspace tooling that may want to use the
dax device as system-ram, and wants to selectively hotplug and online
the memory blocks associated with a given device.

Without this, the tooling would have to parse /proc/iomem for the memory
ranges contributed by dax devices, which can be a workaround, but it is
far easier to provide this information in the sysfs hierarchy.

Cc: Dave Hansen
Cc: Dan Williams
Signed-off-by: Vishal Verma
Signed-off-by: Dan Williams

Vishal Verma
2019-06-21 08:40:00 +0800

14 Jun, 2019

1 commit

50f44ee72 mm/devm_memremap_pages: fix final page put race ... Browse Code »

Logan noticed that devm_memremap_pages_release() kills the percpu_ref
drops all the page references that were acquired at init and then
immediately proceeds to unplug, arch_remove_memory(), the backing pages
for the pagemap. If for some reason device shutdown actually collides
with a busy / elevated-ref-count page then arch_remove_memory() should
be deferred until after that reference is dropped.

As it stands the "wait for last page ref drop" happens *after*
devm_memremap_pages_release() returns, which is obviously too late and
can lead to crashes.

Fix this situation by assigning the responsibility to wait for the
percpu_ref to go idle to devm_memremap_pages() with a new ->cleanup()
callback. Implement the new cleanup callback for all
devm_memremap_pages() users: pmem, devdax, hmm, and p2pdma.

Link: http://lkml.kernel.org/r/155727339156.292046.5432007428235387859.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: 41e94a851304 ("add devm_memremap_pages")
Signed-off-by: Dan Williams
Reported-by: Logan Gunthorpe
Reviewed-by: Ira Weiny
Reviewed-by: Logan Gunthorpe
Cc: Bjorn Helgaas
Cc: "Jérôme Glisse"
Cc: Christoph Hellwig
Cc: Greg Kroah-Hartman
Cc: "Rafael J. Wysocki"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2019-06-14 11:34:56 +0800

05 Jun, 2019

1 commit

5b497af42 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 295 ... Browse Code »

Based on 1 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of version 2 of the gnu general public license as
published by the free software foundation this program is
distributed in the hope that it will be useful but without any
warranty without even the implied warranty of merchantability or
fitness for a particular purpose see the gnu general public license
for more details

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-only

has been chosen to replace the boilerplate/reference in 64 file(s).

Signed-off-by: Thomas Gleixner
Reviewed-by: Alexios Zavras
Reviewed-by: Allison Randal
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190529141901.894819585@linutronix.de
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2019-06-05 23:36:38 +0800

26 May, 2019

3 commits

75d4e06f0 vfs: Convert dax to use the new mount API ... Browse Code »

Convert the dax filesystem to the new internal mount API as the old
one will be obsoleted and removed. This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.

See Documentation/filesystems/mount_api.txt for more information.

Signed-off-by: David Howells
cc: Dan Williams
cc: Vishal Verma
cc: Keith Busch
cc: Dave Jiang
cc: linux-nvdimm@lists.01.org
Signed-off-by: Al Viro

David Howells
2019-05-26 06:06:12 +0800
1f58bb18f mount_pseudo(): drop 'name' argument, switch to d_make_root() ... Browse Code »

Once upon a time we used to set ->d_name of e.g. pipefs root
so that d_path() on pipes would work. These days it's
completely pointless - dentries of pipes are not even connected
to pipefs root. However, mount_pseudo() had set the root
dentry name (passed as the second argument) and callers
kept inventing names to pass to it. Including those that
didn't *have* any non-root dentries to start with...

All of that had been pointless for about 8 years now; it's
time to get rid of that cargo-culting...

Signed-off-by: Al Viro

Al Viro
2019-05-26 05:59:24 +0800
b2ad81363 Merge tag 'libnvdimm-fixes-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm ... Browse Code »

Pull libnvdimm fixes from Dan Williams:

- Fix a regression that disabled device-mapper dax support

- Remove unnecessary hardened-user-copy overhead (>30%) for dax
read(2)/write(2).

- Fix some compilation warnings.

* tag 'libnvdimm-fixes-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead
dax: Arrange for dax_supported check to span multiple devices
libnvdimm: Fix compilation warnings with W=1

Linus Torvalds
2019-05-26 01:11:23 +0800

21 May, 2019

3 commits

ec8f24b7f treewide: Add SPDX license identifier - Makefile/Kconfig ... Browse Code »

Add SPDX license identifiers to all Make/Kconfig files which:

- Have no license information of any form

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

GPL-2.0-only

Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2019-05-21 16:50:46 +0800
1a6e9e76b device-dax: Drop register_filesystem() ... Browse Code »

The device-dax fs is only there to allocate a common inode for each
device-node that refers to the same device by major:minor. It is
otherwise not user mountable and need not be displayed in
/proc/filesystems.

Reported-by: Al Viro
Acked-by: Al Viro
Signed-off-by: Dan Williams
Signed-off-by: Al Viro

Dan Williams
2019-05-21 15:23:41 +0800
7bf7eac8d dax: Arrange for dax_supported check to span multiple devices ... Browse Code »

Pankaj reports that starting with commit ad428cdb525a "dax: Check the
end of the block-device capacity with dax_direct_access()" device-mapper
no longer allows dax operation. This results from the stricter checks in
__bdev_dax_supported() that validate that the start and end of a
block-device map to the same 'pagemap' instance.

Teach the dax-core and device-mapper to validate the 'pagemap' on a
per-target basis. This is accomplished by refactoring the
bdev_dax_supported() internals into generic_fsdax_supported() which
takes a sector range to validate. Consequently generic_fsdax_supported()
is suitable to be used in a device-mapper ->iterate_devices() callback.
A new ->dax_supported() operation is added to allow composite devices to
split and route upper-level bdev_dax_supported() requests.

Fixes: ad428cdb525a ("dax: Check the end of the block-device...")
Cc:
Cc: Ira Weiny
Cc: Dave Jiang
Cc: Keith Busch
Cc: Matthew Wilcox
Cc: Vishal Verma
Cc: Heiko Carstens
Cc: Martin Schwidefsky
Reviewed-by: Jan Kara
Reported-by: Pankaj Gupta
Reviewed-by: Pankaj Gupta
Tested-by: Pankaj Gupta
Tested-by: Vaibhav Jain
Reviewed-by: Mike Snitzer
Signed-off-by: Dan Williams

Dan Williams
2019-05-21 06:02:08 +0800

16 May, 2019

1 commit

83f3ef3de Merge tag 'libnvdimm-fixes-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm ... Browse Code »

Pull libnvdimm updates from Dan Williams:
"Just a small collection of fixes this time around.

The new virtio-pmem driver is nearly ready, but some last minute
device-mapper acks and virtio questions made it prudent to await v5.3.

Other major topics that were brewing on the linux-nvdimm mailing list
like sub-section hotplug, and other devm_memremap_pages() reworks will
go upstream through Andrew's tree.

Summary:

- Fix a long standing namespace label corruption scenario when
re-provisioning capacity for a namespace.

- Restore the ability of the dax_pmem module to be built-in.

- Harden the build for the 'nfit_test' unit test modules so that the
userspace test harness can ensure all required test modules are
available"

* tag 'libnvdimm-fixes-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
drivers/dax: Allow to include DEV_DAX_PMEM as builtin
libnvdimm/namespace: Fix label tracking error
tools/testing/nvdimm: add watermarks for dax_pmem* modules
dax/pmem: Fix whitespace in dax_pmem

Linus Torvalds
2019-05-16 09:56:50 +0800

15 May, 2019

1 commit

fce86ff58 mm/huge_memory: fix vmf_insert_pfn_{pmd, pud}() crash, handle unaligned addresses ... Browse Code »

Starting with c6f3c5ee40c1 ("mm/huge_memory.c: fix modifying of page
protection by insert_pfn_pmd()") vmf_insert_pfn_pmd() internally calls
pmdp_set_access_flags(). That helper enforces a pmd aligned @address
argument via VM_BUG_ON() assertion.

Update the implementation to take a 'struct vm_fault' argument directly
and apply the address alignment fixup internally to fix crash signatures
like:

kernel BUG at arch/x86/mm/pgtable.c:515!
invalid opcode: 0000 [#1] SMP NOPTI
CPU: 51 PID: 43713 Comm: java Tainted: G OE 4.19.35 #1
[..]
RIP: 0010:pmdp_set_access_flags+0x48/0x50
[..]
Call Trace:
vmf_insert_pfn_pmd+0x198/0x350
dax_iomap_fault+0xe82/0x1190
ext4_dax_huge_fault+0x103/0x1f0
? __switch_to_asm+0x40/0x70
__handle_mm_fault+0x3f6/0x1370
? __switch_to_asm+0x34/0x70
? __switch_to_asm+0x40/0x70
handle_mm_fault+0xda/0x200
__do_page_fault+0x249/0x4f0
do_page_fault+0x32/0x110
? page_fault+0x8/0x30
page_fault+0x1e/0x30

Link: http://lkml.kernel.org/r/155741946350.372037.11148198430068238140.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: c6f3c5ee40c1 ("mm/huge_memory.c: fix modifying of page protection by insert_pfn_pmd()")
Signed-off-by: Dan Williams
Reported-by: Piotr Balcer
Tested-by: Yan Ma
Tested-by: Pankaj Gupta
Reviewed-by: Matthew Wilcox
Reviewed-by: Jan Kara
Reviewed-by: Aneesh Kumar K.V
Cc: Chandan Rajendra
Cc: Souptick Joarder
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2019-05-15 00:47:44 +0800

07 May, 2019

1 commit

67476656f drivers/dax: Allow to include DEV_DAX_PMEM as builtin ... Browse Code »

This move the dependency to DEV_DAX_PMEM_COMPAT such that only
if DEV_DAX_PMEM is built as module we can allow the compat support.

This allows to test the new code easily in a emulation setup where we
often build things without module support.

Cc:
Fixes: 730926c3b099 ("device-dax: Add /sys/class/dax backwards compatibility")
Signed-off-by: Aneesh Kumar K.V
Signed-off-by: Dan Williams

Aneesh Kumar K.V
2019-05-07 22:48:06 +0800

02 May, 2019

1 commit

53e228299 dax: make use of ->free_inode() ... Browse Code »

we might want to drop ->destroy_inode() there - it's used only for
WARN_ON() now, and AFAICS that could be moved to ->evict_inode()
if we had one...

Reviewed-by: Jan Kara
Acked-by: Dan Williams
Signed-off-by: Al Viro

Al Viro
2019-05-02 10:43:26 +0800

23 Apr, 2019

1 commit

d521fbaed dax/pmem: Fix whitespace in dax_pmem ... Browse Code »

A few lines were whitespace damaged, with spaces at the start instead of
tabs. This was noticed while debugging an nfit_test failure, so fix
them.

Cc: Dan Williams
Signed-off-by: Vishal Verma
Signed-off-by: Dan Williams

Vishal Verma
2019-04-23 06:56:20 +0800

17 Mar, 2019

1 commit

f67e3fb48 Merge tag 'devdax-for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm ... Browse Code »

Pull device-dax updates from Dan Williams:
"New device-dax infrastructure to allow persistent memory and other
"reserved" / performance differentiated memories, to be assigned to
the core-mm as "System RAM".

Some users want to use persistent memory as additional volatile
memory. They are willing to cope with potential performance
differences, for example between DRAM and 3D Xpoint, and want to use
typical Linux memory management apis rather than a userspace memory
allocator layered over an mmap() of a dax file. The administration
model is to decide how much Persistent Memory (pmem) to use as System
RAM, create a device-dax-mode namespace of that size, and then assign
it to the core-mm. The rationale for device-dax is that it is a
generic memory-mapping driver that can be layered over any "special
purpose" memory, not just pmem. On subsequent boots udev rules can be
used to restore the memory assignment.

One implication of using pmem as RAM is that mlock() no longer keeps
data off persistent media. For this reason it is recommended to enable
NVDIMM Security (previously merged for 5.0) to encrypt pmem contents
at rest. We considered making this recommendation an actively enforced
requirement, but in the end decided to leave it as a distribution /
administrator policy to allow for emulation and test environments that
lack security capable NVDIMMs.

Summary:

- Replace the /sys/class/dax device model with /sys/bus/dax, and
include a compat driver so distributions can opt-in to the new ABI.

- Allow for an alternative driver for the device-dax address-range

- Introduce the 'kmem' driver to hotplug / assign a device-dax
address-range to the core-mm.

- Arrange for the device-dax target-node to be onlined so that the
newly added memory range can be uniquely referenced by numa apis"

NOTE! I'm not entirely happy with the whole "PMEM as RAM" model because
we currently have special - and very annoying rules in the kernel about
accessing PMEM only with the "MC safe" accessors, because machine checks
inside the regular repeat string copy functions can be fatal in some
(not described) circumstances.

And apparently the PMEM modules can cause that a lot more than regular
RAM. The argument is that this happens because PMEM doesn't necessarily
get scrubbed at boot like RAM does, but that is planned to be added for
the user space tooling.

Quoting Dan from another email:
"The exposure can be reduced in the volatile-RAM case by scanning for
and clearing errors before it is onlined as RAM. The userspace tooling
for that can be in place before v5.1-final. There's also runtime
notifications of errors via acpi_nfit_uc_error_notify() from
background scrubbers on the DIMM devices. With that mechanism the
kernel could proactively clear newly discovered poison in the volatile
case, but that would be additional development more suitable for v5.2.

I understand the concern, and the need to highlight this issue by
tapping the brakes on feature development, but I don't see PMEM as RAM
making the situation worse when the exposure is also there via DAX in
the PMEM case. Volatile-RAM is arguably a safer use case since it's
possible to repair pages where the persistent case needs active
application coordination"

* tag 'devdax-for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
device-dax: "Hotplug" persistent memory for use like normal RAM
mm/resource: Let walk_system_ram_range() search child resources
mm/memory-hotplug: Allow memory resources to be children
mm/resource: Move HMM pr_debug() deeper into resource code
mm/resource: Return real error codes from walk failures
device-dax: Add a 'modalias' attribute to DAX 'bus' devices
device-dax: Add a 'target_node' attribute
device-dax: Auto-bind device after successful new_id
acpi/nfit, device-dax: Identify differentiated memory with a unique numa-node
device-dax: Add /sys/class/dax backwards compatibility
device-dax: Add support for a dax override driver
device-dax: Move resource pinning+mapping into the common driver
device-dax: Introduce bus + driver model
device-dax: Start defining a dax bus model
device-dax: Remove multi-resource infrastructure
device-dax: Kill dax_region base
device-dax: Kill dax_region ida

Linus Torvalds
2019-03-17 04:05:32 +0800

01 Mar, 2019

1 commit

c221c0b03 device-dax: "Hotplug" persistent memory for use like normal RAM ... Browse Code »

This is intended for use with NVDIMMs that are physically persistent
(physically like flash) so that they can be used as a cost-effective
RAM replacement. Intel Optane DC persistent memory is one
implementation of this kind of NVDIMM.

Currently, a persistent memory region is "owned" by a device driver,
either the "Direct DAX" or "Filesystem DAX" drivers. These drivers
allow applications to explicitly use persistent memory, generally
by being modified to use special, new libraries. (DIMM-based
persistent memory hardware/software is described in great detail
here: Documentation/nvdimm/nvdimm.txt).

However, this limits persistent memory use to applications which
*have* been modified. To make it more broadly usable, this driver
"hotplugs" memory into the kernel, to be managed and used just like
normal RAM would be.

To make this work, management software must remove the device from
being controlled by the "Device DAX" infrastructure:

echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind

and then tell the new driver that it can bind to the device:

echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id

After this, there will be a number of new memory sections visible
in sysfs that can be onlined, or that may get onlined by existing
udev-initiated memory hotplug rules.

This rebinding procedure is currently a one-way trip. Once memory
is bound to "kmem", it's there permanently and can not be
unbound and assigned back to device_dax.

The kmem driver will never bind to a dax device unless the device
is *explicitly* bound to the driver. There are two reasons for
this: One, since it is a one-way trip, it can not be undone if
bound incorrectly. Two, the kmem driver destroys data on the
device. Think of if you had good data on a pmem device. It
would be catastrophic if you compile-in "kmem", but leave out
the "device_dax" driver. kmem would take over the device and
write volatile data all over your good data.

This inherits any existing NUMA information for the newly-added
memory from the persistent memory device that came from the
firmware. On Intel platforms, the firmware has guarantees that
require each socket's persistent memory to be in a separate
memory-only NUMA node. That means that this patch is not expected
to create NUMA nodes, but will simply hotplug memory into existing
nodes.

Because NUMA nodes are created, the existing NUMA APIs and tools
are sufficient to create policies for applications or memory areas
to have affinity for or an aversion to using this memory.

There is currently some metadata at the beginning of pmem regions.
The section-size memory hotplug restrictions, plus this small
reserved area can cause the "loss" of a section or two of capacity.
This should be fixable in follow-on patches. But, as a first step,
losing 256MB of memory (worst case) out of hundreds of gigabytes
is a good tradeoff vs. the required code to fix this up precisely.
This calculation is also the reason we export
memory_block_size_bytes().

Signed-off-by: Dave Hansen
Reviewed-by: Dan Williams
Reviewed-by: Keith Busch
Cc: Dave Jiang
Cc: Ross Zwisler
Cc: Vishal Verma
Cc: Tom Lendacky
Cc: Andrew Morton
Cc: Michal Hocko
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying
Cc: Fengguang Wu
Cc: Borislav Petkov
Cc: Bjorn Helgaas
Cc: Yaowei Bai
Cc: Takashi Iwai
Cc: Jerome Glisse
Reviewed-by: Vishal Verma
Signed-off-by: Dan Williams

Dave Hansen
2019-03-01 02:41:23 +0800

28 Feb, 2019

1 commit

c347bd71d device-dax: Add a 'modalias' attribute to DAX 'bus' devices ... Browse Code »

Add a 'modalias' attribute to devices under the DAX bus so that userspace
is able to dynamically load modules as needed.

Normally, udev can get the modalias from 'uevent', and that is correctly
set up by the DAX bus. However other tooling such as 'libndctl' for
interacting with drivers/nvdimm/, and 'libdaxctl' for drivers/dax/ can
also use the modalias to dynamically load modules via libkmod lookups.

The 'nd' bus set up by the libnvdimm subsystem exports a modalias
attribute. Imitate this to export the same for the 'dax' bus.

Cc: Dave Hansen
Signed-off-by: Vishal Verma
Signed-off-by: Dan Williams

Vishal Verma
2019-02-28 13:03:48 +0800

21 Feb, 2019

2 commits

ad428cdb5 dax: Check the end of the block-device capacity with dax_direct_access() ... Browse Code »

The checks in __bdev_dax_supported() helped mitigate a potential data
corruption bug in the pmem driver's handling of section alignment
padding. Strengthen the checks, including checking the end of the range,
to validate the dev_pagemap, Xarray entries, and sector-to-pfn
translation established for pmem namespaces.

Acked-by: Jan Kara
Cc: "Darrick J. Wong"
Signed-off-by: Dan Williams

Dan Williams
2019-02-21 13:12:50 +0800
21c75763a device-dax: Add a 'target_node' attribute ... Browse Code »

The target-node attribute is the Linux numa-node that a device-dax
instance may create when it is online. Prior to being online the
device's 'numa_node' property reflects the closest online cpu node which
is the typical expectation of a device 'numa_node'. Once it is online it
becomes its own distinct numa node, i.e. 'target_node'.

Export the 'target_node' property to give userspace tooling the ability
to predict the effective numa-node from a device-dax instance configured
to provide 'System RAM' capacity.

Cc: Vishal Verma
Reported-by: Dave Hansen
Signed-off-by: Dan Williams

Dan Williams
2019-02-21 03:39:36 +0800

25 Jan, 2019

1 commit

664525b2d device-dax: Auto-bind device after successful new_id ... Browse Code »

The typical 'new_id' attribute behavior is to immediately attach a
device to its driver after a new device-id is added. Implement this
behavior for the dax bus.

Reported-by: Alexander Duyck
Reported-by: Brice Goglin
Cc: Dave Hansen
Signed-off-by: Dan Williams

Dan Williams
2019-01-25 05:12:04 +0800

07 Jan, 2019

9 commits

8fc5c7355 acpi/nfit, device-dax: Identify differentiated memory with a unique numa-node ... Browse Code »

Persistent memory, as described by the ACPI NFIT (NVDIMM Firmware
Interface Table), is the first known instance of a memory range
described by a unique "target" proximity domain. Where "initiator" and
"target" proximity domains is an approach that the ACPI HMAT
(Heterogeneous Memory Attributes Table) uses to described the unique
performance properties of a memory range relative to a given initiator
(e.g. CPU or DMA device).

Currently the numa-node for a /dev/pmemX block-device or /dev/daxX.Y
char-device follows the traditional notion of 'numa-node' where the
attribute conveys the closest online numa-node. That numa-node attribute
is useful for cpu-binding and memory-binding processes *near* the
device. However, when the memory range backing a 'pmem', or 'dax' device
is onlined (memory hot-add) the memory-only-numa-node representing that
address needs to be differentiated from the set of online nodes. In
other words, the numa-node association of the device depends on whether
you can bind processes *near* the cpu-numa-node in the offline
device-case, or bind process *on* the memory-range directly after the
backing address range is onlined.

Allow for the case that platform firmware describes persistent memory
with a unique proximity domain, i.e. when it is distinct from the
proximity of DRAM and CPUs that are on the same socket. Plumb the Linux
numa-node translation of that proximity through the libnvdimm region
device to namespaces that are in device-dax mode. With this in place the
proposed kmem driver [1] can optionally discover a unique numa-node
number for the address range as it transitions the memory from an
offline state managed by a device-driver to an online memory range
managed by the core-mm.

[1]: https://lore.kernel.org/lkml/20181022201317.8558C1D8@viggo.jf.intel.com

Reported-by: Fan Du
Cc: Michael Ellerman
Cc: "Oliver O'Halloran"
Cc: Dave Hansen
Cc: Jérôme Glisse
Reviewed-by: Yang Shi
Signed-off-by: Dan Williams

Dan Williams
2019-01-07 13:41:57 +0800
730926c3b device-dax: Add /sys/class/dax backwards compatibility ... Browse Code »

On the expectation that some environments may not upgrade libdaxctl
(userspace component that depends on the /sys/class/dax hierarchy),
provide a default / legacy dax_pmem_compat driver. The dax_pmem_compat
driver implements the original /sys/class/dax sysfs layout rather than
/sys/bus/dax. When userspace is upgraded it can blacklist this module
and switch to the dax_pmem driver going forward.

CONFIG_DEV_DAX_PMEM_COMPAT and supporting code will be deleted according
to the dax_pmem entry in Documentation/ABI/obsolete/.

Signed-off-by: Dan Williams

Dan Williams
2019-01-07 13:41:57 +0800
d200781ef device-dax: Add support for a dax override driver ... Browse Code »

Introduce the 'new_id' concept for enabling a custom device-driver attach
policy for dax-bus drivers. The intended use is to have a mechanism for
hot-plugging device-dax ranges into the page allocator on-demand. With
this in place the default policy of using device-dax for performance
differentiated memory can be overridden by user-space policy that can
arrange for the memory range to be managed as 'System RAM' with
user-defined NUMA and other performance attributes.

Signed-off-by: Dan Williams

Dan Williams
2019-01-07 13:41:55 +0800
89ec9f2cf device-dax: Move resource pinning+mapping into the common driver ... Browse Code »

Move the responsibility of calling devm_request_resource() and
devm_memremap_pages() into the common device-dax driver. This is another
preparatory step to allowing an alternate personality driver for a
device-dax range.

Signed-off-by: Dan Williams

Dan Williams
2019-01-07 13:26:21 +0800
9567da0b4 device-dax: Introduce bus + driver model ... Browse Code »

In support of multiple device-dax instances per device-dax-region and
allowing the 'kmem' driver to attach to dax-instances instead of the
current device-node access, convert the dax sub-system from a class to a
bus. Recall that the kmem driver takes reserved / special purpose
memories and assigns them to be managed by the core-mm.

Aside from the fact the device-dax instances are registered and probed
on a bus, two other lifetime-management changes are made:

1/ Delay attaching a cdev until driver probe time

2/ A new run_dax() helper is introduced to allow restoring dax-operation
after a kill_dax() event. So, at driver ->probe() time we run_dax()
and at ->remove() time we kill_dax() and invalidate all mappings.

Signed-off-by: Dan Williams

Dan Williams
2019-01-07 13:24:46 +0800
51cf784c4 device-dax: Start defining a dax bus model ... Browse Code »

Towards eliminating the dax_class, move the dax-device-attribute
enabling to a new bus.c file in the core. The amount of code
thrash of sub-sequent patches is reduced as no logic changes are made,
just pure code movement.

A temporary export of unregister_dex_dax() and dax_attribute_groups is
needed to preserve compilation, but those symbols become static again in
a follow-on patch.

Signed-off-by: Dan Williams

Dan Williams
2019-01-07 13:24:46 +0800
753a0850e device-dax: Remove multi-resource infrastructure ... Browse Code »

The multi-resource implementation anticipated discontiguous sub-division
support. That has not yet materialized, delete the infrastructure and
related code.

Signed-off-by: Dan Williams

Dan Williams
2019-01-07 13:24:46 +0800
93694f963 device-dax: Kill dax_region base ... Browse Code »

Nothing consumes this attribute of a region and devres otherwise
remembers the value for de-allocation purposes.

Signed-off-by: Dan Williams

Dan Williams
2019-01-07 13:24:46 +0800
21b9e9795 device-dax: Kill dax_region ida ... Browse Code »

Commit bbb3be170ac2 "device-dax: fix sysfs duplicate warnings" arranged
for passing a dax instance-id to devm_create_dax_dev(), rather than
generating one internally. Remove the dax_region ida and related code.

Signed-off-by: Dan Williams

Dan Williams
2019-01-07 13:24:45 +0800

29 Dec, 2018

1 commit

a95c90f1e mm, devm_memremap_pages: fix shutdown handling ... Browse Code »

The last step before devm_memremap_pages() returns success is to allocate
a release action, devm_memremap_pages_release(), to tear the entire setup
down. However, the result from devm_add_action() is not checked.

Checking the error from devm_add_action() is not enough. The api
currently relies on the fact that the percpu_ref it is using is killed by
the time the devm_memremap_pages_release() is run. Rather than continue
this awkward situation, offload the responsibility of killing the
percpu_ref to devm_memremap_pages_release() directly. This allows
devm_memremap_pages() to do the right thing relative to init failures and
shutdown.

Without this change we could fail to register the teardown of
devm_memremap_pages(). The likelihood of hitting this failure is tiny as
small memory allocations almost always succeed. However, the impact of
the failure is large given any future reconfiguration, or disable/enable,
of an nvdimm namespace will fail forever as subsequent calls to
devm_memremap_pages() will fail to setup the pgmap_radix since there will
be stale entries for the physical address range.

An argument could be made to require that the ->kill() operation be set in
the @pgmap arg rather than passed in separately. However, it helps code
readability, tracking the lifetime of a given instance, to be able to grep
the kill routine directly at the devm_memremap_pages() call site.

Link: http://lkml.kernel.org/r/154275558526.76910.7535251937849268605.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams
Fixes: e8d513483300 ("memremap: change devm_memremap_pages interface...")
Reviewed-by: "Jérôme Glisse"
Reported-by: Logan Gunthorpe
Reviewed-by: Logan Gunthorpe
Reviewed-by: Christoph Hellwig
Cc: Balbir Singh
Cc: Michal Hocko
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2018-12-29 04:11:47 +0800