17 Oct, 2020
2 commits
-
We soon want to pass flags, e.g., to mark added System RAM resources.
mergeable. Prepare for that.This patch is based on a similar patch by Oscar Salvador:
https://lkml.kernel.org/r/20190625075227.15193-3-osalvador@suse.de
Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Reviewed-by: Juergen Gross # Xen related part
Reviewed-by: Pankaj Gupta
Acked-by: Wei Liu
Cc: Michal Hocko
Cc: Dan Williams
Cc: Jason Gunthorpe
Cc: Baoquan He
Cc: Michael Ellerman
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: "Rafael J. Wysocki"
Cc: Len Brown
Cc: Greg Kroah-Hartman
Cc: Vishal Verma
Cc: Dave Jiang
Cc: "K. Y. Srinivasan"
Cc: Haiyang Zhang
Cc: Stephen Hemminger
Cc: Wei Liu
Cc: Heiko Carstens
Cc: Vasily Gorbik
Cc: Christian Borntraeger
Cc: David Hildenbrand
Cc: "Michael S. Tsirkin"
Cc: Jason Wang
Cc: Boris Ostrovsky
Cc: Stefano Stabellini
Cc: "Oliver O'Halloran"
Cc: Pingfan Liu
Cc: Nathan Lynch
Cc: Libor Pechacek
Cc: Anton Blanchard
Cc: Leonardo Bras
Cc: Ard Biesheuvel
Cc: Eric Biederman
Cc: Julien Grall
Cc: Kees Cook
Cc: Roger Pau Monné
Cc: Thomas Gleixner
Cc: Wei Yang
Link: https://lkml.kernel.org/r/20200911103459.10306-5-david@redhat.com
Signed-off-by: Linus Torvalds -
The conversion to request_mem_region() is broken because it assumes that
the range is marked busy prior to release. However, due to the way that
the kmem driver manipulates the IORESOURCE_BUSY flag (clears it to let
{add,remove}_memory() handle busy) it requires a manual release_resource()
to perform cleanup.Given that the actual 'struct resource *' needs to be recalled, not just
the range, add that tracking to the kmem driver-data.Fixes: 0513bd5bb114 ("device-dax/kmem: replace release_resource() with release_mem_region()")
Reported-by: David Hildenbrand
Signed-off-by: Dan Williams
Signed-off-by: Andrew Morton
Reviewed-by: David Hildenbrand
Cc: Vishal Verma
Cc: Dave Hansen
Cc: Pavel Tatashin
Cc: Brice Goglin
Cc: Dave Jiang
Cc: Ira Weiny
Cc: Jia He
Cc: Joao Martins
Cc: Jonathan Cameron
Link: https://lkml.kernel.org/r/160272252925.3136502.17220638073995895400.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Linus Torvalds
14 Oct, 2020
6 commits
-
Break the requirement that device-dax instances are physically contiguous.
With this constraint removed it allows fragmented available capacity to
be fully allocated.This capability is useful to mitigate the "noisy neighbor" problem with
memory-side-cache management for virtual machines, or any other scenario
where a platform address boundary also designates a performance boundary.
For example a direct mapped memory side cache might rotate cache colors at
1GB boundaries. With dis-contiguous allocations a device-dax instance
could be configured to contain only 1 cache color.It also satisfies Joao's use case (see link) for partitioning memory for
exclusive guest access. It allows for a future potential mode where the
host kernel need not allocate 'struct page' capacity up-front.Reported-by: Joao Martins
Signed-off-by: Dan Williams
Signed-off-by: Andrew Morton
Cc: Andy Lutomirski
Cc: Ard Biesheuvel
Cc: Ard Biesheuvel
Cc: Benjamin Herrenschmidt
Cc: Ben Skeggs
Cc: Bjorn Helgaas
Cc: Borislav Petkov
Cc: Boris Ostrovsky
Cc: Brice Goglin
Cc: Catalin Marinas
Cc: Daniel Vetter
Cc: Dave Hansen
Cc: Dave Jiang
Cc: David Airlie
Cc: David Hildenbrand
Cc: Greg Kroah-Hartman
Cc: "H. Peter Anvin"
Cc: Hulk Robot
Cc: Ingo Molnar
Cc: Ira Weiny
Cc: Jason Gunthorpe
Cc: Jason Yan
Cc: Jeff Moyer
Cc: "Jérôme Glisse"
Cc: Jia He
Cc: Jonathan Cameron
Cc: Juergen Gross
Cc: kernel test robot
Cc: Michael Ellerman
Cc: Mike Rapoport
Cc: Paul Mackerras
Cc: Pavel Tatashin
Cc: Peter Zijlstra
Cc: "Rafael J. Wysocki"
Cc: Randy Dunlap
Cc: Stefano Stabellini
Cc: Thomas Gleixner
Cc: Tom Lendacky
Cc: Vishal Verma
Cc: Vivek Goyal
Cc: Wei Yang
Cc: Will Deacon
Link: https://lore.kernel.org/lkml/20200110190313.17144-1-joao.m.martins@oracle.com/
Link: https://lkml.kernel.org/r/159643104304.4062302.16561669534797528660.stgit@dwillia2-desk3.amr.corp.intel.com
Link: https://lkml.kernel.org/r/160106116875.30709.11456649969327399771.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Linus Torvalds -
In preparation for introducing seed devices the dax-bus core needs to be
able to intercept ->probe() and ->remove() operations. Towards that end
arrange for the bus and drivers to switch from raw 'struct device' driver
operations to 'struct dev_dax' typed operations.Reported-by: Hulk Robot
Signed-off-by: Dan Williams
Signed-off-by: Andrew Morton
Cc: Jason Yan
Cc: Vishal Verma
Cc: Brice Goglin
Cc: Dave Hansen
Cc: Dave Jiang
Cc: David Hildenbrand
Cc: Ira Weiny
Cc: Jia He
Cc: Joao Martins
Cc: Jonathan Cameron
Cc: Andy Lutomirski
Cc: Ard Biesheuvel
Cc: Ard Biesheuvel
Cc: Benjamin Herrenschmidt
Cc: Ben Skeggs
Cc: Bjorn Helgaas
Cc: Borislav Petkov
Cc: Boris Ostrovsky
Cc: Catalin Marinas
Cc: Daniel Vetter
Cc: David Airlie
Cc: Greg Kroah-Hartman
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: Jason Gunthorpe
Cc: Jeff Moyer
Cc: "Jérôme Glisse"
Cc: Juergen Gross
Cc: kernel test robot
Cc: Michael Ellerman
Cc: Mike Rapoport
Cc: Paul Mackerras
Cc: Pavel Tatashin
Cc: Peter Zijlstra
Cc: "Rafael J. Wysocki"
Cc: Randy Dunlap
Cc: Stefano Stabellini
Cc: Thomas Gleixner
Cc: Tom Lendacky
Cc: Vivek Goyal
Cc: Wei Yang
Cc: Will Deacon
Link: https://lkml.kernel.org/r/160106113357.30709.4541750544799737855.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Linus Torvalds -
Towards removing the mode specific @dax_kmem_res attribute from the
generic 'struct dev_dax', and preparing for multi-range support, change
the kmem driver to use the idiomatic release_mem_region() to pair with the
initial request_mem_region(). This also eliminates the need to open code
the release of the resource allocated by request_mem_region().As there are no more dax_kmem_res users, delete this struct member.
Signed-off-by: Dan Williams
Signed-off-by: Andrew Morton
Cc: David Hildenbrand
Cc: Vishal Verma
Cc: Dave Hansen
Cc: Pavel Tatashin
Cc: Brice Goglin
Cc: Dave Jiang
Cc: Ira Weiny
Cc: Jia He
Cc: Joao Martins
Cc: Jonathan Cameron
Cc: Andy Lutomirski
Cc: Ard Biesheuvel
Cc: Ard Biesheuvel
Cc: Benjamin Herrenschmidt
Cc: Ben Skeggs
Cc: Bjorn Helgaas
Cc: Borislav Petkov
Cc: Boris Ostrovsky
Cc: Catalin Marinas
Cc: Daniel Vetter
Cc: David Airlie
Cc: Greg Kroah-Hartman
Cc: "H. Peter Anvin"
Cc: Hulk Robot
Cc: Ingo Molnar
Cc: Jason Gunthorpe
Cc: Jason Yan
Cc: Jeff Moyer
Cc: "Jérôme Glisse"
Cc: Juergen Gross
Cc: kernel test robot
Cc: Michael Ellerman
Cc: Mike Rapoport
Cc: Paul Mackerras
Cc: Peter Zijlstra
Cc: "Rafael J. Wysocki"
Cc: Randy Dunlap
Cc: Stefano Stabellini
Cc: Thomas Gleixner
Cc: Tom Lendacky
Cc: Vivek Goyal
Cc: Wei Yang
Cc: Will Deacon
Link: https://lkml.kernel.org/r/160106112239.30709.15909567572288425294.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Linus Torvalds -
Towards removing the mode specific @dax_kmem_res attribute from the
generic 'struct dev_dax', and preparing for multi-range support, move
resource name tracking to driver data. The memory for the resource name
needs to have its own lifetime separate from the device bind lifetime for
cases where the driver is unbound, but the kmem range could not be
unplugged from the page allocator.Signed-off-by: Dan Williams
Signed-off-by: Andrew Morton
Cc: David Hildenbrand
Cc: Vishal Verma
Cc: Dave Hansen
Cc: Pavel Tatashin
Cc: Brice Goglin
Cc: Dave Jiang
Cc: Ira Weiny
Cc: Jia He
Cc: Joao Martins
Cc: Jonathan Cameron
Cc: Andy Lutomirski
Cc: Ard Biesheuvel
Cc: Ard Biesheuvel
Cc: Benjamin Herrenschmidt
Cc: Ben Skeggs
Cc: Bjorn Helgaas
Cc: Borislav Petkov
Cc: Boris Ostrovsky
Cc: Catalin Marinas
Cc: Daniel Vetter
Cc: David Airlie
Cc: Greg Kroah-Hartman
Cc: "H. Peter Anvin"
Cc: Hulk Robot
Cc: Ingo Molnar
Cc: Jason Gunthorpe
Cc: Jason Yan
Cc: Jeff Moyer
Cc: "Jérôme Glisse"
Cc: Juergen Gross
Cc: kernel test robot
Cc: Michael Ellerman
Cc: Mike Rapoport
Cc: Paul Mackerras
Cc: Peter Zijlstra
Cc: "Rafael J. Wysocki"
Cc: Randy Dunlap
Cc: Stefano Stabellini
Cc: Thomas Gleixner
Cc: Tom Lendacky
Cc: Vivek Goyal
Cc: Wei Yang
Cc: Will Deacon
Link: https://lkml.kernel.org/r/160106111639.30709.17624822766862009183.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Linus Torvalds -
Towards removing the mode specific @dax_kmem_res attribute from the
generic 'struct dev_dax', and preparing for multi-range support, teach the
driver to calculate the hotplug range from the device range. The hotplug
range is the trivially calculated memory-block-size aligned version of the
device range.Signed-off-by: Dan Williams
Signed-off-by: Andrew Morton
Cc: David Hildenbrand
Cc: Vishal Verma
Cc: Dave Hansen
Cc: Pavel Tatashin
Cc: Brice Goglin
Cc: Dave Jiang
Cc: Ira Weiny
Cc: Jia He
Cc: Joao Martins
Cc: Jonathan Cameron
Cc: Andy Lutomirski
Cc: Ard Biesheuvel
Cc: Ard Biesheuvel
Cc: Benjamin Herrenschmidt
Cc: Ben Skeggs
Cc: Bjorn Helgaas
Cc: Borislav Petkov
Cc: Boris Ostrovsky
Cc: Catalin Marinas
Cc: Daniel Vetter
Cc: David Airlie
Cc: Greg Kroah-Hartman
Cc: "H. Peter Anvin"
Cc: Hulk Robot
Cc: Ingo Molnar
Cc: Jason Gunthorpe
Cc: Jason Yan
Cc: Jeff Moyer
Cc: "Jérôme Glisse"
Cc: Juergen Gross
Cc: kernel test robot
Cc: Michael Ellerman
Cc: Mike Rapoport
Cc: Paul Mackerras
Cc: Peter Zijlstra
Cc: "Rafael J. Wysocki"
Cc: Randy Dunlap
Cc: Stefano Stabellini
Cc: Thomas Gleixner
Cc: Tom Lendacky
Cc: Vivek Goyal
Cc: Wei Yang
Cc: Will Deacon
Link: https://lkml.kernel.org/r/160106111109.30709.3173462396758431559.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Linus Torvalds -
The passed in dev_pagemap is only required in the pmem case as the
libnvdimm core may have reserved a vmem_altmap for dev_memremap_pages() to
place the memmap in pmem directly. In the hmem case there is no agent
reserving an altmap so it can all be handled by a core internal default.Pass the resource range via a new @range property of 'struct
dev_dax_data'.Signed-off-by: Dan Williams
Signed-off-by: Andrew Morton
Cc: David Hildenbrand
Cc: Vishal Verma
Cc: Dave Hansen
Cc: Pavel Tatashin
Cc: Brice Goglin
Cc: Dave Jiang
Cc: Ira Weiny
Cc: Jia He
Cc: Joao Martins
Cc: Jonathan Cameron
Cc: Andy Lutomirski
Cc: Ard Biesheuvel
Cc: Ard Biesheuvel
Cc: Benjamin Herrenschmidt
Cc: Ben Skeggs
Cc: Bjorn Helgaas
Cc: Borislav Petkov
Cc: Boris Ostrovsky
Cc: Catalin Marinas
Cc: Daniel Vetter
Cc: David Airlie
Cc: Greg Kroah-Hartman
Cc: "H. Peter Anvin"
Cc: Hulk Robot
Cc: Ingo Molnar
Cc: Jason Gunthorpe
Cc: Jason Yan
Cc: Jeff Moyer
Cc: "Jérôme Glisse"
Cc: Juergen Gross
Cc: kernel test robot
Cc: Michael Ellerman
Cc: Mike Rapoport
Cc: Paul Mackerras
Cc: Peter Zijlstra
Cc: "Rafael J. Wysocki"
Cc: Randy Dunlap
Cc: Stefano Stabellini
Cc: Thomas Gleixner
Cc: Tom Lendacky
Cc: Vivek Goyal
Cc: Wei Yang
Cc: Will Deacon
Link: https://lkml.kernel.org/r/159643099958.4062302.10379230791041872886.stgit@dwillia2-desk3.amr.corp.intel.com
Link: https://lkml.kernel.org/r/160106110513.30709.4303239334850606031.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Linus Torvalds
05 Jun, 2020
1 commit
-
Currently, when adding memory, we create entries in /sys/firmware/memmap/
as "System RAM". This will lead to kexec-tools to add that memory to the
fixed-up initial memmap for a kexec kernel (loaded via kexec_load()). The
memory will be considered initial System RAM by the kexec'd kernel and can
no longer be reconfigured. This is not what happens during a real reboot.Let's add our memory via add_memory_driver_managed() now, so we won't
create entries in /sys/firmware/memmap/ and indicate the memory as "System
RAM (kmem)" in /proc/iomem. This allows everybody (especially
kexec-tools) to identify that this memory is special and has to be treated
differently than ordinary (hotplugged) System RAM.Before configuring the namespace:
[root@localhost ~]# cat /proc/iomem
...
140000000-33fffffff : Persistent Memory
140000000-33fffffff : namespace0.0
3280000000-32ffffffff : PCI Bus 0000:00After configuring the namespace:
[root@localhost ~]# cat /proc/iomem
...
140000000-33fffffff : Persistent Memory
140000000-1481fffff : namespace0.0
148200000-33fffffff : dax0.0
3280000000-32ffffffff : PCI Bus 0000:00After loading kmem before this change:
[root@localhost ~]# cat /proc/iomem
...
140000000-33fffffff : Persistent Memory
140000000-1481fffff : namespace0.0
150000000-33fffffff : dax0.0
150000000-33fffffff : System RAM
3280000000-32ffffffff : PCI Bus 0000:00After loading kmem after this change:
[root@localhost ~]# cat /proc/iomem
...
140000000-33fffffff : Persistent Memory
140000000-1481fffff : namespace0.0
150000000-33fffffff : dax0.0
150000000-33fffffff : System RAM (kmem)
3280000000-32ffffffff : PCI Bus 0000:00After a proper reboot:
[root@localhost ~]# cat /proc/iomem
...
140000000-33fffffff : Persistent Memory
140000000-1481fffff : namespace0.0
148200000-33fffffff : dax0.0
3280000000-32ffffffff : PCI Bus 0000:00Within the kexec kernel before this change:
[root@localhost ~]# cat /proc/iomem
...
140000000-33fffffff : Persistent Memory
140000000-1481fffff : namespace0.0
150000000-33fffffff : System RAM
3280000000-32ffffffff : PCI Bus 0000:00Within the kexec kernel after this change:
[root@localhost ~]# cat /proc/iomem
...
140000000-33fffffff : Persistent Memory
140000000-1481fffff : namespace0.0
148200000-33fffffff : dax0.0
3280000000-32ffffffff : PCI Bus 0000:00/sys/firmware/memmap/ before this change:
0000000000000000-000000000009fc00 (System RAM)
000000000009fc00-00000000000a0000 (Reserved)
00000000000f0000-0000000000100000 (Reserved)
0000000000100000-00000000bffdf000 (System RAM)
00000000bffdf000-00000000c0000000 (Reserved)
00000000feffc000-00000000ff000000 (Reserved)
00000000fffc0000-0000000100000000 (Reserved)
0000000100000000-0000000140000000 (System RAM)
0000000150000000-0000000340000000 (System RAM)/sys/firmware/memmap/ after a proper reboot:
0000000000000000-000000000009fc00 (System RAM)
000000000009fc00-00000000000a0000 (Reserved)
00000000000f0000-0000000000100000 (Reserved)
0000000000100000-00000000bffdf000 (System RAM)
00000000bffdf000-00000000c0000000 (Reserved)
00000000feffc000-00000000ff000000 (Reserved)
00000000fffc0000-0000000100000000 (Reserved)
0000000100000000-0000000140000000 (System RAM)/sys/firmware/memmap/ after this change:
0000000000000000-000000000009fc00 (System RAM)
000000000009fc00-00000000000a0000 (Reserved)
00000000000f0000-0000000000100000 (Reserved)
0000000000100000-00000000bffdf000 (System RAM)
00000000bffdf000-00000000c0000000 (Reserved)
00000000feffc000-00000000ff000000 (Reserved)
00000000fffc0000-0000000100000000 (Reserved)
0000000100000000-0000000140000000 (System RAM)kexec-tools already seem to basically ignore any System RAM that's not on
top level when searching for areas to place kexec images - but also for
determining crash areas to dump via kdump. Changing the resource name
won't have an impact.Handle unloading of the driver after memory hotremove failed properly, by
duplicating the string if necessary.Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Acked-by: Pankaj Gupta
Cc: Michal Hocko
Cc: Pankaj Gupta
Cc: Wei Yang
Cc: Baoquan He
Cc: Dave Hansen
Cc: Eric Biederman
Cc: Pavel Tatashin
Cc: Dan Williams
Link: http://lkml.kernel.org/r/20200508084217.9160-5-david@redhat.com
Signed-off-by: Linus Torvalds
24 May, 2020
1 commit
-
Assume we have kmem configured and loaded:
[root@localhost ~]# cat /proc/iomem
...
140000000-33fffffff : Persistent Memory$
140000000-1481fffff : namespace0.0
150000000-33fffffff : dax0.0
150000000-33fffffff : System RAMAssume we try to unload kmem. This force-unloading will work, even if
memory cannot get removed from the system.[root@localhost ~]# rmmod kmem
[ 86.380228] removing memory fails, because memory [0x0000000150000000-0x0000000157ffffff] is onlined
...
[ 86.431225] kmem dax0.0: DAX region [mem 0x150000000-0x33fffffff] cannot be hotremoved until the next rebootNow, we can reconfigure the namespace:
[root@localhost ~]# ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax
[ 131.409351] nd_pmem namespace0.0: could not reserve region [mem 0x140000000-0x33fffffff]dax
[ 131.410147] nd_pmem: probe of namespace0.0 failed with error -16namespace0.0 --mode=devdax
...This fails as expected due to the busy memory resource, and the memory
cannot be used. However, the dax0.0 device is removed, and along its
name.The name of the memory resource now points at freed memory (name of the
device):[root@localhost ~]# cat /proc/iomem
...
140000000-33fffffff : Persistent Memory
140000000-1481fffff : namespace0.0
150000000-33fffffff : �_�^7_��/_��wR��WQ���^��� ...
150000000-33fffffff : System RAMWe have to make sure to duplicate the string. While at it, remove the
superfluous setting of the name and fixup a stale comment.Fixes: 9f960da72b25 ("device-dax: "Hotremove" persistent memory that is used like normal RAM")
Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Cc: Dan Williams
Cc: Vishal Verma
Cc: Dave Jiang
Cc: Pavel Tatashin
Cc: Andrew Morton
Cc: [5.3]
Link: http://lkml.kernel.org/r/20200508084217.9160-2-david@redhat.com
Signed-off-by: Linus Torvalds
17 Jul, 2019
2 commits
-
It is now allowed to use persistent memory like a regular RAM, but
currently there is no way to remove this memory until machine is
rebooted.This work expands the functionality to also allows hotremoving
previously hotplugged persistent memory, and recover the device for use
for other purposes.To hotremove persistent memory, the management software must first
offline all memory blocks of dax region, and than unbind it from
device-dax/kmem driver. So, operations should look like this:echo offline > /sys/devices/system/memory/memoryN/state
...
echo dax0.0 > /sys/bus/dax/drivers/kmem/unbindNote: if unbind is done without offlining memory beforehand, it won't be
possible to do dax0.0 hotremove, and dax's memory is going to be part of
System RAM until reboot.Link: http://lkml.kernel.org/r/20190517215438.6487-4-pasha.tatashin@soleen.com
Signed-off-by: Pavel Tatashin
Reviewed-by: David Hildenbrand
Cc: James Morris
Cc: Sasha Levin
Cc: Michal Hocko
Cc: Dave Hansen
Cc: Dan Williams
Cc: Keith Busch
Cc: Vishal Verma
Cc: Dave Jiang
Cc: Ross Zwisler
Cc: Tom Lendacky
Cc: Huang Ying
Cc: Fengguang Wu
Cc: Borislav Petkov
Cc: Bjorn Helgaas
Cc: Yaowei Bai
Cc: Takashi Iwai
Cc: Jérôme Glisse
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Patch series ""Hotremove" persistent memory", v6.
Recently, adding a persistent memory to be used like a regular RAM was
added to Linux. This work extends this functionality to also allow hot
removing persistent memory.We (Microsoft) have an important use case for this functionality.
The requirement is for physical machines with small amount of RAM (~8G)
to be able to reboot in a very short period of time ( /sys/bus/dax/drivers/device_dax/unbind
echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
echo online_movable > /sys/devices/system/memoryXXX/state
4. Before reboot hotremove device-dax memory from System RAM
echo offline > /sys/devices/system/memoryXXX/state
echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
5. Create raw pmem0 device
ndctl create-namespace --mode raw -e namespace0.0 -f
6. Copy the state that was stored by apps to ramdisk to pmem device
7. Do kexec reboot or reboot through firmware if firmware does not
zero memory in pmem0 region (These machines have only regular
volatile memory). So to have pmem0 device either memmap kernel
parameter is used, or devices nodes in dtb are specified.This patch (of 3):
When add_memory() fails, the resource and the memory should be freed.
Link: http://lkml.kernel.org/r/20190517215438.6487-2-pasha.tatashin@soleen.com
Fixes: c221c0b0308f ("device-dax: "Hotplug" persistent memory for use like normal RAM")
Signed-off-by: Pavel Tatashin
Reviewed-by: Dave Hansen
Cc: Bjorn Helgaas
Cc: Borislav Petkov
Cc: Dan Williams
Cc: Dave Hansen
Cc: Dave Jiang
Cc: David Hildenbrand
Cc: Fengguang Wu
Cc: Huang Ying
Cc: James Morris
Cc: Jérôme Glisse
Cc: Keith Busch
Cc: Michal Hocko
Cc: Ross Zwisler
Cc: Sasha Levin
Cc: Takashi Iwai
Cc: Tom Lendacky
Cc: Vishal Verma
Cc: Yaowei Bai
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
01 Mar, 2019
1 commit
-
This is intended for use with NVDIMMs that are physically persistent
(physically like flash) so that they can be used as a cost-effective
RAM replacement. Intel Optane DC persistent memory is one
implementation of this kind of NVDIMM.Currently, a persistent memory region is "owned" by a device driver,
either the "Direct DAX" or "Filesystem DAX" drivers. These drivers
allow applications to explicitly use persistent memory, generally
by being modified to use special, new libraries. (DIMM-based
persistent memory hardware/software is described in great detail
here: Documentation/nvdimm/nvdimm.txt).However, this limits persistent memory use to applications which
*have* been modified. To make it more broadly usable, this driver
"hotplugs" memory into the kernel, to be managed and used just like
normal RAM would be.To make this work, management software must remove the device from
being controlled by the "Device DAX" infrastructure:echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
and then tell the new driver that it can bind to the device:
echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
After this, there will be a number of new memory sections visible
in sysfs that can be onlined, or that may get onlined by existing
udev-initiated memory hotplug rules.This rebinding procedure is currently a one-way trip. Once memory
is bound to "kmem", it's there permanently and can not be
unbound and assigned back to device_dax.The kmem driver will never bind to a dax device unless the device
is *explicitly* bound to the driver. There are two reasons for
this: One, since it is a one-way trip, it can not be undone if
bound incorrectly. Two, the kmem driver destroys data on the
device. Think of if you had good data on a pmem device. It
would be catastrophic if you compile-in "kmem", but leave out
the "device_dax" driver. kmem would take over the device and
write volatile data all over your good data.This inherits any existing NUMA information for the newly-added
memory from the persistent memory device that came from the
firmware. On Intel platforms, the firmware has guarantees that
require each socket's persistent memory to be in a separate
memory-only NUMA node. That means that this patch is not expected
to create NUMA nodes, but will simply hotplug memory into existing
nodes.Because NUMA nodes are created, the existing NUMA APIs and tools
are sufficient to create policies for applications or memory areas
to have affinity for or an aversion to using this memory.There is currently some metadata at the beginning of pmem regions.
The section-size memory hotplug restrictions, plus this small
reserved area can cause the "loss" of a section or two of capacity.
This should be fixable in follow-on patches. But, as a first step,
losing 256MB of memory (worst case) out of hundreds of gigabytes
is a good tradeoff vs. the required code to fix this up precisely.
This calculation is also the reason we export
memory_block_size_bytes().Signed-off-by: Dave Hansen
Reviewed-by: Dan Williams
Reviewed-by: Keith Busch
Cc: Dave Jiang
Cc: Ross Zwisler
Cc: Vishal Verma
Cc: Tom Lendacky
Cc: Andrew Morton
Cc: Michal Hocko
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying
Cc: Fengguang Wu
Cc: Borislav Petkov
Cc: Bjorn Helgaas
Cc: Yaowei Bai
Cc: Takashi Iwai
Cc: Jerome Glisse
Reviewed-by: Vishal Verma
Signed-off-by: Dan Williams