17 Oct, 2020

2 commits

  • We soon want to pass flags, e.g., to mark added System RAM resources.
    mergeable. Prepare for that.

    This patch is based on a similar patch by Oscar Salvador:

    https://lkml.kernel.org/r/20190625075227.15193-3-osalvador@suse.de

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Juergen Gross # Xen related part
    Reviewed-by: Pankaj Gupta
    Acked-by: Wei Liu
    Cc: Michal Hocko
    Cc: Dan Williams
    Cc: Jason Gunthorpe
    Cc: Baoquan He
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "Rafael J. Wysocki"
    Cc: Len Brown
    Cc: Greg Kroah-Hartman
    Cc: Vishal Verma
    Cc: Dave Jiang
    Cc: "K. Y. Srinivasan"
    Cc: Haiyang Zhang
    Cc: Stephen Hemminger
    Cc: Wei Liu
    Cc: Heiko Carstens
    Cc: Vasily Gorbik
    Cc: Christian Borntraeger
    Cc: David Hildenbrand
    Cc: "Michael S. Tsirkin"
    Cc: Jason Wang
    Cc: Boris Ostrovsky
    Cc: Stefano Stabellini
    Cc: "Oliver O'Halloran"
    Cc: Pingfan Liu
    Cc: Nathan Lynch
    Cc: Libor Pechacek
    Cc: Anton Blanchard
    Cc: Leonardo Bras
    Cc: Ard Biesheuvel
    Cc: Eric Biederman
    Cc: Julien Grall
    Cc: Kees Cook
    Cc: Roger Pau Monné
    Cc: Thomas Gleixner
    Cc: Wei Yang
    Link: https://lkml.kernel.org/r/20200911103459.10306-5-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • The conversion to request_mem_region() is broken because it assumes that
    the range is marked busy prior to release. However, due to the way that
    the kmem driver manipulates the IORESOURCE_BUSY flag (clears it to let
    {add,remove}_memory() handle busy) it requires a manual release_resource()
    to perform cleanup.

    Given that the actual 'struct resource *' needs to be recalled, not just
    the range, add that tracking to the kmem driver-data.

    Fixes: 0513bd5bb114 ("device-dax/kmem: replace release_resource() with release_mem_region()")
    Reported-by: David Hildenbrand
    Signed-off-by: Dan Williams
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Cc: Vishal Verma
    Cc: Dave Hansen
    Cc: Pavel Tatashin
    Cc: Brice Goglin
    Cc: Dave Jiang
    Cc: Ira Weiny
    Cc: Jia He
    Cc: Joao Martins
    Cc: Jonathan Cameron
    Link: https://lkml.kernel.org/r/160272252925.3136502.17220638073995895400.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Linus Torvalds

    Dan Williams
     

14 Oct, 2020

6 commits

  • Break the requirement that device-dax instances are physically contiguous.
    With this constraint removed it allows fragmented available capacity to
    be fully allocated.

    This capability is useful to mitigate the "noisy neighbor" problem with
    memory-side-cache management for virtual machines, or any other scenario
    where a platform address boundary also designates a performance boundary.
    For example a direct mapped memory side cache might rotate cache colors at
    1GB boundaries. With dis-contiguous allocations a device-dax instance
    could be configured to contain only 1 cache color.

    It also satisfies Joao's use case (see link) for partitioning memory for
    exclusive guest access. It allows for a future potential mode where the
    host kernel need not allocate 'struct page' capacity up-front.

    Reported-by: Joao Martins
    Signed-off-by: Dan Williams
    Signed-off-by: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Ard Biesheuvel
    Cc: Benjamin Herrenschmidt
    Cc: Ben Skeggs
    Cc: Bjorn Helgaas
    Cc: Borislav Petkov
    Cc: Boris Ostrovsky
    Cc: Brice Goglin
    Cc: Catalin Marinas
    Cc: Daniel Vetter
    Cc: Dave Hansen
    Cc: Dave Jiang
    Cc: David Airlie
    Cc: David Hildenbrand
    Cc: Greg Kroah-Hartman
    Cc: "H. Peter Anvin"
    Cc: Hulk Robot
    Cc: Ingo Molnar
    Cc: Ira Weiny
    Cc: Jason Gunthorpe
    Cc: Jason Yan
    Cc: Jeff Moyer
    Cc: "Jérôme Glisse"
    Cc: Jia He
    Cc: Jonathan Cameron
    Cc: Juergen Gross
    Cc: kernel test robot
    Cc: Michael Ellerman
    Cc: Mike Rapoport
    Cc: Paul Mackerras
    Cc: Pavel Tatashin
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Randy Dunlap
    Cc: Stefano Stabellini
    Cc: Thomas Gleixner
    Cc: Tom Lendacky
    Cc: Vishal Verma
    Cc: Vivek Goyal
    Cc: Wei Yang
    Cc: Will Deacon
    Link: https://lore.kernel.org/lkml/20200110190313.17144-1-joao.m.martins@oracle.com/
    Link: https://lkml.kernel.org/r/159643104304.4062302.16561669534797528660.stgit@dwillia2-desk3.amr.corp.intel.com
    Link: https://lkml.kernel.org/r/160106116875.30709.11456649969327399771.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • In preparation for introducing seed devices the dax-bus core needs to be
    able to intercept ->probe() and ->remove() operations. Towards that end
    arrange for the bus and drivers to switch from raw 'struct device' driver
    operations to 'struct dev_dax' typed operations.

    Reported-by: Hulk Robot
    Signed-off-by: Dan Williams
    Signed-off-by: Andrew Morton
    Cc: Jason Yan
    Cc: Vishal Verma
    Cc: Brice Goglin
    Cc: Dave Hansen
    Cc: Dave Jiang
    Cc: David Hildenbrand
    Cc: Ira Weiny
    Cc: Jia He
    Cc: Joao Martins
    Cc: Jonathan Cameron
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Ard Biesheuvel
    Cc: Benjamin Herrenschmidt
    Cc: Ben Skeggs
    Cc: Bjorn Helgaas
    Cc: Borislav Petkov
    Cc: Boris Ostrovsky
    Cc: Catalin Marinas
    Cc: Daniel Vetter
    Cc: David Airlie
    Cc: Greg Kroah-Hartman
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Jason Gunthorpe
    Cc: Jeff Moyer
    Cc: "Jérôme Glisse"
    Cc: Juergen Gross
    Cc: kernel test robot
    Cc: Michael Ellerman
    Cc: Mike Rapoport
    Cc: Paul Mackerras
    Cc: Pavel Tatashin
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Randy Dunlap
    Cc: Stefano Stabellini
    Cc: Thomas Gleixner
    Cc: Tom Lendacky
    Cc: Vivek Goyal
    Cc: Wei Yang
    Cc: Will Deacon
    Link: https://lkml.kernel.org/r/160106113357.30709.4541750544799737855.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Towards removing the mode specific @dax_kmem_res attribute from the
    generic 'struct dev_dax', and preparing for multi-range support, change
    the kmem driver to use the idiomatic release_mem_region() to pair with the
    initial request_mem_region(). This also eliminates the need to open code
    the release of the resource allocated by request_mem_region().

    As there are no more dax_kmem_res users, delete this struct member.

    Signed-off-by: Dan Williams
    Signed-off-by: Andrew Morton
    Cc: David Hildenbrand
    Cc: Vishal Verma
    Cc: Dave Hansen
    Cc: Pavel Tatashin
    Cc: Brice Goglin
    Cc: Dave Jiang
    Cc: Ira Weiny
    Cc: Jia He
    Cc: Joao Martins
    Cc: Jonathan Cameron
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Ard Biesheuvel
    Cc: Benjamin Herrenschmidt
    Cc: Ben Skeggs
    Cc: Bjorn Helgaas
    Cc: Borislav Petkov
    Cc: Boris Ostrovsky
    Cc: Catalin Marinas
    Cc: Daniel Vetter
    Cc: David Airlie
    Cc: Greg Kroah-Hartman
    Cc: "H. Peter Anvin"
    Cc: Hulk Robot
    Cc: Ingo Molnar
    Cc: Jason Gunthorpe
    Cc: Jason Yan
    Cc: Jeff Moyer
    Cc: "Jérôme Glisse"
    Cc: Juergen Gross
    Cc: kernel test robot
    Cc: Michael Ellerman
    Cc: Mike Rapoport
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Randy Dunlap
    Cc: Stefano Stabellini
    Cc: Thomas Gleixner
    Cc: Tom Lendacky
    Cc: Vivek Goyal
    Cc: Wei Yang
    Cc: Will Deacon
    Link: https://lkml.kernel.org/r/160106112239.30709.15909567572288425294.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Towards removing the mode specific @dax_kmem_res attribute from the
    generic 'struct dev_dax', and preparing for multi-range support, move
    resource name tracking to driver data. The memory for the resource name
    needs to have its own lifetime separate from the device bind lifetime for
    cases where the driver is unbound, but the kmem range could not be
    unplugged from the page allocator.

    Signed-off-by: Dan Williams
    Signed-off-by: Andrew Morton
    Cc: David Hildenbrand
    Cc: Vishal Verma
    Cc: Dave Hansen
    Cc: Pavel Tatashin
    Cc: Brice Goglin
    Cc: Dave Jiang
    Cc: Ira Weiny
    Cc: Jia He
    Cc: Joao Martins
    Cc: Jonathan Cameron
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Ard Biesheuvel
    Cc: Benjamin Herrenschmidt
    Cc: Ben Skeggs
    Cc: Bjorn Helgaas
    Cc: Borislav Petkov
    Cc: Boris Ostrovsky
    Cc: Catalin Marinas
    Cc: Daniel Vetter
    Cc: David Airlie
    Cc: Greg Kroah-Hartman
    Cc: "H. Peter Anvin"
    Cc: Hulk Robot
    Cc: Ingo Molnar
    Cc: Jason Gunthorpe
    Cc: Jason Yan
    Cc: Jeff Moyer
    Cc: "Jérôme Glisse"
    Cc: Juergen Gross
    Cc: kernel test robot
    Cc: Michael Ellerman
    Cc: Mike Rapoport
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Randy Dunlap
    Cc: Stefano Stabellini
    Cc: Thomas Gleixner
    Cc: Tom Lendacky
    Cc: Vivek Goyal
    Cc: Wei Yang
    Cc: Will Deacon
    Link: https://lkml.kernel.org/r/160106111639.30709.17624822766862009183.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Towards removing the mode specific @dax_kmem_res attribute from the
    generic 'struct dev_dax', and preparing for multi-range support, teach the
    driver to calculate the hotplug range from the device range. The hotplug
    range is the trivially calculated memory-block-size aligned version of the
    device range.

    Signed-off-by: Dan Williams
    Signed-off-by: Andrew Morton
    Cc: David Hildenbrand
    Cc: Vishal Verma
    Cc: Dave Hansen
    Cc: Pavel Tatashin
    Cc: Brice Goglin
    Cc: Dave Jiang
    Cc: Ira Weiny
    Cc: Jia He
    Cc: Joao Martins
    Cc: Jonathan Cameron
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Ard Biesheuvel
    Cc: Benjamin Herrenschmidt
    Cc: Ben Skeggs
    Cc: Bjorn Helgaas
    Cc: Borislav Petkov
    Cc: Boris Ostrovsky
    Cc: Catalin Marinas
    Cc: Daniel Vetter
    Cc: David Airlie
    Cc: Greg Kroah-Hartman
    Cc: "H. Peter Anvin"
    Cc: Hulk Robot
    Cc: Ingo Molnar
    Cc: Jason Gunthorpe
    Cc: Jason Yan
    Cc: Jeff Moyer
    Cc: "Jérôme Glisse"
    Cc: Juergen Gross
    Cc: kernel test robot
    Cc: Michael Ellerman
    Cc: Mike Rapoport
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Randy Dunlap
    Cc: Stefano Stabellini
    Cc: Thomas Gleixner
    Cc: Tom Lendacky
    Cc: Vivek Goyal
    Cc: Wei Yang
    Cc: Will Deacon
    Link: https://lkml.kernel.org/r/160106111109.30709.3173462396758431559.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • The passed in dev_pagemap is only required in the pmem case as the
    libnvdimm core may have reserved a vmem_altmap for dev_memremap_pages() to
    place the memmap in pmem directly. In the hmem case there is no agent
    reserving an altmap so it can all be handled by a core internal default.

    Pass the resource range via a new @range property of 'struct
    dev_dax_data'.

    Signed-off-by: Dan Williams
    Signed-off-by: Andrew Morton
    Cc: David Hildenbrand
    Cc: Vishal Verma
    Cc: Dave Hansen
    Cc: Pavel Tatashin
    Cc: Brice Goglin
    Cc: Dave Jiang
    Cc: Ira Weiny
    Cc: Jia He
    Cc: Joao Martins
    Cc: Jonathan Cameron
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Ard Biesheuvel
    Cc: Benjamin Herrenschmidt
    Cc: Ben Skeggs
    Cc: Bjorn Helgaas
    Cc: Borislav Petkov
    Cc: Boris Ostrovsky
    Cc: Catalin Marinas
    Cc: Daniel Vetter
    Cc: David Airlie
    Cc: Greg Kroah-Hartman
    Cc: "H. Peter Anvin"
    Cc: Hulk Robot
    Cc: Ingo Molnar
    Cc: Jason Gunthorpe
    Cc: Jason Yan
    Cc: Jeff Moyer
    Cc: "Jérôme Glisse"
    Cc: Juergen Gross
    Cc: kernel test robot
    Cc: Michael Ellerman
    Cc: Mike Rapoport
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Randy Dunlap
    Cc: Stefano Stabellini
    Cc: Thomas Gleixner
    Cc: Tom Lendacky
    Cc: Vivek Goyal
    Cc: Wei Yang
    Cc: Will Deacon
    Link: https://lkml.kernel.org/r/159643099958.4062302.10379230791041872886.stgit@dwillia2-desk3.amr.corp.intel.com
    Link: https://lkml.kernel.org/r/160106110513.30709.4303239334850606031.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Linus Torvalds

    Dan Williams
     

05 Jun, 2020

1 commit

  • Currently, when adding memory, we create entries in /sys/firmware/memmap/
    as "System RAM". This will lead to kexec-tools to add that memory to the
    fixed-up initial memmap for a kexec kernel (loaded via kexec_load()). The
    memory will be considered initial System RAM by the kexec'd kernel and can
    no longer be reconfigured. This is not what happens during a real reboot.

    Let's add our memory via add_memory_driver_managed() now, so we won't
    create entries in /sys/firmware/memmap/ and indicate the memory as "System
    RAM (kmem)" in /proc/iomem. This allows everybody (especially
    kexec-tools) to identify that this memory is special and has to be treated
    differently than ordinary (hotplugged) System RAM.

    Before configuring the namespace:
    [root@localhost ~]# cat /proc/iomem
    ...
    140000000-33fffffff : Persistent Memory
    140000000-33fffffff : namespace0.0
    3280000000-32ffffffff : PCI Bus 0000:00

    After configuring the namespace:
    [root@localhost ~]# cat /proc/iomem
    ...
    140000000-33fffffff : Persistent Memory
    140000000-1481fffff : namespace0.0
    148200000-33fffffff : dax0.0
    3280000000-32ffffffff : PCI Bus 0000:00

    After loading kmem before this change:
    [root@localhost ~]# cat /proc/iomem
    ...
    140000000-33fffffff : Persistent Memory
    140000000-1481fffff : namespace0.0
    150000000-33fffffff : dax0.0
    150000000-33fffffff : System RAM
    3280000000-32ffffffff : PCI Bus 0000:00

    After loading kmem after this change:
    [root@localhost ~]# cat /proc/iomem
    ...
    140000000-33fffffff : Persistent Memory
    140000000-1481fffff : namespace0.0
    150000000-33fffffff : dax0.0
    150000000-33fffffff : System RAM (kmem)
    3280000000-32ffffffff : PCI Bus 0000:00

    After a proper reboot:
    [root@localhost ~]# cat /proc/iomem
    ...
    140000000-33fffffff : Persistent Memory
    140000000-1481fffff : namespace0.0
    148200000-33fffffff : dax0.0
    3280000000-32ffffffff : PCI Bus 0000:00

    Within the kexec kernel before this change:
    [root@localhost ~]# cat /proc/iomem
    ...
    140000000-33fffffff : Persistent Memory
    140000000-1481fffff : namespace0.0
    150000000-33fffffff : System RAM
    3280000000-32ffffffff : PCI Bus 0000:00

    Within the kexec kernel after this change:
    [root@localhost ~]# cat /proc/iomem
    ...
    140000000-33fffffff : Persistent Memory
    140000000-1481fffff : namespace0.0
    148200000-33fffffff : dax0.0
    3280000000-32ffffffff : PCI Bus 0000:00

    /sys/firmware/memmap/ before this change:
    0000000000000000-000000000009fc00 (System RAM)
    000000000009fc00-00000000000a0000 (Reserved)
    00000000000f0000-0000000000100000 (Reserved)
    0000000000100000-00000000bffdf000 (System RAM)
    00000000bffdf000-00000000c0000000 (Reserved)
    00000000feffc000-00000000ff000000 (Reserved)
    00000000fffc0000-0000000100000000 (Reserved)
    0000000100000000-0000000140000000 (System RAM)
    0000000150000000-0000000340000000 (System RAM)

    /sys/firmware/memmap/ after a proper reboot:
    0000000000000000-000000000009fc00 (System RAM)
    000000000009fc00-00000000000a0000 (Reserved)
    00000000000f0000-0000000000100000 (Reserved)
    0000000000100000-00000000bffdf000 (System RAM)
    00000000bffdf000-00000000c0000000 (Reserved)
    00000000feffc000-00000000ff000000 (Reserved)
    00000000fffc0000-0000000100000000 (Reserved)
    0000000100000000-0000000140000000 (System RAM)

    /sys/firmware/memmap/ after this change:
    0000000000000000-000000000009fc00 (System RAM)
    000000000009fc00-00000000000a0000 (Reserved)
    00000000000f0000-0000000000100000 (Reserved)
    0000000000100000-00000000bffdf000 (System RAM)
    00000000bffdf000-00000000c0000000 (Reserved)
    00000000feffc000-00000000ff000000 (Reserved)
    00000000fffc0000-0000000100000000 (Reserved)
    0000000100000000-0000000140000000 (System RAM)

    kexec-tools already seem to basically ignore any System RAM that's not on
    top level when searching for areas to place kexec images - but also for
    determining crash areas to dump via kdump. Changing the resource name
    won't have an impact.

    Handle unloading of the driver after memory hotremove failed properly, by
    duplicating the string if necessary.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Acked-by: Pankaj Gupta
    Cc: Michal Hocko
    Cc: Pankaj Gupta
    Cc: Wei Yang
    Cc: Baoquan He
    Cc: Dave Hansen
    Cc: Eric Biederman
    Cc: Pavel Tatashin
    Cc: Dan Williams
    Link: http://lkml.kernel.org/r/20200508084217.9160-5-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     

24 May, 2020

1 commit

  • Assume we have kmem configured and loaded:

    [root@localhost ~]# cat /proc/iomem
    ...
    140000000-33fffffff : Persistent Memory$
    140000000-1481fffff : namespace0.0
    150000000-33fffffff : dax0.0
    150000000-33fffffff : System RAM

    Assume we try to unload kmem. This force-unloading will work, even if
    memory cannot get removed from the system.

    [root@localhost ~]# rmmod kmem
    [ 86.380228] removing memory fails, because memory [0x0000000150000000-0x0000000157ffffff] is onlined
    ...
    [ 86.431225] kmem dax0.0: DAX region [mem 0x150000000-0x33fffffff] cannot be hotremoved until the next reboot

    Now, we can reconfigure the namespace:

    [root@localhost ~]# ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax
    [ 131.409351] nd_pmem namespace0.0: could not reserve region [mem 0x140000000-0x33fffffff]dax
    [ 131.410147] nd_pmem: probe of namespace0.0 failed with error -16namespace0.0 --mode=devdax
    ...

    This fails as expected due to the busy memory resource, and the memory
    cannot be used. However, the dax0.0 device is removed, and along its
    name.

    The name of the memory resource now points at freed memory (name of the
    device):

    [root@localhost ~]# cat /proc/iomem
    ...
    140000000-33fffffff : Persistent Memory
    140000000-1481fffff : namespace0.0
    150000000-33fffffff : �_�^7_��/_��wR��WQ���^��� ...
    150000000-33fffffff : System RAM

    We have to make sure to duplicate the string. While at it, remove the
    superfluous setting of the name and fixup a stale comment.

    Fixes: 9f960da72b25 ("device-dax: "Hotremove" persistent memory that is used like normal RAM")
    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Cc: Dan Williams
    Cc: Vishal Verma
    Cc: Dave Jiang
    Cc: Pavel Tatashin
    Cc: Andrew Morton
    Cc: [5.3]
    Link: http://lkml.kernel.org/r/20200508084217.9160-2-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     

17 Jul, 2019

2 commits

  • It is now allowed to use persistent memory like a regular RAM, but
    currently there is no way to remove this memory until machine is
    rebooted.

    This work expands the functionality to also allows hotremoving
    previously hotplugged persistent memory, and recover the device for use
    for other purposes.

    To hotremove persistent memory, the management software must first
    offline all memory blocks of dax region, and than unbind it from
    device-dax/kmem driver. So, operations should look like this:

    echo offline > /sys/devices/system/memory/memoryN/state
    ...
    echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind

    Note: if unbind is done without offlining memory beforehand, it won't be
    possible to do dax0.0 hotremove, and dax's memory is going to be part of
    System RAM until reboot.

    Link: http://lkml.kernel.org/r/20190517215438.6487-4-pasha.tatashin@soleen.com
    Signed-off-by: Pavel Tatashin
    Reviewed-by: David Hildenbrand
    Cc: James Morris
    Cc: Sasha Levin
    Cc: Michal Hocko
    Cc: Dave Hansen
    Cc: Dan Williams
    Cc: Keith Busch
    Cc: Vishal Verma
    Cc: Dave Jiang
    Cc: Ross Zwisler
    Cc: Tom Lendacky
    Cc: Huang Ying
    Cc: Fengguang Wu
    Cc: Borislav Petkov
    Cc: Bjorn Helgaas
    Cc: Yaowei Bai
    Cc: Takashi Iwai
    Cc: Jérôme Glisse
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     
  • Patch series ""Hotremove" persistent memory", v6.

    Recently, adding a persistent memory to be used like a regular RAM was
    added to Linux. This work extends this functionality to also allow hot
    removing persistent memory.

    We (Microsoft) have an important use case for this functionality.

    The requirement is for physical machines with small amount of RAM (~8G)
    to be able to reboot in a very short period of time ( /sys/bus/dax/drivers/device_dax/unbind
    echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
    echo online_movable > /sys/devices/system/memoryXXX/state
    4. Before reboot hotremove device-dax memory from System RAM
    echo offline > /sys/devices/system/memoryXXX/state
    echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
    5. Create raw pmem0 device
    ndctl create-namespace --mode raw -e namespace0.0 -f
    6. Copy the state that was stored by apps to ramdisk to pmem device
    7. Do kexec reboot or reboot through firmware if firmware does not
    zero memory in pmem0 region (These machines have only regular
    volatile memory). So to have pmem0 device either memmap kernel
    parameter is used, or devices nodes in dtb are specified.

    This patch (of 3):

    When add_memory() fails, the resource and the memory should be freed.

    Link: http://lkml.kernel.org/r/20190517215438.6487-2-pasha.tatashin@soleen.com
    Fixes: c221c0b0308f ("device-dax: "Hotplug" persistent memory for use like normal RAM")
    Signed-off-by: Pavel Tatashin
    Reviewed-by: Dave Hansen
    Cc: Bjorn Helgaas
    Cc: Borislav Petkov
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: Dave Jiang
    Cc: David Hildenbrand
    Cc: Fengguang Wu
    Cc: Huang Ying
    Cc: James Morris
    Cc: Jérôme Glisse
    Cc: Keith Busch
    Cc: Michal Hocko
    Cc: Ross Zwisler
    Cc: Sasha Levin
    Cc: Takashi Iwai
    Cc: Tom Lendacky
    Cc: Vishal Verma
    Cc: Yaowei Bai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     

01 Mar, 2019

1 commit

  • This is intended for use with NVDIMMs that are physically persistent
    (physically like flash) so that they can be used as a cost-effective
    RAM replacement. Intel Optane DC persistent memory is one
    implementation of this kind of NVDIMM.

    Currently, a persistent memory region is "owned" by a device driver,
    either the "Direct DAX" or "Filesystem DAX" drivers. These drivers
    allow applications to explicitly use persistent memory, generally
    by being modified to use special, new libraries. (DIMM-based
    persistent memory hardware/software is described in great detail
    here: Documentation/nvdimm/nvdimm.txt).

    However, this limits persistent memory use to applications which
    *have* been modified. To make it more broadly usable, this driver
    "hotplugs" memory into the kernel, to be managed and used just like
    normal RAM would be.

    To make this work, management software must remove the device from
    being controlled by the "Device DAX" infrastructure:

    echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind

    and then tell the new driver that it can bind to the device:

    echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id

    After this, there will be a number of new memory sections visible
    in sysfs that can be onlined, or that may get onlined by existing
    udev-initiated memory hotplug rules.

    This rebinding procedure is currently a one-way trip. Once memory
    is bound to "kmem", it's there permanently and can not be
    unbound and assigned back to device_dax.

    The kmem driver will never bind to a dax device unless the device
    is *explicitly* bound to the driver. There are two reasons for
    this: One, since it is a one-way trip, it can not be undone if
    bound incorrectly. Two, the kmem driver destroys data on the
    device. Think of if you had good data on a pmem device. It
    would be catastrophic if you compile-in "kmem", but leave out
    the "device_dax" driver. kmem would take over the device and
    write volatile data all over your good data.

    This inherits any existing NUMA information for the newly-added
    memory from the persistent memory device that came from the
    firmware. On Intel platforms, the firmware has guarantees that
    require each socket's persistent memory to be in a separate
    memory-only NUMA node. That means that this patch is not expected
    to create NUMA nodes, but will simply hotplug memory into existing
    nodes.

    Because NUMA nodes are created, the existing NUMA APIs and tools
    are sufficient to create policies for applications or memory areas
    to have affinity for or an aversion to using this memory.

    There is currently some metadata at the beginning of pmem regions.
    The section-size memory hotplug restrictions, plus this small
    reserved area can cause the "loss" of a section or two of capacity.
    This should be fixable in follow-on patches. But, as a first step,
    losing 256MB of memory (worst case) out of hundreds of gigabytes
    is a good tradeoff vs. the required code to fix this up precisely.
    This calculation is also the reason we export
    memory_block_size_bytes().

    Signed-off-by: Dave Hansen
    Reviewed-by: Dan Williams
    Reviewed-by: Keith Busch
    Cc: Dave Jiang
    Cc: Ross Zwisler
    Cc: Vishal Verma
    Cc: Tom Lendacky
    Cc: Andrew Morton
    Cc: Michal Hocko
    Cc: linux-nvdimm@lists.01.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: Huang Ying
    Cc: Fengguang Wu
    Cc: Borislav Petkov
    Cc: Bjorn Helgaas
    Cc: Yaowei Bai
    Cc: Takashi Iwai
    Cc: Jerome Glisse
    Reviewed-by: Vishal Verma
    Signed-off-by: Dan Williams

    Dave Hansen