17 Oct, 2020

3 commits

  • "mem" in the name already indicates the root, similar to
    release_mem_region() and devm_request_mem_region(). Make it implicit.
    The only single caller always passes iomem_resource, other parents are not
    applicable.

    Suggested-by: Wei Yang
    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Wei Yang
    Cc: Michal Hocko
    Cc: Dan Williams
    Cc: Jason Gunthorpe
    Cc: Kees Cook
    Cc: Ard Biesheuvel
    Cc: Pankaj Gupta
    Cc: Baoquan He
    Link: https://lkml.kernel.org/r/20200916073041.10355-1-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Some add_memory*() users add memory in small, contiguous memory blocks.
    Examples include virtio-mem, hyper-v balloon, and the XEN balloon.

    This can quickly result in a lot of memory resources, whereby the actual
    resource boundaries are not of interest (e.g., it might be relevant for
    DIMMs, exposed via /proc/iomem to user space). We really want to merge
    added resources in this scenario where possible.

    Let's provide a flag (MEMHP_MERGE_RESOURCE) to specify that a resource
    either created within add_memory*() or passed via add_memory_resource()
    shall be marked mergeable and merged with applicable siblings.

    To implement that, we need a kernel/resource interface to mark selected
    System RAM resources mergeable (IORESOURCE_SYSRAM_MERGEABLE) and trigger
    merging.

    Note: We really want to merge after the whole operation succeeded, not
    directly when adding a resource to the resource tree (it would break
    add_memory_resource() and require splitting resources again when the
    operation failed - e.g., due to -ENOMEM).

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Pankaj Gupta
    Cc: Michal Hocko
    Cc: Dan Williams
    Cc: Jason Gunthorpe
    Cc: Kees Cook
    Cc: Ard Biesheuvel
    Cc: Thomas Gleixner
    Cc: "K. Y. Srinivasan"
    Cc: Haiyang Zhang
    Cc: Stephen Hemminger
    Cc: Wei Liu
    Cc: Boris Ostrovsky
    Cc: Juergen Gross
    Cc: Stefano Stabellini
    Cc: Roger Pau Monné
    Cc: Julien Grall
    Cc: Baoquan He
    Cc: Wei Yang
    Cc: Anton Blanchard
    Cc: Benjamin Herrenschmidt
    Cc: Christian Borntraeger
    Cc: Dave Jiang
    Cc: Eric Biederman
    Cc: Greg Kroah-Hartman
    Cc: Heiko Carstens
    Cc: Jason Wang
    Cc: Len Brown
    Cc: Leonardo Bras
    Cc: Libor Pechacek
    Cc: Michael Ellerman
    Cc: "Michael S. Tsirkin"
    Cc: Nathan Lynch
    Cc: "Oliver O'Halloran"
    Cc: Paul Mackerras
    Cc: Pingfan Liu
    Cc: "Rafael J. Wysocki"
    Cc: Vasily Gorbik
    Cc: Vishal Verma
    Link: https://lkml.kernel.org/r/20200911103459.10306-6-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Patch series "selective merging of system ram resources", v4.

    Some add_memory*() users add memory in small, contiguous memory blocks.
    Examples include virtio-mem, hyper-v balloon, and the XEN balloon.

    This can quickly result in a lot of memory resources, whereby the actual
    resource boundaries are not of interest (e.g., it might be relevant for
    DIMMs, exposed via /proc/iomem to user space). We really want to merge
    added resources in this scenario where possible.

    Resources are effectively stored in a list-based tree. Having a lot of
    resources not only wastes memory, it also makes traversing that tree more
    expensive, and makes /proc/iomem explode in size (e.g., requiring
    kexec-tools to manually merge resources when creating a kdump header. The
    current kexec-tools resource count limit does not allow for more than
    ~100GB of memory with a memory block size of 128MB on x86-64).

    Let's allow to selectively merge system ram resources by specifying a new
    flag for add_memory*(). Patch #5 contains a /proc/iomem example. Only
    tested with virtio-mem.

    This patch (of 8):

    Let's make sure splitting a resource on memory hotunplug will never fail.
    This will become more relevant once we merge selected System RAM resources
    - then, we'll trigger that case more often on memory hotunplug.

    In general, this function is already unlikely to fail. When we remove
    memory, we free up quite a lot of metadata (memmap, page tables, memory
    block device, etc.). The only reason it could really fail would be when
    injecting allocation errors.

    All other error cases inside release_mem_region_adjustable() seem to be
    sanity checks if the function would be abused in different context - let's
    add WARN_ON_ONCE() in these cases so we can catch them.

    [natechancellor@gmail.com: fix use of ternary condition in release_mem_region_adjustable]
    Link: https://lkml.kernel.org/r/20200922060748.2452056-1-natechancellor@gmail.com
    Link: https://github.com/ClangBuiltLinux/linux/issues/1159

    Signed-off-by: David Hildenbrand
    Signed-off-by: Nathan Chancellor
    Signed-off-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Dan Williams
    Cc: Jason Gunthorpe
    Cc: Kees Cook
    Cc: Ard Biesheuvel
    Cc: Pankaj Gupta
    Cc: Baoquan He
    Cc: Wei Yang
    Cc: Anton Blanchard
    Cc: Benjamin Herrenschmidt
    Cc: Boris Ostrovsky
    Cc: Christian Borntraeger
    Cc: Dave Jiang
    Cc: Eric Biederman
    Cc: Greg Kroah-Hartman
    Cc: Haiyang Zhang
    Cc: Heiko Carstens
    Cc: Jason Wang
    Cc: Juergen Gross
    Cc: Julien Grall
    Cc: "K. Y. Srinivasan"
    Cc: Len Brown
    Cc: Leonardo Bras
    Cc: Libor Pechacek
    Cc: Michael Ellerman
    Cc: "Michael S. Tsirkin"
    Cc: Nathan Lynch
    Cc: "Oliver O'Halloran"
    Cc: Paul Mackerras
    Cc: Pingfan Liu
    Cc: "Rafael J. Wysocki"
    Cc: Roger Pau Monn
    Cc: Stefano Stabellini
    Cc: Stephen Hemminger
    Cc: Thomas Gleixner
    Cc: Vasily Gorbik
    Cc: Vishal Verma
    Cc: Wei Liu
    Link: https://lkml.kernel.org/r/20200911103459.10306-2-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     

14 Oct, 2020

1 commit

  • In support of detecting whether a resource might have been been claimed,
    report the parent to the walk_iomem_res_desc() callback. For example, the
    ACPI HMAT parser publishes "hmem" platform devices per target range.
    However, if the HMAT is disabled / missing a fallback driver can attach
    devices to the raw memory ranges as a fallback if it sees unclaimed /
    orphan "Soft Reserved" resources in the resource tree.

    Otherwise, find_next_iomem_res() returns a resource with garbage data from
    the stack allocation in __walk_iomem_res_desc() for the res->parent field.

    There are currently no users that expect ->child and ->sibling to be
    valid, and the resource_lock would be needed to traverse them. Use a
    compound literal to implicitly zero initialize the fields that are not
    being returned in addition to setting ->parent.

    Signed-off-by: Dan Williams
    Signed-off-by: Andrew Morton
    Cc: Jason Gunthorpe
    Cc: Dave Hansen
    Cc: Wei Yang
    Cc: Tom Lendacky
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Benjamin Herrenschmidt
    Cc: Ben Skeggs
    Cc: Borislav Petkov
    Cc: Brice Goglin
    Cc: Catalin Marinas
    Cc: Daniel Vetter
    Cc: Dave Jiang
    Cc: David Airlie
    Cc: David Hildenbrand
    Cc: Greg Kroah-Hartman
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Ira Weiny
    Cc: Jeff Moyer
    Cc: Jia He
    Cc: Joao Martins
    Cc: Jonathan Cameron
    Cc: Michael Ellerman
    Cc: Mike Rapoport
    Cc: Paul Mackerras
    Cc: Pavel Tatashin
    Cc: Peter Zijlstra
    Cc: Rafael J. Wysocki
    Cc: Thomas Gleixner
    Cc: Vishal Verma
    Cc: Will Deacon
    Cc: Ard Biesheuvel
    Cc: Bjorn Helgaas
    Cc: Boris Ostrovsky
    Cc: Hulk Robot
    Cc: Jason Yan
    Cc: "Jérôme Glisse"
    Cc: Juergen Gross
    Cc: kernel test robot
    Cc: Randy Dunlap
    Cc: Stefano Stabellini
    Cc: Vivek Goyal
    Link: https://lkml.kernel.org/r/159643097166.4062302.11875688887228572793.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Linus Torvalds

    Dan Williams
     

27 May, 2020

1 commit

  • Close the hole of holding a mapping over kernel driver takeover event of
    a given address range.

    Commit 90a545e98126 ("restrict /dev/mem to idle io memory ranges")
    introduced CONFIG_IO_STRICT_DEVMEM with the goal of protecting the
    kernel against scenarios where a /dev/mem user tramples memory that a
    kernel driver owns. However, this protection only prevents *new* read(),
    write() and mmap() requests. Established mappings prior to the driver
    calling request_mem_region() are left alone.

    Especially with persistent memory, and the core kernel metadata that is
    stored there, there are plentiful scenarios for a /dev/mem user to
    violate the expectations of the driver and cause amplified damage.

    Teach request_mem_region() to find and shoot down active /dev/mem
    mappings that it believes it has successfully claimed for the exclusive
    use of the driver. Effectively a driver call to request_mem_region()
    becomes a hole-punch on the /dev/mem device.

    The typical usage of unmap_mapping_range() is part of
    truncate_pagecache() to punch a hole in a file, but in this case the
    implementation is only doing the "first half" of a hole punch. Namely it
    is just evacuating current established mappings of the "hole", and it
    relies on the fact that /dev/mem establishes mappings in terms of
    absolute physical address offsets. Once existing mmap users are
    invalidated they can attempt to re-establish the mapping, or attempt to
    continue issuing read(2) / write(2) to the invalidated extent, but they
    will then be subject to the CONFIG_IO_STRICT_DEVMEM checking that can
    block those subsequent accesses.

    Cc: Arnd Bergmann
    Cc: Ingo Molnar
    Cc: Kees Cook
    Cc: Matthew Wilcox
    Cc: Russell King
    Cc: Andrew Morton
    Cc: Greg Kroah-Hartman
    Fixes: 90a545e98126 ("restrict /dev/mem to idle io memory ranges")
    Signed-off-by: Dan Williams
    Reviewed-by: Kees Cook
    Link: https://lore.kernel.org/r/159009507306.847224.8502634072429766747.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     

25 Sep, 2019

1 commit

  • Patch series "mm/memory_hotplug: online_pages() cleanups", v2.

    Some cleanups (+ one fix for a special case) in the context of
    online_pages().

    This patch (of 5):

    This makes it clearer that we will never call func() with duplicate PFNs
    in case we have multiple sub-page memory resources. All unaligned parts
    of PFNs are completely discarded.

    Link: http://lkml.kernel.org/r/20190814154109.3448-2-david@redhat.com
    Signed-off-by: David Hildenbrand
    Acked-by: Michal Hocko
    Reviewed-by: Wei Yang
    Cc: Dan Williams
    Cc: Borislav Petkov
    Cc: Bjorn Helgaas
    Cc: Ingo Molnar
    Cc: Dave Hansen
    Cc: Nadav Amit
    Cc: Oscar Salvador
    Cc: Arun KS
    Cc: Pavel Tatashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     

20 Aug, 2019

1 commit

  • Factor out the guts of devm_request_free_mem_region so that we can
    implement both a device managed and a manually release version as tiny
    wrappers around it.

    Link: https://lore.kernel.org/r/20190818090557.17853-2-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ira Weiny
    Reviewed-by: Dan Williams
    Tested-by: Bharata B Rao
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     

19 Jul, 2019

2 commits

  • find_next_iomem_res() shows up to be a source for overhead in dax
    benchmarks.

    Improve performance by not considering children of the tree if the top
    level does not match. Since the range of the parents should include the
    range of the children such check is redundant.

    Running sysbench on dax (pmem emulation, with write_cache disabled):

    sysbench fileio --file-total-size=3G --file-test-mode=rndwr \
    --file-io-mode=mmap --threads=4 --file-fsync-mode=fdatasync run

    Provides the following results:

    events (avg/stddev)
    -------------------
    5.2-rc3: 1247669.0000/16075.39
    w/patch: 1286320.5000/16402.72 (+3%)

    Link: http://lkml.kernel.org/r/20190613045903.4922-3-namit@vmware.com
    Signed-off-by: Nadav Amit
    Cc: Borislav Petkov
    Cc: Toshi Kani
    Cc: Peter Zijlstra
    Cc: Dave Hansen
    Cc: Dan Williams
    Cc: Bjorn Helgaas
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nadav Amit
     
  • Since resources can be removed, locking should ensure that the resource
    is not removed while accessing it. However, find_next_iomem_res() does
    not hold the lock while copying the data of the resource.

    Keep holding the lock while the data is copied. While at it, change the
    return value to a more informative value. It is disregarded by the
    callers.

    [akpm@linux-foundation.org: fix find_next_iomem_res() documentation]
    Link: http://lkml.kernel.org/r/20190613045903.4922-2-namit@vmware.com
    Fixes: ff3cc952d3f00 ("resource: Add remove_resource interface")
    Signed-off-by: Nadav Amit
    Reviewed-by: Andrew Morton
    Reviewed-by: Dan Williams
    Cc: Borislav Petkov
    Cc: Toshi Kani
    Cc: Peter Zijlstra
    Cc: Dave Hansen
    Cc: Bjorn Helgaas
    Cc: Ingo Molnar
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nadav Amit
     

03 Jul, 2019

1 commit


21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

19 Apr, 2019

1 commit

  • The three checks in region_intersects() are basically an open-coded version
    of resource_overlaps() - so use the real thing.

    Also fix typos in comments while at it.

    Signed-off-by: Wei Yang
    Reviewed-by: Like Xu
    Reviewed-by: Yuan Yao
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: akpm@linux-foundation.org
    Cc: bhelgaas@google.com
    Cc: bp@suse.de
    Cc: dan.j.williams@intel.com
    Cc: jack@suse.cz
    Cc: rdunlap@infradead.org
    Cc: tiwai@suse.de
    Link: http://lkml.kernel.org/r/20190305083432.23675-1-richardw.yang@linux.intel.com
    [ Rewrote the changelog. ]
    Signed-off-by: Ingo Molnar

    Wei Yang
     

17 Mar, 2019

1 commit

  • Pull device-dax updates from Dan Williams:
    "New device-dax infrastructure to allow persistent memory and other
    "reserved" / performance differentiated memories, to be assigned to
    the core-mm as "System RAM".

    Some users want to use persistent memory as additional volatile
    memory. They are willing to cope with potential performance
    differences, for example between DRAM and 3D Xpoint, and want to use
    typical Linux memory management apis rather than a userspace memory
    allocator layered over an mmap() of a dax file. The administration
    model is to decide how much Persistent Memory (pmem) to use as System
    RAM, create a device-dax-mode namespace of that size, and then assign
    it to the core-mm. The rationale for device-dax is that it is a
    generic memory-mapping driver that can be layered over any "special
    purpose" memory, not just pmem. On subsequent boots udev rules can be
    used to restore the memory assignment.

    One implication of using pmem as RAM is that mlock() no longer keeps
    data off persistent media. For this reason it is recommended to enable
    NVDIMM Security (previously merged for 5.0) to encrypt pmem contents
    at rest. We considered making this recommendation an actively enforced
    requirement, but in the end decided to leave it as a distribution /
    administrator policy to allow for emulation and test environments that
    lack security capable NVDIMMs.

    Summary:

    - Replace the /sys/class/dax device model with /sys/bus/dax, and
    include a compat driver so distributions can opt-in to the new ABI.

    - Allow for an alternative driver for the device-dax address-range

    - Introduce the 'kmem' driver to hotplug / assign a device-dax
    address-range to the core-mm.

    - Arrange for the device-dax target-node to be onlined so that the
    newly added memory range can be uniquely referenced by numa apis"

    NOTE! I'm not entirely happy with the whole "PMEM as RAM" model because
    we currently have special - and very annoying rules in the kernel about
    accessing PMEM only with the "MC safe" accessors, because machine checks
    inside the regular repeat string copy functions can be fatal in some
    (not described) circumstances.

    And apparently the PMEM modules can cause that a lot more than regular
    RAM. The argument is that this happens because PMEM doesn't necessarily
    get scrubbed at boot like RAM does, but that is planned to be added for
    the user space tooling.

    Quoting Dan from another email:
    "The exposure can be reduced in the volatile-RAM case by scanning for
    and clearing errors before it is onlined as RAM. The userspace tooling
    for that can be in place before v5.1-final. There's also runtime
    notifications of errors via acpi_nfit_uc_error_notify() from
    background scrubbers on the DIMM devices. With that mechanism the
    kernel could proactively clear newly discovered poison in the volatile
    case, but that would be additional development more suitable for v5.2.

    I understand the concern, and the need to highlight this issue by
    tapping the brakes on feature development, but I don't see PMEM as RAM
    making the situation worse when the exposure is also there via DAX in
    the PMEM case. Volatile-RAM is arguably a safer use case since it's
    possible to repair pages where the persistent case needs active
    application coordination"

    * tag 'devdax-for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    device-dax: "Hotplug" persistent memory for use like normal RAM
    mm/resource: Let walk_system_ram_range() search child resources
    mm/memory-hotplug: Allow memory resources to be children
    mm/resource: Move HMM pr_debug() deeper into resource code
    mm/resource: Return real error codes from walk failures
    device-dax: Add a 'modalias' attribute to DAX 'bus' devices
    device-dax: Add a 'target_node' attribute
    device-dax: Auto-bind device after successful new_id
    acpi/nfit, device-dax: Identify differentiated memory with a unique numa-node
    device-dax: Add /sys/class/dax backwards compatibility
    device-dax: Add support for a dax override driver
    device-dax: Move resource pinning+mapping into the common driver
    device-dax: Introduce bus + driver model
    device-dax: Start defining a dax bus model
    device-dax: Remove multi-resource infrastructure
    device-dax: Kill dax_region base
    device-dax: Kill dax_region ida

    Linus Torvalds
     

01 Mar, 2019

3 commits

  • In the process of onlining memory, we use walk_system_ram_range()
    to find the actual RAM areas inside of the area being onlined.

    However, it currently only finds memory resources which are
    "top-level" iomem_resources. Children are not currently
    searched which causes it to skip System RAM in areas like this
    (in the format of /proc/iomem):

    a0000000-bfffffff : Persistent Memory (legacy)
    a0000000-afffffff : System RAM

    Changing the true->false here allows children to be searched
    as well. We need this because we add a new "System RAM"
    resource underneath the "persistent memory" resource when
    we use persistent memory in a volatile mode.

    Signed-off-by: Dave Hansen
    Cc: Keith Busch
    Cc: Dan Williams
    Cc: Dave Jiang
    Cc: Ross Zwisler
    Cc: Vishal Verma
    Cc: Tom Lendacky
    Cc: Andrew Morton
    Cc: Michal Hocko
    Cc: linux-nvdimm@lists.01.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: Huang Ying
    Cc: Fengguang Wu
    Cc: Borislav Petkov
    Cc: Bjorn Helgaas
    Cc: Yaowei Bai
    Cc: Takashi Iwai
    Cc: Jerome Glisse
    Signed-off-by: Dan Williams

    Dave Hansen
     
  • HMM consumes physical address space for its own use, even
    though nothing is mapped or accessible there. It uses a
    special resource description (IORES_DESC_DEVICE_PRIVATE_MEMORY)
    to uniquely identify these areas.

    When HMM consumes address space, it makes a best guess about
    what to consume. However, it is possible that a future memory
    or device hotplug can collide with the reserved area. In the
    case of these conflicts, there is an error message in
    register_memory_resource().

    Later patches in this series move register_memory_resource()
    from using request_resource_conflict() to __request_region().
    Unfortunately, __request_region() does not return the conflict
    like the previous function did, which makes it impossible to
    check for IORES_DESC_DEVICE_PRIVATE_MEMORY in a conflicting
    resource.

    Instead of warning in register_memory_resource(), move the
    check into the core resource code itself (__request_region())
    where the conflicting resource _is_ available. This has the
    added bonus of producing a warning in case of HMM conflicts
    with devices *or* RAM address space, as opposed to the RAM-
    only warnings that were there previously.

    Signed-off-by: Dave Hansen
    Reviewed-by: Jerome Glisse
    Cc: Dan Williams
    Cc: Dave Jiang
    Cc: Ross Zwisler
    Cc: Vishal Verma
    Cc: Tom Lendacky
    Cc: Andrew Morton
    Cc: Michal Hocko
    Cc: linux-nvdimm@lists.01.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: Huang Ying
    Cc: Fengguang Wu
    Cc: Keith Busch
    Signed-off-by: Dan Williams

    Dave Hansen
     
  • walk_system_ram_range() can return an error code either becuase
    *it* failed, or because the 'func' that it calls returned an
    error. The memory hotplug does the following:

    ret = walk_system_ram_range(..., func);
    if (ret)
    return ret;

    and 'ret' makes it out to userspace, eventually. The problem
    s, walk_system_ram_range() failues that result from *it* failing
    (as opposed to 'func') return -1. That leads to a very odd
    -EPERM (-1) return code out to userspace.

    Make walk_system_ram_range() return -EINVAL for internal
    failures to keep userspace less confused.

    This return code is compatible with all the callers that I
    audited.

    Signed-off-by: Dave Hansen
    Reviewed-by: Bjorn Helgaas
    Acked-by: Michael Ellerman (powerpc)
    Cc: Dan Williams
    Cc: Dave Jiang
    Cc: Ross Zwisler
    Cc: Vishal Verma
    Cc: Tom Lendacky
    Cc: Andrew Morton
    Cc: Michal Hocko
    Cc: linux-nvdimm@lists.01.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: Huang Ying
    Cc: Fengguang Wu
    Cc: Borislav Petkov
    Cc: Yaowei Bai
    Cc: Takashi Iwai
    Cc: Jerome Glisse
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: Keith Busch
    Signed-off-by: Dan Williams

    Dave Hansen
     

04 Feb, 2019

1 commit

  • Since commit c40dd2f76644 ("powerpc: Add System RAM to /proc/iomem")
    it is possible to use the generic walk_system_ram_range() and
    the generic page_is_ram().

    To enable the use of walk_system_ram_range() by the IBM EHEA ethernet
    driver, we still need an export of the generic function.

    As powerpc was the only user of CONFIG_ARCH_HAS_WALK_MEMORY, the
    ifdef around the generic walk_system_ram_range() has become useless
    and can be dropped.

    Fixes: c40dd2f76644 ("powerpc: Add System RAM to /proc/iomem")
    Signed-off-by: Christophe Leroy
    [mpe: Keep the EXPORT_SYMBOL_GPL in powerpc code]
    Signed-off-by: Michael Ellerman

    Christophe Leroy
     

29 Dec, 2018

1 commit

  • This is a preparation for the next patch.

    Currently, we only call release_mem_region_adjustable() in __remove_pages
    if the zone is not ZONE_DEVICE, because resources that belong to HMM/devm
    are being released by themselves with devm_release_mem_region.

    Since we do not want to touch any zone/page stuff during the removing of
    the memory (but during the offlining), we do not want to check for the
    zone here. So we need another way to tell release_mem_region_adjustable()
    to not realease the resource in case it belongs to HMM/devm.

    HMM/devm acquires/releases a resource through
    devm_request_mem_region/devm_release_mem_region.

    These resources have the flag IORESOURCE_MEM, while resources acquired by
    hot-add memory path (register_memory_resource()) contain
    IORESOURCE_SYSTEM_RAM.

    So, we can check for this flag in release_mem_region_adjustable, and if
    the resource does not contain such flag, we know that we are dealing with
    a HMM/devm resource, so we can back off.

    Link: http://lkml.kernel.org/r/20181127162005.15833-3-osalvador@suse.de
    Signed-off-by: Oscar Salvador
    Reviewed-by: David Hildenbrand
    Reviewed-by: Pavel Tatashin
    Cc: Dan Williams
    Cc: Jerome Glisse
    Cc: Jonathan Cameron
    Cc: Michal Hocko
    Cc: Oscar Salvador
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     

07 Nov, 2018

1 commit

  • Add the missing kernel-doc style function parameters documentation.

    Signed-off-by: Borislav Petkov
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: akpm@linux-foundation.org
    Cc: linux-tip-commits@vger.kernel.org
    Cc: rdunlap@infradead.org
    Fixes: b69c2e20f6e4 ("resource: Clean it up a bit")
    Link: http://lkml.kernel.org/r/20181105093307.GA12445@zn.tnic
    Signed-off-by: Ingo Molnar

    Borislav Petkov
     

05 Nov, 2018

1 commit

  • The first group of warnings is caused by a "/**" kernel-doc notation
    marker but the function comments are not in kernel-doc format.
    Also add another error return value here.

    ../kernel/resource.c:337: warning: Function parameter or member 'start' not described in 'find_next_iomem_res'
    ../kernel/resource.c:337: warning: Function parameter or member 'end' not described in 'find_next_iomem_res'
    ../kernel/resource.c:337: warning: Function parameter or member 'flags' not described in 'find_next_iomem_res'
    ../kernel/resource.c:337: warning: Function parameter or member 'desc' not described in 'find_next_iomem_res'
    ../kernel/resource.c:337: warning: Function parameter or member 'first_lvl' not described in 'find_next_iomem_res'
    ../kernel/resource.c:337: warning: Function parameter or member 'res' not described in 'find_next_iomem_res'

    Add the missing function parameter documentation for the other warnings:

    ../kernel/resource.c:409: warning: Function parameter or member 'arg' not described in 'walk_iomem_res_desc'
    ../kernel/resource.c:409: warning: Function parameter or member 'func' not described in 'walk_iomem_res_desc'

    Signed-off-by: Randy Dunlap
    Cc: Andrew Morton
    Cc: Borislav Petkov
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: b69c2e20f6e4 ("resource: Clean it up a bit")
    Link: http://lkml.kernel.org/r/dda2e4d8-bedd-3167-20fe-8c7d2d35b354@infradead.org
    Signed-off-by: Ingo Molnar

    Randy Dunlap
     

09 Oct, 2018

3 commits

  • - Drop BUG_ON()s and do normal error handling instead, in
    find_next_iomem_res().

    - Align function arguments on opening braces.

    - Get rid of local var sibling_only in find_next_iomem_res().

    - Shorten unnecessarily long first_level_children_only arg name.

    Signed-off-by: Borislav Petkov
    CC: Andrew Morton
    CC: Bjorn Helgaas
    CC: Brijesh Singh
    CC: Dan Williams
    CC: H. Peter Anvin
    CC: Lianbo Jiang
    CC: Takashi Iwai
    CC: Thomas Gleixner
    CC: Tom Lendacky
    CC: Vivek Goyal
    CC: Yaowei Bai
    CC: bhe@redhat.com
    CC: dan.j.williams@intel.com
    CC: dyoung@redhat.com
    CC: kexec@lists.infradead.org
    CC: mingo@redhat.com
    Link:

    Borislav Petkov
     
  • Previously find_next_iomem_res() used "*res" as both an input parameter for
    the range to search and the type of resource to search for, and an output
    parameter for the resource we found, which makes the interface confusing.

    The current callers use find_next_iomem_res() incorrectly because they
    allocate a single struct resource and use it for repeated calls to
    find_next_iomem_res(). When find_next_iomem_res() returns a resource, it
    overwrites the start, end, flags, and desc members of the struct. If we
    call find_next_iomem_res() again, we must update or restore these fields.
    The previous code restored res.start and res.end, but not res.flags or
    res.desc.

    Since the callers did not restore res.flags, if they searched for flags
    IORESOURCE_MEM | IORESOURCE_BUSY and found a resource with flags
    IORESOURCE_MEM | IORESOURCE_BUSY | IORESOURCE_SYSRAM, the next search would
    incorrectly skip resources unless they were also marked as
    IORESOURCE_SYSRAM.

    Fix this by restructuring the interface so it takes explicit "start, end,
    flags" parameters and uses "*res" only as an output parameter.

    Based on a patch by Lianbo Jiang .

    [ bp: While at it:
    - make comments kernel-doc style.
    -

    Originally-by: http://lore.kernel.org/lkml/20180921073211.20097-2-lijiang@redhat.com
    Signed-off-by: Bjorn Helgaas
    Signed-off-by: Borislav Petkov
    CC: Andrew Morton
    CC: Brijesh Singh
    CC: Dan Williams
    CC: H. Peter Anvin
    CC: Lianbo Jiang
    CC: Takashi Iwai
    CC: Thomas Gleixner
    CC: Tom Lendacky
    CC: Vivek Goyal
    CC: Yaowei Bai
    CC: bhe@redhat.com
    CC: dan.j.williams@intel.com
    CC: dyoung@redhat.com
    CC: kexec@lists.infradead.org
    CC: mingo@redhat.com
    CC: x86-ml
    Link: http://lkml.kernel.org/r/153805812916.1157.177580438135143788.stgit@bhelgaas-glaptop.roam.corp.google.com

    Bjorn Helgaas
     
  • find_next_iomem_res() finds an iomem resource that covers part of a range
    described by "start, end". All callers expect that range to be inclusive,
    i.e., both start and end are included, but find_next_iomem_res() doesn't
    handle the end address correctly.

    If it finds an iomem resource that contains exactly the end address, it
    skips it, e.g., if "start, end" is [0x0-0x10000] and there happens to be an
    iomem resource [mem 0x10000-0x10000] (the single byte at 0x10000), we skip
    it:

    find_next_iomem_res(...)
    {
    start = 0x0;
    end = 0x10000;
    for (p = next_resource(...)) {
    # p->start = 0x10000;
    # p->end = 0x10000;
    # we *should* return this resource, but this condition is false:
    if ((p->end >= start) && (p->start < end))
    break;

    Adjust find_next_iomem_res() so it allows a resource that includes the
    single byte at the end of the range. This is a corner case that we
    probably don't see in practice.

    Fixes: 58c1b5b07907 ("[PATCH] memory hotadd fixes: find_next_system_ram catch range fix")
    Signed-off-by: Bjorn Helgaas
    Signed-off-by: Borislav Petkov
    CC: Andrew Morton
    CC: Brijesh Singh
    CC: Dan Williams
    CC: H. Peter Anvin
    CC: Lianbo Jiang
    CC: Takashi Iwai
    CC: Thomas Gleixner
    CC: Tom Lendacky
    CC: Vivek Goyal
    CC: Yaowei Bai
    CC: bhe@redhat.com
    CC: dan.j.williams@intel.com
    CC: dyoung@redhat.com
    CC: kexec@lists.infradead.org
    CC: mingo@redhat.com
    CC: x86-ml
    Link: http://lkml.kernel.org/r/153805812254.1157.16736368485811773752.stgit@bhelgaas-glaptop.roam.corp.google.com

    Bjorn Helgaas
     

09 Jun, 2018

1 commit

  • Pull libnvdimm updates from Dan Williams:
    "This adds a user for the new 'bytes-remaining' updates to
    memcpy_mcsafe() that you already received through Ingo via the
    x86-dax- for-linus pull.

    Not included here, but still targeting this cycle, is support for
    handling memory media errors (poison) consumed via userspace dax
    mappings.

    Summary:

    - DAX broke a fundamental assumption of truncate of file mapped
    pages. The truncate path assumed that it is safe to disconnect a
    pinned page from a file and let the filesystem reclaim the physical
    block. With DAX the page is equivalent to the filesystem block.
    Introduce dax_layout_busy_page() to enable filesystems to wait for
    pinned DAX pages to be released. Without this wait a filesystem
    could allocate blocks under active device-DMA to a new file.

    - DAX arranges for the block layer to be bypassed and uses
    dax_direct_access() + copy_to_iter() to satisfy read(2) calls.
    However, the memcpy_mcsafe() facility is available through the pmem
    block driver. In order to safely handle media errors, via the DAX
    block-layer bypass, introduce copy_to_iter_mcsafe().

    - Fix cache management policy relative to the ACPI NFIT Platform
    Capabilities Structure to properly elide cache flushes when they
    are not necessary. The table indicates whether CPU caches are
    power-fail protected. Clarify that a deep flush is always performed
    on REQ_{FUA,PREFLUSH} requests"

    * tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (21 commits)
    dax: Use dax_write_cache* helpers
    libnvdimm, pmem: Do not flush power-fail protected CPU caches
    libnvdimm, pmem: Unconditionally deep flush on *sync
    libnvdimm, pmem: Complete REQ_FLUSH => REQ_PREFLUSH
    acpi, nfit: Remove ecc_unit_size
    dax: dax_insert_mapping_entry always succeeds
    libnvdimm, e820: Register all pmem resources
    libnvdimm: Debug probe times
    linvdimm, pmem: Preserve read-only setting for pmem devices
    x86, nfit_test: Add unit test for memcpy_mcsafe()
    pmem: Switch to copy_to_iter_mcsafe()
    dax: Report bytes remaining in dax_iomap_actor()
    dax: Introduce a ->copy_to_iter dax operation
    uio, lib: Fix CONFIG_ARCH_HAS_UACCESS_MCSAFE compilation
    xfs, dax: introduce xfs_break_dax_layouts()
    xfs: prepare xfs_break_layouts() for another layout type
    xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL
    mm, fs, dax: handle layout changes to pinned dax mappings
    mm: fix __gup_device_huge vs unmap
    mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS
    ...

    Linus Torvalds
     

03 Jun, 2018

1 commit

  • There is currently a mismatch between the resources that will trigger
    the e820_pmem driver to register/load and the resources that will
    actually be surfaced as pmem ranges. register_e820_pmem() uses
    walk_iomem_res_desc() which includes children and siblings. In contrast,
    e820_pmem_probe() only considers top level resources. For example the
    following resource tree results in the driver being loaded, but no
    resources being registered:

    398000000000-39bfffffffff : PCI Bus 0000:ae
    39be00000000-39bf07ffffff : PCI Bus 0000:af
    39be00000000-39beffffffff : 0000:af:00.0
    39be10000000-39beffffffff : Persistent Memory (legacy)

    Fix this up to allow definitions of "legacy" pmem ranges anywhere in
    system-physical address space. Not that it is a recommended or safe to
    define a pmem range in PCI space, but it is useful for debug /
    experimentation, and the restriction on being a top-level resource was
    arbitrary.

    Cc: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     

16 May, 2018

1 commit


14 Apr, 2018

1 commit

  • We've got a bug report indicating a kernel panic at booting on an x86-32
    system, and it turned out to be the invalid PCI resource assigned after
    reallocation. __find_resource() first aligns the resource start address
    and resets the end address with start+size-1 accordingly, then checks
    whether it's contained. Here the end address may overflow the integer,
    although resource_contains() still returns true because the function
    validates only start and end address. So this ends up with returning an
    invalid resource (start > end).

    There was already an attempt to cover such a problem in the commit
    47ea91b4052d ("Resource: fix wrong resource window calculation"), but
    this case is an overseen one.

    This patch adds the validity check of the newly calculated resource for
    avoiding the integer overflow problem.

    Bugzilla: http://bugzilla.opensuse.org/show_bug.cgi?id=1086739
    Link: http://lkml.kernel.org/r/s5hpo37d5l8.wl-tiwai@suse.de
    Fixes: 23c570a67448 ("resource: ability to resize an allocated resource")
    Signed-off-by: Takashi Iwai
    Reported-by: Michael Henders
    Tested-by: Michael Henders
    Reviewed-by: Andrew Morton
    Cc: Ram Pai
    Cc: Bjorn Helgaas
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Takashi Iwai
     

07 Feb, 2018

2 commits

  • Merge misc updates from Andrew Morton:

    - kasan updates

    - procfs

    - lib/bitmap updates

    - other lib/ updates

    - checkpatch tweaks

    - rapidio

    - ubsan

    - pipe fixes and cleanups

    - lots of other misc bits

    * emailed patches from Andrew Morton : (114 commits)
    Documentation/sysctl/user.txt: fix typo
    MAINTAINERS: update ARM/QUALCOMM SUPPORT patterns
    MAINTAINERS: update various PALM patterns
    MAINTAINERS: update "ARM/OXNAS platform support" patterns
    MAINTAINERS: update Cortina/Gemini patterns
    MAINTAINERS: remove ARM/CLKDEV SUPPORT file pattern
    MAINTAINERS: remove ANDROID ION pattern
    mm: docs: add blank lines to silence sphinx "Unexpected indentation" errors
    mm: docs: fix parameter names mismatch
    mm: docs: fixup punctuation
    pipe: read buffer limits atomically
    pipe: simplify round_pipe_size()
    pipe: reject F_SETPIPE_SZ with size over UINT_MAX
    pipe: fix off-by-one error when checking buffer limits
    pipe: actually allow root to exceed the pipe buffer limits
    pipe, sysctl: remove pipe_proc_fn()
    pipe, sysctl: drop 'min' parameter from pipe-max-size converter
    kasan: rework Kconfig settings
    crash_dump: is_kdump_kernel can be boolean
    kernel/mutex: mutex_is_locked can be boolean
    ...

    Linus Torvalds
     
  • Make iomem_is_exclusive return bool due to this particular function only
    using either one or zero as its return value.

    No functional change.

    Link: http://lkml.kernel.org/r/1513266622-15860-5-git-send-email-baiyaowei@cmss.chinamobile.com
    Signed-off-by: Yaowei Bai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     

19 Dec, 2017

2 commits


07 Nov, 2017

3 commits

  • In order for memory pages to be properly mapped when SEV is active, it's
    necessary to use the PAGE_KERNEL protection attribute as the base
    protection. This ensures that memory mapping of, e.g. ACPI tables,
    receives the proper mapping attributes.

    Signed-off-by: Tom Lendacky
    Signed-off-by: Brijesh Singh
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Borislav Petkov
    Tested-by: Borislav Petkov
    Cc: Laura Abbott
    Cc: Kees Cook
    Cc: kvm@vger.kernel.org
    Cc: Jérôme Glisse
    Cc: Borislav Petkov
    Cc: Andy Lutomirski
    Cc: Andrew Morton
    Cc: Dan Williams
    Cc: "Kirill A. Shutemov"
    Link: https://lkml.kernel.org/r/20171020143059.3291-11-brijesh.singh@amd.com

    Tom Lendacky
     
  • In preperation for a new function that will need additional resource
    information during the resource walk, update the resource walk callback to
    pass the resource structure. Since the current callback start and end
    arguments are pulled from the resource structure, the callback functions
    can obtain them from the resource structure directly.

    Signed-off-by: Tom Lendacky
    Signed-off-by: Brijesh Singh
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Kees Cook
    Reviewed-by: Borislav Petkov
    Tested-by: Borislav Petkov
    Cc: kvm@vger.kernel.org
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: linuxppc-dev@lists.ozlabs.org
    Link: https://lkml.kernel.org/r/20171020143059.3291-10-brijesh.singh@amd.com

    Tom Lendacky
     
  • The walk_iomem_res_desc(), walk_system_ram_res() and walk_system_ram_range()
    functions each have much of the same code.

    Create a new function that consolidates the common code from these
    functions in one place to reduce the amount of duplicated code.

    Signed-off-by: Tom Lendacky
    Signed-off-by: Brijesh Singh
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Borislav Petkov
    Tested-by: Borislav Petkov
    Cc: kvm@vger.kernel.org
    Cc: Borislav Petkov
    Link: https://lkml.kernel.org/r/20171020143059.3291-9-brijesh.singh@amd.com

    Tom Lendacky
     

15 Apr, 2016

1 commit


10 Mar, 2016

3 commits

  • insert_resource() and remove_resouce() are called by producers
    of resources, such as FW modules and bus drivers. These modules
    may be implemented as loadable modules.

    Export insert_resource() and remove_resouce() so that they can
    be called from such modules.

    link: https://lkml.org/lkml/2016/3/8/872
    Signed-off-by: Toshi Kani
    Cc: Linus Torvalds
    Cc: Ingo Molnar
    Cc: Borislav Petkov
    Cc: Andrew Morton
    Cc: Dan Williams
    Signed-off-by: Dan Williams

    Toshi Kani
     
  • insert_resource() and insert_resource_conflict() are called
    by resource producers to insert a new resource. When there
    is any conflict, they move conflicting resources down to the
    children of the new resource. There is no destructor of these
    interfaces, however.

    Add remove_resource(), which removes a resource previously
    inserted by insert_resource() or insert_resource_conflict(),
    and moves the children up to where they were before.

    __release_resource() is changed to have @release_child, so
    that this function can be used for remove_resource() as well.

    Also add comments to clarify that these functions are intended
    for producers of resources to avoid any confusion with
    request/release_resource() for consumers.

    Signed-off-by: Toshi Kani
    Cc: Ingo Molnar
    Cc: Borislav Petkov
    Cc: Andrew Morton
    Cc: Dan Williams
    Signed-off-by: Dan Williams

    Toshi Kani
     
  • __request_region() sets 'flags' of a new resource from @parent
    as it inherits the parent's attribute. When a target resource
    has a conflict, this function inserts the new resource entry
    under the conflicted entry by updating @parent. In this case,
    the new resource entry needs to inherit attribute from the updated
    parent. This conflict is a typical case since __request_region()
    is used to allocate a new resource from a specific resource range.

    For instance, request_mem_region() calls __request_region() with
    @parent set to &iomem_resource, which is the root entry of the
    whole iomem range. When this request results in inserting a new
    entry "DEV-A" under "BUS-1", "DEV-A" needs to inherit from the
    immediate parent "BUS-1" as it holds specific attribute for the
    range.

    root (&iomem_resource)
    :
    + "BUS-1"
    + "DEV-A"

    Change __request_region() to set 'flags' and 'desc' of a new entry
    from the immediate parent.

    Signed-off-by: Toshi Kani
    Cc: Ingo Molnar
    Cc: Borislav Petkov
    Cc: Andrew Morton
    Cc: Dan Williams
    Signed-off-by: Dan Williams

    Toshi Kani
     

04 Mar, 2016

1 commit


21 Feb, 2016

1 commit

  • In __request_region, if a conflict with a BUSY and MUXED resource is
    detected, then the caller goes to sleep and waits for the resource to be
    released. A pointer on the conflicting resource is kept. At wake-up
    this pointer is used as a parent to retry to request the region.

    A first problem is that this pointer might well be invalid (if for
    example the conflicting resource have already been freed). Another
    problem is that the next call to __request_region() fails to detect a
    remaining conflict. The previously conflicting resource is passed as a
    parameter and __request_region() will look for a conflict among the
    children of this resource and not at the resource itself. It is likely
    to succeed anyway, even if there is still a conflict.

    Instead, the parent of the conflicting resource should be passed to
    __request_region().

    As a fix, this patch doesn't update the parent resource pointer in the
    case we have to wait for a muxed region right after.

    Reported-and-tested-by: Vincent Pelletier
    Signed-off-by: Simon Guinot
    Tested-by: Vincent Donnefort
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Simon Guinot