17 Mar, 2019

1 commit

  • Pull device-dax updates from Dan Williams:
    "New device-dax infrastructure to allow persistent memory and other
    "reserved" / performance differentiated memories, to be assigned to
    the core-mm as "System RAM".

    Some users want to use persistent memory as additional volatile
    memory. They are willing to cope with potential performance
    differences, for example between DRAM and 3D Xpoint, and want to use
    typical Linux memory management apis rather than a userspace memory
    allocator layered over an mmap() of a dax file. The administration
    model is to decide how much Persistent Memory (pmem) to use as System
    RAM, create a device-dax-mode namespace of that size, and then assign
    it to the core-mm. The rationale for device-dax is that it is a
    generic memory-mapping driver that can be layered over any "special
    purpose" memory, not just pmem. On subsequent boots udev rules can be
    used to restore the memory assignment.

    One implication of using pmem as RAM is that mlock() no longer keeps
    data off persistent media. For this reason it is recommended to enable
    NVDIMM Security (previously merged for 5.0) to encrypt pmem contents
    at rest. We considered making this recommendation an actively enforced
    requirement, but in the end decided to leave it as a distribution /
    administrator policy to allow for emulation and test environments that
    lack security capable NVDIMMs.

    Summary:

    - Replace the /sys/class/dax device model with /sys/bus/dax, and
    include a compat driver so distributions can opt-in to the new ABI.

    - Allow for an alternative driver for the device-dax address-range

    - Introduce the 'kmem' driver to hotplug / assign a device-dax
    address-range to the core-mm.

    - Arrange for the device-dax target-node to be onlined so that the
    newly added memory range can be uniquely referenced by numa apis"

    NOTE! I'm not entirely happy with the whole "PMEM as RAM" model because
    we currently have special - and very annoying rules in the kernel about
    accessing PMEM only with the "MC safe" accessors, because machine checks
    inside the regular repeat string copy functions can be fatal in some
    (not described) circumstances.

    And apparently the PMEM modules can cause that a lot more than regular
    RAM. The argument is that this happens because PMEM doesn't necessarily
    get scrubbed at boot like RAM does, but that is planned to be added for
    the user space tooling.

    Quoting Dan from another email:
    "The exposure can be reduced in the volatile-RAM case by scanning for
    and clearing errors before it is onlined as RAM. The userspace tooling
    for that can be in place before v5.1-final. There's also runtime
    notifications of errors via acpi_nfit_uc_error_notify() from
    background scrubbers on the DIMM devices. With that mechanism the
    kernel could proactively clear newly discovered poison in the volatile
    case, but that would be additional development more suitable for v5.2.

    I understand the concern, and the need to highlight this issue by
    tapping the brakes on feature development, but I don't see PMEM as RAM
    making the situation worse when the exposure is also there via DAX in
    the PMEM case. Volatile-RAM is arguably a safer use case since it's
    possible to repair pages where the persistent case needs active
    application coordination"

    * tag 'devdax-for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    device-dax: "Hotplug" persistent memory for use like normal RAM
    mm/resource: Let walk_system_ram_range() search child resources
    mm/memory-hotplug: Allow memory resources to be children
    mm/resource: Move HMM pr_debug() deeper into resource code
    mm/resource: Return real error codes from walk failures
    device-dax: Add a 'modalias' attribute to DAX 'bus' devices
    device-dax: Add a 'target_node' attribute
    device-dax: Auto-bind device after successful new_id
    acpi/nfit, device-dax: Identify differentiated memory with a unique numa-node
    device-dax: Add /sys/class/dax backwards compatibility
    device-dax: Add support for a dax override driver
    device-dax: Move resource pinning+mapping into the common driver
    device-dax: Introduce bus + driver model
    device-dax: Start defining a dax bus model
    device-dax: Remove multi-resource infrastructure
    device-dax: Kill dax_region base
    device-dax: Kill dax_region ida

    Linus Torvalds
     

01 Mar, 2019

3 commits

  • In the process of onlining memory, we use walk_system_ram_range()
    to find the actual RAM areas inside of the area being onlined.

    However, it currently only finds memory resources which are
    "top-level" iomem_resources. Children are not currently
    searched which causes it to skip System RAM in areas like this
    (in the format of /proc/iomem):

    a0000000-bfffffff : Persistent Memory (legacy)
    a0000000-afffffff : System RAM

    Changing the true->false here allows children to be searched
    as well. We need this because we add a new "System RAM"
    resource underneath the "persistent memory" resource when
    we use persistent memory in a volatile mode.

    Signed-off-by: Dave Hansen
    Cc: Keith Busch
    Cc: Dan Williams
    Cc: Dave Jiang
    Cc: Ross Zwisler
    Cc: Vishal Verma
    Cc: Tom Lendacky
    Cc: Andrew Morton
    Cc: Michal Hocko
    Cc: linux-nvdimm@lists.01.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: Huang Ying
    Cc: Fengguang Wu
    Cc: Borislav Petkov
    Cc: Bjorn Helgaas
    Cc: Yaowei Bai
    Cc: Takashi Iwai
    Cc: Jerome Glisse
    Signed-off-by: Dan Williams

    Dave Hansen
     
  • HMM consumes physical address space for its own use, even
    though nothing is mapped or accessible there. It uses a
    special resource description (IORES_DESC_DEVICE_PRIVATE_MEMORY)
    to uniquely identify these areas.

    When HMM consumes address space, it makes a best guess about
    what to consume. However, it is possible that a future memory
    or device hotplug can collide with the reserved area. In the
    case of these conflicts, there is an error message in
    register_memory_resource().

    Later patches in this series move register_memory_resource()
    from using request_resource_conflict() to __request_region().
    Unfortunately, __request_region() does not return the conflict
    like the previous function did, which makes it impossible to
    check for IORES_DESC_DEVICE_PRIVATE_MEMORY in a conflicting
    resource.

    Instead of warning in register_memory_resource(), move the
    check into the core resource code itself (__request_region())
    where the conflicting resource _is_ available. This has the
    added bonus of producing a warning in case of HMM conflicts
    with devices *or* RAM address space, as opposed to the RAM-
    only warnings that were there previously.

    Signed-off-by: Dave Hansen
    Reviewed-by: Jerome Glisse
    Cc: Dan Williams
    Cc: Dave Jiang
    Cc: Ross Zwisler
    Cc: Vishal Verma
    Cc: Tom Lendacky
    Cc: Andrew Morton
    Cc: Michal Hocko
    Cc: linux-nvdimm@lists.01.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: Huang Ying
    Cc: Fengguang Wu
    Cc: Keith Busch
    Signed-off-by: Dan Williams

    Dave Hansen
     
  • walk_system_ram_range() can return an error code either becuase
    *it* failed, or because the 'func' that it calls returned an
    error. The memory hotplug does the following:

    ret = walk_system_ram_range(..., func);
    if (ret)
    return ret;

    and 'ret' makes it out to userspace, eventually. The problem
    s, walk_system_ram_range() failues that result from *it* failing
    (as opposed to 'func') return -1. That leads to a very odd
    -EPERM (-1) return code out to userspace.

    Make walk_system_ram_range() return -EINVAL for internal
    failures to keep userspace less confused.

    This return code is compatible with all the callers that I
    audited.

    Signed-off-by: Dave Hansen
    Reviewed-by: Bjorn Helgaas
    Acked-by: Michael Ellerman (powerpc)
    Cc: Dan Williams
    Cc: Dave Jiang
    Cc: Ross Zwisler
    Cc: Vishal Verma
    Cc: Tom Lendacky
    Cc: Andrew Morton
    Cc: Michal Hocko
    Cc: linux-nvdimm@lists.01.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: Huang Ying
    Cc: Fengguang Wu
    Cc: Borislav Petkov
    Cc: Yaowei Bai
    Cc: Takashi Iwai
    Cc: Jerome Glisse
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: Keith Busch
    Signed-off-by: Dan Williams

    Dave Hansen
     

04 Feb, 2019

1 commit

  • Since commit c40dd2f76644 ("powerpc: Add System RAM to /proc/iomem")
    it is possible to use the generic walk_system_ram_range() and
    the generic page_is_ram().

    To enable the use of walk_system_ram_range() by the IBM EHEA ethernet
    driver, we still need an export of the generic function.

    As powerpc was the only user of CONFIG_ARCH_HAS_WALK_MEMORY, the
    ifdef around the generic walk_system_ram_range() has become useless
    and can be dropped.

    Fixes: c40dd2f76644 ("powerpc: Add System RAM to /proc/iomem")
    Signed-off-by: Christophe Leroy
    [mpe: Keep the EXPORT_SYMBOL_GPL in powerpc code]
    Signed-off-by: Michael Ellerman

    Christophe Leroy
     

29 Dec, 2018

1 commit

  • This is a preparation for the next patch.

    Currently, we only call release_mem_region_adjustable() in __remove_pages
    if the zone is not ZONE_DEVICE, because resources that belong to HMM/devm
    are being released by themselves with devm_release_mem_region.

    Since we do not want to touch any zone/page stuff during the removing of
    the memory (but during the offlining), we do not want to check for the
    zone here. So we need another way to tell release_mem_region_adjustable()
    to not realease the resource in case it belongs to HMM/devm.

    HMM/devm acquires/releases a resource through
    devm_request_mem_region/devm_release_mem_region.

    These resources have the flag IORESOURCE_MEM, while resources acquired by
    hot-add memory path (register_memory_resource()) contain
    IORESOURCE_SYSTEM_RAM.

    So, we can check for this flag in release_mem_region_adjustable, and if
    the resource does not contain such flag, we know that we are dealing with
    a HMM/devm resource, so we can back off.

    Link: http://lkml.kernel.org/r/20181127162005.15833-3-osalvador@suse.de
    Signed-off-by: Oscar Salvador
    Reviewed-by: David Hildenbrand
    Reviewed-by: Pavel Tatashin
    Cc: Dan Williams
    Cc: Jerome Glisse
    Cc: Jonathan Cameron
    Cc: Michal Hocko
    Cc: Oscar Salvador
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     

07 Nov, 2018

1 commit

  • Add the missing kernel-doc style function parameters documentation.

    Signed-off-by: Borislav Petkov
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: akpm@linux-foundation.org
    Cc: linux-tip-commits@vger.kernel.org
    Cc: rdunlap@infradead.org
    Fixes: b69c2e20f6e4 ("resource: Clean it up a bit")
    Link: http://lkml.kernel.org/r/20181105093307.GA12445@zn.tnic
    Signed-off-by: Ingo Molnar

    Borislav Petkov
     

05 Nov, 2018

1 commit

  • The first group of warnings is caused by a "/**" kernel-doc notation
    marker but the function comments are not in kernel-doc format.
    Also add another error return value here.

    ../kernel/resource.c:337: warning: Function parameter or member 'start' not described in 'find_next_iomem_res'
    ../kernel/resource.c:337: warning: Function parameter or member 'end' not described in 'find_next_iomem_res'
    ../kernel/resource.c:337: warning: Function parameter or member 'flags' not described in 'find_next_iomem_res'
    ../kernel/resource.c:337: warning: Function parameter or member 'desc' not described in 'find_next_iomem_res'
    ../kernel/resource.c:337: warning: Function parameter or member 'first_lvl' not described in 'find_next_iomem_res'
    ../kernel/resource.c:337: warning: Function parameter or member 'res' not described in 'find_next_iomem_res'

    Add the missing function parameter documentation for the other warnings:

    ../kernel/resource.c:409: warning: Function parameter or member 'arg' not described in 'walk_iomem_res_desc'
    ../kernel/resource.c:409: warning: Function parameter or member 'func' not described in 'walk_iomem_res_desc'

    Signed-off-by: Randy Dunlap
    Cc: Andrew Morton
    Cc: Borislav Petkov
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: b69c2e20f6e4 ("resource: Clean it up a bit")
    Link: http://lkml.kernel.org/r/dda2e4d8-bedd-3167-20fe-8c7d2d35b354@infradead.org
    Signed-off-by: Ingo Molnar

    Randy Dunlap
     

09 Oct, 2018

3 commits

  • - Drop BUG_ON()s and do normal error handling instead, in
    find_next_iomem_res().

    - Align function arguments on opening braces.

    - Get rid of local var sibling_only in find_next_iomem_res().

    - Shorten unnecessarily long first_level_children_only arg name.

    Signed-off-by: Borislav Petkov
    CC: Andrew Morton
    CC: Bjorn Helgaas
    CC: Brijesh Singh
    CC: Dan Williams
    CC: H. Peter Anvin
    CC: Lianbo Jiang
    CC: Takashi Iwai
    CC: Thomas Gleixner
    CC: Tom Lendacky
    CC: Vivek Goyal
    CC: Yaowei Bai
    CC: bhe@redhat.com
    CC: dan.j.williams@intel.com
    CC: dyoung@redhat.com
    CC: kexec@lists.infradead.org
    CC: mingo@redhat.com
    Link:

    Borislav Petkov
     
  • Previously find_next_iomem_res() used "*res" as both an input parameter for
    the range to search and the type of resource to search for, and an output
    parameter for the resource we found, which makes the interface confusing.

    The current callers use find_next_iomem_res() incorrectly because they
    allocate a single struct resource and use it for repeated calls to
    find_next_iomem_res(). When find_next_iomem_res() returns a resource, it
    overwrites the start, end, flags, and desc members of the struct. If we
    call find_next_iomem_res() again, we must update or restore these fields.
    The previous code restored res.start and res.end, but not res.flags or
    res.desc.

    Since the callers did not restore res.flags, if they searched for flags
    IORESOURCE_MEM | IORESOURCE_BUSY and found a resource with flags
    IORESOURCE_MEM | IORESOURCE_BUSY | IORESOURCE_SYSRAM, the next search would
    incorrectly skip resources unless they were also marked as
    IORESOURCE_SYSRAM.

    Fix this by restructuring the interface so it takes explicit "start, end,
    flags" parameters and uses "*res" only as an output parameter.

    Based on a patch by Lianbo Jiang .

    [ bp: While at it:
    - make comments kernel-doc style.
    -

    Originally-by: http://lore.kernel.org/lkml/20180921073211.20097-2-lijiang@redhat.com
    Signed-off-by: Bjorn Helgaas
    Signed-off-by: Borislav Petkov
    CC: Andrew Morton
    CC: Brijesh Singh
    CC: Dan Williams
    CC: H. Peter Anvin
    CC: Lianbo Jiang
    CC: Takashi Iwai
    CC: Thomas Gleixner
    CC: Tom Lendacky
    CC: Vivek Goyal
    CC: Yaowei Bai
    CC: bhe@redhat.com
    CC: dan.j.williams@intel.com
    CC: dyoung@redhat.com
    CC: kexec@lists.infradead.org
    CC: mingo@redhat.com
    CC: x86-ml
    Link: http://lkml.kernel.org/r/153805812916.1157.177580438135143788.stgit@bhelgaas-glaptop.roam.corp.google.com

    Bjorn Helgaas
     
  • find_next_iomem_res() finds an iomem resource that covers part of a range
    described by "start, end". All callers expect that range to be inclusive,
    i.e., both start and end are included, but find_next_iomem_res() doesn't
    handle the end address correctly.

    If it finds an iomem resource that contains exactly the end address, it
    skips it, e.g., if "start, end" is [0x0-0x10000] and there happens to be an
    iomem resource [mem 0x10000-0x10000] (the single byte at 0x10000), we skip
    it:

    find_next_iomem_res(...)
    {
    start = 0x0;
    end = 0x10000;
    for (p = next_resource(...)) {
    # p->start = 0x10000;
    # p->end = 0x10000;
    # we *should* return this resource, but this condition is false:
    if ((p->end >= start) && (p->start < end))
    break;

    Adjust find_next_iomem_res() so it allows a resource that includes the
    single byte at the end of the range. This is a corner case that we
    probably don't see in practice.

    Fixes: 58c1b5b07907 ("[PATCH] memory hotadd fixes: find_next_system_ram catch range fix")
    Signed-off-by: Bjorn Helgaas
    Signed-off-by: Borislav Petkov
    CC: Andrew Morton
    CC: Brijesh Singh
    CC: Dan Williams
    CC: H. Peter Anvin
    CC: Lianbo Jiang
    CC: Takashi Iwai
    CC: Thomas Gleixner
    CC: Tom Lendacky
    CC: Vivek Goyal
    CC: Yaowei Bai
    CC: bhe@redhat.com
    CC: dan.j.williams@intel.com
    CC: dyoung@redhat.com
    CC: kexec@lists.infradead.org
    CC: mingo@redhat.com
    CC: x86-ml
    Link: http://lkml.kernel.org/r/153805812254.1157.16736368485811773752.stgit@bhelgaas-glaptop.roam.corp.google.com

    Bjorn Helgaas
     

09 Jun, 2018

1 commit

  • Pull libnvdimm updates from Dan Williams:
    "This adds a user for the new 'bytes-remaining' updates to
    memcpy_mcsafe() that you already received through Ingo via the
    x86-dax- for-linus pull.

    Not included here, but still targeting this cycle, is support for
    handling memory media errors (poison) consumed via userspace dax
    mappings.

    Summary:

    - DAX broke a fundamental assumption of truncate of file mapped
    pages. The truncate path assumed that it is safe to disconnect a
    pinned page from a file and let the filesystem reclaim the physical
    block. With DAX the page is equivalent to the filesystem block.
    Introduce dax_layout_busy_page() to enable filesystems to wait for
    pinned DAX pages to be released. Without this wait a filesystem
    could allocate blocks under active device-DMA to a new file.

    - DAX arranges for the block layer to be bypassed and uses
    dax_direct_access() + copy_to_iter() to satisfy read(2) calls.
    However, the memcpy_mcsafe() facility is available through the pmem
    block driver. In order to safely handle media errors, via the DAX
    block-layer bypass, introduce copy_to_iter_mcsafe().

    - Fix cache management policy relative to the ACPI NFIT Platform
    Capabilities Structure to properly elide cache flushes when they
    are not necessary. The table indicates whether CPU caches are
    power-fail protected. Clarify that a deep flush is always performed
    on REQ_{FUA,PREFLUSH} requests"

    * tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (21 commits)
    dax: Use dax_write_cache* helpers
    libnvdimm, pmem: Do not flush power-fail protected CPU caches
    libnvdimm, pmem: Unconditionally deep flush on *sync
    libnvdimm, pmem: Complete REQ_FLUSH => REQ_PREFLUSH
    acpi, nfit: Remove ecc_unit_size
    dax: dax_insert_mapping_entry always succeeds
    libnvdimm, e820: Register all pmem resources
    libnvdimm: Debug probe times
    linvdimm, pmem: Preserve read-only setting for pmem devices
    x86, nfit_test: Add unit test for memcpy_mcsafe()
    pmem: Switch to copy_to_iter_mcsafe()
    dax: Report bytes remaining in dax_iomap_actor()
    dax: Introduce a ->copy_to_iter dax operation
    uio, lib: Fix CONFIG_ARCH_HAS_UACCESS_MCSAFE compilation
    xfs, dax: introduce xfs_break_dax_layouts()
    xfs: prepare xfs_break_layouts() for another layout type
    xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL
    mm, fs, dax: handle layout changes to pinned dax mappings
    mm: fix __gup_device_huge vs unmap
    mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS
    ...

    Linus Torvalds
     

03 Jun, 2018

1 commit

  • There is currently a mismatch between the resources that will trigger
    the e820_pmem driver to register/load and the resources that will
    actually be surfaced as pmem ranges. register_e820_pmem() uses
    walk_iomem_res_desc() which includes children and siblings. In contrast,
    e820_pmem_probe() only considers top level resources. For example the
    following resource tree results in the driver being loaded, but no
    resources being registered:

    398000000000-39bfffffffff : PCI Bus 0000:ae
    39be00000000-39bf07ffffff : PCI Bus 0000:af
    39be00000000-39beffffffff : 0000:af:00.0
    39be10000000-39beffffffff : Persistent Memory (legacy)

    Fix this up to allow definitions of "legacy" pmem ranges anywhere in
    system-physical address space. Not that it is a recommended or safe to
    define a pmem range in PCI space, but it is useful for debug /
    experimentation, and the restriction on being a top-level resource was
    arbitrary.

    Cc: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     

16 May, 2018

1 commit


14 Apr, 2018

1 commit

  • We've got a bug report indicating a kernel panic at booting on an x86-32
    system, and it turned out to be the invalid PCI resource assigned after
    reallocation. __find_resource() first aligns the resource start address
    and resets the end address with start+size-1 accordingly, then checks
    whether it's contained. Here the end address may overflow the integer,
    although resource_contains() still returns true because the function
    validates only start and end address. So this ends up with returning an
    invalid resource (start > end).

    There was already an attempt to cover such a problem in the commit
    47ea91b4052d ("Resource: fix wrong resource window calculation"), but
    this case is an overseen one.

    This patch adds the validity check of the newly calculated resource for
    avoiding the integer overflow problem.

    Bugzilla: http://bugzilla.opensuse.org/show_bug.cgi?id=1086739
    Link: http://lkml.kernel.org/r/s5hpo37d5l8.wl-tiwai@suse.de
    Fixes: 23c570a67448 ("resource: ability to resize an allocated resource")
    Signed-off-by: Takashi Iwai
    Reported-by: Michael Henders
    Tested-by: Michael Henders
    Reviewed-by: Andrew Morton
    Cc: Ram Pai
    Cc: Bjorn Helgaas
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Takashi Iwai
     

07 Feb, 2018

2 commits

  • Merge misc updates from Andrew Morton:

    - kasan updates

    - procfs

    - lib/bitmap updates

    - other lib/ updates

    - checkpatch tweaks

    - rapidio

    - ubsan

    - pipe fixes and cleanups

    - lots of other misc bits

    * emailed patches from Andrew Morton : (114 commits)
    Documentation/sysctl/user.txt: fix typo
    MAINTAINERS: update ARM/QUALCOMM SUPPORT patterns
    MAINTAINERS: update various PALM patterns
    MAINTAINERS: update "ARM/OXNAS platform support" patterns
    MAINTAINERS: update Cortina/Gemini patterns
    MAINTAINERS: remove ARM/CLKDEV SUPPORT file pattern
    MAINTAINERS: remove ANDROID ION pattern
    mm: docs: add blank lines to silence sphinx "Unexpected indentation" errors
    mm: docs: fix parameter names mismatch
    mm: docs: fixup punctuation
    pipe: read buffer limits atomically
    pipe: simplify round_pipe_size()
    pipe: reject F_SETPIPE_SZ with size over UINT_MAX
    pipe: fix off-by-one error when checking buffer limits
    pipe: actually allow root to exceed the pipe buffer limits
    pipe, sysctl: remove pipe_proc_fn()
    pipe, sysctl: drop 'min' parameter from pipe-max-size converter
    kasan: rework Kconfig settings
    crash_dump: is_kdump_kernel can be boolean
    kernel/mutex: mutex_is_locked can be boolean
    ...

    Linus Torvalds
     
  • Make iomem_is_exclusive return bool due to this particular function only
    using either one or zero as its return value.

    No functional change.

    Link: http://lkml.kernel.org/r/1513266622-15860-5-git-send-email-baiyaowei@cmss.chinamobile.com
    Signed-off-by: Yaowei Bai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     

19 Dec, 2017

2 commits


07 Nov, 2017

3 commits

  • In order for memory pages to be properly mapped when SEV is active, it's
    necessary to use the PAGE_KERNEL protection attribute as the base
    protection. This ensures that memory mapping of, e.g. ACPI tables,
    receives the proper mapping attributes.

    Signed-off-by: Tom Lendacky
    Signed-off-by: Brijesh Singh
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Borislav Petkov
    Tested-by: Borislav Petkov
    Cc: Laura Abbott
    Cc: Kees Cook
    Cc: kvm@vger.kernel.org
    Cc: Jérôme Glisse
    Cc: Borislav Petkov
    Cc: Andy Lutomirski
    Cc: Andrew Morton
    Cc: Dan Williams
    Cc: "Kirill A. Shutemov"
    Link: https://lkml.kernel.org/r/20171020143059.3291-11-brijesh.singh@amd.com

    Tom Lendacky
     
  • In preperation for a new function that will need additional resource
    information during the resource walk, update the resource walk callback to
    pass the resource structure. Since the current callback start and end
    arguments are pulled from the resource structure, the callback functions
    can obtain them from the resource structure directly.

    Signed-off-by: Tom Lendacky
    Signed-off-by: Brijesh Singh
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Kees Cook
    Reviewed-by: Borislav Petkov
    Tested-by: Borislav Petkov
    Cc: kvm@vger.kernel.org
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: linuxppc-dev@lists.ozlabs.org
    Link: https://lkml.kernel.org/r/20171020143059.3291-10-brijesh.singh@amd.com

    Tom Lendacky
     
  • The walk_iomem_res_desc(), walk_system_ram_res() and walk_system_ram_range()
    functions each have much of the same code.

    Create a new function that consolidates the common code from these
    functions in one place to reduce the amount of duplicated code.

    Signed-off-by: Tom Lendacky
    Signed-off-by: Brijesh Singh
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Borislav Petkov
    Tested-by: Borislav Petkov
    Cc: kvm@vger.kernel.org
    Cc: Borislav Petkov
    Link: https://lkml.kernel.org/r/20171020143059.3291-9-brijesh.singh@amd.com

    Tom Lendacky
     

15 Apr, 2016

1 commit


10 Mar, 2016

3 commits

  • insert_resource() and remove_resouce() are called by producers
    of resources, such as FW modules and bus drivers. These modules
    may be implemented as loadable modules.

    Export insert_resource() and remove_resouce() so that they can
    be called from such modules.

    link: https://lkml.org/lkml/2016/3/8/872
    Signed-off-by: Toshi Kani
    Cc: Linus Torvalds
    Cc: Ingo Molnar
    Cc: Borislav Petkov
    Cc: Andrew Morton
    Cc: Dan Williams
    Signed-off-by: Dan Williams

    Toshi Kani
     
  • insert_resource() and insert_resource_conflict() are called
    by resource producers to insert a new resource. When there
    is any conflict, they move conflicting resources down to the
    children of the new resource. There is no destructor of these
    interfaces, however.

    Add remove_resource(), which removes a resource previously
    inserted by insert_resource() or insert_resource_conflict(),
    and moves the children up to where they were before.

    __release_resource() is changed to have @release_child, so
    that this function can be used for remove_resource() as well.

    Also add comments to clarify that these functions are intended
    for producers of resources to avoid any confusion with
    request/release_resource() for consumers.

    Signed-off-by: Toshi Kani
    Cc: Ingo Molnar
    Cc: Borislav Petkov
    Cc: Andrew Morton
    Cc: Dan Williams
    Signed-off-by: Dan Williams

    Toshi Kani
     
  • __request_region() sets 'flags' of a new resource from @parent
    as it inherits the parent's attribute. When a target resource
    has a conflict, this function inserts the new resource entry
    under the conflicted entry by updating @parent. In this case,
    the new resource entry needs to inherit attribute from the updated
    parent. This conflict is a typical case since __request_region()
    is used to allocate a new resource from a specific resource range.

    For instance, request_mem_region() calls __request_region() with
    @parent set to &iomem_resource, which is the root entry of the
    whole iomem range. When this request results in inserting a new
    entry "DEV-A" under "BUS-1", "DEV-A" needs to inherit from the
    immediate parent "BUS-1" as it holds specific attribute for the
    range.

    root (&iomem_resource)
    :
    + "BUS-1"
    + "DEV-A"

    Change __request_region() to set 'flags' and 'desc' of a new entry
    from the immediate parent.

    Signed-off-by: Toshi Kani
    Cc: Ingo Molnar
    Cc: Borislav Petkov
    Cc: Andrew Morton
    Cc: Dan Williams
    Signed-off-by: Dan Williams

    Toshi Kani
     

04 Mar, 2016

1 commit


21 Feb, 2016

1 commit

  • In __request_region, if a conflict with a BUSY and MUXED resource is
    detected, then the caller goes to sleep and waits for the resource to be
    released. A pointer on the conflicting resource is kept. At wake-up
    this pointer is used as a parent to retry to request the region.

    A first problem is that this pointer might well be invalid (if for
    example the conflicting resource have already been freed). Another
    problem is that the next call to __request_region() fails to detect a
    remaining conflict. The previously conflicting resource is passed as a
    parameter and __request_region() will look for a conflict among the
    children of this resource and not at the resource itself. It is likely
    to succeed anyway, even if there is still a conflict.

    Instead, the parent of the conflicting resource should be passed to
    __request_region().

    As a fix, this patch doesn't update the parent resource pointer in the
    case we have to wait for a muxed region right after.

    Reported-and-tested-by: Vincent Pelletier
    Signed-off-by: Simon Guinot
    Tested-by: Vincent Donnefort
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Simon Guinot
     

30 Jan, 2016

6 commits

  • walk_iomem_res_desc() replaced walk_iomem_res() and there is no
    caller to walk_iomem_res() any more. Kill it. Also remove @name
    from find_next_iomem_res() as it is no longer used.

    Signed-off-by: Toshi Kani
    Signed-off-by: Borislav Petkov
    Acked-by: Dave Young
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dan Williams
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Hanjun Guo
    Cc: Jakub Sitnicki
    Cc: Jiang Liu
    Cc: Linus Torvalds
    Cc: Luis R. Rodriguez
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Toshi Kani
    Cc: Vinod Koul
    Cc: linux-arch@vger.kernel.org
    Cc: linux-mm
    Link: http://lkml.kernel.org/r/1453841853-11383-17-git-send-email-bp@alien8.de
    Signed-off-by: Ingo Molnar

    Toshi Kani
     
  • Add a new interface, walk_iomem_res_desc(), which walks through
    the iomem table by identifying a target with @flags and @desc.
    This interface provides the same functionality as
    walk_iomem_res(), but does not use strcmp() to @name for better
    efficiency.

    walk_iomem_res() is deprecated and will be removed in a later
    patch.

    Requested-by: Borislav Petkov
    Signed-off-by: Toshi Kani
    [ Fixup comments. ]
    Signed-off-by: Borislav Petkov
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dan Williams
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Hanjun Guo
    Cc: Jakub Sitnicki
    Cc: Jiang Liu
    Cc: Linus Torvalds
    Cc: Luis R. Rodriguez
    Cc: Peter Zijlstra
    Cc: Rafael J. Wysocki
    Cc: Thomas Gleixner
    Cc: Toshi Kani
    Cc: linux-arch@vger.kernel.org
    Cc: linux-mm
    Link: http://lkml.kernel.org/r/1453841853-11383-14-git-send-email-bp@alien8.de
    Signed-off-by: Ingo Molnar

    Toshi Kani
     
  • Change region_intersects() to identify a target with @flags and
    @desc, instead of @name with strcmp().

    Change the callers of region_intersects(), memremap() and
    devm_memremap(), to set IORESOURCE_SYSTEM_RAM in @flags and
    IORES_DESC_NONE in @desc when searching System RAM.

    Also, export region_intersects() so that the ACPI EINJ error
    injection driver can call this function in a later patch.

    Signed-off-by: Toshi Kani
    Signed-off-by: Borislav Petkov
    Acked-by: Dan Williams
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Ard Biesheuvel
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Jakub Sitnicki
    Cc: Jan Kara
    Cc: Jiang Liu
    Cc: Kees Cook
    Cc: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Linus Torvalds
    Cc: Luis R. Rodriguez
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Peter Zijlstra
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Toshi Kani
    Cc: Vlastimil Babka
    Cc: linux-arch@vger.kernel.org
    Cc: linux-mm
    Link: http://lkml.kernel.org/r/1453841853-11383-13-git-send-email-bp@alien8.de
    Signed-off-by: Ingo Molnar

    Toshi Kani
     
  • Now that all System RAM resource entries have been initialized
    to IORESOURCE_SYSTEM_RAM type, change walk_system_ram_res() and
    walk_system_ram_range() to call find_next_iomem_res() by setting
    @res.flags to IORESOURCE_SYSTEM_RAM and @name to NULL. With this
    change, they walk through the iomem table to find System RAM
    ranges without the need to do strcmp() on the resource names.

    No functional change is made to the interfaces.

    Signed-off-by: Toshi Kani
    [ Boris: fixup comments. ]
    Signed-off-by: Borislav Petkov
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dan Williams
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Jakub Sitnicki
    Cc: Jiang Liu
    Cc: Linus Torvalds
    Cc: Luis R. Rodriguez
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Toshi Kani
    Cc: linux-arch@vger.kernel.org
    Cc: linux-mm
    Link: http://lkml.kernel.org/r/1453841853-11383-11-git-send-email-bp@alien8.de
    Signed-off-by: Ingo Molnar

    Toshi Kani
     
  • walk_iomem_res() and region_intersects() still need to use
    strcmp() for searching a resource entry by @name in the iomem
    table.

    This patch introduces I/O resource descriptor 'desc' in struct
    resource for the iomem search interfaces. Drivers can assign
    their unique descriptor to a range when they support the search
    interfaces.

    Otherwise, 'desc' is set to IORES_DESC_NONE (0). This avoids
    changing most of the drivers as they typically allocate resource
    entries statically, or by calling alloc_resource(), kzalloc(),
    or alloc_bootmem_low(), which set the field to zero by default.
    A later patch will address some drivers that use kmalloc()
    without zero'ing the field.

    Also change release_mem_region_adjustable() to set 'desc' when
    its resource entry gets separated. Other resource interfaces are
    also changed to initialize 'desc' explicitly although
    alloc_resource() sets it to 0.

    Signed-off-by: Toshi Kani
    Signed-off-by: Borislav Petkov
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dan Williams
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Jakub Sitnicki
    Cc: Jiang Liu
    Cc: Linus Torvalds
    Cc: Luis R. Rodriguez
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Toshi Kani
    Cc: linux-arch@vger.kernel.org
    Cc: linux-mm
    Link: http://lkml.kernel.org/r/1453841853-11383-4-git-send-email-bp@alien8.de
    Signed-off-by: Ingo Molnar

    Toshi Kani
     
  • I/O resource flags consist of I/O resource types and modifier
    bits. Therefore, checking an I/O resource type in 'flags' must
    be performed with a bitwise operation.

    Fix find_next_iomem_res() and region_intersects() that simply
    compare 'flags' against a given value.

    Also change __request_region() to set 'res->flags' from
    resource_type() and resource_ext_type() of the parent, so that
    children nodes will inherit the extended I/O resource type.

    Signed-off-by: Toshi Kani
    Signed-off-by: Borislav Petkov
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dan Williams
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Jakub Sitnicki
    Cc: Jiang Liu
    Cc: Linus Torvalds
    Cc: Luis R. Rodriguez
    Cc: Peter Zijlstra
    Cc: Rafael J. Wysocki
    Cc: Thomas Gleixner
    Cc: Toshi Kani
    Cc: Vinod Koul
    Cc: linux-arch@vger.kernel.org
    Cc: linux-mm
    Link: http://lkml.kernel.org/r/1453841853-11383-3-git-send-email-bp@alien8.de
    Signed-off-by: Ingo Molnar

    Toshi Kani
     

09 Jan, 2016

1 commit

  • This effectively promotes IORESOURCE_BUSY to IORESOURCE_EXCLUSIVE
    semantics by default. If userspace really believes it is safe to access
    the memory region it can also perform the extra step of disabling an
    active driver. This protects device address ranges with read side
    effects and otherwise directs userspace to use the driver.

    Persistent memory presents a large "mistake surface" to /dev/mem as now
    accidental writes can corrupt a filesystem.

    In general if a device driver is busily using a memory region it already
    informs other parts of the kernel to not touch it via
    request_mem_region(). /dev/mem should honor the same safety restriction
    by default. Debugging a device driver from userspace becomes more
    difficult with this enabled. Any application using /dev/mem or mmap of
    sysfs pci resources will now need to perform the extra step of either:

    1/ Disabling the driver, for example:

    echo > /dev/bus//drivers//unbind

    2/ Rebooting with "iomem=relaxed" on the command line

    3/ Recompiling with CONFIG_IO_STRICT_DEVMEM=n

    Traditional users of /dev/mem like dosemu are unaffected because the
    first 1MB of memory is not subject to the IO_STRICT_DEVMEM restriction.
    Legacy X configurations use /dev/mem to talk to graphics hardware, but
    that functionality has since moved to kernel graphics drivers.

    Cc: Arnd Bergmann
    Cc: Russell King
    Cc: Andrew Morton
    Cc: Greg Kroah-Hartman
    Acked-by: Kees Cook
    Acked-by: Ingo Molnar
    Signed-off-by: Dan Williams

    Dan Williams
     

11 Aug, 2015

1 commit

  • region_is_ram() is used to prevent the establishment of aliased mappings
    to physical "System RAM" with incompatible cache settings. However, it
    uses "-1" to indicate both "unknown" memory ranges (ranges not described
    by platform firmware) and "mixed" ranges (where the parameters describe
    a range that partially overlaps "System RAM").

    Fix this up by explicitly tracking the "unknown" vs "mixed" resource
    cases and returning REGION_INTERSECTS, REGION_MIXED, or REGION_DISJOINT.
    This re-write also adds support for detecting when the requested region
    completely eclipses all of a resource. Note, the implementation treats
    overlaps between "unknown" and the requested memory type as
    REGION_INTERSECTS.

    Finally, other memory types can be passed in by name, for now the only
    usage "System RAM".

    Suggested-by: Luis R. Rodriguez
    Reviewed-by: Toshi Kani
    Signed-off-by: Dan Williams

    Dan Williams
     

22 Jul, 2015

1 commit

  • region_is_ram() looks up the iomem_resource table to check if
    a target range is in RAM. However, it always returns with -1
    due to invalid range checks. It always breaks the loop at the
    first entry of the table.

    Another issue is that it compares p->flags and flags, but it always
    fails. flags is declared as int, which makes it as a negative value
    with IORESOURCE_BUSY (0x80000000) set while p->flags is unsigned long.

    Fix the range check and flags so that region_is_ram() works as
    advertised.

    Signed-off-by: Toshi Kani
    Reviewed-by: Dan Williams
    Cc: Mike Travis
    Cc: Luis R. Rodriguez
    Cc: Andrew Morton
    Cc: Roland Dreier
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/1437088996-28511-4-git-send-email-toshi.kani@hp.com
    Signed-off-by: Thomas Gleixner

    Toshi Kani
     

16 Apr, 2015

1 commit

  • All users of __check_region(), check_region(), and check_mem_region() are
    gone. We got rid of the last user in v4.0-rc1. Remove them.

    bloat-o-meter on x86_64 shows:

    add/remove: 0/3 grow/shrink: 0/0 up/down: 0/-102 (-102)
    function old new delta
    __kstrtab___check_region 15 - -15
    __ksymtab___check_region 16 - -16
    __check_region 71 - -71

    Signed-off-by: Jakub Sitnicki
    Cc: Bjorn Helgaas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jakub Sitnicki
     

05 Feb, 2015

1 commit

  • Currently ACPI, PCI and pnp all implement the same resource list
    management with different data structure. We need to transfer from
    one data structure into another when passing resources from one
    subsystem into another subsystem. So move struct resource_list_entry
    from ACPI into resource core and rename it as resource_entry,
    then it could be reused by different subystems and avoid the data
    structure conversion.

    Introduce dedicated header file resource_ext.h instead of embedding
    it into ioport.h to avoid header file inclusion order issues.

    Signed-off-by: Jiang Liu
    Acked-by: Vinod Koul
    Signed-off-by: Rafael J. Wysocki

    Jiang Liu
     

14 Oct, 2014

1 commit

  • We have a large university system in the UK that is experiencing very long
    delays modprobing the driver for a specific I/O device. The delay is from
    8-10 minutes per device and there are 31 devices in the system. This 4 to
    5 hour delay in starting up those I/O devices is very much a burden on the
    customer.

    There are two causes for requiring a restart/reload of the drivers. First
    is periodic preventive maintenance (PM) and the second is if any of the
    devices experience a fatal error. Both of these trigger this excessively
    long delay in bringing the system back up to full capability.

    The problem was tracked down to a very slow IOREMAP operation and the
    excessively long ioresource lookup to insure that the user is not
    attempting to ioremap RAM. These patches provide a speed up to that
    function.

    The modprobe time appears to be affected quite a bit by previous activity
    on the ioresource list, which I suspect is due to cache preloading. While
    the overall improvement is impacted by other overhead of starting the
    devices, this drastically improves the modprobe time.

    Also our system is considerably smaller so the percentages gained will not
    be the same. Best case improvement with the modprobe on our 20 device
    smallish system was from 'real 5m51.913s' to 'real 0m18.275s'.

    This patch (of 2):

    Since the ioremap operation is verifying that the specified address range
    is NOT RAM, it will search the entire ioresource list if the condition is
    true. To make matters worse, it does this one 4k page at a time. For a
    128M BAR region this is 32 passes to determine the entire region does not
    contain any RAM addresses.

    This patch provides another resource lookup function, region_is_ram, that
    searches for the entire region specified, verifying that it is completely
    contained within the resource region. If it is found, then it is checked
    to be RAM or not, within a single pass.

    The return result reflects if it was found or not (-1), and whether it is
    RAM (1) or not (0). This allows the caller to fallback to the previous
    page by page search if it was not found.

    [akpm@linux-foundation.org: fix spellos and typos in comment]
    Signed-off-by: Mike Travis
    Acked-by: Alex Thorlton
    Reviewed-by: Cliff Wickman
    Cc: Thomas Gleixner
    Cc: H. Peter Anvin
    Cc: Mark Salter
    Cc: Dave Young
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Mel Gorman
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Travis