03 Aug, 2018

3 commits

  • [ Upstream commit 48d8476b41eed63567dd2f0ad125c895b9ac648a ]

    MAP_DMA ioctls might be called from various threads within a process,
    for example when using QEMU, the vCPU threads are often generating
    these calls and we therefore take a reference to that vCPU task.
    However, QEMU also supports vCPU hotplug on some machines and the task
    that called MAP_DMA may have exited by the time UNMAP_DMA is called,
    resulting in the mm_struct pointer being NULL and thus a failure to
    match against the existing mapping.

    To resolve this, we instead take a reference to the thread
    group_leader, which has the same mm_struct and resource limits, but
    is less likely exit, at least in the QEMU case. A difficulty here is
    guaranteeing that the capabilities of the group_leader match that of
    the calling thread, which we resolve by tracking CAP_IPC_LOCK at the
    time of calling rather than at an indeterminate time in the future.
    Potentially this also results in better efficiency as this is now
    recorded once per MAP_DMA ioctl.

    Reported-by: Xu Yandong
    Signed-off-by: Alex Williamson
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Alex Williamson
     
  • [ Upstream commit 002fe996f67f4f46d8917b14cfb6e4313c20685a ]

    When we create an mdev device, we check for duplicates against the
    parent device and return -EEXIST if found, but the mdev device
    namespace is global since we'll link all devices from the bus. We do
    catch this later in sysfs_do_create_link_sd() to return -EEXIST, but
    with it comes a kernel warning and stack trace for trying to create
    duplicate sysfs links, which makes it an undesirable response.

    Therefore we should really be looking for duplicates across all mdev
    parent devices, or as implemented here, against our mdev device list.
    Using mdev_list to prevent duplicates means that we can remove
    mdev_parent.lock, but in order not to serialize mdev device creation
    and removal globally, we add mdev_device.active which allows UUIDs to
    be reserved such that we can drop the mdev_list_lock before the mdev
    device is fully in place.

    Two behavioral notes; first, mdev_parent.lock had the side-effect of
    serializing mdev create and remove ops per parent device. This was
    an implementation detail, not an intentional guarantee provided to
    the mdev vendor drivers. Vendor drivers can trivially provide this
    serialization internally if necessary. Second, review comments note
    the new -EAGAIN behavior when the device, and in particular the remove
    attribute, becomes visible in sysfs. If a remove is triggered prior
    to completion of mdev_device_create() the user will see a -EAGAIN
    error. While the errno is different, receiving an error during this
    period is not, the previous implementation returned -ENODEV for the
    same condition. Furthermore, the consistency to the user is improved
    in the case where mdev_device_remove_ops() returns error. Previously
    concurrent calls to mdev_device_remove() could see the device
    disappear with -ENODEV and return in the case of error. Now a user
    would see -EAGAIN while the device is in this transitory state.

    Reviewed-by: Kirti Wankhede
    Reviewed-by: Cornelia Huck
    Acked-by: Halil Pasic
    Acked-by: Zhenyu Wang
    Signed-off-by: Alex Williamson
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Alex Williamson
     
  • [ Upstream commit 28a68387888997e8a7fa57940ea5d55f2e16b594 ]

    If the IOMMU group setup fails, the reset module is not released.

    Fixes: b5add544d677d363 ("vfio, platform: make reset driver a requirement by default")
    Signed-off-by: Geert Uytterhoeven
    Reviewed-by: Eric Auger
    Reviewed-by: Simon Horman
    Acked-by: Eric Auger
    Signed-off-by: Alex Williamson
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Geert Uytterhoeven
     

28 Jul, 2018

1 commit

  • commit 76fa4975f3ed12d15762bc979ca44078598ed8ee upstream.

    A VM which has:
    - a DMA capable device passed through to it (eg. network card);
    - running a malicious kernel that ignores H_PUT_TCE failure;
    - capability of using IOMMU pages bigger that physical pages
    can create an IOMMU mapping that exposes (for example) 16MB of
    the host physical memory to the device when only 64K was allocated to the VM.

    The remaining 16MB - 64K will be some other content of host memory, possibly
    including pages of the VM, but also pages of host kernel memory, host
    programs or other VMs.

    The attacking VM does not control the location of the page it can map,
    and is only allowed to map as many pages as it has pages of RAM.

    We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that
    an IOMMU page is contained in the physical page so the PCI hardware won't
    get access to unassigned host memory; however this check is missing in
    the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and
    did not hit this yet as the very first time when the mapping happens
    we do not have tbl::it_userspace allocated yet and fall back to
    the userspace which in turn calls VFIO IOMMU driver, this fails and
    the guest does not retry,

    This stores the smallest preregistered page size in the preregistered
    region descriptor and changes the mm_iommu_xxx API to check this against
    the IOMMU page size.

    This calculates maximum page size as a minimum of the natural region
    alignment and compound page size. For the page shift this uses the shift
    returned by find_linux_pte() which indicates how the page is mapped to
    the current userspace - if the page is huge and this is not a zero, then
    it is a leaf pte and the page is mapped within the range.

    Fixes: 121f80ba68f1 ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO")
    Cc: stable@vger.kernel.org # v4.12+
    Signed-off-by: Alexey Kardashevskiy
    Reviewed-by: David Gibson
    Signed-off-by: Michael Ellerman
    Signed-off-by: Alexey Kardashevskiy
    Signed-off-by: Greg Kroah-Hartman

    Alexey Kardashevskiy
     

25 Jul, 2018

2 commits

  • commit 1463edca6734d42ab4406fa2896e20b45478ea36 upstream.

    The size is always equal to 1 page so let's use this. Later on this will
    be used for other checks which use page shifts to check the granularity
    of access.

    This should cause no behavioral change.

    Cc: stable@vger.kernel.org # v4.12+
    Reviewed-by: David Gibson
    Acked-by: Alex Williamson
    Signed-off-by: Alexey Kardashevskiy
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Alexey Kardashevskiy
     
  • commit 0e714d27786ce1fb3efa9aac58abc096e68b1c2a upstream.

    info.index can be indirectly controlled by user-space, hence leading
    to a potential exploitation of the Spectre variant 1 vulnerability.

    This issue was detected with the help of Smatch:

    drivers/vfio/pci/vfio_pci.c:734 vfio_pci_ioctl()
    warn: potential spectre issue 'vdev->region'

    Fix this by sanitizing info.index before indirectly using it to index
    vdev->region

    Notice that given that speculation windows are large, the policy is
    to kill the speculation on the first load and not worry if it can be
    completed with a dependent load/store [1].

    [1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2

    Cc: stable@vger.kernel.org
    Signed-off-by: Gustavo A. R. Silva
    Signed-off-by: Alex Williamson
    Signed-off-by: Greg Kroah-Hartman

    Gustavo A. R. Silva
     

11 Jul, 2018

1 commit

  • commit bb94b55af3461e26b32f0e23d455abeae0cfca5d upstream.

    The patch noted in the fixes below converted get_user_pages_fast() to
    get_user_pages_longterm(), however the two calls differ in a few ways.

    First _fast() is documented to not require the mmap_sem, while _longterm()
    is documented to need it. Hold the mmap sem as required.

    Second, _fast accepts an 'int write' while _longterm uses 'unsigned int
    gup_flags', so the expression '!!(prot & IOMMU_WRITE)' is only working by
    luck as FOLL_WRITE is currently == 0x1. Use the expected FOLL_WRITE
    constant instead.

    Fixes: 94db151dc892 ("vfio: disable filesystem-dax page pinning")
    Cc:
    Signed-off-by: Jason Gunthorpe
    Acked-by: Dan Williams
    Signed-off-by: Alex Williamson
    Signed-off-by: Greg Kroah-Hartman

    Jason Gunthorpe
     

24 Apr, 2018

1 commit

  • commit cf0d53ba4947aad6e471491d5b20a567cbe92e56 upstream.

    MRRS defines the maximum read request size a device is allowed to
    make. Drivers will often increase this to allow more data transfer
    with a single request. Completions to this request are bound by the
    MPS setting for the bus. Aside from device quirks (none known), it
    doesn't seem to make sense to set an MRRS value less than MPS, yet
    this is a likely scenario given that user drivers do not have a
    system-wide view of the PCI topology. Virtualize MRRS such that the
    user can set MRRS >= MPS, but use MPS as the floor value that we'll
    write to hardware.

    Signed-off-by: Alex Williamson
    Signed-off-by: Greg Kroah-Hartman

    Alex Williamson
     

09 Mar, 2018

1 commit

  • commit 94db151dc89262bfa82922c44e8320cea2334667 upstream.

    Filesystem-DAX is incompatible with 'longterm' page pinning. Without
    page cache indirection a DAX mapping maps filesystem blocks directly.
    This means that the filesystem must not modify a file's block map while
    any page in a mapping is pinned. In order to prevent the situation of
    userspace holding of filesystem operations indefinitely, disallow
    'longterm' Filesystem-DAX mappings.

    RDMA has the same conflict and the plan there is to add a 'with lease'
    mechanism to allow the kernel to notify userspace that the mapping is
    being torn down for block-map maintenance. Perhaps something similar can
    be put in place for vfio.

    Note that xfs and ext4 still report:

    "DAX enabled. Warning: EXPERIMENTAL, use at your own risk"

    ...at mount time, and resolving the dax-dma-vs-truncate problem is one
    of the last hurdles to remove that designation.

    Acked-by: Alex Williamson
    Cc: Michal Hocko
    Cc: kvm@vger.kernel.org
    Cc:
    Reported-by: Haozhong Zhang
    Tested-by: Haozhong Zhang
    Fixes: d475c6346a38 ("dax,ext2: replace XIP read and write with DAX I/O")
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     

25 Dec, 2017

1 commit

  • [ Upstream commit 523184972b282cd9ca17a76f6ca4742394856818 ]

    With virtual PCI-Express chipsets, we now see userspace/guest drivers
    trying to match the physical MPS setting to a virtual downstream port.
    Of course a lone physical device surrounded by virtual interconnects
    cannot make a correct decision for a proper MPS setting. Instead,
    let's virtualize the MPS control register so that writes through to
    hardware are disallowed. Userspace drivers like QEMU assume they can
    write anything to the device and we'll filter out anything dangerous.
    Since mismatched MPS can lead to AER and other faults, let's add it
    to the kernel side rather than relying on userspace virtualization to
    handle it.

    Signed-off-by: Alex Williamson
    Reviewed-by: Eric Auger
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Alex Williamson
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

31 Aug, 2017

3 commits

  • amba_id are not supposed to change at runtime. All functions
    working with const amba_id. So mark the non-const structs as const.

    Signed-off-by: Arvind Yadav
    Signed-off-by: Alex Williamson

    Arvind Yadav
     
  • When the user unbinds the last device of a group from a vfio bus
    driver, the devices within that group should be available for other
    purposes. We currently have a race that makes this generally, but
    not always true. The device can be unbound from the vfio bus driver,
    but remaining IOMMU context of the group attached to the container
    can result in errors as the next driver configures DMA for the device.

    Wait for the group to be detached from the IOMMU backend before
    allowing the bus driver remove callback to complete.

    Signed-off-by: Alex Williamson

    Alex Williamson
     
  • In vfio_iommu_group_get() we want to increase the reference
    count of the iommu group.

    In noiommu case, the group does not exist and is allocated.
    iommu_group_add_device() increases the group ref count. However we
    then call iommu_group_put() which decrements it.

    This leads to a "refcount_t: underflow WARN_ON".

    Only decrement the ref count in case of iommu_group_add_device
    failure.

    Signed-off-by: Eric Auger
    Signed-off-by: Alex Williamson

    Eric Auger
     

11 Aug, 2017

2 commits

  • If the IOMMU driver advertises 'real' reserved regions for MSIs, but
    still includes the software-managed region as well, we are currently
    blind to the former and will configure the IOMMU domain to map MSIs into
    the latter, which is unlikely to work as expected.

    Since it would take a ridiculous hardware topology for both regions to
    be valid (which would be rather difficult to support in general), we
    should be safe to assume that the presence of any hardware regions makes
    the software region irrelevant. However, the IOMMU driver might still
    advertise the software region by default, particularly if the hardware
    regions are filled in elsewhere by generic code, so it might not be fair
    for VFIO to be super-strict about not mixing them. To that end, make
    vfio_iommu_has_sw_msi() robust against the presence of both region types
    at once, so that we end up doing what is almost certainly right, rather
    than what is almost certainly wrong.

    Signed-off-by: Robin Murphy
    Tested-by: Shameer Kolothum
    Reviewed-by: Eric Auger
    Signed-off-by: Alex Williamson

    Robin Murphy
     
  • For ARM-based systems with a GICv3 ITS to provide interrupt isolation,
    but hardware limitations which are worked around by having MSIs bypass
    SMMU translation (e.g. HiSilicon Hip06/Hip07), VFIO neglects to check
    for the IRQ_DOMAIN_FLAG_MSI_REMAP capability, (and thus erroneously
    demands unsafe_interrupts) if a software-managed MSI region is absent.

    Fix this by always checking for isolation capability at both the IRQ
    domain and IOMMU domain levels, rather than predicating that on whether
    MSIs require an IOMMU mapping (which was always slightly tenuous logic).

    Signed-off-by: Robin Murphy
    Tested-by: Shameer Kolothum
    Reviewed-by: Eric Auger
    Signed-off-by: Alex Williamson

    Robin Murphy
     

28 Jul, 2017

1 commit


27 Jul, 2017

1 commit

  • Device lock bites again; if a device .remove() callback races a user
    calling ioctl(VFIO_GROUP_GET_DEVICE_FD), the unbind request will hold
    the device lock, but the user ioctl may have already taken a vfio_device
    reference. In the case of a PCI device, the initial open will attempt
    to reset the device, which again attempts to get the device lock,
    resulting in deadlock. Use the trylock PCI reset interface and return
    error on the open path if reset fails due to lock contention.

    Link: https://lkml.org/lkml/2017/7/25/381
    Reported-by: Wen Congyang
    Signed-off-by: Alex Williamson

    Alex Williamson
     

14 Jul, 2017

1 commit

  • Pull VFIO updates from Alex Williamson:

    - Include Intel XXV710 in INTx workaround (Alex Williamson)

    - Make use of ERR_CAST() for error return (Dan Carpenter)

    - Fix vfio_group release deadlock from iommu notifier (Alex Williamson)

    - Unset KVM-VFIO attributes only on group match (Alex Williamson)

    - Fix release path group/file matching with KVM-VFIO (Alex Williamson)

    - Remove unnecessary lock uses triggering lockdep splat (Alex Williamson)

    * tag 'vfio-v4.13-rc1' of git://github.com/awilliam/linux-vfio:
    vfio: Remove unnecessary uses of vfio_container.group_lock
    vfio: New external user group/file match
    kvm-vfio: Decouple only when we match a group
    vfio: Fix group release deadlock
    vfio: Use ERR_CAST() instead of open coding it
    vfio/pci: Add Intel XXV710 to hidden INTx devices

    Linus Torvalds
     

08 Jul, 2017

1 commit

  • The original intent of vfio_container.group_lock is to protect
    vfio_container.group_list, however over time it's become a crutch to
    prevent changes in container composition any time we call into the
    iommu driver backend. This introduces problems when we start to have
    more complex interactions, for example when a user's DMA unmap request
    triggers a notification to an mdev vendor driver, who responds by
    attempting to unpin mappings within that request, re-entering the
    iommu backend. We incorrectly assume that the use of read-locks here
    allow for this nested locking behavior, but a poorly timed write-lock
    could in fact trigger a deadlock.

    The current use of group_lock seems to fall into the trap of locking
    code, not data. Correct that by removing uses of group_lock that are
    not directly related to group_list. Note that the vfio type1 iommu
    backend has its own mutex, vfio_iommu.lock, which it uses to protect
    itself for each of these interfaces anyway. The group_lock appears to
    be a redundancy for these interfaces and type1 even goes so far as to
    release its mutex to allow for exactly the re-entrant code path above.

    Reported-by: Chuanxiao Dong
    Signed-off-by: Alex Williamson
    Acked-by: Alexey Kardashevskiy
    Cc: stable@vger.kernel.org # v4.10+

    Alex Williamson
     

29 Jun, 2017

2 commits

  • At the point where the kvm-vfio pseudo device wants to release its
    vfio group reference, we can't always acquire a new reference to make
    that happen. The group can be in a state where we wouldn't allow a
    new reference to be added. This new helper function allows a caller
    to match a file to a group to facilitate this. Given a file and
    group, report if they match. Thus the caller needs to already have a
    group reference to match to the file. This allows the deletion of a
    group without acquiring a new reference.

    Signed-off-by: Alex Williamson
    Reviewed-by: Eric Auger
    Reviewed-by: Paolo Bonzini
    Tested-by: Eric Auger
    Cc: stable@vger.kernel.org

    Alex Williamson
     
  • If vfio_iommu_group_notifier() acquires a group reference and that
    reference becomes the last reference to the group, then vfio_group_put
    introduces a deadlock code path where we're trying to unregister from
    the iommu notifier chain from within a callout of that chain. Use a
    work_struct to release this reference asynchronously.

    Signed-off-by: Alex Williamson
    Reviewed-by: Eric Auger
    Tested-by: Eric Auger
    Cc: stable@vger.kernel.org

    Alex Williamson
     

20 Jun, 2017

1 commit

  • Rename:

    wait_queue_t => wait_queue_entry_t

    'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
    but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
    which had to carry the name.

    Start sorting this out by renaming it to 'wait_queue_entry_t'.

    This also allows the real structure name 'struct __wait_queue' to
    lose its double underscore and become 'struct wait_queue_entry',
    which is the more canonical nomenclature for such data types.

    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

13 Jun, 2017

2 commits


06 May, 2017

1 commit

  • Pull powerpc updates from Michael Ellerman:
    "Highlights include:

    - Larger virtual address space on 64-bit server CPUs. By default we
    use a 128TB virtual address space, but a process can request access
    to the full 512TB by passing a hint to mmap().

    - Support for the new Power9 "XIVE" interrupt controller.

    - TLB flushing optimisations for the radix MMU on Power9.

    - Support for CAPI cards on Power9, using the "Coherent Accelerator
    Interface Architecture 2.0".

    - The ability to configure the mmap randomisation limits at build and
    runtime.

    - Several small fixes and cleanups to the kprobes code, as well as
    support for KPROBES_ON_FTRACE.

    - Major improvements to handling of system reset interrupts,
    correctly treating them as NMIs, giving them a dedicated stack and
    using a new hypervisor call to trigger them, all of which should
    aid debugging and robustness.

    - Many fixes and other minor enhancements.

    Thanks to: Alastair D'Silva, Alexey Kardashevskiy, Alistair Popple,
    Andrew Donnellan, Aneesh Kumar K.V, Anshuman Khandual, Anton
    Blanchard, Balbir Singh, Ben Hutchings, Benjamin Herrenschmidt,
    Bhupesh Sharma, Chris Packham, Christian Zigotzky, Christophe Leroy,
    Christophe Lombard, Daniel Axtens, David Gibson, Gautham R. Shenoy,
    Gavin Shan, Geert Uytterhoeven, Guilherme G. Piccoli, Hamish Martin,
    Hari Bathini, Kees Cook, Laurent Dufour, Madhavan Srinivasan, Mahesh J
    Salgaonkar, Mahesh Salgaonkar, Masami Hiramatsu, Matt Brown, Matthew
    R. Ochs, Michael Neuling, Naveen N. Rao, Nicholas Piggin, Oliver
    O'Halloran, Pan Xinhui, Paul Mackerras, Rashmica Gupta, Russell
    Currey, Sukadev Bhattiprolu, Thadeu Lima de Souza Cascardo, Tobin C.
    Harding, Tyrel Datwyler, Uma Krishnan, Vaibhav Jain, Vipin K Parashar,
    Yang Shi"

    * tag 'powerpc-4.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (214 commits)
    powerpc/64s: Power9 has no LPCR[VRMASD] field so don't set it
    powerpc/powernv: Fix TCE kill on NVLink2
    powerpc/mm/radix: Drop support for CPUs without lockless tlbie
    powerpc/book3s/mce: Move add_taint() later in virtual mode
    powerpc/sysfs: Move #ifdef CONFIG_HOTPLUG_CPU out of the function body
    powerpc/smp: Document irq enable/disable after migrating IRQs
    powerpc/mpc52xx: Don't select user-visible RTAS_PROC
    powerpc/powernv: Document cxl dependency on special case in pnv_eeh_reset()
    powerpc/eeh: Clean up and document event handling functions
    powerpc/eeh: Avoid use after free in eeh_handle_special_event()
    cxl: Mask slice error interrupts after first occurrence
    cxl: Route eeh events to all drivers in cxl_pci_error_detected()
    cxl: Force context lock during EEH flow
    powerpc/64: Allow CONFIG_RELOCATABLE if COMPILE_TEST
    powerpc/xmon: Teach xmon oops about radix vectors
    powerpc/mm/hash: Fix off-by-one in comment about kernel contexts ids
    powerpc/pseries: Enable VFIO
    powerpc/powernv: Fix iommu table size calculation hook for small tables
    powerpc/powernv: Check kzalloc() return value in pnv_pci_table_alloc
    powerpc: Add arch/powerpc/tools directory
    ...

    Linus Torvalds
     

19 Apr, 2017

2 commits

  • vfio_pin_pages_remote() is typically called to iterate over a range
    of memory. Testing CAP_IPC_LOCK is relatively expensive, so it makes
    sense to push it up to the caller, which can then repeatedly call
    vfio_pin_pages_remote() using that value. This can show nearly a 20%
    improvement on the worst case path through VFIO_IOMMU_MAP_DMA with
    contiguous page mapping disabled. Testing RLIMIT_MEMLOCK is much more
    lightweight, but we bring it along on the same principle and it does
    seem to show a marginal improvement.

    Reviewed-by: Peter Xu
    Reviewed-by: Kirti Wankhede
    Signed-off-by: Alex Williamson

    Alex Williamson
     
  • With vfio_lock_acct() testing the locked memory limit under mmap_sem,
    it's redundant to do it here for a single page. We can also reorder
    our tests such that we can avoid testing for reserved pages if we're
    not doing accounting and let vfio_lock_acct() test the process
    CAP_IPC_LOCK. Finally, this function oddly returns 1 on success.
    Update to return zero on success, -errno on error. Since the function
    only pins a single page, there's no need to return the number of pages
    pinned.

    N.B. vfio_pin_pages_remote() can pin a large contiguous range of pages
    before calling vfio_lock_acct(). If we were to similarly remove the
    extra test there, a user could temporarily pin far more pages than
    they're allowed.

    Suggested-by: Kirti Wankhede
    Suggested-by: Eric Auger
    Reviewed-by: Kirti Wankhede
    Reviewed-by: Peter Xu
    Signed-off-by: Alex Williamson

    Alex Williamson
     

14 Apr, 2017

1 commit

  • If the mmap_sem is contented then the vfio type1 IOMMU backend will
    defer locked page accounting updates to a workqueue task. This has a
    few problems and depending on which side the user tries to play, they
    might be over-penalized for unmaps that haven't yet been accounted or
    race the workqueue to enter more mappings than they're allowed. The
    original intent of this workqueue mechanism seems to be focused on
    reducing latency through the ioctl, but we cannot do so at the cost
    of correctness. Remove this workqueue mechanism and update the
    callers to allow for failure. We can also now recheck the limit under
    write lock to make sure we don't exceed it.

    vfio_pin_pages_remote() also now necessarily includes an unwind path
    which we can jump to directly if the consecutive page pinning finds
    that we're exceeding the user's memory limits. This avoids the
    current lazy approach which does accounting and mapping up to the
    fault, only to return an error on the next iteration to unwind the
    entire vfio_dma.

    Cc: stable@vger.kernel.org
    Reviewed-by: Peter Xu
    Reviewed-by: Kirti Wankhede
    Signed-off-by: Alex Williamson

    Alex Williamson
     

12 Apr, 2017

2 commits

  • This adds missing checking for kzalloc() return value.

    Fixes: 4b6fad7097f8 ("powerpc/mm/iommu, vfio/spapr: Put pages on VFIO container shutdown")
    Signed-off-by: Alexey Kardashevskiy
    Reviewed-by: David Gibson
    Signed-off-by: Alex Williamson

    Alexey Kardashevskiy
     
  • The existing SPAPR TCE driver advertises both VFIO_SPAPR_TCE_IOMMU and
    VFIO_SPAPR_TCE_v2_IOMMU types to the userspace and the userspace usually
    picks the v2.

    Normally the userspace would create a container, attach an IOMMU group
    to it and only then set the IOMMU type (which would normally be v2).

    However a specific IOMMU group may not support v2, in other words
    it may not implement set_window/unset_window/take_ownership/
    release_ownership and such a group should not be attached to
    a v2 container.

    This adds extra checks that a new group can do what the selected IOMMU
    type suggests. The userspace can then test the return value from
    ioctl(VFIO_SET_IOMMU, VFIO_SPAPR_TCE_v2_IOMMU) and try
    VFIO_SPAPR_TCE_IOMMU.

    Signed-off-by: Alexey Kardashevskiy
    Signed-off-by: Alex Williamson

    Alexey Kardashevskiy
     

30 Mar, 2017

2 commits

  • So far iommu_table obejcts were only used in virtual mode and had
    a single owner. We are going to change this by implementing in-kernel
    acceleration of DMA mapping requests. The proposed acceleration
    will handle requests in real mode and KVM will keep references to tables.

    This adds a kref to iommu_table and defines new helpers to update it.
    This replaces iommu_free_table() with iommu_tce_table_put() and makes
    iommu_free_table() static. iommu_tce_table_get() is not used in this patch
    but it will be in the following patch.

    Since this touches prototypes, this also removes @node_name parameter as
    it has never been really useful on powernv and carrying it for
    the pseries platform code to iommu_free_table() seems to be quite
    useless as well.

    This should cause no behavioral change.

    Signed-off-by: Alexey Kardashevskiy
    Reviewed-by: David Gibson
    Acked-by: Alex Williamson
    Signed-off-by: Michael Ellerman

    Alexey Kardashevskiy
     
  • At the moment iommu_table can be disposed by either calling
    iommu_table_free() directly or it_ops::free(); the only implementation
    of free() is in IODA2 - pnv_ioda2_table_free() - and it calls
    iommu_table_free() anyway.

    As we are going to have reference counting on tables, we need an unified
    way of disposing tables.

    This moves it_ops::free() call into iommu_free_table() and makes use
    of the latter. The free() callback now handles only platform-specific
    data.

    As from now on the iommu_free_table() calls it_ops->free(), we need
    to have it_ops initialized before calling iommu_free_table() so this
    moves this initialization in pnv_pci_ioda2_create_table().

    This should cause no behavioral change.

    Signed-off-by: Alexey Kardashevskiy
    Reviewed-by: David Gibson
    Acked-by: Alex Williamson
    Signed-off-by: Michael Ellerman

    Alexey Kardashevskiy
     

25 Mar, 2017

1 commit


22 Mar, 2017

2 commits

  • The introduction of reserved regions has left a couple of rough edges
    which we could do with sorting out sooner rather than later. Since we
    are not yet addressing the potential dynamic aspect of software-managed
    reservations and presenting them at arbitrary fixed addresses, it is
    incongruous that we end up displaying hardware vs. software-managed MSI
    regions to userspace differently, especially since ARM-based systems may
    actually require one or the other, or even potentially both at once,
    (which iommu-dma currently has no hope of dealing with at all). Let's
    resolve the former user-visible inconsistency ASAP before the ABI has
    been baked into a kernel release, in a way that also lays the groundwork
    for the latter shortcoming to be addressed by follow-up patches.

    For clarity, rename the software-managed type to IOMMU_RESV_SW_MSI, use
    IOMMU_RESV_MSI to describe the hardware type, and document everything a
    little bit. Since the x86 MSI remapping hardware falls squarely under
    this meaning of IOMMU_RESV_MSI, apply that type to their regions as well,
    so that we tell the same story to userspace across all platforms.

    Secondly, as the various region types require quite different handling,
    and it really makes little sense to ever try combining them, convert the
    bitfield-esque #defines to a plain enum in the process before anyone
    gets the wrong impression.

    Fixes: d30ddcaa7b02 ("iommu: Add a new type field in iommu_resv_region")
    Reviewed-by: Eric Auger
    CC: Alex Williamson
    CC: David Woodhouse
    CC: kvm@vger.kernel.org
    Signed-off-by: Robin Murphy
    Signed-off-by: Joerg Roedel

    Robin Murphy
     
  • The intent of the original warning is make sure that the mdev vendor
    driver has removed any group notifiers at the point where the group
    is closed by the user. Theoretically this would be through an
    orderly shutdown where any devices are release prior to the group
    release. We can't always count on an orderly shutdown, the user can
    close the group before the notifier can be removed or the user task
    might be killed. We'd like to add this sanity test when the group is
    idle and the only references are from the devices within the group
    themselves, but we don't have a good way to do that. Instead check
    both when the group itself is removed and when the group is opened.
    A bit later than we'd prefer, but better than the current over
    aggressive approach.

    Fixes: ccd46dbae77d ("vfio: support notifier chain in vfio_group")
    Signed-off-by: Alex Williamson
    Cc: # v4.10
    Cc: Jike Song

    Alex Williamson
     

02 Mar, 2017

2 commits

  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    The APIs that are going to be moved first are:

    mm_alloc()
    __mmdrop()
    mmdrop()
    mmdrop_async_fn()
    mmdrop_async()
    mmget_not_zero()
    mmput()
    mmput_async()
    get_task_mm()
    mm_access()
    mm_release()

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

24 Feb, 2017

1 commit

  • Pull VFIO updates from Alex Williamson:

    - Kconfig fixes for SPAPR_TCE_IOMMU=n (Michael Ellerman)

    - Module softdep rather than request_module to simplify usage from
    initrd (Alex Williamson)

    - Comment typo fix (Changbin Du)

    * tag 'vfio-v4.11-rc1' of git://github.com/awilliam/linux-vfio:
    vfio: fix a typo in comment of function vfio_pin_pages
    vfio: Replace module request with softdep
    vfio/mdev: Use a module softdep for vfio_mdev
    vfio: Fix build break when SPAPR_TCE_IOMMU=n

    Linus Torvalds
     

23 Feb, 2017

1 commit