10 Sep, 2020

1 commit

  • commit 41311242221e3482b20bfed10fa4d9db98d87016 upstream.

    With conversion to follow_pfn(), DMA mapping a PFNMAP range depends on
    the range being faulted into the vma. Add support to manually provide
    that, in the same way as done on KVM with hva_to_pfn_remapped().

    Reviewed-by: Peter Xu
    Signed-off-by: Alex Williamson
    Signed-off-by: Ajay Kaher
    Signed-off-by: Sasha Levin

    Ajay Kaher
     

26 Aug, 2020

1 commit

  • [ Upstream commit aae7a75a821a793ed6b8ad502a5890fb8e8f172d ]

    The vfio_iommu_replay() function does not currently unwind on error,
    yet it does pin pages, perform IOMMU mapping, and modify the vfio_dma
    structure to indicate IOMMU mapping. The IOMMU mappings are torn down
    when the domain is destroyed, but the other actions go on to cause
    trouble later. For example, the iommu->domain_list can be empty if we
    only have a non-IOMMU backed mdev attached. We don't currently check
    if the list is empty before getting the first entry in the list, which
    leads to a bogus domain pointer. If a vfio_dma entry is erroneously
    marked as iommu_mapped, we'll attempt to use that bogus pointer to
    retrieve the existing physical page addresses.

    This is the scenario that uncovered this issue, attempting to hot-add
    a vfio-pci device to a container with an existing mdev device and DMA
    mappings, one of which could not be pinned, causing a failure adding
    the new group to the existing container and setting the conditions
    for a subsequent attempt to explode.

    To resolve this, we can first check if the domain_list is empty so
    that we can reject replay of a bogus domain, should we ever encounter
    this inconsistent state again in the future. The real fix though is
    to add the necessary unwind support, which means cleaning up the
    current pinning if an IOMMU mapping fails, then walking back through
    the r-b tree of DMA entries, reading from the IOMMU which ranges are
    mapped, and unmapping and unpinning those ranges. To be able to do
    this, we also defer marking the DMA entry as IOMMU mapped until all
    entries are processed, in order to allow the unwind to know the
    disposition of each entry.

    Fixes: a54eb55045ae ("vfio iommu type1: Add support for mediated devices")
    Reported-by: Zhiyi Guo
    Tested-by: Zhiyi Guo
    Reviewed-by: Cornelia Huck
    Signed-off-by: Alex Williamson
    Signed-off-by: Sasha Levin

    Alex Williamson
     

06 May, 2020

2 commits

  • commit 5cbf3264bc715e9eb384e2b68601f8c02bb9a61d upstream.

    Use follow_pfn() to get the PFN of a PFNMAP VMA instead of assuming that
    vma->vm_pgoff holds the base PFN of the VMA. This fixes a bug where
    attempting to do VFIO_IOMMU_MAP_DMA on an arbitrary PFNMAP'd region of
    memory calculates garbage for the PFN.

    Hilariously, this only got detected because the first "PFN" calculated
    by vaddr_get_pfn() is PFN 0 (vma->vm_pgoff==0), and iommu_iova_to_phys()
    uses PA==0 as an error, which triggers a WARN in vfio_unmap_unpin()
    because the translation "failed". PFN 0 is now unconditionally reserved
    on x86 in order to mitigate L1TF, which causes is_invalid_reserved_pfn()
    to return true and in turns results in vaddr_get_pfn() returning success
    for PFN 0. Eventually the bogus calculation runs into PFNs that aren't
    reserved and leads to failure in vfio_pin_map_dma(). The subsequent
    call to vfio_remove_dma() attempts to unmap PFN 0 and WARNs.

    WARNING: CPU: 8 PID: 5130 at drivers/vfio/vfio_iommu_type1.c:750 vfio_unmap_unpin+0x2e1/0x310 [vfio_iommu_type1]
    Modules linked in: vfio_pci vfio_virqfd vfio_iommu_type1 vfio ...
    CPU: 8 PID: 5130 Comm: sgx Tainted: G W 5.6.0-rc5-705d787c7fee-vfio+ #3
    Hardware name: Intel Corporation Mehlow UP Server Platform/Moss Beach Server, BIOS CNLSE2R1.D00.X119.B49.1803010910 03/01/2018
    RIP: 0010:vfio_unmap_unpin+0x2e1/0x310 [vfio_iommu_type1]
    Code: 0b 49 81 c5 00 10 00 00 e9 c5 fe ff ff bb 00 10 00 00 e9 3d fe
    RSP: 0018:ffffbeb5039ebda8 EFLAGS: 00010246
    RAX: 0000000000000000 RBX: ffff9a55cbf8d480 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9a52b771c200
    RBP: 0000000000000000 R08: 0000000000000040 R09: 00000000fffffff2
    R10: 0000000000000001 R11: ffff9a51fa896000 R12: 0000000184010000
    R13: 0000000184000000 R14: 0000000000010000 R15: ffff9a55cb66ea08
    FS: 00007f15d3830b40(0000) GS:ffff9a55d5600000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000561cf39429e0 CR3: 000000084f75f005 CR4: 00000000003626e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    vfio_remove_dma+0x17/0x70 [vfio_iommu_type1]
    vfio_iommu_type1_ioctl+0x9e3/0xa7b [vfio_iommu_type1]
    ksys_ioctl+0x92/0xb0
    __x64_sys_ioctl+0x16/0x20
    do_syscall_64+0x4c/0x180
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f15d04c75d7
    Code: 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 48 2d 00 f7 d8 64 89 01 48

    Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation")
    Signed-off-by: Sean Christopherson
    Signed-off-by: Alex Williamson
    Signed-off-by: Greg Kroah-Hartman

    Sean Christopherson
     
  • commit 0ea971f8dcd6dee78a9a30ea70227cf305f11ff7 upstream.

    add parentheses to avoid possible vaddr overflow.

    Fixes: a54eb55045ae ("vfio iommu type1: Add support for mediated devices")
    Signed-off-by: Yan Zhao
    Signed-off-by: Alex Williamson
    Signed-off-by: Greg Kroah-Hartman

    Yan Zhao
     

16 Oct, 2019

1 commit

  • After enabling CONFIG_IOMMU_DMA on X86 a new warning appears when
    compiling vfio:

    drivers/vfio/vfio_iommu_type1.c: In function ‘vfio_iommu_type1_attach_group’:
    drivers/vfio/vfio_iommu_type1.c:1827:7: warning: ‘resv_msi_base’ may be used uninitialized in this function [-Wmaybe-uninitialized]
    ret = iommu_get_msi_cookie(domain->domain, resv_msi_base);
    ~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    The warning is a false positive, because the call to iommu_get_msi_cookie()
    only happens when vfio_iommu_has_sw_msi() returned true. And that only
    happens when it also set resv_msi_base.

    But initialize the variable anyway to get rid of the warning.

    Signed-off-by: Joerg Roedel
    Reviewed-by: Cornelia Huck
    Reviewed-by: Eric Auger
    Signed-off-by: Alex Williamson

    Joerg Roedel
     

26 Sep, 2019

1 commit

  • This patch is a part of a series that extends kernel ABI to allow to pass
    tagged user pointers (with the top byte set to something else other than
    0x00) as syscall arguments.

    vaddr_get_pfn() uses provided user pointers for vma lookups, which can
    only by done with untagged pointers.

    Untag user pointers in this function.

    Link: http://lkml.kernel.org/r/87422b4d72116a975896f2b19b00f38acbd28f33.1563904656.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Reviewed-by: Eric Auger
    Reviewed-by: Vincenzo Frascino
    Reviewed-by: Catalin Marinas
    Reviewed-by: Kees Cook
    Cc: Dave Hansen
    Cc: Will Deacon
    Cc: Al Viro
    Cc: Felix Kuehling
    Cc: Jens Wiklander
    Cc: Khalid Aziz
    Cc: Mauro Carvalho Chehab
    Cc: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     

21 Sep, 2019

1 commit

  • Pull VFIO updates from Alex Williamson:

    - Fix spapr iommu error case case (Alexey Kardashevskiy)

    - Consolidate region type definitions (Cornelia Huck)

    - Restore saved original PCI state on release (hexin)

    - Simplify mtty sample driver interrupt path (Parav Pandit)

    - Support for reporting valid IOVA regions to user (Shameer Kolothum)

    * tag 'vfio-v5.4-rc1' of git://github.com/awilliam/linux-vfio:
    vfio_pci: Restore original state on release
    vfio/type1: remove duplicate retrieval of reserved regions
    vfio/type1: Add IOVA range capability support
    vfio/type1: check dma map request is within a valid iova range
    vfio/spapr_tce: Fix incorrect tce_iommu_group memory free
    vfio-mdev/mtty: Simplify interrupt generation
    vfio: re-arrange vfio region definitions
    vfio/type1: Update iova list on detach
    vfio/type1: Check reserved region conflict and update iova list
    vfio/type1: Introduce iova list and add iommu aperture validity check

    Linus Torvalds
     

20 Aug, 2019

6 commits


24 Jul, 2019

2 commits

  • To permit batching of TLB flushes across multiple calls to the IOMMU
    driver's ->unmap() implementation, introduce a new structure for
    tracking the address range to be flushed and the granularity at which
    the flushing is required.

    This is hooked into the IOMMU API and its caller are updated to make use
    of the new structure. Subsequent patches will plumb this into the IOMMU
    drivers as well, but for now the gathering information is ignored.

    Signed-off-by: Will Deacon

    Will Deacon
     
  • Commit add02cfdc9bc ("iommu: Introduce Interface for IOMMU TLB Flushing")
    added three new TLB flushing operations to the IOMMU API so that the
    underlying driver operations can be batched when unmapping large regions
    of IO virtual address space.

    However, the ->iotlb_range_add() callback has not been implemented by
    any IOMMU drivers (amd_iommu.c implements it as an empty function, which
    incurs the overhead of an indirect branch). Instead, drivers either flush
    the entire IOTLB in the ->iotlb_sync() callback or perform the necessary
    invalidation during ->unmap().

    Attempting to implement ->iotlb_range_add() for arm-smmu-v3.c revealed
    two major issues:

    1. The page size used to map the region in the page-table is not known,
    and so it is not generally possible to issue TLB flushes in the most
    efficient manner.

    2. The only mutable state passed to the callback is a pointer to the
    iommu_domain, which can be accessed concurrently and therefore
    requires expensive synchronisation to keep track of the outstanding
    flushes.

    Remove the callback entirely in preparation for extending ->unmap() and
    ->iotlb_sync() to update a token on the caller's stack.

    Signed-off-by: Will Deacon

    Will Deacon
     

17 Jul, 2019

1 commit

  • locked_vm accounting is done roughly the same way in five places, so
    unify them in a helper.

    Include the helper's caller in the debug print to distinguish between
    callsites.

    Error codes stay the same, so user-visible behavior does too. The one
    exception is that the -EPERM case in tce_account_locked_vm is removed
    because Alexey has never seen it triggered.

    [daniel.m.jordan@oracle.com: v3]
    Link: http://lkml.kernel.org/r/20190529205019.20927-1-daniel.m.jordan@oracle.com
    [sfr@canb.auug.org.au: fix mm/util.c]
    Link: http://lkml.kernel.org/r/20190524175045.26897-1-daniel.m.jordan@oracle.com
    Signed-off-by: Daniel Jordan
    Signed-off-by: Stephen Rothwell
    Tested-by: Alexey Kardashevskiy
    Acked-by: Alex Williamson
    Cc: Alan Tull
    Cc: Alex Williamson
    Cc: Benjamin Herrenschmidt
    Cc: Christoph Lameter
    Cc: Christophe Leroy
    Cc: Davidlohr Bueso
    Cc: Jason Gunthorpe
    Cc: Mark Rutland
    Cc: Michael Ellerman
    Cc: Moritz Fischer
    Cc: Paul Mackerras
    Cc: Steve Sistare
    Cc: Wu Hao
    Cc: Ira Weiny
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Jordan
     

19 Jun, 2019

1 commit

  • Based on 2 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license version 2 as
    published by the free software foundation

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license version 2 as
    published by the free software foundation #

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 4122 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Enrico Weigelt
    Reviewed-by: Kate Stewart
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

15 May, 2019

1 commit

  • Pach series "Add FOLL_LONGTERM to GUP fast and use it".

    HFI1, qib, and mthca, use get_user_pages_fast() due to its performance
    advantages. These pages can be held for a significant time. But
    get_user_pages_fast() does not protect against mapping FS DAX pages.

    Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which
    retains the performance while also adding the FS DAX checks. XDP has also
    shown interest in using this functionality.[1]

    In addition we change get_user_pages() to use the new FOLL_LONGTERM flag
    and remove the specialized get_user_pages_longterm call.

    [1] https://lkml.org/lkml/2019/3/19/939

    "longterm" is a relative thing and at this point is probably a misnomer.
    This is really flagging a pin which is going to be given to hardware and
    can't move. I've thought of a couple of alternative names but I think we
    have to settle on if we are going to use FL_LAYOUT or something else to
    solve the "longterm" problem. Then I think we can change the flag to a
    better name.

    Secondly, it depends on how often you are registering memory. I have
    spoken with some RDMA users who consider MR in the performance path...
    For the overall application performance. I don't have the numbers as the
    tests for HFI1 were done a long time ago. But there was a significant
    advantage. Some of which is probably due to the fact that you don't have
    to hold mmap_sem.

    Finally, architecturally I think it would be good for everyone to use
    *_fast. There are patches submitted to the RDMA list which would allow
    the use of *_fast (they reworking the use of mmap_sem) and as soon as they
    are accepted I'll submit a patch to convert the RDMA core as well. Also
    to this point others are looking to use *_fast.

    As an aside, Jasons pointed out in my previous submission that *_fast and
    *_unlocked look very much the same. I agree and I think further cleanup
    will be coming. But I'm focused on getting the final solution for DAX at
    the moment.

    This patch (of 7):

    This patch starts a series which aims to support FOLL_LONGTERM in
    get_user_pages_fast(). Some callers who would like to do a longterm (user
    controlled pin) of pages with the fast variant of GUP for performance
    purposes.

    Rather than have a separate get_user_pages_longterm() call, introduce
    FOLL_LONGTERM and change the longterm callers to use it.

    This patch does not change any functionality. In the short term
    "longterm" or user controlled pins are unsafe for Filesystems and FS DAX
    in particular has been blocked. However, callers of get_user_pages_fast()
    were not "protected".

    FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it
    requires vmas to determine if DAX is in use.

    NOTE: In merging with the CMA changes we opt to change the
    get_user_pages() call in check_and_migrate_cma_pages() to a call of
    __get_user_pages_locked() on the newly migrated pages. This makes the
    code read better in that we are calling __get_user_pages_locked() on the
    pages before and after a potential migration.

    As a side affect some of the interfaces are cleaned up but this is not the
    primary purpose of the series.

    In review[1] it was asked:

    > This I don't get - if you do lock down long term mappings performance
    > of the actual get_user_pages call shouldn't matter to start with.
    >
    > What do I miss?

    A couple of points.

    First "longterm" is a relative thing and at this point is probably a
    misnomer. This is really flagging a pin which is going to be given to
    hardware and can't move. I've thought of a couple of alternative names
    but I think we have to settle on if we are going to use FL_LAYOUT or
    something else to solve the "longterm" problem. Then I think we can
    change the flag to a better name.

    Second, It depends on how often you are registering memory. I have spoken
    with some RDMA users who consider MR in the performance path... For the
    overall application performance. I don't have the numbers as the tests
    for HFI1 were done a long time ago. But there was a significant
    advantage. Some of which is probably due to the fact that you don't have
    to hold mmap_sem.

    Finally, architecturally I think it would be good for everyone to use
    *_fast. There are patches submitted to the RDMA list which would allow
    the use of *_fast (they reworking the use of mmap_sem) and as soon as they
    are accepted I'll submit a patch to convert the RDMA core as well. Also
    to this point others are looking to use *_fast.

    As an asside, Jasons pointed out in my previous submission that *_fast and
    *_unlocked look very much the same. I agree and I think further cleanup
    will be coming. But I'm focused on getting the final solution for DAX at
    the moment.

    [1] https://lore.kernel.org/lkml/20190220180255.GA12020@iweiny-DESK2.sc.intel.com/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965

    [ira.weiny@intel.com: v3]
    Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
    Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
    Link: http://lkml.kernel.org/r/20190317183438.2057-2-ira.weiny@intel.com
    Signed-off-by: Ira Weiny
    Reviewed-by: Andrew Morton
    Cc: Aneesh Kumar K.V
    Cc: Michal Hocko
    Cc: John Hubbard
    Cc: "Kirill A. Shutemov"
    Cc: Peter Zijlstra
    Cc: Jason Gunthorpe
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "David S. Miller"
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Rich Felker
    Cc: Yoshinori Sato
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Borislav Petkov
    Cc: Ralf Baechle
    Cc: James Hogan
    Cc: Dan Williams
    Cc: Mike Marshall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ira Weiny
     

12 Apr, 2019

2 commits

  • This adds the support to determine the isolation type
    of a mediated device group by checking whether it has
    an iommu device. If an iommu device exists, an iommu
    domain will be allocated and then attached to the iommu
    device. Otherwise, keep the same behavior as it is.

    Cc: Ashok Raj
    Cc: Jacob Pan
    Cc: Kevin Tian
    Signed-off-by: Sanjay Kumar
    Signed-off-by: Liu Yi L
    Signed-off-by: Lu Baolu
    Reviewed-by: Jean-Philippe Brucker
    Reviewed-by: Kirti Wankhede
    Acked-by: Alex Williamson
    Signed-off-by: Joerg Roedel

    Lu Baolu
     
  • This adds helpers to attach or detach a domain to a
    group. This will replace iommu_attach_group() which
    only works for non-mdev devices.

    If a domain is attaching to a group which includes the
    mediated devices, it should attach to the iommu device
    (a pci device which represents the mdev in iommu scope)
    instead. The added helper supports attaching domain to
    groups for both pci and mdev devices.

    Cc: Ashok Raj
    Cc: Jacob Pan
    Cc: Kevin Tian
    Signed-off-by: Sanjay Kumar
    Signed-off-by: Liu Yi L
    Signed-off-by: Lu Baolu
    Reviewed-by: Jean-Philippe Brucker
    Acked-by: Alex Williamson
    Signed-off-by: Joerg Roedel

    Lu Baolu
     

04 Apr, 2019

1 commit

  • Memory backed DMA mappings are accounted against a user's locked
    memory limit, including multiple mappings of the same memory. This
    accounting bounds the number of such mappings that a user can create.
    However, DMA mappings that are not backed by memory, such as DMA
    mappings of device MMIO via mmaps, do not make use of page pinning
    and therefore do not count against the user's locked memory limit.
    These mappings still consume memory, but the memory is not well
    associated to the process for the purpose of oom killing a task.

    To add bounding on this use case, we introduce a limit to the total
    number of concurrent DMA mappings that a user is allowed to create.
    This limit is exposed as a tunable module option where the default
    value of 64K is expected to be well in excess of any reasonable use
    case (a large virtual machine configuration would typically only make
    use of tens of concurrent mappings).

    This fixes CVE-2019-3882.

    Reviewed-by: Eric Auger
    Tested-by: Eric Auger
    Reviewed-by: Peter Xu
    Reviewed-by: Cornelia Huck
    Signed-off-by: Alex Williamson

    Alex Williamson
     

09 Jan, 2019

1 commit

  • The below referenced commit adds a test for integer overflow, but in
    doing so prevents the unmap ioctl from ever including the last page of
    the address space. Subtract one to compare to the last address of the
    unmap to avoid the overflow and wrap-around.

    Fixes: 71a7d3d78e3c ("vfio/type1: silence integer overflow warning")
    Link: https://bugzilla.redhat.com/show_bug.cgi?id=1662291
    Cc: stable@vger.kernel.org # v4.15+
    Reported-by: Pei Zhang
    Debugged-by: Peter Xu
    Reviewed-by: Dan Carpenter
    Reviewed-by: Peter Xu
    Tested-by: Peter Xu
    Reviewed-by: Cornelia Huck
    Signed-off-by: Alex Williamson

    Alex Williamson
     

15 Nov, 2018

1 commit


07 Aug, 2018

1 commit


01 Jul, 2018

1 commit

  • The patch noted in the fixes below converted get_user_pages_fast() to
    get_user_pages_longterm(), however the two calls differ in a few ways.

    First _fast() is documented to not require the mmap_sem, while _longterm()
    is documented to need it. Hold the mmap sem as required.

    Second, _fast accepts an 'int write' while _longterm uses 'unsigned int
    gup_flags', so the expression '!!(prot & IOMMU_WRITE)' is only working by
    luck as FOLL_WRITE is currently == 0x1. Use the expected FOLL_WRITE
    constant instead.

    Fixes: 94db151dc892 ("vfio: disable filesystem-dax page pinning")
    Cc:
    Signed-off-by: Jason Gunthorpe
    Acked-by: Dan Williams
    Signed-off-by: Alex Williamson

    Jason Gunthorpe
     

09 Jun, 2018

1 commit

  • MAP_DMA ioctls might be called from various threads within a process,
    for example when using QEMU, the vCPU threads are often generating
    these calls and we therefore take a reference to that vCPU task.
    However, QEMU also supports vCPU hotplug on some machines and the task
    that called MAP_DMA may have exited by the time UNMAP_DMA is called,
    resulting in the mm_struct pointer being NULL and thus a failure to
    match against the existing mapping.

    To resolve this, we instead take a reference to the thread
    group_leader, which has the same mm_struct and resource limits, but
    is less likely exit, at least in the QEMU case. A difficulty here is
    guaranteeing that the capabilities of the group_leader match that of
    the calling thread, which we resolve by tracking CAP_IPC_LOCK at the
    time of calling rather than at an indeterminate time in the future.
    Potentially this also results in better efficiency as this is now
    recorded once per MAP_DMA ioctl.

    Reported-by: Xu Yandong
    Signed-off-by: Alex Williamson

    Alex Williamson
     

02 Jun, 2018

1 commit

  • Bisection by Amadeusz Sławiński implicates this commit leading to bad
    page state issues after VM shutdown, likely due to unbalanced page
    references. The original commit was intended only as a performance
    improvement, therefore revert for offline rework.

    Link: https://lkml.org/lkml/2018/6/2/97
    Fixes: 356e88ebe447 ("vfio/type1: Improve memory pinning process for raw PFN mapping")
    Cc: Jason Cai (Xiang Feng)
    Reported-by: Amadeusz Sławiński
    Signed-off-by: Alex Williamson

    Alex Williamson
     

23 Mar, 2018

1 commit

  • When using vfio to pass through a PCIe device (e.g. a GPU card) that
    has a huge BAR (e.g. 16GB), a lot of cycles are wasted on memory
    pinning because PFNs of PCI BAR are not backed by struct page, and
    the corresponding VMA has flag VM_PFNMAP.

    With this change, when pinning a region which is a raw PFN mapping,
    it can skip unnecessary user memory pinning process, and thus, can
    significantly improve VM's boot up time when passing through devices
    via VFIO. In my test on a Xeon E5 2.6GHz, the time mapping a 16GB
    BAR was reduced from about 0.4s to 1.5us.

    Signed-off-by: Jason Cai (Xiang Feng)
    Signed-off-by: Alex Williamson

    Jason Cai (Xiang Feng)
     

22 Mar, 2018

1 commit

  • VFIO IOMMU type1 currently upmaps IOVA pages synchronously, which requires
    IOTLB flushing for every unmapping. This results in large IOTLB flushing
    overhead when handling pass-through devices has a large number of mapped
    IOVAs. This can be avoided by using the new IOTLB flushing interface.

    Cc: Alex Williamson
    Cc: Joerg Roedel
    Signed-off-by: Suravee Suthikulpanit
    [aw - use LIST_HEAD]
    Signed-off-by: Alex Williamson

    Suravee Suthikulpanit
     

03 Mar, 2018

1 commit

  • Filesystem-DAX is incompatible with 'longterm' page pinning. Without
    page cache indirection a DAX mapping maps filesystem blocks directly.
    This means that the filesystem must not modify a file's block map while
    any page in a mapping is pinned. In order to prevent the situation of
    userspace holding of filesystem operations indefinitely, disallow
    'longterm' Filesystem-DAX mappings.

    RDMA has the same conflict and the plan there is to add a 'with lease'
    mechanism to allow the kernel to notify userspace that the mapping is
    being torn down for block-map maintenance. Perhaps something similar can
    be put in place for vfio.

    Note that xfs and ext4 still report:

    "DAX enabled. Warning: EXPERIMENTAL, use at your own risk"

    ...at mount time, and resolving the dax-dma-vs-truncate problem is one
    of the last hurdles to remove that designation.

    Acked-by: Alex Williamson
    Cc: Michal Hocko
    Cc: kvm@vger.kernel.org
    Cc:
    Reported-by: Haozhong Zhang
    Tested-by: Haozhong Zhang
    Fixes: d475c6346a38 ("dax,ext2: replace XIP read and write with DAX I/O")
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     

21 Oct, 2017

1 commit

  • I get a static checker warning about the potential integer overflow if
    we add "unmap->iova + unmap->size". The integer overflow isn't really
    harmful, but we may as well fix it. Also unmap->size gets truncated to
    size_t when we pass it to vfio_find_dma() so we could check for too high
    values of that as well.

    Signed-off-by: Dan Carpenter
    Signed-off-by: Alex Williamson

    Dan Carpenter
     

11 Aug, 2017

2 commits

  • If the IOMMU driver advertises 'real' reserved regions for MSIs, but
    still includes the software-managed region as well, we are currently
    blind to the former and will configure the IOMMU domain to map MSIs into
    the latter, which is unlikely to work as expected.

    Since it would take a ridiculous hardware topology for both regions to
    be valid (which would be rather difficult to support in general), we
    should be safe to assume that the presence of any hardware regions makes
    the software region irrelevant. However, the IOMMU driver might still
    advertise the software region by default, particularly if the hardware
    regions are filled in elsewhere by generic code, so it might not be fair
    for VFIO to be super-strict about not mixing them. To that end, make
    vfio_iommu_has_sw_msi() robust against the presence of both region types
    at once, so that we end up doing what is almost certainly right, rather
    than what is almost certainly wrong.

    Signed-off-by: Robin Murphy
    Tested-by: Shameer Kolothum
    Reviewed-by: Eric Auger
    Signed-off-by: Alex Williamson

    Robin Murphy
     
  • For ARM-based systems with a GICv3 ITS to provide interrupt isolation,
    but hardware limitations which are worked around by having MSIs bypass
    SMMU translation (e.g. HiSilicon Hip06/Hip07), VFIO neglects to check
    for the IRQ_DOMAIN_FLAG_MSI_REMAP capability, (and thus erroneously
    demands unsafe_interrupts) if a software-managed MSI region is absent.

    Fix this by always checking for isolation capability at both the IRQ
    domain and IOMMU domain levels, rather than predicating that on whether
    MSIs require an IOMMU mapping (which was always slightly tenuous logic).

    Signed-off-by: Robin Murphy
    Tested-by: Shameer Kolothum
    Reviewed-by: Eric Auger
    Signed-off-by: Alex Williamson

    Robin Murphy
     

19 Apr, 2017

2 commits

  • vfio_pin_pages_remote() is typically called to iterate over a range
    of memory. Testing CAP_IPC_LOCK is relatively expensive, so it makes
    sense to push it up to the caller, which can then repeatedly call
    vfio_pin_pages_remote() using that value. This can show nearly a 20%
    improvement on the worst case path through VFIO_IOMMU_MAP_DMA with
    contiguous page mapping disabled. Testing RLIMIT_MEMLOCK is much more
    lightweight, but we bring it along on the same principle and it does
    seem to show a marginal improvement.

    Reviewed-by: Peter Xu
    Reviewed-by: Kirti Wankhede
    Signed-off-by: Alex Williamson

    Alex Williamson
     
  • With vfio_lock_acct() testing the locked memory limit under mmap_sem,
    it's redundant to do it here for a single page. We can also reorder
    our tests such that we can avoid testing for reserved pages if we're
    not doing accounting and let vfio_lock_acct() test the process
    CAP_IPC_LOCK. Finally, this function oddly returns 1 on success.
    Update to return zero on success, -errno on error. Since the function
    only pins a single page, there's no need to return the number of pages
    pinned.

    N.B. vfio_pin_pages_remote() can pin a large contiguous range of pages
    before calling vfio_lock_acct(). If we were to similarly remove the
    extra test there, a user could temporarily pin far more pages than
    they're allowed.

    Suggested-by: Kirti Wankhede
    Suggested-by: Eric Auger
    Reviewed-by: Kirti Wankhede
    Reviewed-by: Peter Xu
    Signed-off-by: Alex Williamson

    Alex Williamson
     

14 Apr, 2017

1 commit

  • If the mmap_sem is contented then the vfio type1 IOMMU backend will
    defer locked page accounting updates to a workqueue task. This has a
    few problems and depending on which side the user tries to play, they
    might be over-penalized for unmaps that haven't yet been accounted or
    race the workqueue to enter more mappings than they're allowed. The
    original intent of this workqueue mechanism seems to be focused on
    reducing latency through the ioctl, but we cannot do so at the cost
    of correctness. Remove this workqueue mechanism and update the
    callers to allow for failure. We can also now recheck the limit under
    write lock to make sure we don't exceed it.

    vfio_pin_pages_remote() also now necessarily includes an unwind path
    which we can jump to directly if the consecutive page pinning finds
    that we're exceeding the user's memory limits. This avoids the
    current lazy approach which does accounting and mapping up to the
    fault, only to return an error on the next iteration to unwind the
    entire vfio_dma.

    Cc: stable@vger.kernel.org
    Reviewed-by: Peter Xu
    Reviewed-by: Kirti Wankhede
    Signed-off-by: Alex Williamson

    Alex Williamson
     

22 Mar, 2017

1 commit

  • The introduction of reserved regions has left a couple of rough edges
    which we could do with sorting out sooner rather than later. Since we
    are not yet addressing the potential dynamic aspect of software-managed
    reservations and presenting them at arbitrary fixed addresses, it is
    incongruous that we end up displaying hardware vs. software-managed MSI
    regions to userspace differently, especially since ARM-based systems may
    actually require one or the other, or even potentially both at once,
    (which iommu-dma currently has no hope of dealing with at all). Let's
    resolve the former user-visible inconsistency ASAP before the ABI has
    been baked into a kernel release, in a way that also lays the groundwork
    for the latter shortcoming to be addressed by follow-up patches.

    For clarity, rename the software-managed type to IOMMU_RESV_SW_MSI, use
    IOMMU_RESV_MSI to describe the hardware type, and document everything a
    little bit. Since the x86 MSI remapping hardware falls squarely under
    this meaning of IOMMU_RESV_MSI, apply that type to their regions as well,
    so that we tell the same story to userspace across all platforms.

    Secondly, as the various region types require quite different handling,
    and it really makes little sense to ever try combining them, convert the
    bitfield-esque #defines to a plain enum in the process before anyone
    gets the wrong impression.

    Fixes: d30ddcaa7b02 ("iommu: Add a new type field in iommu_resv_region")
    Reviewed-by: Eric Auger
    CC: Alex Williamson
    CC: David Woodhouse
    CC: kvm@vger.kernel.org
    Signed-off-by: Robin Murphy
    Signed-off-by: Joerg Roedel

    Robin Murphy
     

02 Mar, 2017

2 commits

  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    The APIs that are going to be moved first are:

    mm_alloc()
    __mmdrop()
    mmdrop()
    mmdrop_async_fn()
    mmdrop_async()
    mmget_not_zero()
    mmput()
    mmput_async()
    get_task_mm()
    mm_access()
    mm_release()

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

10 Feb, 2017

1 commit