10 Sep, 2020
1 commit
-
commit 41311242221e3482b20bfed10fa4d9db98d87016 upstream.
With conversion to follow_pfn(), DMA mapping a PFNMAP range depends on
the range being faulted into the vma. Add support to manually provide
that, in the same way as done on KVM with hva_to_pfn_remapped().Reviewed-by: Peter Xu
Signed-off-by: Alex Williamson
Signed-off-by: Ajay Kaher
Signed-off-by: Sasha Levin
26 Aug, 2020
1 commit
-
[ Upstream commit aae7a75a821a793ed6b8ad502a5890fb8e8f172d ]
The vfio_iommu_replay() function does not currently unwind on error,
yet it does pin pages, perform IOMMU mapping, and modify the vfio_dma
structure to indicate IOMMU mapping. The IOMMU mappings are torn down
when the domain is destroyed, but the other actions go on to cause
trouble later. For example, the iommu->domain_list can be empty if we
only have a non-IOMMU backed mdev attached. We don't currently check
if the list is empty before getting the first entry in the list, which
leads to a bogus domain pointer. If a vfio_dma entry is erroneously
marked as iommu_mapped, we'll attempt to use that bogus pointer to
retrieve the existing physical page addresses.This is the scenario that uncovered this issue, attempting to hot-add
a vfio-pci device to a container with an existing mdev device and DMA
mappings, one of which could not be pinned, causing a failure adding
the new group to the existing container and setting the conditions
for a subsequent attempt to explode.To resolve this, we can first check if the domain_list is empty so
that we can reject replay of a bogus domain, should we ever encounter
this inconsistent state again in the future. The real fix though is
to add the necessary unwind support, which means cleaning up the
current pinning if an IOMMU mapping fails, then walking back through
the r-b tree of DMA entries, reading from the IOMMU which ranges are
mapped, and unmapping and unpinning those ranges. To be able to do
this, we also defer marking the DMA entry as IOMMU mapped until all
entries are processed, in order to allow the unwind to know the
disposition of each entry.Fixes: a54eb55045ae ("vfio iommu type1: Add support for mediated devices")
Reported-by: Zhiyi Guo
Tested-by: Zhiyi Guo
Reviewed-by: Cornelia Huck
Signed-off-by: Alex Williamson
Signed-off-by: Sasha Levin
06 May, 2020
2 commits
-
commit 5cbf3264bc715e9eb384e2b68601f8c02bb9a61d upstream.
Use follow_pfn() to get the PFN of a PFNMAP VMA instead of assuming that
vma->vm_pgoff holds the base PFN of the VMA. This fixes a bug where
attempting to do VFIO_IOMMU_MAP_DMA on an arbitrary PFNMAP'd region of
memory calculates garbage for the PFN.Hilariously, this only got detected because the first "PFN" calculated
by vaddr_get_pfn() is PFN 0 (vma->vm_pgoff==0), and iommu_iova_to_phys()
uses PA==0 as an error, which triggers a WARN in vfio_unmap_unpin()
because the translation "failed". PFN 0 is now unconditionally reserved
on x86 in order to mitigate L1TF, which causes is_invalid_reserved_pfn()
to return true and in turns results in vaddr_get_pfn() returning success
for PFN 0. Eventually the bogus calculation runs into PFNs that aren't
reserved and leads to failure in vfio_pin_map_dma(). The subsequent
call to vfio_remove_dma() attempts to unmap PFN 0 and WARNs.WARNING: CPU: 8 PID: 5130 at drivers/vfio/vfio_iommu_type1.c:750 vfio_unmap_unpin+0x2e1/0x310 [vfio_iommu_type1]
Modules linked in: vfio_pci vfio_virqfd vfio_iommu_type1 vfio ...
CPU: 8 PID: 5130 Comm: sgx Tainted: G W 5.6.0-rc5-705d787c7fee-vfio+ #3
Hardware name: Intel Corporation Mehlow UP Server Platform/Moss Beach Server, BIOS CNLSE2R1.D00.X119.B49.1803010910 03/01/2018
RIP: 0010:vfio_unmap_unpin+0x2e1/0x310 [vfio_iommu_type1]
Code: 0b 49 81 c5 00 10 00 00 e9 c5 fe ff ff bb 00 10 00 00 e9 3d fe
RSP: 0018:ffffbeb5039ebda8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff9a55cbf8d480 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9a52b771c200
RBP: 0000000000000000 R08: 0000000000000040 R09: 00000000fffffff2
R10: 0000000000000001 R11: ffff9a51fa896000 R12: 0000000184010000
R13: 0000000184000000 R14: 0000000000010000 R15: ffff9a55cb66ea08
FS: 00007f15d3830b40(0000) GS:ffff9a55d5600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000561cf39429e0 CR3: 000000084f75f005 CR4: 00000000003626e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
vfio_remove_dma+0x17/0x70 [vfio_iommu_type1]
vfio_iommu_type1_ioctl+0x9e3/0xa7b [vfio_iommu_type1]
ksys_ioctl+0x92/0xb0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x4c/0x180
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15d04c75d7
Code: 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 48 2d 00 f7 d8 64 89 01 48Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation")
Signed-off-by: Sean Christopherson
Signed-off-by: Alex Williamson
Signed-off-by: Greg Kroah-Hartman -
commit 0ea971f8dcd6dee78a9a30ea70227cf305f11ff7 upstream.
add parentheses to avoid possible vaddr overflow.
Fixes: a54eb55045ae ("vfio iommu type1: Add support for mediated devices")
Signed-off-by: Yan Zhao
Signed-off-by: Alex Williamson
Signed-off-by: Greg Kroah-Hartman
16 Oct, 2019
1 commit
-
After enabling CONFIG_IOMMU_DMA on X86 a new warning appears when
compiling vfio:drivers/vfio/vfio_iommu_type1.c: In function ‘vfio_iommu_type1_attach_group’:
drivers/vfio/vfio_iommu_type1.c:1827:7: warning: ‘resv_msi_base’ may be used uninitialized in this function [-Wmaybe-uninitialized]
ret = iommu_get_msi_cookie(domain->domain, resv_msi_base);
~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~The warning is a false positive, because the call to iommu_get_msi_cookie()
only happens when vfio_iommu_has_sw_msi() returned true. And that only
happens when it also set resv_msi_base.But initialize the variable anyway to get rid of the warning.
Signed-off-by: Joerg Roedel
Reviewed-by: Cornelia Huck
Reviewed-by: Eric Auger
Signed-off-by: Alex Williamson
26 Sep, 2019
1 commit
-
This patch is a part of a series that extends kernel ABI to allow to pass
tagged user pointers (with the top byte set to something else other than
0x00) as syscall arguments.vaddr_get_pfn() uses provided user pointers for vma lookups, which can
only by done with untagged pointers.Untag user pointers in this function.
Link: http://lkml.kernel.org/r/87422b4d72116a975896f2b19b00f38acbd28f33.1563904656.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov
Reviewed-by: Eric Auger
Reviewed-by: Vincenzo Frascino
Reviewed-by: Catalin Marinas
Reviewed-by: Kees Cook
Cc: Dave Hansen
Cc: Will Deacon
Cc: Al Viro
Cc: Felix Kuehling
Cc: Jens Wiklander
Cc: Khalid Aziz
Cc: Mauro Carvalho Chehab
Cc: Mike Rapoport
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
21 Sep, 2019
1 commit
-
Pull VFIO updates from Alex Williamson:
- Fix spapr iommu error case case (Alexey Kardashevskiy)
- Consolidate region type definitions (Cornelia Huck)
- Restore saved original PCI state on release (hexin)
- Simplify mtty sample driver interrupt path (Parav Pandit)
- Support for reporting valid IOVA regions to user (Shameer Kolothum)
* tag 'vfio-v5.4-rc1' of git://github.com/awilliam/linux-vfio:
vfio_pci: Restore original state on release
vfio/type1: remove duplicate retrieval of reserved regions
vfio/type1: Add IOVA range capability support
vfio/type1: check dma map request is within a valid iova range
vfio/spapr_tce: Fix incorrect tce_iommu_group memory free
vfio-mdev/mtty: Simplify interrupt generation
vfio: re-arrange vfio region definitions
vfio/type1: Update iova list on detach
vfio/type1: Check reserved region conflict and update iova list
vfio/type1: Introduce iova list and add iommu aperture validity check
20 Aug, 2019
6 commits
-
As we now already have the reserved regions list, just pass that into
vfio_iommu_has_sw_msi() fn.Signed-off-by: Shameer Kolothum
Reviewed-by: Eric Auger
Signed-off-by: Alex Williamson -
This allows the user-space to retrieve the supported IOVA
range(s), excluding any non-relaxable reserved regions. The
implementation is based on capability chains, added to
VFIO_IOMMU_GET_INFO ioctl.Signed-off-by: Shameer Kolothum
Reviewed-by: Eric Auger
Signed-off-by: Alex Williamson -
This checks and rejects any dma map request outside valid iova
range.Signed-off-by: Shameer Kolothum
Reviewed-by: Eric Auger
Signed-off-by: Alex Williamson -
Get a copy of iova list on _group_detach and try to update the list.
On success replace the current one with the copy. Leave the list as
it is if update fails.Signed-off-by: Shameer Kolothum
Reviewed-by: Eric Auger
Signed-off-by: Alex Williamson -
This retrieves the reserved regions associated with dev group and
checks for conflicts with any existing dma mappings. Also update
the iova list excluding the reserved regions.Reserved regions with type IOMMU_RESV_DIRECT_RELAXABLE are
excluded from above checks as they are considered as directly
mapped regions which are known to be relaxable.Signed-off-by: Shameer Kolothum
Reviewed-by: Eric Auger
Signed-off-by: Alex Williamson -
This introduces an iova list that is valid for dma mappings. Make
sure the new iommu aperture window doesn't conflict with the current
one or with any existing dma mappings during attach.Signed-off-by: Shameer Kolothum
Reviewed-by: Eric Auger
Signed-off-by: Alex Williamson
24 Jul, 2019
2 commits
-
To permit batching of TLB flushes across multiple calls to the IOMMU
driver's ->unmap() implementation, introduce a new structure for
tracking the address range to be flushed and the granularity at which
the flushing is required.This is hooked into the IOMMU API and its caller are updated to make use
of the new structure. Subsequent patches will plumb this into the IOMMU
drivers as well, but for now the gathering information is ignored.Signed-off-by: Will Deacon
-
Commit add02cfdc9bc ("iommu: Introduce Interface for IOMMU TLB Flushing")
added three new TLB flushing operations to the IOMMU API so that the
underlying driver operations can be batched when unmapping large regions
of IO virtual address space.However, the ->iotlb_range_add() callback has not been implemented by
any IOMMU drivers (amd_iommu.c implements it as an empty function, which
incurs the overhead of an indirect branch). Instead, drivers either flush
the entire IOTLB in the ->iotlb_sync() callback or perform the necessary
invalidation during ->unmap().Attempting to implement ->iotlb_range_add() for arm-smmu-v3.c revealed
two major issues:1. The page size used to map the region in the page-table is not known,
and so it is not generally possible to issue TLB flushes in the most
efficient manner.2. The only mutable state passed to the callback is a pointer to the
iommu_domain, which can be accessed concurrently and therefore
requires expensive synchronisation to keep track of the outstanding
flushes.Remove the callback entirely in preparation for extending ->unmap() and
->iotlb_sync() to update a token on the caller's stack.Signed-off-by: Will Deacon
17 Jul, 2019
1 commit
-
locked_vm accounting is done roughly the same way in five places, so
unify them in a helper.Include the helper's caller in the debug print to distinguish between
callsites.Error codes stay the same, so user-visible behavior does too. The one
exception is that the -EPERM case in tce_account_locked_vm is removed
because Alexey has never seen it triggered.[daniel.m.jordan@oracle.com: v3]
Link: http://lkml.kernel.org/r/20190529205019.20927-1-daniel.m.jordan@oracle.com
[sfr@canb.auug.org.au: fix mm/util.c]
Link: http://lkml.kernel.org/r/20190524175045.26897-1-daniel.m.jordan@oracle.com
Signed-off-by: Daniel Jordan
Signed-off-by: Stephen Rothwell
Tested-by: Alexey Kardashevskiy
Acked-by: Alex Williamson
Cc: Alan Tull
Cc: Alex Williamson
Cc: Benjamin Herrenschmidt
Cc: Christoph Lameter
Cc: Christophe Leroy
Cc: Davidlohr Bueso
Cc: Jason Gunthorpe
Cc: Mark Rutland
Cc: Michael Ellerman
Cc: Moritz Fischer
Cc: Paul Mackerras
Cc: Steve Sistare
Cc: Wu Hao
Cc: Ira Weiny
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
19 Jun, 2019
1 commit
-
Based on 2 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundationthis program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation #extracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 4122 file(s).
Signed-off-by: Thomas Gleixner
Reviewed-by: Enrico Weigelt
Reviewed-by: Kate Stewart
Reviewed-by: Allison Randal
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
Signed-off-by: Greg Kroah-Hartman
15 May, 2019
1 commit
-
Pach series "Add FOLL_LONGTERM to GUP fast and use it".
HFI1, qib, and mthca, use get_user_pages_fast() due to its performance
advantages. These pages can be held for a significant time. But
get_user_pages_fast() does not protect against mapping FS DAX pages.Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which
retains the performance while also adding the FS DAX checks. XDP has also
shown interest in using this functionality.[1]In addition we change get_user_pages() to use the new FOLL_LONGTERM flag
and remove the specialized get_user_pages_longterm call.[1] https://lkml.org/lkml/2019/3/19/939
"longterm" is a relative thing and at this point is probably a misnomer.
This is really flagging a pin which is going to be given to hardware and
can't move. I've thought of a couple of alternative names but I think we
have to settle on if we are going to use FL_LAYOUT or something else to
solve the "longterm" problem. Then I think we can change the flag to a
better name.Secondly, it depends on how often you are registering memory. I have
spoken with some RDMA users who consider MR in the performance path...
For the overall application performance. I don't have the numbers as the
tests for HFI1 were done a long time ago. But there was a significant
advantage. Some of which is probably due to the fact that you don't have
to hold mmap_sem.Finally, architecturally I think it would be good for everyone to use
*_fast. There are patches submitted to the RDMA list which would allow
the use of *_fast (they reworking the use of mmap_sem) and as soon as they
are accepted I'll submit a patch to convert the RDMA core as well. Also
to this point others are looking to use *_fast.As an aside, Jasons pointed out in my previous submission that *_fast and
*_unlocked look very much the same. I agree and I think further cleanup
will be coming. But I'm focused on getting the final solution for DAX at
the moment.This patch (of 7):
This patch starts a series which aims to support FOLL_LONGTERM in
get_user_pages_fast(). Some callers who would like to do a longterm (user
controlled pin) of pages with the fast variant of GUP for performance
purposes.Rather than have a separate get_user_pages_longterm() call, introduce
FOLL_LONGTERM and change the longterm callers to use it.This patch does not change any functionality. In the short term
"longterm" or user controlled pins are unsafe for Filesystems and FS DAX
in particular has been blocked. However, callers of get_user_pages_fast()
were not "protected".FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it
requires vmas to determine if DAX is in use.NOTE: In merging with the CMA changes we opt to change the
get_user_pages() call in check_and_migrate_cma_pages() to a call of
__get_user_pages_locked() on the newly migrated pages. This makes the
code read better in that we are calling __get_user_pages_locked() on the
pages before and after a potential migration.As a side affect some of the interfaces are cleaned up but this is not the
primary purpose of the series.In review[1] it was asked:
> This I don't get - if you do lock down long term mappings performance
> of the actual get_user_pages call shouldn't matter to start with.
>
> What do I miss?A couple of points.
First "longterm" is a relative thing and at this point is probably a
misnomer. This is really flagging a pin which is going to be given to
hardware and can't move. I've thought of a couple of alternative names
but I think we have to settle on if we are going to use FL_LAYOUT or
something else to solve the "longterm" problem. Then I think we can
change the flag to a better name.Second, It depends on how often you are registering memory. I have spoken
with some RDMA users who consider MR in the performance path... For the
overall application performance. I don't have the numbers as the tests
for HFI1 were done a long time ago. But there was a significant
advantage. Some of which is probably due to the fact that you don't have
to hold mmap_sem.Finally, architecturally I think it would be good for everyone to use
*_fast. There are patches submitted to the RDMA list which would allow
the use of *_fast (they reworking the use of mmap_sem) and as soon as they
are accepted I'll submit a patch to convert the RDMA core as well. Also
to this point others are looking to use *_fast.As an asside, Jasons pointed out in my previous submission that *_fast and
*_unlocked look very much the same. I agree and I think further cleanup
will be coming. But I'm focused on getting the final solution for DAX at
the moment.[1] https://lore.kernel.org/lkml/20190220180255.GA12020@iweiny-DESK2.sc.intel.com/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965
[ira.weiny@intel.com: v3]
Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190317183438.2057-2-ira.weiny@intel.com
Signed-off-by: Ira Weiny
Reviewed-by: Andrew Morton
Cc: Aneesh Kumar K.V
Cc: Michal Hocko
Cc: John Hubbard
Cc: "Kirill A. Shutemov"
Cc: Peter Zijlstra
Cc: Jason Gunthorpe
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: "David S. Miller"
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Rich Felker
Cc: Yoshinori Sato
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Borislav Petkov
Cc: Ralf Baechle
Cc: James Hogan
Cc: Dan Williams
Cc: Mike Marshall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
12 Apr, 2019
2 commits
-
This adds the support to determine the isolation type
of a mediated device group by checking whether it has
an iommu device. If an iommu device exists, an iommu
domain will be allocated and then attached to the iommu
device. Otherwise, keep the same behavior as it is.Cc: Ashok Raj
Cc: Jacob Pan
Cc: Kevin Tian
Signed-off-by: Sanjay Kumar
Signed-off-by: Liu Yi L
Signed-off-by: Lu Baolu
Reviewed-by: Jean-Philippe Brucker
Reviewed-by: Kirti Wankhede
Acked-by: Alex Williamson
Signed-off-by: Joerg Roedel -
This adds helpers to attach or detach a domain to a
group. This will replace iommu_attach_group() which
only works for non-mdev devices.If a domain is attaching to a group which includes the
mediated devices, it should attach to the iommu device
(a pci device which represents the mdev in iommu scope)
instead. The added helper supports attaching domain to
groups for both pci and mdev devices.Cc: Ashok Raj
Cc: Jacob Pan
Cc: Kevin Tian
Signed-off-by: Sanjay Kumar
Signed-off-by: Liu Yi L
Signed-off-by: Lu Baolu
Reviewed-by: Jean-Philippe Brucker
Acked-by: Alex Williamson
Signed-off-by: Joerg Roedel
04 Apr, 2019
1 commit
-
Memory backed DMA mappings are accounted against a user's locked
memory limit, including multiple mappings of the same memory. This
accounting bounds the number of such mappings that a user can create.
However, DMA mappings that are not backed by memory, such as DMA
mappings of device MMIO via mmaps, do not make use of page pinning
and therefore do not count against the user's locked memory limit.
These mappings still consume memory, but the memory is not well
associated to the process for the purpose of oom killing a task.To add bounding on this use case, we introduce a limit to the total
number of concurrent DMA mappings that a user is allowed to create.
This limit is exposed as a tunable module option where the default
value of 64K is expected to be well in excess of any reasonable use
case (a large virtual machine configuration would typically only make
use of tens of concurrent mappings).This fixes CVE-2019-3882.
Reviewed-by: Eric Auger
Tested-by: Eric Auger
Reviewed-by: Peter Xu
Reviewed-by: Cornelia Huck
Signed-off-by: Alex Williamson
09 Jan, 2019
1 commit
-
The below referenced commit adds a test for integer overflow, but in
doing so prevents the unmap ioctl from ever including the last page of
the address space. Subtract one to compare to the last address of the
unmap to avoid the overflow and wrap-around.Fixes: 71a7d3d78e3c ("vfio/type1: silence integer overflow warning")
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1662291
Cc: stable@vger.kernel.org # v4.15+
Reported-by: Pei Zhang
Debugged-by: Peter Xu
Reviewed-by: Dan Carpenter
Reviewed-by: Peter Xu
Tested-by: Peter Xu
Reviewed-by: Cornelia Huck
Signed-off-by: Alex Williamson
15 Nov, 2018
1 commit
-
The AMD IOMMU driver can now map a huge-page where smaller
mappings existed before, so this code-path is no longer
triggered.Acked-by: Alex Williamson
Signed-off-by: Joerg Roedel
07 Aug, 2018
1 commit
-
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.Signed-off-by: Gustavo A. R. Silva
Signed-off-by: Alex Williamson
01 Jul, 2018
1 commit
-
The patch noted in the fixes below converted get_user_pages_fast() to
get_user_pages_longterm(), however the two calls differ in a few ways.First _fast() is documented to not require the mmap_sem, while _longterm()
is documented to need it. Hold the mmap sem as required.Second, _fast accepts an 'int write' while _longterm uses 'unsigned int
gup_flags', so the expression '!!(prot & IOMMU_WRITE)' is only working by
luck as FOLL_WRITE is currently == 0x1. Use the expected FOLL_WRITE
constant instead.Fixes: 94db151dc892 ("vfio: disable filesystem-dax page pinning")
Cc:
Signed-off-by: Jason Gunthorpe
Acked-by: Dan Williams
Signed-off-by: Alex Williamson
09 Jun, 2018
1 commit
-
MAP_DMA ioctls might be called from various threads within a process,
for example when using QEMU, the vCPU threads are often generating
these calls and we therefore take a reference to that vCPU task.
However, QEMU also supports vCPU hotplug on some machines and the task
that called MAP_DMA may have exited by the time UNMAP_DMA is called,
resulting in the mm_struct pointer being NULL and thus a failure to
match against the existing mapping.To resolve this, we instead take a reference to the thread
group_leader, which has the same mm_struct and resource limits, but
is less likely exit, at least in the QEMU case. A difficulty here is
guaranteeing that the capabilities of the group_leader match that of
the calling thread, which we resolve by tracking CAP_IPC_LOCK at the
time of calling rather than at an indeterminate time in the future.
Potentially this also results in better efficiency as this is now
recorded once per MAP_DMA ioctl.Reported-by: Xu Yandong
Signed-off-by: Alex Williamson
02 Jun, 2018
1 commit
-
Bisection by Amadeusz Sławiński implicates this commit leading to bad
page state issues after VM shutdown, likely due to unbalanced page
references. The original commit was intended only as a performance
improvement, therefore revert for offline rework.Link: https://lkml.org/lkml/2018/6/2/97
Fixes: 356e88ebe447 ("vfio/type1: Improve memory pinning process for raw PFN mapping")
Cc: Jason Cai (Xiang Feng)
Reported-by: Amadeusz Sławiński
Signed-off-by: Alex Williamson
23 Mar, 2018
1 commit
-
When using vfio to pass through a PCIe device (e.g. a GPU card) that
has a huge BAR (e.g. 16GB), a lot of cycles are wasted on memory
pinning because PFNs of PCI BAR are not backed by struct page, and
the corresponding VMA has flag VM_PFNMAP.With this change, when pinning a region which is a raw PFN mapping,
it can skip unnecessary user memory pinning process, and thus, can
significantly improve VM's boot up time when passing through devices
via VFIO. In my test on a Xeon E5 2.6GHz, the time mapping a 16GB
BAR was reduced from about 0.4s to 1.5us.Signed-off-by: Jason Cai (Xiang Feng)
Signed-off-by: Alex Williamson
22 Mar, 2018
1 commit
-
VFIO IOMMU type1 currently upmaps IOVA pages synchronously, which requires
IOTLB flushing for every unmapping. This results in large IOTLB flushing
overhead when handling pass-through devices has a large number of mapped
IOVAs. This can be avoided by using the new IOTLB flushing interface.Cc: Alex Williamson
Cc: Joerg Roedel
Signed-off-by: Suravee Suthikulpanit
[aw - use LIST_HEAD]
Signed-off-by: Alex Williamson
03 Mar, 2018
1 commit
-
Filesystem-DAX is incompatible with 'longterm' page pinning. Without
page cache indirection a DAX mapping maps filesystem blocks directly.
This means that the filesystem must not modify a file's block map while
any page in a mapping is pinned. In order to prevent the situation of
userspace holding of filesystem operations indefinitely, disallow
'longterm' Filesystem-DAX mappings.RDMA has the same conflict and the plan there is to add a 'with lease'
mechanism to allow the kernel to notify userspace that the mapping is
being torn down for block-map maintenance. Perhaps something similar can
be put in place for vfio.Note that xfs and ext4 still report:
"DAX enabled. Warning: EXPERIMENTAL, use at your own risk"
...at mount time, and resolving the dax-dma-vs-truncate problem is one
of the last hurdles to remove that designation.Acked-by: Alex Williamson
Cc: Michal Hocko
Cc: kvm@vger.kernel.org
Cc:
Reported-by: Haozhong Zhang
Tested-by: Haozhong Zhang
Fixes: d475c6346a38 ("dax,ext2: replace XIP read and write with DAX I/O")
Reviewed-by: Christoph Hellwig
Signed-off-by: Dan Williams
21 Oct, 2017
1 commit
-
I get a static checker warning about the potential integer overflow if
we add "unmap->iova + unmap->size". The integer overflow isn't really
harmful, but we may as well fix it. Also unmap->size gets truncated to
size_t when we pass it to vfio_find_dma() so we could check for too high
values of that as well.Signed-off-by: Dan Carpenter
Signed-off-by: Alex Williamson
11 Aug, 2017
2 commits
-
If the IOMMU driver advertises 'real' reserved regions for MSIs, but
still includes the software-managed region as well, we are currently
blind to the former and will configure the IOMMU domain to map MSIs into
the latter, which is unlikely to work as expected.Since it would take a ridiculous hardware topology for both regions to
be valid (which would be rather difficult to support in general), we
should be safe to assume that the presence of any hardware regions makes
the software region irrelevant. However, the IOMMU driver might still
advertise the software region by default, particularly if the hardware
regions are filled in elsewhere by generic code, so it might not be fair
for VFIO to be super-strict about not mixing them. To that end, make
vfio_iommu_has_sw_msi() robust against the presence of both region types
at once, so that we end up doing what is almost certainly right, rather
than what is almost certainly wrong.Signed-off-by: Robin Murphy
Tested-by: Shameer Kolothum
Reviewed-by: Eric Auger
Signed-off-by: Alex Williamson -
For ARM-based systems with a GICv3 ITS to provide interrupt isolation,
but hardware limitations which are worked around by having MSIs bypass
SMMU translation (e.g. HiSilicon Hip06/Hip07), VFIO neglects to check
for the IRQ_DOMAIN_FLAG_MSI_REMAP capability, (and thus erroneously
demands unsafe_interrupts) if a software-managed MSI region is absent.Fix this by always checking for isolation capability at both the IRQ
domain and IOMMU domain levels, rather than predicating that on whether
MSIs require an IOMMU mapping (which was always slightly tenuous logic).Signed-off-by: Robin Murphy
Tested-by: Shameer Kolothum
Reviewed-by: Eric Auger
Signed-off-by: Alex Williamson
19 Apr, 2017
2 commits
-
vfio_pin_pages_remote() is typically called to iterate over a range
of memory. Testing CAP_IPC_LOCK is relatively expensive, so it makes
sense to push it up to the caller, which can then repeatedly call
vfio_pin_pages_remote() using that value. This can show nearly a 20%
improvement on the worst case path through VFIO_IOMMU_MAP_DMA with
contiguous page mapping disabled. Testing RLIMIT_MEMLOCK is much more
lightweight, but we bring it along on the same principle and it does
seem to show a marginal improvement.Reviewed-by: Peter Xu
Reviewed-by: Kirti Wankhede
Signed-off-by: Alex Williamson -
With vfio_lock_acct() testing the locked memory limit under mmap_sem,
it's redundant to do it here for a single page. We can also reorder
our tests such that we can avoid testing for reserved pages if we're
not doing accounting and let vfio_lock_acct() test the process
CAP_IPC_LOCK. Finally, this function oddly returns 1 on success.
Update to return zero on success, -errno on error. Since the function
only pins a single page, there's no need to return the number of pages
pinned.N.B. vfio_pin_pages_remote() can pin a large contiguous range of pages
before calling vfio_lock_acct(). If we were to similarly remove the
extra test there, a user could temporarily pin far more pages than
they're allowed.Suggested-by: Kirti Wankhede
Suggested-by: Eric Auger
Reviewed-by: Kirti Wankhede
Reviewed-by: Peter Xu
Signed-off-by: Alex Williamson
14 Apr, 2017
1 commit
-
If the mmap_sem is contented then the vfio type1 IOMMU backend will
defer locked page accounting updates to a workqueue task. This has a
few problems and depending on which side the user tries to play, they
might be over-penalized for unmaps that haven't yet been accounted or
race the workqueue to enter more mappings than they're allowed. The
original intent of this workqueue mechanism seems to be focused on
reducing latency through the ioctl, but we cannot do so at the cost
of correctness. Remove this workqueue mechanism and update the
callers to allow for failure. We can also now recheck the limit under
write lock to make sure we don't exceed it.vfio_pin_pages_remote() also now necessarily includes an unwind path
which we can jump to directly if the consecutive page pinning finds
that we're exceeding the user's memory limits. This avoids the
current lazy approach which does accounting and mapping up to the
fault, only to return an error on the next iteration to unwind the
entire vfio_dma.Cc: stable@vger.kernel.org
Reviewed-by: Peter Xu
Reviewed-by: Kirti Wankhede
Signed-off-by: Alex Williamson
22 Mar, 2017
1 commit
-
The introduction of reserved regions has left a couple of rough edges
which we could do with sorting out sooner rather than later. Since we
are not yet addressing the potential dynamic aspect of software-managed
reservations and presenting them at arbitrary fixed addresses, it is
incongruous that we end up displaying hardware vs. software-managed MSI
regions to userspace differently, especially since ARM-based systems may
actually require one or the other, or even potentially both at once,
(which iommu-dma currently has no hope of dealing with at all). Let's
resolve the former user-visible inconsistency ASAP before the ABI has
been baked into a kernel release, in a way that also lays the groundwork
for the latter shortcoming to be addressed by follow-up patches.For clarity, rename the software-managed type to IOMMU_RESV_SW_MSI, use
IOMMU_RESV_MSI to describe the hardware type, and document everything a
little bit. Since the x86 MSI remapping hardware falls squarely under
this meaning of IOMMU_RESV_MSI, apply that type to their regions as well,
so that we tell the same story to userspace across all platforms.Secondly, as the various region types require quite different handling,
and it really makes little sense to ever try combining them, convert the
bitfield-esque #defines to a plain enum in the process before anyone
gets the wrong impression.Fixes: d30ddcaa7b02 ("iommu: Add a new type field in iommu_resv_region")
Reviewed-by: Eric Auger
CC: Alex Williamson
CC: David Woodhouse
CC: kvm@vger.kernel.org
Signed-off-by: Robin Murphy
Signed-off-by: Joerg Roedel
02 Mar, 2017
2 commits
-
We are going to split out of , which
will have to be picked up from other headers and a couple of .c files.Create a trivial placeholder file that just
maps to to make this patch obviously correct and
bisectable.Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar -
We are going to split out of , which
will have to be picked up from other headers and a couple of .c files.Create a trivial placeholder file that just
maps to to make this patch obviously correct and
bisectable.The APIs that are going to be moved first are:
mm_alloc()
__mmdrop()
mmdrop()
mmdrop_async_fn()
mmdrop_async()
mmget_not_zero()
mmput()
mmput_async()
get_task_mm()
mm_access()
mm_release()Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar
10 Feb, 2017
1 commit
-
Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.Fixes: 5d704992189f ("vfio/type1: Allow transparent MSI IOVA allocation")
Signed-off-by: Wei Yongjun
Reviewed-by: Eric Auger
Acked-by: Alex Williamson
Signed-off-by: Joerg Roedel