17 Jan, 2019

1 commit

  • commit 58fec830fc19208354895d9832785505046d6c01 upstream.

    The below referenced commit adds a test for integer overflow, but in
    doing so prevents the unmap ioctl from ever including the last page of
    the address space. Subtract one to compare to the last address of the
    unmap to avoid the overflow and wrap-around.

    Fixes: 71a7d3d78e3c ("vfio/type1: silence integer overflow warning")
    Link: https://bugzilla.redhat.com/show_bug.cgi?id=1662291
    Cc: stable@vger.kernel.org # v4.15+
    Reported-by: Pei Zhang
    Debugged-by: Peter Xu
    Reviewed-by: Dan Carpenter
    Reviewed-by: Peter Xu
    Tested-by: Peter Xu
    Reviewed-by: Cornelia Huck
    Signed-off-by: Alex Williamson
    Signed-off-by: Greg Kroah-Hartman

    Alex Williamson
     

18 Aug, 2018

1 commit

  • Pull powerpc updates from Michael Ellerman:
    "Notable changes:

    - A fix for a bug in our page table fragment allocator, where a page
    table page could be freed and reallocated for something else while
    still in use, leading to memory corruption etc. The fix reuses
    pt_mm in struct page (x86 only) for a powerpc only refcount.

    - Fixes to our pkey support. Several are user-visible changes, but
    bring us in to line with x86 behaviour and/or fix outright bugs.
    Thanks to Florian Weimer for reporting many of these.

    - A series to improve the hvc driver & related OPAL console code,
    which have been seen to cause hardlockups at times. The hvc driver
    changes in particular have been in linux-next for ~month.

    - Increase our MAX_PHYSMEM_BITS to 128TB when SPARSEMEM_VMEMMAP=y.

    - Remove Power8 DD1 and Power9 DD1 support, neither chip should be in
    use anywhere other than as a paper weight.

    - An optimised memcmp implementation using Power7-or-later VMX
    instructions

    - Support for barrier_nospec on some NXP CPUs.

    - Support for flushing the count cache on context switch on some IBM
    CPUs (controlled by firmware), as a Spectre v2 mitigation.

    - A series to enhance the information we print on unhandled signals
    to bring it into line with other arches, including showing the
    offending VMA and dumping the instructions around the fault.

    Thanks to: Aaro Koskinen, Akshay Adiga, Alastair D'Silva, Alexey
    Kardashevskiy, Alexey Spirkov, Alistair Popple, Andrew Donnellan,
    Aneesh Kumar K.V, Anju T Sudhakar, Arnd Bergmann, Bartosz Golaszewski,
    Benjamin Herrenschmidt, Bharat Bhushan, Bjoern Noetel, Boqun Feng,
    Breno Leitao, Bryant G. Ly, Camelia Groza, Christophe Leroy, Christoph
    Hellwig, Cyril Bur, Dan Carpenter, Daniel Klamt, Darren Stevens, Dave
    Young, David Gibson, Diana Craciun, Finn Thain, Florian Weimer,
    Frederic Barrat, Gautham R. Shenoy, Geert Uytterhoeven, Geoff Levand,
    Guenter Roeck, Gustavo Romero, Haren Myneni, Hari Bathini, Joel
    Stanley, Jonathan Neuschäfer, Kees Cook, Madhavan Srinivasan, Mahesh
    Salgaonkar, Markus Elfring, Mathieu Malaterre, Mauro S. M. Rodrigues,
    Michael Hanselmann, Michael Neuling, Michael Schmitz, Mukesh Ojha,
    Murilo Opsfelder Araujo, Nicholas Piggin, Parth Y Shah, Paul
    Mackerras, Paul Menzel, Ram Pai, Randy Dunlap, Rashmica Gupta, Reza
    Arbab, Rodrigo R. Galvao, Russell Currey, Sam Bobroff, Scott Wood,
    Shilpasri G Bhat, Simon Guo, Souptick Joarder, Stan Johnson, Thiago
    Jung Bauermann, Tyrel Datwyler, Vaibhav Jain, Vasant Hegde, Venkat
    Rao, zhong jiang"

    * tag 'powerpc-4.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (234 commits)
    powerpc/mm/book3s/radix: Add mapping statistics
    powerpc/uaccess: Enable get_user(u64, *p) on 32-bit
    powerpc/mm/hash: Remove unnecessary do { } while(0) loop
    powerpc/64s: move machine check SLB flushing to mm/slb.c
    powerpc/powernv/idle: Fix build error
    powerpc/mm/tlbflush: update the mmu_gather page size while iterating address range
    powerpc/mm: remove warning about ‘type’ being set
    powerpc/32: Include setup.h header file to fix warnings
    powerpc: Move `path` variable inside DEBUG_PROM
    powerpc/powermac: Make some functions static
    powerpc/powermac: Remove variable x that's never read
    cxl: remove a dead branch
    powerpc/powermac: Add missing include of header pmac.h
    powerpc/kexec: Use common error handling code in setup_new_fdt()
    powerpc/xmon: Add address lookup for percpu symbols
    powerpc/mm: remove huge_pte_offset_and_shift() prototype
    powerpc/lib: Use patch_site to patch copy_32 functions once cache is enabled
    powerpc/pseries: Fix endianness while restoring of r3 in MCE handler.
    powerpc/fadump: merge adjacent memory ranges to reduce PT_LOAD segements
    powerpc/fadump: handle crash memory ranges array index overflow
    ...

    Linus Torvalds
     

17 Aug, 2018

2 commits

  • Pull VFIO updates from Alex Williamson:

    - mark switch fall-through cases (Gustavo A. R. Silva)

    - disable binding SR-IOV enabled PFs (Alex Williamson)

    * tag 'vfio-v4.19-rc1' of git://github.com/awilliam/linux-vfio:
    vfio-pci: Disable binding to PFs with SR-IOV enabled
    vfio: Mark expected switch fall-throughs

    Linus Torvalds
     
  • Pull pci updates from Bjorn Helgaas:

    - Decode AER errors with names similar to "lspci" (Tyler Baicar)

    - Expose AER statistics in sysfs (Rajat Jain)

    - Clear AER status bits selectively based on the type of recovery (Oza
    Pawandeep)

    - Honor "pcie_ports=native" even if HEST sets FIRMWARE_FIRST (Alexandru
    Gagniuc)

    - Don't clear AER status bits if we're using the "Firmware-First"
    strategy where firmware owns the registers (Alexandru Gagniuc)

    - Use sysfs_match_string() to simplify ASPM sysfs parsing (Andy
    Shevchenko)

    - Remove unnecessary includes of (Bjorn Helgaas)

    - Defer DPC event handling to work queue (Keith Busch)

    - Use threaded IRQ for DPC bottom half (Keith Busch)

    - Print AER status while handling DPC events (Keith Busch)

    - Work around IDT switch ACS Source Validation erratum (James
    Puthukattukaran)

    - Emit diagnostics for all cases of PCIe Link downtraining (Links
    operating slower than they're capable of) (Alexandru Gagniuc)

    - Skip VFs when configuring Max Payload Size (Myron Stowe)

    - Reduce Root Port Max Payload Size if necessary when hot-adding a
    device below it (Myron Stowe)

    - Simplify SHPC existence/permission checks (Bjorn Helgaas)

    - Remove hotplug sample skeleton driver (Lukas Wunner)

    - Convert pciehp to threaded IRQ handling (Lukas Wunner)

    - Improve pciehp tolerance of missed events and initially unstable
    links (Lukas Wunner)

    - Clear spurious pciehp events on resume (Lukas Wunner)

    - Add pciehp runtime PM support, including for Thunderbolt controllers
    (Lukas Wunner)

    - Support interrupts from pciehp bridges in D3hot (Lukas Wunner)

    - Mark fall-through switch cases before enabling -Wimplicit-fallthrough
    (Gustavo A. R. Silva)

    - Move DMA-debug PCI init from arch code to PCI core (Christoph
    Hellwig)

    - Fix pci_request_irq() usage of IRQF_ONESHOT when no handler is
    supplied (Heiner Kallweit)

    - Unify PCI and DMA direction #defines (Shunyong Yang)

    - Add PCI_DEVICE_DATA() macro (Andy Shevchenko)

    - Check for VPD completion before checking for timeout (Bert Kenward)

    - Limit Netronome NFP5000 config space size to work around erratum
    (Jakub Kicinski)

    - Set IRQCHIP_ONESHOT_SAFE for PCI MSI irqchips (Heiner Kallweit)

    - Document ACPI description of PCI host bridges (Bjorn Helgaas)

    - Add "pci=disable_acs_redir=" parameter to disable ACS redirection for
    peer-to-peer DMA support (we don't have the peer-to-peer support yet;
    this is just one piece) (Logan Gunthorpe)

    - Clean up devm_of_pci_get_host_bridge_resources() resource allocation
    (Jan Kiszka)

    - Fixup resizable BARs after suspend/resume (Christian König)

    - Make "pci=earlydump" generic (Sinan Kaya)

    - Fix ROM BAR access routines to stay in bounds and check for signature
    correctly (Rex Zhu)

    - Add DMA alias quirk for Microsemi Switchtec NTB (Doug Meyer)

    - Expand documentation for pci_add_dma_alias() (Logan Gunthorpe)

    - To avoid bus errors, enable PASID only if entire path supports
    End-End TLP prefixes (Sinan Kaya)

    - Unify slot and bus reset functions and remove hotplug knowledge from
    callers (Sinan Kaya)

    - Add Function-Level Reset quirks for Intel and Samsung NVMe devices to
    fix guest reboot issues (Alex Williamson)

    - Add function 1 DMA alias quirk for Marvell 88SS9183 PCIe SSD
    Controller (Bjorn Helgaas)

    - Remove Xilinx AXI-PCIe host bridge arch dependency (Palmer Dabbelt)

    - Remove Aardvark outbound window configuration (Evan Wang)

    - Fix Aardvark bridge window sizing issue (Zachary Zhang)

    - Convert Aardvark to use pci_host_probe() to reduce code duplication
    (Thomas Petazzoni)

    - Correct the Cadence cdns_pcie_writel() signature (Alan Douglas)

    - Add Cadence support for optional generic PHYs (Alan Douglas)

    - Add Cadence power management ops (Alan Douglas)

    - Remove redundant variable from Cadence driver (Colin Ian King)

    - Add Kirin MSI support (Xiaowei Song)

    - Drop unnecessary root_bus_nr setting from exynos, imx6, keystone,
    armada8k, artpec6, designware-plat, histb, qcom, spear13xx (Shawn
    Guo)

    - Move link notification settings from DesignWare core to individual
    drivers (Gustavo Pimentel)

    - Add endpoint library MSI-X interfaces (Gustavo Pimentel)

    - Correct signature of endpoint library IRQ interfaces (Gustavo
    Pimentel)

    - Add DesignWare endpoint library MSI-X callbacks (Gustavo Pimentel)

    - Add endpoint library MSI-X test support (Gustavo Pimentel)

    - Remove unnecessary GFP_ATOMIC from Hyper-V "new child" allocation
    (Jia-Ju Bai)

    - Add more devices to Broadcom PAXC quirk (Ray Jui)

    - Work around corrupted Broadcom PAXC config space to enable SMMU and
    GICv3 ITS (Ray Jui)

    - Disable MSI parsing to work around broken Broadcom PAXC logic in some
    devices (Ray Jui)

    - Hide unconfigured functions to work around a Broadcom PAXC defect
    (Ray Jui)

    - Lower iproc log level to reduce console output during boot (Ray Jui)

    - Fix mobiveil iomem/phys_addr_t type usage (Lorenzo Pieralisi)

    - Fix mobiveil missing include file (Lorenzo Pieralisi)

    - Add mobiveil Kconfig/Makefile support (Lorenzo Pieralisi)

    - Fix mvebu I/O space remapping issues (Thomas Petazzoni)

    - Use generic pci_host_bridge in mvebu instead of ARM-specific API
    (Thomas Petazzoni)

    - Whitelist VMD devices with fast interrupt handlers to avoid sharing
    vectors with slow handlers (Keith Busch)

    * tag 'pci-v4.19-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (153 commits)
    PCI/AER: Don't clear AER bits if error handling is Firmware-First
    PCI: Limit config space size for Netronome NFP5000
    PCI/MSI: Set IRQCHIP_ONESHOT_SAFE for PCI-MSI irqchips
    PCI/VPD: Check for VPD access completion before checking for timeout
    PCI: Add PCI_DEVICE_DATA() macro to fully describe device ID entry
    PCI: Match Root Port's MPS to endpoint's MPSS as necessary
    PCI: Skip MPS logic for Virtual Functions (VFs)
    PCI: Add function 1 DMA alias quirk for Marvell 88SS9183
    PCI: Check for PCIe Link downtraining
    PCI: Add ACS Redirect disable quirk for Intel Sunrise Point
    PCI: Add device-specific ACS Redirect disable infrastructure
    PCI: Convert device-specific ACS quirks from NULL termination to ARRAY_SIZE
    PCI: Add "pci=disable_acs_redir=" parameter for peer-to-peer support
    PCI: Allow specifying devices using a base bus and path of devfns
    PCI: Make specifying PCI devices in kernel parameters reusable
    PCI: Hide ACS quirk declarations inside PCI core
    PCI: Delay after FLR of Intel DC P3700 NVMe
    PCI: Disable Samsung SM961/PM961 NVMe before FLR
    PCI: Export pcie_has_flr()
    PCI: mvebu: Drop bogus comment above mvebu_pcie_map_registers()
    ...

    Linus Torvalds
     

13 Aug, 2018

1 commit


07 Aug, 2018

2 commits

  • We expect to receive PFs with SR-IOV disabled, however some host
    drivers leave SR-IOV enabled at unbind. This puts us in a state where
    we can potentially assign both the PF and the VF, leading to both
    functionality as well as security concerns due to lack of managing the
    SR-IOV state as well as vendor dependent isolation from the PF to VF.
    If we were to attempt to actively disable SR-IOV on driver probe, we
    risk VF bound drivers blocking, potentially risking live lock
    scenarios. Therefore simply refuse to bind to PFs with SR-IOV enabled
    with a warning message indicating the issue. Users can resolve this
    by re-binding to the host driver and disabling SR-IOV before
    attempting to use the device with vfio-pci.

    Reviewed-by: David Gibson
    Reviewed-by: Peter Xu
    Signed-off-by: Alex Williamson

    Alex Williamson
     
  • In preparation to enabling -Wimplicit-fallthrough, mark switch cases
    where we are expecting to fall through.

    Signed-off-by: Gustavo A. R. Silva
    Signed-off-by: Alex Williamson

    Gustavo A. R. Silva
     

22 Jul, 2018

1 commit

  • Pull powerpc fixes from Michael Ellerman:
    "Two regression fixes, one for xmon disassembly formatting and the
    other to fix the E500 build.

    Two commits to fix a potential security issue in the VFIO code under
    obscure circumstances.

    And finally a fix to the Power9 idle code to restore SPRG3, which is
    user visible and used for sched_getcpu().

    Thanks to: Alexey Kardashevskiy, David Gibson. Gautham R. Shenoy,
    James Clarke"

    * tag 'powerpc-4.18-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
    powerpc/powernv: Fix save/restore of SPRG3 on entry/exit from stop (idle)
    powerpc/Makefile: Assemble with -me500 when building for E500
    KVM: PPC: Check if IOMMU page is contained in the pinned physical page
    vfio/spapr: Use IOMMU pageshift rather than pagesize
    powerpc/xmon: Fix disassembly since printf changes

    Linus Torvalds
     

20 Jul, 2018

2 commits

  • Now that the old implementation of pci_reset_bus() is gone, replace
    pci_try_reset_bus() with pci_reset_bus().

    Compared to the old implementation, new code will fail immmediately with
    -EAGAIN if object lock cannot be obtained.

    Signed-off-by: Sinan Kaya
    Signed-off-by: Bjorn Helgaas

    Sinan Kaya
     
  • Drivers are expected to call pci_try_reset_slot() or pci_try_reset_bus() by
    querying if a system supports hotplug or not. A survey showed that most
    drivers don't do this and we are leaking hotplug capability to the user.

    Hide pci_try_slot_reset() from drivers and embed into pci_try_bus_reset().
    Change pci_try_reset_bus() parameter from struct pci_bus to struct pci_dev.

    Signed-off-by: Sinan Kaya
    Signed-off-by: Bjorn Helgaas

    Sinan Kaya
     

19 Jul, 2018

1 commit

  • info.index can be indirectly controlled by user-space, hence leading
    to a potential exploitation of the Spectre variant 1 vulnerability.

    This issue was detected with the help of Smatch:

    drivers/vfio/pci/vfio_pci.c:734 vfio_pci_ioctl()
    warn: potential spectre issue 'vdev->region'

    Fix this by sanitizing info.index before indirectly using it to index
    vdev->region

    Notice that given that speculation windows are large, the policy is
    to kill the speculation on the first load and not worry if it can be
    completed with a dependent load/store [1].

    [1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2

    Cc: stable@vger.kernel.org
    Signed-off-by: Gustavo A. R. Silva
    Signed-off-by: Alex Williamson

    Gustavo A. R. Silva
     

18 Jul, 2018

2 commits

  • A VM which has:
    - a DMA capable device passed through to it (eg. network card);
    - running a malicious kernel that ignores H_PUT_TCE failure;
    - capability of using IOMMU pages bigger that physical pages
    can create an IOMMU mapping that exposes (for example) 16MB of
    the host physical memory to the device when only 64K was allocated to the VM.

    The remaining 16MB - 64K will be some other content of host memory, possibly
    including pages of the VM, but also pages of host kernel memory, host
    programs or other VMs.

    The attacking VM does not control the location of the page it can map,
    and is only allowed to map as many pages as it has pages of RAM.

    We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that
    an IOMMU page is contained in the physical page so the PCI hardware won't
    get access to unassigned host memory; however this check is missing in
    the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and
    did not hit this yet as the very first time when the mapping happens
    we do not have tbl::it_userspace allocated yet and fall back to
    the userspace which in turn calls VFIO IOMMU driver, this fails and
    the guest does not retry,

    This stores the smallest preregistered page size in the preregistered
    region descriptor and changes the mm_iommu_xxx API to check this against
    the IOMMU page size.

    This calculates maximum page size as a minimum of the natural region
    alignment and compound page size. For the page shift this uses the shift
    returned by find_linux_pte() which indicates how the page is mapped to
    the current userspace - if the page is huge and this is not a zero, then
    it is a leaf pte and the page is mapped within the range.

    Fixes: 121f80ba68f1 ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO")
    Cc: stable@vger.kernel.org # v4.12+
    Signed-off-by: Alexey Kardashevskiy
    Reviewed-by: David Gibson
    Signed-off-by: Michael Ellerman

    Alexey Kardashevskiy
     
  • The size is always equal to 1 page so let's use this. Later on this will
    be used for other checks which use page shifts to check the granularity
    of access.

    This should cause no behavioral change.

    Cc: stable@vger.kernel.org # v4.12+
    Reviewed-by: David Gibson
    Acked-by: Alex Williamson
    Signed-off-by: Alexey Kardashevskiy
    Signed-off-by: Michael Ellerman

    Alexey Kardashevskiy
     

16 Jul, 2018

3 commits

  • At the moment we allocate the entire TCE table, twice (hardware part and
    userspace translation cache). This normally works as we normally have
    contigous memory and the guest will map entire RAM for 64bit DMA.

    However if we have sparse RAM (one example is a memory device), then
    we will allocate TCEs which will never be used as the guest only maps
    actual memory for DMA. If it is a single level TCE table, there is nothing
    we can really do but if it a multilevel table, we can skip allocating
    TCEs we know we won't need.

    This adds ability to allocate only first level, saving memory.

    This changes iommu_table::free() to avoid allocating of an extra level;
    iommu_table::set() will do this when needed.

    This adds @alloc parameter to iommu_table::exchange() to tell the callback
    if it can allocate an extra level; the flag is set to "false" for
    the realmode KVM handlers of H_PUT_TCE hcalls and the callback returns
    H_TOO_HARD.

    This still requires the entire table to be counted in mm::locked_vm.

    To be conservative, this only does on-demand allocation when
    the usespace cache table is requested which is the case of VFIO.

    The example math for a system replicating a powernv setup with NVLink2
    in a guest:
    16GB RAM mapped at 0x0
    128GB GPU RAM window (16GB of actual RAM) mapped at 0x244000000000

    the table to cover that all with 64K pages takes:
    (((0x244000000000 + 0x2000000000) >> 16)*8)>>20 = 4556MB

    If we allocate only necessary TCE levels, we will only need:
    (((0x400000000 + 0x400000000) >> 16)*8)>>20 = 4MB (plus some for indirect
    levels).

    Signed-off-by: Alexey Kardashevskiy
    Reviewed-by: David Gibson
    Signed-off-by: Michael Ellerman

    Alexey Kardashevskiy
     
  • We want to support sparse memory and therefore huge chunks of DMA windows
    do not need to be mapped. If a DMA window big enough to require 2 or more
    indirect levels, and a DMA window is used to map all RAM (which is
    a default case for 64bit window), we can actually save some memory by
    not allocation TCE for regions which we are not going to map anyway.

    The hardware tables alreary support indirect levels but we also keep
    host-physical-to-userspace translation array which is allocated by
    vmalloc() and is a flat array which might use quite some memory.

    This converts it_userspace from vmalloc'ed array to a multi level table.

    As the format becomes platform dependend, this replaces the direct access
    to it_usespace with a iommu_table_ops::useraddrptr hook which returns
    a pointer to the userspace copy of a TCE; future extension will return
    NULL if the level was not allocated.

    This should not change non-KVM handling of TCE tables and it_userspace
    will not be allocated for non-KVM tables.

    Reviewed-by: David Gibson
    Signed-off-by: Alexey Kardashevskiy
    Signed-off-by: Michael Ellerman

    Alexey Kardashevskiy
     
  • We are going to reuse multilevel TCE code for the userspace copy of
    the TCE table and since it is big endian, let's make the copy big endian
    too.

    Reviewed-by: David Gibson
    Signed-off-by: Alexey Kardashevskiy
    Acked-by: Paul Mackerras
    Signed-off-by: Michael Ellerman

    Alexey Kardashevskiy
     

01 Jul, 2018

1 commit

  • The patch noted in the fixes below converted get_user_pages_fast() to
    get_user_pages_longterm(), however the two calls differ in a few ways.

    First _fast() is documented to not require the mmap_sem, while _longterm()
    is documented to need it. Hold the mmap sem as required.

    Second, _fast accepts an 'int write' while _longterm uses 'unsigned int
    gup_flags', so the expression '!!(prot & IOMMU_WRITE)' is only working by
    luck as FOLL_WRITE is currently == 0x1. Use the expected FOLL_WRITE
    constant instead.

    Fixes: 94db151dc892 ("vfio: disable filesystem-dax page pinning")
    Cc:
    Signed-off-by: Jason Gunthorpe
    Acked-by: Dan Williams
    Signed-off-by: Alex Williamson

    Jason Gunthorpe
     

19 Jun, 2018

1 commit

  • Allow the code which provides extensions to support direct assignment
    of Intel IGD (GVT-d) to be compiled out of the kernel if desired. The
    config option for this was previously automatically enabled on X86,
    therefore the default remains Y. This simply provides the option to
    disable it even for X86.

    Signed-off-by: Alex Williamson

    Alex Williamson
     

13 Jun, 2018

1 commit

  • Pull VFIO updates from Alex Williamson:

    - Bind type1 task tracking to group_leader to facilitate vCPU hotplug
    in QEMU (Alex Williamson)

    - Sample mdev display drivers, including region-based host and guest
    Linux drivers and bochs compatible dmabuf device
    (Gerd Hoffmann)

    - Fix vfio-platform reset module leak (Geert Uytterhoeven)

    - vfio-platform error message consistency (Geert Uytterhoeven)

    - Global checking for mdev uuid collisions rather than per parent
    device (Alex Williamson)

    - Use match_string() helper (Yisheng Xie)

    - vfio-platform PM domain fixes (Geert Uytterhoeven)

    - Fix sample mbochs driver build dependency (Arnd Bergmann)

    * tag 'vfio-v4.18-rc1' of git://github.com/awilliam/linux-vfio:
    samples: mbochs: add DMA_SHARED_BUFFER dependency
    vfio: platform: Fix using devices in PM Domains
    vfio: use match_string() helper
    vfio/mdev: Re-order sysfs attribute creation
    vfio/mdev: Check globally for duplicate devices
    vfio: platform: Make printed error messages more consistent
    vfio: platform: Fix reset module leak in error path
    sample: vfio bochs vbe display (host device for bochs-drm)
    sample: vfio mdev display - guest driver
    sample: vfio mdev display - host device
    vfio/type1: Fix task tracking for QEMU vCPU hotplug

    Linus Torvalds
     

09 Jun, 2018

7 commits

  • If a device is part of a PM Domain (e.g. power and/or clock domain), its
    power state is managed using Runtime PM. Without Runtime PM, the device
    may not be powered up or clocked, causing subtle failures, crashes, or
    system lock-ups when the device is accessed by the guest.

    Fix this by adding Runtime PM support, powering the device when the VFIO
    device is opened by the guest.

    Signed-off-by: Geert Uytterhoeven
    Reviewed-by: Simon Horman
    Acked-by: Eric Auger
    Signed-off-by: Alex Williamson

    Geert Uytterhoeven
     
  • match_string() returns the index of an array for a matching string,
    which can be used intead of open coded variant.

    Cc: Alex Williamson
    Cc: kvm@vger.kernel.org
    Signed-off-by: Yisheng Xie
    Reviewed-by: Andy Shevchenko
    Signed-off-by: Alex Williamson

    Yisheng Xie
     
  • There exists a gap at the end of mdev_device_create() where the device
    is visible to userspace, but we're not yet ready to handle removal, as
    triggered through the 'remove' attribute. We handle this properly in
    mdev_device_remove() with an -EAGAIN return, but we can marginally
    reduce this gap by adding this attribute as a final step of our sysfs
    setup.

    Reviewed-by: Kirti Wankhede
    Reviewed-by: Cornelia Huck
    Acked-by: Halil Pasic
    Signed-off-by: Alex Williamson

    Alex Williamson
     
  • When we create an mdev device, we check for duplicates against the
    parent device and return -EEXIST if found, but the mdev device
    namespace is global since we'll link all devices from the bus. We do
    catch this later in sysfs_do_create_link_sd() to return -EEXIST, but
    with it comes a kernel warning and stack trace for trying to create
    duplicate sysfs links, which makes it an undesirable response.

    Therefore we should really be looking for duplicates across all mdev
    parent devices, or as implemented here, against our mdev device list.
    Using mdev_list to prevent duplicates means that we can remove
    mdev_parent.lock, but in order not to serialize mdev device creation
    and removal globally, we add mdev_device.active which allows UUIDs to
    be reserved such that we can drop the mdev_list_lock before the mdev
    device is fully in place.

    Two behavioral notes; first, mdev_parent.lock had the side-effect of
    serializing mdev create and remove ops per parent device. This was
    an implementation detail, not an intentional guarantee provided to
    the mdev vendor drivers. Vendor drivers can trivially provide this
    serialization internally if necessary. Second, review comments note
    the new -EAGAIN behavior when the device, and in particular the remove
    attribute, becomes visible in sysfs. If a remove is triggered prior
    to completion of mdev_device_create() the user will see a -EAGAIN
    error. While the errno is different, receiving an error during this
    period is not, the previous implementation returned -ENODEV for the
    same condition. Furthermore, the consistency to the user is improved
    in the case where mdev_device_remove_ops() returns error. Previously
    concurrent calls to mdev_device_remove() could see the device
    disappear with -ENODEV and return in the case of error. Now a user
    would see -EAGAIN while the device is in this transitory state.

    Reviewed-by: Kirti Wankhede
    Reviewed-by: Cornelia Huck
    Acked-by: Halil Pasic
    Acked-by: Zhenyu Wang
    Signed-off-by: Alex Williamson

    Alex Williamson
     
  • - Capitalize the first word of error messages,
    - Unwrap statements that fit on a single line,
    - Use "VFIO" instead of "vfio" as the error message prefix.

    Signed-off-by: Geert Uytterhoeven
    Reviewed-by: Eric Auger
    Acked-by: Eric Auger
    Signed-off-by: Alex Williamson

    Geert Uytterhoeven
     
  • If the IOMMU group setup fails, the reset module is not released.

    Fixes: b5add544d677d363 ("vfio, platform: make reset driver a requirement by default")
    Signed-off-by: Geert Uytterhoeven
    Reviewed-by: Eric Auger
    Reviewed-by: Simon Horman
    Acked-by: Eric Auger
    Signed-off-by: Alex Williamson

    Geert Uytterhoeven
     
  • MAP_DMA ioctls might be called from various threads within a process,
    for example when using QEMU, the vCPU threads are often generating
    these calls and we therefore take a reference to that vCPU task.
    However, QEMU also supports vCPU hotplug on some machines and the task
    that called MAP_DMA may have exited by the time UNMAP_DMA is called,
    resulting in the mm_struct pointer being NULL and thus a failure to
    match against the existing mapping.

    To resolve this, we instead take a reference to the thread
    group_leader, which has the same mm_struct and resource limits, but
    is less likely exit, at least in the QEMU case. A difficulty here is
    guaranteeing that the capabilities of the group_leader match that of
    the calling thread, which we resolve by tracking CAP_IPC_LOCK at the
    time of calling rather than at an indeterminate time in the future.
    Potentially this also results in better efficiency as this is now
    recorded once per MAP_DMA ioctl.

    Reported-by: Xu Yandong
    Signed-off-by: Alex Williamson

    Alex Williamson
     

05 Jun, 2018

1 commit

  • Pull aio updates from Al Viro:
    "Majority of AIO stuff this cycle. aio-fsync and aio-poll, mostly.

    The only thing I'm holding back for a day or so is Adam's aio ioprio -
    his last-minute fixup is trivial (missing stub in !CONFIG_BLOCK case),
    but let it sit in -next for decency sake..."

    * 'work.aio-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
    aio: sanitize the limit checking in io_submit(2)
    aio: fold do_io_submit() into callers
    aio: shift copyin of iocb into io_submit_one()
    aio_read_events_ring(): make a bit more readable
    aio: all callers of aio_{read,write,fsync,poll} treat 0 and -EIOCBQUEUED the same way
    aio: take list removal to (some) callers of aio_complete()
    aio: add missing break for the IOCB_CMD_FDSYNC case
    random: convert to ->poll_mask
    timerfd: convert to ->poll_mask
    eventfd: switch to ->poll_mask
    pipe: convert to ->poll_mask
    crypto: af_alg: convert to ->poll_mask
    net/rxrpc: convert to ->poll_mask
    net/iucv: convert to ->poll_mask
    net/phonet: convert to ->poll_mask
    net/nfc: convert to ->poll_mask
    net/caif: convert to ->poll_mask
    net/bluetooth: convert to ->poll_mask
    net/sctp: convert to ->poll_mask
    net/tipc: convert to ->poll_mask
    ...

    Linus Torvalds
     

02 Jun, 2018

1 commit

  • Bisection by Amadeusz Sławiński implicates this commit leading to bad
    page state issues after VM shutdown, likely due to unbalanced page
    references. The original commit was intended only as a performance
    improvement, therefore revert for offline rework.

    Link: https://lkml.org/lkml/2018/6/2/97
    Fixes: 356e88ebe447 ("vfio/type1: Improve memory pinning process for raw PFN mapping")
    Cc: Jason Cai (Xiang Feng)
    Reported-by: Amadeusz Sławiński
    Signed-off-by: Alex Williamson

    Alex Williamson
     

26 May, 2018

1 commit


07 Apr, 2018

1 commit

  • Pull VFIO updates from Alex Williamson:

    - Adopt iommu_unmap_fast() interface to type1 backend
    (Suravee Suthikulpanit)

    - mdev sample driver fixup (Shunyong Yang)

    - More efficient PFN mapping handling in type1 backend
    (Jason Cai)

    - VFIO device ioeventfd interface (Alex Williamson)

    - Tag new vfio-platform sub-maintainer (Alex Williamson)

    * tag 'vfio-v4.17-rc1' of git://github.com/awilliam/linux-vfio:
    MAINTAINERS: vfio/platform: Update sub-maintainer
    vfio/pci: Add ioeventfd support
    vfio/pci: Use endian neutral helpers
    vfio/pci: Pull BAR mapping setup from read-write path
    vfio/type1: Improve memory pinning process for raw PFN mapping
    vfio-mdev/samples: change RDI interrupt condition
    vfio/type1: Adopt fast IOTLB flush interface when unmap IOVAs

    Linus Torvalds
     

27 Mar, 2018

3 commits

  • The ioeventfd here is actually irqfd handling of an ioeventfd such as
    supported in KVM. A user is able to pre-program a device write to
    occur when the eventfd triggers. This is yet another instance of
    eventfd-irqfd triggering between KVM and vfio. The impetus for this
    is high frequency writes to pages which are virtualized in QEMU.
    Enabling this near-direct write path for selected registers within
    the virtualized page can improve performance and reduce overhead.
    Specifically this is initially targeted at NVIDIA graphics cards where
    the driver issues a write to an MMIO register within a virtualized
    region in order to allow the MSI interrupt to re-trigger.

    Reviewed-by: Peter Xu
    Reviewed-by: Alexey Kardashevskiy
    Signed-off-by: Alex Williamson

    Alex Williamson
     
  • The iowriteXX/ioreadXX functions assume little endian hardware and
    convert to little endian on a write and from little endian on a read.
    We currently do our own explicit conversion to negate this. Instead,
    add some endian dependent defines to avoid all byte swaps. There
    should be no functional change other than big endian systems aren't
    penalized with wasted swaps.

    Reviewed-by: Alexey Kardashevskiy
    Signed-off-by: Alex Williamson

    Alex Williamson
     
  • This creates a common helper that we'll use for ioeventfd setup.

    Reviewed-by: Peter Xu
    Reviewed-by: Eric Auger
    Reviewed-by: Alexey Kardashevskiy
    Signed-off-by: Alex Williamson

    Alex Williamson
     

23 Mar, 2018

1 commit

  • When using vfio to pass through a PCIe device (e.g. a GPU card) that
    has a huge BAR (e.g. 16GB), a lot of cycles are wasted on memory
    pinning because PFNs of PCI BAR are not backed by struct page, and
    the corresponding VMA has flag VM_PFNMAP.

    With this change, when pinning a region which is a raw PFN mapping,
    it can skip unnecessary user memory pinning process, and thus, can
    significantly improve VM's boot up time when passing through devices
    via VFIO. In my test on a Xeon E5 2.6GHz, the time mapping a 16GB
    BAR was reduced from about 0.4s to 1.5us.

    Signed-off-by: Jason Cai (Xiang Feng)
    Signed-off-by: Alex Williamson

    Jason Cai (Xiang Feng)
     

22 Mar, 2018

2 commits

  • This reverts commit 2170dd04316e0754cbbfa4892a25aead39d225f7

    The intent of commit 2170dd04316e ("vfio-pci: Mask INTx if a device is
    not capabable of enabling it") was to disallow the user from seeing
    that the device supports INTx if the platform is incapable of enabling
    it. The detection of this case however incorrectly includes devices
    which natively do not support INTx, such as SR-IOV VFs, and further
    discussions reveal gaps even for the target use case.

    Reported-by: Arjun Vynipadath
    Fixes: 2170dd04316e ("vfio-pci: Mask INTx if a device is not capabable of enabling it")
    Signed-off-by: Alex Williamson

    Alex Williamson
     
  • VFIO IOMMU type1 currently upmaps IOVA pages synchronously, which requires
    IOTLB flushing for every unmapping. This results in large IOTLB flushing
    overhead when handling pass-through devices has a large number of mapped
    IOVAs. This can be avoided by using the new IOTLB flushing interface.

    Cc: Alex Williamson
    Cc: Joerg Roedel
    Signed-off-by: Suravee Suthikulpanit
    [aw - use LIST_HEAD]
    Signed-off-by: Alex Williamson

    Suravee Suthikulpanit
     

03 Mar, 2018

1 commit

  • Filesystem-DAX is incompatible with 'longterm' page pinning. Without
    page cache indirection a DAX mapping maps filesystem blocks directly.
    This means that the filesystem must not modify a file's block map while
    any page in a mapping is pinned. In order to prevent the situation of
    userspace holding of filesystem operations indefinitely, disallow
    'longterm' Filesystem-DAX mappings.

    RDMA has the same conflict and the plan there is to add a 'with lease'
    mechanism to allow the kernel to notify userspace that the mapping is
    being torn down for block-map maintenance. Perhaps something similar can
    be put in place for vfio.

    Note that xfs and ext4 still report:

    "DAX enabled. Warning: EXPERIMENTAL, use at your own risk"

    ...at mount time, and resolving the dax-dma-vs-truncate problem is one
    of the last hurdles to remove that designation.

    Acked-by: Alex Williamson
    Cc: Michal Hocko
    Cc: kvm@vger.kernel.org
    Cc:
    Reported-by: Haozhong Zhang
    Tested-by: Haozhong Zhang
    Fixes: d475c6346a38 ("dax,ext2: replace XIP read and write with DAX I/O")
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     

12 Feb, 2018

1 commit

  • This is the mindless scripted replacement of kernel use of POLL*
    variables as described by Al, done by this script:

    for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
    L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
    for f in $L; do sed -i "-es/^\([^\"]*\)\(\\)/\\1E\\2/" $f; done
    done

    with de-mangling cleanups yet to come.

    NOTE! On almost all architectures, the EPOLL* constants have the same
    values as the POLL* constants do. But they keyword here is "almost".
    For various bad reasons they aren't the same, and epoll() doesn't
    actually work quite correctly in some cases due to this on Sparc et al.

    The next patch from Al will sort out the final differences, and we
    should be all done.

    Scripted-by: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

02 Feb, 2018

1 commit

  • Pull VFIO updates from Alex Williamson:

    - Mask INTx from user if pdev->irq is zero (Alexey Kardashevskiy)

    - Capability helper cleanup (Alex Williamson)

    - Allow mmaps overlapping MSI-X vector table with region capability
    exposing this feature (Alexey Kardashevskiy)

    - mdev static cleanups (Xiongwei Song)

    * tag 'vfio-v4.16-rc1' of git://github.com/awilliam/linux-vfio:
    vfio: mdev: make a couple of functions and structure vfio_mdev_driver static
    vfio-pci: Allow mapping MSIX BAR
    vfio: Simplify capability helper
    vfio-pci: Mask INTx if a device is not capabable of enabling it

    Linus Torvalds
     

10 Jan, 2018

1 commit

  • The functions vfio_mdev_probe, vfio_mdev_remove and the structure
    vfio_mdev_driver are only used in this file, so make them static.

    Clean up sparse warnings:
    drivers/vfio/mdev/vfio_mdev.c:114:5: warning: no previous prototype
    for 'vfio_mdev_probe' [-Wmissing-prototypes]
    drivers/vfio/mdev/vfio_mdev.c:121:6: warning: no previous prototype
    for 'vfio_mdev_remove' [-Wmissing-prototypes]

    Signed-off-by: Xiongwei Song
    Reviewed-by: Quan Xu
    Reviewed-by: Liu, Yi L
    Reviewed-by: Kirti Wankhede
    Signed-off-by: Alex Williamson

    Xiongwei Song