04 Apr, 2014

1 commit

  • Pull VFIO updates from Alex Williamson:
    "VFIO updates for v3.15 include:

    - Allow the vfio-type1 IOMMU to support multiple domains within a
    container
    - Plumb path to query whether all domains are cache-coherent
    - Wire query into kvm-vfio device to avoid KVM x86 WBINVD emulation
    - Always select CONFIG_ANON_INODES, vfio depends on it (Arnd)

    The first patch also makes the vfio-type1 IOMMU driver completely
    independent of the bus_type of the devices it's handling, which
    enables it to be used for both vfio-pci and a future vfio-platform
    (and hopefully combinations involving both simultaneously)"

    * tag 'vfio-v3.15-rc1' of git://github.com/awilliam/linux-vfio:
    vfio: always select ANON_INODES
    kvm/vfio: Support for DMA coherent IOMMUs
    vfio: Add external user check extension interface
    vfio/type1: Add extension to test DMA cache coherence of IOMMU
    vfio/iommu_type1: Multi-IOMMU domain support

    Linus Torvalds
     

02 Apr, 2014

1 commit

  • Pull PCI changes from Bjorn Helgaas:
    "Enumeration
    - Increment max correctly in pci_scan_bridge() (Andreas Noever)
    - Clarify the "scan anyway" comment in pci_scan_bridge() (Andreas Noever)
    - Assign CardBus bus number only during the second pass (Andreas Noever)
    - Use request_resource_conflict() instead of insert_ for bus numbers (Andreas Noever)
    - Make sure bus number resources stay within their parents bounds (Andreas Noever)
    - Remove pci_fixup_parent_subordinate_busnr() (Andreas Noever)
    - Check for child busses which use more bus numbers than allocated (Andreas Noever)
    - Don't scan random busses in pci_scan_bridge() (Andreas Noever)
    - x86: Drop pcibios_scan_root() check for bus already scanned (Bjorn Helgaas)
    - x86: Use pcibios_scan_root() instead of pci_scan_bus_with_sysdata() (Bjorn Helgaas)
    - x86: Use pcibios_scan_root() instead of pci_scan_bus_on_node() (Bjorn Helgaas)
    - x86: Merge pci_scan_bus_on_node() into pcibios_scan_root() (Bjorn Helgaas)
    - x86: Drop return value of pcibios_scan_root() (Bjorn Helgaas)

    NUMA
    - x86: Add x86_pci_root_bus_node() to look up NUMA node from PCI bus (Bjorn Helgaas)
    - x86: Use x86_pci_root_bus_node() instead of get_mp_bus_to_node() (Bjorn Helgaas)
    - x86: Remove mp_bus_to_node[], set_mp_bus_to_node(), get_mp_bus_to_node() (Bjorn Helgaas)
    - x86: Use NUMA_NO_NODE, not -1, for unknown node (Bjorn Helgaas)
    - x86: Remove acpi_get_pxm() usage (Bjorn Helgaas)
    - ia64: Use NUMA_NO_NODE, not MAX_NUMNODES, for unknown node (Bjorn Helgaas)
    - ia64: Remove acpi_get_pxm() usage (Bjorn Helgaas)
    - ACPI: Fix acpi_get_node() prototype (Bjorn Helgaas)

    Resource management
    - i2o: Fix and refactor PCI space allocation (Bjorn Helgaas)
    - Add resource_contains() (Bjorn Helgaas)
    - Add %pR support for IORESOURCE_UNSET (Bjorn Helgaas)
    - Mark resources as IORESOURCE_UNSET if we can't assign them (Bjorn Helgaas)
    - Don't clear IORESOURCE_UNSET when updating BAR (Bjorn Helgaas)
    - Check IORESOURCE_UNSET before updating BAR (Bjorn Helgaas)
    - Don't try to claim IORESOURCE_UNSET resources (Bjorn Helgaas)
    - Mark 64-bit resource as IORESOURCE_UNSET if we only support 32-bit (Bjorn Helgaas)
    - Don't enable decoding if BAR hasn't been assigned an address (Bjorn Helgaas)
    - Add "weak" generic pcibios_enable_device() implementation (Bjorn Helgaas)
    - alpha, microblaze, sh, sparc, tile: Use default pcibios_enable_device() (Bjorn Helgaas)
    - s390: Use generic pci_enable_resources() (Bjorn Helgaas)
    - Don't check resource_size() in pci_bus_alloc_resource() (Bjorn Helgaas)
    - Set type in __request_region() (Bjorn Helgaas)
    - Check all IORESOURCE_TYPE_BITS in pci_bus_alloc_from_region() (Bjorn Helgaas)
    - Change pci_bus_alloc_resource() type_mask to unsigned long (Bjorn Helgaas)
    - Log IDE resource quirk in dmesg (Bjorn Helgaas)
    - Revert "[PATCH] Insert GART region into resource map" (Bjorn Helgaas)

    PCI device hotplug
    - Make check_link_active() non-static (Rajat Jain)
    - Use link change notifications for hot-plug and removal (Rajat Jain)
    - Enable link state change notifications (Rajat Jain)
    - Don't disable the link permanently during removal (Rajat Jain)
    - Don't check adapter or latch status while disabling (Rajat Jain)
    - Disable link notification across slot reset (Rajat Jain)
    - Ensure very fast hotplug events are also processed (Rajat Jain)
    - Add hotplug_lock to serialize hotplug events (Rajat Jain)
    - Remove a non-existent card, regardless of "surprise" capability (Rajat Jain)
    - Don't turn slot off when hot-added device already exists (Yijing Wang)

    MSI
    - Keep pci_enable_msi() documentation (Alexander Gordeev)
    - ahci: Fix broken single MSI fallback (Alexander Gordeev)
    - ahci, vfio: Use pci_enable_msi_range() (Alexander Gordeev)
    - Check kmalloc() return value, fix leak of name (Greg Kroah-Hartman)
    - Fix leak of msi_attrs (Greg Kroah-Hartman)
    - Fix pci_msix_vec_count() htmldocs failure (Masanari Iida)

    Virtualization
    - Device-specific ACS support (Alex Williamson)

    Freescale i.MX6
    - Wait for retraining (Marek Vasut)

    Marvell MVEBU
    - Use Device ID and revision from underlying endpoint (Andrew Lunn)
    - Fix incorrect size for PCI aperture resources (Jason Gunthorpe)
    - Call request_resource() on the apertures (Jason Gunthorpe)
    - Fix potential issue in range parsing (Jean-Jacques Hiblot)

    Renesas R-Car
    - Check platform_get_irq() return code (Ben Dooks)
    - Add error interrupt handling (Ben Dooks)
    - Fix bridge logic configuration accesses (Ben Dooks)
    - Register each instance independently (Magnus Damm)
    - Break out window size handling (Magnus Damm)
    - Make the Kconfig dependencies more generic (Magnus Damm)

    Synopsys DesignWare
    - Fix RC BAR to be single 64-bit non-prefetchable memory (Mohit Kumar)

    Miscellaneous
    - Remove unused SR-IOV VF Migration support (Bjorn Helgaas)
    - Enable INTx if BIOS left them disabled (Bjorn Helgaas)
    - Fix hex vs decimal typo in cpqhpc_probe() (Dan Carpenter)
    - Clean up par-arch object file list (Liviu Dudau)
    - Set IORESOURCE_ROM_SHADOW only for the default VGA device (Sander Eikelenboom)
    - ACPI, ARM, drm, powerpc, pcmcia, PCI: Use list_for_each_entry() for bus traversal (Yijing Wang)
    - Fix pci_bus_b() build failure (Paul Gortmaker)"

    * tag 'pci-v3.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (108 commits)
    Revert "[PATCH] Insert GART region into resource map"
    PCI: Log IDE resource quirk in dmesg
    PCI: Change pci_bus_alloc_resource() type_mask to unsigned long
    PCI: Check all IORESOURCE_TYPE_BITS in pci_bus_alloc_from_region()
    resources: Set type in __request_region()
    PCI: Don't check resource_size() in pci_bus_alloc_resource()
    s390/PCI: Use generic pci_enable_resources()
    tile PCI RC: Use default pcibios_enable_device()
    sparc/PCI: Use default pcibios_enable_device() (Leon only)
    sh/PCI: Use default pcibios_enable_device()
    microblaze/PCI: Use default pcibios_enable_device()
    alpha/PCI: Use default pcibios_enable_device()
    PCI: Add "weak" generic pcibios_enable_device() implementation
    PCI: Don't enable decoding if BAR hasn't been assigned an address
    PCI: Enable INTx in pci_reenable_device() only when MSI/MSI-X not enabled
    PCI: Mark 64-bit resource as IORESOURCE_UNSET if we only support 32-bit
    PCI: Don't try to claim IORESOURCE_UNSET resources
    PCI: Check IORESOURCE_UNSET before updating BAR
    PCI: Don't clear IORESOURCE_UNSET when updating BAR
    PCI: Mark resources as IORESOURCE_UNSET if we can't assign them
    ...

    Conflicts:
    arch/x86/include/asm/topology.h
    drivers/ata/ahci.c

    Linus Torvalds
     

28 Mar, 2014

1 commit


04 Mar, 2014

1 commit

  • Commit bf6bddf1924e ("mm: introduce compaction and migration for
    ballooned pages") introduces page_count(page) into memory compaction
    which dereferences page->first_page if PageTail(page).

    This results in a very rare NULL pointer dereference on the
    aforementioned page_count(page). Indeed, anything that does
    compound_head(), including page_count() is susceptible to racing with
    prep_compound_page() and seeing a NULL or dangling page->first_page
    pointer.

    This patch uses Andrea's implementation of compound_trans_head() that
    deals with such a race and makes it the default compound_head()
    implementation. This includes a read memory barrier that ensures that
    if PageTail(head) is true that we return a head page that is neither
    NULL nor dangling. The patch then adds a store memory barrier to
    prep_compound_page() to ensure page->first_page is set.

    This is the safest way to ensure we see the head page that we are
    expecting, PageTail(page) is already in the unlikely() path and the
    memory barriers are unfortunately required.

    Hugetlbfs is the exception, we don't enforce a store memory barrier
    during init since no race is possible.

    Signed-off-by: David Rientjes
    Cc: Holger Kiehl
    Cc: Christoph Lameter
    Cc: Rafael Aquini
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: "Kirill A. Shutemov"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

27 Feb, 2014

3 commits

  • This lets us check extensions, particularly VFIO_DMA_CC_IOMMU using
    the external user interface, allowing KVM to probe IOMMU coherency.

    Signed-off-by: Alex Williamson

    Alex Williamson
     
  • Now that the type1 IOMMU backend can support IOMMU_CACHE, we need to
    be able to test whether coherency is currently enforced. Add an
    extension for this.

    Signed-off-by: Alex Williamson

    Alex Williamson
     
  • We currently have a problem that we cannot support advanced features
    of an IOMMU domain (ex. IOMMU_CACHE), because we have no guarantee
    that those features will be supported by all of the hardware units
    involved with the domain over its lifetime. For instance, the Intel
    VT-d architecture does not require that all DRHDs support snoop
    control. If we create a domain based on a device behind a DRHD that
    does support snoop control and enable SNP support via the IOMMU_CACHE
    mapping option, we cannot then add a device behind a DRHD which does
    not support snoop control or we'll get reserved bit faults from the
    SNP bit in the pagetables. To add to the complexity, we can't know
    the properties of a domain until a device is attached.

    We could pass this problem off to userspace and require that a
    separate vfio container be used, but we don't know how to handle page
    accounting in that case. How do we know that a page pinned in one
    container is the same page as a different container and avoid double
    billing the user for the page.

    The solution is therefore to support multiple IOMMU domains per
    container. In the majority of cases, only one domain will be required
    since hardware is typically consistent within a system. However, this
    provides us the ability to validate compatibility of domains and
    support mixed environments where page table flags can be different
    between domains.

    To do this, our DMA tracking needs to change. We currently try to
    coalesce user mappings into as few tracking entries as possible. The
    problem then becomes that we lose granularity of user mappings. We've
    never guaranteed that a user is able to unmap at a finer granularity
    than the original mapping, but we must honor the granularity of the
    original mapping. This coalescing code is therefore removed, allowing
    only unmaps covering complete maps. The change in accounting is
    fairly small here, a typical QEMU VM will start out with roughly a
    dozen entries, so it's arguable if this coalescing was ever needed.

    We also move IOMMU domain creation to the point where a group is
    attached to the container. An interesting side-effect of this is that
    we now have access to the device at the time of domain creation and
    can probe the devices within the group to determine the bus_type.
    This finally makes vfio_iommu_type1 completely device/bus agnostic.
    In fact, each IOMMU domain can host devices on different buses managed
    by different physical IOMMUs, and present a single DMA mapping
    interface to the user. When a new domain is created, mappings are
    replayed to bring the IOMMU pagetables up to the state of the current
    container. And of course, DMA mapping and unmapping automatically
    traverse all of the configured IOMMU domains.

    Signed-off-by: Alex Williamson
    Cc: Varun Sethi

    Alex Williamson
     

15 Feb, 2014

1 commit


28 Jan, 2014

1 commit

  • Pull powerpc updates from Ben Herrenschmidt:
    "So here's my next branch for powerpc. A bit late as I was on vacation
    last week. It's mostly the same stuff that was in next already, I
    just added two patches today which are the wiring up of lockref for
    powerpc, which for some reason fell through the cracks last time and
    is trivial.

    The highlights are, in addition to a bunch of bug fixes:

    - Reworked Machine Check handling on kernels running without a
    hypervisor (or acting as a hypervisor). Provides hooks to handle
    some errors in real mode such as TLB errors, handle SLB errors,
    etc...

    - Support for retrieving memory error information from the service
    processor on IBM servers running without a hypervisor and routing
    them to the memory poison infrastructure.

    - _PAGE_NUMA support on server processors

    - 32-bit BookE relocatable kernel support

    - FSL e6500 hardware tablewalk support

    - A bunch of new/revived board support

    - FSL e6500 deeper idle states and altivec powerdown support

    You'll notice a generic mm change here, it has been acked by the
    relevant authorities and is a pre-req for our _PAGE_NUMA support"

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (121 commits)
    powerpc: Implement arch_spin_is_locked() using arch_spin_value_unlocked()
    powerpc: Add support for the optimised lockref implementation
    powerpc/powernv: Call OPAL sync before kexec'ing
    powerpc/eeh: Escalate error on non-existing PE
    powerpc/eeh: Handle multiple EEH errors
    powerpc: Fix transactional FP/VMX/VSX unavailable handlers
    powerpc: Don't corrupt transactional state when using FP/VMX in kernel
    powerpc: Reclaim two unused thread_info flag bits
    powerpc: Fix races with irq_work
    Move precessing of MCE queued event out from syscall exit path.
    pseries/cpuidle: Remove redundant call to ppc64_runlatch_off() in cpu idle routines
    powerpc: Make add_system_ram_resources() __init
    powerpc: add SATA_MV to ppc64_defconfig
    powerpc/powernv: Increase candidate fw image size
    powerpc: Add debug checks to catch invalid cpu-to-node mappings
    powerpc: Fix the setup of CPU-to-Node mappings during CPU online
    powerpc/iommu: Don't detach device without IOMMU group
    powerpc/eeh: Hotplug improvement
    powerpc/eeh: Call opal_pci_reinit() on powernv for restoring config space
    powerpc/eeh: Add restore_config operation
    ...

    Linus Torvalds
     

25 Jan, 2014

1 commit


16 Jan, 2014

1 commit

  • PCI resets will attempt to take the device_lock for any device to be
    reset. This is a problem if that lock is already held, for instance
    in the device remove path. It's not sufficient to simply kill the
    user process or skip the reset if called after .remove as a race could
    result in the same deadlock. Instead, we handle all resets as "best
    effort" using the PCI "try" reset interfaces. This prevents the user
    from being able to induce a deadlock by triggering a reset.

    Signed-off-by: Alex Williamson
    Signed-off-by: Bjorn Helgaas

    Alex Williamson
     

15 Jan, 2014

1 commit

  • device_lock is much too prone to lockups. For instance if we have a
    pending .remove then device_lock is already held. If userspace
    attempts to modify AER signaling after that point, a deadlock occurs.
    eventfd setup/teardown is already protected in vfio with the igate
    mutex. AER is not a high performance interrupt, so we can also use
    the same mutex to protect signaling versus setup races.

    Signed-off-by: Alex Williamson

    Alex Williamson
     

30 Dec, 2013

1 commit


20 Dec, 2013

1 commit

  • This change allows us to support module auto loading using devname
    support in userspace tools. With this, /dev/vfio/vfio will always
    be present and opening it will cause the vfio module to load. This
    should avoid needing to configure the system to statically load
    vfio in order to get libvirt to correctly detect support for it.

    Suggested-by: Paolo Bonzini
    Signed-off-by: Alex Williamson

    Alex Williamson
     

18 Dec, 2013

1 commit


12 Oct, 2013

1 commit

  • In vfio_iommu_type1.c there is a bug in vfio_dma_do_map, when checking
    that pages are not already mapped. Since the check is being done in a
    for loop nested within the main loop, breaking out of it does not create
    the intended behavior. If the underlying IOMMU driver returns a non-NULL
    value, this will be ignored and mapping the DMA range will be attempted
    anyway, leading to unpredictable behavior.

    This interracts badly with the ARM SMMU driver issue fixed in the patch
    that was submitted with the title:
    "[PATCH 2/2] ARM: SMMU: return NULL on error in arm_smmu_iova_to_phys"
    Both fixes are required in order to use the vfio_iommu_type1 driver
    with an ARM SMMU.

    This patch refactors the function slightly, in order to also make this
    kind of bug less likely.

    Signed-off-by: Antonios Motakis
    Signed-off-by: Alex Williamson

    Antonios Motakis
     

05 Sep, 2013

2 commits

  • The current VFIO_DEVICE_RESET interface only maps to PCI use cases
    where we can isolate the reset to the individual PCI function. This
    means the device must support FLR (PCIe or AF), PM reset on D3hot->D0
    transition, device specific reset, or be a singleton device on a bus
    for a secondary bus reset. FLR does not have widespread support,
    PM reset is not very reliable, and bus topology is dictated by the
    system and device design. We need to provide a means for a user to
    induce a bus reset in cases where the existing mechanisms are not
    available or not reliable.

    This device specific extension to VFIO provides the user with this
    ability. Two new ioctls are introduced:
    - VFIO_DEVICE_PCI_GET_HOT_RESET_INFO
    - VFIO_DEVICE_PCI_HOT_RESET

    The first provides the user with information about the extent of
    devices affected by a hot reset. This is essentially a list of
    devices and the IOMMU groups they belong to. The user may then
    initiate a hot reset by calling the second ioctl. We must be
    careful that the user has ownership of all the affected devices
    found via the first ioctl, so the second ioctl takes a list of file
    descriptors for the VFIO groups affected by the reset. Each group
    must have IOMMU protection established for the ioctl to succeed.

    Signed-off-by: Alex Williamson

    Alex Williamson
     
  • Having PCIe/PCI-X capability isn't enough to assume that there are
    extended capabilities. Both specs define that the first capability
    header is all zero if there are no extended capabilities. Testing
    for this avoids an erroneous message about hiding capability 0x0 at
    offset 0x100.

    Signed-off-by: Alex Williamson

    Alex Williamson
     

28 Aug, 2013

1 commit

  • eventfd_fget() tests to see whether the file is an eventfd file, which
    we then immediately pass to eventfd_ctx_fileget(), which again tests
    whether the file is an eventfd file. Simplify slightly by using
    fdget() so that we only test that we're looking at an eventfd once.
    fget() could also be used, but fdget() makes use of fget_light() for
    another slight optimization.

    Signed-off-by: Alex Williamson

    Alex Williamson
     

23 Aug, 2013

2 commits

  • Add the default O_CLOEXEC flag for device file descriptors. This is
    generally considered a safer option as it allows the user a race free
    option to decide whether file descriptors are inherited across exec,
    with the default avoiding file descriptor leaks.

    Reported-by: Yann Droneaud
    Signed-off-by: Alex Williamson

    Alex Williamson
     
  • Macro get_unused_fd() is used to allocate a file descriptor with
    default flags. Those default flags (0) can be "unsafe":
    O_CLOEXEC must be used by default to not leak file descriptor
    across exec().

    Instead of macro get_unused_fd(), functions anon_inode_getfd()
    or get_unused_fd_flags() should be used with flags given by userspace.
    If not possible, flags should be set to O_CLOEXEC to provide userspace
    with a default safe behavor.

    In a further patch, get_unused_fd() will be removed so that
    new code start using anon_inode_getfd() or get_unused_fd_flags()
    with correct flags.

    This patch replaces calls to get_unused_fd() with equivalent call to
    get_unused_fd_flags(0) to preserve current behavor for existing code.

    The hard coded flag value (0) should be reviewed on a per-subsystem basis,
    and, if possible, set to O_CLOEXEC.

    Signed-off-by: Yann Droneaud
    Link: http://lkml.kernel.org/r/cover.1376327678.git.ydroneaud@opteya.com
    Signed-off-by: Alex Williamson

    Yann Droneaud
     

06 Aug, 2013

1 commit

  • VFIO is designed to be used via ioctls on file descriptors
    returned by VFIO.

    However in some situations support for an external user is required.
    The first user is KVM on PPC64 (SPAPR TCE protocol) which is going to
    use the existing VFIO groups for exclusive access in real/virtual mode
    on a host to avoid passing map/unmap requests to the user space which
    would made things pretty slow.

    The protocol includes:

    1. do normal VFIO init operation:
    - opening a new container;
    - attaching group(s) to it;
    - setting an IOMMU driver for a container.
    When IOMMU is set for a container, all groups in it are
    considered ready to use by an external user.

    2. User space passes a group fd to an external user.
    The external user calls vfio_group_get_external_user()
    to verify that:
    - the group is initialized;
    - IOMMU is set for it.
    If both checks passed, vfio_group_get_external_user()
    increments the container user counter to prevent
    the VFIO group from disposal before KVM exits.

    3. The external user calls vfio_external_user_iommu_id()
    to know an IOMMU ID. PPC64 KVM uses it to link logical bus
    number (LIOBN) with IOMMU ID.

    4. When the external KVM finishes, it calls
    vfio_group_put_external_user() to release the VFIO group.
    This call decrements the container user counter.
    Everything gets released.

    The "vfio: Limit group opens" patch is also required for the consistency.

    Signed-off-by: Alexey Kardashevskiy
    Signed-off-by: Alex Williamson

    Alexey Kardashevskiy
     

25 Jul, 2013

3 commits

  • If an attempt is made to unbind a device from vfio-pci while that
    device is in use, the request is blocked until the device becomes
    unused. Unfortunately, that unbind path still grabs the device_lock,
    which certain things like __pci_reset_function() also want to take.
    This means we need to try to acquire the locks ourselves and use the
    pre-locked version, __pci_reset_function_locked().

    Signed-off-by: Alex Williamson

    Alex Williamson
     
  • Remove debugging WARN_ON if we get a spurious notify for a group that
    no longer exists. No reports of anyone hitting this, but it would
    likely be a race and not a bug if they did.

    Signed-off-by: Alex Williamson

    Alex Williamson
     
  • BUS_NOTIFY_DEL_DEVICE triggers IOMMU drivers to remove devices from
    their iommu group, but there's really nothing we can do about it at
    this point. If the device is in use, then the vfio sub-driver will
    block the device_del from completing until it's released. If the
    device is not in use or not owned by a vfio sub-driver, then we
    really don't care that it's being removed.

    The current code can be triggered just by unloading an sr-iov driver
    (ex. igb) while the VFs are attached to vfio-pci because it makes an
    incorrect assumption about the ordering of driver remove callbacks
    vs the DEL_DEVICE notification.

    Signed-off-by: Alex Williamson

    Alex Williamson
     

11 Jul, 2013

1 commit

  • Pull vfio updates from Alex Williamson:
    "Largely hugepage support for vfio/type1 iommu and surrounding cleanups
    and fixes"

    * tag 'vfio-v3.11' of git://github.com/awilliam/linux-vfio:
    vfio/type1: Fix leak on error path
    vfio: Limit group opens
    vfio/type1: Fix missed frees and zero sized removes
    vfio: fix documentation
    vfio: Provide module option to disable vfio_iommu_type1 hugepage support
    vfio: hugepage support for vfio_iommu_type1
    vfio: Convert type1 iommu to use rbtree

    Linus Torvalds
     

05 Jul, 2013

1 commit

  • Pull powerpc updates from Ben Herrenschmidt:
    "This is the powerpc changes for the 3.11 merge window. In addition to
    the usual bug fixes and small updates, the main highlights are:

    - Support for transparent huge pages by Aneesh Kumar for 64-bit
    server processors. This allows the use of 16M pages as transparent
    huge pages on kernels compiled with a 64K base page size.

    - Base VFIO support for KVM on power by Alexey Kardashevskiy

    - Wiring up of our nvram to the pstore infrastructure, including
    putting compressed oopses in there by Aruna Balakrishnaiah

    - Move, rework and improve our "EEH" (basically PCI error handling
    and recovery) infrastructure. It is no longer specific to pseries
    but is now usable by the new "powernv" platform as well (no
    hypervisor) by Gavin Shan.

    - I fixed some bugs in our math-emu instruction decoding and made it
    usable to emulate some optional FP instructions on processors with
    hard FP that lack them (such as fsqrt on Freescale embedded
    processors).

    - Support for Power8 "Event Based Branch" facility by Michael
    Ellerman. This facility allows what is basically "userspace
    interrupts" for performance monitor events.

    - A bunch of Transactional Memory vs. Signals bug fixes and HW
    breakpoint/watchpoint fixes by Michael Neuling.

    And more ... I appologize in advance if I've failed to highlight
    something that somebody deemed worth it."

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (156 commits)
    pstore: Add hsize argument in write_buf call of pstore_ftrace_call
    powerpc/fsl: add MPIC timer wakeup support
    powerpc/mpic: create mpic subsystem object
    powerpc/mpic: add global timer support
    powerpc/mpic: add irq_set_wake support
    powerpc/85xx: enable coreint for all the 64bit boards
    powerpc/8xx: Erroneous double irq_eoi() on CPM IRQ in MPC8xx
    powerpc/fsl: Enable CONFIG_E1000E in mpc85xx_smp_defconfig
    powerpc/mpic: Add get_version API both for internal and external use
    powerpc: Handle both new style and old style reserve maps
    powerpc/hw_brk: Fix off by one error when validating DAWR region end
    powerpc/pseries: Support compression of oops text via pstore
    powerpc/pseries: Re-organise the oops compression code
    pstore: Pass header size in the pstore write callback
    powerpc/powernv: Fix iommu initialization again
    powerpc/pseries: Inform the hypervisor we are using EBB regs
    powerpc/perf: Add power8 EBB support
    powerpc/perf: Core EBB support for 64-bit book3s
    powerpc/perf: Drop MMCRA from thread_struct
    powerpc/perf: Don't enable if we have zero events
    ...

    Linus Torvalds
     

01 Jul, 2013

1 commit


29 Jun, 2013

1 commit


26 Jun, 2013

2 commits

  • vfio_group_fops_open attempts to limit concurrent sessions by
    disallowing opens once group->container is set. This really doesn't
    do what we want and allow for inconsistent behavior, for instance a
    group can be opened twice, then a container set giving the user two
    file descriptors to the group. But then it won't allow more to be
    opened. There's not much reason to have the group opened multiple
    times since most access is through devices or the container, so
    complete what the original code intended and only allow a single
    instance.

    Signed-off-by: Alex Williamson

    Alex Williamson
     
  • With hugepage support we can only properly aligned and sized ranges.
    We only guarantee that we can unmap the same ranges mapped and not
    arbitrary sub-ranges. This means we might not free anything or might
    free more than requested. The vfio unmap interface started storing
    the unmapped size to return to userspace to handle this. This patch
    fixes a few places where we don't properly handle those cases, moves
    a memory allocation to a place where failure is an option and checks
    our loops to make sure we don't get into an infinite loop trying to
    remove an overlap.

    Signed-off-by: Alex Williamson

    Alex Williamson
     

21 Jun, 2013

3 commits

  • Add a module option to vfio_iommu_type1 to disable IOMMU hugepage
    support. This causes iommu_map to only be called with single page
    mappings, disabling the IOMMU driver's ability to use hugepages.
    This option can be enabled by loading vfio_iommu_type1 with
    disable_hugepages=1 or dynamically through sysfs. If enabled
    dynamically, only new mappings are restricted.

    Signed-off-by: Alex Williamson

    Alex Williamson
     
  • We currently send all mappings to the iommu in PAGE_SIZE chunks,
    which prevents the iommu from enabling support for larger page sizes.
    We still need to pin pages, which means we step through them in
    PAGE_SIZE chunks, but we can batch up contiguous physical memory
    chunks to allow the iommu the opportunity to use larger pages. The
    approach here is a bit different that the one currently used for
    legacy KVM device assignment. Rather than looking at the vma page
    size and using that as the maximum size to pass to the iommu, we
    instead simply look at whether the next page is physically
    contiguous. This means we might ask the iommu to map a 4MB region,
    while legacy KVM might limit itself to a maximum of 2MB.

    Splitting our mapping path also allows us to be smarter about locked
    memory because we can more easily unwind if the user attempts to
    exceed the limit. Therefore, rather than assuming that a mapping
    will result in locked memory, we test each page as it is pinned to
    determine whether it locks RAM vs an mmap'd MMIO region. This should
    result in better locking granularity and less locked page fudge
    factors in userspace.

    The unmap path uses the same algorithm as legacy KVM. We don't want
    to track the pfn for each mapping ourselves, but we need the pfn in
    order to unpin pages. We therefore ask the iommu for the iova to
    physical address translation, ask it to unpin a page, and see how many
    pages were actually unpinned. iommus supporting large pages will
    often return something bigger than a page here, which we know will be
    physically contiguous and we can unpin a batch of pfns. iommus not
    supporting large mappings won't see an improvement in batching here as
    they only unmap a page at a time.

    With this change, we also make a clarification to the API for mapping
    and unmapping DMA. We can only guarantee unmaps at the same
    granularity as used for the original mapping. In other words,
    unmapping a subregion of a previous mapping is not guaranteed and may
    result in a larger or smaller unmapping than requested. The size
    field in the unmapping structure is updated to reflect this.
    Previously this was unmodified on mapping, always returning the the
    requested unmap size. This is now updated to return the actual unmap
    size on success, allowing userspace to appropriately track mappings.

    Signed-off-by: Alex Williamson

    Alex Williamson
     
  • We need to keep track of all the DMA mappings of an iommu container so
    that it can be automatically unmapped when the user releases the file
    descriptor. We currently do this using a simple list, where we merge
    entries with contiguous iovas and virtual addresses. Using a tree for
    this is a bit more efficient and allows us to use common code instead
    of inventing our own.

    Signed-off-by: Alex Williamson

    Alex Williamson
     

20 Jun, 2013

2 commits

  • The enables VFIO on the pSeries platform, enabling user space
    programs to access PCI devices directly.

    Signed-off-by: Alexey Kardashevskiy
    Cc: David Gibson
    Signed-off-by: Paul Mackerras
    Acked-by: Alex Williamson
    Signed-off-by: Benjamin Herrenschmidt

    Alexey Kardashevskiy
     
  • VFIO implements platform independent stuff such as
    a PCI driver, BAR access (via read/write on a file descriptor
    or direct mapping when possible) and IRQ signaling.

    The platform dependent part includes IOMMU initialization
    and handling. This implements an IOMMU driver for VFIO
    which does mapping/unmapping pages for the guest IO and
    provides information about DMA window (required by a POWER
    guest).

    Cc: David Gibson
    Signed-off-by: Alexey Kardashevskiy
    Signed-off-by: Paul Mackerras
    Acked-by: Alex Williamson
    Signed-off-by: Benjamin Herrenschmidt

    Alexey Kardashevskiy
     

05 Jun, 2013

1 commit


03 May, 2013

1 commit

  • Pull vfio updates from Alex Williamson:
    "Changes include extension to support PCI AER notification to
    userspace, byte granularity of PCI config space and access to
    unarchitected PCI config space, better protection around IOMMU driver
    accesses, default file mode fix, and a few misc cleanups."

    * tag 'vfio-for-v3.10' of git://github.com/awilliam/linux-vfio:
    vfio: Set container device mode
    vfio: Use down_reads to protect iommu disconnects
    vfio: Convert container->group_lock to rwsem
    PCI/VFIO: use pcie_flags_reg instead of access PCI-E Capabilities Register
    vfio-pci: Enable raw access to unassigned config space
    vfio-pci: Use byte granularity in config map
    vfio: make local function vfio_pci_intx_unmask_handler() static
    VFIO-AER: Vfio-pci driver changes for supporting AER
    VFIO: Wrapper for getting reference to vfio_device

    Linus Torvalds
     

01 May, 2013

1 commit

  • Minor 0 is the VFIO container device (/dev/vfio/vfio). On it's own
    the container does not provide a user with any privileged access. It
    only supports API version check and extension check ioctls. Only by
    attaching a VFIO group to the container does it gain any access. Set
    the mode of the container to allow access.

    Signed-off-by: Alex Williamson

    Alex Williamson
     

30 Apr, 2013

1 commit

  • Pull PCI updates from Bjorn Helgaas:
    "PCI changes for the v3.10 merge window:

    PCI device hotplug
    - Remove ACPI PCI subdrivers (Jiang Liu, Myron Stowe)
    - Make acpiphp builtin only, not modular (Jiang Liu)
    - Add acpiphp mutual exclusion (Jiang Liu)

    Power management
    - Skip "PME enabled/disabled" messages when not supported (Rafael
    Wysocki)
    - Fix fallback to PCI_D0 (Rafael Wysocki)

    Miscellaneous
    - Factor quirk_io_region (Yinghai Lu)
    - Cache MSI capability offsets & cleanup (Gavin Shan, Bjorn Helgaas)
    - Clean up EISA resource initialization and logging (Bjorn Helgaas)
    - Fix prototype warnings (Andy Shevchenko, Bjorn Helgaas)
    - MIPS: Initialize of_node before scanning bus (Gabor Juhos)
    - Fix pcibios_get_phb_of_node() declaration "weak" annotation (Gabor
    Juhos)
    - Add MSI INTX_DISABLE quirks for AR8161/AR8162/etc (Xiong Huang)
    - Fix aer_inject return values (Prarit Bhargava)
    - Remove PME/ACPI dependency (Andrew Murray)
    - Use shared PCI_BUS_NUM() and PCI_DEVID() (Shuah Khan)"

    * tag 'pci-v3.10-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (63 commits)
    vfio-pci: Use cached MSI/MSI-X capabilities
    vfio-pci: Use PCI_MSIX_TABLE_BIR, not PCI_MSIX_FLAGS_BIRMASK
    PCI: Remove "extern" from function declarations
    PCI: Use PCI_MSIX_TABLE_BIR, not PCI_MSIX_FLAGS_BIRMASK
    PCI: Drop msi_mask_reg() and remove drivers/pci/msi.h
    PCI: Use msix_table_size() directly, drop multi_msix_capable()
    PCI: Drop msix_table_offset_reg() and msix_pba_offset_reg() macros
    PCI: Drop is_64bit_address() and is_mask_bit_support() macros
    PCI: Drop msi_data_reg() macro
    PCI: Drop msi_lower_address_reg() and msi_upper_address_reg() macros
    PCI: Drop msi_control_reg() macro and use PCI_MSI_FLAGS directly
    PCI: Use cached MSI/MSI-X offsets from dev, not from msi_desc
    PCI: Clean up MSI/MSI-X capability #defines
    PCI: Use cached MSI-X cap while enabling MSI-X
    PCI: Use cached MSI cap while enabling MSI interrupts
    PCI: Remove MSI/MSI-X cap check in pci_msi_check_device()
    PCI: Cache MSI/MSI-X capability offsets in struct pci_dev
    PCI: Use u8, not int, for PM capability offset
    [SCSI] megaraid_sas: Use correct #define for MSI-X capability
    PCI: Remove "extern" from function declarations
    ...

    Linus Torvalds