08 Mar, 2020

1 commit

  • Merge Linux stable release v5.4.24 into imx_5.4.y

    * tag 'v5.4.24': (3306 commits)
    Linux 5.4.24
    blktrace: Protect q->blk_trace with RCU
    kvm: nVMX: VMWRITE checks unsupported field before read-only field
    ...

    Signed-off-by: Jason Liu

    Conflicts:
    arch/arm/boot/dts/imx6sll-evk.dts
    arch/arm/boot/dts/imx7ulp.dtsi
    arch/arm64/boot/dts/freescale/fsl-ls1028a.dtsi
    drivers/clk/imx/clk-composite-8m.c
    drivers/gpio/gpio-mxc.c
    drivers/irqchip/Kconfig
    drivers/mmc/host/sdhci-of-esdhc.c
    drivers/mtd/nand/raw/gpmi-nand/gpmi-nand.c
    drivers/net/can/flexcan.c
    drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
    drivers/net/ethernet/mscc/ocelot.c
    drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
    drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
    drivers/net/phy/realtek.c
    drivers/pci/controller/mobiveil/pcie-mobiveil-host.c
    drivers/perf/fsl_imx8_ddr_perf.c
    drivers/tee/optee/shm_pool.c
    drivers/usb/cdns3/gadget.c
    kernel/sched/cpufreq.c
    net/core/xdp.c
    sound/soc/fsl/fsl_esai.c
    sound/soc/fsl/fsl_sai.c
    sound/soc/sof/core.c
    sound/soc/sof/imx/Kconfig
    sound/soc/sof/loader.c

    Jason Liu
     

24 Feb, 2020

1 commit

  • [ Upstream commit 338b4e10f939a71194d8ecef7ece205a942cec05 ]

    The nvlink2 subdriver for IBM Witherspoon machines preregisters
    GPU memory in the IOMMI API so KVM TCE code can map this memory
    for DMA as well. This is done by mm_iommu_newdev() called from
    vfio_pci_nvgpu_regops::mmap.

    In an unlikely event of failure the data->mem remains NULL and
    since mm_iommu_put() (which unregisters the region and unpins memory
    if that was regular memory) does not expect mem=NULL, it should not be
    called.

    This adds a check to only call mm_iommu_put() for a valid data->mem.

    Fixes: 7f92891778df ("vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver")
    Signed-off-by: Alexey Kardashevskiy
    Signed-off-by: Alex Williamson
    Signed-off-by: Sasha Levin

    Alexey Kardashevskiy
     

21 Dec, 2019

1 commit

  • commit d567fb8819162099035e546b11a736e29c2af0ea upstream.

    Since irq_bypass_register_producer() is called after request_irq(), we
    should do tear-down in reverse order: irq_bypass_unregister_producer()
    then free_irq().

    Specifically free_irq() may release resources required by the
    irqbypass del_producer() callback. Notably an example provided by
    Marc Zyngier on arm64 with GICv4 that he indicates has the potential
    to wedge the hardware:

    free_irq(irq)
    __free_irq(irq)
    irq_domain_deactivate_irq(irq)
    its_irq_domain_deactivate()
    [unmap the VLPI from the ITS]

    kvm_arch_irq_bypass_del_producer(cons, prod)
    kvm_vgic_v4_unset_forwarding(kvm, irq, ...)
    its_unmap_vlpi(irq)
    [Unmap the VLPI from the ITS (again), remap the original LPI]

    Signed-off-by: Jiang Yi
    Cc: stable@vger.kernel.org # v4.4+
    Fixes: 6d7425f109d26 ("vfio: Register/unregister irq_bypass_producer")
    Link: https://lore.kernel.org/kvm/20191127164910.15888-1-giangyi@amazon.com
    Reviewed-by: Marc Zyngier
    Reviewed-by: Eric Auger
    [aw: commit log]
    Signed-off-by: Alex Williamson
    Signed-off-by: Greg Kroah-Hartman

    Jiang Yi
     

26 Nov, 2019

10 commits


16 Oct, 2019

1 commit

  • After enabling CONFIG_IOMMU_DMA on X86 a new warning appears when
    compiling vfio:

    drivers/vfio/vfio_iommu_type1.c: In function ‘vfio_iommu_type1_attach_group’:
    drivers/vfio/vfio_iommu_type1.c:1827:7: warning: ‘resv_msi_base’ may be used uninitialized in this function [-Wmaybe-uninitialized]
    ret = iommu_get_msi_cookie(domain->domain, resv_msi_base);
    ~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    The warning is a false positive, because the call to iommu_get_msi_cookie()
    only happens when vfio_iommu_has_sw_msi() returned true. And that only
    happens when it also set resv_msi_base.

    But initialize the variable anyway to get rid of the warning.

    Signed-off-by: Joerg Roedel
    Reviewed-by: Cornelia Huck
    Reviewed-by: Eric Auger
    Signed-off-by: Alex Williamson

    Joerg Roedel
     

26 Sep, 2019

1 commit

  • This patch is a part of a series that extends kernel ABI to allow to pass
    tagged user pointers (with the top byte set to something else other than
    0x00) as syscall arguments.

    vaddr_get_pfn() uses provided user pointers for vma lookups, which can
    only by done with untagged pointers.

    Untag user pointers in this function.

    Link: http://lkml.kernel.org/r/87422b4d72116a975896f2b19b00f38acbd28f33.1563904656.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Reviewed-by: Eric Auger
    Reviewed-by: Vincenzo Frascino
    Reviewed-by: Catalin Marinas
    Reviewed-by: Kees Cook
    Cc: Dave Hansen
    Cc: Will Deacon
    Cc: Al Viro
    Cc: Felix Kuehling
    Cc: Jens Wiklander
    Cc: Khalid Aziz
    Cc: Mauro Carvalho Chehab
    Cc: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     

25 Sep, 2019

1 commit

  • Replace PAGE_SHIFT + compound_order(page) with the new page_shift()
    function. Minor improvements in readability.

    [akpm@linux-foundation.org: fix build in tce_page_is_contained()]
    Link: http://lkml.kernel.org/r/201907241853.yNQTrJWd%25lkp@intel.com
    Link: http://lkml.kernel.org/r/20190721104612.19120-3-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Andrew Morton
    Reviewed-by: Ira Weiny
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

21 Sep, 2019

2 commits

  • Pull VFIO updates from Alex Williamson:

    - Fix spapr iommu error case case (Alexey Kardashevskiy)

    - Consolidate region type definitions (Cornelia Huck)

    - Restore saved original PCI state on release (hexin)

    - Simplify mtty sample driver interrupt path (Parav Pandit)

    - Support for reporting valid IOVA regions to user (Shameer Kolothum)

    * tag 'vfio-v5.4-rc1' of git://github.com/awilliam/linux-vfio:
    vfio_pci: Restore original state on release
    vfio/type1: remove duplicate retrieval of reserved regions
    vfio/type1: Add IOVA range capability support
    vfio/type1: check dma map request is within a valid iova range
    vfio/spapr_tce: Fix incorrect tce_iommu_group memory free
    vfio-mdev/mtty: Simplify interrupt generation
    vfio: re-arrange vfio region definitions
    vfio/type1: Update iova list on detach
    vfio/type1: Check reserved region conflict and update iova list
    vfio/type1: Introduce iova list and add iommu aperture validity check

    Linus Torvalds
     
  • Pull powerpc updates from Michael Ellerman:
    "This is a bit late, partly due to me travelling, and partly due to a
    power outage knocking out some of my test systems *while* I was
    travelling.

    - Initial support for running on a system with an Ultravisor, which
    is software that runs below the hypervisor and protects guests
    against some attacks by the hypervisor.

    - Support for building the kernel to run as a "Secure Virtual
    Machine", ie. as a guest capable of running on a system with an
    Ultravisor.

    - Some changes to our DMA code on bare metal, to allow devices with
    medium sized DMA masks (> 32 && < 59 bits) to use more than 2GB of
    DMA space.

    - Support for firmware assisted crash dumps on bare metal (powernv).

    - Two series fixing bugs in and refactoring our PCI EEH code.

    - A large series refactoring our exception entry code to use gas
    macros, both to make it more readable and also enable some future
    optimisations.

    As well as many cleanups and other minor features & fixups.

    Thanks to: Adam Zerella, Alexey Kardashevskiy, Alistair Popple, Andrew
    Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Anshuman Khandual,
    Balbir Singh, Benjamin Herrenschmidt, Cédric Le Goater, Christophe
    JAILLET, Christophe Leroy, Christopher M. Riedl, Christoph Hellwig,
    Claudio Carvalho, Daniel Axtens, David Gibson, David Hildenbrand,
    Desnes A. Nunes do Rosario, Ganesh Goudar, Gautham R. Shenoy, Greg
    Kurz, Guerney Hunt, Gustavo Romero, Halil Pasic, Hari Bathini, Joakim
    Tjernlund, Jonathan Neuschafer, Jordan Niethe, Leonardo Bras, Lianbo
    Jiang, Madhavan Srinivasan, Mahesh Salgaonkar, Mahesh Salgaonkar,
    Masahiro Yamada, Maxiwell S. Garcia, Michael Anderson, Nathan
    Chancellor, Nathan Lynch, Naveen N. Rao, Nicholas Piggin, Oliver
    O'Halloran, Qian Cai, Ram Pai, Ravi Bangoria, Reza Arbab, Ryan Grimm,
    Sam Bobroff, Santosh Sivaraj, Segher Boessenkool, Sukadev Bhattiprolu,
    Thiago Bauermann, Thiago Jung Bauermann, Thomas Gleixner, Tom
    Lendacky, Vasant Hegde"

    * tag 'powerpc-5.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (264 commits)
    powerpc/mm/mce: Keep irqs disabled during lockless page table walk
    powerpc: Use ftrace_graph_ret_addr() when unwinding
    powerpc/ftrace: Enable HAVE_FUNCTION_GRAPH_RET_ADDR_PTR
    ftrace: Look up the address of return_to_handler() using helpers
    powerpc: dump kernel log before carrying out fadump or kdump
    docs: powerpc: Add missing documentation reference
    powerpc/xmon: Fix output of XIVE IPI
    powerpc/xmon: Improve output of XIVE interrupts
    powerpc/mm/radix: remove useless kernel messages
    powerpc/fadump: support holes in kernel boot memory area
    powerpc/fadump: remove RMA_START and RMA_END macros
    powerpc/fadump: update documentation about option to release opalcore
    powerpc/fadump: consider f/w load area
    powerpc/opalcore: provide an option to invalidate /sys/firmware/opal/core file
    powerpc/opalcore: export /sys/firmware/opal/core for analysing opal crashes
    powerpc/fadump: update documentation about CONFIG_PRESERVE_FA_DUMP
    powerpc/fadump: add support to preserve crash data on FADUMP disabled kernel
    powerpc/fadump: improve how crashed kernel's memory is reserved
    powerpc/fadump: consider reserved ranges while releasing memory
    powerpc/fadump: make crash memory ranges array allocation generic
    ...

    Linus Torvalds
     

30 Aug, 2019

1 commit

  • Invalidating a TCE cache entry for each updated TCE is quite expensive.
    This makes use of the new iommu_table_ops::xchg_no_kill()/tce_kill()
    callbacks to bring down the time spent in mapping a huge guest DMA window.

    Signed-off-by: Alexey Kardashevskiy
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/20190829085252.72370-4-aik@ozlabs.ru

    Alexey Kardashevskiy
     

24 Aug, 2019

1 commit


23 Aug, 2019

1 commit

  • vfio_pci_enable() saves the device's initial configuration information
    with the intent that it is restored in vfio_pci_disable(). However,
    the commit referenced in Fixes: below replaced the call to
    __pci_reset_function_locked(), which is not wrapped in a state save
    and restore, with pci_try_reset_function(), which overwrites the
    restored device state with the current state before applying it to the
    device. Reinstate use of __pci_reset_function_locked() to return to
    the desired behavior.

    Fixes: 890ed578df82 ("vfio-pci: Use pci "try" reset interface")
    Signed-off-by: hexin
    Signed-off-by: Liu Qi
    Signed-off-by: Zhang Yu
    Signed-off-by: Alex Williamson

    hexin
     

20 Aug, 2019

7 commits


24 Jul, 2019

2 commits

  • To permit batching of TLB flushes across multiple calls to the IOMMU
    driver's ->unmap() implementation, introduce a new structure for
    tracking the address range to be flushed and the granularity at which
    the flushing is required.

    This is hooked into the IOMMU API and its caller are updated to make use
    of the new structure. Subsequent patches will plumb this into the IOMMU
    drivers as well, but for now the gathering information is ignored.

    Signed-off-by: Will Deacon

    Will Deacon
     
  • Commit add02cfdc9bc ("iommu: Introduce Interface for IOMMU TLB Flushing")
    added three new TLB flushing operations to the IOMMU API so that the
    underlying driver operations can be batched when unmapping large regions
    of IO virtual address space.

    However, the ->iotlb_range_add() callback has not been implemented by
    any IOMMU drivers (amd_iommu.c implements it as an empty function, which
    incurs the overhead of an indirect branch). Instead, drivers either flush
    the entire IOTLB in the ->iotlb_sync() callback or perform the necessary
    invalidation during ->unmap().

    Attempting to implement ->iotlb_range_add() for arm-smmu-v3.c revealed
    two major issues:

    1. The page size used to map the region in the page-table is not known,
    and so it is not generally possible to issue TLB flushes in the most
    efficient manner.

    2. The only mutable state passed to the callback is a pointer to the
    iommu_domain, which can be accessed concurrently and therefore
    requires expensive synchronisation to keep track of the outstanding
    flushes.

    Remove the callback entirely in preparation for extending ->unmap() and
    ->iotlb_sync() to update a token on the caller's stack.

    Signed-off-by: Will Deacon

    Will Deacon
     

18 Jul, 2019

1 commit

  • Pull VFIO updates from Alex Williamson:

    - Static symbol cleanup in mdev samples (Kefeng Wang)

    - Use vma help in nvlink code (Peng Hao)

    - Remove unused code in mbochs sample (YueHaibing)

    - Send uevents around mdev registration (Alex Williamson)

    * tag 'vfio-v5.3-rc1' of git://github.com/awilliam/linux-vfio:
    mdev: Send uevents around parent device registration
    sample/mdev/mbochs: remove set but not used variable 'mdev_state'
    vfio: vfio_pci_nvlink2: use a vma helper function
    vfio-mdev/samples: make some symbols static

    Linus Torvalds
     

17 Jul, 2019

2 commits

  • Merge more updates from Andrew Morton:
    "VM:
    - z3fold fixes and enhancements by Henry Burns and Vitaly Wool

    - more accurate reclaimed slab caches calculations by Yafang Shao

    - fix MAP_UNINITIALIZED UAPI symbol to not depend on config, by
    Christoph Hellwig

    - !CONFIG_MMU fixes by Christoph Hellwig

    - new novmcoredd parameter to omit device dumps from vmcore, by
    Kairui Song

    - new test_meminit module for testing heap and pagealloc
    initialization, by Alexander Potapenko

    - ioremap improvements for huge mappings, by Anshuman Khandual

    - generalize kprobe page fault handling, by Anshuman Khandual

    - device-dax hotplug fixes and improvements, by Pavel Tatashin

    - enable synchronous DAX fault on powerpc, by Aneesh Kumar K.V

    - add pte_devmap() support for arm64, by Robin Murphy

    - unify locked_vm accounting with a helper, by Daniel Jordan

    - several misc fixes

    core/lib:
    - new typeof_member() macro including some users, by Alexey Dobriyan

    - make BIT() and GENMASK() available in asm, by Masahiro Yamada

    - changed LIST_POISON2 on x86_64 to 0xdead000000000122 for better
    code generation, by Alexey Dobriyan

    - rbtree code size optimizations, by Michel Lespinasse

    - convert struct pid count to refcount_t, by Joel Fernandes

    get_maintainer.pl:
    - add --no-moderated switch to skip moderated ML's, by Joe Perches

    misc:
    - ptrace PTRACE_GET_SYSCALL_INFO interface

    - coda updates

    - gdb scripts, various"

    [ Using merge message suggestion from Vlastimil Babka, with some editing - Linus ]

    * emailed patches from Andrew Morton : (100 commits)
    fs/select.c: use struct_size() in kmalloc()
    mm: add account_locked_vm utility function
    arm64: mm: implement pte_devmap support
    mm: introduce ARCH_HAS_PTE_DEVMAP
    mm: clean up is_device_*_page() definitions
    mm/mmap: move common defines to mman-common.h
    mm: move MAP_SYNC to asm-generic/mman-common.h
    device-dax: "Hotremove" persistent memory that is used like normal RAM
    mm/hotplug: make remove_memory() interface usable
    device-dax: fix memory and resource leak if hotplug fails
    include/linux/lz4.h: fix spelling and copy-paste errors in documentation
    ipc/mqueue.c: only perform resource calculation if user valid
    include/asm-generic/bug.h: fix "cut here" for WARN_ON for __WARN_TAINT architectures
    scripts/gdb: add helpers to find and list devices
    scripts/gdb: add lx-genpd-summary command
    drivers/pps/pps.c: clear offset flags in PPS_SETPARAMS ioctl
    kernel/pid.c: convert struct pid count to refcount_t
    drivers/rapidio/devices/rio_mport_cdev.c: NUL terminate some strings
    select: shift restore_saved_sigmask_unless() into poll_select_copy_remaining()
    select: change do_poll() to return -ERESTARTNOHAND rather than -EINTR
    ...

    Linus Torvalds
     
  • locked_vm accounting is done roughly the same way in five places, so
    unify them in a helper.

    Include the helper's caller in the debug print to distinguish between
    callsites.

    Error codes stay the same, so user-visible behavior does too. The one
    exception is that the -EPERM case in tce_account_locked_vm is removed
    because Alexey has never seen it triggered.

    [daniel.m.jordan@oracle.com: v3]
    Link: http://lkml.kernel.org/r/20190529205019.20927-1-daniel.m.jordan@oracle.com
    [sfr@canb.auug.org.au: fix mm/util.c]
    Link: http://lkml.kernel.org/r/20190524175045.26897-1-daniel.m.jordan@oracle.com
    Signed-off-by: Daniel Jordan
    Signed-off-by: Stephen Rothwell
    Tested-by: Alexey Kardashevskiy
    Acked-by: Alex Williamson
    Cc: Alan Tull
    Cc: Alex Williamson
    Cc: Benjamin Herrenschmidt
    Cc: Christoph Lameter
    Cc: Christophe Leroy
    Cc: Davidlohr Bueso
    Cc: Jason Gunthorpe
    Cc: Mark Rutland
    Cc: Michael Ellerman
    Cc: Moritz Fischer
    Cc: Paul Mackerras
    Cc: Steve Sistare
    Cc: Wu Hao
    Cc: Ira Weiny
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Jordan
     

15 Jul, 2019

1 commit


12 Jul, 2019

1 commit


03 Jul, 2019

1 commit


19 Jun, 2019

1 commit

  • Based on 2 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license version 2 as
    published by the free software foundation

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license version 2 as
    published by the free software foundation #

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 4122 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Enrico Weigelt
    Reviewed-by: Kate Stewart
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

07 Jun, 2019

3 commits

  • In following sequences, child devices created while removing mdev parent
    device can be left out, or it may lead to race of removing half
    initialized child mdev devices.

    issue-1:
    --------
    cpu-0 cpu-1
    ----- -----
    mdev_unregister_device()
    device_for_each_child()
    mdev_device_remove_cb()
    mdev_device_remove()
    create_store()
    mdev_device_create() [...]
    device_add()
    parent_remove_sysfs_files()

    /* BUG: device added by cpu-0
    * whose parent is getting removed
    * and it won't process this mdev.
    */

    issue-2:
    --------
    Below crash is observed when user initiated remove is in progress
    and mdev_unregister_driver() completes parent unregistration.

    cpu-0 cpu-1
    ----- -----
    remove_store()
    mdev_device_remove()
    active = false;
    mdev_unregister_device()
    parent device removed.
    [...]
    parents->ops->remove()
    /*
    * BUG: Accessing invalid parent.
    */

    This is similar race like create() racing with mdev_unregister_device().

    BUG: unable to handle kernel paging request at ffffffffc0585668
    PGD e8f618067 P4D e8f618067 PUD e8f61a067 PMD 85adca067 PTE 0
    Oops: 0000 [#1] SMP PTI
    CPU: 41 PID: 37403 Comm: bash Kdump: loaded Not tainted 5.1.0-rc6-vdevbus+ #6
    Hardware name: Supermicro SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
    RIP: 0010:mdev_device_remove+0xfa/0x140 [mdev]
    Call Trace:
    remove_store+0x71/0x90 [mdev]
    kernfs_fop_write+0x113/0x1a0
    vfs_write+0xad/0x1b0
    ksys_write+0x5a/0xe0
    do_syscall_64+0x5a/0x210
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Therefore, mdev core is improved as below to overcome above issues.

    Wait for any ongoing mdev create() and remove() to finish before
    unregistering parent device.
    This continues to allow multiple create and remove to progress in
    parallel for different mdev devices as most common case.
    At the same time guard parent removal while parent is being accessed by
    create() and remove() callbacks.
    create()/remove() and unregister_device() are synchronized by the rwsem.

    Refactor device removal code to mdev_device_remove_common() to avoid
    acquiring unreg_sem of the parent.

    Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
    Signed-off-by: Parav Pandit
    Reviewed-by: Cornelia Huck
    Signed-off-by: Alex Williamson

    Parav Pandit
     
  • If device is removal is initiated by two threads as below, mdev core
    attempts to create a syfs remove file on stale device.
    During this flow, below [1] call trace is observed.

    cpu-0 cpu-1
    ----- -----
    mdev_unregister_device()
    device_for_each_child
    mdev_device_remove_cb
    mdev_device_remove
    user_syscall
    remove_store()
    mdev_device_remove()
    [..]
    unregister device();
    /* not found in list or
    * active=false.
    */
    sysfs_create_file()
    ..Call trace

    Now that mdev core follows correct device removal sequence of the linux
    bus model, remove shouldn't fail in normal cases. If it fails, there is
    no point of creating a stale file or checking for specific error status.

    kernel: WARNING: CPU: 2 PID: 9348 at fs/sysfs/file.c:327
    sysfs_create_file_ns+0x7f/0x90
    kernel: CPU: 2 PID: 9348 Comm: bash Kdump: loaded Not tainted
    5.1.0-rc6-vdevbus+ #6
    kernel: Hardware name: Supermicro SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b
    08/09/2016
    kernel: RIP: 0010:sysfs_create_file_ns+0x7f/0x90
    kernel: Call Trace:
    kernel: remove_store+0xdc/0x100 [mdev]
    kernel: kernfs_fop_write+0x113/0x1a0
    kernel: vfs_write+0xad/0x1b0
    kernel: ksys_write+0x5a/0xe0
    kernel: do_syscall_64+0x5a/0x210
    kernel: entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Reviewed-by: Cornelia Huck
    Signed-off-by: Parav Pandit
    Signed-off-by: Alex Williamson

    Parav Pandit
     
  • This patch addresses below two issues and prepares the code to address
    3rd issue listed below.

    1. mdev device is placed on the mdev bus before it is created in the
    vendor driver. Once a device is placed on the mdev bus without creating
    its supporting underlying vendor device, mdev driver's probe() gets
    triggered. However there isn't a stable mdev available to work on.

    create_store()
    mdev_create_device()
    device_register()
    ...
    vfio_mdev_probe()
    [...]
    parent->ops->create()
    vfio_ap_mdev_create()
    mdev_set_drvdata(mdev, matrix_mdev);
    /* Valid pointer set above */

    Due to this way of initialization, mdev driver who wants to use the mdev,
    doesn't have a valid mdev to work on.

    2. Current creation sequence is,
    parent->ops_create()
    groups_register()

    Remove sequence is,
    parent->ops->remove()
    groups_unregister()

    However, remove sequence should be exact mirror of creation sequence.
    Once this is achieved, all users of the mdev will be terminated first
    before removing underlying vendor device.
    (Follow standard linux driver model).
    At that point vendor's remove() ops shouldn't fail because taking the
    device off the bus should terminate any usage.

    3. When remove operation fails, mdev sysfs removal attempts to add the
    file back on already removed device. Following call trace [1] is observed.

    [1] call trace:
    kernel: WARNING: CPU: 2 PID: 9348 at fs/sysfs/file.c:327 sysfs_create_file_ns+0x7f/0x90
    kernel: CPU: 2 PID: 9348 Comm: bash Kdump: loaded Not tainted 5.1.0-rc6-vdevbus+ #6
    kernel: Hardware name: Supermicro SYS-6028U-TR4+/X10DRU-i+, BIOS 2.0b 08/09/2016
    kernel: RIP: 0010:sysfs_create_file_ns+0x7f/0x90
    kernel: Call Trace:
    kernel: remove_store+0xdc/0x100 [mdev]
    kernel: kernfs_fop_write+0x113/0x1a0
    kernel: vfs_write+0xad/0x1b0
    kernel: ksys_write+0x5a/0xe0
    kernel: do_syscall_64+0x5a/0x210
    kernel: entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Therefore, mdev core is improved in following ways.

    1. Split the device registration/deregistration sequence so that some
    things can be done between initialization of the device and hooking it
    up to the bus respectively after deregistering it from the bus but
    before giving up our final reference.
    In particular, this means invoking the ->create() and ->remove()
    callbacks in those new windows. This gives the vendor driver an
    initialized mdev device to work with during creation.
    At the same time, a bus driver who wish to bind to mdev driver also
    gets initialized mdev device.

    This follows standard Linux kernel bus and device model.

    2. During remove flow, first remove the device from the bus. This
    ensures that any bus specific devices are removed.
    Once device is taken off the mdev bus, invoke remove() of mdev
    from the vendor driver.

    3. The driver core device model provides way to register and auto
    unregister the device sysfs attribute groups at dev->groups.
    Make use of dev->groups to let core create the groups and eliminate
    code to avoid explicit groups creation and removal.

    To ensure, that new sequence is solid, a below stack dump of a
    process is taken who attempts to remove the device while device is in
    use by vfio driver and user application.
    This stack dump validates that vfio driver guards against such device
    removal when device is in use.

    cat /proc/21962/stack
    [] vfio_del_group_dev+0x216/0x3c0 [vfio]
    [] mdev_remove+0x21/0x40 [mdev]
    [] device_release_driver_internal+0xe8/0x1b0
    [] bus_remove_device+0xf9/0x170
    [] device_del+0x168/0x350
    [] mdev_device_remove_common+0x1d/0x50 [mdev]
    [] mdev_device_remove+0x8c/0xd0 [mdev]
    [] remove_store+0x71/0x90 [mdev]
    [] kernfs_fop_write+0x113/0x1a0
    [] vfs_write+0xad/0x1b0
    [] ksys_write+0x5a/0xe0
    [] do_syscall_64+0x5a/0x210
    [] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [] 0xffffffffffffffff

    This prepares the code to eliminate calling device_create_file() in
    subsequent patch.

    Reviewed-by: Cornelia Huck
    Signed-off-by: Parav Pandit
    Signed-off-by: Alex Williamson

    Parav Pandit