13 May, 2020

1 commit


23 Dec, 2019

1 commit

  • When running heavy memory pressure workloads, this 5+ old system is
    throwing endless warnings below because disk IO is too slow to recover
    from swapping. Since the volume from alloc_iova_fast() could be large,
    once it calls printk(), it will trigger disk IO (writing to the log
    files) and pending softirqs which could cause an infinite loop and make
    no progress for days by the ongoimng memory reclaim. This is the counter
    part for Intel where the AMD part has already been merged. See the
    commit 3d708895325b ("iommu/amd: Silence warnings under memory
    pressure"). Since the allocation failure will be reported in
    intel_alloc_iova(), so just call dev_err_once() there because even the
    "ratelimited" is too much, and silence the one in alloc_iova_mem() to
    avoid the expensive warn_alloc().

    hpsa 0000:03:00.0: DMAR: Allocating 1-page iova failed
    hpsa 0000:03:00.0: DMAR: Allocating 1-page iova failed
    hpsa 0000:03:00.0: DMAR: Allocating 1-page iova failed
    hpsa 0000:03:00.0: DMAR: Allocating 1-page iova failed
    hpsa 0000:03:00.0: DMAR: Allocating 1-page iova failed
    hpsa 0000:03:00.0: DMAR: Allocating 1-page iova failed
    hpsa 0000:03:00.0: DMAR: Allocating 1-page iova failed
    hpsa 0000:03:00.0: DMAR: Allocating 1-page iova failed
    slab_out_of_memory: 66 callbacks suppressed
    SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)
    cache: iommu_iova, object size: 40, buffer size: 448, default order:
    0, min order: 0
    node 0: slabs: 1822, objs: 16398, free: 0
    node 1: slabs: 2051, objs: 18459, free: 31
    SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)
    cache: iommu_iova, object size: 40, buffer size: 448, default order:
    0, min order: 0
    node 0: slabs: 1822, objs: 16398, free: 0
    node 1: slabs: 2051, objs: 18459, free: 31
    SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)
    cache: iommu_iova, object size: 40, buffer size: 448, default order:
    0, min order: 0
    SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)
    SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)
    SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)
    SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)
    SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)
    cache: skbuff_head_cache, object size: 208, buffer size: 640, default
    order: 0, min order: 0
    cache: skbuff_head_cache, object size: 208, buffer size: 640, default
    order: 0, min order: 0
    cache: skbuff_head_cache, object size: 208, buffer size: 640, default
    order: 0, min order: 0
    cache: skbuff_head_cache, object size: 208, buffer size: 640, default
    order: 0, min order: 0
    node 0: slabs: 697, objs: 4182, free: 0
    node 0: slabs: 697, objs: 4182, free: 0
    node 0: slabs: 697, objs: 4182, free: 0
    node 0: slabs: 697, objs: 4182, free: 0
    node 1: slabs: 381, objs: 2286, free: 27
    node 1: slabs: 381, objs: 2286, free: 27
    node 1: slabs: 381, objs: 2286, free: 27
    node 1: slabs: 381, objs: 2286, free: 27
    node 0: slabs: 1822, objs: 16398, free: 0
    cache: skbuff_head_cache, object size: 208, buffer size: 640, default
    order: 0, min order: 0
    node 1: slabs: 2051, objs: 18459, free: 31
    node 0: slabs: 697, objs: 4182, free: 0
    SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)
    node 1: slabs: 381, objs: 2286, free: 27
    cache: skbuff_head_cache, object size: 208, buffer size: 640, default
    order: 0, min order: 0
    node 0: slabs: 697, objs: 4182, free: 0
    node 1: slabs: 381, objs: 2286, free: 27
    hpsa 0000:03:00.0: DMAR: Allocating 1-page iova failed
    warn_alloc: 96 callbacks suppressed
    kworker/11:1H: page allocation failure: order:0,
    mode:0xa20(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0-1
    CPU: 11 PID: 1642 Comm: kworker/11:1H Tainted: G B
    Hardware name: HP ProLiant XL420 Gen9/ProLiant XL420 Gen9, BIOS U19
    12/27/2015
    Workqueue: kblockd blk_mq_run_work_fn
    Call Trace:
    dump_stack+0xa0/0xea
    warn_alloc.cold.94+0x8a/0x12d
    __alloc_pages_slowpath+0x1750/0x1870
    __alloc_pages_nodemask+0x58a/0x710
    alloc_pages_current+0x9c/0x110
    alloc_slab_page+0xc9/0x760
    allocate_slab+0x48f/0x5d0
    new_slab+0x46/0x70
    ___slab_alloc+0x4ab/0x7b0
    __slab_alloc+0x43/0x70
    kmem_cache_alloc+0x2dd/0x450
    SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)
    alloc_iova+0x33/0x210
    cache: skbuff_head_cache, object size: 208, buffer size: 640, default
    order: 0, min order: 0
    node 0: slabs: 697, objs: 4182, free: 0
    alloc_iova_fast+0x62/0x3d1
    node 1: slabs: 381, objs: 2286, free: 27
    intel_alloc_iova+0xce/0xe0
    intel_map_sg+0xed/0x410
    scsi_dma_map+0xd7/0x160
    scsi_queue_rq+0xbf7/0x1310
    blk_mq_dispatch_rq_list+0x4d9/0xbc0
    blk_mq_sched_dispatch_requests+0x24a/0x300
    __blk_mq_run_hw_queue+0x156/0x230
    blk_mq_run_work_fn+0x3b/0x40
    process_one_work+0x579/0xb90
    worker_thread+0x63/0x5b0
    kthread+0x1e6/0x210
    ret_from_fork+0x3a/0x50
    Mem-Info:
    active_anon:2422723 inactive_anon:361971 isolated_anon:34403
    active_file:2285 inactive_file:1838 isolated_file:0
    unevictable:0 dirty:1 writeback:5 unstable:0
    slab_reclaimable:13972 slab_unreclaimable:453879
    mapped:2380 shmem:154 pagetables:6948 bounce:0
    free:19133 free_pcp:7363 free_cma:0

    Signed-off-by: Qian Cai
    Signed-off-by: Joerg Roedel

    Qian Cai
     

17 Dec, 2019

1 commit

  • During ethernet(Marvell octeontx2) set ring buffer test:
    ethtool -G eth1 rx tx
    following kmemleak will happen sometimes:

    unreferenced object 0xffff000b85421340 (size 64):
    comm "ethtool", pid 867, jiffies 4295323539 (age 550.500s)
    hex dump (first 64 bytes):
    80 13 42 85 0b 00 ff ff ff ff ff ff ff ff ff ff ..B.............
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    ff ff ff ff ff ff ff ff 00 00 00 00 00 00 00 00 ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] kmem_cache_alloc+0x1b0/0x350
    [] alloc_iova+0x3c/0x168
    [] alloc_iova_fast+0x7c/0x2d8
    [] iommu_dma_alloc_iova.isra.0+0x12c/0x138
    [] __iommu_dma_map+0x8c/0xf8
    [] iommu_dma_map_page+0x98/0xf8
    [] otx2_alloc_rbuf+0xf4/0x158
    [] otx2_rq_aura_pool_init+0x110/0x270
    [] otx2_open+0x15c/0x734
    [] otx2_dev_open+0x3c/0x68
    [] otx2_set_ringparam+0x1ac/0x1d4
    [] dev_ethtool+0xb84/0x2028
    [] dev_ioctl+0x248/0x3a0
    [] sock_ioctl+0x280/0x638
    [] do_vfs_ioctl+0x8b0/0xa80
    [] ksys_ioctl+0x84/0xb8

    The reason:
    When alloc_iova_mem() without initial with Zero, sometimes fpn_lo will
    equal to IOVA_ANCHOR by chance, so when return with -ENOMEM(iova32_full)
    from __alloc_and_insert_iova_range(), the new_iova will not be freed in
    free_iova_mem().

    Fixes: bb68b2fbfbd6 ("iommu/iova: Add rbtree anchor node")
    Signed-off-by: Xiaotao Yin
    Reviewed-by: Robin Murphy
    Signed-off-by: Joerg Roedel

    Xiaotao Yin
     

30 Aug, 2019

1 commit

  • In commit 14bd9a607f90 ("iommu/iova: Separate atomic variables
    to improve performance") Jinyu Qi identified that the atomic_cmpxchg()
    in queue_iova() was causing a performance loss and moved critical fields
    so that the false sharing would not impact them.

    However, avoiding the false sharing in the first place seems easy.
    We should attempt the atomic_cmpxchg() no more than 100 times
    per second. Adding an atomic_read() will keep the cache
    line mostly shared.

    This false sharing came with commit 9a005a800ae8
    ("iommu/iova: Add flush timer").

    Signed-off-by: Eric Dumazet
    Fixes: 9a005a800ae8 ('iommu/iova: Add flush timer')
    Cc: Jinyu Qi
    Cc: Joerg Roedel
    Acked-by: Robin Murphy
    Signed-off-by: Joerg Roedel

    Eric Dumazet
     

22 Jul, 2019

2 commits

  • Since the cached32_node is allowed to be advanced above dma_32bit_pfn
    (to provide a shortcut into the limited range), we need to be careful to
    remove the to be freed node if it is the cached32_node.

    [ 48.477773] BUG: KASAN: use-after-free in __cached_rbnode_delete_update+0x68/0x110
    [ 48.477812] Read of size 8 at addr ffff88870fc19020 by task kworker/u8:1/37
    [ 48.477843]
    [ 48.477879] CPU: 1 PID: 37 Comm: kworker/u8:1 Tainted: G U 5.2.0+ #735
    [ 48.477915] Hardware name: Intel Corporation NUC7i5BNK/NUC7i5BNB, BIOS BNKBL357.86A.0052.2017.0918.1346 09/18/2017
    [ 48.478047] Workqueue: i915 __i915_gem_free_work [i915]
    [ 48.478075] Call Trace:
    [ 48.478111] dump_stack+0x5b/0x90
    [ 48.478137] print_address_description+0x67/0x237
    [ 48.478178] ? __cached_rbnode_delete_update+0x68/0x110
    [ 48.478212] __kasan_report.cold.3+0x1c/0x38
    [ 48.478240] ? __cached_rbnode_delete_update+0x68/0x110
    [ 48.478280] ? __cached_rbnode_delete_update+0x68/0x110
    [ 48.478308] __cached_rbnode_delete_update+0x68/0x110
    [ 48.478344] private_free_iova+0x2b/0x60
    [ 48.478378] iova_magazine_free_pfns+0x46/0xa0
    [ 48.478403] free_iova_fast+0x277/0x340
    [ 48.478443] fq_ring_free+0x15a/0x1a0
    [ 48.478473] queue_iova+0x19c/0x1f0
    [ 48.478597] cleanup_page_dma.isra.64+0x62/0xb0 [i915]
    [ 48.478712] __gen8_ppgtt_cleanup+0x63/0x80 [i915]
    [ 48.478826] __gen8_ppgtt_cleanup+0x42/0x80 [i915]
    [ 48.478940] __gen8_ppgtt_clear+0x433/0x4b0 [i915]
    [ 48.479053] __gen8_ppgtt_clear+0x462/0x4b0 [i915]
    [ 48.479081] ? __sg_free_table+0x9e/0xf0
    [ 48.479116] ? kfree+0x7f/0x150
    [ 48.479234] i915_vma_unbind+0x1e2/0x240 [i915]
    [ 48.479352] i915_vma_destroy+0x3a/0x280 [i915]
    [ 48.479465] __i915_gem_free_objects+0xf0/0x2d0 [i915]
    [ 48.479579] __i915_gem_free_work+0x41/0xa0 [i915]
    [ 48.479607] process_one_work+0x495/0x710
    [ 48.479642] worker_thread+0x4c7/0x6f0
    [ 48.479687] ? process_one_work+0x710/0x710
    [ 48.479724] kthread+0x1b2/0x1d0
    [ 48.479774] ? kthread_create_worker_on_cpu+0xa0/0xa0
    [ 48.479820] ret_from_fork+0x1f/0x30
    [ 48.479864]
    [ 48.479907] Allocated by task 631:
    [ 48.479944] save_stack+0x19/0x80
    [ 48.479994] __kasan_kmalloc.constprop.6+0xc1/0xd0
    [ 48.480038] kmem_cache_alloc+0x91/0xf0
    [ 48.480082] alloc_iova+0x2b/0x1e0
    [ 48.480125] alloc_iova_fast+0x58/0x376
    [ 48.480166] intel_alloc_iova+0x90/0xc0
    [ 48.480214] intel_map_sg+0xde/0x1f0
    [ 48.480343] i915_gem_gtt_prepare_pages+0xb8/0x170 [i915]
    [ 48.480465] huge_get_pages+0x232/0x2b0 [i915]
    [ 48.480590] ____i915_gem_object_get_pages+0x40/0xb0 [i915]
    [ 48.480712] __i915_gem_object_get_pages+0x90/0xa0 [i915]
    [ 48.480834] i915_gem_object_prepare_write+0x2d6/0x330 [i915]
    [ 48.480955] create_test_object.isra.54+0x1a9/0x3e0 [i915]
    [ 48.481075] igt_shared_ctx_exec+0x365/0x3c0 [i915]
    [ 48.481210] __i915_subtests.cold.4+0x30/0x92 [i915]
    [ 48.481341] __run_selftests.cold.3+0xa9/0x119 [i915]
    [ 48.481466] i915_live_selftests+0x3c/0x70 [i915]
    [ 48.481583] i915_pci_probe+0xe7/0x220 [i915]
    [ 48.481620] pci_device_probe+0xe0/0x180
    [ 48.481665] really_probe+0x163/0x4e0
    [ 48.481710] device_driver_attach+0x85/0x90
    [ 48.481750] __driver_attach+0xa5/0x180
    [ 48.481796] bus_for_each_dev+0xda/0x130
    [ 48.481831] bus_add_driver+0x205/0x2e0
    [ 48.481882] driver_register+0xca/0x140
    [ 48.481927] do_one_initcall+0x6c/0x1af
    [ 48.481970] do_init_module+0x106/0x350
    [ 48.482010] load_module+0x3d2c/0x3ea0
    [ 48.482058] __do_sys_finit_module+0x110/0x180
    [ 48.482102] do_syscall_64+0x62/0x1f0
    [ 48.482147] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [ 48.482190]
    [ 48.482224] Freed by task 37:
    [ 48.482273] save_stack+0x19/0x80
    [ 48.482318] __kasan_slab_free+0x12e/0x180
    [ 48.482363] kmem_cache_free+0x70/0x140
    [ 48.482406] __free_iova+0x1d/0x30
    [ 48.482445] fq_ring_free+0x15a/0x1a0
    [ 48.482490] queue_iova+0x19c/0x1f0
    [ 48.482624] cleanup_page_dma.isra.64+0x62/0xb0 [i915]
    [ 48.482749] __gen8_ppgtt_cleanup+0x63/0x80 [i915]
    [ 48.482873] __gen8_ppgtt_cleanup+0x42/0x80 [i915]
    [ 48.482999] __gen8_ppgtt_clear+0x433/0x4b0 [i915]
    [ 48.483123] __gen8_ppgtt_clear+0x462/0x4b0 [i915]
    [ 48.483250] i915_vma_unbind+0x1e2/0x240 [i915]
    [ 48.483378] i915_vma_destroy+0x3a/0x280 [i915]
    [ 48.483500] __i915_gem_free_objects+0xf0/0x2d0 [i915]
    [ 48.483622] __i915_gem_free_work+0x41/0xa0 [i915]
    [ 48.483659] process_one_work+0x495/0x710
    [ 48.483704] worker_thread+0x4c7/0x6f0
    [ 48.483748] kthread+0x1b2/0x1d0
    [ 48.483787] ret_from_fork+0x1f/0x30
    [ 48.483831]
    [ 48.483868] The buggy address belongs to the object at ffff88870fc19000
    [ 48.483868] which belongs to the cache iommu_iova of size 40
    [ 48.483920] The buggy address is located 32 bytes inside of
    [ 48.483920] 40-byte region [ffff88870fc19000, ffff88870fc19028)
    [ 48.483964] The buggy address belongs to the page:
    [ 48.484006] page:ffffea001c3f0600 refcount:1 mapcount:0 mapping:ffff8888181a91c0 index:0x0 compound_mapcount: 0
    [ 48.484045] flags: 0x8000000000010200(slab|head)
    [ 48.484096] raw: 8000000000010200 ffffea001c421a08 ffffea001c447e88 ffff8888181a91c0
    [ 48.484141] raw: 0000000000000000 0000000000120012 00000001ffffffff 0000000000000000
    [ 48.484188] page dumped because: kasan: bad access detected
    [ 48.484230]
    [ 48.484265] Memory state around the buggy address:
    [ 48.484314] ffff88870fc18f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    [ 48.484361] ffff88870fc18f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    [ 48.484406] >ffff88870fc19000: fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc
    [ 48.484451] ^
    [ 48.484494] ffff88870fc19080: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    [ 48.484530] ffff88870fc19100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc

    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=108602
    Fixes: e60aa7b53845 ("iommu/iova: Extend rbtree node caching")
    Signed-off-by: Chris Wilson
    Cc: Robin Murphy
    Cc: Joerg Roedel
    Cc: Joerg Roedel
    Cc: # v4.15+
    Reviewed-by: Robin Murphy
    Signed-off-by: Joerg Roedel

    Chris Wilson
     
  • Intel VT-d driver was reworked to use common deferred flushing
    implementation. Previously there was one global per-cpu flush queue,
    afterwards - one per domain.

    Before deferring a flush, the queue should be allocated and initialized.

    Currently only domains with IOMMU_DOMAIN_DMA type initialize their flush
    queue. It's probably worth to init it for static or unmanaged domains
    too, but it may be arguable - I'm leaving it to iommu folks.

    Prevent queuing an iova flush if the domain doesn't have a queue.
    The defensive check seems to be worth to keep even if queue would be
    initialized for all kinds of domains. And is easy backportable.

    On 4.19.43 stable kernel it has a user-visible effect: previously for
    devices in si domain there were crashes, on sata devices:

    BUG: spinlock bad magic on CPU#6, swapper/0/1
    lock: 0xffff88844f582008, .magic: 00000000, .owner: /-1, .owner_cpu: 0
    CPU: 6 PID: 1 Comm: swapper/0 Not tainted 4.19.43 #1
    Call Trace:

    dump_stack+0x61/0x7e
    spin_bug+0x9d/0xa3
    do_raw_spin_lock+0x22/0x8e
    _raw_spin_lock_irqsave+0x32/0x3a
    queue_iova+0x45/0x115
    intel_unmap+0x107/0x113
    intel_unmap_sg+0x6b/0x76
    __ata_qc_complete+0x7f/0x103
    ata_qc_complete+0x9b/0x26a
    ata_qc_complete_multiple+0xd0/0xe3
    ahci_handle_port_interrupt+0x3ee/0x48a
    ahci_handle_port_intr+0x73/0xa9
    ahci_single_level_irq_intr+0x40/0x60
    __handle_irq_event_percpu+0x7f/0x19a
    handle_irq_event_percpu+0x32/0x72
    handle_irq_event+0x38/0x56
    handle_edge_irq+0x102/0x121
    handle_irq+0x147/0x15c
    do_IRQ+0x66/0xf2
    common_interrupt+0xf/0xf
    RIP: 0010:__do_softirq+0x8c/0x2df

    The same for usb devices that use ehci-pci:
    BUG: spinlock bad magic on CPU#0, swapper/0/1
    lock: 0xffff88844f402008, .magic: 00000000, .owner: /-1, .owner_cpu: 0
    CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.43 #4
    Call Trace:

    dump_stack+0x61/0x7e
    spin_bug+0x9d/0xa3
    do_raw_spin_lock+0x22/0x8e
    _raw_spin_lock_irqsave+0x32/0x3a
    queue_iova+0x77/0x145
    intel_unmap+0x107/0x113
    intel_unmap_page+0xe/0x10
    usb_hcd_unmap_urb_setup_for_dma+0x53/0x9d
    usb_hcd_unmap_urb_for_dma+0x17/0x100
    unmap_urb_for_dma+0x22/0x24
    __usb_hcd_giveback_urb+0x51/0xc3
    usb_giveback_urb_bh+0x97/0xde
    tasklet_action_common.isra.4+0x5f/0xa1
    tasklet_action+0x2d/0x30
    __do_softirq+0x138/0x2df
    irq_exit+0x7d/0x8b
    smp_apic_timer_interrupt+0x10f/0x151
    apic_timer_interrupt+0xf/0x20

    RIP: 0010:_raw_spin_unlock_irqrestore+0x17/0x39

    Cc: David Woodhouse
    Cc: Joerg Roedel
    Cc: Lu Baolu
    Cc: iommu@lists.linux-foundation.org
    Cc: # 4.14+
    Fixes: 13cf01744608 ("iommu/vt-d: Make use of iova deferred flushing")
    Signed-off-by: Dmitry Safonov
    Reviewed-by: Lu Baolu
    Signed-off-by: Joerg Roedel

    Dmitry Safonov
     

05 Jun, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms and conditions of the gnu general public license
    version 2 as published by the free software foundation this program
    is distributed in the hope it will be useful but without any
    warranty without even the implied warranty of merchantability or
    fitness for a particular purpose see the gnu general public license
    for more details you should have received a copy of the gnu general
    public license along with this program if not write to the free
    software foundation inc 59 temple place suite 330 boston ma 02111
    1307 usa

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 33 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Reviewed-by: Kate Stewart
    Reviewed-by: Alexios Zavras
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190530000435.254582722@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

22 Mar, 2019

1 commit

  • If a 32 bit allocation request is too big to possibly succeed, it
    early exits with a failure and then should never update max32_alloc_
    size. This patch fixes current code, now the size is only updated if
    the slow path failed while walking the tree. Without the fix the
    allocation may enter the slow path again even if there was a failure
    before of a request with the same or a smaller size.

    Cc: # 4.20+
    Fixes: bee60e94a1e2 ("iommu/iova: Optimise attempts to allocate iova from 32bit address range")
    Reviewed-by: Robin Murphy
    Signed-off-by: Robert Richter
    Signed-off-by: Joerg Roedel

    Robert Richter
     

25 Sep, 2018

1 commit

  • As an optimisation for PCI devices, there is always first attempt
    been made to allocate iova from SAC address range. This will lead
    to unnecessary attempts, when there are no free ranges
    available. Adding fix to track recently failed iova address size and
    allow further attempts, only if requested size is lesser than a failed
    size. The size is updated when any replenish happens.

    Reviewed-by: Robin Murphy
    Signed-off-by: Ganapatrao Kulkarni
    Signed-off-by: Joerg Roedel

    Ganapatrao Kulkarni
     

22 Nov, 2017

1 commit

  • This converts all remaining cases of the old setup_timer() API into using
    timer_setup(), where the callback argument is the structure already
    holding the struct timer_list. These should have no behavioral changes,
    since they just change which pointer is passed into the callback with
    the same available pointers after conversion. It handles the following
    examples, in addition to some other variations.

    Casting from unsigned long:

    void my_callback(unsigned long data)
    {
    struct something *ptr = (struct something *)data;
    ...
    }
    ...
    setup_timer(&ptr->my_timer, my_callback, ptr);

    and forced object casts:

    void my_callback(struct something *ptr)
    {
    ...
    }
    ...
    setup_timer(&ptr->my_timer, my_callback, (unsigned long)ptr);

    become:

    void my_callback(struct timer_list *t)
    {
    struct something *ptr = from_timer(ptr, t, my_timer);
    ...
    }
    ...
    timer_setup(&ptr->my_timer, my_callback, 0);

    Direct function assignments:

    void my_callback(unsigned long data)
    {
    struct something *ptr = (struct something *)data;
    ...
    }
    ...
    ptr->my_timer.function = my_callback;

    have a temporary cast added, along with converting the args:

    void my_callback(struct timer_list *t)
    {
    struct something *ptr = from_timer(ptr, t, my_timer);
    ...
    }
    ...
    ptr->my_timer.function = (TIMER_FUNC_TYPE)my_callback;

    And finally, callbacks without a data assignment:

    void my_callback(unsigned long data)
    {
    ...
    }
    ...
    setup_timer(&ptr->my_timer, my_callback, 0);

    have their argument renamed to verify they're unused during conversion:

    void my_callback(struct timer_list *unused)
    {
    ...
    }
    ...
    timer_setup(&ptr->my_timer, my_callback, 0);

    The conversion is done with the following Coccinelle script:

    spatch --very-quiet --all-includes --include-headers \
    -I ./arch/x86/include -I ./arch/x86/include/generated \
    -I ./include -I ./arch/x86/include/uapi \
    -I ./arch/x86/include/generated/uapi -I ./include/uapi \
    -I ./include/generated/uapi --include ./include/linux/kconfig.h \
    --dir . \
    --cocci-file ~/src/data/timer_setup.cocci

    @fix_address_of@
    expression e;
    @@

    setup_timer(
    -&(e)
    +&e
    , ...)

    // Update any raw setup_timer() usages that have a NULL callback, but
    // would otherwise match change_timer_function_usage, since the latter
    // will update all function assignments done in the face of a NULL
    // function initialization in setup_timer().
    @change_timer_function_usage_NULL@
    expression _E;
    identifier _timer;
    type _cast_data;
    @@

    (
    -setup_timer(&_E->_timer, NULL, _E);
    +timer_setup(&_E->_timer, NULL, 0);
    |
    -setup_timer(&_E->_timer, NULL, (_cast_data)_E);
    +timer_setup(&_E->_timer, NULL, 0);
    |
    -setup_timer(&_E._timer, NULL, &_E);
    +timer_setup(&_E._timer, NULL, 0);
    |
    -setup_timer(&_E._timer, NULL, (_cast_data)&_E);
    +timer_setup(&_E._timer, NULL, 0);
    )

    @change_timer_function_usage@
    expression _E;
    identifier _timer;
    struct timer_list _stl;
    identifier _callback;
    type _cast_func, _cast_data;
    @@

    (
    -setup_timer(&_E->_timer, _callback, _E);
    +timer_setup(&_E->_timer, _callback, 0);
    |
    -setup_timer(&_E->_timer, &_callback, _E);
    +timer_setup(&_E->_timer, _callback, 0);
    |
    -setup_timer(&_E->_timer, _callback, (_cast_data)_E);
    +timer_setup(&_E->_timer, _callback, 0);
    |
    -setup_timer(&_E->_timer, &_callback, (_cast_data)_E);
    +timer_setup(&_E->_timer, _callback, 0);
    |
    -setup_timer(&_E->_timer, (_cast_func)_callback, _E);
    +timer_setup(&_E->_timer, _callback, 0);
    |
    -setup_timer(&_E->_timer, (_cast_func)&_callback, _E);
    +timer_setup(&_E->_timer, _callback, 0);
    |
    -setup_timer(&_E->_timer, (_cast_func)_callback, (_cast_data)_E);
    +timer_setup(&_E->_timer, _callback, 0);
    |
    -setup_timer(&_E->_timer, (_cast_func)&_callback, (_cast_data)_E);
    +timer_setup(&_E->_timer, _callback, 0);
    |
    -setup_timer(&_E._timer, _callback, (_cast_data)_E);
    +timer_setup(&_E._timer, _callback, 0);
    |
    -setup_timer(&_E._timer, _callback, (_cast_data)&_E);
    +timer_setup(&_E._timer, _callback, 0);
    |
    -setup_timer(&_E._timer, &_callback, (_cast_data)_E);
    +timer_setup(&_E._timer, _callback, 0);
    |
    -setup_timer(&_E._timer, &_callback, (_cast_data)&_E);
    +timer_setup(&_E._timer, _callback, 0);
    |
    -setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)_E);
    +timer_setup(&_E._timer, _callback, 0);
    |
    -setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)&_E);
    +timer_setup(&_E._timer, _callback, 0);
    |
    -setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)_E);
    +timer_setup(&_E._timer, _callback, 0);
    |
    -setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)&_E);
    +timer_setup(&_E._timer, _callback, 0);
    |
    _E->_timer@_stl.function = _callback;
    |
    _E->_timer@_stl.function = &_callback;
    |
    _E->_timer@_stl.function = (_cast_func)_callback;
    |
    _E->_timer@_stl.function = (_cast_func)&_callback;
    |
    _E._timer@_stl.function = _callback;
    |
    _E._timer@_stl.function = &_callback;
    |
    _E._timer@_stl.function = (_cast_func)_callback;
    |
    _E._timer@_stl.function = (_cast_func)&_callback;
    )

    // callback(unsigned long arg)
    @change_callback_handle_cast
    depends on change_timer_function_usage@
    identifier change_timer_function_usage._callback;
    identifier change_timer_function_usage._timer;
    type _origtype;
    identifier _origarg;
    type _handletype;
    identifier _handle;
    @@

    void _callback(
    -_origtype _origarg
    +struct timer_list *t
    )
    {
    (
    ... when != _origarg
    _handletype *_handle =
    -(_handletype *)_origarg;
    +from_timer(_handle, t, _timer);
    ... when != _origarg
    |
    ... when != _origarg
    _handletype *_handle =
    -(void *)_origarg;
    +from_timer(_handle, t, _timer);
    ... when != _origarg
    |
    ... when != _origarg
    _handletype *_handle;
    ... when != _handle
    _handle =
    -(_handletype *)_origarg;
    +from_timer(_handle, t, _timer);
    ... when != _origarg
    |
    ... when != _origarg
    _handletype *_handle;
    ... when != _handle
    _handle =
    -(void *)_origarg;
    +from_timer(_handle, t, _timer);
    ... when != _origarg
    )
    }

    // callback(unsigned long arg) without existing variable
    @change_callback_handle_cast_no_arg
    depends on change_timer_function_usage &&
    !change_callback_handle_cast@
    identifier change_timer_function_usage._callback;
    identifier change_timer_function_usage._timer;
    type _origtype;
    identifier _origarg;
    type _handletype;
    @@

    void _callback(
    -_origtype _origarg
    +struct timer_list *t
    )
    {
    + _handletype *_origarg = from_timer(_origarg, t, _timer);
    +
    ... when != _origarg
    - (_handletype *)_origarg
    + _origarg
    ... when != _origarg
    }

    // Avoid already converted callbacks.
    @match_callback_converted
    depends on change_timer_function_usage &&
    !change_callback_handle_cast &&
    !change_callback_handle_cast_no_arg@
    identifier change_timer_function_usage._callback;
    identifier t;
    @@

    void _callback(struct timer_list *t)
    { ... }

    // callback(struct something *handle)
    @change_callback_handle_arg
    depends on change_timer_function_usage &&
    !match_callback_converted &&
    !change_callback_handle_cast &&
    !change_callback_handle_cast_no_arg@
    identifier change_timer_function_usage._callback;
    identifier change_timer_function_usage._timer;
    type _handletype;
    identifier _handle;
    @@

    void _callback(
    -_handletype *_handle
    +struct timer_list *t
    )
    {
    + _handletype *_handle = from_timer(_handle, t, _timer);
    ...
    }

    // If change_callback_handle_arg ran on an empty function, remove
    // the added handler.
    @unchange_callback_handle_arg
    depends on change_timer_function_usage &&
    change_callback_handle_arg@
    identifier change_timer_function_usage._callback;
    identifier change_timer_function_usage._timer;
    type _handletype;
    identifier _handle;
    identifier t;
    @@

    void _callback(struct timer_list *t)
    {
    - _handletype *_handle = from_timer(_handle, t, _timer);
    }

    // We only want to refactor the setup_timer() data argument if we've found
    // the matching callback. This undoes changes in change_timer_function_usage.
    @unchange_timer_function_usage
    depends on change_timer_function_usage &&
    !change_callback_handle_cast &&
    !change_callback_handle_cast_no_arg &&
    !change_callback_handle_arg@
    expression change_timer_function_usage._E;
    identifier change_timer_function_usage._timer;
    identifier change_timer_function_usage._callback;
    type change_timer_function_usage._cast_data;
    @@

    (
    -timer_setup(&_E->_timer, _callback, 0);
    +setup_timer(&_E->_timer, _callback, (_cast_data)_E);
    |
    -timer_setup(&_E._timer, _callback, 0);
    +setup_timer(&_E._timer, _callback, (_cast_data)&_E);
    )

    // If we fixed a callback from a .function assignment, fix the
    // assignment cast now.
    @change_timer_function_assignment
    depends on change_timer_function_usage &&
    (change_callback_handle_cast ||
    change_callback_handle_cast_no_arg ||
    change_callback_handle_arg)@
    expression change_timer_function_usage._E;
    identifier change_timer_function_usage._timer;
    identifier change_timer_function_usage._callback;
    type _cast_func;
    typedef TIMER_FUNC_TYPE;
    @@

    (
    _E->_timer.function =
    -_callback
    +(TIMER_FUNC_TYPE)_callback
    ;
    |
    _E->_timer.function =
    -&_callback
    +(TIMER_FUNC_TYPE)_callback
    ;
    |
    _E->_timer.function =
    -(_cast_func)_callback;
    +(TIMER_FUNC_TYPE)_callback
    ;
    |
    _E->_timer.function =
    -(_cast_func)&_callback
    +(TIMER_FUNC_TYPE)_callback
    ;
    |
    _E._timer.function =
    -_callback
    +(TIMER_FUNC_TYPE)_callback
    ;
    |
    _E._timer.function =
    -&_callback;
    +(TIMER_FUNC_TYPE)_callback
    ;
    |
    _E._timer.function =
    -(_cast_func)_callback
    +(TIMER_FUNC_TYPE)_callback
    ;
    |
    _E._timer.function =
    -(_cast_func)&_callback
    +(TIMER_FUNC_TYPE)_callback
    ;
    )

    // Sometimes timer functions are called directly. Replace matched args.
    @change_timer_function_calls
    depends on change_timer_function_usage &&
    (change_callback_handle_cast ||
    change_callback_handle_cast_no_arg ||
    change_callback_handle_arg)@
    expression _E;
    identifier change_timer_function_usage._timer;
    identifier change_timer_function_usage._callback;
    type _cast_data;
    @@

    _callback(
    (
    -(_cast_data)_E
    +&_E->_timer
    |
    -(_cast_data)&_E
    +&_E._timer
    |
    -_E
    +&_E->_timer
    )
    )

    // If a timer has been configured without a data argument, it can be
    // converted without regard to the callback argument, since it is unused.
    @match_timer_function_unused_data@
    expression _E;
    identifier _timer;
    identifier _callback;
    @@

    (
    -setup_timer(&_E->_timer, _callback, 0);
    +timer_setup(&_E->_timer, _callback, 0);
    |
    -setup_timer(&_E->_timer, _callback, 0L);
    +timer_setup(&_E->_timer, _callback, 0);
    |
    -setup_timer(&_E->_timer, _callback, 0UL);
    +timer_setup(&_E->_timer, _callback, 0);
    |
    -setup_timer(&_E._timer, _callback, 0);
    +timer_setup(&_E._timer, _callback, 0);
    |
    -setup_timer(&_E._timer, _callback, 0L);
    +timer_setup(&_E._timer, _callback, 0);
    |
    -setup_timer(&_E._timer, _callback, 0UL);
    +timer_setup(&_E._timer, _callback, 0);
    |
    -setup_timer(&_timer, _callback, 0);
    +timer_setup(&_timer, _callback, 0);
    |
    -setup_timer(&_timer, _callback, 0L);
    +timer_setup(&_timer, _callback, 0);
    |
    -setup_timer(&_timer, _callback, 0UL);
    +timer_setup(&_timer, _callback, 0);
    |
    -setup_timer(_timer, _callback, 0);
    +timer_setup(_timer, _callback, 0);
    |
    -setup_timer(_timer, _callback, 0L);
    +timer_setup(_timer, _callback, 0);
    |
    -setup_timer(_timer, _callback, 0UL);
    +timer_setup(_timer, _callback, 0);
    )

    @change_callback_unused_data
    depends on match_timer_function_unused_data@
    identifier match_timer_function_unused_data._callback;
    type _origtype;
    identifier _origarg;
    @@

    void _callback(
    -_origtype _origarg
    +struct timer_list *unused
    )
    {
    ... when != _origarg
    }

    Signed-off-by: Kees Cook

    Kees Cook
     

07 Nov, 2017

1 commit

  • get_cpu_ptr() disabled preemption and returns the ->fq object of the
    current CPU. raw_cpu_ptr() does the same except that it not disable
    preemption which means the scheduler can move it to another CPU after it
    obtained the per-CPU object.
    In this case this is not bad because the data structure itself is
    protected with a spin_lock. This change shouldn't matter however on RT
    it does because the sleeping lock can't be accessed with disabled
    preemption.

    Cc: Joerg Roedel
    Cc: iommu@lists.linux-foundation.org
    Reported-by: vinadhy@gmail.com
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Alex Williamson

    Sebastian Andrzej Siewior
     

12 Oct, 2017

1 commit

  • Since IOVA allocation failure is not unusual case we need to flush
    CPUs' rcache in hope we will succeed in next round.

    However, it is useful to decide whether we need rcache flush step because
    of two reasons:
    - Not scalability. On large system with ~100 CPUs iterating and flushing
    rcache for each CPU becomes serious bottleneck so we may want to defer it.
    - free_cpu_cached_iovas() does not care about max PFN we are interested in.
    Thus we may flush our rcaches and still get no new IOVA like in the
    commonly used scenario:

    if (dma_limit > DMA_BIT_MASK(32) && dev_is_pci(dev))
    iova = alloc_iova_fast(iovad, iova_len, DMA_BIT_MASK(32) >> shift);

    if (!iova)
    iova = alloc_iova_fast(iovad, iova_len, dma_limit >> shift);

    1. First alloc_iova_fast() call is limited to DMA_BIT_MASK(32) to get
    PCI devices a SAC address
    2. alloc_iova() fails due to full 32-bit space
    3. rcaches contain PFNs out of 32-bit space so free_cpu_cached_iovas()
    throws entries away for nothing and alloc_iova() fails again
    4. Next alloc_iova_fast() call cannot take advantage of rcache since we
    have just defeated caches. In this case we pick the slowest option
    to proceed.

    This patch reworks flushed_rcache local flag to be additional function
    argument instead and control rcache flush step. Also, it updates all users
    to do the flush as the last chance.

    Signed-off-by: Tomasz Nowicki
    Reviewed-by: Robin Murphy
    Tested-by: Nate Watterson
    Signed-off-by: Joerg Roedel

    Tomasz Nowicki
     

02 Oct, 2017

1 commit

  • Anchor nodes are not reserved IOVAs in the way that copy_reserved_iova()
    cares about - while the failure from reserve_iova() is benign since the
    target domain will already have its own anchor, we still don't want to
    be triggering spurious warnings.

    Reported-by: kernel test robot
    Signed-off-by: Robin Murphy
    Fixes: bb68b2fbfbd6 ('iommu/iova: Add rbtree anchor node')
    Signed-off-by: Joerg Roedel

    Robin Murphy
     

28 Sep, 2017

3 commits

  • When devices with different DMA masks are using the same domain, or for
    PCI devices where we usually try a speculative 32-bit allocation first,
    there is a fair possibility that the top PFN of the rcache stack at any
    given time may be unsuitable for the lower limit, prompting a fallback
    to allocating anew from the rbtree. Consequently, we may end up
    artifically increasing pressure on the 32-bit IOVA space as unused IOVAs
    accumulate lower down in the rcache stacks, while callers with 32-bit
    masks also impose unnecessary rbtree overhead.

    In such cases, let's try a bit harder to satisfy the allocation locally
    first - scanning the whole stack should still be relatively inexpensive.

    Signed-off-by: Robin Murphy
    Signed-off-by: Joerg Roedel

    Robin Murphy
     
  • When popping a pfn from an rcache, we are currently checking it directly
    against limit_pfn for viability. Since this represents iova->pfn_lo, it
    is technically possible for the corresponding iova->pfn_hi to be greater
    than limit_pfn. Although we generally get away with it in practice since
    limit_pfn is typically a power-of-two boundary and the IOVAs are
    size-aligned, it's pretty trivial to make the iova_rcache_get() path
    take the allocation size into account for complete safety.

    Signed-off-by: Robin Murphy
    Signed-off-by: Joerg Roedel

    Robin Murphy
     
  • All put_iova_domain() should have to worry about is freeing memory - by
    that point the domain must no longer be live, so the act of cleaning up
    doesn't need to be concurrency-safe or maintain the rbtree in a
    self-consistent state. There's no need to waste time with locking or
    emptying the rcache magazines, and we can just use the postorder
    traversal helper to clear out the remaining rbtree entries in-place.

    Signed-off-by: Robin Murphy
    Signed-off-by: Joerg Roedel

    Robin Murphy
     

27 Sep, 2017

6 commits

  • The logic of __get_cached_rbnode() is a little obtuse, but then
    __get_prev_node_of_cached_rbnode_or_last_node_and_update_limit_pfn()
    wouldn't exactly roll off the tongue...

    Now that we have the invariant that there is always a valid node to
    start searching downwards from, everything gets a bit easier to follow
    if we simplify that function to do what it says on the tin and return
    the cached node (or anchor node as appropriate) directly. In turn, we
    can then deduplicate the rb_prev() and limit_pfn logic into the main
    loop itself, further reduce the amount of code under the lock, and
    generally make the inner workings a bit less subtle.

    Signed-off-by: Robin Murphy
    Signed-off-by: Joerg Roedel

    Robin Murphy
     
  • Add a permanent dummy IOVA reservation to the rbtree, such that we can
    always access the top of the address space instantly. The immediate
    benefit is that we remove the overhead of the rb_last() traversal when
    not using the cached node, but it also paves the way for further
    simplifications.

    Signed-off-by: Robin Murphy
    Signed-off-by: Joerg Roedel

    Robin Murphy
     
  • Now that the cached node optimisation can apply to all allocations, the
    couple of users which were playing tricks with dma_32bit_pfn in order to
    benefit from it can stop doing so. Conversely, there is also no need for
    all the other users to explicitly calculate a 'real' 32-bit PFN, when
    init_iova_domain() can happily do that itself from the page granularity.

    CC: Thierry Reding
    CC: Jonathan Hunter
    CC: David Airlie
    CC: Sudeep Dutt
    CC: Ashutosh Dixit
    Signed-off-by: Zhen Lei
    Tested-by: Ard Biesheuvel
    Tested-by: Zhen Lei
    Tested-by: Nate Watterson
    [rm: use iova_shift(), rewrote commit message]
    Signed-off-by: Robin Murphy
    Signed-off-by: Joerg Roedel

    Zhen Lei
     
  • The cached node mechanism provides a significant performance benefit for
    allocations using a 32-bit DMA mask, but in the case of non-PCI devices
    or where the 32-bit space is full, the loss of this benefit can be
    significant - on large systems there can be many thousands of entries in
    the tree, such that walking all the way down to find free space every
    time becomes increasingly awful.

    Maintain a similar cached node for the whole IOVA space as a superset of
    the 32-bit space so that performance can remain much more consistent.

    Inspired by work by Zhen Lei .

    Tested-by: Ard Biesheuvel
    Tested-by: Zhen Lei
    Tested-by: Nate Watterson
    Signed-off-by: Robin Murphy
    Signed-off-by: Joerg Roedel

    Robin Murphy
     
  • The mask for calculating the padding size doesn't change, so there's no
    need to recalculate it every loop iteration. Furthermore, Once we've
    done that, it becomes clear that we don't actually need to calculate a
    padding size at all - by flipping the arithmetic around, we can just
    combine the upper limit, size, and mask directly to check against the
    lower limit.

    For an arm64 build, this alone knocks 20% off the object code size of
    the entire alloc_iova() function!

    Signed-off-by: Zhen Lei
    Tested-by: Ard Biesheuvel
    Tested-by: Zhen Lei
    Tested-by: Nate Watterson
    [rm: simplified more of the arithmetic, rewrote commit message]
    Signed-off-by: Robin Murphy
    Signed-off-by: Joerg Roedel

    Zhen Lei
     
  • Checking the IOVA bounds separately before deciding which direction to
    continue the search (if necessary) results in redundantly comparing both
    pfns twice each. GCC can already determine that the final comparison op
    is redundant and optimise it down to 3 in total, but we can go one
    further with a little tweak of the ordering (which makes the intent of
    the code that much cleaner as a bonus).

    Signed-off-by: Zhen Lei
    Tested-by: Ard Biesheuvel
    Tested-by: Zhen Lei
    Tested-by: Nate Watterson
    [rm: rewrote commit message to clarify]
    Signed-off-by: Robin Murphy
    Signed-off-by: Joerg Roedel

    Zhen Lei
     

16 Aug, 2017

5 commits

  • Add a timer to flush entries from the Flush-Queues every
    10ms. This makes sure that no stale TLB entries remain for
    too long after an IOVA has been unmapped.

    Signed-off-by: Joerg Roedel

    Joerg Roedel
     
  • The lock is taken from the same CPU most of the time. But
    having it allows to flush the queue also from another CPU if
    necessary.

    This will be used by a timer to regularily flush any pending
    IOVAs from the Flush-Queues.

    Signed-off-by: Joerg Roedel

    Joerg Roedel
     
  • There are two counters:

    * fq_flush_start_cnt - Increased when a TLB flush
    is started.

    * fq_flush_finish_cnt - Increased when a TLB flush
    is finished.

    The fq_flush_start_cnt is assigned to every Flush-Queue
    entry on its creation. When freeing entries from the
    Flush-Queue, the value in the entry is compared to the
    fq_flush_finish_cnt. The entry can only be freed when its
    value is less than the value of fq_flush_finish_cnt.

    The reason for these counters it to take advantage of IOMMU
    TLB flushes that happened on other CPUs. These already
    flushed the TLB for Flush-Queue entries on other CPUs so
    that they can already be freed without flushing the TLB
    again.

    This makes it less likely that the Flush-Queue is full and
    saves IOMMU TLB flushes.

    Signed-off-by: Joerg Roedel

    Joerg Roedel
     
  • Add a function to add entries to the Flush-Queue ring
    buffer. If the buffer is full, call the flush-callback and
    free the entries.

    Signed-off-by: Joerg Roedel

    Joerg Roedel
     
  • This patch adds the basic data-structures to implement
    flush-queues in the generic IOVA code. It also adds the
    initialization and destroy routines for these data
    structures.

    The initialization routine is designed so that the use of
    this feature is optional for the users of IOVA code.

    Signed-off-by: Joerg Roedel

    Joerg Roedel
     

28 Jun, 2017

2 commits

  • …re', 'x86/vt-d', 'x86/amd', 's390' and 'core' into next

    Joerg Roedel
     
  • Commit 583248e6620a ("iommu/iova: Disable preemption around use of
    this_cpu_ptr()") disables preemption while accessing a per-CPU variable.
    This does keep lockdep quiet. However I don't see the point why it is
    bad if we get migrated after its access to another CPU.
    __iova_rcache_insert() and __iova_rcache_get() immediately locks the
    variable after obtaining it - before accessing its members.
    _If_ we get migrated away after retrieving the address of cpu_rcache
    before taking the lock then the *other* task on the same CPU will
    retrieve the same address of cpu_rcache and will spin on the lock.

    alloc_iova_fast() disables preemption while invoking
    free_cpu_cached_iovas() on each CPU. The function itself uses
    per_cpu_ptr() which does not trigger a warning (like this_cpu_ptr()
    does). It _could_ make sense to use get_online_cpus() instead but the we
    have a hotplug notifier for CPU down (and none for up) so we are good.

    Cc: Joerg Roedel
    Cc: iommu@lists.linux-foundation.org
    Cc: Andrew Morton
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Joerg Roedel

    Sebastian Andrzej Siewior
     

17 May, 2017

1 commit

  • When walking the rbtree, the fact that iovad->start_pfn and limit_pfn
    are both inclusive limits creates an ambiguity once limit_pfn reaches
    the bottom of the address space and they overlap. Commit 5016bdb796b3
    ("iommu/iova: Fix underflow bug in __alloc_and_insert_iova_range") fixed
    the worst side-effect of this, that of underflow wraparound leading to
    bogus allocations, but the remaining fallout is that any attempt to
    allocate start_pfn itself erroneously fails.

    The cleanest way to resolve the ambiguity is to simply make limit_pfn an
    exclusive limit when inside the guts of the rbtree. Since we're working
    with PFNs, representing one past the top of the address space is always
    possible without fear of overflow, and elsewhere it just makes life a
    little more straightforward.

    Reported-by: Aaron Sierra
    Signed-off-by: Robin Murphy
    Signed-off-by: Joerg Roedel

    Robin Murphy
     

05 May, 2017

1 commit


07 Apr, 2017

1 commit

  • Normally, calling alloc_iova() using an iova_domain with insufficient
    pfns remaining between start_pfn and dma_limit will fail and return a
    NULL pointer. Unexpectedly, if such a "full" iova_domain contains an
    iova with pfn_lo == 0, the alloc_iova() call will instead succeed and
    return an iova containing invalid pfns.

    This is caused by an underflow bug in __alloc_and_insert_iova_range()
    that occurs after walking the "full" iova tree when the search ends
    at the iova with pfn_lo == 0 and limit_pfn is then adjusted to be just
    below that (-1). This (now huge) limit_pfn gives the impression that a
    vast amount of space is available between it and start_pfn and thus
    a new iova is allocated with the invalid pfn_hi value, 0xFFF.... .

    To rememdy this, a check is introduced to ensure that adjustments to
    limit_pfn will not underflow.

    This issue has been observed in the wild, and is easily reproduced with
    the following sample code.

    struct iova_domain *iovad = kzalloc(sizeof(*iovad), GFP_KERNEL);
    struct iova *rsvd_iova, *good_iova, *bad_iova;
    unsigned long limit_pfn = 3;
    unsigned long start_pfn = 1;
    unsigned long va_size = 2;

    init_iova_domain(iovad, SZ_4K, start_pfn, limit_pfn);
    rsvd_iova = reserve_iova(iovad, 0, 0);
    good_iova = alloc_iova(iovad, va_size, limit_pfn, true);
    bad_iova = alloc_iova(iovad, va_size, limit_pfn, true);

    Prior to the patch, this yielded:
    *rsvd_iova == {0, 0} /* Expected */
    *good_iova == {2, 3} /* Expected */
    *bad_iova == {-2, -1} /* Oh no... */

    After the patch, bad_iova is NULL as expected since inadequate
    space remains between limit_pfn and start_pfn after allocating
    good_iova.

    Signed-off-by: Nate Watterson
    Signed-off-by: Joerg Roedel

    Nate Watterson
     

21 Mar, 2017

1 commit


04 Jan, 2017

1 commit


15 Nov, 2016

1 commit

  • When searching for a free IOVA range, we optimise the tree traversal
    by starting from the cached32_node, instead of the last node, when
    limit_pfn is equal to dma_32bit_pfn. However, if limit_pfn happens to
    be smaller, then we'll go ahead and start from the top even though
    dma_32bit_pfn is still a more suitable upper bound. Since this is
    clearly a silly thing to do, adjust the lookup condition appropriately.

    Signed-off-by: Robin Murphy
    Signed-off-by: Joerg Roedel

    Robin Murphy
     

27 Jun, 2016

1 commit

  • Between acquiring the this_cpu_ptr() and using it, ideally we don't want
    to be preempted and work on another CPU's private data. this_cpu_ptr()
    checks whether or not preemption is disable, and get_cpu_ptr() provides
    a convenient wrapper for operating on the cpu ptr inside a preemption
    disabled critical section (which currently is provided by the
    spinlock).

    [ 167.997877] BUG: using smp_processor_id() in preemptible [00000000] code: usb-storage/216
    [ 167.997940] caller is debug_smp_processor_id+0x17/0x20
    [ 167.997945] CPU: 7 PID: 216 Comm: usb-storage Tainted: G U 4.7.0-rc1-gfxbench-RO_Patchwork_1057+ #1
    [ 167.997948] Hardware name: Hewlett-Packard HP Pro 3500 Series/2ABF, BIOS 8.11 10/24/2012
    [ 167.997951] 0000000000000000 ffff880118b7f9c8 ffffffff8140dca5 0000000000000007
    [ 167.997958] ffffffff81a3a7e9 ffff880118b7f9f8 ffffffff8142a927 0000000000000000
    [ 167.997965] ffff8800d499ed58 0000000000000001 00000000000fffff ffff880118b7fa08
    [ 167.997971] Call Trace:
    [ 167.997977] [] dump_stack+0x67/0x92
    [ 167.997981] [] check_preemption_disabled+0xd7/0xe0
    [ 167.997985] [] debug_smp_processor_id+0x17/0x20
    [ 167.997990] [] alloc_iova_fast+0xb7/0x210
    [ 167.997994] [] intel_alloc_iova+0x7f/0xd0
    [ 167.997998] [] intel_map_sg+0xbd/0x240
    [ 167.998002] [] ? debug_lockdep_rcu_enabled+0x1d/0x20
    [ 167.998009] [] usb_hcd_map_urb_for_dma+0x4b9/0x5a0
    [ 167.998013] [] usb_hcd_submit_urb+0xe9/0xaa0
    [ 167.998017] [] ? mark_held_locks+0x6f/0xa0
    [ 167.998022] [] ? __raw_spin_lock_init+0x1c/0x50
    [ 167.998025] [] ? debug_lockdep_rcu_enabled+0x1d/0x20
    [ 167.998028] [] usb_submit_urb+0x3f3/0x5a0
    [ 167.998032] [] ? trace_hardirqs_on_caller+0x122/0x1b0
    [ 167.998035] [] usb_sg_wait+0x67/0x150
    [ 167.998039] [] usb_stor_bulk_transfer_sglist.part.3+0x82/0xd0
    [ 167.998042] [] usb_stor_bulk_srb+0x4c/0x60
    [ 167.998045] [] usb_stor_Bulk_transport+0x17e/0x420
    [ 167.998049] [] usb_stor_invoke_transport+0x242/0x540
    [ 167.998052] [] ? debug_lockdep_rcu_enabled+0x1d/0x20
    [ 167.998058] [] usb_stor_transparent_scsi_command+0x9/0x10
    [ 167.998061] [] usb_stor_control_thread+0x158/0x260
    [ 167.998064] [] ? fill_inquiry_response+0x20/0x20
    [ 167.998067] [] ? fill_inquiry_response+0x20/0x20
    [ 167.998071] [] kthread+0xea/0x100
    [ 167.998078] [] ret_from_fork+0x1f/0x40
    [ 167.998081] [] ? kthread_create_on_node+0x1f0/0x1f0

    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=96293
    Signed-off-by: Chris Wilson
    Cc: Joonas Lahtinen
    Cc: Joerg Roedel
    Cc: iommu@lists.linux-foundation.org
    Cc: linux-kernel@vger.kernel.org
    Fixes: 9257b4a206fc ('iommu/iova: introduce per-cpu caching to iova allocation')
    Signed-off-by: Joerg Roedel

    Chris Wilson
     

21 Apr, 2016

1 commit

  • IOVA allocation has two problems that impede high-throughput I/O.
    First, it can do a linear search over the allocated IOVA ranges.
    Second, the rbtree spinlock that serializes IOVA allocations becomes
    contended.

    Address these problems by creating an API for caching allocated IOVA
    ranges, so that the IOVA allocator isn't accessed frequently. This
    patch adds a per-CPU cache, from which CPUs can alloc/free IOVAs
    without taking the rbtree spinlock. The per-CPU caches are backed by
    a global cache, to avoid invoking the (linear-time) IOVA allocator
    without needing to make the per-CPU cache size excessive. This design
    is based on magazines, as described in "Magazines and Vmem: Extending
    the Slab Allocator to Many CPUs and Arbitrary Resources" (currently
    available at https://www.usenix.org/legacy/event/usenix01/bonwick.html)

    Adding caching on top of the existing rbtree allocator maintains the
    property that IOVAs are densely packed in the IO virtual address space,
    which is important for keeping IOMMU page table usage low.

    To keep the cache size reasonable, we bound the IOVA space a CPU can
    cache by 32 MiB (we cache a bounded number of IOVA ranges, and only
    ranges of size
    [mad@cs.technion.ac.il: rebased, cleaned up and reworded the commit message]
    Signed-off-by: Adam Morrison
    Reviewed-by: Shaohua Li
    Reviewed-by: Ben Serebrin
    [dwmw2: split out VT-d part into a separate patch]
    Signed-off-by: David Woodhouse

    Omer Peleg
     

28 Jul, 2015

3 commits