22 Aug, 2018

2 commits

  • Delete ida_pre_get(), ida_get_new(), ida_get_new_above() and ida_remove()
    from the public API. Some of these functions still exist as internal
    helpers, but they should not be called by consumers.

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     
  • get_slot_offset() can be called with a NULL 'parent' argument.
    In this case, the calculated value will not be used, but calculating
    it is undefined. Rather than fixing the caller (__radix_tree_delete)
    to not call get_slot_offset(), make get_slot_offset() robust against
    being called with a NULL parent.

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     

26 May, 2018

1 commit

  • If the radix tree underlying the IDR happens to be full and we attempt
    to remove an id which is larger than any id in the IDR, we will call
    __radix_tree_delete() with an uninitialised 'slot' pointer, at which
    point anything could happen. This was easiest to hit with a single
    entry at id 0 and attempting to remove a non-0 id, but it could have
    happened with 64 entries and attempting to remove an id >= 64.

    Roman said:

    The syzcaller test boils down to opening /dev/kvm, creating an
    eventfd, and calling a couple of KVM ioctls. None of this requires
    superuser. And the result is dereferencing an uninitialized pointer
    which is likely a crash. The specific path caught by syzbot is via
    KVM_HYPERV_EVENTD ioctl which is new in 4.17. But I guess there are
    other user-triggerable paths, so cc:stable is probably justified.

    Matthew added:

    We have around 250 calls to idr_remove() in the kernel today. Many of
    them pass an ID which is embedded in the object they're removing, so
    they're safe. Picking a few likely candidates:

    drivers/firewire/core-cdev.c looks unsafe; the ID comes from an ioctl.
    drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c is similar
    drivers/atm/nicstar.c could be taken down by a handcrafted packet

    Link: http://lkml.kernel.org/r/20180518175025.GD6361@bombadil.infradead.org
    Fixes: 0a835c4f090a ("Reimplement IDR and IDA using the radix tree")
    Reported-by:
    Debugged-by: Roman Kagan
    Signed-off-by: Matthew Wilcox
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

19 May, 2018

1 commit

  • Fix a race in the multi-order iteration code which causes the kernel to
    hit a GP fault. This was first seen with a production v4.15 based
    kernel (4.15.6-300.fc27.x86_64) utilizing a DAX workload which used
    order 9 PMD DAX entries.

    The race has to do with how we tear down multi-order sibling entries
    when we are removing an item from the tree. Remember for example that
    an order 2 entry looks like this:

    struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling]

    where 'entry' is in some slot in the struct radix_tree_node, and the
    three slots following 'entry' contain sibling pointers which point back
    to 'entry.'

    When we delete 'entry' from the tree, we call :

    radix_tree_delete()
    radix_tree_delete_item()
    __radix_tree_delete()
    replace_slot()

    replace_slot() first removes the siblings in order from the first to the
    last, then at then replaces 'entry' with NULL. This means that for a
    brief period of time we end up with one or more of the siblings removed,
    so:

    struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling]

    This causes an issue if you have a reader iterating over the slots in
    the tree via radix_tree_for_each_slot() while only under
    rcu_read_lock()/rcu_read_unlock() protection. This is a common case in
    mm/filemap.c.

    The issue is that when __radix_tree_next_slot() => skip_siblings() tries
    to skip over the sibling entries in the slots, it currently does so with
    an exact match on the slot directly preceding our current slot.
    Normally this works:

    V preceding slot
    struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling]
    ^ current slot

    This lets you find the first sibling, and you skip them all in order.

    But in the case where one of the siblings is NULL, that slot is skipped
    and then our sibling detection is interrupted:

    V preceding slot
    struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling]
    ^ current slot

    This means that the sibling pointers aren't recognized since they point
    all the way back to 'entry', so we think that they are normal internal
    radix tree pointers. This causes us to think we need to walk down to a
    struct radix_tree_node starting at the address of 'entry'.

    In a real running kernel this will crash the thread with a GP fault when
    you try and dereference the slots in your broken node starting at
    'entry'.

    We fix this race by fixing the way that skip_siblings() detects sibling
    nodes. Instead of testing against the preceding slot we instead look
    for siblings via is_sibling_entry() which compares against the position
    of the struct radix_tree_node.slots[] array. This ensures that sibling
    entries are properly identified, even if they are no longer contiguous
    with the 'entry' they point to.

    Link: http://lkml.kernel.org/r/20180503192430.7582-6-ross.zwisler@linux.intel.com
    Fixes: 148deab223b2 ("radix-tree: improve multiorder iterators")
    Signed-off-by: Ross Zwisler
    Reported-by: CR, Sapthagirish
    Reviewed-by: Jan Kara
    Cc: Matthew Wilcox
    Cc: Christoph Hellwig
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

12 Apr, 2018

1 commit

  • Patch series "XArray", v9. (First part thereof).

    This patchset is, I believe, appropriate for merging for 4.17. It
    contains the XArray implementation, to eventually replace the radix
    tree, and converts the page cache to use it.

    This conversion keeps the radix tree and XArray data structures in sync
    at all times. That allows us to convert the page cache one function at
    a time and should allow for easier bisection. Other than renaming some
    elements of the structures, the data structures are fundamentally
    unchanged; a radix tree walk and an XArray walk will touch the same
    number of cachelines. I have changes planned to the XArray data
    structure, but those will happen in future patches.

    Improvements the XArray has over the radix tree:

    - The radix tree provides operations like other trees do; 'insert' and
    'delete'. But what most users really want is an automatically
    resizing array, and so it makes more sense to give users an API that
    is like an array -- 'load' and 'store'. We still have an 'insert'
    operation for users that really want that semantic.

    - The XArray considers locking as part of its API. This simplifies a
    lot of users who formerly had to manage their own locking just for
    the radix tree. It also improves code generation as we can now tell
    RCU that we're holding a lock and it doesn't need to generate as much
    fencing code. The other advantage is that tree nodes can be moved
    (not yet implemented).

    - GFP flags are now parameters to calls which may need to allocate
    memory. The radix tree forced users to decide what the allocation
    flags would be at creation time. It's much clearer to specify them at
    allocation time.

    - Memory is not preloaded; we don't tie up dozens of pages on the off
    chance that the slab allocator fails. Instead, we drop the lock,
    allocate a new node and retry the operation. We have to convert all
    the radix tree, IDA and IDR preload users before we can realise this
    benefit, but I have not yet found a user which cannot be converted.

    - The XArray provides a cmpxchg operation. The radix tree forces users
    to roll their own (and at least four have).

    - Iterators take a 'max' parameter. That simplifies many users and will
    reduce the amount of iteration done.

    - Iteration can proceed backwards. We only have one user for this, but
    since it's called as part of the pagefault readahead algorithm, that
    seemed worth mentioning.

    - RCU-protected pointers are not exposed as part of the API. There are
    some fun bugs where the page cache forgets to use rcu_dereference()
    in the current codebase.

    - Value entries gain an extra bit compared to radix tree exceptional
    entries. That gives us the extra bit we need to put huge page swap
    entries in the page cache.

    - Some iterators now take a 'filter' argument instead of having
    separate iterators for tagged/untagged iterations.

    The page cache is improved by this:

    - Shorter, easier to read code

    - More efficient iterations

    - Reduction in size of struct address_space

    - Fewer walks from the top of the data structure; the XArray API
    encourages staying at the leaf node and conducting operations there.

    This patch (of 8):

    None of these bits may be used for slab allocations, so we can use them
    as radix tree flags as long as we mask them off before passing them to
    the slab allocator. Move the IDR flag from the high bits to the
    GFP_ZONEMASK bits.

    Link: http://lkml.kernel.org/r/20180313132639.17387-3-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Jeff Layton
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Ryusuke Konishi
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

22 Feb, 2018

1 commit

  • As far as I can tell, the only place the per-cpu ida_bitmap is populated
    is in ida_pre_get. The pre-allocated element is stolen in two places in
    ida_get_new_above, in both cases immediately followed by a memset(0).

    Since ida_get_new_above is called with locks held, do the zeroing in
    ida_pre_get, or rather let kmalloc() do it. Also, apparently gcc
    generates ~44 bytes of code to do a memset(, 0, 128):

    $ scripts/bloat-o-meter vmlinux.{0,1}
    add/remove: 0/0 grow/shrink: 2/1 up/down: 5/-88 (-83)
    Function old new delta
    ida_pre_get 115 119 +4
    vermagic 27 28 +1
    ida_get_new_above 715 627 -88

    Link: http://lkml.kernel.org/r/20180108225634.15340-1-linux@rasmusvillemoes.dk
    Signed-off-by: Rasmus Villemoes
    Acked-by: Matthew Wilcox
    Cc: Eric Biggers
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     

07 Feb, 2018

1 commit

  • It has no more users, so remove it. Move idr_alloc() back into idr.c,
    move the guts of idr_alloc_cmn() into idr_alloc_u32(), remove the
    wrappers around idr_get_free_cmn() and rename it to idr_get_free().
    While there is now no interface to allocate IDs larger than a u32,
    the IDR internals remain ready to handle a larger ID should a need arise.

    These changes make it possible to provide the guarantee that, if the
    nextid pointer points into the object, the object's ID will be initialised
    before a concurrent lookup can find the object.

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     

16 Nov, 2017

1 commit

  • During truncation, the mapping has already been checked for shmem and
    dax so it's known that workingset_update_node is required.

    This patch avoids the checks on mapping for each page being truncated.
    In all other cases, a lookup helper is used to determine if
    workingset_update_node() needs to be called. The one danger is that the
    API is slightly harder to use as calling workingset_update_node directly
    without checking for dax or shmem mappings could lead to surprises.
    However, the API rarely needs to be used and hopefully the comment is
    enough to give people the hint.

    sparsetruncate (tiny)
    4.14.0-rc4 4.14.0-rc4
    oneirq-v1r1 pickhelper-v1r1
    Min Time 141.00 ( 0.00%) 140.00 ( 0.71%)
    1st-qrtle Time 142.00 ( 0.00%) 141.00 ( 0.70%)
    2nd-qrtle Time 142.00 ( 0.00%) 142.00 ( 0.00%)
    3rd-qrtle Time 143.00 ( 0.00%) 143.00 ( 0.00%)
    Max-90% Time 144.00 ( 0.00%) 144.00 ( 0.00%)
    Max-95% Time 147.00 ( 0.00%) 145.00 ( 1.36%)
    Max-99% Time 195.00 ( 0.00%) 191.00 ( 2.05%)
    Max Time 230.00 ( 0.00%) 205.00 ( 10.87%)
    Amean Time 144.37 ( 0.00%) 143.82 ( 0.38%)
    Stddev Time 10.44 ( 0.00%) 9.00 ( 13.74%)
    Coeff Time 7.23 ( 0.00%) 6.26 ( 13.41%)
    Best99%Amean Time 143.72 ( 0.00%) 143.34 ( 0.26%)
    Best95%Amean Time 142.37 ( 0.00%) 142.00 ( 0.26%)
    Best90%Amean Time 142.19 ( 0.00%) 141.85 ( 0.24%)
    Best75%Amean Time 141.92 ( 0.00%) 141.58 ( 0.24%)
    Best50%Amean Time 141.69 ( 0.00%) 141.31 ( 0.27%)
    Best25%Amean Time 141.38 ( 0.00%) 140.97 ( 0.29%)

    As you'd expect, the gain is marginal but it can be detected. The
    differences in bonnie are all within the noise which is not surprising
    given the impact on the microbenchmark.

    radix_tree_update_node_t is a callback for some radix operations that
    optionally passes in a private field. The only user of the callback is
    workingset_update_node and as it no longer requires a mapping, the
    private field is removed.

    Link: http://lkml.kernel.org/r/20171018075952.10627-3-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Jan Kara
    Cc: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

09 Sep, 2017

1 commit

  • __radix_tree_preload() only disables preemption if no error is returned.

    So we really need to make sure callers always check the return value.

    idr_preload() contract is to always disable preemption, so we need
    to add a missing preempt_disable() if an error happened.

    Similarly, ida_pre_get() only needs to call preempt_enable() in the
    case no error happened.

    Link: http://lkml.kernel.org/r/1504637190.15310.62.camel@edumazet-glaptop3.roam.corp.google.com
    Fixes: 0a835c4f090a ("Reimplement IDR and IDA using the radix tree")
    Fixes: 7ad3d4d85c7a ("ida: Move ida_bitmap to a percpu variable")
    Signed-off-by: Eric Dumazet
    Cc: Matthew Wilcox
    Cc: "Kirill A. Shutemov"
    Cc: [4.11+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

07 Sep, 2017

1 commit

  • Pull networking updates from David Miller:

    1) Support ipv6 checksum offload in sunvnet driver, from Shannon
    Nelson.

    2) Move to RB-tree instead of custom AVL code in inetpeer, from Eric
    Dumazet.

    3) Allow generic XDP to work on virtual devices, from John Fastabend.

    4) Add bpf device maps and XDP_REDIRECT, which can be used to build
    arbitrary switching frameworks using XDP. From John Fastabend.

    5) Remove UFO offloads from the tree, gave us little other than bugs.

    6) Remove the IPSEC flow cache, from Florian Westphal.

    7) Support ipv6 route offload in mlxsw driver.

    8) Support VF representors in bnxt_en, from Sathya Perla.

    9) Add support for forward error correction modes to ethtool, from
    Vidya Sagar Ravipati.

    10) Add time filter for packet scheduler action dumping, from Jamal Hadi
    Salim.

    11) Extend the zerocopy sendmsg() used by virtio and tap to regular
    sockets via MSG_ZEROCOPY. From Willem de Bruijn.

    12) Significantly rework value tracking in the BPF verifier, from Edward
    Cree.

    13) Add new jump instructions to eBPF, from Daniel Borkmann.

    14) Rework rtnetlink plumbing so that operations can be run without
    taking the RTNL semaphore. From Florian Westphal.

    15) Support XDP in tap driver, from Jason Wang.

    16) Add 32-bit eBPF JIT for ARM, from Shubham Bansal.

    17) Add Huawei hinic ethernet driver.

    18) Allow to report MD5 keys in TCP inet_diag dumps, from Ivan
    Delalande.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1780 commits)
    i40e: point wb_desc at the nvm_wb_desc during i40e_read_nvm_aq
    i40e: avoid NVM acquire deadlock during NVM update
    drivers: net: xgene: Remove return statement from void function
    drivers: net: xgene: Configure tx/rx delay for ACPI
    drivers: net: xgene: Read tx/rx delay for ACPI
    rocker: fix kcalloc parameter order
    rds: Fix non-atomic operation on shared flag variable
    net: sched: don't use GFP_KERNEL under spin lock
    vhost_net: correctly check tx avail during rx busy polling
    net: mdio-mux: add mdio_mux parameter to mdio_mux_init()
    rxrpc: Make service connection lookup always check for retry
    net: stmmac: Delete dead code for MDIO registration
    gianfar: Fix Tx flow control deactivation
    cxgb4: Ignore MPS_TX_INT_CAUSE[Bubble] for T6
    cxgb4: Fix pause frame count in t4_get_port_stats
    cxgb4: fix memory leak
    tun: rename generic_xdp to skb_xdp
    tun: reserve extra headroom only when XDP is set
    net: dsa: bcm_sf2: Configure IMP port TC2QOS mapping
    net: dsa: bcm_sf2: Advertise number of egress queues
    ...

    Linus Torvalds
     

31 Aug, 2017

1 commit

  • The following new APIs are added:

    int idr_alloc_ext(struct idr *idr, void *ptr, unsigned long *index,
    unsigned long start, unsigned long end, gfp_t gfp);
    void *idr_remove_ext(struct idr *idr, unsigned long id);
    void *idr_find_ext(const struct idr *idr, unsigned long id);
    void *idr_replace_ext(struct idr *idr, void *ptr, unsigned long id);
    void *idr_get_next_ext(struct idr *idr, unsigned long *nextid);

    Signed-off-by: Chris Mi
    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Chris Mi
     

18 Aug, 2017

1 commit

  • This was the competing idea long ago, but it was only with the rewrite
    of the idr as an radixtree and using the radixtree directly ourselves,
    along with the realisation that we can store the vma directly in the
    radixtree and only need a list for the reverse mapping, that made the
    patch performant enough to displace using a hashtable. Though the vma ht
    is fast and doesn't require any extra allocation (as we can embed the node
    inside the vma), it does require a thread for resizing and serialization
    and will have the occasional slow lookup. That is hairy enough to
    investigate alternatives and favour them if equivalent in peak performance.
    One advantage of allocating an indirection entry is that we can support a
    single shared bo between many clients, something that was done on a
    first-come first-serve basis for shared GGTT vma previously. To offset
    the extra allocations, we create yet another kmem_cache for them.

    Signed-off-by: Chris Wilson
    Reviewed-by: Tvrtko Ursulin
    Link: https://patchwork.freedesktop.org/patch/msgid/20170816085210.4199-5-chris@chris-wilson.co.uk

    Chris Wilson
     

04 May, 2017

1 commit

  • The current implementation of the reclaim lockup detection can lead to
    false positives and those even happen and usually lead to tweak the code
    to silence the lockdep by using GFP_NOFS even though the context can use
    __GFP_FS just fine.

    See

    http://lkml.kernel.org/r/20160512080321.GA18496@dastard

    as an example.

    =================================
    [ INFO: inconsistent lock state ]
    4.5.0-rc2+ #4 Tainted: G O
    ---------------------------------
    inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
    kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:

    (&xfs_nondir_ilock_class){++++-+}, at: xfs_ilock+0x177/0x200 [xfs]

    {RECLAIM_FS-ON-R} state was registered at:
    mark_held_locks+0x79/0xa0
    lockdep_trace_alloc+0xb3/0x100
    kmem_cache_alloc+0x33/0x230
    kmem_zone_alloc+0x81/0x120 [xfs]
    xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
    __xfs_refcount_find_shared+0x75/0x580 [xfs]
    xfs_refcount_find_shared+0x84/0xb0 [xfs]
    xfs_getbmap+0x608/0x8c0 [xfs]
    xfs_vn_fiemap+0xab/0xc0 [xfs]
    do_vfs_ioctl+0x498/0x670
    SyS_ioctl+0x79/0x90
    entry_SYSCALL_64_fastpath+0x12/0x6f

    CPU0
    ----
    lock(&xfs_nondir_ilock_class);

    lock(&xfs_nondir_ilock_class);

    *** DEADLOCK ***

    3 locks held by kswapd0/543:

    stack backtrace:
    CPU: 0 PID: 543 Comm: kswapd0 Tainted: G O 4.5.0-rc2+ #4
    Call Trace:
    lock_acquire+0xd8/0x1e0
    down_write_nested+0x5e/0xc0
    xfs_ilock+0x177/0x200 [xfs]
    xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
    xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
    evict+0xc5/0x190
    dispose_list+0x39/0x60
    prune_icache_sb+0x4b/0x60
    super_cache_scan+0x14f/0x1a0
    shrink_slab.part.63.constprop.79+0x1e9/0x4e0
    shrink_zone+0x15e/0x170
    kswapd+0x4f1/0xa80
    kthread+0xf2/0x110
    ret_from_fork+0x3f/0x70

    To quote Dave:
    "Ignoring whether reflink should be doing anything or not, that's a
    "xfs_refcountbt_init_cursor() gets called both outside and inside
    transactions" lockdep false positive case. The problem here is lockdep
    has seen this allocation from within a transaction, hence a GFP_NOFS
    allocation, and now it's seeing it in a GFP_KERNEL context. Also note
    that we have an active reference to this inode.

    So, because the reclaim annotations overload the interrupt level
    detections and it's seen the inode ilock been taken in reclaim
    ("interrupt") context, this triggers a reclaim context warning where
    it thinks it is unsafe to do this allocation in GFP_KERNEL context
    holding the inode ilock..."

    This sounds like a fundamental problem of the reclaim lock detection.
    It is really impossible to annotate such a special usecase IMHO unless
    the reclaim lockup detection is reworked completely. Until then it is
    much better to provide a way to add "I know what I am doing flag" and
    mark problematic places. This would prevent from abusing GFP_NOFS flag
    which has a runtime effect even on configurations which have lockdep
    disabled.

    Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to
    skip the current allocation request.

    While we are at it also make sure that the radix tree doesn't
    accidentaly override tags stored in the upper part of the gfp_mask.

    Link: http://lkml.kernel.org/r/20170306131408.9828-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Suggested-by: Peter Zijlstra
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Vlastimil Babka
    Cc: Dave Chinner
    Cc: Theodore Ts'o
    Cc: Chris Mason
    Cc: David Sterba
    Cc: Jan Kara
    Cc: Brian Foster
    Cc: Darrick J. Wong
    Cc: Nikolay Borisov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

08 Mar, 2017

1 commit

  • There's a relatively rare race where we look at the per-cpu preallocated
    IDA bitmap, see it's NULL, allocate a new one, and atomically update it.
    If the kmalloc() happened to sleep and we were rescheduled to a different
    CPU, or an interrupt came in at the exact right time, another task
    might have successfully allocated a bitmap and already deposited it.
    I forgot what the semantics of cmpxchg() were and ended up freeing the
    wrong bitmap leading to KASAN reporting a use-after-free.

    Dmitry found the bug with syzkaller & wrote the patch. I wrote the test
    case that will reproduce the bug without his patch being applied.

    Reported-by: Dmitry Vyukov
    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     

01 Mar, 2017

1 commit

  • Pull IDR rewrite from Matthew Wilcox:
    "The most significant part of the following is the patch to rewrite the
    IDR & IDA to be clients of the radix tree. But there's much more,
    including an enhancement of the IDA to be significantly more space
    efficient, an IDR & IDA test suite, some improvements to the IDR API
    (and driver changes to take advantage of those improvements), several
    improvements to the radix tree test suite and RCU annotations.

    The IDR & IDA rewrite had a good spin in linux-next and Andrew's tree
    for most of the last cycle. Coupled with the IDR test suite, I feel
    pretty confident that any remaining bugs are quite hard to hit. 0-day
    did a great job of watching my git tree and pointing out problems; as
    it hit them, I added new test-cases to be sure not to be caught the
    same way twice"

    Willy goes on to expand a bit on the IDR rewrite rationale:
    "The radix tree and the IDR use very similar data structures.

    Merging the two codebases lets us share the memory allocation pools,
    and results in a net deletion of 500 lines of code. It also opens up
    the possibility of exposing more of the features of the radix tree to
    users of the IDR (and I have some interesting patches along those
    lines waiting for 4.12)

    It also shrinks the size of the 'struct idr' from 40 bytes to 24 which
    will shrink a fair few data structures that embed an IDR"

    * 'idr-4.11' of git://git.infradead.org/users/willy/linux-dax: (32 commits)
    radix tree test suite: Add config option for map shift
    idr: Add missing __rcu annotations
    radix-tree: Fix __rcu annotations
    radix-tree: Add rcu_dereference and rcu_assign_pointer calls
    radix tree test suite: Run iteration tests for longer
    radix tree test suite: Fix split/join memory leaks
    radix tree test suite: Fix leaks in regression2.c
    radix tree test suite: Fix leaky tests
    radix tree test suite: Enable address sanitizer
    radix_tree_iter_resume: Fix out of bounds error
    radix-tree: Store a pointer to the root in each node
    radix-tree: Chain preallocated nodes through ->parent
    radix tree test suite: Dial down verbosity with -v
    radix tree test suite: Introduce kmalloc_verbose
    idr: Return the deleted entry from idr_remove
    radix tree test suite: Build separate binaries for some tests
    ida: Use exceptional entries for small IDAs
    ida: Move ida_bitmap to a percpu variable
    Reimplement IDR and IDA using the radix tree
    radix-tree: Add radix_tree_iter_delete
    ...

    Linus Torvalds
     

14 Feb, 2017

11 commits

  • Many places were missing __rcu annotations. A few places needed a few
    lines of explanation about why it was safe to not use RCU accessors.
    Add a custom CFLAGS setting to the Makefile to ensure that new patches
    don't miss RCU annotations.

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     
  • Some of these have been missing for many years. Others were recently
    introduced by me. Fortunately, we have tools that help us find such
    things.

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     
  • The address sanitizer occasionally finds an out of bounds error while
    running the test-suite. It turned out to be a read of the pointer
    immediately next to the tree root, but this out of bounds error could
    have occurred elsewhere. This happens because radix_tree_iter_resume()
    dereferences 'slot' before checking whether we've come to the end of
    the chunk. We can just delete this line; the value was never used.

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     
  • Instead of having this mysterious private_data in each radix_tree_node,
    store a pointer to the root, which can be useful for debugging. This also
    relieves the mm code from the duty of updating it.

    Acked-by: Johannes Weiner
    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     
  • Chaining through the ->private_data member means we have to zero
    ->private_data after removing preallocated nodes from the list.
    We're about to initialise ->parent anyway, so we can avoid zeroing it.

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     
  • We can use the root entry as a bitmap and save allocating a 128 byte
    bitmap for an IDA that contains only a few entries (30 on a 32-bit
    machine, 62 on a 64-bit machine). This costs about 300 bytes of kernel
    text on x86-64, so as long as 3 IDAs fall into this category, this
    is a net win for memory consumption.

    Thanks to Rasmus Villemoes for his work documenting the problem and
    collecting statistics on IDAs.

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     
  • When we preload the IDA, we allocate an IDA bitmap. Instead of storing
    that preallocated bitmap in the IDA, we store it in a percpu variable.
    Generally there are more IDAs in the system than CPUs, so this cuts down
    on the number of preallocated bitmaps that are unused, and about half
    of the IDA users did not call ida_destroy() so they were leaking IDA
    bitmaps.

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     
  • The IDR is very similar to the radix tree. It has some functionality that
    the radix tree did not have (alloc next free, cyclic allocation, a
    callback-based for_each, destroy tree), which is readily implementable on
    top of the radix tree. A few small changes were needed in order to use a
    tag to represent nodes with free space below them. More extensive
    changes were needed to support storing NULL as a valid entry in an IDR.
    Plain radix trees still interpret NULL as a not-present entry.

    The IDA is reimplemented as a client of the newly enhanced radix tree. As
    in the current implementation, it uses a bitmap at the last level of the
    tree.

    Signed-off-by: Matthew Wilcox
    Signed-off-by: Matthew Wilcox
    Tested-by: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Ross Zwisler
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton

    Matthew Wilcox
     
  • Factor the deletion code out into __radix_tree_delete() and provide a
    nice iterator-based wrapper around it. If we free the node, advance
    the iterator to avoid reading from freed memory.

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     
  • The counterpart to radix_tree_iter_tag_set(), used by the IDR code

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Rehas Sachdeva

    Matthew Wilcox
     
  • It will be used in drivers/md/raid5-cache.c

    Signed-off-by: Song Liu
    Signed-off-by: Shaohua Li

    Song Liu
     

28 Jan, 2017

1 commit

  • If we're just getting the value of a tag, or looking up an entry,
    we won't modify the radix tree, so we can declare these functions as
    taking a const pointer. Mostly for documentation purposes, though it
    might help code generation.

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     

25 Jan, 2017

1 commit

  • The newly introduced warning in radix_tree_free_nodes() was testing the
    wrong variable; it should have been 'old' instead of 'node'.

    Fixes: ea07b862ac8e ("mm: workingset: fix use-after-free in shadow node shrinker")
    Link: http://lkml.kernel.org/r/20170118163746.GA32495@cmpxchg.org
    Signed-off-by: Matthew Wilcox
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

08 Jan, 2017

1 commit

  • Several people report seeing warnings about inconsistent radix tree
    nodes followed by crashes in the workingset code, which all looked like
    use-after-free access from the shadow node shrinker.

    Dave Jones managed to reproduce the issue with a debug patch applied,
    which confirmed that the radix tree shrinking indeed frees shadow nodes
    while they are still linked to the shadow LRU:

    WARNING: CPU: 2 PID: 53 at lib/radix-tree.c:643 delete_node+0x1e4/0x200
    CPU: 2 PID: 53 Comm: kswapd0 Not tainted 4.10.0-rc2-think+ #3
    Call Trace:
    delete_node+0x1e4/0x200
    __radix_tree_delete_node+0xd/0x10
    shadow_lru_isolate+0xe6/0x220
    __list_lru_walk_one.isra.4+0x9b/0x190
    list_lru_walk_one+0x23/0x30
    scan_shadow_nodes+0x2e/0x40
    shrink_slab.part.44+0x23d/0x5d0
    shrink_node+0x22c/0x330
    kswapd+0x392/0x8f0

    This is the WARN_ON_ONCE(!list_empty(&node->private_list)) placed in the
    inlined radix_tree_shrink().

    The problem is with 14b468791fa9 ("mm: workingset: move shadow entry
    tracking to radix tree exceptional tracking"), which passes an update
    callback into the radix tree to link and unlink shadow leaf nodes when
    tree entries change, but forgot to pass the callback when reclaiming a
    shadow node.

    While the reclaimed shadow node itself is unlinked by the shrinker, its
    deletion from the tree can cause the left-most leaf node in the tree to
    be shrunk. If that happens to be a shadow node as well, we don't unlink
    it from the LRU as we should.

    Consider this tree, where the s are shadow entries:

    root->rnode
    |
    [0 n]
    | |
    [s ] [sssss]

    Now the shadow node shrinker reclaims the rightmost leaf node through
    the shadow node LRU:

    root->rnode
    |
    [0 ]
    |
    [s ]

    Because the parent of the deleted node is the first level below the
    root and has only one child in the left-most slot, the intermediate
    level is shrunk and the node containing the single shadow is put in
    its place:

    root->rnode
    |
    [s ]

    The shrinker again sees a single left-most slot in a first level node
    and thus decides to store the shadow in root->rnode directly and free
    the node - which is a leaf node on the shadow node LRU.

    root->rnode
    |
    s

    Without the update callback, the freed node remains on the shadow LRU,
    where it causes later shrinker runs to crash.

    Pass the node updater callback into __radix_tree_delete_node() in case
    the deletion causes the left-most branch in the tree to collapse too.

    Also add warnings when linked nodes are freed right away, rather than
    wait for the use-after-free when the list is scanned much later.

    Fixes: 14b468791fa9 ("mm: workingset: move shadow entry tracking to radix tree exceptional tracking")
    Reported-by: Dave Chinner
    Reported-by: Hugh Dickins
    Reported-by: Andrea Arcangeli
    Reported-and-tested-by: Dave Jones
    Signed-off-by: Johannes Weiner
    Cc: Christoph Hellwig
    Cc: Chris Leech
    Cc: Lee Duncan
    Cc: Jan Kara
    Cc: Kirill A. Shutemov
    Cc: Matthew Wilcox
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

16 Dec, 2016

1 commit

  • [ This resurrects commit 53855d10f456, which was reverted in
    2b41226b39b6. It depended on commit d544abd5ff7d ("lib/radix-tree:
    Convert to hotplug state machine") so now it is correct to apply ]

    Patch "lib/radix-tree: Convert to hotplug state machine" breaks the test
    suite as it adds a call to cpuhp_setup_state_nocalls() which is not
    currently emulated in the test suite. Add it, and delete the emulation
    of the old CPU hotplug mechanism.

    Link: http://lkml.kernel.org/r/1480369871-5271-36-git-send-email-mawilcox@linuxonhyperv.com
    Signed-off-by: Matthew Wilcox
    Tested-by: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Ross Zwisler
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

15 Dec, 2016

10 commits

  • radix_tree_join() was freeing nodes with a non-zero ->exceptional count,
    and radix_tree_split() wasn't zeroing ->exceptional when it allocated
    the new node. Fix this by making all callers of radix_tree_node_alloc()
    pass in the new counts (and some other always-initialised fields), which
    will prevent the problem recurring if in future we decide to do
    something similar.

    Link: http://lkml.kernel.org/r/1481667692-14500-3-git-send-email-mawilcox@linuxonhyperv.com
    Signed-off-by: Matthew Wilcox
    Cc: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • When replacing an entry with NULL, we need to delete any sibling
    entries. Also account deleting exceptional entries properly. Also fix
    a bug with radix_tree_iter_replace() where we would fail to remove
    entirely freed nodes. Also fix accounting bug when switching between
    normal and exceptional entries with replace_slot. Also add testcases
    for all these bugs.

    Link: http://lkml.kernel.org/r/1480369871-5271-61-git-send-email-mawilcox@linuxonhyperv.com
    Signed-off-by: Matthew Wilcox
    Tested-by: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Ross Zwisler
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Calculate how many nodes we need to allocate to split an old_order entry
    into multiple entries, each of size new_order. The test suite checks
    that we allocated exactly the right number of nodes; neither too many
    (checked by rtp->nr == 0), nor too few (checked by comparing
    nr_allocated before and after the call to radix_tree_split()).

    Link: http://lkml.kernel.org/r/1480369871-5271-60-git-send-email-mawilcox@linuxonhyperv.com
    Signed-off-by: Matthew Wilcox
    Tested-by: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Ross Zwisler
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • This new function splits a larger multiorder entry into smaller entries
    (potentially multi-order entries). These entries are initialised to
    RADIX_TREE_RETRY to ensure that RCU walkers who see this state aren't
    confused. The caller should then call radix_tree_for_each_slot() and
    radix_tree_replace_slot() in order to turn these retry entries into the
    intended new entries. Tags are replicated from the original multiorder
    entry into each new entry.

    Link: http://lkml.kernel.org/r/1480369871-5271-59-git-send-email-mawilcox@linuxonhyperv.com
    Signed-off-by: Matthew Wilcox
    Tested-by: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Ross Zwisler
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • This new function allows for the replacement of many smaller entries in
    the radix tree with one larger multiorder entry. From the point of view
    of an RCU walker, they may see a mixture of the smaller entries and the
    large entry during the same walk, but they will never see NULL for an
    index which was populated before the join.

    Link: http://lkml.kernel.org/r/1480369871-5271-58-git-send-email-mawilcox@linuxonhyperv.com
    Signed-off-by: Matthew Wilcox
    Tested-by: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Ross Zwisler
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • This is an exceptionally complicated function with just one caller
    (tag_pages_for_writeback). We devote a large portion of the runtime of
    the test suite to testing this one function which has one caller. By
    introducing the new function radix_tree_iter_tag_set(), we can eliminate
    all of the complexity while keeping the performance. The caller can now
    use a fairly standard radix_tree_for_each() loop, and it doesn't need to
    worry about tricksy things like 'start' wrapping.

    The test suite continues to spend a large amount of time investigating
    this function, but now it's testing the underlying primitives such as
    radix_tree_iter_resume() and the radix_tree_for_each_tagged() iterator
    which are also used by other parts of the kernel.

    Link: http://lkml.kernel.org/r/1480369871-5271-57-git-send-email-mawilcox@linuxonhyperv.com
    Signed-off-by: Matthew Wilcox
    Tested-by: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Ross Zwisler
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • This rather complicated function can be better implemented as an
    iterator. It has only one caller, so move the functionality to the only
    place that needs it. Update the test suite to follow the same pattern.

    Link: http://lkml.kernel.org/r/1480369871-5271-56-git-send-email-mawilcox@linuxonhyperv.com
    Signed-off-by: Matthew Wilcox
    Acked-by: Konstantin Khlebnikov
    Tested-by: Kirill A. Shutemov
    Cc: Ross Zwisler
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • This fixes several interlinked problems with the iterators in the
    presence of multiorder entries.

    1. radix_tree_iter_next() would only advance by one slot, which would
    result in the iterators returning the same entry more than once if
    there were sibling entries.

    2. radix_tree_next_slot() could return an internal pointer instead of
    a user pointer if a tagged multiorder entry was immediately followed by
    an entry of lower order.

    3. radix_tree_next_slot() expanded to a lot more code than it used to
    when multiorder support was compiled in. And I wasn't comfortable with
    entry_to_node() being in a header file.

    Fixing radix_tree_iter_next() for the presence of sibling entries
    necessarily involves examining the contents of the radix tree, so we now
    need to pass 'slot' to radix_tree_iter_next(), and we need to change the
    calling convention so it is called *before* dropping the lock which
    protects the tree. Also rename it to radix_tree_iter_resume(), as some
    people thought it was necessary to call radix_tree_iter_next() each time
    around the loop.

    radix_tree_next_slot() becomes closer to how it looked before multiorder
    support was introduced. It only checks to see if the next entry in the
    chunk is a sibling entry or a pointer to a node; this should be rare
    enough that handling this case out of line is not a performance impact
    (and such impact is amortised by the fact that the entry we just
    processed was a multiorder entry). Also, radix_tree_next_slot() used to
    force a new chunk lookup for untagged entries, which is more expensive
    than the out of line sibling entry skipping.

    Link: http://lkml.kernel.org/r/1480369871-5271-55-git-send-email-mawilcox@linuxonhyperv.com
    Signed-off-by: Matthew Wilcox
    Tested-by: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Ross Zwisler
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Print the indices of the entries as unsigned (instead of signed)
    integers and print the parent node of each entry to help navigate around
    larger trees where the layout is not quite so obvious. Print the
    indices covered by a node. Rearrange the order of fields printed so the
    indices and parents line up for each type of entry.

    Link: http://lkml.kernel.org/r/1480369871-5271-53-git-send-email-mawilcox@linuxonhyperv.com
    Signed-off-by: Matthew Wilcox
    Tested-by: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Ross Zwisler
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Since this function is specialised to the radix tree, pass in the node
    and tag to calculate the address of the bitmap in
    radix_tree_find_next_bit() instead of the caller. Likewise, there is no
    need to pass in the size of the bitmap.

    Link: http://lkml.kernel.org/r/1480369871-5271-52-git-send-email-mawilcox@linuxonhyperv.com
    Signed-off-by: Matthew Wilcox
    Tested-by: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Ross Zwisler
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox