21 Oct, 2020

1 commit

  • Pull XArray updates from Matthew Wilcox:

    - Fix the test suite after introduction of the local_lock

    - Fix a bug in the IDA spotted by Coverity

    - Change the API that allows the workingset code to delete a node

    - Fix xas_reload() when dealing with entries that occupy multiple
    indices

    - Add a few more tests to the test suite

    - Fix an unsigned int being shifted into an unsigned long

    * tag 'xarray-5.9' of git://git.infradead.org/users/willy/xarray:
    XArray: Fix xas_create_range for ranges above 4 billion
    radix-tree: fix the comment of radix_tree_next_slot()
    XArray: Fix xas_reload for multi-index entries
    XArray: Add private interface for workingset node deletion
    XArray: Fix xas_for_each_conflict documentation
    XArray: Test marked multiorder iterations
    XArray: Test two more things about xa_cmpxchg
    ida: Free allocated bitmap in error path
    radix tree test suite: Fix compilation

    Linus Torvalds
     

17 Oct, 2020

1 commit


07 Oct, 2020

1 commit


17 Jul, 2020

1 commit

  • Using uninitialized_var() is dangerous as it papers over real bugs[1]
    (or can in the future), and suppresses unrelated compiler warnings
    (e.g. "unused variable"). If the compiler thinks it is uninitialized,
    either simply initialize the variable or make compiler changes.

    In preparation for removing[2] the[3] macro[4], remove all remaining
    needless uses with the following script:

    git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
    xargs perl -pi -e \
    's/\buninitialized_var\(([^\)]+)\)/\1/g;
    s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

    drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
    pathological white-space.

    No outstanding warnings were found building allmodconfig with GCC 9.3.0
    for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
    alpha, and m68k.

    [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
    [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
    [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
    [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

    Reviewed-by: Leon Romanovsky # drivers/infiniband and mlx4/mlx5
    Acked-by: Jason Gunthorpe # IB
    Acked-by: Kalle Valo # wireless drivers
    Reviewed-by: Chao Yu # erofs
    Signed-off-by: Kees Cook

    Kees Cook
     

28 May, 2020

1 commit

  • The radix-tree and idr preload mechanisms use preempt_disable() to protect
    the complete operation between xxx_preload() and xxx_preload_end().

    As the code inside the preempt disabled section acquires regular spinlocks,
    which are converted to 'sleeping' spinlocks on a PREEMPT_RT kernel and
    eventually calls into a memory allocator, this conflicts with the RT
    semantics.

    Convert it to a local_lock which allows RT kernels to substitute them with
    a real per CPU lock. On non RT kernels this maps to preempt_disable() as
    before, but provides also lockdep coverage of the critical region.
    No functional change.

    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Ingo Molnar
    Acked-by: Peter Zijlstra
    Link: https://lore.kernel.org/r/20200527201119.1692513-3-bigeasy@linutronix.de

    Sebastian Andrzej Siewior
     

01 Feb, 2020

1 commit


03 Nov, 2019

1 commit


31 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 or at your option any
    later version this program is distributed in the hope that it will
    be useful but without any warranty without even the implied warranty
    of merchantability or fitness for a particular purpose see the gnu
    general public license for more details you should have received a
    copy of the gnu general public license along with this program if
    not write to the free software foundation inc 675 mass ave cambridge
    ma 02139 usa

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 77 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Reviewed-by: Armijn Hemel
    Reviewed-by: Richard Fontana
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070032.837555891@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

06 Dec, 2018

1 commit

  • Commit 66ee620f06f9 ("idr: Permit any valid kernel pointer to be stored")
    changed the radix tree lookup so that it stops when reaching the bottom
    of the tree. However, the condition was added in the wrong place,
    making it possible to return retry entries to the caller. Reorder the
    tests to check for the retry entry before checking whether we're at the
    bottom of the tree. The retry entry should never be found in the tree
    root, so it's safe to defer the check until the end of the loop.

    Add a regression test to the test-suite to be sure this doesn't come
    back.

    Fixes: 66ee620f06f9 ("idr: Permit any valid kernel pointer to be stored")
    Reported-by: Greg Kurz
    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     

21 Oct, 2018

13 commits

  • All users have now been converted to the XArray. Removing the support
    reduces code size and ensures new users will use the XArray instead.

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     
  • The tag_tagged_items() function is supposed to test the page-writeback
    tagging code. Since that has been converted to the XArray, there's
    not much point in testing the radix tree's tagging code. This requires
    using the pthread mutex embedded in the xarray instead of an external
    lock, so remove the pthread mutexes which protect xarrays/radix trees.
    Also remove radix_tree_iter_tag_set() as this was the last user.

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     
  • The page cache was the only user of this interface and it has now
    been converted to the XArray. Transform the test into a test of
    xas_init_marks().

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     
  • This function was only used by the page cache which is now converted
    to the XArray.

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     
  • radix_tree_split and radix_tree_join were never used upstream. Remove
    them; if they're needed in future they will be replaced by XArray
    equivalents.

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     
  • The only user of this functionality was the workingset code, and it's
    now been converted to the XArray. Remove __radix_tree_delete_node()
    entirely as it was also only used by the workingset code.

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     
  • xa_find() is a slightly easier API to use than
    radix_tree_gang_lookup_slot() because it contains its own RCU locking.
    This commit removes the last user of radix_tree_gang_lookup_slot()
    so remove the function too.

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     
  • Use the XArray APIs to add and replace pages in the page cache. This
    removes two uses of the radix tree preload API and is significantly
    shorter code. It also removes the last user of __radix_tree_create()
    outside radix-tree.c itself, so make it static.

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     
  • Use the XA_TRACK_FREE ability to track which entries have a free bit,
    similarly to how it uses the radix tree's IDR_FREE tag. This eliminates
    the per-cpu ida_bitmap preload, and fixes the memory consumption
    regression I introduced when making the IDR able to store any pointer.

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     
  • xa_store() differs from radix_tree_insert() in that it will overwrite an
    existing element in the array rather than returning an error. This is
    the behaviour which most users want, and those that want more complex
    behaviour generally want to use the xas family of routines anyway.

    For memory allocation, xa_store() will first attempt to request memory
    from the slab allocator; if memory is not immediately available, it will
    drop the xa_lock and allocate memory, keeping a pointer in the xa_state.
    It does not use the per-CPU cache, although those will continue to exist
    until all radix tree users are converted to the xarray.

    This patch also includes xa_erase() and __xa_erase() for a streamlined
    way to store NULL. Since there is no need to allocate memory in order
    to store a NULL in the XArray, we do not need to trouble the user with
    deciding what memory allocation flags to use.

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     
  • The xa_load function brings with it a lot of infrastructure; xa_empty(),
    xa_is_err(), and large chunks of the XArray advanced API that are used
    to implement xa_load.

    As the test-suite demonstrates, it is possible to use the XArray functions
    on a radix tree. The radix tree functions depend on the GFP flags being
    stored in the root of the tree, so it's not possible to use the radix
    tree functions on an XArray.

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     
  • This is a direct replacement for struct radix_tree_node. A couple of
    struct members have changed name, so convert those. Use a #define so
    that radix tree users continue to work without change.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Josef Bacik

    Matthew Wilcox
     
  • This is a direct replacement for struct radix_tree_root. Some of the
    struct members have changed name; convert those, and use a #define so
    that radix_tree users continue to work without change.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Josef Bacik

    Matthew Wilcox
     

30 Sep, 2018

3 commits

  • Instead of storing a pointer to the slot containing the canonical entry,
    store the offset of the slot. Produces slightly more efficient code
    (~300 bytes) and simplifies the implementation.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Josef Bacik

    Matthew Wilcox
     
  • Introduce xarray value entries and tagged pointers to replace radix
    tree exceptional entries. This is a slight change in encoding to allow
    the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
    value entry). It is also a change in emphasis; exceptional entries are
    intimidating and different. As the comment explains, you can choose
    to store values or pointers in the xarray and they are both first-class
    citizens.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Josef Bacik

    Matthew Wilcox
     
  • An upcoming change to the encoding of internal entries will set the bottom
    two bits to 0b10. Unfortunately, m68k only aligns some data structures
    to 2 bytes, so the IDR will interpret them as internal entries and things
    will go badly wrong.

    Change the radix tree so that it stops either when the node indicates
    that it's the bottom of the tree (shift == 0) or when the entry is not an
    internal entry. This means we cannot insert an arbitrary kernel pointer
    as a multiorder entry, but the IDR does not permit multiorder entries.

    Annoyingly, this means the IDR can no longer take advantage of the radix
    tree's ability to store a single entry at offset 0 without allocating
    memory. A pointer which is 2-byte aligned cannot be stored directly in
    the root as it would be indistinguishable from a node, so we must allocate
    a node in order to store a 2-byte pointer at index 0. The idr_replace()
    function does not take a GFP flags argument, so cannot allocate memory.
    If a user inserts a 4-byte aligned pointer at index 0 and then replaces
    it with a 2-byte aligned pointer, we must be able to store it.

    Arbitrary pointer values are still not permitted; pointers of the
    form 2 + (i * 4) for values of i between 0 and 1023 are reserved for
    the implementation. These are not valid kernel pointers as they would
    point into the zero page.

    This change does cause a runtime memory consumption regression for
    the IDA. I will recover that later.

    Signed-off-by: Matthew Wilcox
    Tested-by: Guenter Roeck

    Matthew Wilcox
     

22 Aug, 2018

2 commits

  • Delete ida_pre_get(), ida_get_new(), ida_get_new_above() and ida_remove()
    from the public API. Some of these functions still exist as internal
    helpers, but they should not be called by consumers.

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     
  • get_slot_offset() can be called with a NULL 'parent' argument.
    In this case, the calculated value will not be used, but calculating
    it is undefined. Rather than fixing the caller (__radix_tree_delete)
    to not call get_slot_offset(), make get_slot_offset() robust against
    being called with a NULL parent.

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     

26 May, 2018

1 commit

  • If the radix tree underlying the IDR happens to be full and we attempt
    to remove an id which is larger than any id in the IDR, we will call
    __radix_tree_delete() with an uninitialised 'slot' pointer, at which
    point anything could happen. This was easiest to hit with a single
    entry at id 0 and attempting to remove a non-0 id, but it could have
    happened with 64 entries and attempting to remove an id >= 64.

    Roman said:

    The syzcaller test boils down to opening /dev/kvm, creating an
    eventfd, and calling a couple of KVM ioctls. None of this requires
    superuser. And the result is dereferencing an uninitialized pointer
    which is likely a crash. The specific path caught by syzbot is via
    KVM_HYPERV_EVENTD ioctl which is new in 4.17. But I guess there are
    other user-triggerable paths, so cc:stable is probably justified.

    Matthew added:

    We have around 250 calls to idr_remove() in the kernel today. Many of
    them pass an ID which is embedded in the object they're removing, so
    they're safe. Picking a few likely candidates:

    drivers/firewire/core-cdev.c looks unsafe; the ID comes from an ioctl.
    drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c is similar
    drivers/atm/nicstar.c could be taken down by a handcrafted packet

    Link: http://lkml.kernel.org/r/20180518175025.GD6361@bombadil.infradead.org
    Fixes: 0a835c4f090a ("Reimplement IDR and IDA using the radix tree")
    Reported-by:
    Debugged-by: Roman Kagan
    Signed-off-by: Matthew Wilcox
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

19 May, 2018

1 commit

  • Fix a race in the multi-order iteration code which causes the kernel to
    hit a GP fault. This was first seen with a production v4.15 based
    kernel (4.15.6-300.fc27.x86_64) utilizing a DAX workload which used
    order 9 PMD DAX entries.

    The race has to do with how we tear down multi-order sibling entries
    when we are removing an item from the tree. Remember for example that
    an order 2 entry looks like this:

    struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling]

    where 'entry' is in some slot in the struct radix_tree_node, and the
    three slots following 'entry' contain sibling pointers which point back
    to 'entry.'

    When we delete 'entry' from the tree, we call :

    radix_tree_delete()
    radix_tree_delete_item()
    __radix_tree_delete()
    replace_slot()

    replace_slot() first removes the siblings in order from the first to the
    last, then at then replaces 'entry' with NULL. This means that for a
    brief period of time we end up with one or more of the siblings removed,
    so:

    struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling]

    This causes an issue if you have a reader iterating over the slots in
    the tree via radix_tree_for_each_slot() while only under
    rcu_read_lock()/rcu_read_unlock() protection. This is a common case in
    mm/filemap.c.

    The issue is that when __radix_tree_next_slot() => skip_siblings() tries
    to skip over the sibling entries in the slots, it currently does so with
    an exact match on the slot directly preceding our current slot.
    Normally this works:

    V preceding slot
    struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling]
    ^ current slot

    This lets you find the first sibling, and you skip them all in order.

    But in the case where one of the siblings is NULL, that slot is skipped
    and then our sibling detection is interrupted:

    V preceding slot
    struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling]
    ^ current slot

    This means that the sibling pointers aren't recognized since they point
    all the way back to 'entry', so we think that they are normal internal
    radix tree pointers. This causes us to think we need to walk down to a
    struct radix_tree_node starting at the address of 'entry'.

    In a real running kernel this will crash the thread with a GP fault when
    you try and dereference the slots in your broken node starting at
    'entry'.

    We fix this race by fixing the way that skip_siblings() detects sibling
    nodes. Instead of testing against the preceding slot we instead look
    for siblings via is_sibling_entry() which compares against the position
    of the struct radix_tree_node.slots[] array. This ensures that sibling
    entries are properly identified, even if they are no longer contiguous
    with the 'entry' they point to.

    Link: http://lkml.kernel.org/r/20180503192430.7582-6-ross.zwisler@linux.intel.com
    Fixes: 148deab223b2 ("radix-tree: improve multiorder iterators")
    Signed-off-by: Ross Zwisler
    Reported-by: CR, Sapthagirish
    Reviewed-by: Jan Kara
    Cc: Matthew Wilcox
    Cc: Christoph Hellwig
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

12 Apr, 2018

1 commit

  • Patch series "XArray", v9. (First part thereof).

    This patchset is, I believe, appropriate for merging for 4.17. It
    contains the XArray implementation, to eventually replace the radix
    tree, and converts the page cache to use it.

    This conversion keeps the radix tree and XArray data structures in sync
    at all times. That allows us to convert the page cache one function at
    a time and should allow for easier bisection. Other than renaming some
    elements of the structures, the data structures are fundamentally
    unchanged; a radix tree walk and an XArray walk will touch the same
    number of cachelines. I have changes planned to the XArray data
    structure, but those will happen in future patches.

    Improvements the XArray has over the radix tree:

    - The radix tree provides operations like other trees do; 'insert' and
    'delete'. But what most users really want is an automatically
    resizing array, and so it makes more sense to give users an API that
    is like an array -- 'load' and 'store'. We still have an 'insert'
    operation for users that really want that semantic.

    - The XArray considers locking as part of its API. This simplifies a
    lot of users who formerly had to manage their own locking just for
    the radix tree. It also improves code generation as we can now tell
    RCU that we're holding a lock and it doesn't need to generate as much
    fencing code. The other advantage is that tree nodes can be moved
    (not yet implemented).

    - GFP flags are now parameters to calls which may need to allocate
    memory. The radix tree forced users to decide what the allocation
    flags would be at creation time. It's much clearer to specify them at
    allocation time.

    - Memory is not preloaded; we don't tie up dozens of pages on the off
    chance that the slab allocator fails. Instead, we drop the lock,
    allocate a new node and retry the operation. We have to convert all
    the radix tree, IDA and IDR preload users before we can realise this
    benefit, but I have not yet found a user which cannot be converted.

    - The XArray provides a cmpxchg operation. The radix tree forces users
    to roll their own (and at least four have).

    - Iterators take a 'max' parameter. That simplifies many users and will
    reduce the amount of iteration done.

    - Iteration can proceed backwards. We only have one user for this, but
    since it's called as part of the pagefault readahead algorithm, that
    seemed worth mentioning.

    - RCU-protected pointers are not exposed as part of the API. There are
    some fun bugs where the page cache forgets to use rcu_dereference()
    in the current codebase.

    - Value entries gain an extra bit compared to radix tree exceptional
    entries. That gives us the extra bit we need to put huge page swap
    entries in the page cache.

    - Some iterators now take a 'filter' argument instead of having
    separate iterators for tagged/untagged iterations.

    The page cache is improved by this:

    - Shorter, easier to read code

    - More efficient iterations

    - Reduction in size of struct address_space

    - Fewer walks from the top of the data structure; the XArray API
    encourages staying at the leaf node and conducting operations there.

    This patch (of 8):

    None of these bits may be used for slab allocations, so we can use them
    as radix tree flags as long as we mask them off before passing them to
    the slab allocator. Move the IDR flag from the high bits to the
    GFP_ZONEMASK bits.

    Link: http://lkml.kernel.org/r/20180313132639.17387-3-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Jeff Layton
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Ryusuke Konishi
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

22 Feb, 2018

1 commit

  • As far as I can tell, the only place the per-cpu ida_bitmap is populated
    is in ida_pre_get. The pre-allocated element is stolen in two places in
    ida_get_new_above, in both cases immediately followed by a memset(0).

    Since ida_get_new_above is called with locks held, do the zeroing in
    ida_pre_get, or rather let kmalloc() do it. Also, apparently gcc
    generates ~44 bytes of code to do a memset(, 0, 128):

    $ scripts/bloat-o-meter vmlinux.{0,1}
    add/remove: 0/0 grow/shrink: 2/1 up/down: 5/-88 (-83)
    Function old new delta
    ida_pre_get 115 119 +4
    vermagic 27 28 +1
    ida_get_new_above 715 627 -88

    Link: http://lkml.kernel.org/r/20180108225634.15340-1-linux@rasmusvillemoes.dk
    Signed-off-by: Rasmus Villemoes
    Acked-by: Matthew Wilcox
    Cc: Eric Biggers
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     

07 Feb, 2018

1 commit

  • It has no more users, so remove it. Move idr_alloc() back into idr.c,
    move the guts of idr_alloc_cmn() into idr_alloc_u32(), remove the
    wrappers around idr_get_free_cmn() and rename it to idr_get_free().
    While there is now no interface to allocate IDs larger than a u32,
    the IDR internals remain ready to handle a larger ID should a need arise.

    These changes make it possible to provide the guarantee that, if the
    nextid pointer points into the object, the object's ID will be initialised
    before a concurrent lookup can find the object.

    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     

16 Nov, 2017

1 commit

  • During truncation, the mapping has already been checked for shmem and
    dax so it's known that workingset_update_node is required.

    This patch avoids the checks on mapping for each page being truncated.
    In all other cases, a lookup helper is used to determine if
    workingset_update_node() needs to be called. The one danger is that the
    API is slightly harder to use as calling workingset_update_node directly
    without checking for dax or shmem mappings could lead to surprises.
    However, the API rarely needs to be used and hopefully the comment is
    enough to give people the hint.

    sparsetruncate (tiny)
    4.14.0-rc4 4.14.0-rc4
    oneirq-v1r1 pickhelper-v1r1
    Min Time 141.00 ( 0.00%) 140.00 ( 0.71%)
    1st-qrtle Time 142.00 ( 0.00%) 141.00 ( 0.70%)
    2nd-qrtle Time 142.00 ( 0.00%) 142.00 ( 0.00%)
    3rd-qrtle Time 143.00 ( 0.00%) 143.00 ( 0.00%)
    Max-90% Time 144.00 ( 0.00%) 144.00 ( 0.00%)
    Max-95% Time 147.00 ( 0.00%) 145.00 ( 1.36%)
    Max-99% Time 195.00 ( 0.00%) 191.00 ( 2.05%)
    Max Time 230.00 ( 0.00%) 205.00 ( 10.87%)
    Amean Time 144.37 ( 0.00%) 143.82 ( 0.38%)
    Stddev Time 10.44 ( 0.00%) 9.00 ( 13.74%)
    Coeff Time 7.23 ( 0.00%) 6.26 ( 13.41%)
    Best99%Amean Time 143.72 ( 0.00%) 143.34 ( 0.26%)
    Best95%Amean Time 142.37 ( 0.00%) 142.00 ( 0.26%)
    Best90%Amean Time 142.19 ( 0.00%) 141.85 ( 0.24%)
    Best75%Amean Time 141.92 ( 0.00%) 141.58 ( 0.24%)
    Best50%Amean Time 141.69 ( 0.00%) 141.31 ( 0.27%)
    Best25%Amean Time 141.38 ( 0.00%) 140.97 ( 0.29%)

    As you'd expect, the gain is marginal but it can be detected. The
    differences in bonnie are all within the noise which is not surprising
    given the impact on the microbenchmark.

    radix_tree_update_node_t is a callback for some radix operations that
    optionally passes in a private field. The only user of the callback is
    workingset_update_node and as it no longer requires a mapping, the
    private field is removed.

    Link: http://lkml.kernel.org/r/20171018075952.10627-3-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Jan Kara
    Cc: Andi Kleen
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

09 Sep, 2017

1 commit

  • __radix_tree_preload() only disables preemption if no error is returned.

    So we really need to make sure callers always check the return value.

    idr_preload() contract is to always disable preemption, so we need
    to add a missing preempt_disable() if an error happened.

    Similarly, ida_pre_get() only needs to call preempt_enable() in the
    case no error happened.

    Link: http://lkml.kernel.org/r/1504637190.15310.62.camel@edumazet-glaptop3.roam.corp.google.com
    Fixes: 0a835c4f090a ("Reimplement IDR and IDA using the radix tree")
    Fixes: 7ad3d4d85c7a ("ida: Move ida_bitmap to a percpu variable")
    Signed-off-by: Eric Dumazet
    Cc: Matthew Wilcox
    Cc: "Kirill A. Shutemov"
    Cc: [4.11+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

07 Sep, 2017

1 commit

  • Pull networking updates from David Miller:

    1) Support ipv6 checksum offload in sunvnet driver, from Shannon
    Nelson.

    2) Move to RB-tree instead of custom AVL code in inetpeer, from Eric
    Dumazet.

    3) Allow generic XDP to work on virtual devices, from John Fastabend.

    4) Add bpf device maps and XDP_REDIRECT, which can be used to build
    arbitrary switching frameworks using XDP. From John Fastabend.

    5) Remove UFO offloads from the tree, gave us little other than bugs.

    6) Remove the IPSEC flow cache, from Florian Westphal.

    7) Support ipv6 route offload in mlxsw driver.

    8) Support VF representors in bnxt_en, from Sathya Perla.

    9) Add support for forward error correction modes to ethtool, from
    Vidya Sagar Ravipati.

    10) Add time filter for packet scheduler action dumping, from Jamal Hadi
    Salim.

    11) Extend the zerocopy sendmsg() used by virtio and tap to regular
    sockets via MSG_ZEROCOPY. From Willem de Bruijn.

    12) Significantly rework value tracking in the BPF verifier, from Edward
    Cree.

    13) Add new jump instructions to eBPF, from Daniel Borkmann.

    14) Rework rtnetlink plumbing so that operations can be run without
    taking the RTNL semaphore. From Florian Westphal.

    15) Support XDP in tap driver, from Jason Wang.

    16) Add 32-bit eBPF JIT for ARM, from Shubham Bansal.

    17) Add Huawei hinic ethernet driver.

    18) Allow to report MD5 keys in TCP inet_diag dumps, from Ivan
    Delalande.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1780 commits)
    i40e: point wb_desc at the nvm_wb_desc during i40e_read_nvm_aq
    i40e: avoid NVM acquire deadlock during NVM update
    drivers: net: xgene: Remove return statement from void function
    drivers: net: xgene: Configure tx/rx delay for ACPI
    drivers: net: xgene: Read tx/rx delay for ACPI
    rocker: fix kcalloc parameter order
    rds: Fix non-atomic operation on shared flag variable
    net: sched: don't use GFP_KERNEL under spin lock
    vhost_net: correctly check tx avail during rx busy polling
    net: mdio-mux: add mdio_mux parameter to mdio_mux_init()
    rxrpc: Make service connection lookup always check for retry
    net: stmmac: Delete dead code for MDIO registration
    gianfar: Fix Tx flow control deactivation
    cxgb4: Ignore MPS_TX_INT_CAUSE[Bubble] for T6
    cxgb4: Fix pause frame count in t4_get_port_stats
    cxgb4: fix memory leak
    tun: rename generic_xdp to skb_xdp
    tun: reserve extra headroom only when XDP is set
    net: dsa: bcm_sf2: Configure IMP port TC2QOS mapping
    net: dsa: bcm_sf2: Advertise number of egress queues
    ...

    Linus Torvalds
     

31 Aug, 2017

1 commit

  • The following new APIs are added:

    int idr_alloc_ext(struct idr *idr, void *ptr, unsigned long *index,
    unsigned long start, unsigned long end, gfp_t gfp);
    void *idr_remove_ext(struct idr *idr, unsigned long id);
    void *idr_find_ext(const struct idr *idr, unsigned long id);
    void *idr_replace_ext(struct idr *idr, void *ptr, unsigned long id);
    void *idr_get_next_ext(struct idr *idr, unsigned long *nextid);

    Signed-off-by: Chris Mi
    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Chris Mi
     

18 Aug, 2017

1 commit

  • This was the competing idea long ago, but it was only with the rewrite
    of the idr as an radixtree and using the radixtree directly ourselves,
    along with the realisation that we can store the vma directly in the
    radixtree and only need a list for the reverse mapping, that made the
    patch performant enough to displace using a hashtable. Though the vma ht
    is fast and doesn't require any extra allocation (as we can embed the node
    inside the vma), it does require a thread for resizing and serialization
    and will have the occasional slow lookup. That is hairy enough to
    investigate alternatives and favour them if equivalent in peak performance.
    One advantage of allocating an indirection entry is that we can support a
    single shared bo between many clients, something that was done on a
    first-come first-serve basis for shared GGTT vma previously. To offset
    the extra allocations, we create yet another kmem_cache for them.

    Signed-off-by: Chris Wilson
    Reviewed-by: Tvrtko Ursulin
    Link: https://patchwork.freedesktop.org/patch/msgid/20170816085210.4199-5-chris@chris-wilson.co.uk

    Chris Wilson
     

04 May, 2017

1 commit

  • The current implementation of the reclaim lockup detection can lead to
    false positives and those even happen and usually lead to tweak the code
    to silence the lockdep by using GFP_NOFS even though the context can use
    __GFP_FS just fine.

    See

    http://lkml.kernel.org/r/20160512080321.GA18496@dastard

    as an example.

    =================================
    [ INFO: inconsistent lock state ]
    4.5.0-rc2+ #4 Tainted: G O
    ---------------------------------
    inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
    kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:

    (&xfs_nondir_ilock_class){++++-+}, at: xfs_ilock+0x177/0x200 [xfs]

    {RECLAIM_FS-ON-R} state was registered at:
    mark_held_locks+0x79/0xa0
    lockdep_trace_alloc+0xb3/0x100
    kmem_cache_alloc+0x33/0x230
    kmem_zone_alloc+0x81/0x120 [xfs]
    xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
    __xfs_refcount_find_shared+0x75/0x580 [xfs]
    xfs_refcount_find_shared+0x84/0xb0 [xfs]
    xfs_getbmap+0x608/0x8c0 [xfs]
    xfs_vn_fiemap+0xab/0xc0 [xfs]
    do_vfs_ioctl+0x498/0x670
    SyS_ioctl+0x79/0x90
    entry_SYSCALL_64_fastpath+0x12/0x6f

    CPU0
    ----
    lock(&xfs_nondir_ilock_class);

    lock(&xfs_nondir_ilock_class);

    *** DEADLOCK ***

    3 locks held by kswapd0/543:

    stack backtrace:
    CPU: 0 PID: 543 Comm: kswapd0 Tainted: G O 4.5.0-rc2+ #4
    Call Trace:
    lock_acquire+0xd8/0x1e0
    down_write_nested+0x5e/0xc0
    xfs_ilock+0x177/0x200 [xfs]
    xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
    xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
    evict+0xc5/0x190
    dispose_list+0x39/0x60
    prune_icache_sb+0x4b/0x60
    super_cache_scan+0x14f/0x1a0
    shrink_slab.part.63.constprop.79+0x1e9/0x4e0
    shrink_zone+0x15e/0x170
    kswapd+0x4f1/0xa80
    kthread+0xf2/0x110
    ret_from_fork+0x3f/0x70

    To quote Dave:
    "Ignoring whether reflink should be doing anything or not, that's a
    "xfs_refcountbt_init_cursor() gets called both outside and inside
    transactions" lockdep false positive case. The problem here is lockdep
    has seen this allocation from within a transaction, hence a GFP_NOFS
    allocation, and now it's seeing it in a GFP_KERNEL context. Also note
    that we have an active reference to this inode.

    So, because the reclaim annotations overload the interrupt level
    detections and it's seen the inode ilock been taken in reclaim
    ("interrupt") context, this triggers a reclaim context warning where
    it thinks it is unsafe to do this allocation in GFP_KERNEL context
    holding the inode ilock..."

    This sounds like a fundamental problem of the reclaim lock detection.
    It is really impossible to annotate such a special usecase IMHO unless
    the reclaim lockup detection is reworked completely. Until then it is
    much better to provide a way to add "I know what I am doing flag" and
    mark problematic places. This would prevent from abusing GFP_NOFS flag
    which has a runtime effect even on configurations which have lockdep
    disabled.

    Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to
    skip the current allocation request.

    While we are at it also make sure that the radix tree doesn't
    accidentaly override tags stored in the upper part of the gfp_mask.

    Link: http://lkml.kernel.org/r/20170306131408.9828-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Suggested-by: Peter Zijlstra
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Vlastimil Babka
    Cc: Dave Chinner
    Cc: Theodore Ts'o
    Cc: Chris Mason
    Cc: David Sterba
    Cc: Jan Kara
    Cc: Brian Foster
    Cc: Darrick J. Wong
    Cc: Nikolay Borisov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

08 Mar, 2017

1 commit

  • There's a relatively rare race where we look at the per-cpu preallocated
    IDA bitmap, see it's NULL, allocate a new one, and atomically update it.
    If the kmalloc() happened to sleep and we were rescheduled to a different
    CPU, or an interrupt came in at the exact right time, another task
    might have successfully allocated a bitmap and already deposited it.
    I forgot what the semantics of cmpxchg() were and ended up freeing the
    wrong bitmap leading to KASAN reporting a use-after-free.

    Dmitry found the bug with syzkaller & wrote the patch. I wrote the test
    case that will reproduce the bug without his patch being applied.

    Reported-by: Dmitry Vyukov
    Signed-off-by: Matthew Wilcox

    Matthew Wilcox
     

01 Mar, 2017

1 commit

  • Pull IDR rewrite from Matthew Wilcox:
    "The most significant part of the following is the patch to rewrite the
    IDR & IDA to be clients of the radix tree. But there's much more,
    including an enhancement of the IDA to be significantly more space
    efficient, an IDR & IDA test suite, some improvements to the IDR API
    (and driver changes to take advantage of those improvements), several
    improvements to the radix tree test suite and RCU annotations.

    The IDR & IDA rewrite had a good spin in linux-next and Andrew's tree
    for most of the last cycle. Coupled with the IDR test suite, I feel
    pretty confident that any remaining bugs are quite hard to hit. 0-day
    did a great job of watching my git tree and pointing out problems; as
    it hit them, I added new test-cases to be sure not to be caught the
    same way twice"

    Willy goes on to expand a bit on the IDR rewrite rationale:
    "The radix tree and the IDR use very similar data structures.

    Merging the two codebases lets us share the memory allocation pools,
    and results in a net deletion of 500 lines of code. It also opens up
    the possibility of exposing more of the features of the radix tree to
    users of the IDR (and I have some interesting patches along those
    lines waiting for 4.12)

    It also shrinks the size of the 'struct idr' from 40 bytes to 24 which
    will shrink a fair few data structures that embed an IDR"

    * 'idr-4.11' of git://git.infradead.org/users/willy/linux-dax: (32 commits)
    radix tree test suite: Add config option for map shift
    idr: Add missing __rcu annotations
    radix-tree: Fix __rcu annotations
    radix-tree: Add rcu_dereference and rcu_assign_pointer calls
    radix tree test suite: Run iteration tests for longer
    radix tree test suite: Fix split/join memory leaks
    radix tree test suite: Fix leaks in regression2.c
    radix tree test suite: Fix leaky tests
    radix tree test suite: Enable address sanitizer
    radix_tree_iter_resume: Fix out of bounds error
    radix-tree: Store a pointer to the root in each node
    radix-tree: Chain preallocated nodes through ->parent
    radix tree test suite: Dial down verbosity with -v
    radix tree test suite: Introduce kmalloc_verbose
    idr: Return the deleted entry from idr_remove
    radix tree test suite: Build separate binaries for some tests
    ida: Use exceptional entries for small IDAs
    ida: Move ida_bitmap to a percpu variable
    Reimplement IDR and IDA using the radix tree
    radix-tree: Add radix_tree_iter_delete
    ...

    Linus Torvalds