03 Aug, 2016

1 commit

  • Radix trees may be used not only for storing page cache pages, so
    unconditionally accounting radix tree nodes to the current memory cgroup
    is bad: if a radix tree node is used for storing data shared among
    different cgroups we risk pinning dead memory cgroups forever.

    So let's only account radix tree nodes if it was explicitly requested by
    passing __GFP_ACCOUNT to INIT_RADIX_TREE. Currently, we only want to
    account page cache entries, so mark mapping->page_tree so.

    Fixes: 58e698af4c63 ("radix-tree: account radix_tree_node to memory cgroup")
    Link: http://lkml.kernel.org/r/1470057188-7864-1-git-send-email-vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: [4.6+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

27 Jul, 2016

1 commit

  • The new helper is similar to radix_tree_maybe_preload(), but tries to
    preload number of nodes required to insert (1 << order) continuous
    naturally-aligned elements.

    This is required to push huge pages into pagecache.

    Link: http://lkml.kernel.org/r/1466021202-61880-24-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

21 May, 2016

34 commits

  • Now that the shift amount is stored in the node, radix_tree_descend()
    can calculate offset itself from index, which removes several lines of
    code from each of the tree walkers.

    Signed-off-by: Matthew Wilcox
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • In addition to replacing the entry, we also clear all associated tags.
    This is really a one-off special for page_cache_tree_delete() which had
    far too much detailed knowledge about how the radix tree works.

    For efficiency, factor node_tag_clear() out of radix_tree_tag_clear() It
    can be used by radix_tree_delete_item() as well as
    radix_tree_replace_clear_tags().

    Signed-off-by: Matthew Wilcox
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • 1. Rename the existing variable 'slot' to 'child'.
    2. Introduce a new variable called 'slot' which is the address of the
    slot we're dealing with. This lets us simplify the tree insertion,
    and removes the recalculation of 'slot' at the end of the function.
    3. Using 'slot' in the sibling pointer insertion part makes the code
    more readable.

    Signed-off-by: Matthew Wilcox
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Convert radix_tree_range_tag_if_tagged to name the nodes parent, node
    and child instead of node & slot.

    Use parent->offset instead of playing games with 'upindex'.

    Signed-off-by: Matthew Wilcox
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Convert radix_tree_next_chunk to use 'child' instead of 'slot' as the
    name of the child node. Also use node_maxindex() where it makes sense.

    The 'rnode' variable was unnecessary; it doesn't overlap in usage with
    'node', so we can just use 'node' the whole way through the function.

    Improve the testcase to start the walk from every index in the carefully
    constructed tree, and to accept any index within the range covered by
    the entry.

    Signed-off-by: Matthew Wilcox
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Use the more standard 'node' and 'child' instead of 'to_free' and
    'slot'.

    Signed-off-by: Matthew Wilcox
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • As with indirect_to_ptr(), ptr_to_indirect() and
    RADIX_TREE_INDIRECT_PTR, change radix_tree_is_indirect_ptr() to
    radix_tree_is_internal_node().

    Signed-off-by: Matthew Wilcox
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Mirrors the earlier commit introducing node_to_entry().

    Also change the type returned to be a struct radix_tree_node pointer.
    That lets us simplify a couple of places in the radix tree shrink &
    extend paths where we could convert an entry into a pointer, modify the
    node, then convert the pointer back into an entry.

    Signed-off-by: Matthew Wilcox
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • ptr_to_indirect() was a bad name. What it really means is "Convert this
    pointer to a node into an entry suitable for storing in the radix tree".
    So node_to_entry() seemed like a better name.

    Signed-off-by: Matthew Wilcox
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • The name RADIX_TREE_INDIRECT_PTR doesn't really match the meaning.
    RADIX_TREE_INTERNAL_NODE is a better name.

    Signed-off-by: Matthew Wilcox
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • The only remaining references to root->height were in extend and shrink,
    where it was updated. Now we can remove it entirely.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Ross Zwisler
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • If radix_tree_shrink returns whether it managed to shrink, then
    __radix_tree_delete_node doesn't ned to query the tree to find out
    whether it did any work or not.

    Signed-off-by: Matthew Wilcox
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • node->shift represents the shift necessary for looking in the slots
    array at this level. It is equal to the old (node->height - 1) *
    RADIX_TREE_MAP_SHIFT.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Ross Zwisler
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Neither piece of information we're storing in node->path can be larger
    than 64, so store each in its own unsigned char instead of shifting and
    masking to store them both in an unsigned int.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Ross Zwisler
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Typos, whitespace, grammar, line length, using the correct types, etc.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Ross Zwisler
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • The multiorder support is a sufficiently large feature to be worth
    adding copyrigt lines for.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Ross Zwisler
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • - Print which indices are covered by every leaf entry
    - Print sibling entries
    - Print the node pointer instead of the slot entry
    - Build by default in userspace, and make it accessible to the test-suite

    Signed-off-by: Ross Zwisler
    Signed-off-by: Matthew Wilcox
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • I had previously decided that tagging a single multiorder entry would
    count as tagging 2^order entries for the purposes of 'nr_to_tag'. I now
    believe that decision to be a mistake, and it should count as a single
    entry. That's more likely to be what callers expect.

    When walking back up the tree from a newly-tagged entry, the current
    code assumed we were starting from the lowest level of the tree; if we
    have a multiorder entry with an order at least RADIX_TREE_MAP_SHIFT in
    size then we need to shift the index by 'shift' before we start walking
    back up the tree, or we will end up not setting tags on higher entries,
    and then mistakenly thinking that entries below a certain point in the
    tree are not tagged.

    If the first index we examine is a sibling entry of a tagged multiorder
    entry, we were not tagging it. We need to examine the canonical entry,
    and the easiest way to do that is to use radix_tree_descend(). We then
    have to skip over sibling slots when looking for the next entry in the
    tree or we will end up walking back to the canonical entry.

    Add several tests for radix_tree_range_tag_if_tagged().

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Ross Zwisler
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Use the new multi-order support functions to rewrite
    radix_tree_locate_item(). Modify the locate tests to test multiorder
    entries too.

    [hughd@google.com: radix_tree_locate_item() is often returning the wrong index]
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1605012108490.1166@eggly.anvils
    Signed-off-by: Matthew Wilcox
    Reviewed-by: Ross Zwisler
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • If the radix tree user attempted to insert a colliding entry with an
    existing multiorder entry, then radix_tree_create() could encounter a
    sibling entry when walking down the tree to look for a slot. Use
    radix_tree_descend() to fix the problem, and add a test-case to make
    sure the problem doesn't come back in future.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Ross Zwisler
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Use the new multi-order support functions to rewrite
    radix_tree_tag_get()

    Signed-off-by: Ross Zwisler
    Signed-off-by: Matthew Wilcox
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Use the new multi-order support functions to rewrite
    radix_tree_tag_clear()

    Signed-off-by: Ross Zwisler
    Signed-off-by: Matthew Wilcox
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Use the new multi-order support functions to rewrite
    radix_tree_tag_set()

    Signed-off-by: Ross Zwisler
    Signed-off-by: Matthew Wilcox
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • This enables the macros radix_tree_for_each_slot() and friends to be
    used with multi-order entries.

    The way that this works is that we treat all entries in a given slots[]
    array as a single chunk. If the index given to radix_tree_next_chunk()
    happens to point us to a sibling entry, we will back up iter->index so
    that it points to the canonical entry, and that will be the place where
    we start our iteration.

    As we're processing a chunk in radix_tree_next_slot(), we process
    canonical entries, skip over sibling entries, and restart the chunk
    lookup if we find a non-sibling indirect pointer. This drops back to
    the radix_tree_next_chunk() code, which will re-walk the tree and look
    for another chunk.

    This allows us to properly handle multi-order entries mixed with other
    entries that are at various heights in the radix tree.

    Signed-off-by: Ross Zwisler
    Signed-off-by: Matthew Wilcox
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • These BUG_ON tests are to ensure that all the tags are clear when
    inserting a new entry. If we insert a multiorder entry, we'll end up
    looking at the tags for a different node, and so the BUG_ON can end up
    triggering spuriously.

    Also, we now have three tags, not two, so check all three are clear, and
    check all the root tags with a single call to BUG_ON since the bits are
    stored contiguously.

    Include a test-case to ensure this problem does not reoccur.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Ross Zwisler
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Use the new multi-order support functions to rewrite __radix_tree_lookup()

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Ross Zwisler
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Setting the indirect bit on the user data entry used to be unambiguous
    because the tree walking code knew not to expect internal nodes in the
    last level of the tree. Multiorder entries can appear at any level of
    the tree, and a leaf with the indirect bit set is indistinguishable from
    a pointer to a node.

    Introduce a special entry (RADIX_TREE_RETRY) which is neither a valid
    user entry, nor a valid pointer to a node. The radix_tree_deref_retry()
    function continues to work the same way, but tree walking code can
    distinguish it from a pointer to a node.

    Also fix the condition for setting slot->parent to NULL; it does not
    matter what height the tree is, it only matters whether slot is an
    indirect pointer. Move this code above the comment which is referring
    to the assignment to root->rnode.

    Also fix the condition for preventing the tree from shrinking to a
    single entry if it's a multiorder entry.

    Add a test-case to the test suite that checks that the tree goes back
    down to its original height after an item is inserted & deleted from a
    higher index in the tree.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Ross Zwisler
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • The current code will insert entries at each level, as if we're going to
    add a new entry at the bottom level, so we then get an -EEXIST when we
    try to insert the entry into the tree. The best way to fix this is to
    not check 'order' when inserting into an empty tree.

    We still need to 'extend' the tree to the height necessary for the maximum
    index corresponding to this entry, so pass that value to
    radix_tree_extend() rather than the index we're asked to create, or we
    won't create a tree that's deep enough.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Ross Zwisler
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • All the tree walking functions start with some variant of this code;
    centralise it in one place so we're not chasing subtly different bugs
    everywhere.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Ross Zwisler
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Now that sibling pointers are handled explicitly, there is no purpose
    served by restricting the order to be >= RADIX_TREE_MAP_SHIFT.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Ross Zwisler
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • If we deleted an entry through an index which looked up a sibling
    pointer, we'd end up zeroing out the wrong slots in the node. Use
    get_slot_offset() to find the right slot.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Ross Zwisler
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • The subtraction was the wrong way round, leading to undefined behaviour
    (shift by an amount larger than the size of the type).

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Ross Zwisler
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • The code I previously added to enable multiorder radix tree entries was
    untested and therefore buggy. This commit adds the support functions
    that Ross and I decided were necessary over a four-week period of
    iterating various designs.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Ross Zwisler
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • I've been receiving increasingly concerned notes from 0day about how
    much my recent changes have been bloating the radix tree. Make it
    happier by only including multiorder support if
    CONFIG_TRANSPARENT_HUGEPAGES is set.

    This is an independent Kconfig option, so other radix tree users can
    also set it if they have a need.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Ross Zwisler
    Cc: Konstantin Khlebnikov
    Cc: Kirill Shutemov
    Cc: Jan Kara
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

18 Mar, 2016

4 commits

  • This is debug code which is #if 0 out.

    Signed-off-by: Matthew Wilcox
    Cc: Johannes Weiner
    Cc: Matthew Wilcox
    Cc: "Kirill A. Shutemov"
    Cc: Ross Zwisler
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • With huge pages, it is convenient to have the radix tree be able to
    return an entry that covers multiple indices. Previous attempts to deal
    with the problem have involved inserting N duplicate entries, which is a
    waste of memory and leads to problems trying to handle aliased tags, or
    probing the tree multiple times to find alternative entries which might
    cover the requested index.

    This approach inserts one canonical entry into the tree for a given
    range of indices, and may also insert other entries in order to ensure
    that lookups find the canonical entry.

    This solution only tolerates inserting powers of two that are greater
    than the fanout of the tree. If we wish to expand the radix tree's
    abilities to support large-ish pages that is less than the fanout at the
    penultimate level of the tree, then we would need to add one more step
    in lookup to ensure that any sibling nodes in the final level of the
    tree are dereferenced and we return the canonical entry that they
    reference.

    Signed-off-by: Matthew Wilcox
    Cc: Johannes Weiner
    Cc: Matthew Wilcox
    Cc: "Kirill A. Shutemov"
    Cc: Ross Zwisler
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • When we introduce entries that can cover multiple indices, we will need
    to stop in __radix_tree_create based on the shift, not the height.
    Split out for ease of bisect.

    Signed-off-by: Matthew Wilcox
    Cc: Johannes Weiner
    Cc: Matthew Wilcox
    Cc: "Kirill A. Shutemov"
    Cc: Ross Zwisler
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Set the 'indirect_ptr' bit on all the pointers to internal nodes, not
    just on the root node. This enables the following patches to support
    multi-order entries in the radix tree. This patch is split out for ease
    of bisection.

    Signed-off-by: Matthew Wilcox
    Cc: Johannes Weiner
    Cc: Matthew Wilcox
    Cc: "Kirill A. Shutemov"
    Cc: Ross Zwisler
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox