09 Oct, 2012

40 commits

  • kmemleak uses a tree where each node represents an allocated memory object
    in order to quickly find out what object a given address is part of.
    However, the objects don't overlap, so rbtrees are a better choice than
    prio tree for this use. They are both faster and have lower memory
    overhead.

    Tested by booting a kernel with kmemleak enabled, loading the
    kmemleak_test module, and looking for the expected messages.

    Signed-off-by: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Hillf Danton
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Acked-by: Catalin Marinas
    Tested-by: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Implement an interval tree as a replacement for the VMA prio_tree. The
    algorithms are similar to lib/interval_tree.c; however that code can't be
    directly reused as the interval endpoints are not explicitly stored in the
    VMA. So instead, the common algorithm is moved into a template and the
    details (node type, how to get interval endpoints from the node, etc) are
    filled in using the C preprocessor.

    Once the interval tree functions are available, using them as a
    replacement to the VMA prio tree is a relatively simple, mechanical job.

    Signed-off-by: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Hillf Danton
    Cc: Peter Zijlstra
    Cc: Catalin Marinas
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Patch 1 implements support for interval trees, on top of the augmented
    rbtree API. It also adds synthetic tests to compare the performance of
    interval trees vs prio trees. Short answers is that interval trees are
    slightly faster (~25%) on insert/erase, and much faster (~2.4 - 3x)
    on search. It is debatable how realistic the synthetic test is, and I have
    not made such measurements yet, but my impression is that interval trees
    would still come out faster.

    Patch 2 uses a preprocessor template to make the interval tree generic,
    and uses it as a replacement for the vma prio_tree.

    Patch 3 takes the other prio_tree user, kmemleak, and converts it to use
    a basic rbtree. We don't actually need the augmented rbtree support here
    because the intervals are always non-overlapping.

    Patch 4 removes the now-unused prio tree library.

    Patch 5 proposes an additional optimization to rb_erase_augmented, now
    providing it as an inline function so that the augmented callbacks can be
    inlined in. This provides an additional 5-10% performance improvement
    for the interval tree insert/erase benchmark. There is a maintainance cost
    as it exposes augmented rbtree users to some of the rbtree library internals;
    however I think this cost shouldn't be too high as I expect the augmented
    rbtree will always have much less users than the base rbtree.

    I should probably add a quick summary of why I think it makes sense to
    replace prio trees with augmented rbtree based interval trees now. One of
    the drivers is that we need augmented rbtrees for Rik's vma gap finding
    code, and once you have them, it just makes sense to use them for interval
    trees as well, as this is the simpler and more well known algorithm. prio
    trees, in comparison, seem *too* clever: they impose an additional 'heap'
    constraint on the tree, which they use to guarantee a faster worst-case
    complexity of O(k+log N) for stabbing queries in a well-balanced prio
    tree, vs O(k*log N) for interval trees (where k=number of matches,
    N=number of intervals). Now this sounds great, but in practice prio trees
    don't realize this theorical benefit. First, the additional constraint
    makes them harder to update, so that the kernel implementation has to
    simplify things by balancing them like a radix tree, which is not always
    ideal. Second, the fact that there are both index and heap properties
    makes both tree manipulation and search more complex, which results in a
    higher multiplicative time constant. As it turns out, the simple interval
    tree algorithm ends up running faster than the more clever prio tree.

    This patch:

    Add two test modules:

    - prio_tree_test measures the performance of lib/prio_tree.c, both for
    insertion/removal and for stabbing searches

    - interval_tree_test measures the performance of a library of equivalent
    functionality, built using the augmented rbtree support.

    In order to support the second test module, lib/interval_tree.c is
    introduced. It is kept separate from the interval_tree_test main file
    for two reasons: first we don't want to provide an unfair advantage
    over prio_tree_test by having everything in a single compilation unit,
    and second there is the possibility that the interval tree functionality
    could get some non-test users in kernel over time.

    Signed-off-by: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Hillf Danton
    Cc: Peter Zijlstra
    Cc: Catalin Marinas
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • As proposed by Peter Zijlstra, this makes it easier to define the augmented
    rbtree callbacks.

    Signed-off-by: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • convert arch/x86/mm/pat_rbtree.c to the proposed augmented rbtree api
    and remove the old augmented rbtree implementation.

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Introduce new augmented rbtree APIs that allow minimal recalculation of
    augmented node information.

    A new callback is added to the rbtree insertion and erase rebalancing
    functions, to be called on each tree rotations. Such rotations preserve
    the subtree's root augmented value, but require recalculation of the one
    child that was previously located at the subtree root.

    In the insertion case, the handcoded search phase must be updated to
    maintain the augmented information on insertion, and then the rbtree
    coloring/rebalancing algorithms keep it up to date.

    In the erase case, things are more complicated since it is library
    code that manipulates the rbtree in order to remove internal nodes.
    This requires a couple additional callbacks to copy a subtree's
    augmented value when a new root is stitched in, and to recompute
    augmented values down the ancestry path when a node is removed from
    the tree.

    In order to preserve maximum speed for the non-augmented case,
    we provide two versions of each tree manipulation function.
    rb_insert_augmented() is the augmented equivalent of rb_insert_color(),
    and rb_erase_augmented() is the augmented equivalent of rb_erase().

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Small test to measure the performance of augmented rbtrees.

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Various minor optimizations in rb_erase():
    - Avoid multiple loading of node->__rb_parent_color when computing parent
    and color information (possibly not in close sequence, as there might
    be further branches in the algorithm)
    - In the 1-child subcase of case 1, copy the __rb_parent_color field from
    the erased node to the child instead of recomputing it from the desired
    parent and color
    - When searching for the erased node's successor, differentiate between
    cases 2 and 3 based on whether any left links were followed. This avoids
    a condition later down.
    - In case 3, keep a pointer to the erased node's right child so we don't
    have to refetch it later to adjust its parent.
    - In the no-childs subcase of cases 2 and 3, place the rebalance assigment
    last so that the compiler can remove the following if(rebalance) test.

    Also, added some comments to illustrate cases 2 and 3.

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • An interesting observation for rb_erase() is that when a node has
    exactly one child, the node must be black and the child must be red.
    An interesting consequence is that removing such a node can be done by
    simply replacing it with its child and making the child black,
    which we can do efficiently in rb_erase(). __rb_erase_color() then
    only needs to handle the no-childs case and can be modified accordingly.

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • In rb_erase, move the easy case (node to erase has no more than
    1 child) first. I feel the code reads easier that way.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Add __rb_change_child() as an inline helper function to replace code that
    would otherwise be duplicated 4 times in the source.

    No changes to binary size or speed.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Just a small fix to make sparse happy.

    Signed-off-by: Michel Lespinasse
    Reported-by: Fengguang Wu
    Acked-by: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • When looking to fetch a node's sibling, we went through a sequence of:
    - check if node is the parent's left child
    - if it is, then fetch the parent's right child

    This can be replaced with:
    - fetch the parent's right child as an assumed sibling
    - check that node is NOT the fetched child

    This avoids fetching the parent's left child when node is actually
    that child. Saves a bit on code size, though it doesn't seem to make
    a large difference in speed.

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Acked-by: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Jens Axboe
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Set comment and indentation style to be consistent with linux coding style
    and the rest of the file, as suggested by Peter Zijlstra

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Acked-by: David Woodhouse
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Jens Axboe
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • In __rb_erase_color(), we often already have pointers to the nodes being
    rotated and/or know what their colors must be, so we can generate more
    efficient code than the generic __rb_rotate_left() and __rb_rotate_right()
    functions.

    Also when the current node is red or when flipping the sibling's color,
    the parent is already known so we can use the more efficient
    rb_set_parent_color() function to set the desired color.

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Acked-by: David Woodhouse
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Jens Axboe
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • In __rb_erase_color(), we have to select one of 3 cases depending on the
    color on the 'other' node children. If both children are black, we flip a
    few node colors and iterate. Otherwise, we do either one or two tree
    rotations, depending on the color of the 'other' child opposite to 'node',
    and then we are done.

    The corresponding logic had duplicate checks for the color of the 'other'
    child opposite to 'node'. It was checking it first to determine if both
    children are black, and then to determine how many tree rotations are
    required. Rearrange the logic to avoid that extra check.

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Acked-by: David Woodhouse
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Jens Axboe
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • In __rb_erase_color(), we were always setting a node to black after
    exiting the main loop. And in one case, after fixing up the tree to
    satisfy all rbtree invariants, we were setting the current node to root
    just to guarantee a loop exit, at which point the root would be set to
    black. However this is not necessary, as the root of an rbtree is already
    known to be black. The only case where the color flip is required is when
    we exit the loop due to the current node being red, and it's easiest to
    just do the flip at that point instead of doing it after the loop.

    [adrian.hunter@intel.com: perf tools: fix build for another rbtree.c change]
    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Acked-by: David Woodhouse
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Jens Axboe
    Cc: "Eric W. Biederman"
    Signed-off-by: Adrian Hunter
    Cc: Alexander Shishkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • - Use the newly introduced rb_set_parent_color() function to flip the color
    of nodes whose parent is already known.
    - Optimize rb_parent() when the node is known to be red - there is no need
    to mask out the color in that case.
    - Flipping gparent's color to red requires us to fetch its rb_parent_color
    field, so we can reuse it as the parent value for the next loop iteration.
    - Do not use __rb_rotate_left() and __rb_rotate_right() to handle tree
    rotations: we already have pointers to all relevant nodes, and know their
    colors (either because we want to adjust it, or because we've tested it,
    or we can deduce it as black due to the node proximity to a known red node).
    So we can generate more efficient code by making use of the node pointers
    we already have, and setting both the parent and color attributes for
    nodes all at once. Also in Case 2, some node attributes don't have to
    be set because we know another tree rotation (Case 3) will always follow
    and override them.

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Acked-by: David Woodhouse
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Jens Axboe
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • The root node of an rbtree must always be black. However,
    rb_insert_color() only needs to maintain this invariant when it has been
    broken - that is, when it exits the loop due to the current (red) node
    being the root. In all other cases (exiting after tree rotations, or
    exiting due to an existing black parent) the invariant is already
    satisfied, so there is no need to adjust the root node color.

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Acked-by: David Woodhouse
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Jens Axboe
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • It is a well known property of rbtrees that insertion never requires more
    than two tree rotations. In our implementation, after one loop iteration
    identified one or two necessary tree rotations, we would iterate and look
    for more. However at that point the node's parent would always be black,
    which would cause us to exit the loop.

    We can make the code flow more obvious by just adding a break statement
    after the tree rotations, where we know we are done. Additionally, in the
    cases where two tree rotations are necessary, we don't have to update the
    'node' pointer as it wouldn't be used until the next loop iteration, which
    we now avoid due to this break statement.

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Acked-by: David Woodhouse
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Jens Axboe
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • This small module helps measure the performance of rbtree insert and
    erase.

    Additionally, we run a few correctness tests to check that the rbtrees
    have all desired properties:

    - contains the right number of nodes in the order desired,
    - never two consecutive red nodes on any path,
    - all paths to leaf nodes have the same number of black nodes,
    - root node is black

    [akpm@linux-foundation.org: fix printk warning: sparc64 cycles_t is unsigned long]
    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Acked-by: David Woodhouse
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Jens Axboe
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • rbtree users must use the documented APIs to manipulate the tree
    structure. Low-level helpers to manipulate node colors and parenthood are
    not part of that API, so move them to lib/rbtree.c

    [dwmw2@infradead.org: fix jffs2 build issue due to renamed __rb_parent_color field]
    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Acked-by: David Woodhouse
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Jens Axboe
    Cc: "Eric W. Biederman"
    Signed-off-by: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • The recently added code to use rbtrees in sysctl did not follow the proper
    rbtree interface on insertion - it was calling rb_link_node() which
    inserts a new node into the binary tree, but missed the call to
    rb_insert_color() which properly balances the rbtree and establishes all
    expected rbtree invariants.

    I found out about this only because faulty commit also used
    rb_init_node(), which I am removing within this patchset. But I think
    it's an easy mistake to make, and it makes me wonder if we should change
    the rbtree API so that insertions would be done with a single rb_insert()
    call (even if its implementation could still inline the rb_link_node()
    part and call a private __rb_insert_color function to do the rebalancing).

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Acked-by: David Woodhouse
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Jens Axboe
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Empty nodes have no color. We can make use of this property to simplify
    the code emitted by the RB_EMPTY_NODE and RB_CLEAR_NODE macros. Also,
    we can get rid of the rb_init_node function which had been introduced by
    commit 88d19cf37952 ("timers: Add rb_init_node() to allow for stack
    allocated rb nodes") to avoid some issue with the empty node's color not
    being initialized.

    I'm not sure what the RB_EMPTY_NODE checks in rb_prev() / rb_next() are
    doing there, though. axboe introduced them in commit 10fd48f2376d
    ("rbtree: fixed reversed RB_EMPTY_NODE and rb_next/prev"). The way I
    see it, the 'empty node' abstraction is only used by rbtree users to
    flag nodes that they haven't inserted in any rbtree, so asking the
    predecessor or successor of such nodes doesn't make any sense.

    One final rb_init_node() caller was recently added in sysctl code to
    implement faster sysctl name lookups. This code doesn't make use of
    RB_EMPTY_NODE at all, and from what I could see it only called
    rb_init_node() under the mistaken assumption that such initialization was
    required before node insertion.

    [sfr@canb.auug.org.au: fix net/ceph/osd_client.c build]
    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Acked-by: David Woodhouse
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Jens Axboe
    Cc: "Eric W. Biederman"
    Cc: John Stultz
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • I recently started looking at the rbtree code (with an eye towards
    improving the augmented rbtree support, but I haven't gotten there yet).
    I noticed a lot of possible speed improvements, which I am now proposing
    in this patch set.

    Patches 1-4 are preparatory: remove internal functions from rbtree.h so
    that users won't be tempted to use them instead of the documented APIs,
    clean up some incorrect usages I've noticed (in particular, with the
    recently added fs/proc/proc_sysctl.c rbtree usage), reference the
    documentation so that people have one less excuse to miss it, etc.

    Patch 5 is a small module I wrote to check the rbtree performance. It
    creates 100 nodes with random keys and repeatedly inserts and erases them
    from an rbtree. Additionally, it has code to check for rbtree invariants
    after each insert or erase operation.

    Patches 6-12 is where the rbtree optimizations are done, and they touch
    only that one file, lib/rbtree.c . I am getting good results out of these
    - in my small benchmark doing rbtree insertion (including search) and
    erase, I'm seeing a 30% runtime reduction on Sandybridge E5, which is more
    than I initially thought would be possible. (the results aren't as
    impressive on my two other test hosts though, AMD barcelona and Intel
    Westmere, where I am seeing 14% runtime reduction only). The code size -
    both source (ommiting comments) and compiled - is also shorter after these
    changes. However, I do admit that the updated code is more arduous to
    read - one big reason for that is the removal of the tree rotation
    helpers, which added some overhead but also made it easier to reason about
    things locally. Overall, I believe this is an acceptable compromise,
    given that this code doesn't get modified very often, and that I have good
    tests for it.

    Upon Peter's suggestion, I added comments showing the rtree configuration
    before every rotation. I think they help; however it's still best to have
    a copy of the cormen/leiserson/rivest book when digging into this code.

    This patch: reference Documentation/rbtree.txt for usage instructions

    include/linux/rbtree.h included some basic usage instructions, while
    Documentation/rbtree.txt had some more complete and easier to follow
    instructions. Replacing the former with a reference to the latter.

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Acked-by: David Woodhouse
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Jens Axboe
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Commit d6629859b36d ("ipc/mqueue: improve performance of send/recv") and
    ce2d52cc ("ipc/mqueue: add rbtree node caching support") introduced an
    rbtree of message priorities, and usage of rb_init_node() to initialize
    the corresponding nodes. As it turns out, rb_init_node() is unnecessary
    here, as the nodes are fully initialized on insertion by rb_link_node()
    and the code doesn't access nodes that aren't inserted on the rbtree.

    Removing the rb_init_node() calls as I removed that function during
    rbtree API cleanups (the only other use of it was in a place that
    similarly didn't require it).

    Signed-off-by: Michel Lespinasse
    Acked-by: Doug Ledford
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • This implements the architecture backend for transparent hugepages
    on s390.

    Signed-off-by: Gerald Schaefer
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Cc: Hugh Dickins
    Cc: Hillf Danton
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     
  • This patch is part of the architecture backend for thp on s390. It
    disables thp for kvm hosts, because there is no kvm host hugepage support
    so far. Existing thp mappings are split by follow_page() with FOLL_SPLIT,
    and future thp mappings are prevented by setting VM_NOHUGEPAGE in
    mm->def_flags.

    Signed-off-by: Gerald Schaefer
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Cc: Hugh Dickins
    Cc: Hillf Danton
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     
  • This patch is part of the architecture backend for thp on s390. It
    provides the pagetable pre-allocation functions
    pgtable_trans_huge_deposit() and pgtable_trans_huge_withdraw(). Unlike
    other archs, s390 has no struct page * as pgtable_t, but rather a pointer
    to the page table. So instead of saving the pagetable pre- allocation
    list info inside the struct page, it is being saved within the pagetable
    itself.

    Signed-off-by: Gerald Schaefer
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Cc: Hugh Dickins
    Cc: Hillf Danton
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     
  • This patch is part of the architecture backend for thp on s390. It
    provides the functions related to thp splitting, including serialization
    against gup. Unlike other archs, pmdp_splitting_flush() cannot use a tlb
    flushing operation to serialize against gup on s390, because that wouldn't
    be stopped by the disabled IRQs. So instead, smp_call_function() is
    called with an empty function, which will have the expected effect.

    Signed-off-by: Gerald Schaefer
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Cc: Hugh Dickins
    Cc: Hillf Danton
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     
  • This adds a check to hugepage_madvise(), to refuse MADV_HUGEPAGE if
    VM_NOHUGEPAGE is set in mm->def_flags. On s390, the VM_NOHUGEPAGE flag
    will be set in mm->def_flags for kvm processes, to prevent any future thp
    mappings. In order to also prevent MADV_HUGEPAGE on such an mm,
    hugepage_madvise() should check mm->def_flags.

    Signed-off-by: Gerald Schaefer
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Cc: Hugh Dickins
    Cc: Hillf Danton
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     
  • On s390, a valid page table entry must not be changed while it is attached
    to any CPU. So instead of pmd_mknotpresent() and set_pmd_at(), an IDTE
    operation would be necessary there. This patch introduces the
    pmdp_invalidate() function, to allow architecture-specific
    implementations.

    Signed-off-by: Gerald Schaefer
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Cc: Hugh Dickins
    Cc: Hillf Danton
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     
  • The thp page table pre-allocation code currently assumes that pgtable_t is
    of type "struct page *". This may not be true for all architectures, so
    this patch removes that assumption by replacing the functions
    prepare_pmd_huge_pte() and get_pmd_huge_pte() with two new functions that
    can be defined architecture-specific.

    It also removes two VM_BUG_ON checks for page_count() and page_mapcount()
    operating on a pgtable_t. Apart from the VM_BUG_ON removal, there will be
    no functional change introduced by this patch.

    Signed-off-by: Gerald Schaefer
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Cc: Hugh Dickins
    Cc: Hillf Danton
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     
  • Cleanup patch in preparation for transparent hugepage support on s390.
    Adding new architectures to the TRANSPARENT_HUGEPAGE config option can
    make the "depends" line rather ugly, like "depends on (X86 || (S390 &&
    64BIT)) && MMU".

    This patch adds a HAVE_ARCH_TRANSPARENT_HUGEPAGE instead. x86 already has
    MMU "def_bool y", so the MMU check is superfluous there and
    HAVE_ARCH_TRANSPARENT_HUGEPAGE can be selected in arch/x86/Kconfig.

    Signed-off-by: Gerald Schaefer
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Cc: Hugh Dickins
    Cc: Hillf Danton
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     
  • Fix an anon_vma locking issue in the following situation:

    - vma has no anon_vma
    - next has an anon_vma
    - vma is being shrunk / next is being expanded, due to an mprotect call

    We need to take next's anon_vma lock to avoid races with rmap users (such
    as page migration) while next is being expanded.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Since it is called in start_khugepaged

    Signed-off-by: Xiao Guangrong
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiao Guangrong
     
  • Use khugepaged_enabled to see whether thp is enabled

    Signed-off-by: Xiao Guangrong
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiao Guangrong
     
  • Merge khugepaged_loop into khugepaged

    Signed-off-by: Xiao Guangrong
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiao Guangrong
     
  • They are used to abstract the difference between NUMA enabled and NUMA
    disabled to make the code more readable

    Signed-off-by: Xiao Guangrong
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiao Guangrong
     
  • If NUMA is enabled, we can release the page in the page pre-alloc
    operation, then the CONFIG_NUMA dependent code can be reduced

    Signed-off-by: Xiao Guangrong
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiao Guangrong