Eric Lee / smarc-fsl-linux-kernel

09 Oct, 2012

40 commits

85d3a316c kmemleak: use rbtree instead of prio tree ... Browse Code »

kmemleak uses a tree where each node represents an allocated memory object
in order to quickly find out what object a given address is part of.
However, the objects don't overlap, so rbtrees are a better choice than
prio tree for this use. They are both faster and have lower memory
overhead.

Tested by booting a kernel with kmemleak enabled, loading the
kmemleak_test module, and looking for the expected messages.

Signed-off-by: Michel Lespinasse
Cc: Rik van Riel
Cc: Hillf Danton
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: David Woodhouse
Acked-by: Catalin Marinas
Tested-by: Catalin Marinas
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2012-10-09 15:22:39 +0800
6b2dbba8b mm: replace vma prio_tree with an interval tree ... Browse Code »

Implement an interval tree as a replacement for the VMA prio_tree. The
algorithms are similar to lib/interval_tree.c; however that code can't be
directly reused as the interval endpoints are not explicitly stored in the
VMA. So instead, the common algorithm is moved into a template and the
details (node type, how to get interval endpoints from the node, etc) are
filled in using the C preprocessor.

Once the interval tree functions are available, using them as a
replacement to the VMA prio tree is a relatively simple, mechanical job.

Signed-off-by: Michel Lespinasse
Cc: Rik van Riel
Cc: Hillf Danton
Cc: Peter Zijlstra
Cc: Catalin Marinas
Cc: Andrea Arcangeli
Cc: David Woodhouse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2012-10-09 15:22:39 +0800
fff3fd8a1 rbtree: add prio tree and interval tree tests ... Browse Code »

Patch 1 implements support for interval trees, on top of the augmented
rbtree API. It also adds synthetic tests to compare the performance of
interval trees vs prio trees. Short answers is that interval trees are
slightly faster (~25%) on insert/erase, and much faster (~2.4 - 3x)
on search. It is debatable how realistic the synthetic test is, and I have
not made such measurements yet, but my impression is that interval trees
would still come out faster.

Patch 2 uses a preprocessor template to make the interval tree generic,
and uses it as a replacement for the vma prio_tree.

Patch 3 takes the other prio_tree user, kmemleak, and converts it to use
a basic rbtree. We don't actually need the augmented rbtree support here
because the intervals are always non-overlapping.

Patch 4 removes the now-unused prio tree library.

Patch 5 proposes an additional optimization to rb_erase_augmented, now
providing it as an inline function so that the augmented callbacks can be
inlined in. This provides an additional 5-10% performance improvement
for the interval tree insert/erase benchmark. There is a maintainance cost
as it exposes augmented rbtree users to some of the rbtree library internals;
however I think this cost shouldn't be too high as I expect the augmented
rbtree will always have much less users than the base rbtree.

I should probably add a quick summary of why I think it makes sense to
replace prio trees with augmented rbtree based interval trees now. One of
the drivers is that we need augmented rbtrees for Rik's vma gap finding
code, and once you have them, it just makes sense to use them for interval
trees as well, as this is the simpler and more well known algorithm. prio
trees, in comparison, seem *too* clever: they impose an additional 'heap'
constraint on the tree, which they use to guarantee a faster worst-case
complexity of O(k+log N) for stabbing queries in a well-balanced prio
tree, vs O(k*log N) for interval trees (where k=number of matches,
N=number of intervals). Now this sounds great, but in practice prio trees
don't realize this theorical benefit. First, the additional constraint
makes them harder to update, so that the kernel implementation has to
simplify things by balancing them like a radix tree, which is not always
ideal. Second, the fact that there are both index and heap properties
makes both tree manipulation and search more complex, which results in a
higher multiplicative time constant. As it turns out, the simple interval
tree algorithm ends up running faster than the more clever prio tree.

This patch:

Add two test modules:

- prio_tree_test measures the performance of lib/prio_tree.c, both for
insertion/removal and for stabbing searches

- interval_tree_test measures the performance of a library of equivalent
functionality, built using the augmented rbtree support.

In order to support the second test module, lib/interval_tree.c is
introduced. It is kept separate from the interval_tree_test main file
for two reasons: first we don't want to provide an unfair advantage
over prio_tree_test by having everything in a single compilation unit,
and second there is the possibility that the interval tree functionality
could get some non-test users in kernel over time.

Signed-off-by: Michel Lespinasse
Cc: Rik van Riel
Cc: Hillf Danton
Cc: Peter Zijlstra
Cc: Catalin Marinas
Cc: Andrea Arcangeli
Cc: David Woodhouse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2012-10-09 15:22:39 +0800
3908836aa rbtree: add RB_DECLARE_CALLBACKS() macro ... Browse Code »

As proposed by Peter Zijlstra, this makes it easier to define the augmented
rbtree callbacks.

Signed-off-by: Michel Lespinasse
Cc: Rik van Riel
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: David Woodhouse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2012-10-09 15:22:38 +0800
9d9e6f970 rbtree: remove prior augmented rbtree implementation ... Browse Code »

convert arch/x86/mm/pat_rbtree.c to the proposed augmented rbtree api
and remove the old augmented rbtree implementation.

Signed-off-by: Michel Lespinasse
Acked-by: Rik van Riel
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: David Woodhouse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2012-10-09 15:22:38 +0800
14b94af0b rbtree: faster augmented rbtree manipulation ... Browse Code »

Introduce new augmented rbtree APIs that allow minimal recalculation of
augmented node information.

A new callback is added to the rbtree insertion and erase rebalancing
functions, to be called on each tree rotations. Such rotations preserve
the subtree's root augmented value, but require recalculation of the one
child that was previously located at the subtree root.

In the insertion case, the handcoded search phase must be updated to
maintain the augmented information on insertion, and then the rbtree
coloring/rebalancing algorithms keep it up to date.

In the erase case, things are more complicated since it is library
code that manipulates the rbtree in order to remove internal nodes.
This requires a couple additional callbacks to copy a subtree's
augmented value when a new root is stitched in, and to recompute
augmented values down the ancestry path when a node is removed from
the tree.

In order to preserve maximum speed for the non-augmented case,
we provide two versions of each tree manipulation function.
rb_insert_augmented() is the augmented equivalent of rb_insert_color(),
and rb_erase_augmented() is the augmented equivalent of rb_erase().

Signed-off-by: Michel Lespinasse
Acked-by: Rik van Riel
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: David Woodhouse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2012-10-09 15:22:37 +0800
dadf93534 rbtree: augmented rbtree test ... Browse Code »

Small test to measure the performance of augmented rbtrees.

Signed-off-by: Michel Lespinasse
Acked-by: Rik van Riel
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: David Woodhouse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2012-10-09 15:22:37 +0800
4f035ad67 rbtree: low level optimizations in rb_erase() ... Browse Code »

Various minor optimizations in rb_erase():
- Avoid multiple loading of node->__rb_parent_color when computing parent
and color information (possibly not in close sequence, as there might
be further branches in the algorithm)
- In the 1-child subcase of case 1, copy the __rb_parent_color field from
the erased node to the child instead of recomputing it from the desired
parent and color
- When searching for the erased node's successor, differentiate between
cases 2 and 3 based on whether any left links were followed. This avoids
a condition later down.
- In case 3, keep a pointer to the erased node's right child so we don't
have to refetch it later to adjust its parent.
- In the no-childs subcase of cases 2 and 3, place the rebalance assigment
last so that the compiler can remove the following if(rebalance) test.

Also, added some comments to illustrate cases 2 and 3.

Signed-off-by: Michel Lespinasse
Acked-by: Rik van Riel
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: David Woodhouse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2012-10-09 15:22:37 +0800
46b6135a7 rbtree: handle 1-child recoloring in rb_erase() instead of rb_erase_color() ... Browse Code »

An interesting observation for rb_erase() is that when a node has
exactly one child, the node must be black and the child must be red.
An interesting consequence is that removing such a node can be done by
simply replacing it with its child and making the child black,
which we can do efficiently in rb_erase(). __rb_erase_color() then
only needs to handle the no-childs case and can be modified accordingly.

Signed-off-by: Michel Lespinasse
Acked-by: Rik van Riel
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: David Woodhouse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2012-10-09 15:22:37 +0800
60670b803 rbtree: place easiest case first in rb_erase() ... Browse Code »

In rb_erase, move the easy case (node to erase has no more than
1 child) first. I feel the code reads easier that way.

Signed-off-by: Michel Lespinasse
Reviewed-by: Rik van Riel
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: David Woodhouse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2012-10-09 15:22:36 +0800
7abc704ae rbtree: add __rb_change_child() helper function ... Browse Code »

Add __rb_change_child() as an inline helper function to replace code that
would otherwise be duplicated 4 times in the source.

No changes to binary size or speed.

Signed-off-by: Michel Lespinasse
Reviewed-by: Rik van Riel
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: David Woodhouse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2012-10-09 15:22:36 +0800
28d753092 rbtree test: fix sparse warning about 64-bit constant ... Browse Code »

Just a small fix to make sparse happy.

Signed-off-by: Michel Lespinasse
Reported-by: Fengguang Wu
Acked-by: Rik van Riel
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: David Woodhouse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2012-10-09 15:22:36 +0800
59633abf3 rbtree: optimize fetching of sibling node ... Browse Code »

When looking to fetch a node's sibling, we went through a sequence of:
- check if node is the parent's left child
- if it is, then fetch the parent's right child

This can be replaced with:
- fetch the parent's right child as an assumed sibling
- check that node is NOT the fetched child

This avoids fetching the parent's left child when node is actually
that child. Saves a bit on code size, though it doesn't seem to make
a large difference in speed.

Signed-off-by: Michel Lespinasse
Cc: Andrea Arcangeli
Cc: David Woodhouse
Acked-by: Rik van Riel
Cc: Peter Zijlstra
Cc: Daniel Santos
Cc: Jens Axboe
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2012-10-09 15:22:35 +0800
7ce6ff9e5 rbtree: coding style adjustments ... Browse Code »

Set comment and indentation style to be consistent with linux coding style
and the rest of the file, as suggested by Peter Zijlstra

Signed-off-by: Michel Lespinasse
Cc: Andrea Arcangeli
Acked-by: David Woodhouse
Cc: Rik van Riel
Cc: Peter Zijlstra
Cc: Daniel Santos
Cc: Jens Axboe
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2012-10-09 15:22:35 +0800
6280d2356 rbtree: low level optimizations in __rb_erase_color() ... Browse Code »

In __rb_erase_color(), we often already have pointers to the nodes being
rotated and/or know what their colors must be, so we can generate more
efficient code than the generic __rb_rotate_left() and __rb_rotate_right()
functions.

Also when the current node is red or when flipping the sibling's color,
the parent is already known so we can use the more efficient
rb_set_parent_color() function to set the desired color.

Signed-off-by: Michel Lespinasse
Cc: Andrea Arcangeli
Acked-by: David Woodhouse
Cc: Rik van Riel
Cc: Peter Zijlstra
Cc: Daniel Santos
Cc: Jens Axboe
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2012-10-09 15:22:35 +0800
e125d1471 rbtree: optimize case selection logic in __rb_erase_color() ... Browse Code »

In __rb_erase_color(), we have to select one of 3 cases depending on the
color on the 'other' node children. If both children are black, we flip a
few node colors and iterate. Otherwise, we do either one or two tree
rotations, depending on the color of the 'other' child opposite to 'node',
and then we are done.

The corresponding logic had duplicate checks for the color of the 'other'
child opposite to 'node'. It was checking it first to determine if both
children are black, and then to determine how many tree rotations are
required. Rearrange the logic to avoid that extra check.

Signed-off-by: Michel Lespinasse
Cc: Andrea Arcangeli
Acked-by: David Woodhouse
Cc: Rik van Riel
Cc: Peter Zijlstra
Cc: Daniel Santos
Cc: Jens Axboe
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2012-10-09 15:22:34 +0800
d6ff12739 rbtree: adjust node color in __rb_erase_color() only when necessary ... Browse Code »

In __rb_erase_color(), we were always setting a node to black after
exiting the main loop. And in one case, after fixing up the tree to
satisfy all rbtree invariants, we were setting the current node to root
just to guarantee a loop exit, at which point the root would be set to
black. However this is not necessary, as the root of an rbtree is already
known to be black. The only case where the color flip is required is when
we exit the loop due to the current node being red, and it's easiest to
just do the flip at that point instead of doing it after the loop.

[adrian.hunter@intel.com: perf tools: fix build for another rbtree.c change]
Signed-off-by: Michel Lespinasse
Cc: Andrea Arcangeli
Acked-by: David Woodhouse
Cc: Rik van Riel
Cc: Peter Zijlstra
Cc: Daniel Santos
Cc: Jens Axboe
Cc: "Eric W. Biederman"
Signed-off-by: Adrian Hunter
Cc: Alexander Shishkin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2012-10-09 15:22:34 +0800
5bc9188aa rbtree: low level optimizations in rb_insert_color() ... Browse Code »

- Use the newly introduced rb_set_parent_color() function to flip the color
of nodes whose parent is already known.
- Optimize rb_parent() when the node is known to be red - there is no need
to mask out the color in that case.
- Flipping gparent's color to red requires us to fetch its rb_parent_color
field, so we can reuse it as the parent value for the next loop iteration.
- Do not use __rb_rotate_left() and __rb_rotate_right() to handle tree
rotations: we already have pointers to all relevant nodes, and know their
colors (either because we want to adjust it, or because we've tested it,
or we can deduce it as black due to the node proximity to a known red node).
So we can generate more efficient code by making use of the node pointers
we already have, and setting both the parent and color attributes for
nodes all at once. Also in Case 2, some node attributes don't have to
be set because we know another tree rotation (Case 3) will always follow
and override them.

Signed-off-by: Michel Lespinasse
Cc: Andrea Arcangeli
Acked-by: David Woodhouse
Cc: Rik van Riel
Cc: Peter Zijlstra
Cc: Daniel Santos
Cc: Jens Axboe
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2012-10-09 15:22:34 +0800
6d58452dc rbtree: adjust root color in rb_insert_color() only when necessary ... Browse Code »

The root node of an rbtree must always be black. However,
rb_insert_color() only needs to maintain this invariant when it has been
broken - that is, when it exits the loop due to the current (red) node
being the root. In all other cases (exiting after tree rotations, or
exiting due to an existing black parent) the invariant is already
satisfied, so there is no need to adjust the root node color.

Signed-off-by: Michel Lespinasse
Cc: Andrea Arcangeli
Acked-by: David Woodhouse
Cc: Rik van Riel
Cc: Peter Zijlstra
Cc: Daniel Santos
Cc: Jens Axboe
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2012-10-09 15:22:33 +0800
1f0528653 rbtree: break out of rb_insert_color loop after tree rotation ... Browse Code »

It is a well known property of rbtrees that insertion never requires more
than two tree rotations. In our implementation, after one loop iteration
identified one or two necessary tree rotations, we would iterate and look
for more. However at that point the node's parent would always be black,
which would cause us to exit the loop.

We can make the code flow more obvious by just adding a break statement
after the tree rotations, where we know we are done. Additionally, in the
cases where two tree rotations are necessary, we don't have to update the
'node' pointer as it wouldn't be used until the next loop iteration, which
we now avoid due to this break statement.

Signed-off-by: Michel Lespinasse
Cc: Andrea Arcangeli
Acked-by: David Woodhouse
Cc: Rik van Riel
Cc: Peter Zijlstra
Cc: Daniel Santos
Cc: Jens Axboe
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2012-10-09 15:22:33 +0800
910a742d4 rbtree: performance and correctness test ... Browse Code »

This small module helps measure the performance of rbtree insert and
erase.

Additionally, we run a few correctness tests to check that the rbtrees
have all desired properties:

- contains the right number of nodes in the order desired,
- never two consecutive red nodes on any path,
- all paths to leaf nodes have the same number of black nodes,
- root node is black

[akpm@linux-foundation.org: fix printk warning: sparc64 cycles_t is unsigned long]
Signed-off-by: Michel Lespinasse
Cc: Andrea Arcangeli
Acked-by: David Woodhouse
Cc: Rik van Riel
Cc: Peter Zijlstra
Cc: Daniel Santos
Cc: Jens Axboe
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2012-10-09 15:22:33 +0800
bf7ad8eea rbtree: move some implementation details from rbtree.h to rbtree.c ... Browse Code »

rbtree users must use the documented APIs to manipulate the tree
structure. Low-level helpers to manipulate node colors and parenthood are
not part of that API, so move them to lib/rbtree.c

[dwmw2@infradead.org: fix jffs2 build issue due to renamed __rb_parent_color field]
Signed-off-by: Michel Lespinasse
Cc: Andrea Arcangeli
Acked-by: David Woodhouse
Cc: Rik van Riel
Cc: Peter Zijlstra
Cc: Daniel Santos
Cc: Jens Axboe
Cc: "Eric W. Biederman"
Signed-off-by: David Woodhouse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2012-10-09 15:22:32 +0800
ea5272f5c rbtree: fix incorrect rbtree node insertion in fs/proc/proc_sysctl.c ... Browse Code »

The recently added code to use rbtrees in sysctl did not follow the proper
rbtree interface on insertion - it was calling rb_link_node() which
inserts a new node into the binary tree, but missed the call to
rb_insert_color() which properly balances the rbtree and establishes all
expected rbtree invariants.

I found out about this only because faulty commit also used
rb_init_node(), which I am removing within this patchset. But I think
it's an easy mistake to make, and it makes me wonder if we should change
the rbtree API so that insertions would be done with a single rb_insert()
call (even if its implementation could still inline the rb_link_node()
part and call a private __rb_insert_color function to do the rebalancing).

Signed-off-by: Michel Lespinasse
Cc: Andrea Arcangeli
Acked-by: David Woodhouse
Cc: Rik van Riel
Cc: Peter Zijlstra
Cc: Daniel Santos
Cc: Jens Axboe
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2012-10-09 15:22:32 +0800
4c199a93a rbtree: empty nodes have no color ... Browse Code »

Empty nodes have no color. We can make use of this property to simplify
the code emitted by the RB_EMPTY_NODE and RB_CLEAR_NODE macros. Also,
we can get rid of the rb_init_node function which had been introduced by
commit 88d19cf37952 ("timers: Add rb_init_node() to allow for stack
allocated rb nodes") to avoid some issue with the empty node's color not
being initialized.

I'm not sure what the RB_EMPTY_NODE checks in rb_prev() / rb_next() are
doing there, though. axboe introduced them in commit 10fd48f2376d
("rbtree: fixed reversed RB_EMPTY_NODE and rb_next/prev"). The way I
see it, the 'empty node' abstraction is only used by rbtree users to
flag nodes that they haven't inserted in any rbtree, so asking the
predecessor or successor of such nodes doesn't make any sense.

One final rb_init_node() caller was recently added in sysctl code to
implement faster sysctl name lookups. This code doesn't make use of
RB_EMPTY_NODE at all, and from what I could see it only called
rb_init_node() under the mistaken assumption that such initialization was
required before node insertion.

[sfr@canb.auug.org.au: fix net/ceph/osd_client.c build]
Signed-off-by: Michel Lespinasse
Cc: Andrea Arcangeli
Acked-by: David Woodhouse
Cc: Rik van Riel
Cc: Peter Zijlstra
Cc: Daniel Santos
Cc: Jens Axboe
Cc: "Eric W. Biederman"
Cc: John Stultz
Signed-off-by: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2012-10-09 15:22:32 +0800
1457d2877 rbtree: reference Documentation/rbtree.txt for usage instructions ... Browse Code »

I recently started looking at the rbtree code (with an eye towards
improving the augmented rbtree support, but I haven't gotten there yet).
I noticed a lot of possible speed improvements, which I am now proposing
in this patch set.

Patches 1-4 are preparatory: remove internal functions from rbtree.h so
that users won't be tempted to use them instead of the documented APIs,
clean up some incorrect usages I've noticed (in particular, with the
recently added fs/proc/proc_sysctl.c rbtree usage), reference the
documentation so that people have one less excuse to miss it, etc.

Patch 5 is a small module I wrote to check the rbtree performance. It
creates 100 nodes with random keys and repeatedly inserts and erases them
from an rbtree. Additionally, it has code to check for rbtree invariants
after each insert or erase operation.

Patches 6-12 is where the rbtree optimizations are done, and they touch
only that one file, lib/rbtree.c . I am getting good results out of these
- in my small benchmark doing rbtree insertion (including search) and
erase, I'm seeing a 30% runtime reduction on Sandybridge E5, which is more
than I initially thought would be possible. (the results aren't as
impressive on my two other test hosts though, AMD barcelona and Intel
Westmere, where I am seeing 14% runtime reduction only). The code size -
both source (ommiting comments) and compiled - is also shorter after these
changes. However, I do admit that the updated code is more arduous to
read - one big reason for that is the removal of the tree rotation
helpers, which added some overhead but also made it easier to reason about
things locally. Overall, I believe this is an acceptable compromise,
given that this code doesn't get modified very often, and that I have good
tests for it.

Upon Peter's suggestion, I added comments showing the rtree configuration
before every rotation. I think they help; however it's still best to have
a copy of the cormen/leiserson/rivest book when digging into this code.

This patch: reference Documentation/rbtree.txt for usage instructions

include/linux/rbtree.h included some basic usage instructions, while
Documentation/rbtree.txt had some more complete and easier to follow
instructions. Replacing the former with a reference to the latter.

Signed-off-by: Michel Lespinasse
Cc: Andrea Arcangeli
Acked-by: David Woodhouse
Cc: Rik van Riel
Cc: Peter Zijlstra
Cc: Daniel Santos
Cc: Jens Axboe
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2012-10-09 15:22:31 +0800
1638113d9 ipc/mqueue: remove unnecessary rb_init_node() calls ... Browse Code »

Commit d6629859b36d ("ipc/mqueue: improve performance of send/recv") and
ce2d52cc ("ipc/mqueue: add rbtree node caching support") introduced an
rbtree of message priorities, and usage of rb_init_node() to initialize
the corresponding nodes. As it turns out, rb_init_node() is unnecessary
here, as the nodes are fully initialized on insertion by rb_link_node()
and the code doesn't access nodes that aren't inserted on the rbtree.

Removing the rb_init_node() calls as I removed that function during
rbtree API cleanups (the only other use of it was in a place that
similarly didn't require it).

Signed-off-by: Michel Lespinasse
Acked-by: Doug Ledford
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2012-10-09 15:22:31 +0800
1ae1c1d09 thp, s390: architecture backend for thp on s390 ... Browse Code »

This implements the architecture backend for transparent hugepages
on s390.

Signed-off-by: Gerald Schaefer
Cc: Andrea Arcangeli
Cc: Andi Kleen
Cc: Hugh Dickins
Cc: Hillf Danton
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gerald Schaefer
2012-10-09 15:22:31 +0800
274023da1 thp, s390: disable thp for kvm host on s390 ... Browse Code »

This patch is part of the architecture backend for thp on s390. It
disables thp for kvm hosts, because there is no kvm host hugepage support
so far. Existing thp mappings are split by follow_page() with FOLL_SPLIT,
and future thp mappings are prevented by setting VM_NOHUGEPAGE in
mm->def_flags.

Signed-off-by: Gerald Schaefer
Cc: Andrea Arcangeli
Cc: Andi Kleen
Cc: Hugh Dickins
Cc: Hillf Danton
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gerald Schaefer
2012-10-09 15:22:30 +0800
9501d09fa thp, s390: thp pagetable pre-allocation for s390 ... Browse Code »

This patch is part of the architecture backend for thp on s390. It
provides the pagetable pre-allocation functions
pgtable_trans_huge_deposit() and pgtable_trans_huge_withdraw(). Unlike
other archs, s390 has no struct page * as pgtable_t, but rather a pointer
to the page table. So instead of saving the pagetable pre- allocation
list info inside the struct page, it is being saved within the pagetable
itself.

Signed-off-by: Gerald Schaefer
Cc: Andrea Arcangeli
Cc: Andi Kleen
Cc: Hugh Dickins
Cc: Hillf Danton
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gerald Schaefer
2012-10-09 15:22:30 +0800
75077afbe thp, s390: thp splitting backend for s390 ... Browse Code »

This patch is part of the architecture backend for thp on s390. It
provides the functions related to thp splitting, including serialization
against gup. Unlike other archs, pmdp_splitting_flush() cannot use a tlb
flushing operation to serialize against gup on s390, because that wouldn't
be stopped by the disabled IRQs. So instead, smp_call_function() is
called with an empty function, which will have the expected effect.

Signed-off-by: Gerald Schaefer
Cc: Andrea Arcangeli
Cc: Andi Kleen
Cc: Hugh Dickins
Cc: Hillf Danton
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gerald Schaefer
2012-10-09 15:22:30 +0800
8e72033f2 thp: make MADV_HUGEPAGE check for mm->def_flags ... Browse Code »

This adds a check to hugepage_madvise(), to refuse MADV_HUGEPAGE if
VM_NOHUGEPAGE is set in mm->def_flags. On s390, the VM_NOHUGEPAGE flag
will be set in mm->def_flags for kvm processes, to prevent any future thp
mappings. In order to also prevent MADV_HUGEPAGE on such an mm,
hugepage_madvise() should check mm->def_flags.

Signed-off-by: Gerald Schaefer
Cc: Andrea Arcangeli
Cc: Andi Kleen
Cc: Hugh Dickins
Cc: Hillf Danton
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gerald Schaefer
2012-10-09 15:22:30 +0800
46dcde735 thp: introduce pmdp_invalidate() ... Browse Code »

On s390, a valid page table entry must not be changed while it is attached
to any CPU. So instead of pmd_mknotpresent() and set_pmd_at(), an IDTE
operation would be necessary there. This patch introduces the
pmdp_invalidate() function, to allow architecture-specific
implementations.

Signed-off-by: Gerald Schaefer
Cc: Andrea Arcangeli
Cc: Andi Kleen
Cc: Hugh Dickins
Cc: Hillf Danton
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gerald Schaefer
2012-10-09 15:22:29 +0800
e3ebcf643 thp: remove assumptions on pgtable_t type ... Browse Code »

The thp page table pre-allocation code currently assumes that pgtable_t is
of type "struct page *". This may not be true for all architectures, so
this patch removes that assumption by replacing the functions
prepare_pmd_huge_pte() and get_pmd_huge_pte() with two new functions that
can be defined architecture-specific.

It also removes two VM_BUG_ON checks for page_count() and page_mapcount()
operating on a pgtable_t. Apart from the VM_BUG_ON removal, there will be
no functional change introduced by this patch.

Signed-off-by: Gerald Schaefer
Cc: Andrea Arcangeli
Cc: Andi Kleen
Cc: Hugh Dickins
Cc: Hillf Danton
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gerald Schaefer
2012-10-09 15:22:29 +0800
15626062f thp, x86: introduce HAVE_ARCH_TRANSPARENT_HUGEPAGE ... Browse Code »

Cleanup patch in preparation for transparent hugepage support on s390.
Adding new architectures to the TRANSPARENT_HUGEPAGE config option can
make the "depends" line rather ugly, like "depends on (X86 || (S390 &&
64BIT)) && MMU".

This patch adds a HAVE_ARCH_TRANSPARENT_HUGEPAGE instead. x86 already has
MMU "def_bool y", so the MMU check is superfluous there and
HAVE_ARCH_TRANSPARENT_HUGEPAGE can be selected in arch/x86/Kconfig.

Signed-off-by: Gerald Schaefer
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Andrea Arcangeli
Cc: Andi Kleen
Cc: Hugh Dickins
Cc: Hillf Danton
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gerald Schaefer
2012-10-09 15:22:29 +0800
ca42b26ab mm: fix potential anon_vma locking issue in mprotect() ... Browse Code »

Fix an anon_vma locking issue in the following situation:

- vma has no anon_vma
- next has an anon_vma
- vma is being shrunk / next is being expanded, due to an mprotect call

We need to take next's anon_vma lock to avoid races with rmap users (such
as page migration) while next is being expanded.

Signed-off-by: Michel Lespinasse
Reviewed-by: Andrea Arcangeli
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2012-10-09 15:22:28 +0800
227e40474 thp: remove unnecessary set_recommended_min_free_kbytes ... Browse Code »

Since it is called in start_khugepaged

Signed-off-by: Xiao Guangrong
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xiao Guangrong
2012-10-09 15:22:28 +0800
17c230afa thp: use khugepaged_enabled to remove duplicate code ... Browse Code »

Use khugepaged_enabled to see whether thp is enabled

Signed-off-by: Xiao Guangrong
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xiao Guangrong
2012-10-09 15:22:28 +0800
b7231789b thp: remove khugepaged_loop ... Browse Code »

Merge khugepaged_loop into khugepaged

Signed-off-by: Xiao Guangrong
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xiao Guangrong
2012-10-09 15:22:27 +0800
26234f36e thp: introduce khugepaged_prealloc_page and khugepaged_alloc_page ... Browse Code »

They are used to abstract the difference between NUMA enabled and NUMA
disabled to make the code more readable

Signed-off-by: Xiao Guangrong
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xiao Guangrong
2012-10-09 15:22:27 +0800
420256ef0 thp: release page in page pre-alloc path ... Browse Code »

If NUMA is enabled, we can release the page in the page pre-alloc
operation, then the CONFIG_NUMA dependent code can be reduced

Signed-off-by: Xiao Guangrong
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xiao Guangrong
2012-10-09 15:22:27 +0800