09 Oct, 2012

28 commits

  • Add a CONFIG_DEBUG_VM_RB build option for the previously existing
    DEBUG_MM_RB code. Now that Andi Kleen modified it to avoid using
    recursive algorithms, we can expose it a bit more.

    Also extend this code to validate_mm() after stack expansion, and to check
    that the vma's start and last pgoffs have not changed since the nodes were
    inserted on the anon vma interval tree (as it is important that the nodes
    be reindexed after each such update).

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Update the generic interval tree code that was introduced in "mm: replace
    vma prio_tree with an interval tree".

    Changes:

    - fixed 'endpoing' typo noticed by Andrew Morton

    - replaced include/linux/interval_tree_tmpl.h, which was used as a
    template (including it automatically defined the interval tree
    functions) with include/linux/interval_tree_generic.h, which only
    defines a preprocessor macro INTERVAL_TREE_DEFINE(), which itself
    defines the interval tree functions when invoked. Now that is a very
    long macro which is unfortunate, but it does make the usage sites
    (lib/interval_tree.c and mm/interval_tree.c) a bit nicer than previously.

    - make use of RB_DECLARE_CALLBACKS() in the INTERVAL_TREE_DEFINE() macro,
    instead of duplicating that code in the interval tree template.

    - replaced vma_interval_tree_add(), which was actually handling the
    nonlinear and interval tree cases, with vma_interval_tree_insert_after()
    which handles only the interval tree case and has an API that is more
    consistent with the other interval tree handling functions.
    The nonlinear case is now handled explicitly in kernel/fork.c dup_mmap().

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Provide rb_insert_augmented() and rb_erase_augmented() through a new
    rbtree_augmented.h include file. rb_erase_augmented() is defined there as
    an __always_inline function, in order to allow inlining of augmented
    rbtree callbacks into it. Since this generates a relatively large
    function, each augmented rbtree user should make sure to have a single
    call site.

    Signed-off-by: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Hillf Danton
    Cc: Peter Zijlstra
    Cc: Catalin Marinas
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • After both prio_tree users have been converted to use red-black trees,
    there is no need to keep around the prio tree library anymore.

    Signed-off-by: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Hillf Danton
    Cc: Peter Zijlstra
    Cc: Catalin Marinas
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Implement an interval tree as a replacement for the VMA prio_tree. The
    algorithms are similar to lib/interval_tree.c; however that code can't be
    directly reused as the interval endpoints are not explicitly stored in the
    VMA. So instead, the common algorithm is moved into a template and the
    details (node type, how to get interval endpoints from the node, etc) are
    filled in using the C preprocessor.

    Once the interval tree functions are available, using them as a
    replacement to the VMA prio tree is a relatively simple, mechanical job.

    Signed-off-by: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Hillf Danton
    Cc: Peter Zijlstra
    Cc: Catalin Marinas
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Patch 1 implements support for interval trees, on top of the augmented
    rbtree API. It also adds synthetic tests to compare the performance of
    interval trees vs prio trees. Short answers is that interval trees are
    slightly faster (~25%) on insert/erase, and much faster (~2.4 - 3x)
    on search. It is debatable how realistic the synthetic test is, and I have
    not made such measurements yet, but my impression is that interval trees
    would still come out faster.

    Patch 2 uses a preprocessor template to make the interval tree generic,
    and uses it as a replacement for the vma prio_tree.

    Patch 3 takes the other prio_tree user, kmemleak, and converts it to use
    a basic rbtree. We don't actually need the augmented rbtree support here
    because the intervals are always non-overlapping.

    Patch 4 removes the now-unused prio tree library.

    Patch 5 proposes an additional optimization to rb_erase_augmented, now
    providing it as an inline function so that the augmented callbacks can be
    inlined in. This provides an additional 5-10% performance improvement
    for the interval tree insert/erase benchmark. There is a maintainance cost
    as it exposes augmented rbtree users to some of the rbtree library internals;
    however I think this cost shouldn't be too high as I expect the augmented
    rbtree will always have much less users than the base rbtree.

    I should probably add a quick summary of why I think it makes sense to
    replace prio trees with augmented rbtree based interval trees now. One of
    the drivers is that we need augmented rbtrees for Rik's vma gap finding
    code, and once you have them, it just makes sense to use them for interval
    trees as well, as this is the simpler and more well known algorithm. prio
    trees, in comparison, seem *too* clever: they impose an additional 'heap'
    constraint on the tree, which they use to guarantee a faster worst-case
    complexity of O(k+log N) for stabbing queries in a well-balanced prio
    tree, vs O(k*log N) for interval trees (where k=number of matches,
    N=number of intervals). Now this sounds great, but in practice prio trees
    don't realize this theorical benefit. First, the additional constraint
    makes them harder to update, so that the kernel implementation has to
    simplify things by balancing them like a radix tree, which is not always
    ideal. Second, the fact that there are both index and heap properties
    makes both tree manipulation and search more complex, which results in a
    higher multiplicative time constant. As it turns out, the simple interval
    tree algorithm ends up running faster than the more clever prio tree.

    This patch:

    Add two test modules:

    - prio_tree_test measures the performance of lib/prio_tree.c, both for
    insertion/removal and for stabbing searches

    - interval_tree_test measures the performance of a library of equivalent
    functionality, built using the augmented rbtree support.

    In order to support the second test module, lib/interval_tree.c is
    introduced. It is kept separate from the interval_tree_test main file
    for two reasons: first we don't want to provide an unfair advantage
    over prio_tree_test by having everything in a single compilation unit,
    and second there is the possibility that the interval tree functionality
    could get some non-test users in kernel over time.

    Signed-off-by: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Hillf Danton
    Cc: Peter Zijlstra
    Cc: Catalin Marinas
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • As proposed by Peter Zijlstra, this makes it easier to define the augmented
    rbtree callbacks.

    Signed-off-by: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • convert arch/x86/mm/pat_rbtree.c to the proposed augmented rbtree api
    and remove the old augmented rbtree implementation.

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Introduce new augmented rbtree APIs that allow minimal recalculation of
    augmented node information.

    A new callback is added to the rbtree insertion and erase rebalancing
    functions, to be called on each tree rotations. Such rotations preserve
    the subtree's root augmented value, but require recalculation of the one
    child that was previously located at the subtree root.

    In the insertion case, the handcoded search phase must be updated to
    maintain the augmented information on insertion, and then the rbtree
    coloring/rebalancing algorithms keep it up to date.

    In the erase case, things are more complicated since it is library
    code that manipulates the rbtree in order to remove internal nodes.
    This requires a couple additional callbacks to copy a subtree's
    augmented value when a new root is stitched in, and to recompute
    augmented values down the ancestry path when a node is removed from
    the tree.

    In order to preserve maximum speed for the non-augmented case,
    we provide two versions of each tree manipulation function.
    rb_insert_augmented() is the augmented equivalent of rb_insert_color(),
    and rb_erase_augmented() is the augmented equivalent of rb_erase().

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Small test to measure the performance of augmented rbtrees.

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Various minor optimizations in rb_erase():
    - Avoid multiple loading of node->__rb_parent_color when computing parent
    and color information (possibly not in close sequence, as there might
    be further branches in the algorithm)
    - In the 1-child subcase of case 1, copy the __rb_parent_color field from
    the erased node to the child instead of recomputing it from the desired
    parent and color
    - When searching for the erased node's successor, differentiate between
    cases 2 and 3 based on whether any left links were followed. This avoids
    a condition later down.
    - In case 3, keep a pointer to the erased node's right child so we don't
    have to refetch it later to adjust its parent.
    - In the no-childs subcase of cases 2 and 3, place the rebalance assigment
    last so that the compiler can remove the following if(rebalance) test.

    Also, added some comments to illustrate cases 2 and 3.

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • An interesting observation for rb_erase() is that when a node has
    exactly one child, the node must be black and the child must be red.
    An interesting consequence is that removing such a node can be done by
    simply replacing it with its child and making the child black,
    which we can do efficiently in rb_erase(). __rb_erase_color() then
    only needs to handle the no-childs case and can be modified accordingly.

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • In rb_erase, move the easy case (node to erase has no more than
    1 child) first. I feel the code reads easier that way.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Add __rb_change_child() as an inline helper function to replace code that
    would otherwise be duplicated 4 times in the source.

    No changes to binary size or speed.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Just a small fix to make sparse happy.

    Signed-off-by: Michel Lespinasse
    Reported-by: Fengguang Wu
    Acked-by: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • When looking to fetch a node's sibling, we went through a sequence of:
    - check if node is the parent's left child
    - if it is, then fetch the parent's right child

    This can be replaced with:
    - fetch the parent's right child as an assumed sibling
    - check that node is NOT the fetched child

    This avoids fetching the parent's left child when node is actually
    that child. Saves a bit on code size, though it doesn't seem to make
    a large difference in speed.

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Acked-by: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Jens Axboe
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Set comment and indentation style to be consistent with linux coding style
    and the rest of the file, as suggested by Peter Zijlstra

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Acked-by: David Woodhouse
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Jens Axboe
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • In __rb_erase_color(), we often already have pointers to the nodes being
    rotated and/or know what their colors must be, so we can generate more
    efficient code than the generic __rb_rotate_left() and __rb_rotate_right()
    functions.

    Also when the current node is red or when flipping the sibling's color,
    the parent is already known so we can use the more efficient
    rb_set_parent_color() function to set the desired color.

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Acked-by: David Woodhouse
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Jens Axboe
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • In __rb_erase_color(), we have to select one of 3 cases depending on the
    color on the 'other' node children. If both children are black, we flip a
    few node colors and iterate. Otherwise, we do either one or two tree
    rotations, depending on the color of the 'other' child opposite to 'node',
    and then we are done.

    The corresponding logic had duplicate checks for the color of the 'other'
    child opposite to 'node'. It was checking it first to determine if both
    children are black, and then to determine how many tree rotations are
    required. Rearrange the logic to avoid that extra check.

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Acked-by: David Woodhouse
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Jens Axboe
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • In __rb_erase_color(), we were always setting a node to black after
    exiting the main loop. And in one case, after fixing up the tree to
    satisfy all rbtree invariants, we were setting the current node to root
    just to guarantee a loop exit, at which point the root would be set to
    black. However this is not necessary, as the root of an rbtree is already
    known to be black. The only case where the color flip is required is when
    we exit the loop due to the current node being red, and it's easiest to
    just do the flip at that point instead of doing it after the loop.

    [adrian.hunter@intel.com: perf tools: fix build for another rbtree.c change]
    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Acked-by: David Woodhouse
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Jens Axboe
    Cc: "Eric W. Biederman"
    Signed-off-by: Adrian Hunter
    Cc: Alexander Shishkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • - Use the newly introduced rb_set_parent_color() function to flip the color
    of nodes whose parent is already known.
    - Optimize rb_parent() when the node is known to be red - there is no need
    to mask out the color in that case.
    - Flipping gparent's color to red requires us to fetch its rb_parent_color
    field, so we can reuse it as the parent value for the next loop iteration.
    - Do not use __rb_rotate_left() and __rb_rotate_right() to handle tree
    rotations: we already have pointers to all relevant nodes, and know their
    colors (either because we want to adjust it, or because we've tested it,
    or we can deduce it as black due to the node proximity to a known red node).
    So we can generate more efficient code by making use of the node pointers
    we already have, and setting both the parent and color attributes for
    nodes all at once. Also in Case 2, some node attributes don't have to
    be set because we know another tree rotation (Case 3) will always follow
    and override them.

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Acked-by: David Woodhouse
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Jens Axboe
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • The root node of an rbtree must always be black. However,
    rb_insert_color() only needs to maintain this invariant when it has been
    broken - that is, when it exits the loop due to the current (red) node
    being the root. In all other cases (exiting after tree rotations, or
    exiting due to an existing black parent) the invariant is already
    satisfied, so there is no need to adjust the root node color.

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Acked-by: David Woodhouse
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Jens Axboe
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • It is a well known property of rbtrees that insertion never requires more
    than two tree rotations. In our implementation, after one loop iteration
    identified one or two necessary tree rotations, we would iterate and look
    for more. However at that point the node's parent would always be black,
    which would cause us to exit the loop.

    We can make the code flow more obvious by just adding a break statement
    after the tree rotations, where we know we are done. Additionally, in the
    cases where two tree rotations are necessary, we don't have to update the
    'node' pointer as it wouldn't be used until the next loop iteration, which
    we now avoid due to this break statement.

    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Acked-by: David Woodhouse
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Jens Axboe
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • This small module helps measure the performance of rbtree insert and
    erase.

    Additionally, we run a few correctness tests to check that the rbtrees
    have all desired properties:

    - contains the right number of nodes in the order desired,
    - never two consecutive red nodes on any path,
    - all paths to leaf nodes have the same number of black nodes,
    - root node is black

    [akpm@linux-foundation.org: fix printk warning: sparc64 cycles_t is unsigned long]
    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Acked-by: David Woodhouse
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Jens Axboe
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • rbtree users must use the documented APIs to manipulate the tree
    structure. Low-level helpers to manipulate node colors and parenthood are
    not part of that API, so move them to lib/rbtree.c

    [dwmw2@infradead.org: fix jffs2 build issue due to renamed __rb_parent_color field]
    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Acked-by: David Woodhouse
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Jens Axboe
    Cc: "Eric W. Biederman"
    Signed-off-by: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Empty nodes have no color. We can make use of this property to simplify
    the code emitted by the RB_EMPTY_NODE and RB_CLEAR_NODE macros. Also,
    we can get rid of the rb_init_node function which had been introduced by
    commit 88d19cf37952 ("timers: Add rb_init_node() to allow for stack
    allocated rb nodes") to avoid some issue with the empty node's color not
    being initialized.

    I'm not sure what the RB_EMPTY_NODE checks in rb_prev() / rb_next() are
    doing there, though. axboe introduced them in commit 10fd48f2376d
    ("rbtree: fixed reversed RB_EMPTY_NODE and rb_next/prev"). The way I
    see it, the 'empty node' abstraction is only used by rbtree users to
    flag nodes that they haven't inserted in any rbtree, so asking the
    predecessor or successor of such nodes doesn't make any sense.

    One final rb_init_node() caller was recently added in sysctl code to
    implement faster sysctl name lookups. This code doesn't make use of
    RB_EMPTY_NODE at all, and from what I could see it only called
    rb_init_node() under the mistaken assumption that such initialization was
    required before node insertion.

    [sfr@canb.auug.org.au: fix net/ceph/osd_client.c build]
    Signed-off-by: Michel Lespinasse
    Cc: Andrea Arcangeli
    Acked-by: David Woodhouse
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Cc: Daniel Santos
    Cc: Jens Axboe
    Cc: "Eric W. Biederman"
    Cc: John Stultz
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Introduce HAVE_DEBUG_BUGVERBOSE config option and select it in
    corresponding architecture Kconfig files. Architectures that already
    select GENERIC_BUG don't need to select HAVE_DEBUG_BUGVERBOSE.

    Signed-off-by: Catalin Marinas
    Acked-by: Geert Uytterhoeven
    Cc: David Howells
    Cc: Hirokazu Takata
    Cc: Paul Mundt
    Cc: "David S. Miller"
    Cc: Chris Metcalf
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Catalin Marinas
     
  • Introduce HAVE_DEBUG_KMEMLEAK config option and select it in corresponding
    architecture Kconfig files. DEBUG_KMEMLEAK now only depends on
    HAVE_DEBUG_KMEMLEAK.

    Signed-off-by: Catalin Marinas
    Cc: Russell King
    Cc: Michal Simek
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Paul Mundt
    Cc: "David S. Miller"
    Cc: Chris Metcalf
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Catalin Marinas
     

08 Oct, 2012

1 commit

  • Pull IOMMU updates from Joerg Roedel:
    "This time the IOMMU updates contain a bunch of fixes and cleanups to
    various IOMMU drivers and the DMA debug code. New features are the
    code for IRQ remapping support with the AMD IOMMU (preperation for
    that was already merged in the last release) and a debugfs interface
    to export some statistics in the NVidia Tegra IOMMU driver."

    * tag 'iommu-updates-v3.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (27 commits)
    iommu/amd: Remove obsolete comment line
    dma-debug: Remove local BUS_NOTIFY_UNBOUND_DRIVER define
    iommu/amd: Fix possible use after free in get_irq_table()
    iommu/amd: Report irq remapping through IOMMU-API
    iommu/amd: Print message to system log when irq remapping is enabled
    iommu/irq: Use amd_iommu_irq_ops if supported
    iommu/amd: Make sure irq remapping still works on dma init failure
    iommu/amd: Add initialization routines for AMD interrupt remapping
    iommu/amd: Add call-back routine for HPET MSI
    iommu/amd: Implement MSI routines for interrupt remapping
    iommu/amd: Add IOAPIC remapping routines
    iommu/amd: Add routines to manage irq remapping tables
    iommu/amd: Add IRTE invalidation routine
    iommu/amd: Make sure IOMMU is not considered to translate itself
    iommu/amd: Split device table initialization into irq and dma part
    iommu/amd: Check if IOAPIC information is correct
    iommu/amd: Allocate data structures to keep track of irq remapping tables
    iommu/amd: Add slab-cache for irq remapping tables
    iommu/amd: Keep track of HPET and IOAPIC device ids
    iommu/amd: Fix features reporting
    ...

    Linus Torvalds
     

06 Oct, 2012

11 commits

  • Fix the warning:

    WARNING: vmlinux.o(.text+0x14cfd8): Section mismatch in reference from the variable compressed_formats to the function .init.text:gunzip()
    The function compressed_formats() references
    the function __init gunzip().
    etc..

    Within decompress.c, compressed_formats[] needs 'a __initdata annotation',
    because some of it's data members refer to functions which will be
    unloaded after init.

    Consequently, its user decompress_method() will get the __init prefix.

    Signed-off-by: Hein Tibosch
    Cc: Albin Tonnerre
    Cc: Phillip Lougher
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hein Tibosch
     
  • SG mapping iterator w/ SG_MITER_ATOMIC set required IRQ disabled because
    it originally used KM_BIO_SRC_IRQ to allow use from IRQ handlers.
    kmap_atomic() has long been updated to handle stacking atomic mapping
    requests on per-cpu basis and only requires not sleeping while mapped.

    Update sg_mapping_iter such that atomic iterators only require disabling
    preemption instead of disabling IRQ.

    While at it, convert wte weird @ARG@ notations to @ARG in the comment of
    sg_miter_start().

    Signed-off-by: Tejun Heo
    Cc: Maxim Levitsky
    Cc: Alex Dubov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • They show up in dmesg

    [ 4.041094] start plist test
    [ 4.045804] end plist test

    without a lot of meaning so hide them behind debug loglevel.

    Signed-off-by: Borislav Petkov
    Cc: Lai Jiangshan
    Acked-by: Steven Rostedt
    Cc: Paul Gortmaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Borislav Petkov
     
  • Xen's pciback points out a couple of deficiencies with vsscanf()'s
    standard conformance:

    - Trailing character matching cannot be checked by the caller: With a
    format string of "(%x:%x.%x) %n" absence of the closing parenthesis
    cannot be checked, as input of "(00:00.0)" doesn't cause the %n to be
    evaluated (because of the code not skipping white space before the
    trailing %n).

    - The parameter corresponding to a trailing %n could get filled even if
    there was a matching error: With a format string of "(%x:%x.%x)%n",
    input of "(00:00.0]" would still fill the respective variable pointed to
    (and hence again make the mismatch non-detectable by the caller).

    This patch aims at fixing those, but leaves other non-conforming aspects
    of it untouched, among them these possibly relevant ones:

    - improper handling of the assignment suppression character '*' (blindly
    discarding all succeeding non-white space from the format and input
    strings),

    - not honoring conversion specifiers for %n, - not recognizing the C99
    conversion specifier 't' (recognized by vsprintf()).

    Signed-off-by: Jan Beulich
    Cc: Konrad Rzeszutek Wilk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Beulich
     
  • The logic in do_raw_spin_lock() attempts to acquire a spinlock by invoking
    arch_spin_trylock() in a loop with a delay between each attempt. Now
    consider the following situation in a 2 CPU system:

    1. CPU-0 continually acquires and releases a spinlock in a
    tight loop; it stays in this loop until some condition X
    is satisfied. X can only be satisfied by another CPU.

    2. CPU-1 tries to acquire the same spinlock, in an attempt
    to satisfy the aforementioned condition X. However, it
    never sees the unlocked value of the lock because the
    debug spinlock code uses trylock instead of just lock;
    it checks at all the wrong moments - whenever CPU-0 has
    locked the lock.

    Now in the absence of debug spinlocks, the architecture specific spinlock
    code can correctly allow CPU-1 to wait in a "queue" (e.g., ticket
    spinlocks), ensuring that it acquires the lock at some point. However,
    with the debug spinlock code, livelock can easily occur due to the use of
    try_lock, which obviously cannot put the CPU in that "queue". This
    queueing mechanism is implemented in both x86 and ARM spinlock code.

    Note that the situation mentioned above is not hypothetical. A real
    problem was encountered where CPU-0 was running hrtimer_cancel with
    interrupts disabled, and CPU-1 was attempting to run the hrtimer that
    CPU-0 was trying to cancel.

    Address this by actually attempting arch_spin_lock once it is suspected
    that there is a spinlock lockup. If we're in a situation that is
    described above, the arch_spin_lock should succeed; otherwise other
    timeout mechanisms (e.g., watchdog) should alert the system of a lockup.
    Therefore, if there is a genuine system problem and the spinlock can't be
    acquired, the end result (irrespective of this change being present) is
    the same. If there is a livelock caused by the debug code, this change
    will allow the lock to be acquired, depending on the implementation of the
    lower level arch specific spinlock code.

    [akpm@linux-foundation.org: tweak comment]
    Signed-off-by: Vikram Mulukutla
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vikram Mulukutla
     
  • Premit use of another algorithm than the default first-fit one. For
    example a custom algorithm could be used to manage alignment requirements.

    As I can't predict all the possible requirements/needs for all allocation
    uses cases, I add a "free" field 'void *data' to pass any needed
    information to the allocation function. For example 'data' could be used
    to handle a structure where you store the alignment, the expected memory
    bank, the requester device, or any information that could influence the
    allocation algorithm.

    An usage example may look like this:
    struct my_pool_constraints {
    int align;
    int bank;
    ...
    };

    unsigned long my_custom_algo(unsigned long *map, unsigned long size,
    unsigned long start, unsigned int nr, void *data)
    {
    struct my_pool_constraints *constraints = data;
    ...
    deal with allocation contraints
    ...
    return the index in bitmap where perform the allocation
    }

    void create_my_pool()
    {
    struct my_pool_constraints c;
    struct gen_pool *pool = gen_pool_create(...);
    gen_pool_add(pool, ...);
    gen_pool_set_algo(pool, my_custom_algo, &c);
    }

    Add of best-fit algorithm function:
    most of the time best-fit is slower then first-fit but memory fragmentation
    is lower. The random buffer allocation/free tests don't show any arithmetic
    relation between the allocation time and fragmentation but the
    best-fit algorithm
    is sometime able to perform the allocation when the first-fit can't.

    This new algorithm help to remove static allocations on ESRAM, a small but
    fast on-chip RAM of few KB, used for high-performance uses cases like DMA
    linked lists, graphic accelerators, encoders/decoders. On the Ux500
    (in the ARM tree) we have define 5 ESRAM banks of 128 KB each and use of
    static allocations becomes unmaintainable:
    cd arch/arm/mach-ux500 && grep -r ESRAM .
    ./include/mach/db8500-regs.h:/* Base address and bank offsets for ESRAM */
    ./include/mach/db8500-regs.h:#define U8500_ESRAM_BASE 0x40000000
    ./include/mach/db8500-regs.h:#define U8500_ESRAM_BANK_SIZE 0x00020000
    ./include/mach/db8500-regs.h:#define U8500_ESRAM_BANK0 U8500_ESRAM_BASE
    ./include/mach/db8500-regs.h:#define U8500_ESRAM_BANK1 (U8500_ESRAM_BASE + U8500_ESRAM_BANK_SIZE)
    ./include/mach/db8500-regs.h:#define U8500_ESRAM_BANK2 (U8500_ESRAM_BANK1 + U8500_ESRAM_BANK_SIZE)
    ./include/mach/db8500-regs.h:#define U8500_ESRAM_BANK3 (U8500_ESRAM_BANK2 + U8500_ESRAM_BANK_SIZE)
    ./include/mach/db8500-regs.h:#define U8500_ESRAM_BANK4 (U8500_ESRAM_BANK3 + U8500_ESRAM_BANK_SIZE)
    ./include/mach/db8500-regs.h:#define U8500_ESRAM_DMA_LCPA_OFFSET 0x10000
    ./include/mach/db8500-regs.h:#define U8500_DMA_LCPA_BASE
    (U8500_ESRAM_BANK0 + U8500_ESRAM_DMA_LCPA_OFFSET)
    ./include/mach/db8500-regs.h:#define U8500_DMA_LCLA_BASE U8500_ESRAM_BANK4

    I want to use genalloc to do dynamic allocations but I need to be able to
    fine tune the allocation algorithm. I my case best-fit algorithm give
    better results than first-fit, but it will not be true for every use case.

    Signed-off-by: Benjamin Gaignard
    Cc: Huang Ying
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Gaignard
     
  • Account for all properties when a and/or b are 0:
    gcd(0, 0) = 0
    gcd(a, 0) = a
    gcd(0, b) = b

    Fixes no known problems in current kernels.

    Signed-off-by: Davidlohr Bueso
    Cc: Eric Dumazet
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • The main option should not appear in the resulting .config when the
    dependencies aren't met (i.e. use "depends on" rather than directly
    setting the default from the combined dependency values).

    The sub-options should depend on the main option rather than a more
    generic higher level one.

    Signed-off-by: Jan Beulich
    Acked-by: Don Zickus
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Beulich
     
  • The result of converting an integer value to another signed integer type
    that's unable to represent the original value is implementation defined.
    (See notes in section 6.3.1.3 of the C standard.)

    In match_number(), the result of simple_strtol() (which returns type long)
    is assigned to a value of type int.

    Instead, handle the result of simple_strtol() in a well-defined way, and
    return -ERANGE if the result won't fit in the int variable used to hold
    the parsed result.

    No current callers pay attention to the particular error value returned,
    so this additional return code shouldn't do any harm.

    [akpm@linux-foundation.org: coding-style tweaks]
    Signed-off-by: Alex Elder
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Elder
     
  • To avoid name conflicts:

    drivers/video/riva/fbdev.c:281:9: sparse: preprocessor token MAX_LEVEL redefined

    While at it, also make the other names more consistent and add
    parentheses.

    [akpm@linux-foundation.org: repair fallout]
    [sfr@canb.auug.org.au: IB/mlx4: fix for MAX_ID_MASK to MAX_IDR_MASK name change]
    Signed-off-by: Fengguang Wu
    Cc: Bernd Petrovitsch
    Cc: walter harms
    Cc: Glauber Costa
    Signed-off-by: Stephen Rothwell
    Cc: Roland Dreier
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Acked-by: Andrei Emeltchenko
    Signed-off-by: Andy Shevchenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Shevchenko