13 Jan, 2012

1 commit


04 Aug, 2011

2 commits

  • We have already acknowledged that swapoff of a tmpfs file is slower than
    it was before conversion to the generic radix_tree: a little slower
    there will be acceptable, if the hotter paths are faster.

    But it was a shock to find swapoff of a 500MB file 20 times slower on my
    laptop, taking 10 minutes; and at that rate it significantly slows down
    my testing.

    Now, most of that turned out to be overhead from PROVE_LOCKING and
    PROVE_RCU: without those it was only 4 times slower than before; and
    more realistic tests on other machines don't fare as badly.

    I've tried a number of things to improve it, including tagging the swap
    entries, then doing lookup by tag: I'd expected that to halve the time,
    but in practice it's erratic, and often counter-productive.

    The only change I've so far found to make a consistent improvement, is
    to short-circuit the way we go back and forth, gang lookup packing
    entries into the array supplied, then shmem scanning that array for the
    target entry. Scanning in place doubles the speed, so it's now only
    twice as slow as before (or three times slower when the PROVEs are on).

    So, add radix_tree_locate_item() as an expedient, once-off,
    single-caller hack to do the lookup directly in place. #ifdef it on
    CONFIG_SHMEM and CONFIG_SWAP, as much to document its limited
    applicability as save space in other configurations. And, sadly,
    #include sched.h for cond_resched().

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • A patchset to extend tmpfs to MAX_LFS_FILESIZE by abandoning its
    peculiar swap vector, instead keeping a file's swap entries in the same
    radix tree as its struct page pointers: thus saving memory, and
    simplifying its code and locking.

    This patch:

    The radix_tree is used by several subsystems for different purposes. A
    major use is to store the struct page pointers of a file's pagecache for
    memory management. But what if mm wanted to store something other than
    page pointers there too?

    The low bit of a radix_tree entry is already used to denote an indirect
    pointer, for internal use, and the unlikely radix_tree_deref_retry()
    case.

    Define the next bit as denoting an exceptional entry, and supply inline
    functions radix_tree_exception() to return non-0 in either unlikely
    case, and radix_tree_exceptional_entry() to return non-0 in the second
    case.

    If a subsystem already uses radix_tree with that bit set, no problem: it
    does not affect internal workings at all, but is defined for the
    convenience of those storing well-aligned pointers in the radix_tree.

    The radix_tree_gang_lookups have an implicit assumption that the caller
    can deduce the offset of each entry returned e.g. by the page->index of
    a struct page. But that may not be feasible for some kinds of item to
    be stored there.

    radix_tree_gang_lookup_slot() allow for an optional indices argument,
    output array in which to return those offsets. The same could be added
    to other radix_tree_gang_lookups, but for now keep it to the only one
    for which we need it.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

14 Jan, 2011

1 commit

  • …lot during file page migration

    migrate_pages() -> unmap_and_move() only calls rcu_read_lock() for
    anonymous pages, as introduced by git commit
    989f89c57e6361e7d16fbd9572b5da7d313b073d ("fix rcu_read_lock() in page
    migraton"). The point of the RCU protection there is part of getting a
    stable reference to anon_vma and is only held for anon pages as file pages
    are locked which is sufficient protection against freeing.

    However, while a file page's mapping is being migrated, the radix tree is
    double checked to ensure it is the expected page. This uses
    radix_tree_deref_slot() -> rcu_dereference() without the RCU lock held
    triggering the following warning.

    [ 173.674290] ===================================================
    [ 173.676016] [ INFO: suspicious rcu_dereference_check() usage. ]
    [ 173.676016] ---------------------------------------------------
    [ 173.676016] include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!
    [ 173.676016]
    [ 173.676016] other info that might help us debug this:
    [ 173.676016]
    [ 173.676016]
    [ 173.676016] rcu_scheduler_active = 1, debug_locks = 0
    [ 173.676016] 1 lock held by hugeadm/2899:
    [ 173.676016] #0: (&(&inode->i_data.tree_lock)->rlock){..-.-.}, at: [<c10e3d2b>] migrate_page_move_mapping+0x40/0x1ab
    [ 173.676016]
    [ 173.676016] stack backtrace:
    [ 173.676016] Pid: 2899, comm: hugeadm Not tainted 2.6.37-rc5-autobuild
    [ 173.676016] Call Trace:
    [ 173.676016] [<c128cc01>] ? printk+0x14/0x1b
    [ 173.676016] [<c1063502>] lockdep_rcu_dereference+0x7d/0x86
    [ 173.676016] [<c10e3db5>] migrate_page_move_mapping+0xca/0x1ab
    [ 173.676016] [<c10e41ad>] migrate_page+0x23/0x39
    [ 173.676016] [<c10e491b>] buffer_migrate_page+0x22/0x107
    [ 173.676016] [<c10e48f9>] ? buffer_migrate_page+0x0/0x107
    [ 173.676016] [<c10e425d>] move_to_new_page+0x9a/0x1ae
    [ 173.676016] [<c10e47e6>] migrate_pages+0x1e7/0x2fa

    This patch introduces radix_tree_deref_slot_protected() which calls
    rcu_dereference_protected(). Users of it must pass in the
    mapping->tree_lock that is protecting this dereference. Holding the tree
    lock protects against parallel updaters of the radix tree meaning that
    rcu_dereference_protected is allowable.

    [akpm@linux-foundation.org: remove unneeded casts]
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Cc: Minchan Kim <minchan.kim@gmail.com>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: Milton Miller <miltonm@bga.com>
    Cc: Nick Piggin <nickpiggin@yahoo.com.au>
    Cc: Wu Fengguang <fengguang.wu@intel.com>
    Cc: <stable@kernel.org> [2.6.37.early]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     

12 Nov, 2010

1 commit

  • Salman Qazi describes the following radix-tree bug:

    In the following case, we get can get a deadlock:

    0. The radix tree contains two items, one has the index 0.
    1. The reader (in this case find_get_pages) takes the rcu_read_lock.
    2. The reader acquires slot(s) for item(s) including the index 0 item.
    3. The non-zero index item is deleted, and as a consequence the other item is
    moved to the root of the tree. The place where it used to be is queued for
    deletion after the readers finish.
    3b. The zero item is deleted, removing it from the direct slot, it remains in
    the rcu-delayed indirect node.
    4. The reader looks at the index 0 slot, and finds that the page has 0 ref
    count
    5. The reader looks at it again, hoping that the item will either be freed or
    the ref count will increase. This never happens, as the slot it is looking
    at will never be updated. Also, this slot can never be reclaimed because
    the reader is holding rcu_read_lock and is in an infinite loop.

    The fix is to re-use the same "indirect" pointer case that requires a slot
    lookup retry into a general "retry the lookup" bit.

    Signed-off-by: Nick Piggin
    Reported-by: Salman Qazi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

20 Aug, 2010

1 commit


10 Aug, 2010

2 commits

  • We try to avoid livelocks of writeback when some steadily creates dirty
    pages in a mapping we are writing out. For memory-cleaning writeback,
    using nr_to_write works reasonably well but we cannot really use it for
    data integrity writeback. This patch tries to solve the problem.

    The idea is simple: Tag all pages that should be written back with a
    special tag (TOWRITE) in the radix tree. This can be done rather quickly
    and thus livelocks should not happen in practice. Then we start doing the
    hard work of locking pages and sending them to disk only for those pages
    that have TOWRITE tag set.

    Note: Adding new radix tree tag grows radix tree node from 288 to 296
    bytes for 32-bit archs and from 552 to 560 bytes for 64-bit archs.
    However, the number of slab/slub items per page remains the same (13 and 7
    respectively).

    Signed-off-by: Jan Kara
    Cc: Dave Chinner
    Cc: Nick Piggin
    Cc: Chris Mason
    Cc: Theodore Ts'o
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Implement function for setting one tag if another tag is set for each item
    in given range.

    Signed-off-by: Jan Kara
    Cc: Dave Chinner
    Cc: Nick Piggin
    Cc: Chris Mason
    Cc: Theodore Ts'o
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

10 Apr, 2010

1 commit

  • radix_tree_tag_get() is not safe to use concurrently with radix_tree_tag_set()
    or radix_tree_tag_clear(). The problem is that the double tag_get() in
    radix_tree_tag_get():

    if (!tag_get(node, tag, offset))
    saw_unset_tag = 1;
    if (height == 1) {
    int ret = tag_get(node, tag, offset);

    may see the value change due to the action of set/clear. RCU is no protection
    against this as no pointers are being changed, no nodes are being replaced
    according to a COW protocol - set/clear alter the node directly.

    The documentation in linux/radix-tree.h, however, says that
    radix_tree_tag_get() is an exception to the rule that "any function modifying
    the tree or tags (...) must exclude other modifications, and exclude any
    functions reading the tree".

    The problem is that the next statement in radix_tree_tag_get() checks that the
    tag doesn't vary over time:

    BUG_ON(ret && saw_unset_tag);

    This has been seen happening in FS-Cache:

    https://www.redhat.com/archives/linux-cachefs/2010-April/msg00013.html

    To this end, remove the BUG_ON() from radix_tree_tag_get() and note in various
    comments that the value of the tag may change whilst the RCU read lock is held,
    and thus that the return value of radix_tree_tag_get() may not be relied upon
    unless radix_tree_tag_set/clear() and radix_tree_delete() are excluded from
    running concurrently with it.

    Reported-by: Romain DEGEZ
    Signed-off-by: David Howells
    Acked-by: Nick Piggin
    Signed-off-by: Linus Torvalds

    David Howells
     

17 Jun, 2009

1 commit

  • The counterpart of radix_tree_next_hole(). To be used by context readahead.

    Signed-off-by: Wu Fengguang
    Cc: Vladislav Bolkhovitin
    Cc: Jens Axboe
    Cc: Jeff Moyer
    Cc: Nick Piggin
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     

06 Jan, 2009

1 commit

  • An XFS workload showed up a bug in the lockless pagecache patch. Basically it
    would go into an "infinite" loop, although it would sometimes be able to break
    out of the loop! The reason is a missing compiler barrier in the "increment
    reference count unless it was zero" case of the lockless pagecache protocol in
    the gang lookup functions.

    This would cause the compiler to use a cached value of struct page pointer to
    retry the operation with, rather than reload it. So the page might have been
    removed from pagecache and freed (refcount==0) but the lookup would not correctly
    notice the page is no longer in pagecache, and keep attempting to increment the
    refcount and failing, until the page gets reallocated for something else. This
    isn't a data corruption because the condition will be detected if the page has
    been reallocated. However it can result in a lockup.

    Linus points out that ACCESS_ONCE is also required in that pointer load, even
    if it's absence is not causing a bug on our particular build. The most general
    way to solve this is just to put an rcu_dereference in radix_tree_deref_slot.

    Assembly of find_get_pages,
    before:
    .L220:
    movq (%rbx), %rax #* ivtmp.1162, tmp82
    movq (%rax), %rdi #, prephitmp.1149
    .L218:
    testb $1, %dil #, prephitmp.1149
    jne .L217 #,
    testq %rdi, %rdi # prephitmp.1149
    je .L203 #,
    cmpq $-1, %rdi #, prephitmp.1149
    je .L217 #,
    movl 8(%rdi), %esi # ._count.counter, c
    testl %esi, %esi # c
    je .L218 #,

    after:
    .L212:
    movq (%rbx), %rax #* ivtmp.1109, tmp81
    movq (%rax), %rdi #, ret
    testb $1, %dil #, ret
    jne .L211 #,
    testq %rdi, %rdi # ret
    je .L197 #,
    cmpq $-1, %rdi #, ret
    je .L211 #,
    movl 8(%rdi), %esi # ._count.counter, c
    testl %esi, %esi # c
    je .L212 #,

    (notice the obvious infinite loop in the first example, if page->count remains 0)

    Signed-off-by: Nick Piggin
    Reviewed-by: Paul E. McKenney
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

27 Jul, 2008

1 commit

  • Introduce gang_lookup_slot() and gang_lookup_slot_tag() functions, which
    are used by lockless pagecache.

    Signed-off-by: Nick Piggin
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Hugh Dickins
    Cc: "Paul E. McKenney"
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

03 Feb, 2008

1 commit


17 Oct, 2007

2 commits

  • Rather than sign direct radix-tree pointers with a special bit, sign the
    indirect one that hangs off the root. This means that, given a lookup_slot
    operation, the invalid result will be differentiated from the valid
    (previously, valid results could have the bit either set or clear).

    This does not affect slot lookups which occur under lock -- they can never
    return an invalid result. Is needed in future for lockless pagecache.

    Signed-off-by: Nick Piggin
    Acked-by: Peter Zijlstra
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Introduce radix_tree_next_hole(root, index, max_scan) to scan radix tree for
    the first hole. It will be used in interleaved readahead.

    The implementation is dumb and obviously correct. It can help debug(and
    document) the possible smart one in future.

    Cc: Nick Piggin
    Signed-off-by: Fengguang Wu
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     

09 May, 2007

1 commit


08 Dec, 2006

1 commit

  • Make radix tree lookups safe to be performed without locks. Readers are
    protected against nodes being deleted by using RCU based freeing. Readers
    are protected against new node insertion by using memory barriers to ensure
    the node itself will be properly written before it is visible in the radix
    tree.

    Each radix tree node keeps a record of their height (above leaf nodes).
    This height does not change after insertion -- when the radix tree is
    extended, higher nodes are only inserted in the top. So a lookup can take
    the pointer to what is *now* the root node, and traverse down it even if
    the tree is concurrently extended and this node becomes a subtree of a new
    root.

    "Direct" pointers (tree height of 0, where root->rnode points directly to
    the data item) are handled by using the low bit of the pointer to signal
    whether rnode is a direct pointer or a pointer to a radix tree node.

    When a reader wants to traverse the next branch, they will take a copy of
    the pointer. This pointer will be either NULL (and the branch is empty) or
    non-NULL (and will point to a valid node).

    [akpm@osdl.org: cleanups]
    [Lee.Schermerhorn@hp.com: bugfixes, comments, simplifications]
    [clameter@sgi.com: build fix]
    Signed-off-by: Nick Piggin
    Cc: "Paul E. McKenney"
    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

04 Dec, 2006

1 commit


23 Jun, 2006

1 commit

  • The ability to have height 0 radix trees (a direct pointer to the data item
    rather than going through a full node->slot) quietly disappeared with
    old-2.6-bkcvs commit ffee171812d51652f9ba284302d9e5c5cc14bdfd. On 64-bit
    machines this causes nearly 600 bytes to be used for every gfp_mask bits.

    Simplify radix_tree_delete's complex tag clearing arrangement (which would
    become even more complex) by just falling back to tag clearing functions
    (the pagecache radix-tree never uses this path anyway, so the icache
    savings will mean it's actually a speedup).

    On my 4GB G5, this saves 8MB RAM per kernel kernel source+object tree in
    pagecache.

    Pagecache lookup, insertion, and removal speed for small files will also be
    improved.

    This makes RCU radix tree harder, but it's worth it.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

26 Mar, 2006

1 commit

  • Documentation changes to help radix tree users avoid overrunning the tags
    array. RADIX_TREE_TAGS moves to linux/radix-tree.h and is now known as
    RADIX_TREE_MAX_TAGS (Nick Piggin's idea). Tag parameters are changed to
    unsigned, and some comments are updated.

    Signed-off-by: Jonathan Corbet
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jonathan Corbet
     

09 Jan, 2006

1 commit


07 Nov, 2005

1 commit

  • Reiser4 uses radix trees to solve a trouble reiser4_readdir has serving nfs
    requests.

    Unfortunately, radix tree api lacks an operation suitable for modifying
    existing entry. This patch adds radix_tree_lookup_slot which returns pointer
    to found item within the tree. That location can be then updated.

    Both Nick and Christoph Lameter have patches which need this as well.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hans Reiser
     

28 Oct, 2005

1 commit


09 Oct, 2005

1 commit

  • - added typedef unsigned int __nocast gfp_t;

    - replaced __nocast uses for gfp flags with gfp_t - it gives exactly
    the same warnings as far as sparse is concerned, doesn't change
    generated code (from gcc point of view we replaced unsigned int with
    typedef) and documents what's going on far better.

    Signed-off-by: Al Viro
    Signed-off-by: Linus Torvalds

    Al Viro
     

11 Sep, 2005

1 commit


17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds