04 Apr, 2014

4 commits

  • Previously, page cache radix tree nodes were freed after reclaim emptied
    out their page pointers. But now reclaim stores shadow entries in their
    place, which are only reclaimed when the inodes themselves are
    reclaimed. This is problematic for bigger files that are still in use
    after they have a significant amount of their cache reclaimed, without
    any of those pages actually refaulting. The shadow entries will just
    sit there and waste memory. In the worst case, the shadow entries will
    accumulate until the machine runs out of memory.

    To get this under control, the VM will track radix tree nodes
    exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
    rather than global because we expect the radix tree nodes themselves to
    be allocated node-locally and we want to reduce cross-node references of
    otherwise independent cache workloads. A simple shrinker will then
    reclaim these nodes on memory pressure.

    A few things need to be stored in the radix tree node to implement the
    shadow node LRU and allow tree deletions coming from the list:

    1. There is no index available that would describe the reverse path
    from the node up to the tree root, which is needed to perform a
    deletion. To solve this, encode in each node its offset inside the
    parent. This can be stored in the unused upper bits of the same
    member that stores the node's height at no extra space cost.

    2. The number of shadow entries needs to be counted in addition to the
    regular entries, to quickly detect when the node is ready to go to
    the shadow node LRU list. The current entry count is an unsigned
    int but the maximum number of entries is 64, so a shadow counter
    can easily be stored in the unused upper bits.

    3. Tree modification needs tree lock and tree root, which are located
    in the address space, so store an address_space backpointer in the
    node. The parent pointer of the node is in a union with the 2-word
    rcu_head, so the backpointer comes at no extra cost as well.

    4. The node needs to be linked to an LRU list, which requires a list
    head inside the node. This does increase the size of the node, but
    it does not change the number of objects that fit into a slab page.

    [akpm@linux-foundation.org: export the right function]
    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Make struct radix_tree_node part of the public interface and provide API
    functions to create, look up, and delete whole nodes. Refactor the
    existing insert, look up, delete functions on top of these new node
    primitives.

    This will allow the VM to track and garbage collect page cache radix
    tree nodes.

    [sasha.levin@oracle.com: return correct error code on insertion failure]
    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The radix tree hole searching code is only used for page cache, for
    example the readahead code trying to get a a picture of the area
    surrounding a fault.

    It sufficed to rely on the radix tree definition of holes, which is
    "empty tree slot". But this is about to change, though, as shadow page
    descriptors will be stored in the page cache after the actual pages get
    evicted from memory.

    Move the functions over to mm/filemap.c and make them native page cache
    operations, where they can later be adapted to handle the new definition
    of "page cache hole".

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Provide a function that does not just delete an entry at a given index,
    but also allows passing in an expected item. Delete only if that item
    is still located at the specified index.

    This is handy when lockless tree traversals want to delete entries as
    well because they don't have to do an second, locked lookup to verify
    the slot has not changed under them before deleting the entry.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Minchan Kim
    Reviewed-by: Rik van Riel
    Acked-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

04 Mar, 2014

1 commit

  • Running fsx on tmpfs with concurrent memhog-swapoff-swapon, lots of

    BUG: sleeping function called from invalid context at kernel/fork.c:606
    in_atomic(): 0, irqs_disabled(): 0, pid: 1394, name: swapoff
    1 lock held by swapoff/1394:
    #0: (rcu_read_lock){.+.+.+}, at: [] radix_tree_locate_item+0x1f/0x2b6

    followed by

    ================================================
    [ BUG: lock held when returning to user space! ]
    3.14.0-rc1 #3 Not tainted
    ------------------------------------------------
    swapoff/1394 is leaving the kernel with locks still held!
    1 lock held by swapoff/1394:
    #0: (rcu_read_lock){.+.+.+}, at: [] radix_tree_locate_item+0x1f/0x2b6

    after which the system recovered nicely.

    Whoops, I long ago forgot the rcu_read_unlock() on one unlikely branch.

    Fixes e504f3fdd63d ("tmpfs radix_tree: locate_item to speed up swapoff")

    Signed-off-by: Hugh Dickins
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

12 Sep, 2013

1 commit

  • With users of radix_tree_preload() run from interrupt (block/blk-ioc.c is
    one such possible user), the following race can happen:

    radix_tree_preload()
    ...
    radix_tree_insert()
    radix_tree_node_alloc()
    if (rtp->nr) {
    ret = rtp->nodes[rtp->nr - 1];

    ...
    radix_tree_preload()
    ...
    radix_tree_insert()
    radix_tree_node_alloc()
    if (rtp->nr) {
    ret = rtp->nodes[rtp->nr - 1];

    And we give out one radix tree node twice. That clearly results in radix
    tree corruption with different results (usually OOPS) depending on which
    two users of radix tree race.

    We fix the problem by making radix_tree_node_alloc() always allocate fresh
    radix tree nodes when in interrupt. Using preloading when in interrupt
    doesn't make sense since all the allocations have to be atomic anyway and
    we cannot steal nodes from process-context users because some users rely
    on radix_tree_insert() succeeding after radix_tree_preload().
    in_interrupt() check is somewhat ugly but we cannot simply key off passed
    gfp_mask as that is acquired from root_gfp_mask() and thus the same for
    all preload users.

    Another part of the fix is to avoid node preallocation in
    radix_tree_preload() when passed gfp_mask doesn't allow waiting. Again,
    preallocation in such case doesn't make sense and when preallocation would
    happen in interrupt we could possibly leak some allocated nodes. However,
    some users of radix_tree_preload() require following radix_tree_insert()
    to succeed. To avoid unexpected effects for these users,
    radix_tree_preload() only warns if passed gfp mask doesn't allow waiting
    and we provide a new function radix_tree_maybe_preload() for those users
    which get different gfp mask from different call sites and which are
    prepared to handle radix_tree_insert() failure.

    Signed-off-by: Jan Kara
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

06 Jun, 2012

1 commit

  • This patch fixes bug in macro radix_tree_for_each_contig().

    If radix_tree_next_slot() sees NULL in next slot it returns NULL, but following
    radix_tree_next_chunk() switches iterating into next chunk. As result iterating
    becomes non-contiguous and breaks vfs "splice" and all its users.

    Signed-off-by: Konstantin Khlebnikov
    Reported-and-bisected-by: Hans de Bruin
    Reported-and-bisected-by: Ondrej Zary
    Reported-bisected-and-tested-by: Toralf Förster
    Link: https://lkml.org/lkml/2012/6/5/64
    Cc: stable # 3.4.x
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

30 May, 2012

1 commit


29 Mar, 2012

2 commits

  • Rewrite radix_tree_gang_lookup_* functions using the new radix-tree
    iterator.

    Signed-off-by: Konstantin Khlebnikov
    Tested-by: Hugh Dickins
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • A series of radix tree cleanups, and usage of them in the core pagecache
    code.

    Micro-benchmark:

    lookup 14 slots (typical page-vector size)
    in radix-tree there earch slot filled and tagged
    before/after - nsec per full scan through tree

    * Intel Sandy Bridge i7-2620M 4Mb L3
    New code always faster

    * AMD Athlon 6000+ 2x1Mb L2, without L3
    New code generally faster,
    Minor degradation (marked with "*") for huge sparse trees

    * i386 on Sandy Bridge
    New code faster for common cases: tagged and dense trees.
    Some degradations for non-tagged lookup on sparse trees.

    Ideally, there might help __ffs() analog for searching first non-zero
    long element in array, gcc sometimes cannot optimize this loop corretly.

    Numbers:

    CPU: Intel Sandy Bridge i7-2620M 4Mb L3

    radix-tree with 1024 slots:

    tagged lookup

    step 1 before 7156 after 3613
    step 2 before 5399 after 2696
    step 3 before 4779 after 1928
    step 4 before 4456 after 1429
    step 5 before 4292 after 1213
    step 6 before 4183 after 1052
    step 7 before 4157 after 951
    step 8 before 4016 after 812
    step 9 before 3952 after 851
    step 10 before 3937 after 732
    step 11 before 4023 after 709
    step 12 before 3872 after 657
    step 13 before 3892 after 633
    step 14 before 3720 after 591
    step 15 before 3879 after 578
    step 16 before 3561 after 513

    normal lookup

    step 1 before 4266 after 3301
    step 2 before 2695 after 2129
    step 3 before 2083 after 1712
    step 4 before 1801 after 1534
    step 5 before 1628 after 1313
    step 6 before 1551 after 1263
    step 7 before 1475 after 1185
    step 8 before 1432 after 1167
    step 9 before 1373 after 1092
    step 10 before 1339 after 1134
    step 11 before 1292 after 1056
    step 12 before 1319 after 1030
    step 13 before 1276 after 1004
    step 14 before 1256 after 987
    step 15 before 1228 after 992
    step 16 before 1247 after 999

    radix-tree with 1024*1024*128 slots:

    tagged lookup

    step 1 before 1086102841 after 674196409
    step 2 before 816839155 after 498138306
    step 7 before 599728907 after 240676762
    step 15 before 555729253 after 185219677
    step 63 before 606637748 after 128585664
    step 64 before 608384432 after 102945089
    step 65 before 596987114 after 123996019
    step 128 before 304459225 after 56783056
    step 256 before 158846855 after 31232481
    step 512 before 86085652 after 18950595
    step 12345 before 6517189 after 1674057

    normal lookup

    step 1 before 626064869 after 544418266
    step 2 before 418809975 after 336321473
    step 7 before 242303598 after 207755560
    step 15 before 208380563 after 176496355
    step 63 before 186854206 after 167283638
    step 64 before 176188060 after 170143976
    step 65 before 185139608 after 167487116
    step 128 before 88181865 after 86913490
    step 256 before 45733628 after 45143534
    step 512 before 24506038 after 23859036
    step 12345 before 2177425 after 2018662

    * AMD Athlon 6000+ 2x1Mb L2, without L3

    radix-tree with 1024 slots:

    tag-lookup

    step 1 before 8164 after 5379
    step 2 before 5818 after 5581
    step 3 before 4959 after 4213
    step 4 before 4371 after 3386
    step 5 before 4204 after 2997
    step 6 before 4950 after 2744
    step 7 before 4598 after 2480
    step 8 before 4251 after 2288
    step 9 before 4262 after 2243
    step 10 before 4175 after 2131
    step 11 before 3999 after 2024
    step 12 before 3979 after 1994
    step 13 before 3842 after 1929
    step 14 before 3750 after 1810
    step 15 before 3735 after 1810
    step 16 before 3532 after 1660

    normal-lookup

    step 1 before 7875 after 5847
    step 2 before 4808 after 4071
    step 3 before 4073 after 3462
    step 4 before 3677 after 3074
    step 5 before 4308 after 2978
    step 6 before 3911 after 3807
    step 7 before 3635 after 3522
    step 8 before 3313 after 3202
    step 9 before 3280 after 3257
    step 10 before 3166 after 3083
    step 11 before 3066 after 3026
    step 12 before 2985 after 2982
    step 13 before 2925 after 2924
    step 14 before 2834 after 2808
    step 15 before 2805 after 2803
    step 16 before 2647 after 2622

    radix-tree with 1024*1024*128 slots:

    tag-lookup

    step 1 before 1288059720 after 951736580
    step 2 before 961292300 after 884212140
    step 7 before 768905140 after 547267580
    step 15 before 771319480 after 456550640
    step 63 before 504847640 after 242704304
    step 64 before 392484800 after 177920786
    step 65 before 491162160 after 246895264
    step 128 before 208084064 after 97348392
    step 256 before 112401035 after 51408126
    step 512 before 75825834 after 29145070
    step 12345 before 5603166 after 2847330

    normal-lookup

    step 1 before 1025677120 after 861375100
    step 2 before 647220080 after 572258540
    step 7 before 505518960 after 484041813
    step 15 before 430483053 after 444815320 *
    step 63 before 388113453 after 404250546 *
    step 64 before 374154666 after 396027440 *
    step 65 before 381423973 after 396704853 *
    step 128 before 190078700 after 202619384 *
    step 256 before 100886756 after 102829108 *
    step 512 before 64074505 after 56158720
    step 12345 before 4237289 after 4422299 *

    * i686 on Sandy bridge

    radix-tree with 1024 slots:

    tagged lookup

    step 1 before 7990 after 4019
    step 2 before 5698 after 2897
    step 3 before 5013 after 2475
    step 4 before 4630 after 1721
    step 5 before 4346 after 1759
    step 6 before 4299 after 1556
    step 7 before 4098 after 1513
    step 8 before 4115 after 1222
    step 9 before 3983 after 1390
    step 10 before 4077 after 1207
    step 11 before 3921 after 1231
    step 12 before 3894 after 1116
    step 13 before 3840 after 1147
    step 14 before 3799 after 1090
    step 15 before 3797 after 1059
    step 16 before 3783 after 745

    normal lookup

    step 1 before 5103 after 3499
    step 2 before 3299 after 2550
    step 3 before 2489 after 2370
    step 4 before 2034 after 2302 *
    step 5 before 1846 after 2268 *
    step 6 before 1752 after 2249 *
    step 7 before 1679 after 2164 *
    step 8 before 1627 after 2153 *
    step 9 before 1542 after 2095 *
    step 10 before 1479 after 2109 *
    step 11 before 1469 after 2009 *
    step 12 before 1445 after 2039 *
    step 13 before 1411 after 2013 *
    step 14 before 1374 after 2046 *
    step 15 before 1340 after 1975 *
    step 16 before 1331 after 2000 *

    radix-tree with 1024*1024*128 slots:

    tagged lookup

    step 1 before 1225865377 after 667153553
    step 2 before 842427423 after 471533007
    step 7 before 609296153 after 276260116
    step 15 before 544232060 after 226859105
    step 63 before 519209199 after 141343043
    step 64 before 588980279 after 141951339
    step 65 before 521099710 after 138282060
    step 128 before 298476778 after 83390628
    step 256 before 149358342 after 43602609
    step 512 before 76994713 after 22911077
    step 12345 before 5328666 after 1472111

    normal lookup

    step 1 before 819284564 after 533635310
    step 2 before 512421605 after 364956155
    step 7 before 271443305 after 305721345 *
    step 15 before 223591630 after 273960216 *
    step 63 before 190320247 after 217770207 *
    step 64 before 178538168 after 267411372 *
    step 65 before 186400423 after 215347937 *
    step 128 before 88106045 after 140540612 *
    step 256 before 44812420 after 70660377 *
    step 512 before 24435438 after 36328275 *
    step 12345 before 2123924 after 2148062 *

    bloat-o-meter delta for this patchset + patchset with related shmem cleanups

    bloat-o-meter: x86_64

    add/remove: 4/3 grow/shrink: 5/6 up/down: 928/-939 (-11)
    function old new delta
    radix_tree_next_chunk - 499 +499
    shmem_unuse 428 554 +126
    shmem_radix_tree_replace 131 227 +96
    find_get_pages_tag 354 419 +65
    find_get_pages_contig 345 407 +62
    find_get_pages 362 396 +34
    __kstrtab_radix_tree_next_chunk - 22 +22
    __ksymtab_radix_tree_next_chunk - 16 +16
    __kcrctab_radix_tree_next_chunk - 8 +8
    radix_tree_gang_lookup_slot 204 203 -1
    static.shmem_xattr_set 384 381 -3
    radix_tree_gang_lookup_tag_slot 208 191 -17
    radix_tree_gang_lookup 231 187 -44
    radix_tree_gang_lookup_tag 247 199 -48
    shmem_unlock_mapping 278 190 -88
    __lookup 217 - -217
    __lookup_tag 242 - -242
    radix_tree_locate_item 279 - -279

    bloat-o-meter: i386

    add/remove: 3/3 grow/shrink: 8/9 up/down: 1075/-1275 (-200)
    function old new delta
    radix_tree_next_chunk - 757 +757
    shmem_unuse 352 449 +97
    find_get_pages_contig 269 322 +53
    shmem_radix_tree_replace 113 154 +41
    find_get_pages_tag 277 318 +41
    dcache_dir_lseek 426 458 +32
    __kstrtab_radix_tree_next_chunk - 22 +22
    vc_do_resize 968 977 +9
    snd_pcm_lib_read1 725 733 +8
    __ksymtab_radix_tree_next_chunk - 8 +8
    netlbl_cipsov4_list 1120 1127 +7
    find_get_pages 293 291 -2
    new_slab 467 459 -8
    bitfill_unaligned_rev 425 417 -8
    radix_tree_gang_lookup_tag_slot 177 146 -31
    blk_dump_cmd 267 229 -38
    radix_tree_gang_lookup_slot 212 134 -78
    shmem_unlock_mapping 221 128 -93
    radix_tree_gang_lookup_tag 275 162 -113
    radix_tree_gang_lookup 255 126 -129
    __lookup 227 - -227
    __lookup_tag 271 - -271
    radix_tree_locate_item 277 - -277

    This patch:

    Implement a clean, simple and effective radix-tree iteration routine.

    Iterating divided into two phases:
    * lookup next chunk in radix-tree leaf node
    * iterating through slots in this chunk

    Main iterator function radix_tree_next_chunk() returns pointer to first
    slot, and stores in the struct radix_tree_iter index of next-to-last slot.
    For tagged-iterating it also constuct bitmask of tags for retunted chunk.
    All additional logic implemented as static-inline functions and macroses.

    Also adds radix_tree_find_next_bit() static-inline variant of
    find_next_bit() optimized for small constant size arrays, because
    find_next_bit() too heavy for searching in an array with one/two long
    elements.

    [akpm@linux-foundation.org: rework comments a bit]
    Signed-off-by: Konstantin Khlebnikov
    Tested-by: Hugh Dickins
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

08 Mar, 2012

1 commit


13 Jan, 2012

1 commit

  • Down, down in the deepest depths of GFP_NOIO page reclaim, we have
    shrink_page_list() calling __remove_mapping() calling __delete_from_
    swap_cache() or __delete_from_page_cache().

    You would not expect those to need much stack, but in fact they call
    radix_tree_delete(): which declares a 192-byte radix_tree_path array on
    its stack (to record the node,offsets it visits when descending, in case
    it needs to ascend to update them). And if any tag is still set [1],
    that calls radix_tree_tag_clear(), which declares a further such
    192-byte radix_tree_path array on the stack. (At least we have
    interrupts disabled here, so won't then be pushing registers too.)

    That was probably a good choice when most users were 32-bit (array of
    half the size), and adding fields to radix_tree_node would have bloated
    it unnecessarily. But nowadays many are 64-bit, and each
    radix_tree_node contains a struct rcu_head, which is only used when
    freeing; whereas the radix_tree_path info is only used for updating the
    tree (deleting, clearing tags or setting tags if tagged) when a lock
    must be held, of no interest when accessing the tree locklessly.

    So add a parent pointer to the radix_tree_node, in union with the
    rcu_head, and remove all uses of the radix_tree_path. There would be
    space in that union to save the offset when descending as before (we can
    argue that a lock must already be held to exclude other users), but
    recalculating it when ascending is both easy (a constant shift and a
    constant mask) and uncommon, so it seems better just to do that.

    Two little optimizations: no need to decrement height when descending,
    adjusting shift is enough; and once radix_tree_tag_if_tagged() has set
    tag on a node and its ancestors, it need not ascend from that node
    again.

    perf on the radix tree test harness reports radix_tree_insert() as 2%
    slower (now having to set parent), but radix_tree_delete() 24% faster.
    Surely that's an exaggeration from rtth's artificially low map shift 3,
    but forcing it back to 6 still rates radix_tree_delete() 8% faster.

    [1] Can a pagecache tag (dirty, writeback or towrite) actually still be
    set at the time of radix_tree_delete()? Perhaps not if the filesystem is
    well-behaved. But although I've not tracked any stack overflow down to
    this cause, I have observed a curious case in which a dirty tag is set
    and left set on tmpfs: page migration's migrate_page_copy() happens to
    use __set_page_dirty_nobuffers() to set PageDirty on the newpage, and
    that sets PAGECACHE_TAG_DIRTY as a side-effect - harmless to a
    filesystem which doesn't use tags, except for this stack depth issue.

    Signed-off-by: Hugh Dickins
    Cc: Jan Kara
    Cc: Dave Chinner
    Cc: Mel Gorman
    Cc: Nai Xia
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

01 Nov, 2011

1 commit

  • radix_tree_tag_get()'s BUG (when it sees a tag after saw_unset_tag) was
    unsafe and removed in 2.6.34, but the pointless saw_unset_tag left behind.

    Remove it now, and return 0 as soon as we see unset tag - we already rely
    upon the root tag to be correct, returning 0 immediately if it's not set.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

04 Aug, 2011

2 commits

  • We have already acknowledged that swapoff of a tmpfs file is slower than
    it was before conversion to the generic radix_tree: a little slower
    there will be acceptable, if the hotter paths are faster.

    But it was a shock to find swapoff of a 500MB file 20 times slower on my
    laptop, taking 10 minutes; and at that rate it significantly slows down
    my testing.

    Now, most of that turned out to be overhead from PROVE_LOCKING and
    PROVE_RCU: without those it was only 4 times slower than before; and
    more realistic tests on other machines don't fare as badly.

    I've tried a number of things to improve it, including tagging the swap
    entries, then doing lookup by tag: I'd expected that to halve the time,
    but in practice it's erratic, and often counter-productive.

    The only change I've so far found to make a consistent improvement, is
    to short-circuit the way we go back and forth, gang lookup packing
    entries into the array supplied, then shmem scanning that array for the
    target entry. Scanning in place doubles the speed, so it's now only
    twice as slow as before (or three times slower when the PROVEs are on).

    So, add radix_tree_locate_item() as an expedient, once-off,
    single-caller hack to do the lookup directly in place. #ifdef it on
    CONFIG_SHMEM and CONFIG_SWAP, as much to document its limited
    applicability as save space in other configurations. And, sadly,
    #include sched.h for cond_resched().

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • A patchset to extend tmpfs to MAX_LFS_FILESIZE by abandoning its
    peculiar swap vector, instead keeping a file's swap entries in the same
    radix tree as its struct page pointers: thus saving memory, and
    simplifying its code and locking.

    This patch:

    The radix_tree is used by several subsystems for different purposes. A
    major use is to store the struct page pointers of a file's pagecache for
    memory management. But what if mm wanted to store something other than
    page pointers there too?

    The low bit of a radix_tree entry is already used to denote an indirect
    pointer, for internal use, and the unlikely radix_tree_deref_retry()
    case.

    Define the next bit as denoting an exceptional entry, and supply inline
    functions radix_tree_exception() to return non-0 in either unlikely
    case, and radix_tree_exceptional_entry() to return non-0 in the second
    case.

    If a subsystem already uses radix_tree with that bit set, no problem: it
    does not affect internal workings at all, but is defined for the
    convenience of those storing well-aligned pointers in the radix_tree.

    The radix_tree_gang_lookups have an implicit assumption that the caller
    can deduce the offset of each entry returned e.g. by the page->index of
    a struct page. But that may not be feasible for some kinds of item to
    be stored there.

    radix_tree_gang_lookup_slot() allow for an optional indices argument,
    output array in which to return those offsets. The same could be added
    to other radix_tree_gang_lookups, but for now keep it to the only one
    for which we need it.

    Signed-off-by: Hugh Dickins
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

26 Jan, 2011

1 commit

  • Executed command: fsstress -d /mnt -n 600 -p 850

    crash> bt
    PID: 7947 TASK: ffff880160546a70 CPU: 0 COMMAND: "fsstress"
    #0 [ffff8800dfc07d00] machine_kexec at ffffffff81030db9
    #1 [ffff8800dfc07d70] crash_kexec at ffffffff810a7952
    #2 [ffff8800dfc07e40] oops_end at ffffffff814aa7c8
    #3 [ffff8800dfc07e70] die_nmi at ffffffff814aa969
    #4 [ffff8800dfc07ea0] do_nmi_callback at ffffffff8102b07b
    #5 [ffff8800dfc07f10] do_nmi at ffffffff814aa514
    #6 [ffff8800dfc07f50] nmi at ffffffff814a9d60
    [exception RIP: __lookup_tag+100]
    RIP: ffffffff812274b4 RSP: ffff88016056b998 RFLAGS: 00000287
    RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000006
    RDX: 000000000000001d RSI: ffff88016056bb18 RDI: ffff8800c85366e0
    RBP: ffff88016056b9c8 R8: ffff88016056b9e8 R9: 0000000000000000
    R10: 000000000000000e R11: ffff8800c8536908 R12: 0000000000000010
    R13: 0000000000000040 R14: ffffffffffffffc0 R15: ffff8800c85366e0
    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018

    #7 [ffff88016056b998] __lookup_tag at ffffffff812274b4
    #8 [ffff88016056b9d0] radix_tree_gang_lookup_tag_slot at ffffffff81227605
    #9 [ffff88016056ba20] find_get_pages_tag at ffffffff810fc110
    #10 [ffff88016056ba80] pagevec_lookup_tag at ffffffff81105e85
    #11 [ffff88016056baa0] write_cache_pages at ffffffff81104c47
    #12 [ffff88016056bbd0] generic_writepages at ffffffff81105014
    #13 [ffff88016056bbe0] do_writepages at ffffffff81105055
    #14 [ffff88016056bbf0] __filemap_fdatawrite_range at ffffffff810fb2cb
    #15 [ffff88016056bc40] filemap_write_and_wait_range at ffffffff810fb32a
    #16 [ffff88016056bc70] generic_file_direct_write at ffffffff810fb3dc
    #17 [ffff88016056bce0] __generic_file_aio_write at ffffffff810fcee5
    #18 [ffff88016056bda0] generic_file_aio_write at ffffffff810fd085
    #19 [ffff88016056bdf0] do_sync_write at ffffffff8114f9ea
    #20 [ffff88016056bf00] vfs_write at ffffffff8114fcf8
    #21 [ffff88016056bf30] sys_write at ffffffff81150691
    #22 [ffff88016056bf80] system_call_fastpath at ffffffff8100c0b2

    I think this root cause is the following:

    radix_tree_range_tag_if_tagged() always tags the root tag with settag
    if the root tag is set with iftag even if there are no iftag tags
    in the specified range (Of course, there are some iftag tags
    outside the specified range).

    ===============================================================================
    [[[Detailed description]]]

    (1) Why cannot radix_tree_gang_lookup_tag_slot() return forever?

    __lookup_tag():
    - Return with 0.
    - Return with the index which is not bigger than the old one as the
    input parameter.

    Therefore the following "while" repeats forever because the above
    conditions cause "ret" not to be updated and the cur_index cannot be
    changed into the bigger one.

    (So, radix_tree_gang_lookup_tag_slot() cannot return forever.)

    radix_tree_gang_lookup_tag_slot():
    1178 while (ret < max_items) {
    1179 unsigned int slots_found;
    1180 unsigned long next_index; /* Index of next search */
    1181
    1182 if (cur_index > max_index)
    1183 break;
    1184 slots_found = __lookup_tag(node, results + ret,
    1185 cur_index, max_items - ret, &next_index,
    tag);
    1186 ret += slots_found;
    // cannot update ret because slots_found == 0.
    // so, this while loops forever.
    1187 if (next_index == 0)
    1188 break;
    1189 cur_index = next_index;
    1190 }

    (2) Why does __lookup_tag() return with 0 and doesn't update the index?

    Assuming the following:
    - the one of the slot in radix_tree_node is NULL.
    - the one of the tag which corresponds to the slot sets with
    PAGECACHE_TAG_TOWRITE or other.
    - In a certain height(!=0), the corresponding index is 0.

    a) __lookup_tag() notices that the tag is set.

    1005 static unsigned int
    1006 __lookup_tag(struct radix_tree_node *slot, void ***results, unsigned long index,
    1007 unsigned int max_items, unsigned long *next_index, unsigned int tag)
    1008 {
    1009 unsigned int nr_found = 0;
    1010 unsigned int shift, height;
    1011
    1012 height = slot->height;
    1013 if (height == 0)
    1014 goto out;
    1015 shift = (height-1) * RADIX_TREE_MAP_SHIFT;
    1016
    1017 while (height > 0) {
    1018 unsigned long i = (index >> shift) & RADIX_TREE_MAP_MASK ;
    1019
    1020 for (;;) {
    1021 if (tag_get(slot, tag, i))
    1022 break;
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    * the index is not updated yet.

    b) __lookup_tag() notices that the slot is NULL.

    1023 index &= ~((1UL << shift) - 1);
    1024 index += 1UL << shift;
    1025 if (index == 0)
    1026 goto out; /* 32-bit wraparound */
    1027 i++;
    1028 if (i == RADIX_TREE_MAP_SIZE)
    1029 goto out;
    1030 }
    1031 height--;
    1032 if (height == 0) { /* Bottom level: grab some items */
    ...
    1055 }
    1056 shift -= RADIX_TREE_MAP_SHIFT;
    1057 slot = rcu_dereference_raw(slot->slots[i]);
    1058 if (slot == NULL)
    1059 break;
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    c) __lookup_tag() doesn't update the index and return with 0.

    1060 }
    1061 out:
    1062 *next_index = index;
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    1063 return nr_found;
    1064 }

    (3) Why is the slot NULL even if the tag is set?

    Because radix_tree_range_tag_if_tagged() always sets the root tag with
    PAGECACHE_TAG_TOWRITE if the root tag is set with PAGECACHE_TAG_DIRTY,
    even if there is no tag which can be set with PAGECACHE_TAG_TOWRITE
    in the specified range (from *first_indexp to last_index). Of course,
    some PAGECACHE_TAG_DIRTY nodes must exist outside the specified range.
    (radix_tree_range_tag_if_tagged() is called only from tag_pages_for_writeback())

    640 unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root
    *root,
    641 unsigned long *first_indexp, unsigned long last_index,
    642 unsigned long nr_to_tag,
    643 unsigned int iftag, unsigned int settag)
    644 {
    645 unsigned int height = root->height;
    646 struct radix_tree_path path[height];
    647 struct radix_tree_path *pathp = path;
    648 struct radix_tree_node *slot;
    649 unsigned int shift;
    650 unsigned long tagged = 0;
    651 unsigned long index = *first_indexp;
    652
    653 last_index = min(last_index, radix_tree_maxindex(height));
    654 if (index > last_index)
    655 return 0;
    656 if (!nr_to_tag)
    657 return 0;
    658 if (!root_tag_get(root, iftag)) {
    659 *first_indexp = last_index + 1;
    660 return 0;
    661 }
    662 if (height == 0) {
    663 *first_indexp = last_index + 1;
    664 root_tag_set(root, settag);
    665 return 1;
    666 }
    ...
    733 root_tag_set(root, settag);
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    734 *first_indexp = index;
    735
    736 return tagged;
    737 }

    As the result, there is no radix_tree_node which is set with
    PAGECACHE_TAG_TOWRITE but the root tag(radix_tree_root) is set with
    PAGECACHE_TAG_TOWRITE.

    [figure: inside radix_tree]
    (Please see the figure with typewriter font)
    ===========================================
    [roottag = DIRTY]
    | tag=0:NOTHING
    tag[0 0 0 1] 1:DIRTY
    [x x x +] 2:WRITEBACK
    | 3:DIRTY,WRITEBACK
    p 4:TOWRITE
    5:DIRTY,TOWRITE ...
    specified range (index: 0 to 2)

    * There is no DIRTY tag within the specified range.
    (But there is a DIRTY tag outside that range.)

    | | | | | | | | |
    after calling tag_pages_for_writeback()
    | | | | | | | | |
    v v v v v v v v v

    [roottag = DIRTY,TOWRITE]
    | p is "page".
    tag[0 0 0 1] x is NULL.
    [x x x +] +- is a pointer to "page".
    |
    p

    * But TOWRITE tag is set on the root tag.
    ============================================

    After that, radix_tree_extend() via radix_tree_insert() is called
    when the page is added.
    This function sets the new radix_tree_node with PAGECACHE_TAG_TOWRITE
    to succeed the status of the root tag.

    246 static int radix_tree_extend(struct radix_tree_root *root, unsigned long
    index)
    247 {
    248 struct radix_tree_node *node;
    249 unsigned int height;
    250 int tag;
    251
    252 /* Figure out what the height should be. */
    253 height = root->height + 1;
    254 while (index > radix_tree_maxindex(height))
    255 height++;
    256
    257 if (root->rnode == NULL) {
    258 root->height = height;
    259 goto out;
    260 }
    261
    262 do {
    263 unsigned int newheight;
    264 if (!(node = radix_tree_node_alloc(root)))
    265 return -ENOMEM;
    266
    267 /* Increase the height. */
    268 node->slots[0] = radix_tree_indirect_to_ptr(root->rnode);
    269
    270 /* Propagate the aggregated tag info into the new root */
    271 for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
    272 if (root_tag_get(root, tag))
    273 tag_set(node, tag, 0);
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    274 }

    ===========================================
    [roottag = DIRTY,TOWRITE]
    | :
    tag[0 0 0 1] [0 0 0 0]
    [x x x +] [+ x x x]
    | |
    p p (new page)

    | | | | | | | | |
    after calling radix_tree_insert
    | | | | | | | | |
    v v v v v v v v v

    [roottag = DIRTY,TOWRITE]
    |
    tag [5 0 0 0] * DIRTY and TOWRITE tags are
    [+ + x x] succeeded to the new node.
    | |
    tag [0 0 0 1] [0 0 0 0]
    [x x x +] [+ x x x]
    | |
    p p
    ============================================

    After that, the index 3 page is released by remove_from_page_cache().
    Then we can make the situation that the tag is set with PAGECACHE_TAG_TOWRITE
    and that the slot which corresponds to the tag is NULL.
    ===========================================
    [roottag = DIRTY,TOWRITE]
    |
    tag [5 0 0 0]
    [+ + x x]
    | |
    tag [0 0 0 1] [0 0 0 0]
    [x x x +] [+ x x x]
    | |
    p p
    (remove)

    | | | | | | | | |
    after calling remove_page_cache
    | | | | | | | | |
    v v v v v v v v v

    [roottag = DIRTY,TOWRITE]
    |
    tag [4 0 0 0] * Only DIRTY tag is cleared
    [x + x x] because no TOWRITE tag is existed
    | in the bottom node.
    [0 0 0 0]
    [+ x x x]
    |
    p
    ============================================

    To solve this problem

    Change to that radix_tree_tag_if_tagged() doesn't tag the root tag
    if it doesn't set any tags within the specified range.

    Like this.
    ============================================
    640 unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root
    *root,
    641 unsigned long *first_indexp, unsigned long last_index,
    642 unsigned long nr_to_tag,
    643 unsigned int iftag, unsigned int settag)
    644 {
    650 unsigned long tagged = 0;
    ...
    733 if (tagged)
    ^^^^^^^^^^^^^^^^^^^^^^^^
    734 root_tag_set(root, settag);
    735 *first_indexp = index;
    736
    737 return tagged;
    738 }

    ============================================

    Signed-off-by: Toshiyuki Okajima
    Acked-by: Jan Kara
    Cc: Dave Chinner
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toshiyuki Okajima
     

12 Nov, 2010

1 commit

  • Salman Qazi describes the following radix-tree bug:

    In the following case, we get can get a deadlock:

    0. The radix tree contains two items, one has the index 0.
    1. The reader (in this case find_get_pages) takes the rcu_read_lock.
    2. The reader acquires slot(s) for item(s) including the index 0 item.
    3. The non-zero index item is deleted, and as a consequence the other item is
    moved to the root of the tree. The place where it used to be is queued for
    deletion after the readers finish.
    3b. The zero item is deleted, removing it from the direct slot, it remains in
    the rcu-delayed indirect node.
    4. The reader looks at the index 0 slot, and finds that the page has 0 ref
    count
    5. The reader looks at it again, hoping that the item will either be freed or
    the ref count will increase. This never happens, as the slot it is looking
    at will never be updated. Also, this slot can never be reclaimed because
    the reader is holding rcu_read_lock and is in an infinite loop.

    The fix is to re-use the same "indirect" pointer case that requires a slot
    lookup retry into a general "retry the lookup" bit.

    Signed-off-by: Nick Piggin
    Reported-by: Salman Qazi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

07 Oct, 2010

1 commit


23 Aug, 2010

4 commits

  • …/linux-2.6-rcu into core/rcu

    Ingo Molnar
     
  • * 'radix-tree' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev:
    radix-tree: radix_tree_range_tag_if_tagged() can set incorrect tags
    radix-tree: clear all tags in radix_tree_node_rcu_free

    Linus Torvalds
     
  • Commit ebf8aa44beed48cd17893a83d92a4403e5f9d9e2 ("radix-tree:
    omplement function radix_tree_range_tag_if_tagged") does not safely
    set tags on on intermediate tree nodes. The code walks down the tree
    setting tags before it has fully resolved the path to the leaf under
    the assumption there will be a leaf slot with the tag set in the
    range it is searching.

    Unfortunately, this is not a valid assumption - we can abort after
    setting a tag on an intermediate node if we overrun the number of
    tags we are allowed to set in a batch, or stop scanning because we
    we have passed the last scan index before we reach a leaf slot with
    the tag we are searching for set.

    As a result, we can leave the function with tags set on intemediate
    nodes which can be tripped over later by tag-based lookups. The
    result of these stale tags is that lookup may end prematurely or
    livelock because the lookup cannot make progress.

    The fix for the problem involves reocrding the traversal path we
    take to the leaf nodes, and only propagating the tags back up the
    tree once the tag is set in the leaf node slot. We are already
    recording the path for efficient traversal, so there is no
    additional overhead to do the intermediately node tag setting in
    this manner.

    This fixes a radix tree lookup livelock triggered by the new
    writeback sync livelock avoidance code introduced in commit
    f446daaea9d4a420d16c606f755f3689dcb2d0ce ("mm: implement writeback
    livelock avoidance using page tagging").

    Signed-off-by: Dave Chinner
    Acked-by: Jan Kara

    Dave Chinner
     
  • Commit f446daaea9d4a420d16c606f755f3689dcb2d0ce ("mm: implement
    writeback livelock avoidance using page tagging") introduced a new
    radix tree tag, increasing the number of tags in each node from 2 to
    3. It did not, however, fix up the code in
    radix_tree_node_rcu_free() that cleans up after radix_tree_shrink()
    and hence could leave stray tags set in the new tag array.

    The result is that the livelock avoidance code added in the the
    above commit would hit stale tags when doing tag based lookups,
    resulting in livelocks when trying to traverse the tree.

    Fix this problem in radix_tree_node_rcu_free() so it doesn't happen
    again in the future by using a loop to walk all the tags up to
    RADIX_TREE_MAX_TAGS to clear the stray tags radix_tree_shrink()
    leaves behind.

    Signed-off-by: Dave Chinner
    Acked-by: Nick Piggin
    Acked-by: Jan Kara

    Dave Chinner
     

21 Aug, 2010

1 commit


20 Aug, 2010

1 commit


10 Aug, 2010

1 commit


28 May, 2010

1 commit


10 Apr, 2010

1 commit

  • radix_tree_tag_get() is not safe to use concurrently with radix_tree_tag_set()
    or radix_tree_tag_clear(). The problem is that the double tag_get() in
    radix_tree_tag_get():

    if (!tag_get(node, tag, offset))
    saw_unset_tag = 1;
    if (height == 1) {
    int ret = tag_get(node, tag, offset);

    may see the value change due to the action of set/clear. RCU is no protection
    against this as no pointers are being changed, no nodes are being replaced
    according to a COW protocol - set/clear alter the node directly.

    The documentation in linux/radix-tree.h, however, says that
    radix_tree_tag_get() is an exception to the rule that "any function modifying
    the tree or tags (...) must exclude other modifications, and exclude any
    functions reading the tree".

    The problem is that the next statement in radix_tree_tag_get() checks that the
    tag doesn't vary over time:

    BUG_ON(ret && saw_unset_tag);

    This has been seen happening in FS-Cache:

    https://www.redhat.com/archives/linux-cachefs/2010-April/msg00013.html

    To this end, remove the BUG_ON() from radix_tree_tag_get() and note in various
    comments that the value of the tag may change whilst the RCU read lock is held,
    and thus that the return value of radix_tree_tag_get() may not be relied upon
    unless radix_tree_tag_set/clear() and radix_tree_delete() are excluded from
    running concurrently with it.

    Reported-by: Romain DEGEZ
    Signed-off-by: David Howells
    Acked-by: Nick Piggin
    Signed-off-by: Linus Torvalds

    David Howells
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

25 Feb, 2010

1 commit

  • Because the radix tree is used with many different locking
    designs, we cannot do any effective checking without changing
    the radix-tree APIs. It might make sense to do this later, but
    only if the RCU lockdep checking proves itself sufficiently
    valuable.

    Signed-off-by: Paul E. McKenney
    Cc: laijs@cn.fujitsu.com
    Cc: dipankar@in.ibm.com
    Cc: mathieu.desnoyers@polymtl.ca
    Cc: josh@joshtriplett.org
    Cc: dvhltc@us.ibm.com
    Cc: niv@us.ibm.com
    Cc: peterz@infradead.org
    Cc: rostedt@goodmis.org
    Cc: Valdis.Kletnieks@vt.edu
    Cc: dhowells@redhat.com
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul E. McKenney
     

20 Nov, 2009

2 commits

  • Don't delete pending pages from the page-store tracking tree, but rather send
    them for another write as they've presumably been updated.

    Signed-off-by: David Howells

    David Howells
     
  • __fscache_write_page() attempts to load the radix tree preallocation pool for
    the CPU it is on before calling radix_tree_insert(), as the insertion must be
    done inside a pair of spinlocks.

    Use of the preallocation pool, however, is contingent on the radix tree being
    initialised without __GFP_WAIT specified. __fscache_acquire_cookie() was
    passing GFP_NOFS to INIT_RADIX_TREE() - but that includes __GFP_WAIT.

    The solution is to AND out __GFP_WAIT.

    Additionally, the banner comment to radix_tree_preload() is altered to make
    note of this prerequisite. Possibly there should be a WARN_ON() too.

    Without this fix, I have seen the following recursive deadlock caused by
    radix_tree_insert() attempting to allocate memory inside the spinlocked
    region, which resulted in FS-Cache being called back into to release memory -
    which required the spinlock already held.

    =============================================
    [ INFO: possible recursive locking detected ]
    2.6.32-rc6-cachefs #24
    ---------------------------------------------
    nfsiod/7916 is trying to acquire lock:
    (&cookie->lock){+.+.-.}, at: [] __fscache_uncache_page+0xdb/0x160 [fscache]

    but task is already holding lock:
    (&cookie->lock){+.+.-.}, at: [] __fscache_write_page+0x15c/0x3f3 [fscache]

    other info that might help us debug this:
    5 locks held by nfsiod/7916:
    #0: (nfsiod){+.+.+.}, at: [] worker_thread+0x19a/0x2e2
    #1: (&task->u.tk_work#2){+.+.+.}, at: [] worker_thread+0x19a/0x2e2
    #2: (&cookie->lock){+.+.-.}, at: [] __fscache_write_page+0x15c/0x3f3 [fscache]
    #3: (&object->lock#2){+.+.-.}, at: [] __fscache_write_page+0x197/0x3f3 [fscache]
    #4: (&cookie->stores_lock){+.+...}, at: [] __fscache_write_page+0x19f/0x3f3 [fscache]

    stack backtrace:
    Pid: 7916, comm: nfsiod Not tainted 2.6.32-rc6-cachefs #24
    Call Trace:
    [] __lock_acquire+0x1649/0x16e3
    [] ? __lock_acquire+0x7b7/0x16e3
    [] ? dump_trace+0x248/0x257
    [] lock_acquire+0x57/0x6d
    [] ? __fscache_uncache_page+0xdb/0x160 [fscache]
    [] _spin_lock+0x2c/0x3b
    [] ? __fscache_uncache_page+0xdb/0x160 [fscache]
    [] __fscache_uncache_page+0xdb/0x160 [fscache]
    [] ? __fscache_check_page_write+0x0/0x71 [fscache]
    [] nfs_fscache_release_page+0x86/0xc4 [nfs]
    [] nfs_release_page+0x3c/0x41 [nfs]
    [] try_to_release_page+0x32/0x3b
    [] shrink_page_list+0x316/0x4ac
    [] ? mark_held_locks+0x52/0x70
    [] ? _spin_unlock_irq+0x2b/0x31
    [] shrink_inactive_list+0x392/0x67c
    [] ? mark_held_locks+0x52/0x70
    [] shrink_list+0x8d/0x8f
    [] shrink_zone+0x278/0x33c
    [] ? ktime_get_ts+0xad/0xba
    [] try_to_free_pages+0x22e/0x392
    [] ? isolate_pages_global+0x0/0x212
    [] __alloc_pages_nodemask+0x3dc/0x5cf
    [] cache_alloc_refill+0x34d/0x6c1
    [] ? radix_tree_node_alloc+0x52/0x5c
    [] kmem_cache_alloc+0xb2/0x118
    [] radix_tree_node_alloc+0x52/0x5c
    [] radix_tree_insert+0x57/0x19c
    [] __fscache_write_page+0x1e3/0x3f3 [fscache]
    [] __nfs_readpage_to_fscache+0x58/0x11e [nfs]
    [] nfs_readpage_release+0x34/0x9b [nfs]
    [] nfs_readpage_release_full+0x32/0x4b [nfs]
    [] rpc_release_calldata+0x12/0x14 [sunrpc]
    [] rpc_free_task+0x59/0x61 [sunrpc]
    [] rpc_async_release+0x10/0x12 [sunrpc]
    [] worker_thread+0x1ef/0x2e2
    [] ? worker_thread+0x19a/0x2e2
    [] ? thread_return+0x3e/0x101
    [] ? rpc_async_release+0x0/0x12 [sunrpc]
    [] ? autoremove_wake_function+0x0/0x34
    [] ? trace_hardirqs_on+0xd/0xf
    [] ? worker_thread+0x0/0x2e2
    [] kthread+0x7a/0x82
    [] child_rip+0xa/0x20
    [] ? restore_args+0x0/0x30
    [] ? add_wait_queue+0x15/0x44
    [] ? kthread+0x0/0x82
    [] ? child_rip+0x0/0x20

    Signed-off-by: David Howells

    David Howells
     

17 Jun, 2009

2 commits

  • radix_tree_lookup() and radix_tree_lookup_slot() have much the
    same code except for the return value.

    Introduce radix_tree_lookup_element() to do the real work.

    /*
    * is_slot == 1 : search for the slot.
    * is_slot == 0 : search for the node.
    */
    static void * radix_tree_lookup_element(struct radix_tree_root *root,
    unsigned long index, int is_slot);

    Signed-off-by: Huang Shijie
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Shijie
     
  • The counterpart of radix_tree_next_hole(). To be used by context readahead.

    Signed-off-by: Wu Fengguang
    Cc: Vladislav Bolkhovitin
    Cc: Jens Axboe
    Cc: Jeff Moyer
    Cc: Nick Piggin
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     

08 Jan, 2009

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (24 commits)
    trivial: chack -> check typo fix in main Makefile
    trivial: Add a space (and a comma) to a printk in 8250 driver
    trivial: Fix misspelling of "firmware" in docs for ncr53c8xx/sym53c8xx
    trivial: Fix misspelling of "firmware" in powerpc Makefile
    trivial: Fix misspelling of "firmware" in usb.c
    trivial: Fix misspelling of "firmware" in qla1280.c
    trivial: Fix misspelling of "firmware" in a100u2w.c
    trivial: Fix misspelling of "firmware" in megaraid.c
    trivial: Fix misspelling of "firmware" in ql4_mbx.c
    trivial: Fix misspelling of "firmware" in acpi_memhotplug.c
    trivial: Fix misspelling of "firmware" in ipw2100.c
    trivial: Fix misspelling of "firmware" in atmel.c
    trivial: Fix misspelled firmware in Kconfig
    trivial: fix an -> a typos in documentation and comments
    trivial: fix then -> than typos in comments and documentation
    trivial: update Jesper Juhl CREDITS entry with new email
    trivial: fix singal -> signal typo
    trivial: Fix incorrect use of "loose" in event.c
    trivial: printk: fix indentation of new_text_line declaration
    trivial: rtc-stk17ta8: fix sparse warning
    ...

    Linus Torvalds
     

07 Jan, 2009

1 commit

  • radix_tree_preloads is unused outside of this file, make it static.

    Noticed by sparse:
    lib/radix-tree.c:84:1: warning: symbol 'per_cpu__radix_tree_preloads' was not declared. Should it be static?

    Signed-off-by: Harvey Harrison
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Harvey Harrison
     

06 Jan, 2009

1 commit


27 Jul, 2008

2 commits

  • Kmem cache passed to constructor is only needed for constructors that are
    themselves multiplexeres. Nobody uses this "feature", nor does anybody uses
    passed kmem cache in non-trivial way, so pass only pointer to object.

    Non-trivial places are:
    arch/powerpc/mm/init_64.c
    arch/powerpc/mm/hugetlbpage.c

    This is flag day, yes.

    Signed-off-by: Alexey Dobriyan
    Acked-by: Pekka Enberg
    Acked-by: Christoph Lameter
    Cc: Jon Tollefson
    Cc: Nick Piggin
    Cc: Matt Mackall
    [akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c]
    [akpm@linux-foundation.org: fix mm/slab.c]
    [akpm@linux-foundation.org: fix ubifs]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Introduce gang_lookup_slot() and gang_lookup_slot_tag() functions, which
    are used by lockless pagecache.

    Signed-off-by: Nick Piggin
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Hugh Dickins
    Cc: "Paul E. McKenney"
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

05 Jul, 2008

1 commit

  • Remove all clameter@sgi.com addresses from the kernel tree since they will
    become invalid on June 27th. Change my maintainer email address for the
    slab allocators to cl@linux-foundation.org (which will be the new email
    address for the future).

    Signed-off-by: Christoph Lameter
    Signed-off-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Stephen Rothwell
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

13 Jun, 2008

1 commit

  • We shrink a radix tree when its root node has only one child, in the left
    most slot. The child becomes the new root node. To perform this
    operation in a manner compatible with concurrent lockless lookups, we
    atomically switch the root pointer from the parent to its child.

    However a concurrent lockless lookup may now have loaded a pointer to the
    parent (and is presently deciding what to do next). For this reason, we
    also have to keep the parent node in a valid state after shrinking the
    tree, until the next RCU grace period -- otherwise this lookup with the
    parent pointer may not do the right thing. Notably, we need to keep the
    child in the left most slot there in case that is requested by the lookup.

    This is all pretty standard RCU stuff. It is worth repeating because in
    my eagerness to obey the radix tree node constructor scheme, I had broken
    it by zeroing the radix tree node before the grace period.

    What could happen is that a lookup can load the parent pointer, then
    decide it wants to follow the left most child slot, only to find the slot
    contained NULL due to the concurrent shrinker having zeroed the parent
    node before waiting for a grace period. The lookup would return a false
    negative as a result.

    Fix it by doing that clearing in the RCU callback. I would normally want
    to rip out the constructor entirely, but radix tree nodes are one of those
    places where they make sense (only few cachelines will be touched soon
    after allocation).

    This was never actually found in any lockless pagecache testing or by the
    test harness, but by seeing the odd problem with my scalable vmap rewrite.
    I have not tickled the test harness into reproducing it yet, but I'll
    keep working at it.

    Fortunately, it is not a problem anywhere lockless pagecache is used in
    mainline kernels (pagecache probe is not a guarantee, and brd does not
    have concurrent lookups and deletes).

    Signed-off-by: Nick Piggin
    Acked-by: Peter Zijlstra
    Cc: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin