28 May, 2010

2 commits

  • Introduce numa_mem_id(), based on generic percpu variable infrastructure
    to track "nearest node with memory" for archs that support memoryless
    nodes.

    Define API in when CONFIG_HAVE_MEMORYLESS_NODES
    defined, else stubs. Architectures will define HAVE_MEMORYLESS_NODES
    if/when they support them.

    Archs can override definitions of:

    numa_mem_id() - returns node number of "local memory" node
    set_numa_mem() - initialize [this cpus'] per cpu variable 'numa_mem'
    cpu_to_mem() - return numa_mem for specified cpu; may be used as lvalue

    Generic initialization of 'numa_mem' occurs in __build_all_zonelists().
    This will initialize the boot cpu at boot time, and all cpus on change of
    numa_zonelist_order, or when node or memory hot-plug requires zonelist
    rebuild. Archs that support memoryless nodes will need to initialize
    'numa_mem' for secondary cpus as they're brought on-line.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Christoph Lameter
    Cc: Tejun Heo
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Nick Piggin
    Cc: David Rientjes
    Cc: Eric Whitney
    Cc: KAMEZAWA Hiroyuki
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: "Luck, Tony"
    Cc: Pekka Enberg
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • Rework the generic version of the numa_node_id() function to use the new
    generic percpu variable infrastructure.

    Guard the new implementation with a new config option:

    CONFIG_USE_PERCPU_NUMA_NODE_ID.

    Archs which support this new implemention will default this option to 'y'
    when NUMA is configured. This config option could be removed if/when all
    archs switch over to the generic percpu implementation of numa_node_id().
    Arch support involves:

    1) converting any existing per cpu variable implementations to use
    this implementation. x86_64 is an instance of such an arch.
    2) archs that don't use a per cpu variable for numa_node_id() will
    need to initialize the new per cpu variable "numa_node" as cpus
    are brought on-line. ia64 is an example.
    3) Defining USE_PERCPU_NUMA_NODE_ID in arch dependent Kconfig--e.g.,
    when NUMA is configured. This is required because I have
    retained the old implementation by default to allow archs to
    be modified incrementally, as desired.

    Subsequent patches will convert x86_64 and ia64 to use this implemenation.

    Signed-off-by: Lee Schermerhorn
    Cc: Tejun Heo
    Cc: Mel Gorman
    Reviewed-by: Christoph Lameter
    Cc: Nick Piggin
    Cc: David Rientjes
    Cc: Eric Whitney
    Cc: KAMEZAWA Hiroyuki
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: "Luck, Tony"
    Cc: Pekka Enberg
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     

25 May, 2010

10 commits

  • Add global mutex zonelists_mutex to fix the possible race:

    CPU0 CPU1 CPU2
    (1) zone->present_pages += online_pages;
    (2) build_all_zonelists();
    (3) alloc_page();
    (4) free_page();
    (5) build_all_zonelists();
    (6) __build_all_zonelists();
    (7) zone->pageset = alloc_percpu();

    In step (3,4), zone->pageset still points to boot_pageset, so bad
    things may happen if 2+ nodes are in this state. Even if only 1 node
    is accessing the boot_pageset, (3) may still consume too much memory
    to fail the memory allocations in step (7).

    Besides, atomic operation ensures alloc_percpu() in step (7) will never fail
    since there is a new fresh memory block added in step(6).

    [haicheng.li@linux.intel.com: hold zonelists_mutex when build_all_zonelists]
    Signed-off-by: Haicheng Li
    Signed-off-by: Wu Fengguang
    Reviewed-by: Andi Kleen
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Haicheng Li
     
  • For each new populated zone of hotadded node, need to update its pagesets
    with dynamically allocated per_cpu_pageset struct for all possible CPUs:

    1) Detach zone->pageset from the shared boot_pageset
    at end of __build_all_zonelists().

    2) Use mutex to protect zone->pageset when it's still
    shared in onlined_pages()

    Otherwises, multiple zones of different nodes would share same boot strapping
    boot_pageset for same CPU, which will finally cause below kernel panic:

    ------------[ cut here ]------------
    kernel BUG at mm/page_alloc.c:1239!
    invalid opcode: 0000 [#1] SMP
    ...
    Call Trace:
    [] __alloc_pages_nodemask+0x131/0x7b0
    [] alloc_pages_current+0x87/0xd0
    [] __page_cache_alloc+0x67/0x70
    [] __do_page_cache_readahead+0x120/0x260
    [] ra_submit+0x21/0x30
    [] ondemand_readahead+0x166/0x2c0
    [] page_cache_async_readahead+0x80/0xa0
    [] generic_file_aio_read+0x364/0x670
    [] nfs_file_read+0xca/0x130
    [] do_sync_read+0xfa/0x140
    [] vfs_read+0xb5/0x1a0
    [] sys_read+0x51/0x80
    [] system_call_fastpath+0x16/0x1b
    RIP [] get_page_from_freelist+0x883/0x900
    RSP
    ---[ end trace 4bda28328b9990db ]

    [akpm@linux-foundation.org: merge fix]
    Signed-off-by: Haicheng Li
    Signed-off-by: Wu Fengguang
    Reviewed-by: Andi Kleen
    Reviewed-by: Christoph Lameter
    Cc: Mel Gorman
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Haicheng Li
     
  • No behavior change here.

    Move some of setup_per_cpu_pageset() code into a new function
    setup_zone_pageset() that will be useful for memory hotplug.

    Signed-off-by: Wu Fengguang
    Signed-off-by: Haicheng Li
    Reviewed-by: Andi Kleen
    Reviewed-by: Christoph Lameter
    Cc: Mel Gorman
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • free_hot_cold_page() and __free_pages_ok() have very similar freeing
    preparation. Consolidate them.

    [akpm@linux-foundation.org: fix busted coding style]
    Signed-off-by: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • The fragmentation index may indicate that a failure is due to external
    fragmentation but after a compaction run completes, it is still possible
    for an allocation to fail. There are two obvious reasons as to why

    o Page migration cannot move all pages so fragmentation remains
    o A suitable page may exist but watermarks are not met

    In the event of compaction followed by an allocation failure, this patch
    defers further compaction in the zone (1 << compact_defer_shift) times.
    If the next compaction attempt also fails, compact_defer_shift is
    increased up to a maximum of 6. If compaction succeeds, the defer
    counters are reset again.

    The zone that is deferred is the first zone in the zonelist - i.e. the
    preferred zone. To defer compaction in the other zones, the information
    would need to be stored in the zonelist or implemented similar to the
    zonelist_cache. This would impact the fast-paths and is not justified at
    this time.

    Signed-off-by: Mel Gorman
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Ordinarily when a high-order allocation fails, direct reclaim is entered
    to free pages to satisfy the allocation. With this patch, it is
    determined if an allocation failed due to external fragmentation instead
    of low memory and if so, the calling process will compact until a suitable
    page is freed. Compaction by moving pages in memory is considerably
    cheaper than paging out to disk and works where there are locked pages or
    no swap. If compaction fails to free a page of a suitable size, then
    reclaim will still occur.

    Direct compaction returns as soon as possible. As each block is
    compacted, it is checked if a suitable page has been freed and if so, it
    returns.

    [akpm@linux-foundation.org: Fix build errors]
    [aarcange@redhat.com: fix count_vm_event preempt in memory compaction direct reclaim]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patch is the core of a mechanism which compacts memory in a zone by
    relocating movable pages towards the end of the zone.

    A single compaction run involves a migration scanner and a free scanner.
    Both scanners operate on pageblock-sized areas in the zone. The migration
    scanner starts at the bottom of the zone and searches for all movable
    pages within each area, isolating them onto a private list called
    migratelist. The free scanner starts at the top of the zone and searches
    for suitable areas and consumes the free pages within making them
    available for the migration scanner. The pages isolated for migration are
    then migrated to the newly isolated free pages.

    [aarcange@redhat.com: Fix unsafe optimisation]
    [mel@csn.ul.ie: do not schedule work on other CPUs for compaction]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There are two types of zonelist ordering methodologies:

    - node order, preferring allocations on a node to stay local to and

    - zone order, preferring allocations come from a higher zone to avoid
    allocating in lowmem zones even though they may not be local.

    The ordering technique used by the kernel is configurable on the command
    line, but also has some logic to determine what the default should be.

    This logic currently lacks knowledge of systems where a node may only have
    lowmem. For such systems, it is necessary to use node order so that
    GFP_KERNEL allocations may be satisfied by nodes consisting of only
    lowmem.

    If zone order is used, GFP_KERNEL allocations to such nodes are actually
    allocated on a node with local affinity that includes ZONE_NORMAL.

    This change defaults to node zonelist ordering if any node lacks
    ZONE_NORMAL.

    To force zone order, append 'numa_zonelist_order=zone' to the kernel
    command line.

    Signed-off-by: David Rientjes
    Acked-by: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Before applying this patch, cpuset updates task->mems_allowed and
    mempolicy by setting all new bits in the nodemask first, and clearing all
    old unallowed bits later. But in the way, the allocator may find that
    there is no node to alloc memory.

    The reason is that cpuset rebinds the task's mempolicy, it cleans the
    nodes which the allocater can alloc pages on, for example:

    (mpol: mempolicy)
    task1 task1's mpol task2
    alloc page 1
    alloc on node0? NO 1
    1 change mems from 1 to 0
    1 rebind task1's mpol
    0-1 set new bits
    0 clear disallowed bits
    alloc on node1? NO 0
    ...
    can't alloc page
    goto oom

    This patch fixes this problem by expanding the nodes range first(set newly
    allowed bits) and shrink it lazily(clear newly disallowed bits). So we
    use a variable to tell the write-side task that read-side task is reading
    nodemask, and the write-side task clears newly disallowed nodes after
    read-side task ends the current memory allocation.

    [akpm@linux-foundation.org: fix spello]
    Signed-off-by: Miao Xie
    Cc: David Rientjes
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     
  • …re merging to the tail of the free lists

    In order to reduce fragmentation, this patch classifies freed pages in two
    groups according to their probability of being part of a high order merge.
    Pages belonging to a compound whose next-highest buddy is free are more
    likely to be part of a high order merge in the near future, so they will
    be added at the tail of the freelist. The remaining pages are put at the
    front of the freelist.

    In this way, the pages that are more likely to cause a big merge are kept
    free longer. Consequently there is a tendency to aggregate the
    long-living allocations on a subset of the compounds, reducing the
    fragmentation.

    This heuristic was tested on three machines, x86, x86-64 and ppc64 with
    3GB of RAM in each machine. The tests were kernbench, netperf, sysbench
    and STREAM for performance and a high-order stress test for huge page
    allocations.

    KernBench X86
    Elapsed mean 374.77 ( 0.00%) 375.10 (-0.09%)
    User mean 649.53 ( 0.00%) 650.44 (-0.14%)
    System mean 54.75 ( 0.00%) 54.18 ( 1.05%)
    CPU mean 187.75 ( 0.00%) 187.25 ( 0.27%)

    KernBench X86-64
    Elapsed mean 94.45 ( 0.00%) 94.01 ( 0.47%)
    User mean 323.27 ( 0.00%) 322.66 ( 0.19%)
    System mean 36.71 ( 0.00%) 36.50 ( 0.57%)
    CPU mean 380.75 ( 0.00%) 381.75 (-0.26%)

    KernBench PPC64
    Elapsed mean 173.45 ( 0.00%) 173.74 (-0.17%)
    User mean 587.99 ( 0.00%) 587.95 ( 0.01%)
    System mean 60.60 ( 0.00%) 60.57 ( 0.05%)
    CPU mean 373.50 ( 0.00%) 372.75 ( 0.20%)

    Nothing notable for kernbench.

    NetPerf UDP X86
    64 42.68 ( 0.00%) 42.77 ( 0.21%)
    128 85.62 ( 0.00%) 85.32 (-0.35%)
    256 170.01 ( 0.00%) 168.76 (-0.74%)
    1024 655.68 ( 0.00%) 652.33 (-0.51%)
    2048 1262.39 ( 0.00%) 1248.61 (-1.10%)
    3312 1958.41 ( 0.00%) 1944.61 (-0.71%)
    4096 2345.63 ( 0.00%) 2318.83 (-1.16%)
    8192 4132.90 ( 0.00%) 4089.50 (-1.06%)
    16384 6770.88 ( 0.00%) 6642.05 (-1.94%)*

    NetPerf UDP X86-64
    64 148.82 ( 0.00%) 154.92 ( 3.94%)
    128 298.96 ( 0.00%) 312.95 ( 4.47%)
    256 583.67 ( 0.00%) 626.39 ( 6.82%)
    1024 2293.18 ( 0.00%) 2371.10 ( 3.29%)
    2048 4274.16 ( 0.00%) 4396.83 ( 2.79%)
    3312 6356.94 ( 0.00%) 6571.35 ( 3.26%)
    4096 7422.68 ( 0.00%) 7635.42 ( 2.79%)*
    8192 12114.81 ( 0.00%)* 12346.88 ( 1.88%)
    16384 17022.28 ( 0.00%)* 17033.19 ( 0.06%)*
    1.64% 2.73%

    NetPerf UDP PPC64
    64 49.98 ( 0.00%) 50.25 ( 0.54%)
    128 98.66 ( 0.00%) 100.95 ( 2.27%)
    256 197.33 ( 0.00%) 191.03 (-3.30%)
    1024 761.98 ( 0.00%) 785.07 ( 2.94%)
    2048 1493.50 ( 0.00%) 1510.85 ( 1.15%)
    3312 2303.95 ( 0.00%) 2271.72 (-1.42%)
    4096 2774.56 ( 0.00%) 2773.06 (-0.05%)
    8192 4918.31 ( 0.00%) 4793.59 (-2.60%)
    16384 7497.98 ( 0.00%) 7749.52 ( 3.25%)

    The tests are run to have confidence limits within 1%. Results marked
    with a * were not confident although in this case, it's only outside by
    small amounts. Even with some results that were not confident, the
    netperf UDP results were generally positive.

    NetPerf TCP X86
    64 652.25 ( 0.00%)* 648.12 (-0.64%)*
    23.80% 22.82%
    128 1229.98 ( 0.00%)* 1220.56 (-0.77%)*
    21.03% 18.90%
    256 2105.88 ( 0.00%) 1872.03 (-12.49%)*
    1.00% 16.46%
    1024 3476.46 ( 0.00%)* 3548.28 ( 2.02%)*
    13.37% 11.39%
    2048 4023.44 ( 0.00%)* 4231.45 ( 4.92%)*
    9.76% 12.48%
    3312 4348.88 ( 0.00%)* 4396.96 ( 1.09%)*
    6.49% 8.75%
    4096 4726.56 ( 0.00%)* 4877.71 ( 3.10%)*
    9.85% 8.50%
    8192 4732.28 ( 0.00%)* 5777.77 (18.10%)*
    9.13% 13.04%
    16384 5543.05 ( 0.00%)* 5906.24 ( 6.15%)*
    7.73% 8.68%

    NETPERF TCP X86-64
    netperf-tcp-vanilla-netperf netperf-tcp
    tcp-vanilla pgalloc-delay
    64 1895.87 ( 0.00%)* 1775.07 (-6.81%)*
    5.79% 4.78%
    128 3571.03 ( 0.00%)* 3342.20 (-6.85%)*
    3.68% 6.06%
    256 5097.21 ( 0.00%)* 4859.43 (-4.89%)*
    3.02% 2.10%
    1024 8919.10 ( 0.00%)* 8892.49 (-0.30%)*
    5.89% 6.55%
    2048 10255.46 ( 0.00%)* 10449.39 ( 1.86%)*
    7.08% 7.44%
    3312 10839.90 ( 0.00%)* 10740.15 (-0.93%)*
    6.87% 7.33%
    4096 10814.84 ( 0.00%)* 10766.97 (-0.44%)*
    6.86% 8.18%
    8192 11606.89 ( 0.00%)* 11189.28 (-3.73%)*
    7.49% 5.55%
    16384 12554.88 ( 0.00%)* 12361.22 (-1.57%)*
    7.36% 6.49%

    NETPERF TCP PPC64
    netperf-tcp-vanilla-netperf netperf-tcp
    tcp-vanilla pgalloc-delay
    64 594.17 ( 0.00%) 596.04 ( 0.31%)*
    1.00% 2.29%
    128 1064.87 ( 0.00%)* 1074.77 ( 0.92%)*
    1.30% 1.40%
    256 1852.46 ( 0.00%)* 1856.95 ( 0.24%)
    1.25% 1.00%
    1024 3839.46 ( 0.00%)* 3813.05 (-0.69%)
    1.02% 1.00%
    2048 4885.04 ( 0.00%)* 4881.97 (-0.06%)*
    1.15% 1.04%
    3312 5506.90 ( 0.00%) 5459.72 (-0.86%)
    4096 6449.19 ( 0.00%) 6345.46 (-1.63%)
    8192 7501.17 ( 0.00%) 7508.79 ( 0.10%)
    16384 9618.65 ( 0.00%) 9490.10 (-1.35%)

    There was a distinct lack of confidence in the X86* figures so I included
    what the devation was where the results were not confident. Many of the
    results, whether gains or losses were within the standard deviation so no
    solid conclusion can be reached on performance impact. Looking at the
    figures, only the X86-64 ones look suspicious with a few losses that were
    outside the noise. However, the results were so unstable that without
    knowing why they vary so much, a solid conclusion cannot be reached.

    SYSBENCH X86
    sysbench-vanilla pgalloc-delay
    1 7722.85 ( 0.00%) 7756.79 ( 0.44%)
    2 14901.11 ( 0.00%) 13683.44 (-8.90%)
    3 15171.71 ( 0.00%) 14888.25 (-1.90%)
    4 14966.98 ( 0.00%) 15029.67 ( 0.42%)
    5 14370.47 ( 0.00%) 14865.00 ( 3.33%)
    6 14870.33 ( 0.00%) 14845.57 (-0.17%)
    7 14429.45 ( 0.00%) 14520.85 ( 0.63%)
    8 14354.35 ( 0.00%) 14362.31 ( 0.06%)

    SYSBENCH X86-64
    1 17448.70 ( 0.00%) 17484.41 ( 0.20%)
    2 34276.39 ( 0.00%) 34251.00 (-0.07%)
    3 50805.25 ( 0.00%) 50854.80 ( 0.10%)
    4 66667.10 ( 0.00%) 66174.69 (-0.74%)
    5 66003.91 ( 0.00%) 65685.25 (-0.49%)
    6 64981.90 ( 0.00%) 65125.60 ( 0.22%)
    7 64933.16 ( 0.00%) 64379.23 (-0.86%)
    8 63353.30 ( 0.00%) 63281.22 (-0.11%)
    9 63511.84 ( 0.00%) 63570.37 ( 0.09%)
    10 62708.27 ( 0.00%) 63166.25 ( 0.73%)
    11 62092.81 ( 0.00%) 61787.75 (-0.49%)
    12 61330.11 ( 0.00%) 61036.34 (-0.48%)
    13 61438.37 ( 0.00%) 61994.47 ( 0.90%)
    14 62304.48 ( 0.00%) 62064.90 (-0.39%)
    15 63296.48 ( 0.00%) 62875.16 (-0.67%)
    16 63951.76 ( 0.00%) 63769.09 (-0.29%)

    SYSBENCH PPC64
    -sysbench-pgalloc-delay-sysbench
    sysbench-vanilla pgalloc-delay
    1 7645.08 ( 0.00%) 7467.43 (-2.38%)
    2 14856.67 ( 0.00%) 14558.73 (-2.05%)
    3 21952.31 ( 0.00%) 21683.64 (-1.24%)
    4 27946.09 ( 0.00%) 28623.29 ( 2.37%)
    5 28045.11 ( 0.00%) 28143.69 ( 0.35%)
    6 27477.10 ( 0.00%) 27337.45 (-0.51%)
    7 26489.17 ( 0.00%) 26590.06 ( 0.38%)
    8 26642.91 ( 0.00%) 25274.33 (-5.41%)
    9 25137.27 ( 0.00%) 24810.06 (-1.32%)
    10 24451.99 ( 0.00%) 24275.85 (-0.73%)
    11 23262.20 ( 0.00%) 23674.88 ( 1.74%)
    12 24234.81 ( 0.00%) 23640.89 (-2.51%)
    13 24577.75 ( 0.00%) 24433.50 (-0.59%)
    14 25640.19 ( 0.00%) 25116.52 (-2.08%)
    15 26188.84 ( 0.00%) 26181.36 (-0.03%)
    16 26782.37 ( 0.00%) 26255.99 (-2.00%)

    Again, there is little to conclude here. While there are a few losses,
    the results vary by +/- 8% in some cases. They are the results of most
    concern as there are some large losses but it's also within the variance
    typically seen between kernel releases.

    The STREAM results varied so little and are so verbose that I didn't
    include them here.

    The final test stressed how many huge pages can be allocated. The
    absolute number of huge pages allocated are the same with or without the
    page. However, the "unusability free space index" which is a measure of
    external fragmentation was slightly lower (lower is better) throughout the
    lifetime of the system. I also measured the latency of how long it took
    to successfully allocate a huge page. The latency was slightly lower and
    on X86 and PPC64, more huge pages were allocated almost immediately from
    the free lists. The improvement is slight but there.

    [mel@csn.ul.ie: Tested, reworked for less branches]
    [czoccolo@gmail.com: fix oops by checking pfn_valid_within()]
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Christoph Lameter <cl@linux-foundation.org>
    Acked-by: Rik van Riel <riel@redhat.com>
    Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
    Cc: Corrado Zoccolo <czoccolo@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Corrado Zoccolo
     

16 Mar, 2010

1 commit


13 Mar, 2010

2 commits

  • - introduce dump_page() to print the page info for debugging some error
    condition.

    - convert three mm users: bad_page(), print_bad_pte() and memory offline
    failure.

    - print an extra field: the symbolic names of page->flags

    Example dump_page() output:

    [ 157.521694] page:ffffea0000a7cba8 count:2 mapcount:1 mapping:ffff88001c901791 index:0x147
    [ 157.525570] page flags: 0x100000000100068(uptodate|lru|active|swapbacked)

    Signed-off-by: Wu Fengguang
    Cc: Ingo Molnar
    Cc: Alex Chiang
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • __zone_pcp_update() iterates over NR_CPUS instead of limiting the access
    to the possible cpus. This might result in access to uninitialized areas
    as the per cpu allocator only populates the per cpu memory for possible
    cpus.

    This problem was created as a result of the dynamic allocation of pagesets
    from percpu memory that went in during the merge window - commit
    99dcc3e5a94ed491fbef402831d8c0bbb267f995 ("this_cpu: Page allocator
    conversion").

    Signed-off-by: Thomas Gleixner
    Acked-by: Pekka Enberg
    Acked-by: Tejun Heo
    Acked-by: Christoph Lameter
    Acked-by: Mel Gorman
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     

07 Mar, 2010

6 commits

  • free_area_init_nodes() emits pfn ranges for all zones on the system.
    There may be no pages on a higher zone, however, due to memory limitations
    or the use of the mem= kernel parameter. For example:

    Zone PFN ranges:
    DMA 0x00000001 -> 0x00001000
    DMA32 0x00001000 -> 0x00100000
    Normal 0x00100000 -> 0x00100000

    The implementation copies the previous zone's highest pfn, if any, as the
    next zone's lowest pfn. If its highest pfn is then greater than the
    amount of addressable memory, the upper memory limit is used instead.
    Thus, both the lowest and highest possible pfn for higher zones without
    memory may be the same.

    The pfn range for zones without memory is now shown as "empty" instead.

    Signed-off-by: David Rientjes
    Cc: Mel Gorman
    Reviewed-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • There are quite a few GFP_KERNEL memory allocations made during
    suspend/hibernation and resume that may cause the system to hang, because
    the I/O operations they depend on cannot be completed due to the
    underlying devices being suspended.

    Avoid this problem by clearing the __GFP_IO and __GFP_FS bits in
    gfp_allowed_mask before suspend/hibernation and restoring the original
    values of these bits in gfp_allowed_mask durig the subsequent resume.

    [akpm@linux-foundation.org: fix CONFIG_PM=n linkage]
    Signed-off-by: Rafael J. Wysocki
    Reported-by: Maxim Levitsky
    Cc: Sebastian Ott
    Cc: Benjamin Herrenschmidt
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael J. Wysocki
     
  • commit e815af95 ("change all_unreclaimable zone member to flags") changed
    all_unreclaimable member to bit flag. But it had an undesireble side
    effect. free_one_page() is one of most hot path in linux kernel and
    increasing atomic ops in it can reduce kernel performance a bit.

    Thus, this patch revert such commit partially. at least
    all_unreclaimable shouldn't share memory word with other zone flags.

    [akpm@linux-foundation.org: fix patch interaction]
    Signed-off-by: KOSAKI Motohiro
    Cc: David Rientjes
    Cc: Wu Fengguang
    Cc: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Huang Shijie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • free_hot_page() is just a wrapper around free_hot_cold_page() with
    parameter 'cold = 0'. After adding a clear comment for
    free_hot_cold_page(), it is reasonable to remove a level of call.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Li Hong
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Li Ming Chun
    Cc: KOSAKI Motohiro
    Cc: Americo Wang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Hong
     
  • Move a call of trace_mm_page_free_direct() from free_hot_page() to
    free_hot_cold_page(). It is clearer and close to kmemcheck_free_shadow(),
    as it is done in function __free_pages_ok().

    Signed-off-by: Li Hong
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Li Ming Chun
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Hong
     
  • trace_mm_page_free_direct() is called in function __free_pages(). But it
    is called again in free_hot_page() if order == 0 and produce duplicate
    records in trace file for mm_page_free_direct event. As below:

    K-PID CPU# TIMESTAMP FUNCTION
    gnome-terminal-1567 [000] 4415.246466: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
    gnome-terminal-1567 [000] 4415.246468: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
    gnome-terminal-1567 [000] 4415.246506: mm_page_alloc: page=ffffea0003db9f40 pfn=1155800 order=0 migratetype=0 gfp_flags=GFP_KERNEL
    gnome-terminal-1567 [000] 4415.255557: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
    gnome-terminal-1567 [000] 4415.255557: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0

    This patch removes the first call and adds a call to
    trace_mm_page_free_direct() in __free_pages_ok().

    Signed-off-by: Li Hong
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Larry Woodman
    Cc: Peter Zijlstra
    Cc: Li Ming Chun
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Hong
     

04 Mar, 2010

1 commit

  • …l/git/tip/linux-2.6-tip

    * 'x86-bootmem-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (30 commits)
    early_res: Need to save the allocation name in drop_range_partial()
    sparsemem: Fix compilation on PowerPC
    early_res: Add free_early_partial()
    x86: Fix non-bootmem compilation on PowerPC
    core: Move early_res from arch/x86 to kernel/
    x86: Add find_fw_memmap_area
    Move round_up/down to kernel.h
    x86: Make 32bit support NO_BOOTMEM
    early_res: Enhance check_and_double_early_res
    x86: Move back find_e820_area to e820.c
    x86: Add find_early_area_size
    x86: Separate early_res related code from e820.c
    x86: Move bios page reserve early to head32/64.c
    sparsemem: Put mem map for one node together.
    sparsemem: Put usemap for one node together
    x86: Make 64 bit use early_res instead of bootmem before slab
    x86: Only call dma32_reserve_bootmem 64bit !CONFIG_NUMA
    x86: Make early_node_mem get mem > 4 GB if possible
    x86: Dynamically increase early_res array size
    x86: Introduce max_early_res and early_res_count
    ...

    Linus Torvalds
     

22 Feb, 2010

1 commit

  • These build errors on some non-x86 platforms (PowerPC for example):

    mm/page_alloc.c: In function '__alloc_memory_core_early':
    mm/page_alloc.c:3468: error: implicit declaration of function 'find_early_area'
    mm/page_alloc.c:3483: error: implicit declaration of function 'reserve_early_without_check'

    The function is only needed on CONFIG_NO_BOOTMEM.

    Signed-off-by: Yinghai Lu
    Cc: Andrew Morton
    Cc: Johannes Weiner
    Cc: Mel Gorman
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Yinghai Lu
     

13 Feb, 2010

1 commit


02 Feb, 2010

1 commit


30 Jan, 2010

1 commit

  • After memory pressure has forced it to dip into the reserves, 2.6.32's
    5f8dcc21211a3d4e3a7a5ca366b469fb88117f61 "page-allocator: split per-cpu
    list into one-list-per-migrate-type" has been returning MIGRATE_RESERVE
    pages to the MIGRATE_MOVABLE free_list: in some sense depleting reserves.

    Fix that in the most straightforward way (which, considering the overheads
    of alternative approaches, is Mel's preference): the right migratetype is
    already in page_private(page), but free_pcppages_bulk() wasn't using it.

    How did this bug show up? As a 20% slowdown in my tmpfs loop kbuild
    swapping tests, on PowerMac G5 with SLUB allocator. Bisecting to that
    commit was easy, but explaining the magnitude of the slowdown not easy.

    The same effect appears, but much less markedly, with SLAB, and even
    less markedly on other machines (the PowerMac divides into fewer zones
    than x86, I think that may be a factor). We guess that lumpy reclaim
    of short-lived high-order pages is implicated in some way, and probably
    this bug has been tickling a poor decision somewhere in page reclaim.

    But instrumentation hasn't told me much, I've run out of time and
    imagination to determine exactly what's going on, and shouldn't hold up
    the fix any longer: it's valid, and might even fix other misbehaviours.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

17 Jan, 2010

2 commits

  • commit f2260e6b (page allocator: update NR_FREE_PAGES only as necessary)
    made one minor regression. if __rmqueue() was failed, NR_FREE_PAGES stat
    go wrong. this patch fixes it.

    Signed-off-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Reviewed-by: Minchan Kim
    Reported-by: Huang Shijie
    Reviewed-by: Christoph Lameter
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • The current check for 'backward merging' within add_active_range() does
    not seem correct. start_pfn must be compared against
    early_node_map[i].start_pfn (and NOT against .end_pfn) to find out whether
    the new region is backward-mergeable with the existing range.

    Signed-off-by: Kazuhisa Ichikawa
    Acked-by: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kazuhisa Ichikawa
     

05 Jan, 2010

1 commit

  • Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.

    This drastically reduces the size of struct zone for systems with large
    amounts of processors and allows placement of critical variables of struct
    zone in one cacheline even on very large systems.

    Another effect is that the pagesets of one processor are placed near one
    another. If multiple pagesets from different zones fit into one cacheline
    then additional cacheline fetches can be avoided on the hot paths when
    allocating memory from multiple zones.

    Bootstrap becomes simpler if we use the same scheme for UP, SMP, NUMA. #ifdefs
    are reduced and we can drop the zone_pcp macro.

    Hotplug handling is also simplified since cpu alloc can bring up and
    shut down cpu areas for a specific cpu as a whole. So there is no need to
    allocate or free individual pagesets.

    V7-V8:
    - Explain chicken egg dilemmna with percpu allocator.

    V4-V5:
    - Fix up cases where per_cpu_ptr is called before irq disable
    - Integrate the bootstrap logic that was separate before.

    tj: Build failure in pageset_cpuup_callback() due to missing ret
    variable fixed.

    Reviewed-by: Mel Gorman
    Signed-off-by: Christoph Lameter
    Signed-off-by: Tejun Heo

    Christoph Lameter
     

24 Dec, 2009

1 commit


23 Dec, 2009

1 commit

  • * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (36 commits)
    powerpc/gc/wii: Remove get_irq_desc()
    powerpc/gc/wii: hlwd-pic: convert irq_desc.lock to raw_spinlock
    powerpc/gamecube/wii: Fix off-by-one error in ugecon/usbgecko_udbg
    powerpc/mpic: Fix problem that affinity is not updated
    powerpc/mm: Fix stupid bug in subpge protection handling
    powerpc/iseries: use DECLARE_COMPLETION_ONSTACK for non-constant completion
    powerpc: Fix MSI support on U4 bridge PCIe slot
    powerpc: Handle VSX alignment faults correctly in little-endian mode
    powerpc/mm: Fix typo of cpumask_clear_cpu()
    powerpc/mm: Fix hash_utils_64.c compile errors with DEBUG enabled.
    powerpc: Convert BUG() to use unreachable()
    powerpc/pseries: Make declarations of cpu_hotplug_driver_lock() ANSI compatible.
    powerpc/pseries: Don't panic when H_PROD fails during cpu-online.
    powerpc/mm: Fix a WARN_ON() with CONFIG_DEBUG_PAGEALLOC and CONFIG_DEBUG_VM
    powerpc/defconfigs: Set HZ=100 on pseries and ppc64 defconfigs
    powerpc/defconfigs: Disable token ring in powerpc defconfigs
    powerpc/defconfigs: Reduce 64bit vmlinux by making acenic and cramfs modules
    powerpc/pseries: Select XICS and PCI_MSI PSERIES
    powerpc/85xx: Wrong variable returned on error
    powerpc/iseries: Convert to proc_fops
    ...

    Linus Torvalds
     

20 Dec, 2009

1 commit

  • …git/tip/linux-2.6-tip

    * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86, irq: Allow 0xff for /proc/irq/[n]/smp_affinity on an 8-cpu system
    Makefile: Unexport LC_ALL instead of clearing it
    x86: Fix objdump version check in arch/x86/tools/chkobjdump.awk
    x86: Reenable TSC sync check at boot, even with NONSTOP_TSC
    x86: Don't use POSIX character classes in gen-insn-attr-x86.awk
    Makefile: set LC_CTYPE, LC_COLLATE, LC_NUMERIC to C
    x86: Increase MAX_EARLY_RES; insufficient on 32-bit NUMA
    x86: Fix checking of SRAT when node 0 ram is not from 0
    x86, cpuid: Add "volatile" to asm in native_cpuid()
    x86, msr: msrs_alloc/free for CONFIG_SMP=n
    x86, amd: Get multi-node CPU info from NodeId MSR instead of PCI config space
    x86: Add IA32_TSC_AUX MSR and use it
    x86, msr/cpuid: Register enough minors for the MSR and CPUID drivers
    initramfs: add missing decompressor error check
    bzip2: Add missing checks for malloc returning NULL
    bzip2/lzma/gzip: pre-boot malloc doesn't return NULL on failure

    Linus Torvalds
     

18 Dec, 2009

1 commit

  • Memory balloon drivers can allocate a large amount of memory which is not
    movable but could be freed to accomodate memory hotplug remove.

    Prior to calling the memory hotplug notifier chain the memory in the
    pageblock is isolated. Currently, if the migrate type is not
    MIGRATE_MOVABLE the isolation will not proceed, causing the memory removal
    for that page range to fail.

    Rather than failing pageblock isolation if the migrateteype is not
    MIGRATE_MOVABLE, this patch checks if all of the pages in the pageblock,
    and not on the LRU, are owned by a registered balloon driver (or other
    entity) using a notifier chain. If all of the non-movable pages are owned
    by a balloon, they can be freed later through the memory notifier chain
    and the range can still be isolated in set_migratetype_isolate().

    Signed-off-by: Robert Jennings
    Cc: Mel Gorman
    Cc: Ingo Molnar
    Cc: Brian King
    Cc: Paul Mackerras
    Cc: Martin Schwidefsky
    Cc: Gerald Schaefer
    Cc: KAMEZAWA Hiroyuki
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Benjamin Herrenschmidt

    Robert Jennings
     

17 Dec, 2009

2 commits

  • Found one system that boot from socket1 instead of socket0, SRAT get rejected...

    [ 0.000000] SRAT: Node 1 PXM 0 0-a0000
    [ 0.000000] SRAT: Node 1 PXM 0 100000-80000000
    [ 0.000000] SRAT: Node 1 PXM 0 100000000-2080000000
    [ 0.000000] SRAT: Node 0 PXM 1 2080000000-4080000000
    [ 0.000000] SRAT: Node 2 PXM 2 4080000000-6080000000
    [ 0.000000] SRAT: Node 3 PXM 3 6080000000-8080000000
    [ 0.000000] SRAT: Node 4 PXM 4 8080000000-a080000000
    [ 0.000000] SRAT: Node 5 PXM 5 a080000000-c080000000
    [ 0.000000] SRAT: Node 6 PXM 6 c080000000-e080000000
    [ 0.000000] SRAT: Node 7 PXM 7 e080000000-10080000000
    ...
    [ 0.000000] NUMA: Allocated memnodemap from 500000 - 701040
    [ 0.000000] NUMA: Using 20 for the hash shift.
    [ 0.000000] Adding active range (0, 0x2080000, 0x4080000) 0 entries of 3200 used
    [ 0.000000] Adding active range (1, 0x0, 0x96) 1 entries of 3200 used
    [ 0.000000] Adding active range (1, 0x100, 0x7f750) 2 entries of 3200 used
    [ 0.000000] Adding active range (1, 0x100000, 0x2080000) 3 entries of 3200 used
    [ 0.000000] Adding active range (2, 0x4080000, 0x6080000) 4 entries of 3200 used
    [ 0.000000] Adding active range (3, 0x6080000, 0x8080000) 5 entries of 3200 used
    [ 0.000000] Adding active range (4, 0x8080000, 0xa080000) 6 entries of 3200 used
    [ 0.000000] Adding active range (5, 0xa080000, 0xc080000) 7 entries of 3200 used
    [ 0.000000] Adding active range (6, 0xc080000, 0xe080000) 8 entries of 3200 used
    [ 0.000000] Adding active range (7, 0xe080000, 0x10080000) 9 entries of 3200 used
    [ 0.000000] SRAT: PXMs only cover 917504MB of your 1048566MB e820 RAM. Not used.
    [ 0.000000] SRAT: SRAT not used.

    the early_node_map is not sorted because node0 with non zero start come first.

    so try to sort it right away after all regions are registered.

    also fixs refression by 8716273c (x86: Export srat physical topology)

    -v2: make it more solid to handle cross node case like node0 [0,4g), [8,12g) and node1 [4g, 8g), [12g, 16g)
    -v3: update comments.

    Reported-and-tested-by: Jens Axboe
    Signed-off-by: Yinghai Lu
    LKML-Reference:
    Signed-off-by: H. Peter Anvin

    Yinghai Lu
     
  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (34 commits)
    HWPOISON: Remove stray phrase in a comment
    HWPOISON: Try to allocate migration page on the same node
    HWPOISON: Don't do early filtering if filter is disabled
    HWPOISON: Add a madvise() injector for soft page offlining
    HWPOISON: Add soft page offline support
    HWPOISON: Undefine short-hand macros after use to avoid namespace conflict
    HWPOISON: Use new shake_page in memory_failure
    HWPOISON: Use correct name for MADV_HWPOISON in documentation
    HWPOISON: mention HWPoison in Kconfig entry
    HWPOISON: Use get_user_page_fast in hwpoison madvise
    HWPOISON: add an interface to switch off/on all the page filters
    HWPOISON: add memory cgroup filter
    memcg: add accessor to mem_cgroup.css
    memcg: rename and export try_get_mem_cgroup_from_page()
    HWPOISON: add page flags filter
    mm: export stable page flags
    HWPOISON: limit hwpoison injector to known page types
    HWPOISON: add fs/device filters
    HWPOISON: return 0 to indicate success reliably
    HWPOISON: make semantics of IGNORED/DELAYED clear
    ...

    Linus Torvalds
     

16 Dec, 2009

3 commits

  • Fix node-oriented allocation handling in oom-kill.c I myself think of this
    as a bugfix not as an ehnancement.

    In these days, things are changed as
    - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
    - mempolicy don't maintain its own private zonelists.
    (And cpuset doesn't use nodemask for __alloc_pages_nodemask())

    So, current oom-killer's check function is wrong.

    This patch does
    - check nodemask, if nodemask && nodemask doesn't cover all
    node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
    - Scan all zonelist under nodemask, if it hits cpuset's wall
    this faiulre is from cpuset.
    And
    - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
    This doesn't change "current" behavior. If callers use __GFP_THISNODE
    it should handle "page allocation failure" by itself.

    - handle __GFP_NOFAIL+__GFP_THISNODE path.
    This is something like a FIXME but this gfpmask is not used now.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: David Rientjes
    Cc: Daisuke Nishimura
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Most free pages in the buddy system have no PG_buddy set.
    Introduce is_free_buddy_page() for detecting them reliably.

    CC: Nick Piggin
    CC: Mel Gorman
    Signed-off-by: Wu Fengguang
    Signed-off-by: Andi Kleen

    Wu Fengguang
     
  • Remove three degrees of obfuscation, left over from when we had
    CONFIG_UNEVICTABLE_LRU. MLOCK_PAGES is CONFIG_HAVE_MLOCKED_PAGE_BIT is
    CONFIG_HAVE_MLOCK is CONFIG_MMU. rmap.o (and memory-failure.o) are only
    built when CONFIG_MMU, so don't need such conditions at all.

    Somehow, I feel no compulsion to remove the CONFIG_HAVE_MLOCK* lines from
    169 defconfigs: leave those to evolve in due course.

    Signed-off-by: Hugh Dickins
    Cc: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Nick Piggin
    Reviewed-by: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: Wu Fengguang
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

12 Nov, 2009

2 commits

  • Commit 341ce06f69abfafa31b9468410a13dbd60e2b237 ("page allocator:
    calculate the alloc_flags for allocation only once") altered watermark
    logic slightly by allowing rt_tasks that are handling an interrupt to set
    ALLOC_HARDER. This patch brings the watermark logic more in line with
    2.6.30.

    This change results in a reduction of the number high-order GFP_ATOMIC
    allocation failures reported. See
    http://www.gossamer-threads.com/lists/linux/kernel/1144153

    [rientjes@google.com: Spotted the problem]
    Signed-off-by: Mel Gorman
    Reviewed-by: Pekka Enberg
    Reviewed-by: Rik van Riel
    Reviewed-by: KOSAKI Motohiro
    Cc: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • If a direct reclaim makes no forward progress, it considers whether it
    should go OOM or not. Whether OOM is triggered or not, it may retry the
    allocation afterwards. In times past, this would always wake kswapd as
    well but currently, kswapd is not woken up after direct reclaim fails.
    For order-0 allocations, this makes little difference but if there is a
    heavy mix of higher-order allocations that direct reclaim is failing for,
    it might mean that kswapd is not rewoken for higher orders as much as it
    did previously.

    This patch wakes up kswapd when an allocation is being retried after a
    direct reclaim failure. It would be expected that kswapd is already
    awake, but this has the effect of telling kswapd to reclaim at the higher
    order as well.

    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Reviewed-by: Pekka Enberg
    Reviewed-by: KOSAKI Motohiro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman