13 Jan, 2012

4 commits

  • This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
    mode that avoids writing back pages to backing storage. Async compaction
    maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
    For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
    used.

    This avoids sync compaction stalling for an excessive length of time,
    particularly when copying files to a USB stick where there might be a
    large number of dirty pages backed by a filesystem that does not support
    ->writepages.

    [aarcange@redhat.com: This patch is heavily based on Andrea's work]
    [akpm@linux-foundation.org: fix fs/nfs/write.c build]
    [akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware")
    noted that compaction does not migrate dirty or writeback pages and that
    is was meaningless to pick the page and re-add it to the LRU list. This
    had to be partially reverted because some dirty pages can be migrated by
    compaction without blocking.

    This patch updates "mm: compaction: make isolate_lru_page" by skipping
    over pages that migration has no possibility of migrating to minimise LRU
    disruption.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Reviewed-by: Minchan Kim
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When asynchronous compaction was introduced, the
    /proc/sys/vm/compact_memory handler should have been updated to always use
    synchronous compaction. This did not happen so this patch addresses it.

    The assumption is if a user writes to /proc/sys/vm/compact_memory, they
    are willing for that process to stall.

    Signed-off-by: Mel Gorman
    Reviewed-by: Andrea Arcangeli
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Short summary: There are severe stalls when a USB stick using VFAT is
    used with THP enabled that are reduced by this series. If you are
    experiencing this problem, please test and report back and considering I
    have seen complaints from openSUSE and Fedora users on this as well as a
    few private mails, I'm guessing it's a widespread issue. This is a new
    type of USB-related stall because it is due to synchronous compaction
    writing where as in the past the big problem was dirty pages reaching
    the end of the LRU and being written by reclaim.

    Am cc'ing Andrew this time and this series would replace
    mm-do-not-stall-in-synchronous-compaction-for-thp-allocations.patch.
    I'm also cc'ing Dave Jones as he might have merged that patch to Fedora
    for wider testing and ideally it would be reverted and replaced by this
    series.

    That said, the later patches could really do with some review. If this
    series is not the answer then a new direction needs to be discussed
    because as it is, the stalls are unacceptable as the results in this
    leader show.

    For testers that try backporting this to 3.1, it won't work because
    there is a non-obvious dependency on not writing back pages in direct
    reclaim so you need those patches too.

    Changelog since V5
    o Rebase to 3.2-rc5
    o Tidy up the changelogs a bit

    Changelog since V4
    o Added reviewed-bys, credited Andrea properly for sync-light
    o Allow dirty pages without mappings to be considered for migration
    o Bound the number of pages freed for compaction
    o Isolate PageReclaim pages on their own LRU list

    This is against 3.2-rc5 and follows on from discussions on "mm: Do
    not stall in synchronous compaction for THP allocations" and "[RFC
    PATCH 0/5] Reduce compaction-related stalls". Initially, the proposed
    patch eliminated stalls due to compaction which sometimes resulted in
    user-visible interactivity problems on browsers by simply never using
    sync compaction. The downside was that THP success allocation rates
    were lower because dirty pages were not being migrated as reported by
    Andrea. His approach at fixing this was nacked on the grounds that
    it reverted fixes from Rik merged that reduced the amount of pages
    reclaimed as it severely impacted his workloads performance.

    This series attempts to reconcile the requirements of maximising THP
    usage, without stalling in a user-visible fashion due to compaction
    or cheating by reclaiming an excessive number of pages.

    Patch 1 partially reverts commit 39deaf85 to allow migration to isolate
    dirty pages. This is because migration can move some dirty
    pages without blocking.

    Patch 2 notes that the /proc/sys/vm/compact_memory handler is not using
    synchronous compaction when it should be. This is unrelated
    to the reported stalls but is worth fixing.

    Patch 3 checks if we isolated a compound page during lumpy scan and
    account for it properly. For the most part, this affects
    tracing so it's unrelated to the stalls but worth fixing.

    Patch 4 notes that it is possible to abort reclaim early for compaction
    and return 0 to the page allocator potentially entering the
    "may oom" path. This has not been observed in practice but
    the rest of the series potentially makes it easier to happen.

    Patch 5 adds a sync parameter to the migratepage callback and gives
    the callback responsibility for migrating the page without
    blocking if sync==false. For example, fallback_migrate_page
    will not call writepage if sync==false. This increases the
    number of pages that can be handled by asynchronous compaction
    thereby reducing stalls.

    Patch 6 restores filter-awareness to isolate_lru_page for migration.
    In practice, it means that pages under writeback and pages
    without a ->migratepage callback will not be isolated
    for migration.

    Patch 7 avoids calling direct reclaim if compaction is deferred but
    makes sure that compaction is only deferred if sync
    compaction was used.

    Patch 8 introduces a sync-light migration mechanism that sync compaction
    uses. The objective is to allow some stalls but to not call
    ->writepage which can lead to significant user-visible stalls.

    Patch 9 notes that while we want to abort reclaim ASAP to allow
    compation to go ahead that we leave a very small window of
    opportunity for compaction to run. This patch allows more pages
    to be freed by reclaim but bounds the number to a reasonable
    level based on the high watermark on each zone.

    Patch 10 allows slabs to be shrunk even after compaction_ready() is
    true for one zone. This is to avoid a problem whereby a single
    small zone can abort reclaim even though no pages have been
    reclaimed and no suitably large zone is in a usable state.

    Patch 11 fixes a problem with the rate of page scanning. As reclaim is
    rarely stalling on pages under writeback it means that scan
    rates are very high. This is particularly true for direct
    reclaim which is not calling writepage. The vmstat figures
    implied that much of this was busy work with PageReclaim pages
    marked for immediate reclaim. This patch is a prototype that
    moves these pages to their own LRU list.

    This has been tested and other than 2 USB keys getting trashed,
    nothing horrible fell out. That said, I am a bit unhappy with the
    rescue logic in patch 11 but did not find a better way around it. It
    does significantly reduce scan rates and System CPU time indicating
    it is the right direction to take.

    What is of critical importance is that stalls due to compaction
    are massively reduced even though sync compaction was still
    allowed. Testing from people complaining about stalls copying to USBs
    with THP enabled are particularly welcome.

    The following tests all involve THP usage and USB keys in some
    way. Each test follows this type of pattern

    1. Read from some fast fast storage, be it raw device or file. Each time
    the copy finishes, start again until the test ends
    2. Write a large file to a filesystem on a USB stick. Each time the copy
    finishes, start again until the test ends
    3. When memory is low, start an alloc process that creates a mapping
    the size of physical memory to stress THP allocation. This is the
    "real" part of the test and the part that is meant to trigger
    stalls when THP is enabled. Copying continues in the background.
    4. Record the CPU usage and time to execute of the alloc process
    5. Record the number of THP allocs and fallbacks as well as the number of THP
    pages in use a the end of the test just before alloc exited
    6. Run the test 5 times to get an idea of variability
    7. Between each run, sync is run and caches dropped and the test
    waits until nr_dirty is a small number to avoid interference
    or caching between iterations that would skew the figures.

    The individual tests were then

    writebackCPDeviceBasevfat
    Disable THP, read from a raw device (sda), vfat on USB stick
    writebackCPDeviceBaseext4
    Disable THP, read from a raw device (sda), ext4 on USB stick
    writebackCPDevicevfat
    THP enabled, read from a raw device (sda), vfat on USB stick
    writebackCPDeviceext4
    THP enabled, read from a raw device (sda), ext4 on USB stick
    writebackCPFilevfat
    THP enabled, read from a file on fast storage and USB, both vfat
    writebackCPFileext4
    THP enabled, read from a file on fast storage and USB, both ext4

    The kernels tested were

    3.1 3.1
    vanilla 3.2-rc5
    freemore Patches 1-10
    immediate Patches 1-11
    andrea The 8 patches Andrea posted as a basis of comparison

    The results are very long unfortunately. I'll start with the case
    where we are not using THP at all

    writebackCPDeviceBasevfat
    3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
    System Time 1.28 ( 0.00%) 54.49 (-4143.46%) 48.63 (-3687.69%) 4.69 ( -265.11%) 51.88 (-3940.81%)
    +/- 0.06 ( 0.00%) 2.45 (-4305.55%) 4.75 (-8430.57%) 7.46 (-13282.76%) 4.76 (-8440.70%)
    User Time 0.09 ( 0.00%) 0.05 ( 40.91%) 0.06 ( 29.55%) 0.07 ( 15.91%) 0.06 ( 27.27%)
    +/- 0.02 ( 0.00%) 0.01 ( 45.39%) 0.02 ( 25.07%) 0.00 ( 77.06%) 0.01 ( 52.24%)
    Elapsed Time 110.27 ( 0.00%) 56.38 ( 48.87%) 49.95 ( 54.70%) 11.77 ( 89.33%) 53.43 ( 51.54%)
    +/- 7.33 ( 0.00%) 3.77 ( 48.61%) 4.94 ( 32.63%) 6.71 ( 8.50%) 4.76 ( 35.03%)
    THP Active 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
    +/- 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
    Fault Alloc 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
    +/- 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
    Fault Fallback 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
    +/- 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)

    The THP figures are obviously all 0 because THP was enabled. The
    main thing to watch is the elapsed times and how they compare to
    times when THP is enabled later. It's also important to note that
    elapsed time is improved by this series as System CPu time is much
    reduced.

    writebackCPDevicevfat

    3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
    System Time 1.22 ( 0.00%) 13.89 (-1040.72%) 46.40 (-3709.20%) 4.44 ( -264.37%) 47.37 (-3789.33%)
    +/- 0.06 ( 0.00%) 22.82 (-37635.56%) 3.84 (-6249.44%) 6.48 (-10618.92%) 6.60
    (-10818.53%)
    User Time 0.06 ( 0.00%) 0.06 ( -6.90%) 0.05 ( 17.24%) 0.05 ( 13.79%) 0.04 ( 31.03%)
    +/- 0.01 ( 0.00%) 0.01 ( 33.33%) 0.01 ( 33.33%) 0.01 ( 39.14%) 0.01 ( 25.46%)
    Elapsed Time 10445.54 ( 0.00%) 2249.92 ( 78.46%) 70.06 ( 99.33%) 16.59 ( 99.84%) 472.43 (
    95.48%)
    +/- 643.98 ( 0.00%) 811.62 ( -26.03%) 10.02 ( 98.44%) 7.03 ( 98.91%) 59.99 ( 90.68%)
    THP Active 15.60 ( 0.00%) 35.20 ( 225.64%) 65.00 ( 416.67%) 70.80 ( 453.85%) 62.20 ( 398.72%)
    +/- 18.48 ( 0.00%) 51.29 ( 277.59%) 15.99 ( 86.52%) 37.91 ( 205.18%) 22.02 ( 119.18%)
    Fault Alloc 121.80 ( 0.00%) 76.60 ( 62.89%) 155.40 ( 127.59%) 181.20 ( 148.77%) 286.60 ( 235.30%)
    +/- 73.51 ( 0.00%) 61.11 ( 83.12%) 34.89 ( 47.46%) 31.88 ( 43.36%) 68.13 ( 92.68%)
    Fault Fallback 881.20 ( 0.00%) 926.60 ( -5.15%) 847.60 ( 3.81%) 822.00 ( 6.72%) 716.60 ( 18.68%)
    +/- 73.51 ( 0.00%) 61.26 ( 16.67%) 34.89 ( 52.54%) 31.65 ( 56.94%) 67.75 ( 7.84%)
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds) 3540.88 1945.37 716.04 64.97 1937.03
    Total Elapsed Time (seconds) 52417.33 11425.90 501.02 230.95 2520.28

    The first thing to note is the "Elapsed Time" for the vanilla kernels
    of 2249 seconds versus 56 with THP disabled which might explain the
    reports of USB stalls with THP enabled. Applying the patches brings
    performance in line with THP-disabled performance while isolating
    pages for immediate reclaim from the LRU cuts down System CPU time.

    The "Fault Alloc" success rate figures are also improved. The vanilla
    kernel only managed to allocate 76.6 pages on average over the course
    of 5 iterations where as applying the series allocated 181.20 on
    average albeit it is well within variance. It's worth noting that
    applies the series at least descreases the amount of variance which
    implies an improvement.

    Andrea's series had a higher success rate for THP allocations but
    at a severe cost to elapsed time which is still better than vanilla
    but still much worse than disabling THP altogether. One can bring my
    series close to Andrea's by removing this check

    /*
    * If compaction is deferred for high-order allocations, it is because
    * sync compaction recently failed. In this is the case and the caller
    * has requested the system not be heavily disrupted, fail the
    * allocation now instead of entering direct reclaim
    */
    if (deferred_compaction && (gfp_mask & __GFP_NO_KSWAPD))
    goto nopage;

    I didn't include a patch that removed the above check because hurting
    overall performance to improve the THP figure is not what the average
    user wants. It's something to consider though if someone really wants
    to maximise THP usage no matter what it does to the workload initially.

    This is summary of vmstat figures from the same test.

    3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
    Page Ins 3257266139 1111844061 17263623 10901575 161423219
    Page Outs 81054922 30364312 3626530 3657687 8753730
    Swap Ins 3294 2851 6560 4964 4592
    Swap Outs 390073 528094 620197 790912 698285
    Direct pages scanned 1077581700 3024951463 1764930052 115140570 5901188831
    Kswapd pages scanned 34826043 7112868 2131265 1686942 1893966
    Kswapd pages reclaimed 28950067 4911036 1246044 966475 1497726
    Direct pages reclaimed 805148398 280167837 3623473 2215044 40809360
    Kswapd efficiency 83% 69% 58% 57% 79%
    Kswapd velocity 664.399 622.521 4253.852 7304.360 751.490
    Direct efficiency 74% 9% 0% 1% 0%
    Direct velocity 20557.737 264745.137 3522673.849 498551.938 2341481.435
    Percentage direct scans 96% 99% 99% 98% 99%
    Page writes by reclaim 722646 529174 620319 791018 699198
    Page writes file 332573 1080 122 106 913
    Page writes anon 390073 528094 620197 790912 698285
    Page reclaim immediate 0 2552514720 1635858848 111281140 5478375032
    Page rescued immediate 0 0 0 87848 0
    Slabs scanned 23552 23552 9216 8192 9216
    Direct inode steals 231 0 0 0 0
    Kswapd inode steals 0 0 0 0 0
    Kswapd skipped wait 28076 786 0 61 6
    THP fault alloc 609 383 753 906 1433
    THP collapse alloc 12 6 0 0 6
    THP splits 536 211 456 593 1136
    THP fault fallback 4406 4633 4263 4110 3583
    THP collapse fail 120 127 0 0 4
    Compaction stalls 1810 728 623 779 3200
    Compaction success 196 53 60 80 123
    Compaction failures 1614 675 563 699 3077
    Compaction pages moved 193158 53545 243185 333457 226688
    Compaction move failure 9952 9396 16424 23676 45070

    The main things to look at are

    1. Page In/out figures are much reduced by the series.

    2. Direct page scanning is incredibly high (264745.137 pages scanned
    per second on the vanilla kernel) but isolating PageReclaim pages
    on their own list reduces the number of pages scanned significantly.

    3. The fact that "Page rescued immediate" is a positive number implies
    that we sometimes race removing pages from the LRU_IMMEDIATE list
    that need to be put back on a normal LRU but it happens only for
    0.07% of the pages marked for immediate reclaim.

    writebackCPDeviceext4
    3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
    System Time 1.51 ( 0.00%) 1.77 ( -17.66%) 1.46 ( 2.92%) 1.15 ( 23.77%) 1.89 ( -25.63%)
    +/- 0.27 ( 0.00%) 0.67 ( -148.52%) 0.33 ( -22.76%) 0.30 ( -11.15%) 0.19 ( 30.16%)
    User Time 0.03 ( 0.00%) 0.04 ( -37.50%) 0.05 ( -62.50%) 0.07 ( -112.50%) 0.04 ( -18.75%)
    +/- 0.01 ( 0.00%) 0.02 ( -146.64%) 0.02 ( -97.91%) 0.02 ( -75.59%) 0.02 ( -63.30%)
    Elapsed Time 124.93 ( 0.00%) 114.49 ( 8.36%) 96.77 ( 22.55%) 27.48 ( 78.00%) 205.70 ( -64.65%)
    +/- 20.20 ( 0.00%) 74.39 ( -268.34%) 59.88 ( -196.48%) 7.72 ( 61.79%) 25.03 ( -23.95%)
    THP Active 161.80 ( 0.00%) 83.60 ( 51.67%) 141.20 ( 87.27%) 84.60 ( 52.29%) 82.60 ( 51.05%)
    +/- 71.95 ( 0.00%) 43.80 ( 60.88%) 26.91 ( 37.40%) 59.02 ( 82.03%) 52.13 ( 72.45%)
    Fault Alloc 471.40 ( 0.00%) 228.60 ( 48.49%) 282.20 ( 59.86%) 225.20 ( 47.77%) 388.40 ( 82.39%)
    +/- 88.07 ( 0.00%) 87.42 ( 99.26%) 73.79 ( 83.78%) 109.62 ( 124.47%) 82.62 ( 93.81%)
    Fault Fallback 531.60 ( 0.00%) 774.60 ( -45.71%) 720.80 ( -35.59%) 777.80 ( -46.31%) 614.80 ( -15.65%)
    +/- 88.07 ( 0.00%) 87.26 ( 0.92%) 73.79 ( 16.22%) 109.62 ( -24.47%) 82.29 ( 6.56%)
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds) 50.22 33.76 30.65 24.14 128.45
    Total Elapsed Time (seconds) 1113.73 1132.19 1029.45 759.49 1707.26

    Similar test but the USB stick is using ext4 instead of vfat. As
    ext4 does not use writepage for migration, the large stalls due to
    compaction when THP is enabled are not observed. Still, isolating
    PageReclaim pages on their own list helped completion time largely
    by reducing the number of pages scanned by direct reclaim although
    time spend in congestion_wait could also be a factor.

    Again, Andrea's series had far higher success rates for THP allocation
    at the cost of elapsed time. I didn't look too closely but a quick
    look at the vmstat figures tells me kswapd reclaimed 8 times more pages
    than the patch series and direct reclaim reclaimed roughly three times
    as many pages. It follows that if memory is aggressively reclaimed,
    there will be more available for THP.

    writebackCPFilevfat
    3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
    System Time 1.76 ( 0.00%) 29.10 (-1555.52%) 46.01 (-2517.18%) 4.79 ( -172.35%) 54.89 (-3022.53%)
    +/- 0.14 ( 0.00%) 25.61 (-18185.17%) 2.15 (-1434.83%) 6.60 (-4610.03%) 9.75
    (-6863.76%)
    User Time 0.05 ( 0.00%) 0.07 ( -45.83%) 0.05 ( -4.17%) 0.06 ( -29.17%) 0.06 ( -16.67%)
    +/- 0.02 ( 0.00%) 0.02 ( 20.11%) 0.02 ( -3.14%) 0.01 ( 31.58%) 0.01 ( 47.41%)
    Elapsed Time 22520.79 ( 0.00%) 1082.85 ( 95.19%) 73.30 ( 99.67%) 32.43 ( 99.86%) 291.84 ( 98.70%)
    +/- 7277.23 ( 0.00%) 706.29 ( 90.29%) 19.05 ( 99.74%) 17.05 ( 99.77%) 125.55 ( 98.27%)
    THP Active 83.80 ( 0.00%) 12.80 ( 15.27%) 15.60 ( 18.62%) 13.00 ( 15.51%) 0.80 ( 0.95%)
    +/- 66.81 ( 0.00%) 20.19 ( 30.22%) 5.92 ( 8.86%) 15.06 ( 22.54%) 1.17 ( 1.75%)
    Fault Alloc 171.00 ( 0.00%) 67.80 ( 39.65%) 97.40 ( 56.96%) 125.60 ( 73.45%) 133.00 ( 77.78%)
    +/- 82.91 ( 0.00%) 30.69 ( 37.02%) 53.91 ( 65.02%) 55.05 ( 66.40%) 21.19 ( 25.56%)
    Fault Fallback 832.00 ( 0.00%) 935.20 ( -12.40%) 906.00 ( -8.89%) 877.40 ( -5.46%) 870.20 ( -4.59%)
    +/- 82.91 ( 0.00%) 30.69 ( 62.98%) 54.01 ( 34.86%) 55.05 ( 33.60%) 20.91 ( 74.78%)
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds) 7229.81 928.42 704.52 80.68 1330.76
    Total Elapsed Time (seconds) 112849.04 5618.69 571.11 360.54 1664.28

    In this case, the test is reading/writing only from filesystems but as
    it's vfat, it's slow due to calling writepage during compaction. Little
    to observe really - the time to complete the test goes way down
    with the series applied and THP allocation success rates go up in
    comparison to 3.2-rc5. The success rates are lower than 3.1.0 but
    the elapsed time for that kernel is abysmal so it is not really a
    sensible comparison.

    As before, Andrea's series allocates more THPs at the cost of overall
    performance.

    writebackCPFileext4
    3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
    System Time 1.51 ( 0.00%) 1.77 ( -17.66%) 1.46 ( 2.92%) 1.15 ( 23.77%) 1.89 ( -25.63%)
    +/- 0.27 ( 0.00%) 0.67 ( -148.52%) 0.33 ( -22.76%) 0.30 ( -11.15%) 0.19 ( 30.16%)
    User Time 0.03 ( 0.00%) 0.04 ( -37.50%) 0.05 ( -62.50%) 0.07 ( -112.50%) 0.04 ( -18.75%)
    +/- 0.01 ( 0.00%) 0.02 ( -146.64%) 0.02 ( -97.91%) 0.02 ( -75.59%) 0.02 ( -63.30%)
    Elapsed Time 124.93 ( 0.00%) 114.49 ( 8.36%) 96.77 ( 22.55%) 27.48 ( 78.00%) 205.70 ( -64.65%)
    +/- 20.20 ( 0.00%) 74.39 ( -268.34%) 59.88 ( -196.48%) 7.72 ( 61.79%) 25.03 ( -23.95%)
    THP Active 161.80 ( 0.00%) 83.60 ( 51.67%) 141.20 ( 87.27%) 84.60 ( 52.29%) 82.60 ( 51.05%)
    +/- 71.95 ( 0.00%) 43.80 ( 60.88%) 26.91 ( 37.40%) 59.02 ( 82.03%) 52.13 ( 72.45%)
    Fault Alloc 471.40 ( 0.00%) 228.60 ( 48.49%) 282.20 ( 59.86%) 225.20 ( 47.77%) 388.40 ( 82.39%)
    +/- 88.07 ( 0.00%) 87.42 ( 99.26%) 73.79 ( 83.78%) 109.62 ( 124.47%) 82.62 ( 93.81%)
    Fault Fallback 531.60 ( 0.00%) 774.60 ( -45.71%) 720.80 ( -35.59%) 777.80 ( -46.31%) 614.80 ( -15.65%)
    +/- 88.07 ( 0.00%) 87.26 ( 0.92%) 73.79 ( 16.22%) 109.62 ( -24.47%) 82.29 ( 6.56%)
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds) 50.22 33.76 30.65 24.14 128.45
    Total Elapsed Time (seconds) 1113.73 1132.19 1029.45 759.49 1707.26

    Same type of story - elapsed times go down. In this case, allocation
    success rates are roughtly the same. As before, Andrea's has higher
    success rates but takes a lot longer.

    Overall the series does reduce latencies and while the tests are
    inherency racy as alloc competes with the cp processes, the variability
    was included. The THP allocation rates are not as high as they could
    be but that is because we would have to be more aggressive about
    reclaim and compaction impacting overall performance.

    This patch:

    Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware")
    noted that compaction does not migrate dirty or writeback pages and that
    is was meaningless to pick the page and re-add it to the LRU list.

    What was missed during review is that asynchronous migration moves dirty
    pages if their ->migratepage callback is migrate_page() because these can
    be moved without blocking. This potentially impacted hugepage allocation
    success rates by a factor depending on how many dirty pages are in the
    system.

    This patch partially reverts 39deaf85 to allow migration to isolate dirty
    pages again. This increases how much compaction disrupts the LRU but that
    is addressed later in the series.

    Signed-off-by: Mel Gorman
    Reviewed-by: Andrea Arcangeli
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

11 Jan, 2012

1 commit


22 Dec, 2011

1 commit

  • This moves the 'memory sysdev_class' over to a regular 'memory' subsystem
    and converts the devices to regular devices. The sysdev drivers are
    implemented as subsystem interfaces now.

    After all sysdev classes are ported to regular driver core entities, the
    sysdev implementation will be entirely removed from the kernel.

    Signed-off-by: Kay Sievers
    Signed-off-by: Greg Kroah-Hartman

    Kay Sievers
     

01 Nov, 2011

4 commits

  • There's no compact_zone_order() user outside file scope, so make it static.

    Signed-off-by: Kyungmin Park
    Acked-by: David Rientjes
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kyungmin Park
     
  • In async mode, compaction doesn't migrate dirty or writeback pages. So,
    it's meaningless to pick the page and re-add it to lru list.

    Of course, when we isolate the page in compaction, the page might be dirty
    or writeback but when we try to migrate the page, the page would be not
    dirty, writeback. So it could be migrated. But it's very unlikely as
    isolate and migration cycle is much faster than writeout.

    So, this patch helps cpu overhead and prevent unnecessary LRU churning.

    Signed-off-by: Minchan Kim
    Acked-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Michal Hocko
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Change ISOLATE_XXX macro with bitwise isolate_mode_t type. Normally,
    macro isn't recommended as it's type-unsafe and making debugging harder as
    symbol cannot be passed throught to the debugger.

    Quote from Johannes
    " Hmm, it would probably be cleaner to fully convert the isolation mode
    into independent flags. INACTIVE, ACTIVE, BOTH is currently a
    tri-state among flags, which is a bit ugly."

    This patch moves isolate mode from swap.h to mmzone.h by memcontrol.h

    Signed-off-by: Minchan Kim
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • acct_isolated of compaction uses page_lru_base_type which returns only
    base type of LRU list so it never returns LRU_ACTIVE_ANON or
    LRU_ACTIVE_FILE. In addtion, cc->nr_[anon|file] is used in only
    acct_isolated so it doesn't have fields in conpact_control.

    This patch removes fields from compact_control and makes clear function of
    acct_issolated which counts the number of anon|file pages isolated.

    Signed-off-by: Minchan Kim
    Acked-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Michal Hocko
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

16 Jun, 2011

4 commits

  • Asynchronous compaction is used when promoting to huge pages. This is all
    very nice but if there are a number of processes in compacting memory, a
    large number of pages can be isolated. An "asynchronous" process can
    stall for long periods of time as a result with a user reporting that
    firefox can stall for 10s of seconds. This patch aborts asynchronous
    compaction if too many pages are isolated as it's better to fail a
    hugepage promotion than stall a process.

    [minchan.kim@gmail.com: return COMPACT_PARTIAL for abort]
    Reported-and-tested-by: Ury Stankevich
    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Compaction works with two scanners, a migration and a free scanner. When
    the scanners crossover, migration within the zone is complete. The
    location of the scanner is recorded on each cycle to avoid excesive
    scanning.

    When a zone is small and mostly reserved, it's very easy for the migration
    scanner to be close to the end of the zone. Then the following situation
    can occurs

    o migration scanner isolates some pages near the end of the zone
    o free scanner starts at the end of the zone but finds that the
    migration scanner is already there
    o free scanner gets reinitialised for the next cycle as
    cc->migrate_pfn + pageblock_nr_pages
    moving the free scanner into the next zone
    o migration scanner moves into the next zone

    When this happens, NR_ISOLATED accounting goes haywire because some of the
    accounting happens against the wrong zone. One zones counter remains
    positive while the other goes negative even though the overall global
    count is accurate. This was reported on X86-32 with !SMP because !SMP
    allows the negative counters to be visible. The fact that it is the bug
    should theoritically be possible there.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • fragmentation_index() returns -1000 when the allocation might succeed
    This doesn't match the comment and code in compaction_suitable(). I
    thought compaction_suitable should return COMPACT_PARTIAL in -1000
    case, because in this case allocation could succeed depending on
    watermarks.

    The impact of this is that compaction starts and compact_finished() is
    called which rechecks the watermarks and the free lists. It should have
    the same result in that compaction should not start but is more expensive.

    Acked-by: Mel Gorman
    Signed-off-by: Shaohua Li
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Commit 56de7263fcf3 ("mm: compaction: direct compact when a high-order
    allocation fails") introduced a check for cc->order == -1 in
    compact_finished. We should continue compacting in that case because
    the request came from userspace and there is no particular order to
    compact for. Similar check has been added by 82478fb7 (mm: compaction:
    prevent division-by-zero during user-requested compaction) for
    compaction_suitable.

    The check is, however, done after zone_watermark_ok which uses order as a
    right hand argument for shifts. Not only watermark check is pointless if
    we can break out without it but it also uses 1 << -1 which is not well
    defined (at least from C standard). Let's move the -1 check above
    zone_watermark_ok.

    [minchan.kim@gmail.com> - caught compaction_suitable]
    Signed-off-by: Michal Hocko
    Cc: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

23 Mar, 2011

4 commits

  • compaction_alloc() isolates pages for migration in isolate_migratepages.
    While it's scanning, IRQs are disabled on the mistaken assumption the
    scanning should be short. Tests show this to be true for the most part
    but contention times on the LRU lock can be increased. Before this patch,
    the IRQ disabled times for a simple test looked like

    Total sampled time IRQs off (not real total time): 5493
    Event shrink_inactive_list..shrink_zone 1596 us count 1
    Event shrink_inactive_list..shrink_zone 1530 us count 1
    Event shrink_inactive_list..shrink_zone 956 us count 1
    Event shrink_inactive_list..shrink_zone 541 us count 1
    Event shrink_inactive_list..shrink_zone 531 us count 1
    Event split_huge_page..add_to_swap 232 us count 1
    Event save_args..call_softirq 36 us count 1
    Event save_args..call_softirq 35 us count 2
    Event __wake_up..__wake_up 1 us count 1

    This patch reduces the worst-case IRQs-disabled latencies by releasing the
    lock every SWAP_CLUSTER_MAX pages that are scanned and releasing the CPU if
    necessary. The cost of this is that the processing performing compaction will
    be slower but IRQs being disabled for too long a time has worse consequences
    as the following report shows;

    Total sampled time IRQs off (not real total time): 4367
    Event shrink_inactive_list..shrink_zone 881 us count 1
    Event shrink_inactive_list..shrink_zone 875 us count 1
    Event shrink_inactive_list..shrink_zone 868 us count 1
    Event shrink_inactive_list..shrink_zone 555 us count 1
    Event split_huge_page..add_to_swap 495 us count 1
    Event compact_zone..compact_zone_order 269 us count 1
    Event split_huge_page..add_to_swap 266 us count 1
    Event shrink_inactive_list..shrink_zone 85 us count 1
    Event save_args..call_softirq 36 us count 2
    Event __wake_up..__wake_up 1 us count 1

    [akpm@linux-foundation.org: simplify with s/unlocked/locked/]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Arthur Marsh
    Cc: Clemens Ladisch
    Cc: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • compaction_alloc() isolates free pages to be used as migration targets.
    While its scanning, IRQs are disabled on the mistaken assumption the
    scanning should be short. Analysis showed that IRQs were in fact being
    disabled for substantial time. A simple test was run using large
    anonymous mappings with transparent hugepage support enabled to trigger
    frequent compactions. A monitor sampled what the worst IRQ-off latencies
    were and a post-processing tool found the following;

    Total sampled time IRQs off (not real total time): 22355
    Event compaction_alloc..compaction_alloc 8409 us count 1
    Event compaction_alloc..compaction_alloc 7341 us count 1
    Event compaction_alloc..compaction_alloc 2463 us count 1
    Event compaction_alloc..compaction_alloc 2054 us count 1
    Event shrink_inactive_list..shrink_zone 1864 us count 1
    Event shrink_inactive_list..shrink_zone 88 us count 1
    Event save_args..call_softirq 36 us count 1
    Event save_args..call_softirq 35 us count 2
    Event __make_request..__blk_run_queue 24 us count 1
    Event __alloc_pages_nodemask..__alloc_pages_nodemask 6 us count 1

    i.e. compaction is disabled IRQs for a prolonged period of time - 8ms in
    one instance. The full report generated by the tool can be found at

    http://www.csn.ul.ie/~mel/postings/minfree-20110225/irqsoff-vanilla-micro.report

    This patch reduces the time IRQs are disabled by simply disabling IRQs at
    the last possible minute. An updated IRQs-off summary report then looks
    like;

    Total sampled time IRQs off (not real total time): 5493
    Event shrink_inactive_list..shrink_zone 1596 us count 1
    Event shrink_inactive_list..shrink_zone 1530 us count 1
    Event shrink_inactive_list..shrink_zone 956 us count 1
    Event shrink_inactive_list..shrink_zone 541 us count 1
    Event shrink_inactive_list..shrink_zone 531 us count 1
    Event split_huge_page..add_to_swap 232 us count 1
    Event save_args..call_softirq 36 us count 1
    Event save_args..call_softirq 35 us count 2
    Event __wake_up..__wake_up 1 us count 1

    A full report is again available at

    http://www.csn.ul.ie/~mel/postings/minfree-20110225/irqsoff-minimiseirq-free-v1r4-micro.report

    As should be obvious, IRQ disabled latencies due to compaction are
    almost elimimnated for this particular test.

    [aarcange@redhat.com: Fix initialisation of isolated]
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Acked-by: Andrea Arcangeli
    Cc: Arthur Marsh
    Cc: Clemens Ladisch
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Many migrate_page's caller check return value instead of list_empy by
    cf608ac19c ("mm: compaction: fix COMPACTPAGEFAILED counting"). This patch
    makes compaction's migrate_pages consistent with others. This patch
    should not change old behavior.

    Signed-off-by: Minchan Kim
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • This patch reverts 5a03b051 ("thp: use compaction in kswapd for GFP_ATOMIC
    order > 0") due to reports stating that kswapd CPU usage was higher and
    IRQs were being disabled more frequently. This was reported at
    http://www.spinics.net/linux/fedora/alsa-user/msg09885.html.

    Without this patch applied, CPU usage by kswapd hovers around the 20% mark
    according to the tester (Arthur Marsh:
    http://www.spinics.net/linux/fedora/alsa-user/msg09899.html). With this
    patch applied, it's around 2%.

    The problem is not related to THP which specifies __GFP_NO_KSWAPD but is
    triggered by high-order allocations hitting the low watermark for their
    order and waking kswapd on kernels with CONFIG_COMPACTION set. The most
    common trigger for this is network cards configured for jumbo frames but
    it's also possible it'll be triggered by fork-heavy workloads (order-1)
    and some wireless cards which depend on order-1 allocations.

    The symptoms for the user will be high CPU usage by kswapd in low-memory
    situations which could be confused with another writeback problem. While
    a patch like 5a03b051 may be reintroduced in the future, this patch plays
    it safe for now and reverts it.

    [mel@csn.ul.ie: Beefed up the changelog]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Mel Gorman
    Reported-by: Arthur Marsh
    Tested-by: Arthur Marsh
    Cc: [2.6.38.1]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

21 Jan, 2011

1 commit

  • Up until 3e7d344 ("mm: vmscan: reclaim order-0 and use compaction instead
    of lumpy reclaim"), compaction skipped calculating the fragmentation index
    of a zone when compaction was explicitely requested through the procfs
    knob.

    However, when compaction_suitable was introduced, it did not come with an
    extra check for order == -1, set on explicit compaction requests, and
    passed this order on to the fragmentation index calculation, where it
    overshifts the number of requested pages, leading to a division by zero.

    This patch makes sure that order == -1 is recognized as the flag it is
    rather than passing it along as valid order parameter.

    [akpm@linux-foundation.org: add comment, per Mel]
    Signed-off-by: Johannes Weiner
    Reviewed-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

14 Jan, 2011

8 commits

  • It makes no sense not to enable compaction for small order pages as we
    don't want to end up with bad order 2 allocations and good and graceful
    order 9 allocations.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This takes advantage of memory compaction to properly generate pages of
    order > 0 if regular page reclaim fails and priority level becomes more
    severe and we don't reach the proper watermarks.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • It's not worth migrating transparent hugepages during compaction. Those
    hugepages don't create fragmentation.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • try_to_compact_pages() is initially called to only migrate pages
    asychronously and kswapd always compacts asynchronously. Both are being
    optimistic so it is important to complete the work as quickly as possible
    to minimise stalls.

    This patch alters the scanner when asynchronous to only consider
    MIGRATE_MOVABLE pageblocks as migration candidates. This reduces stalls
    when allocating huge pages while not impairing allocation success rates as
    a full scan will be performed if necessary after direct reclaim.

    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • With the introduction of the boolean sync parameter, the API looks a
    little inconsistent as offlining is still an int. Convert offlining to a
    bool for the sake of being tidy.

    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • …ompaction in the faster path

    Migration synchronously waits for writeback if the initial passes fails.
    Callers of memory compaction do not necessarily want this behaviour if the
    caller is latency sensitive or expects that synchronous migration is not
    going to have a significantly better success rate.

    This patch adds a sync parameter to migrate_pages() allowing the caller to
    indicate if wait_on_page_writeback() is allowed within migration or not.
    For reclaim/compaction, try_to_compact_pages() is first called
    asynchronously, direct reclaim runs and then try_to_compact_pages() is
    called synchronously as there is a greater expectation that it'll succeed.

    [akpm@linux-foundation.org: build/merge fix]
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Rik van Riel <riel@redhat.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Andy Whitcroft <apw@shadowen.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     
  • Lumpy reclaim is disruptive. It reclaims a large number of pages and
    ignores the age of the pages it reclaims. This can incur significant
    stalls and potentially increase the number of major faults.

    Compaction has reached the point where it is considered reasonably stable
    (meaning it has passed a lot of testing) and is a potential candidate for
    displacing lumpy reclaim. This patch introduces an alternative to lumpy
    reclaim whe compaction is available called reclaim/compaction. The basic
    operation is very simple - instead of selecting a contiguous range of
    pages to reclaim, a number of order-0 pages are reclaimed and then
    compaction is later by either kswapd (compact_zone_order()) or direct
    compaction (__alloc_pages_direct_compact()).

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: use conventional task_struct naming]
    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • In preparation for a patches promoting the use of memory compaction over
    lumpy reclaim, this patch adds trace points for memory compaction
    activity. Using them, we can monitor the scanning activity of the
    migration and free page scanners as well as the number and success rates
    of pages passed to page migration.

    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

23 Dec, 2010

1 commit

  • del_page_from_lru_list() already called mem_cgroup_del_lru(). So we must
    not call it again. It adds unnecessary overhead.

    It was not a runtime bug because the TestClearPageCgroupAcctLRU() early in
    mem_cgroup_del_lru_list() will prevent any double-deletion, etc.

    Signed-off-by: Minchan Kim
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Reviewed-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

10 Sep, 2010

1 commit

  • Iram reported that compaction's too_many_isolated() loops forever.
    (http://www.spinics.net/lists/linux-mm/msg08123.html)

    The meminfo when the situation happened was inactive anon is zero. That's
    because the system has no memory pressure until then. While all anon
    pages were in the active lru, compaction could select active lru as well
    as inactive lru. That's a different thing from vmscan's isolated. So we
    has been two too_many_isolated.

    While compaction can isolate pages in both active and inactive, current
    implementation of too_many_isolated only considers inactive. It made
    Iram's problem.

    This patch handles active and inactive fairly. That's because we can't
    expect where from and how many compaction would isolated pages.

    This patch changes (nr_isolated > nr_inactive) with
    nr_isolated > (nr_active + nr_inactive) / 2.

    Signed-off-by: Minchan Kim
    Reported-by: Iram Shahzad
    Acked-by: Mel Gorman
    Acked-by: Wu Fengguang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

25 May, 2010

5 commits

  • …hen it should be reclaimed

    The kernel applies some heuristics when deciding if memory should be
    compacted or reclaimed to satisfy a high-order allocation. One of these
    is based on the fragmentation. If the index is below 500, memory will not
    be compacted. This choice is arbitrary and not based on data. To help
    optimise the system and set a sensible default for this value, this patch
    adds a sysctl extfrag_threshold. The kernel will only compact memory if
    the fragmentation index is above the extfrag_threshold.

    [randy.dunlap@oracle.com: Fix build errors when proc fs is not configured]
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Minchan Kim <minchan.kim@gmail.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Christoph Lameter <cl@linux-foundation.org>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     
  • Ordinarily when a high-order allocation fails, direct reclaim is entered
    to free pages to satisfy the allocation. With this patch, it is
    determined if an allocation failed due to external fragmentation instead
    of low memory and if so, the calling process will compact until a suitable
    page is freed. Compaction by moving pages in memory is considerably
    cheaper than paging out to disk and works where there are locked pages or
    no swap. If compaction fails to free a page of a suitable size, then
    reclaim will still occur.

    Direct compaction returns as soon as possible. As each block is
    compacted, it is checked if a suitable page has been freed and if so, it
    returns.

    [akpm@linux-foundation.org: Fix build errors]
    [aarcange@redhat.com: fix count_vm_event preempt in memory compaction direct reclaim]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Add a per-node sysfs file called compact. When the file is written to,
    each zone in that node is compacted. The intention that this would be
    used by something like a job scheduler in a batch system before a job
    starts so that the job can allocate the maximum number of hugepages
    without significant start-up cost.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Add a proc file /proc/sys/vm/compact_memory. When an arbitrary value is
    written to the file, all zones are compacted. The expected user of such a
    trigger is a job scheduler that prepares the system before the target
    application runs.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patch is the core of a mechanism which compacts memory in a zone by
    relocating movable pages towards the end of the zone.

    A single compaction run involves a migration scanner and a free scanner.
    Both scanners operate on pageblock-sized areas in the zone. The migration
    scanner starts at the bottom of the zone and searches for all movable
    pages within each area, isolating them onto a private list called
    migratelist. The free scanner starts at the top of the zone and searches
    for suitable areas and consumes the free pages within making them
    available for the migration scanner. The pages isolated for migration are
    then migrated to the newly isolated free pages.

    [aarcange@redhat.com: Fix unsafe optimisation]
    [mel@csn.ul.ie: do not schedule work on other CPUs for compaction]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman