13 Jan, 2012

40 commits

  • Asynchronous compaction is used when allocating transparent hugepages to
    avoid blocking for long periods of time. Due to reports of stalling,
    there was a debate on disabling synchronous compaction but this severely
    impacted allocation success rates. Part of the reason was that many dirty
    pages are skipped in asynchronous compaction by the following check;

    if (PageDirty(page) && !sync &&
    mapping->a_ops->migratepage != migrate_page)
    rc = -EBUSY;

    This skips over all mapping aops using buffer_migrate_page() even though
    it is possible to migrate some of these pages without blocking. This
    patch updates the ->migratepage callback with a "sync" parameter. It is
    the responsibility of the callback to fail gracefully if migration would
    block.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • During direct reclaim it is possible that reclaim will be aborted so that
    compaction can be attempted to satisfy a high-order allocation. If this
    decision is made before any pages are reclaimed, it is possible that 0 is
    returned to the page allocator potentially triggering an OOM. This has
    not been observed but it is a possibility so this patch addresses it.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Properly take into account if we isolated a compound page during the lumpy
    scan in reclaim and skip over the tail pages when encountered. This
    corrects the values given to the tracepoint for number of lumpy pages
    isolated and will avoid breaking the loop early if compound pages smaller
    than the requested allocation size are requested.

    [mgorman@suse.de: Updated changelog]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: Rik van Riel
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • When asynchronous compaction was introduced, the
    /proc/sys/vm/compact_memory handler should have been updated to always use
    synchronous compaction. This did not happen so this patch addresses it.

    The assumption is if a user writes to /proc/sys/vm/compact_memory, they
    are willing for that process to stall.

    Signed-off-by: Mel Gorman
    Reviewed-by: Andrea Arcangeli
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Short summary: There are severe stalls when a USB stick using VFAT is
    used with THP enabled that are reduced by this series. If you are
    experiencing this problem, please test and report back and considering I
    have seen complaints from openSUSE and Fedora users on this as well as a
    few private mails, I'm guessing it's a widespread issue. This is a new
    type of USB-related stall because it is due to synchronous compaction
    writing where as in the past the big problem was dirty pages reaching
    the end of the LRU and being written by reclaim.

    Am cc'ing Andrew this time and this series would replace
    mm-do-not-stall-in-synchronous-compaction-for-thp-allocations.patch.
    I'm also cc'ing Dave Jones as he might have merged that patch to Fedora
    for wider testing and ideally it would be reverted and replaced by this
    series.

    That said, the later patches could really do with some review. If this
    series is not the answer then a new direction needs to be discussed
    because as it is, the stalls are unacceptable as the results in this
    leader show.

    For testers that try backporting this to 3.1, it won't work because
    there is a non-obvious dependency on not writing back pages in direct
    reclaim so you need those patches too.

    Changelog since V5
    o Rebase to 3.2-rc5
    o Tidy up the changelogs a bit

    Changelog since V4
    o Added reviewed-bys, credited Andrea properly for sync-light
    o Allow dirty pages without mappings to be considered for migration
    o Bound the number of pages freed for compaction
    o Isolate PageReclaim pages on their own LRU list

    This is against 3.2-rc5 and follows on from discussions on "mm: Do
    not stall in synchronous compaction for THP allocations" and "[RFC
    PATCH 0/5] Reduce compaction-related stalls". Initially, the proposed
    patch eliminated stalls due to compaction which sometimes resulted in
    user-visible interactivity problems on browsers by simply never using
    sync compaction. The downside was that THP success allocation rates
    were lower because dirty pages were not being migrated as reported by
    Andrea. His approach at fixing this was nacked on the grounds that
    it reverted fixes from Rik merged that reduced the amount of pages
    reclaimed as it severely impacted his workloads performance.

    This series attempts to reconcile the requirements of maximising THP
    usage, without stalling in a user-visible fashion due to compaction
    or cheating by reclaiming an excessive number of pages.

    Patch 1 partially reverts commit 39deaf85 to allow migration to isolate
    dirty pages. This is because migration can move some dirty
    pages without blocking.

    Patch 2 notes that the /proc/sys/vm/compact_memory handler is not using
    synchronous compaction when it should be. This is unrelated
    to the reported stalls but is worth fixing.

    Patch 3 checks if we isolated a compound page during lumpy scan and
    account for it properly. For the most part, this affects
    tracing so it's unrelated to the stalls but worth fixing.

    Patch 4 notes that it is possible to abort reclaim early for compaction
    and return 0 to the page allocator potentially entering the
    "may oom" path. This has not been observed in practice but
    the rest of the series potentially makes it easier to happen.

    Patch 5 adds a sync parameter to the migratepage callback and gives
    the callback responsibility for migrating the page without
    blocking if sync==false. For example, fallback_migrate_page
    will not call writepage if sync==false. This increases the
    number of pages that can be handled by asynchronous compaction
    thereby reducing stalls.

    Patch 6 restores filter-awareness to isolate_lru_page for migration.
    In practice, it means that pages under writeback and pages
    without a ->migratepage callback will not be isolated
    for migration.

    Patch 7 avoids calling direct reclaim if compaction is deferred but
    makes sure that compaction is only deferred if sync
    compaction was used.

    Patch 8 introduces a sync-light migration mechanism that sync compaction
    uses. The objective is to allow some stalls but to not call
    ->writepage which can lead to significant user-visible stalls.

    Patch 9 notes that while we want to abort reclaim ASAP to allow
    compation to go ahead that we leave a very small window of
    opportunity for compaction to run. This patch allows more pages
    to be freed by reclaim but bounds the number to a reasonable
    level based on the high watermark on each zone.

    Patch 10 allows slabs to be shrunk even after compaction_ready() is
    true for one zone. This is to avoid a problem whereby a single
    small zone can abort reclaim even though no pages have been
    reclaimed and no suitably large zone is in a usable state.

    Patch 11 fixes a problem with the rate of page scanning. As reclaim is
    rarely stalling on pages under writeback it means that scan
    rates are very high. This is particularly true for direct
    reclaim which is not calling writepage. The vmstat figures
    implied that much of this was busy work with PageReclaim pages
    marked for immediate reclaim. This patch is a prototype that
    moves these pages to their own LRU list.

    This has been tested and other than 2 USB keys getting trashed,
    nothing horrible fell out. That said, I am a bit unhappy with the
    rescue logic in patch 11 but did not find a better way around it. It
    does significantly reduce scan rates and System CPU time indicating
    it is the right direction to take.

    What is of critical importance is that stalls due to compaction
    are massively reduced even though sync compaction was still
    allowed. Testing from people complaining about stalls copying to USBs
    with THP enabled are particularly welcome.

    The following tests all involve THP usage and USB keys in some
    way. Each test follows this type of pattern

    1. Read from some fast fast storage, be it raw device or file. Each time
    the copy finishes, start again until the test ends
    2. Write a large file to a filesystem on a USB stick. Each time the copy
    finishes, start again until the test ends
    3. When memory is low, start an alloc process that creates a mapping
    the size of physical memory to stress THP allocation. This is the
    "real" part of the test and the part that is meant to trigger
    stalls when THP is enabled. Copying continues in the background.
    4. Record the CPU usage and time to execute of the alloc process
    5. Record the number of THP allocs and fallbacks as well as the number of THP
    pages in use a the end of the test just before alloc exited
    6. Run the test 5 times to get an idea of variability
    7. Between each run, sync is run and caches dropped and the test
    waits until nr_dirty is a small number to avoid interference
    or caching between iterations that would skew the figures.

    The individual tests were then

    writebackCPDeviceBasevfat
    Disable THP, read from a raw device (sda), vfat on USB stick
    writebackCPDeviceBaseext4
    Disable THP, read from a raw device (sda), ext4 on USB stick
    writebackCPDevicevfat
    THP enabled, read from a raw device (sda), vfat on USB stick
    writebackCPDeviceext4
    THP enabled, read from a raw device (sda), ext4 on USB stick
    writebackCPFilevfat
    THP enabled, read from a file on fast storage and USB, both vfat
    writebackCPFileext4
    THP enabled, read from a file on fast storage and USB, both ext4

    The kernels tested were

    3.1 3.1
    vanilla 3.2-rc5
    freemore Patches 1-10
    immediate Patches 1-11
    andrea The 8 patches Andrea posted as a basis of comparison

    The results are very long unfortunately. I'll start with the case
    where we are not using THP at all

    writebackCPDeviceBasevfat
    3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
    System Time 1.28 ( 0.00%) 54.49 (-4143.46%) 48.63 (-3687.69%) 4.69 ( -265.11%) 51.88 (-3940.81%)
    +/- 0.06 ( 0.00%) 2.45 (-4305.55%) 4.75 (-8430.57%) 7.46 (-13282.76%) 4.76 (-8440.70%)
    User Time 0.09 ( 0.00%) 0.05 ( 40.91%) 0.06 ( 29.55%) 0.07 ( 15.91%) 0.06 ( 27.27%)
    +/- 0.02 ( 0.00%) 0.01 ( 45.39%) 0.02 ( 25.07%) 0.00 ( 77.06%) 0.01 ( 52.24%)
    Elapsed Time 110.27 ( 0.00%) 56.38 ( 48.87%) 49.95 ( 54.70%) 11.77 ( 89.33%) 53.43 ( 51.54%)
    +/- 7.33 ( 0.00%) 3.77 ( 48.61%) 4.94 ( 32.63%) 6.71 ( 8.50%) 4.76 ( 35.03%)
    THP Active 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
    +/- 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
    Fault Alloc 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
    +/- 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
    Fault Fallback 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
    +/- 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)

    The THP figures are obviously all 0 because THP was enabled. The
    main thing to watch is the elapsed times and how they compare to
    times when THP is enabled later. It's also important to note that
    elapsed time is improved by this series as System CPu time is much
    reduced.

    writebackCPDevicevfat

    3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
    System Time 1.22 ( 0.00%) 13.89 (-1040.72%) 46.40 (-3709.20%) 4.44 ( -264.37%) 47.37 (-3789.33%)
    +/- 0.06 ( 0.00%) 22.82 (-37635.56%) 3.84 (-6249.44%) 6.48 (-10618.92%) 6.60
    (-10818.53%)
    User Time 0.06 ( 0.00%) 0.06 ( -6.90%) 0.05 ( 17.24%) 0.05 ( 13.79%) 0.04 ( 31.03%)
    +/- 0.01 ( 0.00%) 0.01 ( 33.33%) 0.01 ( 33.33%) 0.01 ( 39.14%) 0.01 ( 25.46%)
    Elapsed Time 10445.54 ( 0.00%) 2249.92 ( 78.46%) 70.06 ( 99.33%) 16.59 ( 99.84%) 472.43 (
    95.48%)
    +/- 643.98 ( 0.00%) 811.62 ( -26.03%) 10.02 ( 98.44%) 7.03 ( 98.91%) 59.99 ( 90.68%)
    THP Active 15.60 ( 0.00%) 35.20 ( 225.64%) 65.00 ( 416.67%) 70.80 ( 453.85%) 62.20 ( 398.72%)
    +/- 18.48 ( 0.00%) 51.29 ( 277.59%) 15.99 ( 86.52%) 37.91 ( 205.18%) 22.02 ( 119.18%)
    Fault Alloc 121.80 ( 0.00%) 76.60 ( 62.89%) 155.40 ( 127.59%) 181.20 ( 148.77%) 286.60 ( 235.30%)
    +/- 73.51 ( 0.00%) 61.11 ( 83.12%) 34.89 ( 47.46%) 31.88 ( 43.36%) 68.13 ( 92.68%)
    Fault Fallback 881.20 ( 0.00%) 926.60 ( -5.15%) 847.60 ( 3.81%) 822.00 ( 6.72%) 716.60 ( 18.68%)
    +/- 73.51 ( 0.00%) 61.26 ( 16.67%) 34.89 ( 52.54%) 31.65 ( 56.94%) 67.75 ( 7.84%)
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds) 3540.88 1945.37 716.04 64.97 1937.03
    Total Elapsed Time (seconds) 52417.33 11425.90 501.02 230.95 2520.28

    The first thing to note is the "Elapsed Time" for the vanilla kernels
    of 2249 seconds versus 56 with THP disabled which might explain the
    reports of USB stalls with THP enabled. Applying the patches brings
    performance in line with THP-disabled performance while isolating
    pages for immediate reclaim from the LRU cuts down System CPU time.

    The "Fault Alloc" success rate figures are also improved. The vanilla
    kernel only managed to allocate 76.6 pages on average over the course
    of 5 iterations where as applying the series allocated 181.20 on
    average albeit it is well within variance. It's worth noting that
    applies the series at least descreases the amount of variance which
    implies an improvement.

    Andrea's series had a higher success rate for THP allocations but
    at a severe cost to elapsed time which is still better than vanilla
    but still much worse than disabling THP altogether. One can bring my
    series close to Andrea's by removing this check

    /*
    * If compaction is deferred for high-order allocations, it is because
    * sync compaction recently failed. In this is the case and the caller
    * has requested the system not be heavily disrupted, fail the
    * allocation now instead of entering direct reclaim
    */
    if (deferred_compaction && (gfp_mask & __GFP_NO_KSWAPD))
    goto nopage;

    I didn't include a patch that removed the above check because hurting
    overall performance to improve the THP figure is not what the average
    user wants. It's something to consider though if someone really wants
    to maximise THP usage no matter what it does to the workload initially.

    This is summary of vmstat figures from the same test.

    3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
    Page Ins 3257266139 1111844061 17263623 10901575 161423219
    Page Outs 81054922 30364312 3626530 3657687 8753730
    Swap Ins 3294 2851 6560 4964 4592
    Swap Outs 390073 528094 620197 790912 698285
    Direct pages scanned 1077581700 3024951463 1764930052 115140570 5901188831
    Kswapd pages scanned 34826043 7112868 2131265 1686942 1893966
    Kswapd pages reclaimed 28950067 4911036 1246044 966475 1497726
    Direct pages reclaimed 805148398 280167837 3623473 2215044 40809360
    Kswapd efficiency 83% 69% 58% 57% 79%
    Kswapd velocity 664.399 622.521 4253.852 7304.360 751.490
    Direct efficiency 74% 9% 0% 1% 0%
    Direct velocity 20557.737 264745.137 3522673.849 498551.938 2341481.435
    Percentage direct scans 96% 99% 99% 98% 99%
    Page writes by reclaim 722646 529174 620319 791018 699198
    Page writes file 332573 1080 122 106 913
    Page writes anon 390073 528094 620197 790912 698285
    Page reclaim immediate 0 2552514720 1635858848 111281140 5478375032
    Page rescued immediate 0 0 0 87848 0
    Slabs scanned 23552 23552 9216 8192 9216
    Direct inode steals 231 0 0 0 0
    Kswapd inode steals 0 0 0 0 0
    Kswapd skipped wait 28076 786 0 61 6
    THP fault alloc 609 383 753 906 1433
    THP collapse alloc 12 6 0 0 6
    THP splits 536 211 456 593 1136
    THP fault fallback 4406 4633 4263 4110 3583
    THP collapse fail 120 127 0 0 4
    Compaction stalls 1810 728 623 779 3200
    Compaction success 196 53 60 80 123
    Compaction failures 1614 675 563 699 3077
    Compaction pages moved 193158 53545 243185 333457 226688
    Compaction move failure 9952 9396 16424 23676 45070

    The main things to look at are

    1. Page In/out figures are much reduced by the series.

    2. Direct page scanning is incredibly high (264745.137 pages scanned
    per second on the vanilla kernel) but isolating PageReclaim pages
    on their own list reduces the number of pages scanned significantly.

    3. The fact that "Page rescued immediate" is a positive number implies
    that we sometimes race removing pages from the LRU_IMMEDIATE list
    that need to be put back on a normal LRU but it happens only for
    0.07% of the pages marked for immediate reclaim.

    writebackCPDeviceext4
    3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
    System Time 1.51 ( 0.00%) 1.77 ( -17.66%) 1.46 ( 2.92%) 1.15 ( 23.77%) 1.89 ( -25.63%)
    +/- 0.27 ( 0.00%) 0.67 ( -148.52%) 0.33 ( -22.76%) 0.30 ( -11.15%) 0.19 ( 30.16%)
    User Time 0.03 ( 0.00%) 0.04 ( -37.50%) 0.05 ( -62.50%) 0.07 ( -112.50%) 0.04 ( -18.75%)
    +/- 0.01 ( 0.00%) 0.02 ( -146.64%) 0.02 ( -97.91%) 0.02 ( -75.59%) 0.02 ( -63.30%)
    Elapsed Time 124.93 ( 0.00%) 114.49 ( 8.36%) 96.77 ( 22.55%) 27.48 ( 78.00%) 205.70 ( -64.65%)
    +/- 20.20 ( 0.00%) 74.39 ( -268.34%) 59.88 ( -196.48%) 7.72 ( 61.79%) 25.03 ( -23.95%)
    THP Active 161.80 ( 0.00%) 83.60 ( 51.67%) 141.20 ( 87.27%) 84.60 ( 52.29%) 82.60 ( 51.05%)
    +/- 71.95 ( 0.00%) 43.80 ( 60.88%) 26.91 ( 37.40%) 59.02 ( 82.03%) 52.13 ( 72.45%)
    Fault Alloc 471.40 ( 0.00%) 228.60 ( 48.49%) 282.20 ( 59.86%) 225.20 ( 47.77%) 388.40 ( 82.39%)
    +/- 88.07 ( 0.00%) 87.42 ( 99.26%) 73.79 ( 83.78%) 109.62 ( 124.47%) 82.62 ( 93.81%)
    Fault Fallback 531.60 ( 0.00%) 774.60 ( -45.71%) 720.80 ( -35.59%) 777.80 ( -46.31%) 614.80 ( -15.65%)
    +/- 88.07 ( 0.00%) 87.26 ( 0.92%) 73.79 ( 16.22%) 109.62 ( -24.47%) 82.29 ( 6.56%)
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds) 50.22 33.76 30.65 24.14 128.45
    Total Elapsed Time (seconds) 1113.73 1132.19 1029.45 759.49 1707.26

    Similar test but the USB stick is using ext4 instead of vfat. As
    ext4 does not use writepage for migration, the large stalls due to
    compaction when THP is enabled are not observed. Still, isolating
    PageReclaim pages on their own list helped completion time largely
    by reducing the number of pages scanned by direct reclaim although
    time spend in congestion_wait could also be a factor.

    Again, Andrea's series had far higher success rates for THP allocation
    at the cost of elapsed time. I didn't look too closely but a quick
    look at the vmstat figures tells me kswapd reclaimed 8 times more pages
    than the patch series and direct reclaim reclaimed roughly three times
    as many pages. It follows that if memory is aggressively reclaimed,
    there will be more available for THP.

    writebackCPFilevfat
    3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
    System Time 1.76 ( 0.00%) 29.10 (-1555.52%) 46.01 (-2517.18%) 4.79 ( -172.35%) 54.89 (-3022.53%)
    +/- 0.14 ( 0.00%) 25.61 (-18185.17%) 2.15 (-1434.83%) 6.60 (-4610.03%) 9.75
    (-6863.76%)
    User Time 0.05 ( 0.00%) 0.07 ( -45.83%) 0.05 ( -4.17%) 0.06 ( -29.17%) 0.06 ( -16.67%)
    +/- 0.02 ( 0.00%) 0.02 ( 20.11%) 0.02 ( -3.14%) 0.01 ( 31.58%) 0.01 ( 47.41%)
    Elapsed Time 22520.79 ( 0.00%) 1082.85 ( 95.19%) 73.30 ( 99.67%) 32.43 ( 99.86%) 291.84 ( 98.70%)
    +/- 7277.23 ( 0.00%) 706.29 ( 90.29%) 19.05 ( 99.74%) 17.05 ( 99.77%) 125.55 ( 98.27%)
    THP Active 83.80 ( 0.00%) 12.80 ( 15.27%) 15.60 ( 18.62%) 13.00 ( 15.51%) 0.80 ( 0.95%)
    +/- 66.81 ( 0.00%) 20.19 ( 30.22%) 5.92 ( 8.86%) 15.06 ( 22.54%) 1.17 ( 1.75%)
    Fault Alloc 171.00 ( 0.00%) 67.80 ( 39.65%) 97.40 ( 56.96%) 125.60 ( 73.45%) 133.00 ( 77.78%)
    +/- 82.91 ( 0.00%) 30.69 ( 37.02%) 53.91 ( 65.02%) 55.05 ( 66.40%) 21.19 ( 25.56%)
    Fault Fallback 832.00 ( 0.00%) 935.20 ( -12.40%) 906.00 ( -8.89%) 877.40 ( -5.46%) 870.20 ( -4.59%)
    +/- 82.91 ( 0.00%) 30.69 ( 62.98%) 54.01 ( 34.86%) 55.05 ( 33.60%) 20.91 ( 74.78%)
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds) 7229.81 928.42 704.52 80.68 1330.76
    Total Elapsed Time (seconds) 112849.04 5618.69 571.11 360.54 1664.28

    In this case, the test is reading/writing only from filesystems but as
    it's vfat, it's slow due to calling writepage during compaction. Little
    to observe really - the time to complete the test goes way down
    with the series applied and THP allocation success rates go up in
    comparison to 3.2-rc5. The success rates are lower than 3.1.0 but
    the elapsed time for that kernel is abysmal so it is not really a
    sensible comparison.

    As before, Andrea's series allocates more THPs at the cost of overall
    performance.

    writebackCPFileext4
    3.1.0-vanilla rc5-vanilla freemore-v6r1 isolate-v6r1 andrea-v2r1
    System Time 1.51 ( 0.00%) 1.77 ( -17.66%) 1.46 ( 2.92%) 1.15 ( 23.77%) 1.89 ( -25.63%)
    +/- 0.27 ( 0.00%) 0.67 ( -148.52%) 0.33 ( -22.76%) 0.30 ( -11.15%) 0.19 ( 30.16%)
    User Time 0.03 ( 0.00%) 0.04 ( -37.50%) 0.05 ( -62.50%) 0.07 ( -112.50%) 0.04 ( -18.75%)
    +/- 0.01 ( 0.00%) 0.02 ( -146.64%) 0.02 ( -97.91%) 0.02 ( -75.59%) 0.02 ( -63.30%)
    Elapsed Time 124.93 ( 0.00%) 114.49 ( 8.36%) 96.77 ( 22.55%) 27.48 ( 78.00%) 205.70 ( -64.65%)
    +/- 20.20 ( 0.00%) 74.39 ( -268.34%) 59.88 ( -196.48%) 7.72 ( 61.79%) 25.03 ( -23.95%)
    THP Active 161.80 ( 0.00%) 83.60 ( 51.67%) 141.20 ( 87.27%) 84.60 ( 52.29%) 82.60 ( 51.05%)
    +/- 71.95 ( 0.00%) 43.80 ( 60.88%) 26.91 ( 37.40%) 59.02 ( 82.03%) 52.13 ( 72.45%)
    Fault Alloc 471.40 ( 0.00%) 228.60 ( 48.49%) 282.20 ( 59.86%) 225.20 ( 47.77%) 388.40 ( 82.39%)
    +/- 88.07 ( 0.00%) 87.42 ( 99.26%) 73.79 ( 83.78%) 109.62 ( 124.47%) 82.62 ( 93.81%)
    Fault Fallback 531.60 ( 0.00%) 774.60 ( -45.71%) 720.80 ( -35.59%) 777.80 ( -46.31%) 614.80 ( -15.65%)
    +/- 88.07 ( 0.00%) 87.26 ( 0.92%) 73.79 ( 16.22%) 109.62 ( -24.47%) 82.29 ( 6.56%)
    MMTests Statistics: duration
    User/Sys Time Running Test (seconds) 50.22 33.76 30.65 24.14 128.45
    Total Elapsed Time (seconds) 1113.73 1132.19 1029.45 759.49 1707.26

    Same type of story - elapsed times go down. In this case, allocation
    success rates are roughtly the same. As before, Andrea's has higher
    success rates but takes a lot longer.

    Overall the series does reduce latencies and while the tests are
    inherency racy as alloc competes with the cp processes, the variability
    was included. The THP allocation rates are not as high as they could
    be but that is because we would have to be more aggressive about
    reclaim and compaction impacting overall performance.

    This patch:

    Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware")
    noted that compaction does not migrate dirty or writeback pages and that
    is was meaningless to pick the page and re-add it to the LRU list.

    What was missed during review is that asynchronous migration moves dirty
    pages if their ->migratepage callback is migrate_page() because these can
    be moved without blocking. This potentially impacted hugepage allocation
    success rates by a factor depending on how many dirty pages are in the
    system.

    This patch partially reverts 39deaf85 to allow migration to isolate dirty
    pages again. This increases how much compaction disrupts the LRU but that
    is addressed later in the series.

    Signed-off-by: Mel Gorman
    Reviewed-by: Andrea Arcangeli
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • In trace_mm_vmscan_lru_isolate(), we don't output 'file' information to
    the trace event and it is a bit inconvenient for the user to get the
    real information(like pasted below). mm_vmscan_lru_isolate:
    isolate_mode=2 order=0 nr_requested=32 nr_scanned=32 nr_taken=32
    contig_taken=0 contig_dirty=0 contig_failed=0

    'active' can be obtained by analyzing mode(Thanks go to Minchan and
    Mel), So this patch adds 'file' to the trace event and it now looks
    like: mm_vmscan_lru_isolate: isolate_mode=2 order=0 nr_requested=32
    nr_scanned=32 nr_taken=32 contig_taken=0 contig_dirty=0 contig_failed=0
    file=0

    Signed-off-by: Tao Ma
    Acked-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Reviewed-by: Minchan Kim
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tao Ma
     
  • Put the tail subpages of an isolated hugepage under splitting in the lru
    reclaim head as they supposedly should be isolated too next.

    Queues the subpages in physical order in the lru for non isolated
    hugepages under splitting. That might provide some theoretical cache
    benefit to the buddy allocator later.

    Signed-off-by: Shaohua Li
    Signed-off-by: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • We have tlb_remove_tlb_entry to indicate a pte tlb flush entry should be
    flushed, but not a corresponding API for pmd entry. This isn't a
    problem so far because THP is only for x86 currently and tlb_flush()
    under x86 will flush entire TLB. But this is confusion and could be
    missed if thp is ported to other arch.

    Also convert tlb->need_flush = 1 to a VM_BUG_ON(!tlb->need_flush) in
    __tlb_remove_page() as suggested by Andrea Arcangeli. The
    __tlb_remove_page() function is supposed to be called after
    tlb_remove_xxx_tlb_entry() and we can catch any misuse.

    Signed-off-by: Shaohua Li
    Reviewed-by: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • change_protection() will do TLB flush later, don't need duplicate tlb
    flush.

    Signed-off-by: Shaohua Li
    Reviewed-by: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Improve the error code path. Delete unnecessary sysfs file for example.
    Also remove the #ifdef xxx to make code better.

    Signed-off-by: Shaohua Li
    Reviewed-by: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • No need for two CONFIG_MEMORY_HOTPLUG blocks.

    Signed-off-by: Bob Liu
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Liu
     
  • If there is a zone below ZONE_NORMAL has present_pages, we can set node
    state to N_NORMAL_MEMORY, no need to loop to end.

    Signed-off-by: Bob Liu
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Liu
     
  • We already have for_each_node(node) define in nodemask.h, better to use it.

    Signed-off-by: Bob Liu
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Liu
     
  • Now, at LRU handling, memory cgroup needs to do complicated works to see
    valid pc->mem_cgroup, which may be overwritten.

    This patch is for relaxing the protocol. This patch guarantees
    - when pc->mem_cgroup is overwritten, page must not be on LRU.

    By this, LRU routine can believe pc->mem_cgroup and don't need to check
    bits on pc->flags. This new rule may adds small overheads to swapin. But
    in most case, lru handling gets faster.

    After this patch, PCG_ACCT_LRU bit is obsolete and removed.

    [akpm@linux-foundation.org: remove unneeded VM_BUG_ON(), restore hannes's christmas tree]
    [akpm@linux-foundation.org: clean up code comment]
    [hughd@google.com: fix NULL mem_cgroup_try_charge]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Miklos Szeredi
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Ying Han
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This is a preparation before removing a flag PCG_ACCT_LRU in page_cgroup
    and reducing atomic ops/complexity in memcg LRU handling.

    In some cases, pages are added to lru before charge to memcg and pages
    are not classfied to memory cgroup at lru addtion. Now, the lru where
    the page should be added is determined a bit in page_cgroup->flags and
    pc->mem_cgroup. I'd like to remove the check of flag.

    To handle the case pc->mem_cgroup may contain stale pointers if pages
    are added to LRU before classification. This patch resets
    pc->mem_cgroup to root_mem_cgroup before lru additions.

    [akpm@linux-foundation.org: fix CONFIG_CGROUP_MEM_CONT=n build]
    [hughd@google.com: fix CONFIG_CGROUP_MEM_RES_CTLR=y CONFIG_CGROUP_MEM_RES_CTLR_SWAP=n build]
    [akpm@linux-foundation.org: ksm.c needs memcontrol.h, per Michal]
    [hughd@google.com: stop oops in mem_cgroup_reset_owner()]
    [hughd@google.com: fix page migration to reset_owner]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Miklos Szeredi
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Ying Han
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patch simplifies LRU handling of racy case (memcg+SwapCache). At
    charging, SwapCache tend to be on LRU already. So, before overwriting
    pc->mem_cgroup, the page must be removed from LRU and added to LRU
    later.

    This patch does
    spin_lock(zone->lru_lock);
    if (PageLRU(page))
    remove from LRU
    overwrite pc->mem_cgroup
    if (PageLRU(page))
    add to new LRU.
    spin_unlock(zone->lru_lock);

    And guarantee all pages are not on LRU at modifying pc->mem_cgroup.
    This patch also unfies lru handling of replace_page_cache() and
    swapin.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Miklos Szeredi
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Ying Han
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • This patch is a clean up. No functional/logical changes.

    Because of commit ef6a3c6311 ("mm: add replace_page_cache_page()
    function") , FUSE uses replace_page_cache() instead of
    add_to_page_cache(). Then, mem_cgroup_cache_charge() is not called
    against FUSE's pages from splice.

    So now, mem_cgroup_cache_charge() gets pages that are not on the LRU
    with the exception of PageSwapCache pages. For checking,
    WARN_ON_ONCE(PageLRU(page)) is added.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Miklos Szeredi
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Ying Han
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • The oom killer relies on logic that identifies threads that have already
    been oom killed when scanning the tasklist and, if found, deferring
    until such threads have exited. This is done by checking for any
    candidate threads that have the TIF_MEMDIE bit set.

    For memcg ooms, candidate threads are first found by calling
    task_in_mem_cgroup() since the oom killer should not defer if there's an
    oom killed thread in another memcg.

    Unfortunately, task_in_mem_cgroup() excludes threads if they have
    detached their mm in the process of exiting so TIF_MEMDIE is never
    detected for such conditions. This is different for global, mempolicy,
    and cpuset oom conditions where a detached mm is only excluded after
    checking for TIF_MEMDIE and deferring, if necessary, in
    select_bad_process().

    The fix is to return true if a task has a detached mm but is still in
    the memcg or its hierarchy that is currently oom. This will allow the
    oom killer to appropriately defer rather than kill unnecessarily or, in
    the worst case, panic the machine if nothing else is available to kill.

    Signed-off-by: David Rientjes
    Acked-by: KOSAKI Motohiro
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • If we are not able to allocate tree nodes for all NUMA nodes then we
    should release those that were allocated.

    Signed-off-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Andrea Arcangeli
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • There are multiple places which need to get the swap_cgroup address, so
    add a helper function:

    static struct swap_cgroup *swap_cgroup_getsc(swp_entry_t ent,
    struct swap_cgroup_ctrl **ctrl);

    to simplify the code.

    Signed-off-by: Bob Liu
    Acked-by: Michal Hocko
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Liu
     
  • mem_cgroup_uncharge_page() is only called on either freshly allocated
    pages without page->mapping or on rmapped PageAnon() pages. There is no
    need to check for a page->mapping that is not an anon_vma.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • All callsites pass in freshly allocated pages and a valid mm. As a
    result, all checks pertaining to the page's mapcount, page->mapping or the
    fallback to init_mm are unneeded.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Balbir Singh
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • lookup_page_cgroup() is usually used only against pages that are used in
    userspace.

    The exception is the CONFIG_DEBUG_VM-only memcg check from the page
    allocator: it can run on pages without page_cgroup descriptors allocated
    when the pages are fed into the page allocator for the first time during
    boot or memory hotplug.

    Include the array check only when CONFIG_DEBUG_VM is set and save the
    unnecessary check in production kernels.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Pages have their corresponding page_cgroup descriptors set up before
    they are used in userspace, and thus managed by a memory cgroup.

    The only time where lookup_page_cgroup() can return NULL is in the
    CONFIG_DEBUG_VM-only page sanity checking code that executes while
    feeding pages into the page allocator for the first time.

    Remove the NULL checks against lookup_page_cgroup() results from all
    callsites where we know that corresponding page_cgroup descriptors must
    be allocated, and add a comment to the callsite that actually does have
    to check the return value.

    [hughd@google.com: stop oops in mem_cgroup_update_page_stat()]
    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Balbir Singh
    Cc: David Rientjes
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The fault accounting functions have a single, memcg-internal user, so they
    don't need to be global. In fact, their one-line bodies can be directly
    folded into the caller. And since faults happen one at a time, use
    this_cpu_inc() directly instead of this_cpu_add(foo, 1).

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Acked-by: Balbir Singh
    Cc: David Rientjes
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Signed-off-by: Johannes Weiner
    Acked-by: David Rientjes
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Balbir Singh
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The memcg argument of oom_kill_task() hasn't been used since 341aea2
    'oom-kill: remove boost_dying_task_prio()'. Kill it.

    Signed-off-by: Johannes Weiner
    Acked-by: David Rientjes
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Balbir Singh
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The two memcg stats pgpgin/pgpgout have different meaning than the ones
    in vmstat, which indicates that we picked a bad naming for them.

    It might be late to change the stat name, but better documentation is
    always helpful.

    Signed-off-by: Ying Han
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • It should be memsw.max_usage_in_bytes. This typo has been there for
    a really long time.

    Signed-off-by: Zhu Yanhai
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhu Yanhai
     
  • Only the ratelimit checks themselves have to run with preemption
    disabled, the resulting actions - checking for usage thresholds,
    updating the soft limit tree - can and should run with preemption
    enabled.

    Signed-off-by: Johannes Weiner
    Reported-by: Yong Zhang
    Tested-by: Yong Zhang
    Reported-by: Luis Henriques
    Tested-by: Luis Henriques
    Cc: Thomas Gleixner
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • In split_huge_page(), mem_cgroup_split_huge_fixup() is called to handle
    page_cgroup modifcations. It takes move_lock_page_cgroup() and modifies
    page_cgroup and LRU accounting jobs and called HPAGE_PMD_SIZE - 1 times.

    But thinking again,
    - compound_lock() is held at move_accout...then, it's not necessary
    to take move_lock_page_cgroup().
    - LRU is locked and all tail pages will go into the same LRU as
    head is now on.
    - page_cgroup is contiguous in huge page range.

    This patch fixes mem_cgroup_split_huge_fixup() as to be called once per
    hugepage and reduce costs for spliting.

    [akpm@linux-foundation.org: fix typo, per Michal]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Cc: Andrea Arcangeli
    Reviewed-by: Michal Hocko
    Cc: Balbir Singh
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • To find the page corresponding to a certain page_cgroup, the pc->flags
    encoded the node or section ID with the base array to compare the pc
    pointer to.

    Now that the per-memory cgroup LRU lists link page descriptors directly,
    there is no longer any code that knows the struct page_cgroup of a PFN
    but not the struct page.

    [hughd@google.com: remove unused node/section info from pc->flags fix]
    Signed-off-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: "Kirill A. Shutemov"
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Now that all code that operated on global per-zone LRU lists is
    converted to operate on per-memory cgroup LRU lists instead, there is no
    reason to keep the double-LRU scheme around any longer.

    The pc->lru member is removed and page->lru is linked directly to the
    per-memory cgroup LRU lists, which removes two pointers from a
    descriptor that exists for every page frame in the system.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Hugh Dickins
    Signed-off-by: Ying Han
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Having a unified structure with a LRU list set for both global zones and
    per-memcg zones allows to keep that code simple which deals with LRU
    lists and does not care about the container itself.

    Once the per-memcg LRU lists directly link struct pages, the isolation
    function and all other list manipulations are shared between the memcg
    case and the global LRU case.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The global per-zone LRU lists are about to go away on memcg-enabled
    kernels, global reclaim must be able to find its pages on the per-memcg
    LRU lists.

    Since the LRU pages of a zone are distributed over all existing memory
    cgroups, a scan target for a zone is complete when all memory cgroups
    are scanned for their proportional share of a zone's memory.

    The forced scanning of small scan targets from kswapd is limited to
    zones marked unreclaimable, otherwise kswapd can quickly overreclaim by
    force-scanning the LRU lists of multiple memory cgroups.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • root_mem_cgroup, lacking a configurable limit, was never subject to
    limit reclaim, so the pages charged to it could be kept off its LRU
    lists. They would be found on the global per-zone LRU lists upon
    physical memory pressure and it made sense to avoid uselessly linking
    them to both lists.

    The global per-zone LRU lists are about to go away on memcg-enabled
    kernels, with all pages being exclusively linked to their respective
    per-memcg LRU lists. As a result, pages of the root_mem_cgroup must
    also be linked to its LRU lists again. This is purely about the LRU
    list, root_mem_cgroup is still not charged.

    The overhead is temporary until the double-LRU scheme is going away
    completely.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Memory cgroup limit reclaim and traditional global pressure reclaim will
    soon share the same code to reclaim from a hierarchical tree of memory
    cgroups.

    In preparation of this, move the two right next to each other in
    shrink_zone().

    The mem_cgroup_hierarchical_reclaim() polymath is split into a soft
    limit reclaim function, which still does hierarchy walking on its own,
    and a limit (shrinking) reclaim function, which relies on generic
    reclaim code to walk the hierarchy.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Memory cgroup limit reclaim currently picks one memory cgroup out of the
    target hierarchy, remembers it as the last scanned child, and reclaims
    all zones in it with decreasing priority levels.

    The new hierarchy reclaim code will pick memory cgroups from the same
    hierarchy concurrently from different zones and priority levels, it
    becomes necessary that hierarchy roots not only remember the last
    scanned child, but do so for each zone and priority level.

    Until now, we reclaimed memcgs like this:

    mem = mem_cgroup_iter(root)
    for each priority level:
    for each zone in zonelist:
    reclaim(mem, zone)

    But subsequent patches will move the memcg iteration inside the loop
    over the zones:

    for each priority level:
    for each zone in zonelist:
    mem = mem_cgroup_iter(root)
    reclaim(mem, zone)

    And to keep with the original scan order - memcg -> priority -> zone -
    the last scanned memcg has to be remembered per zone and per priority
    level.

    Furthermore, global reclaim will be switched to the hierarchy walk as
    well. Different from limit reclaim, which can just recheck the limit
    after some reclaim progress, its target is to scan all memcgs for the
    desired zone pages, proportional to the memcg size, and so reliably
    detecting a full hierarchy round-trip will become crucial.

    Currently, the code relies on one reclaimer encountering the same memcg
    twice, but that is error-prone with concurrent reclaimers. Instead, use
    a generation counter that is increased every time the child with the
    highest ID has been visited, so that reclaimers can stop when the
    generation changes.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Kirill A. Shutemov
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Memory cgroup hierarchies are currently handled completely outside of
    the traditional reclaim code, which is invoked with a single memory
    cgroup as an argument for the whole call stack.

    Subsequent patches will switch this code to do hierarchical reclaim, so
    there needs to be a distinction between a) the memory cgroup that is
    triggering reclaim due to hitting its limit and b) the memory cgroup
    that is being scanned as a child of a).

    This patch introduces a struct mem_cgroup_zone that contains the
    combination of the memory cgroup and the zone being scanned, which is
    then passed down the stack instead of the zone argument.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The traditional zone reclaim code is scanning the per-zone LRU lists
    during direct reclaim and kswapd, and the per-zone per-memory cgroup LRU
    lists when reclaiming on behalf of a memory cgroup limit.

    Subsequent patches will convert the traditional reclaim code to reclaim
    exclusively from the per-memory cgroup LRU lists. As a result, using
    the predicate for which LRU list is scanned will no longer be
    appropriate to tell global reclaim from limit reclaim.

    This patch adds a global_reclaim() predicate to tell direct/kswapd
    reclaim from memory cgroup limit reclaim and substitutes it in all
    places where currently scanning_global_lru() is used for that.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner