13 Jan, 2012

3 commits

  • Mostly we use "enum lru_list lru": change those few "l"s to "lru"s.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware")
    noted that compaction does not migrate dirty or writeback pages and that
    is was meaningless to pick the page and re-add it to the LRU list. This
    had to be partially reverted because some dirty pages can be migrated by
    compaction without blocking.

    This patch updates "mm: compaction: make isolate_lru_page" by skipping
    over pages that migration has no possibility of migrating to minimise LRU
    disruption.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Reviewed-by: Minchan Kim
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Having a unified structure with a LRU list set for both global zones and
    per-memcg zones allows to keep that code simple which deals with LRU
    lists and does not care about the container itself.

    Once the per-memcg LRU lists directly link struct pages, the isolation
    function and all other list manipulations are shared between the memcg
    case and the global LRU case.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

11 Jan, 2012

1 commit

  • Per-zone dirty limits try to distribute page cache pages allocated for
    writing across zones in proportion to the individual zone sizes, to reduce
    the likelihood of reclaim having to write back individual pages from the
    LRU lists in order to make progress.

    This patch:

    The amount of dirtyable pages should not include the full number of free
    pages: there is a number of reserved pages that the page allocator and
    kswapd always try to keep free.

    The closer (reclaimable pages - dirty pages) is to the number of reserved
    pages, the more likely it becomes for reclaim to run into dirty pages:

    +----------+ ---
    | anon | |
    +----------+ |
    | | |
    | | -- dirty limit new -- flusher new
    | file | | |
    | | | |
    | | -- dirty limit old -- flusher old
    | | |
    +----------+ --- reclaim
    | reserved |
    +----------+
    | kernel |
    +----------+

    This patch introduces a per-zone dirty reserve that takes both the lowmem
    reserve as well as the high watermark of the zone into account, and a
    global sum of those per-zone values that is subtracted from the global
    amount of dirtyable pages. The lowmem reserve is unavailable to page
    cache allocations and kswapd tries to keep the high watermark free. We
    don't want to end up in a situation where reclaim has to clean pages in
    order to balance zones.

    Not treating reserved pages as dirtyable on a global level is only a
    conceptual fix. In reality, dirty pages are not distributed equally
    across zones and reclaim runs into dirty pages on a regular basis.

    But it is important to get this right before tackling the problem on a
    per-zone level, where the distance between reclaim and the dirty pages is
    mostly much smaller in absolute numbers.

    [akpm@linux-foundation.org: fix highmem build]
    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Michal Hocko
    Reviewed-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Hellwig
    Cc: Wu Fengguang
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Shaohua Li
    Cc: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

09 Dec, 2011

1 commit

  • Now all ARCH_POPULATES_NODE_MAP archs select HAVE_MEBLOCK_NODE_MAP -
    there's no user of early_node_map[] left. Kill early_node_map[] and
    replace ARCH_POPULATES_NODE_MAP with HAVE_MEMBLOCK_NODE_MAP. Also,
    relocate for_each_mem_pfn_range() and helper from mm.h to memblock.h
    as page_alloc.c would no longer host an alternative implementation.

    This change is ultimately one to one mapping and shouldn't cause any
    observable difference; however, after the recent changes, there are
    some functions which now would fit memblock.c better than page_alloc.c
    and dependency on HAVE_MEMBLOCK_NODE_MAP instead of HAVE_MEMBLOCK
    doesn't make much sense on some of them. Further cleanups for
    functions inside HAVE_MEMBLOCK_NODE_MAP in mm.h would be nice.

    -v2: Fix compile bug introduced by mis-spelling
    CONFIG_HAVE_MEMBLOCK_NODE_MAP to CONFIG_MEMBLOCK_HAVE_NODE_MAP in
    mmzone.h. Reported by Stephen Rothwell.

    Signed-off-by: Tejun Heo
    Cc: Stephen Rothwell
    Cc: Benjamin Herrenschmidt
    Cc: Yinghai Lu
    Cc: Tony Luck
    Cc: Ralf Baechle
    Cc: Martin Schwidefsky
    Cc: Chen Liqin
    Cc: Paul Mundt
    Cc: "David S. Miller"
    Cc: "H. Peter Anvin"

    Tejun Heo
     

01 Nov, 2011

5 commits

  • When direct reclaim encounters a dirty page, it gets recycled around the
    LRU for another cycle. This patch marks the page PageReclaim similar to
    deactivate_page() so that the page gets reclaimed almost immediately after
    the page gets cleaned. This is to avoid reclaiming clean pages that are
    younger than a dirty page encountered at the end of the LRU that might
    have been something like a use-once page.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Cc: Wu Fengguang
    Cc: Jan Kara
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Alex Elder
    Cc: Theodore Ts'o
    Cc: Chris Mason
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Testing from the XFS folk revealed that there is still too much I/O from
    the end of the LRU in kswapd. Previously it was considered acceptable by
    VM people for a small number of pages to be written back from reclaim with
    testing generally showing about 0.3% of pages reclaimed were written back
    (higher if memory was low). That writing back a small number of pages is
    ok has been heavily disputed for quite some time and Dave Chinner
    explained it well;

    It doesn't have to be a very high number to be a problem. IO
    is orders of magnitude slower than the CPU time it takes to
    flush a page, so the cost of making a bad flush decision is
    very high. And single page writeback from the LRU is almost
    always a bad flush decision.

    To complicate matters, filesystems respond very differently to requests
    from reclaim according to Christoph Hellwig;

    xfs tries to write it back if the requester is kswapd
    ext4 ignores the request if it's a delayed allocation
    btrfs ignores the request

    As a result, each filesystem has different performance characteristics
    when under memory pressure and there are many pages being dirtied. In
    some cases, the request is ignored entirely so the VM cannot depend on the
    IO being dispatched.

    The objective of this series is to reduce writing of filesystem-backed
    pages from reclaim, play nicely with writeback that is already in progress
    and throttle reclaim appropriately when writeback pages are encountered.
    The assumption is that the flushers will always write pages faster than if
    reclaim issues the IO.

    A secondary goal is to avoid the problem whereby direct reclaim splices
    two potentially deep call stacks together.

    There is a potential new problem as reclaim has less control over how long
    before a page in a particularly zone or container is cleaned and direct
    reclaimers depend on kswapd or flusher threads to do the necessary work.
    However, as filesystems sometimes ignore direct reclaim requests already,
    it is not expected to be a serious issue.

    Patch 1 disables writeback of filesystem pages from direct reclaim
    entirely. Anonymous pages are still written.

    Patch 2 removes dead code in lumpy reclaim as it is no longer able
    to synchronously write pages. This hurts lumpy reclaim but
    there is an expectation that compaction is used for hugepage
    allocations these days and lumpy reclaim's days are numbered.

    Patches 3-4 add warnings to XFS and ext4 if called from
    direct reclaim. With patch 1, this "never happens" and is
    intended to catch regressions in this logic in the future.

    Patch 5 disables writeback of filesystem pages from kswapd unless
    the priority is raised to the point where kswapd is considered
    to be in trouble.

    Patch 6 throttles reclaimers if too many dirty pages are being
    encountered and the zones or backing devices are congested.

    Patch 7 invalidates dirty pages found at the end of the LRU so they
    are reclaimed quickly after being written back rather than
    waiting for a reclaimer to find them

    I consider this series to be orthogonal to the writeback work but it is
    worth noting that the writeback work affects the viability of patch 8 in
    particular.

    I tested this on ext4 and xfs using fs_mark, a simple writeback test based
    on dd and a micro benchmark that does a streaming write to a large mapping
    (exercises use-once LRU logic) followed by streaming writes to a mix of
    anonymous and file-backed mappings. The command line for fs_mark when
    botted with 512M looked something like

    ./fs_mark -d /tmp/fsmark-2676 -D 100 -N 150 -n 150 -L 25 -t 1 -S0 -s 10485760

    The number of files was adjusted depending on the amount of available
    memory so that the files created was about 3xRAM. For multiple threads,
    the -d switch is specified multiple times.

    The test machine is x86-64 with an older generation of AMD processor with
    4 cores. The underlying storage was 4 disks configured as RAID-0 as this
    was the best configuration of storage I had available. Swap is on a
    separate disk. Dirty ratio was tuned to 40% instead of the default of
    20%.

    Testing was run with and without monitors to both verify that the patches
    were operating as expected and that any performance gain was real and not
    due to interference from monitors.

    Here is a summary of results based on testing XFS.

    512M1P-xfs Files/s mean 32.69 ( 0.00%) 34.44 ( 5.08%)
    512M1P-xfs Elapsed Time fsmark 51.41 48.29
    512M1P-xfs Elapsed Time simple-wb 114.09 108.61
    512M1P-xfs Elapsed Time mmap-strm 113.46 109.34
    512M1P-xfs Kswapd efficiency fsmark 62% 63%
    512M1P-xfs Kswapd efficiency simple-wb 56% 61%
    512M1P-xfs Kswapd efficiency mmap-strm 44% 42%
    512M-xfs Files/s mean 30.78 ( 0.00%) 35.94 (14.36%)
    512M-xfs Elapsed Time fsmark 56.08 48.90
    512M-xfs Elapsed Time simple-wb 112.22 98.13
    512M-xfs Elapsed Time mmap-strm 219.15 196.67
    512M-xfs Kswapd efficiency fsmark 54% 56%
    512M-xfs Kswapd efficiency simple-wb 54% 55%
    512M-xfs Kswapd efficiency mmap-strm 45% 44%
    512M-4X-xfs Files/s mean 30.31 ( 0.00%) 33.33 ( 9.06%)
    512M-4X-xfs Elapsed Time fsmark 63.26 55.88
    512M-4X-xfs Elapsed Time simple-wb 100.90 90.25
    512M-4X-xfs Elapsed Time mmap-strm 261.73 255.38
    512M-4X-xfs Kswapd efficiency fsmark 49% 50%
    512M-4X-xfs Kswapd efficiency simple-wb 54% 56%
    512M-4X-xfs Kswapd efficiency mmap-strm 37% 36%
    512M-16X-xfs Files/s mean 60.89 ( 0.00%) 65.22 ( 6.64%)
    512M-16X-xfs Elapsed Time fsmark 67.47 58.25
    512M-16X-xfs Elapsed Time simple-wb 103.22 90.89
    512M-16X-xfs Elapsed Time mmap-strm 237.09 198.82
    512M-16X-xfs Kswapd efficiency fsmark 45% 46%
    512M-16X-xfs Kswapd efficiency simple-wb 53% 55%
    512M-16X-xfs Kswapd efficiency mmap-strm 33% 33%

    Up until 512-4X, the FSmark improvements were statistically significant.
    For the 4X and 16X tests the results were within standard deviations but
    just barely. The time to completion for all tests is improved which is an
    important result. In general, kswapd efficiency is not affected by
    skipping dirty pages.

    1024M1P-xfs Files/s mean 39.09 ( 0.00%) 41.15 ( 5.01%)
    1024M1P-xfs Elapsed Time fsmark 84.14 80.41
    1024M1P-xfs Elapsed Time simple-wb 210.77 184.78
    1024M1P-xfs Elapsed Time mmap-strm 162.00 160.34
    1024M1P-xfs Kswapd efficiency fsmark 69% 75%
    1024M1P-xfs Kswapd efficiency simple-wb 71% 77%
    1024M1P-xfs Kswapd efficiency mmap-strm 43% 44%
    1024M-xfs Files/s mean 35.45 ( 0.00%) 37.00 ( 4.19%)
    1024M-xfs Elapsed Time fsmark 94.59 91.00
    1024M-xfs Elapsed Time simple-wb 229.84 195.08
    1024M-xfs Elapsed Time mmap-strm 405.38 440.29
    1024M-xfs Kswapd efficiency fsmark 79% 71%
    1024M-xfs Kswapd efficiency simple-wb 74% 74%
    1024M-xfs Kswapd efficiency mmap-strm 39% 42%
    1024M-4X-xfs Files/s mean 32.63 ( 0.00%) 35.05 ( 6.90%)
    1024M-4X-xfs Elapsed Time fsmark 103.33 97.74
    1024M-4X-xfs Elapsed Time simple-wb 204.48 178.57
    1024M-4X-xfs Elapsed Time mmap-strm 528.38 511.88
    1024M-4X-xfs Kswapd efficiency fsmark 81% 70%
    1024M-4X-xfs Kswapd efficiency simple-wb 73% 72%
    1024M-4X-xfs Kswapd efficiency mmap-strm 39% 38%
    1024M-16X-xfs Files/s mean 42.65 ( 0.00%) 42.97 ( 0.74%)
    1024M-16X-xfs Elapsed Time fsmark 103.11 99.11
    1024M-16X-xfs Elapsed Time simple-wb 200.83 178.24
    1024M-16X-xfs Elapsed Time mmap-strm 397.35 459.82
    1024M-16X-xfs Kswapd efficiency fsmark 84% 69%
    1024M-16X-xfs Kswapd efficiency simple-wb 74% 73%
    1024M-16X-xfs Kswapd efficiency mmap-strm 39% 40%

    All FSMark tests up to 16X had statistically significant improvements.
    For the most part, tests are completing faster with the exception of the
    streaming writes to a mixture of anonymous and file-backed mappings which
    were slower in two cases

    In the cases where the mmap-strm tests were slower, there was more
    swapping due to dirty pages being skipped. The number of additional pages
    swapped is almost identical to the fewer number of pages written from
    reclaim. In other words, roughly the same number of pages were reclaimed
    but swapping was slower. As the test is a bit unrealistic and stresses
    memory heavily, the small shift is acceptable.

    4608M1P-xfs Files/s mean 29.75 ( 0.00%) 30.96 ( 3.91%)
    4608M1P-xfs Elapsed Time fsmark 512.01 492.15
    4608M1P-xfs Elapsed Time simple-wb 618.18 566.24
    4608M1P-xfs Elapsed Time mmap-strm 488.05 465.07
    4608M1P-xfs Kswapd efficiency fsmark 93% 86%
    4608M1P-xfs Kswapd efficiency simple-wb 88% 84%
    4608M1P-xfs Kswapd efficiency mmap-strm 46% 45%
    4608M-xfs Files/s mean 27.60 ( 0.00%) 28.85 ( 4.33%)
    4608M-xfs Elapsed Time fsmark 555.96 532.34
    4608M-xfs Elapsed Time simple-wb 659.72 571.85
    4608M-xfs Elapsed Time mmap-strm 1082.57 1146.38
    4608M-xfs Kswapd efficiency fsmark 89% 91%
    4608M-xfs Kswapd efficiency simple-wb 88% 82%
    4608M-xfs Kswapd efficiency mmap-strm 48% 46%
    4608M-4X-xfs Files/s mean 26.00 ( 0.00%) 27.47 ( 5.35%)
    4608M-4X-xfs Elapsed Time fsmark 592.91 564.00
    4608M-4X-xfs Elapsed Time simple-wb 616.65 575.07
    4608M-4X-xfs Elapsed Time mmap-strm 1773.02 1631.53
    4608M-4X-xfs Kswapd efficiency fsmark 90% 94%
    4608M-4X-xfs Kswapd efficiency simple-wb 87% 82%
    4608M-4X-xfs Kswapd efficiency mmap-strm 43% 43%
    4608M-16X-xfs Files/s mean 26.07 ( 0.00%) 26.42 ( 1.32%)
    4608M-16X-xfs Elapsed Time fsmark 602.69 585.78
    4608M-16X-xfs Elapsed Time simple-wb 606.60 573.81
    4608M-16X-xfs Elapsed Time mmap-strm 1549.75 1441.86
    4608M-16X-xfs Kswapd efficiency fsmark 98% 98%
    4608M-16X-xfs Kswapd efficiency simple-wb 88% 82%
    4608M-16X-xfs Kswapd efficiency mmap-strm 44% 42%

    Unlike the other tests, the fsmark results are not statistically
    significant but the min and max times are both improved and for the most
    part, tests completed faster.

    There are other indications that this is an improvement as well. For
    example, in the vast majority of cases, there were fewer pages scanned by
    direct reclaim implying in many cases that stalls due to direct reclaim
    are reduced. KSwapd is scanning more due to skipping dirty pages which is
    unfortunate but the CPU usage is still acceptable

    In an earlier set of tests, I used blktrace and in almost all cases
    throughput throughout the entire test was higher. However, I ended up
    discarding those results as recording blktrace data was too heavy for my
    liking.

    On a laptop, I plugged in a USB stick and ran a similar tests of tests
    using it as backing storage. A desktop environment was running and for
    the entire duration of the tests, firefox and gnome terminal were
    launching and exiting to vaguely simulate a user.

    1024M-xfs Files/s mean 0.41 ( 0.00%) 0.44 ( 6.82%)
    1024M-xfs Elapsed Time fsmark 2053.52 1641.03
    1024M-xfs Elapsed Time simple-wb 1229.53 768.05
    1024M-xfs Elapsed Time mmap-strm 4126.44 4597.03
    1024M-xfs Kswapd efficiency fsmark 84% 85%
    1024M-xfs Kswapd efficiency simple-wb 92% 81%
    1024M-xfs Kswapd efficiency mmap-strm 60% 51%
    1024M-xfs Avg wait ms fsmark 5404.53 4473.87
    1024M-xfs Avg wait ms simple-wb 2541.35 1453.54
    1024M-xfs Avg wait ms mmap-strm 3400.25 3852.53

    The mmap-strm results were hurt because firefox launching had a tendency
    to push the test out of memory. On the postive side, firefox launched
    marginally faster with the patches applied. Time to completion for many
    tests was faster but more importantly - the "Avg wait" time as measured by
    iostat was far lower implying the system would be more responsive. It was
    also the case that "Avg wait ms" on the root filesystem was lower. I
    tested it manually and while the system felt slightly more responsive
    while copying data to a USB stick, it was marginal enough that it could be
    my imagination.

    This patch: do not writeback filesystem pages in direct reclaim.

    When kswapd is failing to keep zones above the min watermark, a process
    will enter direct reclaim in the same manner kswapd does. If a dirty page
    is encountered during the scan, this page is written to backing storage
    using mapping->writepage.

    This causes two problems. First, it can result in very deep call stacks,
    particularly if the target storage or filesystem are complex. Some
    filesystems ignore write requests from direct reclaim as a result. The
    second is that a single-page flush is inefficient in terms of IO. While
    there is an expectation that the elevator will merge requests, this does
    not always happen. Quoting Christoph Hellwig;

    The elevator has a relatively small window it can operate on,
    and can never fix up a bad large scale writeback pattern.

    This patch prevents direct reclaim writing back filesystem pages by
    checking if current is kswapd. Anonymous pages are still written to swap
    as there is not the equivalent of a flusher thread for anonymous pages.
    If the dirty pages cannot be written back, they are placed back on the LRU
    lists. There is now a direct dependency on dirty page balancing to
    prevent too many pages in the system being dirtied which would prevent
    reclaim making forward progress.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Cc: Johannes Weiner
    Cc: Wu Fengguang
    Cc: Jan Kara
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Alex Elder
    Cc: Theodore Ts'o
    Cc: Chris Mason
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • In __zone_reclaim case, we don't want to shrink mapped page. Nonetheless,
    we have isolated mapped page and re-add it into LRU's head. It's
    unnecessary CPU overhead and makes LRU churning.

    Of course, when we isolate the page, the page might be mapped but when we
    try to migrate the page, the page would be not mapped. So it could be
    migrated. But race is rare and although it happens, it's no big deal.

    Signed-off-by: Minchan Kim
    Acked-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • In async mode, compaction doesn't migrate dirty or writeback pages. So,
    it's meaningless to pick the page and re-add it to lru list.

    Of course, when we isolate the page in compaction, the page might be dirty
    or writeback but when we try to migrate the page, the page would be not
    dirty, writeback. So it could be migrated. But it's very unlikely as
    isolate and migration cycle is much faster than writeout.

    So, this patch helps cpu overhead and prevent unnecessary LRU churning.

    Signed-off-by: Minchan Kim
    Acked-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Michal Hocko
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Change ISOLATE_XXX macro with bitwise isolate_mode_t type. Normally,
    macro isn't recommended as it's type-unsafe and making debugging harder as
    symbol cannot be passed throught to the debugger.

    Quote from Johannes
    " Hmm, it would probably be cleaner to fully convert the isolation mode
    into independent flags. INACTIVE, ACTIVE, BOTH is currently a
    tri-state among flags, which is a bit ugly."

    This patch moves isolate mode from swap.h to mmzone.h by memcontrol.h

    Signed-off-by: Minchan Kim
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

27 Jul, 2011

2 commits

  • This allows us to move duplicated code in
    (atomic_inc_not_zero() for now) to

    Signed-off-by: Arun Sharma
    Reviewed-by: Eric Dumazet
    Cc: Ingo Molnar
    Cc: David Miller
    Cc: Eric Dumazet
    Acked-by: Mike Frysinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun Sharma
     
  • In mm/memcontrol.c, there are many lru stat functions as..

    mem_cgroup_zone_nr_lru_pages
    mem_cgroup_node_nr_file_lru_pages
    mem_cgroup_nr_file_lru_pages
    mem_cgroup_node_nr_anon_lru_pages
    mem_cgroup_nr_anon_lru_pages
    mem_cgroup_node_nr_unevictable_lru_pages
    mem_cgroup_nr_unevictable_lru_pages
    mem_cgroup_node_nr_lru_pages
    mem_cgroup_nr_lru_pages
    mem_cgroup_get_local_zonestat

    Some of them are under #ifdef MAX_NUMNODES >1 and others are not.
    This seems bad. This patch consolidates all functions into

    mem_cgroup_zone_nr_lru_pages()
    mem_cgroup_node_nr_lru_pages()
    mem_cgroup_nr_lru_pages()

    For these functions, "which LRU?" information is passed by a mask.

    example:
    mem_cgroup_nr_lru_pages(mem, BIT(LRU_ACTIVE_ANON))

    And I added some macro as ALL_LRU, ALL_LRU_FILE, ALL_LRU_ANON.

    example:
    mem_cgroup_nr_lru_pages(mem, ALL_LRU)

    BTW, considering layout of NUMA memory placement of counters, this patch seems
    to be better.

    Now, when we gather all LRU information, we scan in following orer
    for_each_lru -> for_each_node -> for_each_zone.

    This means we'll touch cache lines in different node in turn.

    After patch, we'll scan
    for_each_node -> for_each_zone -> for_each_lru(mask)

    Then, we'll gather information in the same cacheline at once.

    [akpm@linux-foundation.org: fix warnigns, build error]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Michal Hocko
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

28 Jun, 2011

1 commit

  • commit 21a3c96 uses node_start/end_pfn(nid) for detection start/end
    of nodes. But, it's not defined in linux/mmzone.h but defined in
    /arch/???/include/mmzone.h which is included only under
    CONFIG_NEED_MULTIPLE_NODES=y.

    Then, we see
    mm/page_cgroup.c: In function 'page_cgroup_init':
    mm/page_cgroup.c:308: error: implicit declaration of function 'node_start_pfn'
    mm/page_cgroup.c:309: error: implicit declaration of function 'node_end_pfn'

    So, fixiing page_cgroup.c is an idea...

    But node_start_pfn()/node_end_pfn() is a very generic macro and
    should be implemented in the same manner for all archs.
    (m32r has different implementation...)

    This patch removes definitions of node_start/end_pfn() in each archs
    and defines a unified one in linux/mmzone.h. It's not under
    CONFIG_NEED_MULTIPLE_NODES, now.

    A result of macro expansion is here (mm/page_cgroup.c)

    for !NUMA
    start_pfn = ((&contig_page_data)->node_start_pfn);
    end_pfn = ({ pg_data_t *__pgdat = (&contig_page_data); __pgdat->node_start_pfn + __pgdat->node_spanned_pages;});

    for NUMA (x86-64)
    start_pfn = ((node_data[nid])->node_start_pfn);
    end_pfn = ({ pg_data_t *__pgdat = (node_data[nid]); __pgdat->node_start_pfn + __pgdat->node_spanned_pages;});

    Changelog:
    - fixed to avoid using "nid" twice in node_end_pfn() macro.

    Reported-and-acked-by: Randy Dunlap
    Reported-and-tested-by: Ingo Molnar
    Acked-by: Mel Gorman
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

28 May, 2011

1 commit

  • * 'for-linus' of master.kernel.org:/home/rmk/linux-2.6-arm: (45 commits)
    ARM: 6945/1: Add unwinding support for division functions
    ARM: kill pmd_off()
    ARM: 6944/1: mm: allow ASID 0 to be allocated to tasks
    ARM: 6943/1: mm: use TTBR1 instead of reserved context ID
    ARM: 6942/1: mm: make TTBR1 always point to swapper_pg_dir on ARMv6/7
    ARM: 6941/1: cache: ensure MVA is cacheline aligned in flush_kern_dcache_area
    ARM: add sendmmsg syscall
    ARM: 6863/1: allow hotplug on msm
    ARM: 6832/1: mmci: support for ST-Ericsson db8500v2
    ARM: 6830/1: mach-ux500: force PrimeCell revisions
    ARM: 6829/1: amba: make hardcoded periphid override hardware
    ARM: 6828/1: mach-ux500: delete SSP PrimeCell ID
    ARM: 6827/1: mach-netx: delete hardcoded periphid
    ARM: 6940/1: fiq: Briefly document driver responsibilities for suspend/resume
    ARM: 6938/1: fiq: Refactor {get,set}_fiq_regs() for Thumb-2
    ARM: 6914/1: sparsemem: fix highmem detection when using SPARSEMEM
    ARM: 6913/1: sparsemem: allow pfn_valid to be overridden when using SPARSEMEM
    at91: drop at572d940hf support
    at91rm9200: introduce at91rm9200_set_type to specficy cpu package
    at91: drop boot_params and PLAT_PHYS_OFFSET
    ...

    Linus Torvalds
     

27 May, 2011

1 commit

  • During memory reclaim we determine the number of pages to be scanned per
    zone as

    (anon + file) >> priority.
    Assume
    scan = (anon + file) >> priority.

    If scan < SWAP_CLUSTER_MAX, the scan will be skipped for this time and
    priority gets higher. This has some problems.

    1. This increases priority as 1 without any scan.
    To do scan in this priority, amount of pages should be larger than 512M.
    If pages>>priority < SWAP_CLUSTER_MAX, it's recorded and scan will be
    batched, later. (But we lose 1 priority.)
    If memory size is below 16M, pages >> priority is 0 and no scan in
    DEF_PRIORITY forever.

    2. If zone->all_unreclaimabe==true, it's scanned only when priority==0.
    So, x86's ZONE_DMA will never be recoverred until the user of pages
    frees memory by itself.

    3. With memcg, the limit of memory can be small. When using small memcg,
    it gets priority < DEF_PRIORITY-2 very easily and need to call
    wait_iff_congested().
    For doing scan before priorty=9, 64MB of memory should be used.

    Then, this patch tries to scan SWAP_CLUSTER_MAX of pages in force...when

    1. the target is enough small.
    2. it's kswapd or memcg reclaim.

    Then we can avoid rapid priority drop and may be able to recover
    all_unreclaimable in a small zones. And this patch removes nr_saved_scan.
    This will allow scanning in this priority even when pages >> priority is
    very small.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Ying Han
    Cc: Balbir Singh
    Cc: KOSAKI Motohiro
    Cc: Daisuke Nishimura
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

26 May, 2011

1 commit

  • In commit eb33575c ("[ARM] Double check memmap is actually valid with a
    memmap has unexpected holes V2"), a new function, memmap_valid_within,
    was introduced to mmzone.h so that holes in the memmap which pass
    pfn_valid in SPARSEMEM configurations can be detected and avoided.

    The fix to this problem checks that the pfn page linkages are
    correct by calculating the page for the pfn and then checking that
    page_to_pfn on that page returns the original pfn. Unfortunately, in
    SPARSEMEM configurations, this results in reading from the page flags to
    determine the correct section. Since the memmap here has been freed,
    junk is read from memory and the check is no longer robust.

    In the best case, reading from /proc/pagetypeinfo will give you the
    wrong answer. In the worst case, you get SEGVs, Kernel OOPses and hung
    CPUs. Furthermore, ioremap implementations that use pfn_valid to
    disallow the remapping of normal memory will break.

    This patch allows architectures to provide their own pfn_valid function
    instead of using the default implementation used by sparsemem. The
    architecture-specific version is aware of the memmap state and will
    return false when passed a pfn for a freed page within a valid section.

    Acked-by: Mel Gorman
    Acked-by: Catalin Marinas
    Tested-by: H Hartley Sweeten
    Signed-off-by: Will Deacon
    Signed-off-by: Russell King

    Will Deacon
     

25 May, 2011

2 commits


15 Feb, 2011

1 commit


04 Feb, 2011

1 commit


14 Jan, 2011

3 commits

  • Add hugepage stat information to /proc/vmstat and /proc/meminfo.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Simon Kirby reported the following problem

    We're seeing cases on a number of servers where cache never fully
    grows to use all available memory. Sometimes we see servers with 4 GB
    of memory that never seem to have less than 1.5 GB free, even with a
    constantly-active VM. In some cases, these servers also swap out while
    this happens, even though they are constantly reading the working set
    into memory. We have been seeing this happening for a long time; I
    don't think it's anything recent, and it still happens on 2.6.36.

    After some debugging work by Simon, Dave Hansen and others, the prevaling
    theory became that kswapd is reclaiming order-3 pages requested by SLUB
    too aggressive about it.

    There are two apparent problems here. On the target machine, there is a
    small Normal zone in comparison to DMA32. As kswapd tries to balance all
    zones, it would continually try reclaiming for Normal even though DMA32
    was balanced enough for callers. The second problem is that
    sleeping_prematurely() does not use the same logic as balance_pgdat() when
    deciding whether to sleep or not. This keeps kswapd artifically awake.

    A number of tests were run and the figures from previous postings will
    look very different for a few reasons. One, the old figures were forcing
    my network card to use GFP_ATOMIC in attempt to replicate Simon's problem.
    Second, I previous specified slub_min_order=3 again in an attempt to
    reproduce Simon's problem. In this posting, I'm depending on Simon to say
    whether his problem is fixed or not and these figures are to show the
    impact to the ordinary cases. Finally, the "vmscan" figures are taken
    from /proc/vmstat instead of the tracepoints. There is less information
    but recording is less disruptive.

    The first test of relevance was postmark with a process running in the
    background reading a large amount of anonymous memory in blocks. The
    objective was to vaguely simulate what was happening on Simon's machine
    and it's memory intensive enough to have kswapd awake.

    POSTMARK
    traceonly kanyzone
    Transactions per second: 156.00 ( 0.00%) 153.00 (-1.96%)
    Data megabytes read per second: 21.51 ( 0.00%) 21.52 ( 0.05%)
    Data megabytes written per second: 29.28 ( 0.00%) 29.11 (-0.58%)
    Files created alone per second: 250.00 ( 0.00%) 416.00 (39.90%)
    Files create/transact per second: 79.00 ( 0.00%) 76.00 (-3.95%)
    Files deleted alone per second: 520.00 ( 0.00%) 420.00 (-23.81%)
    Files delete/transact per second: 79.00 ( 0.00%) 76.00 (-3.95%)

    MMTests Statistics: duration
    User/Sys Time Running Test (seconds) 16.58 17.4
    Total Elapsed Time (seconds) 218.48 222.47

    VMstat Reclaim Statistics: vmscan
    Direct reclaims 0 4
    Direct reclaim pages scanned 0 203
    Direct reclaim pages reclaimed 0 184
    Kswapd pages scanned 326631 322018
    Kswapd pages reclaimed 312632 309784
    Kswapd low wmark quickly 1 4
    Kswapd high wmark quickly 122 475
    Kswapd skip congestion_wait 1 0
    Pages activated 700040 705317
    Pages deactivated 212113 203922
    Pages written 9875 6363

    Total pages scanned 326631 322221
    Total pages reclaimed 312632 309968
    %age total pages scanned/reclaimed 95.71% 96.20%
    %age total pages scanned/written 3.02% 1.97%

    proc vmstat: Faults
    Major Faults 300 254
    Minor Faults 645183 660284
    Page ins 493588 486704
    Page outs 4960088 4986704
    Swap ins 1230 661
    Swap outs 9869 6355

    Performance is mildly affected because kswapd is no longer doing as much
    work and the background memory consumer process is getting in the way.
    Note that kswapd scanned and reclaimed fewer pages as it's less aggressive
    and overall fewer pages were scanned and reclaimed. Swap in/out is
    particularly reduced again reflecting kswapd throwing out fewer pages.

    The slight performance impact is unfortunate here but it looks like a
    direct result of kswapd being less aggressive. As the bug report is about
    too many pages being freed by kswapd, it may have to be accepted for now.

    The second test is a streaming IO benchmark that was previously used by
    Johannes to show regressions in page reclaim.

    MICRO
    traceonly kanyzone
    User/Sys Time Running Test (seconds) 29.29 28.87
    Total Elapsed Time (seconds) 492.18 488.79

    VMstat Reclaim Statistics: vmscan
    Direct reclaims 2128 1460
    Direct reclaim pages scanned 2284822 1496067
    Direct reclaim pages reclaimed 148919 110937
    Kswapd pages scanned 15450014 16202876
    Kswapd pages reclaimed 8503697 8537897
    Kswapd low wmark quickly 3100 3397
    Kswapd high wmark quickly 1860 7243
    Kswapd skip congestion_wait 708 801
    Pages activated 9635 9573
    Pages deactivated 1432 1271
    Pages written 223 1130

    Total pages scanned 17734836 17698943
    Total pages reclaimed 8652616 8648834
    %age total pages scanned/reclaimed 48.79% 48.87%
    %age total pages scanned/written 0.00% 0.01%

    proc vmstat: Faults
    Major Faults 165 221
    Minor Faults 9655785 9656506
    Page ins 3880 7228
    Page outs 37692940 37480076
    Swap ins 0 69
    Swap outs 19 15

    Again fewer pages are scanned and reclaimed as expected and this time the
    test completed faster. Note that kswapd is hitting its watermarks faster
    (low and high wmark quickly) which I expect is due to kswapd reclaiming
    fewer pages.

    I also ran fs-mark, iozone and sysbench but there is nothing interesting
    to report in the figures. Performance is not significantly changed and
    the reclaim statistics look reasonable.

    Tgis patch:

    When the allocator enters its slow path, kswapd is woken up to balance the
    node. It continues working until all zones within the node are balanced.
    For order-0 allocations, this makes perfect sense but for higher orders it
    can have unintended side-effects. If the zone sizes are imbalanced,
    kswapd may reclaim heavily within a smaller zone discarding an excessive
    number of pages. The user-visible behaviour is that kswapd is awake and
    reclaiming even though plenty of pages are free from a suitable zone.

    This patch alters the "balance" logic for high-order reclaim allowing
    kswapd to stop if any suitable zone becomes balanced to reduce the number
    of pages it reclaims from other zones. kswapd still tries to ensure that
    order-0 watermarks for all zones are met before sleeping.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Eric B Munson
    Cc: Simon Kirby
    Cc: KOSAKI Motohiro
    Cc: Shaohua Li
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit aa45484 ("calculate a better estimate of NR_FREE_PAGES when memory
    is low") noted that watermarks were based on the vmstat NR_FREE_PAGES. To
    avoid synchronization overhead, these counters are maintained on a per-cpu
    basis and drained both periodically and when a threshold is above a
    threshold. On large CPU systems, the difference between the estimate and
    real value of NR_FREE_PAGES can be very high. The system can get into a
    case where pages are allocated far below the min watermark potentially
    causing livelock issues. The commit solved the problem by taking a better
    reading of NR_FREE_PAGES when memory was low.

    Unfortately, as reported by Shaohua Li this accurate reading can consume a
    large amount of CPU time on systems with many sockets due to cache line
    bouncing. This patch takes a different approach. For large machines
    where counter drift might be unsafe and while kswapd is awake, the per-cpu
    thresholds for the target pgdat are reduced to limit the level of drift to
    what should be a safe level. This incurs a performance penalty in heavy
    memory pressure by a factor that depends on the workload and the machine
    but the machine should function correctly without accidentally exhausting
    all memory on a node. There is an additional cost when kswapd wakes and
    sleeps but the event is not expected to be frequent - in Shaohua's test
    case, there was one recorded sleep and wake event at least.

    To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is
    introduced that takes a more accurate reading of NR_FREE_PAGES when called
    from wakeup_kswapd, when deciding whether it is really safe to go back to
    sleep in sleeping_prematurely() and when deciding if a zone is really
    balanced or not in balance_pgdat(). We are still using an expensive
    function but limiting how often it is called.

    When the test case is reproduced, the time spent in the watermark
    functions is reduced. The following report is on the percentage of time
    spent cumulatively spent in the functions zone_nr_free_pages(),
    zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(),
    zone_page_state_snapshot(), zone_page_state().

    vanilla 11.6615%
    disable-threshold 0.2584%

    David said:

    : We had to pull aa454840 "mm: page allocator: calculate a better estimate
    : of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36
    : internally because tests showed that it would cause the machine to stall
    : as the result of heavy kswapd activity. I merged it back with this fix as
    : it is pending in the -mm tree and it solves the issue we were seeing, so I
    : definitely think this should be pushed to -stable (and I would seriously
    : consider it for 2.6.37 inclusion even at this late date).

    Signed-off-by: Mel Gorman
    Reported-by: Shaohua Li
    Reviewed-by: Christoph Lameter
    Tested-by: Nicolas Bareil
    Cc: David Rientjes
    Cc: Kyle McMartin
    Cc: [2.6.37.1, 2.6.36.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

27 Oct, 2010

2 commits

  • …r if significant congestion is not being encountered in the current zone

    If congestion_wait() is called with no BDI congested, the caller will
    sleep for the full timeout and this may be an unnecessary sleep. This
    patch adds a wait_iff_congested() that checks congestion and only sleeps
    if a BDI is congested else, it calls cond_resched() to ensure the caller
    is not hogging the CPU longer than its quota but otherwise will not sleep.

    This is aimed at reducing some of the major desktop stalls reported during
    IO. For example, while kswapd is operating, it calls congestion_wait()
    but it could just have been reclaiming clean page cache pages with no
    congestion. Without this patch, it would sleep for a full timeout but
    after this patch, it'll just call schedule() if it has been on the CPU too
    long. Similar logic applies to direct reclaimers that are not making
    enough progress.

    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Minchan Kim <minchan.kim@gmail.com>
    Cc: Wu Fengguang <fengguang.wu@intel.com>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     
  • To help developers and applications gain visibility into writeback
    behaviour adding two entries to vm_stat_items and /proc/vmstat. This will
    allow us to track the "written" and "dirtied" counts.

    # grep nr_dirtied /proc/vmstat
    nr_dirtied 3747
    # grep nr_written /proc/vmstat
    nr_written 3618

    Signed-off-by: Michael Rubin
    Reviewed-by: Wu Fengguang
    Cc: Dave Chinner
    Cc: Jens Axboe
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Rubin
     

10 Sep, 2010

1 commit

  • …low and kswapd is awake

    Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is
    cheaper than scanning a number of lists. To avoid synchronization
    overhead, counter deltas are maintained on a per-cpu basis and drained
    both periodically and when the delta is above a threshold. On large CPU
    systems, the difference between the estimated and real value of
    NR_FREE_PAGES can be very high. If NR_FREE_PAGES is much higher than
    number of real free page in buddy, the VM can allocate pages below min
    watermark, at worst reducing the real number of pages to zero. Even if
    the OOM killer kills some victim for freeing memory, it may not free
    memory if the exit path requires a new page resulting in livelock.

    This patch introduces a zone_page_state_snapshot() function (courtesy of
    Christoph) that takes a slightly more accurate view of an arbitrary vmstat
    counter. It is used to read NR_FREE_PAGES while kswapd is awake to avoid
    the watermark being accidentally broken. The estimate is not perfect and
    may result in cache line bounces but is expected to be lighter than the
    IPI calls necessary to continually drain the per-cpu counters while kswapd
    is awake.

    Signed-off-by: Christoph Lameter <cl@linux.com>
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Christoph Lameter
     

10 Aug, 2010

2 commits

  • Since 2.6.28 zone->prev_priority is unused. Then it can be removed
    safely. It reduce stack usage slightly.

    Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
    can be integrate again, it's useful. but four (or more) times trying
    haven't got good performance number. Thus I give up such approach.

    The rest of this changelog is notes on prev_priority and why it existed in
    the first place and why it might be not necessary any more. This information
    is based heavily on discussions between Andrew Morton, Rik van Riel and
    Kosaki Motohiro who is heavily quotes from.

    Historically prev_priority was important because it determined when the VM
    would start unmapping PTE pages. i.e. there are no balances of note within
    the VM, Anon vs File and Mapped vs Unmapped. Without prev_priority, there
    is a potential risk of unnecessarily increasing minor faults as a large
    amount of read activity of use-once pages could push mapped pages to the
    end of the LRU and get unmapped.

    There is no proof this is still a problem but currently it is not considered
    to be. Active files are not deactivated if the active file list is smaller
    than the inactive list reducing the liklihood that file-mapped pages are
    being pushed off the LRU and referenced executable pages are kept on the
    active list to avoid them getting pushed out by read activity.

    Even if it is a problem, prev_priority prev_priority wouldn't works
    nowadays. First of all, current vmscan still a lot of UP centric code. it
    expose some weakness on some dozens CPUs machine. I think we need more and
    more improvement.

    The problem is, current vmscan mix up per-system-pressure, per-zone-pressure
    and per-task-pressure a bit. example, prev_priority try to boost priority to
    other concurrent priority. but if the another task have mempolicy restriction,
    it is unnecessary, but also makes wrong big latency and exceeding reclaim.
    per-task based priority + prev_priority adjustment make the emulation of
    per-system pressure. but it have two issue 1) too rough and brutal emulation
    2) we need per-zone pressure, not per-system.

    Another example, currently DEF_PRIORITY is 12. it mean the lru rotate about
    2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer.
    but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the
    system have higher memory pressure than priority==0 (1/4096*10,000 > 2).
    prev_priority can't solve such multithreads workload issue. In other word,
    prev_priority concept assume the sysmtem don't have lots threads."

    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Mel Gorman
    Reviewed-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: Dave Chinner
    Cc: Chris Mason
    Cc: Nick Piggin
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Christoph Hellwig
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Andrea Arcangeli
    Cc: Michael Rubin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • get_zone_counts() was dropped from kernel tree, see:
    http://www.mail-archive.com/mm-commits@vger.kernel.org/msg07313.html but
    its prototype remains.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Nevenchannyy
     

28 May, 2010

1 commit

  • Introduce numa_mem_id(), based on generic percpu variable infrastructure
    to track "nearest node with memory" for archs that support memoryless
    nodes.

    Define API in when CONFIG_HAVE_MEMORYLESS_NODES
    defined, else stubs. Architectures will define HAVE_MEMORYLESS_NODES
    if/when they support them.

    Archs can override definitions of:

    numa_mem_id() - returns node number of "local memory" node
    set_numa_mem() - initialize [this cpus'] per cpu variable 'numa_mem'
    cpu_to_mem() - return numa_mem for specified cpu; may be used as lvalue

    Generic initialization of 'numa_mem' occurs in __build_all_zonelists().
    This will initialize the boot cpu at boot time, and all cpus on change of
    numa_zonelist_order, or when node or memory hot-plug requires zonelist
    rebuild. Archs that support memoryless nodes will need to initialize
    'numa_mem' for secondary cpus as they're brought on-line.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Christoph Lameter
    Cc: Tejun Heo
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Nick Piggin
    Cc: David Rientjes
    Cc: Eric Whitney
    Cc: KAMEZAWA Hiroyuki
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: "Luck, Tony"
    Cc: Pekka Enberg
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     

25 May, 2010

4 commits

  • Add global mutex zonelists_mutex to fix the possible race:

    CPU0 CPU1 CPU2
    (1) zone->present_pages += online_pages;
    (2) build_all_zonelists();
    (3) alloc_page();
    (4) free_page();
    (5) build_all_zonelists();
    (6) __build_all_zonelists();
    (7) zone->pageset = alloc_percpu();

    In step (3,4), zone->pageset still points to boot_pageset, so bad
    things may happen if 2+ nodes are in this state. Even if only 1 node
    is accessing the boot_pageset, (3) may still consume too much memory
    to fail the memory allocations in step (7).

    Besides, atomic operation ensures alloc_percpu() in step (7) will never fail
    since there is a new fresh memory block added in step(6).

    [haicheng.li@linux.intel.com: hold zonelists_mutex when build_all_zonelists]
    Signed-off-by: Haicheng Li
    Signed-off-by: Wu Fengguang
    Reviewed-by: Andi Kleen
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Haicheng Li
     
  • For each new populated zone of hotadded node, need to update its pagesets
    with dynamically allocated per_cpu_pageset struct for all possible CPUs:

    1) Detach zone->pageset from the shared boot_pageset
    at end of __build_all_zonelists().

    2) Use mutex to protect zone->pageset when it's still
    shared in onlined_pages()

    Otherwises, multiple zones of different nodes would share same boot strapping
    boot_pageset for same CPU, which will finally cause below kernel panic:

    ------------[ cut here ]------------
    kernel BUG at mm/page_alloc.c:1239!
    invalid opcode: 0000 [#1] SMP
    ...
    Call Trace:
    [] __alloc_pages_nodemask+0x131/0x7b0
    [] alloc_pages_current+0x87/0xd0
    [] __page_cache_alloc+0x67/0x70
    [] __do_page_cache_readahead+0x120/0x260
    [] ra_submit+0x21/0x30
    [] ondemand_readahead+0x166/0x2c0
    [] page_cache_async_readahead+0x80/0xa0
    [] generic_file_aio_read+0x364/0x670
    [] nfs_file_read+0xca/0x130
    [] do_sync_read+0xfa/0x140
    [] vfs_read+0xb5/0x1a0
    [] sys_read+0x51/0x80
    [] system_call_fastpath+0x16/0x1b
    RIP [] get_page_from_freelist+0x883/0x900
    RSP
    ---[ end trace 4bda28328b9990db ]

    [akpm@linux-foundation.org: merge fix]
    Signed-off-by: Haicheng Li
    Signed-off-by: Wu Fengguang
    Reviewed-by: Andi Kleen
    Reviewed-by: Christoph Lameter
    Cc: Mel Gorman
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Haicheng Li
     
  • Got this while compiling for ARM/SA1100:

    mm/sparse.c: In function '__section_nr':
    mm/sparse.c:135: warning: 'root' is used uninitialized in this function

    This patch follows Russell King's suggestion for a new calculation for
    NR_SECTION_ROOTS. Thanks also to Sergei Shtylyov for pointing out the
    existence of the macro DIV_ROUND_UP.

    Atsushi Nemoto observed:
    : This fix doesn't just silence the warning - it fixes a real problem.
    :
    : Without this fix, mem_section[] might have 0 size so mem_section[0]
    : will share other variable area. For example, I got:
    :
    : c030c700 b __warned.16478
    : c030c700 B mem_section
    : c030c701 b __warned.16483
    :
    : This might cause very strange behavior. Your patch actually fixes it.

    Signed-off-by: Marcelo Roberto Jimenez
    Cc: Atsushi Nemoto
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Yinghai Lu
    Cc: Sergei Shtylyov
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marcelo Roberto Jimenez
     
  • The fragmentation index may indicate that a failure is due to external
    fragmentation but after a compaction run completes, it is still possible
    for an allocation to fail. There are two obvious reasons as to why

    o Page migration cannot move all pages so fragmentation remains
    o A suitable page may exist but watermarks are not met

    In the event of compaction followed by an allocation failure, this patch
    defers further compaction in the zone (1 << compact_defer_shift) times.
    If the next compaction attempt also fails, compact_defer_shift is
    increased up to a maximum of 6. If compaction succeeds, the defer
    counters are reset again.

    The zone that is deferred is the first zone in the zonelist - i.e. the
    preferred zone. To defer compaction in the other zones, the information
    would need to be stored in the zonelist or implemented similar to the
    zonelist_cache. This would impact the fast-paths and is not justified at
    this time.

    Signed-off-by: Mel Gorman
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

08 Mar, 2010

1 commit


07 Mar, 2010

1 commit

  • commit e815af95 ("change all_unreclaimable zone member to flags") changed
    all_unreclaimable member to bit flag. But it had an undesireble side
    effect. free_one_page() is one of most hot path in linux kernel and
    increasing atomic ops in it can reduce kernel performance a bit.

    Thus, this patch revert such commit partially. at least
    all_unreclaimable shouldn't share memory word with other zone flags.

    [akpm@linux-foundation.org: fix patch interaction]
    Signed-off-by: KOSAKI Motohiro
    Cc: David Rientjes
    Cc: Wu Fengguang
    Cc: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Huang Shijie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

04 Mar, 2010

1 commit

  • …l/git/tip/linux-2.6-tip

    * 'x86-bootmem-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (30 commits)
    early_res: Need to save the allocation name in drop_range_partial()
    sparsemem: Fix compilation on PowerPC
    early_res: Add free_early_partial()
    x86: Fix non-bootmem compilation on PowerPC
    core: Move early_res from arch/x86 to kernel/
    x86: Add find_fw_memmap_area
    Move round_up/down to kernel.h
    x86: Make 32bit support NO_BOOTMEM
    early_res: Enhance check_and_double_early_res
    x86: Move back find_e820_area to e820.c
    x86: Add find_early_area_size
    x86: Separate early_res related code from e820.c
    x86: Move bios page reserve early to head32/64.c
    sparsemem: Put mem map for one node together.
    sparsemem: Put usemap for one node together
    x86: Make 64 bit use early_res instead of bootmem before slab
    x86: Only call dma32_reserve_bootmem 64bit !CONFIG_NUMA
    x86: Make early_node_mem get mem > 4 GB if possible
    x86: Dynamically increase early_res array size
    x86: Introduce max_early_res and early_res_count
    ...

    Linus Torvalds
     

17 Feb, 2010

1 commit

  • Add __percpu sparse annotations to core subsystems.

    These annotations are to make sparse consider percpu variables to be
    in a different address space and warn if accessed without going
    through percpu accessors. This patch doesn't affect normal builds.

    Signed-off-by: Tejun Heo
    Reviewed-by: Christoph Lameter
    Acked-by: Paul E. McKenney
    Cc: Jens Axboe
    Cc: linux-mm@kvack.org
    Cc: Rusty Russell
    Cc: Dipankar Sarma
    Cc: Peter Zijlstra
    Cc: Andrew Morton
    Cc: Eric Biederman

    Tejun Heo
     

13 Feb, 2010

1 commit


04 Feb, 2010

1 commit


05 Jan, 2010

1 commit

  • Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.

    This drastically reduces the size of struct zone for systems with large
    amounts of processors and allows placement of critical variables of struct
    zone in one cacheline even on very large systems.

    Another effect is that the pagesets of one processor are placed near one
    another. If multiple pagesets from different zones fit into one cacheline
    then additional cacheline fetches can be avoided on the hot paths when
    allocating memory from multiple zones.

    Bootstrap becomes simpler if we use the same scheme for UP, SMP, NUMA. #ifdefs
    are reduced and we can drop the zone_pcp macro.

    Hotplug handling is also simplified since cpu alloc can bring up and
    shut down cpu areas for a specific cpu as a whole. So there is no need to
    allocate or free individual pagesets.

    V7-V8:
    - Explain chicken egg dilemmna with percpu allocator.

    V4-V5:
    - Fix up cases where per_cpu_ptr is called before irq disable
    - Integrate the bootstrap logic that was separate before.

    tj: Build failure in pageset_cpuup_callback() due to missing ret
    variable fixed.

    Reviewed-by: Mel Gorman
    Signed-off-by: Christoph Lameter
    Signed-off-by: Tejun Heo

    Christoph Lameter