13 Jan, 2012

1 commit

  • Move CMPXCHG_LOCAL and rename it to HAVE_CMPXCHG_LOCAL so architectures
    can simply select the option if it is supported.

    Signed-off-by: Heiko Carstens
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     

01 Nov, 2011

3 commits

  • Avoid false sharing of the vm_stat array.

    This was found to adversely affect tmpfs I/O performance.

    Tests run on a 640 cpu UV system.

    With 120 threads doing parallel writes, each to different tmpfs mounts:
    No patch: ~300 MB/sec
    With vm_stat alignment: ~430 MB/sec

    Signed-off-by: Dimitri Sivanich
    Acked-by: Christoph Lameter
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dimitri Sivanich
     
  • When direct reclaim encounters a dirty page, it gets recycled around the
    LRU for another cycle. This patch marks the page PageReclaim similar to
    deactivate_page() so that the page gets reclaimed almost immediately after
    the page gets cleaned. This is to avoid reclaiming clean pages that are
    younger than a dirty page encountered at the end of the LRU that might
    have been something like a use-once page.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Cc: Wu Fengguang
    Cc: Jan Kara
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Alex Elder
    Cc: Theodore Ts'o
    Cc: Chris Mason
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Testing from the XFS folk revealed that there is still too much I/O from
    the end of the LRU in kswapd. Previously it was considered acceptable by
    VM people for a small number of pages to be written back from reclaim with
    testing generally showing about 0.3% of pages reclaimed were written back
    (higher if memory was low). That writing back a small number of pages is
    ok has been heavily disputed for quite some time and Dave Chinner
    explained it well;

    It doesn't have to be a very high number to be a problem. IO
    is orders of magnitude slower than the CPU time it takes to
    flush a page, so the cost of making a bad flush decision is
    very high. And single page writeback from the LRU is almost
    always a bad flush decision.

    To complicate matters, filesystems respond very differently to requests
    from reclaim according to Christoph Hellwig;

    xfs tries to write it back if the requester is kswapd
    ext4 ignores the request if it's a delayed allocation
    btrfs ignores the request

    As a result, each filesystem has different performance characteristics
    when under memory pressure and there are many pages being dirtied. In
    some cases, the request is ignored entirely so the VM cannot depend on the
    IO being dispatched.

    The objective of this series is to reduce writing of filesystem-backed
    pages from reclaim, play nicely with writeback that is already in progress
    and throttle reclaim appropriately when writeback pages are encountered.
    The assumption is that the flushers will always write pages faster than if
    reclaim issues the IO.

    A secondary goal is to avoid the problem whereby direct reclaim splices
    two potentially deep call stacks together.

    There is a potential new problem as reclaim has less control over how long
    before a page in a particularly zone or container is cleaned and direct
    reclaimers depend on kswapd or flusher threads to do the necessary work.
    However, as filesystems sometimes ignore direct reclaim requests already,
    it is not expected to be a serious issue.

    Patch 1 disables writeback of filesystem pages from direct reclaim
    entirely. Anonymous pages are still written.

    Patch 2 removes dead code in lumpy reclaim as it is no longer able
    to synchronously write pages. This hurts lumpy reclaim but
    there is an expectation that compaction is used for hugepage
    allocations these days and lumpy reclaim's days are numbered.

    Patches 3-4 add warnings to XFS and ext4 if called from
    direct reclaim. With patch 1, this "never happens" and is
    intended to catch regressions in this logic in the future.

    Patch 5 disables writeback of filesystem pages from kswapd unless
    the priority is raised to the point where kswapd is considered
    to be in trouble.

    Patch 6 throttles reclaimers if too many dirty pages are being
    encountered and the zones or backing devices are congested.

    Patch 7 invalidates dirty pages found at the end of the LRU so they
    are reclaimed quickly after being written back rather than
    waiting for a reclaimer to find them

    I consider this series to be orthogonal to the writeback work but it is
    worth noting that the writeback work affects the viability of patch 8 in
    particular.

    I tested this on ext4 and xfs using fs_mark, a simple writeback test based
    on dd and a micro benchmark that does a streaming write to a large mapping
    (exercises use-once LRU logic) followed by streaming writes to a mix of
    anonymous and file-backed mappings. The command line for fs_mark when
    botted with 512M looked something like

    ./fs_mark -d /tmp/fsmark-2676 -D 100 -N 150 -n 150 -L 25 -t 1 -S0 -s 10485760

    The number of files was adjusted depending on the amount of available
    memory so that the files created was about 3xRAM. For multiple threads,
    the -d switch is specified multiple times.

    The test machine is x86-64 with an older generation of AMD processor with
    4 cores. The underlying storage was 4 disks configured as RAID-0 as this
    was the best configuration of storage I had available. Swap is on a
    separate disk. Dirty ratio was tuned to 40% instead of the default of
    20%.

    Testing was run with and without monitors to both verify that the patches
    were operating as expected and that any performance gain was real and not
    due to interference from monitors.

    Here is a summary of results based on testing XFS.

    512M1P-xfs Files/s mean 32.69 ( 0.00%) 34.44 ( 5.08%)
    512M1P-xfs Elapsed Time fsmark 51.41 48.29
    512M1P-xfs Elapsed Time simple-wb 114.09 108.61
    512M1P-xfs Elapsed Time mmap-strm 113.46 109.34
    512M1P-xfs Kswapd efficiency fsmark 62% 63%
    512M1P-xfs Kswapd efficiency simple-wb 56% 61%
    512M1P-xfs Kswapd efficiency mmap-strm 44% 42%
    512M-xfs Files/s mean 30.78 ( 0.00%) 35.94 (14.36%)
    512M-xfs Elapsed Time fsmark 56.08 48.90
    512M-xfs Elapsed Time simple-wb 112.22 98.13
    512M-xfs Elapsed Time mmap-strm 219.15 196.67
    512M-xfs Kswapd efficiency fsmark 54% 56%
    512M-xfs Kswapd efficiency simple-wb 54% 55%
    512M-xfs Kswapd efficiency mmap-strm 45% 44%
    512M-4X-xfs Files/s mean 30.31 ( 0.00%) 33.33 ( 9.06%)
    512M-4X-xfs Elapsed Time fsmark 63.26 55.88
    512M-4X-xfs Elapsed Time simple-wb 100.90 90.25
    512M-4X-xfs Elapsed Time mmap-strm 261.73 255.38
    512M-4X-xfs Kswapd efficiency fsmark 49% 50%
    512M-4X-xfs Kswapd efficiency simple-wb 54% 56%
    512M-4X-xfs Kswapd efficiency mmap-strm 37% 36%
    512M-16X-xfs Files/s mean 60.89 ( 0.00%) 65.22 ( 6.64%)
    512M-16X-xfs Elapsed Time fsmark 67.47 58.25
    512M-16X-xfs Elapsed Time simple-wb 103.22 90.89
    512M-16X-xfs Elapsed Time mmap-strm 237.09 198.82
    512M-16X-xfs Kswapd efficiency fsmark 45% 46%
    512M-16X-xfs Kswapd efficiency simple-wb 53% 55%
    512M-16X-xfs Kswapd efficiency mmap-strm 33% 33%

    Up until 512-4X, the FSmark improvements were statistically significant.
    For the 4X and 16X tests the results were within standard deviations but
    just barely. The time to completion for all tests is improved which is an
    important result. In general, kswapd efficiency is not affected by
    skipping dirty pages.

    1024M1P-xfs Files/s mean 39.09 ( 0.00%) 41.15 ( 5.01%)
    1024M1P-xfs Elapsed Time fsmark 84.14 80.41
    1024M1P-xfs Elapsed Time simple-wb 210.77 184.78
    1024M1P-xfs Elapsed Time mmap-strm 162.00 160.34
    1024M1P-xfs Kswapd efficiency fsmark 69% 75%
    1024M1P-xfs Kswapd efficiency simple-wb 71% 77%
    1024M1P-xfs Kswapd efficiency mmap-strm 43% 44%
    1024M-xfs Files/s mean 35.45 ( 0.00%) 37.00 ( 4.19%)
    1024M-xfs Elapsed Time fsmark 94.59 91.00
    1024M-xfs Elapsed Time simple-wb 229.84 195.08
    1024M-xfs Elapsed Time mmap-strm 405.38 440.29
    1024M-xfs Kswapd efficiency fsmark 79% 71%
    1024M-xfs Kswapd efficiency simple-wb 74% 74%
    1024M-xfs Kswapd efficiency mmap-strm 39% 42%
    1024M-4X-xfs Files/s mean 32.63 ( 0.00%) 35.05 ( 6.90%)
    1024M-4X-xfs Elapsed Time fsmark 103.33 97.74
    1024M-4X-xfs Elapsed Time simple-wb 204.48 178.57
    1024M-4X-xfs Elapsed Time mmap-strm 528.38 511.88
    1024M-4X-xfs Kswapd efficiency fsmark 81% 70%
    1024M-4X-xfs Kswapd efficiency simple-wb 73% 72%
    1024M-4X-xfs Kswapd efficiency mmap-strm 39% 38%
    1024M-16X-xfs Files/s mean 42.65 ( 0.00%) 42.97 ( 0.74%)
    1024M-16X-xfs Elapsed Time fsmark 103.11 99.11
    1024M-16X-xfs Elapsed Time simple-wb 200.83 178.24
    1024M-16X-xfs Elapsed Time mmap-strm 397.35 459.82
    1024M-16X-xfs Kswapd efficiency fsmark 84% 69%
    1024M-16X-xfs Kswapd efficiency simple-wb 74% 73%
    1024M-16X-xfs Kswapd efficiency mmap-strm 39% 40%

    All FSMark tests up to 16X had statistically significant improvements.
    For the most part, tests are completing faster with the exception of the
    streaming writes to a mixture of anonymous and file-backed mappings which
    were slower in two cases

    In the cases where the mmap-strm tests were slower, there was more
    swapping due to dirty pages being skipped. The number of additional pages
    swapped is almost identical to the fewer number of pages written from
    reclaim. In other words, roughly the same number of pages were reclaimed
    but swapping was slower. As the test is a bit unrealistic and stresses
    memory heavily, the small shift is acceptable.

    4608M1P-xfs Files/s mean 29.75 ( 0.00%) 30.96 ( 3.91%)
    4608M1P-xfs Elapsed Time fsmark 512.01 492.15
    4608M1P-xfs Elapsed Time simple-wb 618.18 566.24
    4608M1P-xfs Elapsed Time mmap-strm 488.05 465.07
    4608M1P-xfs Kswapd efficiency fsmark 93% 86%
    4608M1P-xfs Kswapd efficiency simple-wb 88% 84%
    4608M1P-xfs Kswapd efficiency mmap-strm 46% 45%
    4608M-xfs Files/s mean 27.60 ( 0.00%) 28.85 ( 4.33%)
    4608M-xfs Elapsed Time fsmark 555.96 532.34
    4608M-xfs Elapsed Time simple-wb 659.72 571.85
    4608M-xfs Elapsed Time mmap-strm 1082.57 1146.38
    4608M-xfs Kswapd efficiency fsmark 89% 91%
    4608M-xfs Kswapd efficiency simple-wb 88% 82%
    4608M-xfs Kswapd efficiency mmap-strm 48% 46%
    4608M-4X-xfs Files/s mean 26.00 ( 0.00%) 27.47 ( 5.35%)
    4608M-4X-xfs Elapsed Time fsmark 592.91 564.00
    4608M-4X-xfs Elapsed Time simple-wb 616.65 575.07
    4608M-4X-xfs Elapsed Time mmap-strm 1773.02 1631.53
    4608M-4X-xfs Kswapd efficiency fsmark 90% 94%
    4608M-4X-xfs Kswapd efficiency simple-wb 87% 82%
    4608M-4X-xfs Kswapd efficiency mmap-strm 43% 43%
    4608M-16X-xfs Files/s mean 26.07 ( 0.00%) 26.42 ( 1.32%)
    4608M-16X-xfs Elapsed Time fsmark 602.69 585.78
    4608M-16X-xfs Elapsed Time simple-wb 606.60 573.81
    4608M-16X-xfs Elapsed Time mmap-strm 1549.75 1441.86
    4608M-16X-xfs Kswapd efficiency fsmark 98% 98%
    4608M-16X-xfs Kswapd efficiency simple-wb 88% 82%
    4608M-16X-xfs Kswapd efficiency mmap-strm 44% 42%

    Unlike the other tests, the fsmark results are not statistically
    significant but the min and max times are both improved and for the most
    part, tests completed faster.

    There are other indications that this is an improvement as well. For
    example, in the vast majority of cases, there were fewer pages scanned by
    direct reclaim implying in many cases that stalls due to direct reclaim
    are reduced. KSwapd is scanning more due to skipping dirty pages which is
    unfortunate but the CPU usage is still acceptable

    In an earlier set of tests, I used blktrace and in almost all cases
    throughput throughout the entire test was higher. However, I ended up
    discarding those results as recording blktrace data was too heavy for my
    liking.

    On a laptop, I plugged in a USB stick and ran a similar tests of tests
    using it as backing storage. A desktop environment was running and for
    the entire duration of the tests, firefox and gnome terminal were
    launching and exiting to vaguely simulate a user.

    1024M-xfs Files/s mean 0.41 ( 0.00%) 0.44 ( 6.82%)
    1024M-xfs Elapsed Time fsmark 2053.52 1641.03
    1024M-xfs Elapsed Time simple-wb 1229.53 768.05
    1024M-xfs Elapsed Time mmap-strm 4126.44 4597.03
    1024M-xfs Kswapd efficiency fsmark 84% 85%
    1024M-xfs Kswapd efficiency simple-wb 92% 81%
    1024M-xfs Kswapd efficiency mmap-strm 60% 51%
    1024M-xfs Avg wait ms fsmark 5404.53 4473.87
    1024M-xfs Avg wait ms simple-wb 2541.35 1453.54
    1024M-xfs Avg wait ms mmap-strm 3400.25 3852.53

    The mmap-strm results were hurt because firefox launching had a tendency
    to push the test out of memory. On the postive side, firefox launched
    marginally faster with the patches applied. Time to completion for many
    tests was faster but more importantly - the "Avg wait" time as measured by
    iostat was far lower implying the system would be more responsive. It was
    also the case that "Avg wait ms" on the root filesystem was lower. I
    tested it manually and while the system felt slightly more responsive
    while copying data to a USB stick, it was marginal enough that it could be
    my imagination.

    This patch: do not writeback filesystem pages in direct reclaim.

    When kswapd is failing to keep zones above the min watermark, a process
    will enter direct reclaim in the same manner kswapd does. If a dirty page
    is encountered during the scan, this page is written to backing storage
    using mapping->writepage.

    This causes two problems. First, it can result in very deep call stacks,
    particularly if the target storage or filesystem are complex. Some
    filesystems ignore write requests from direct reclaim as a result. The
    second is that a single-page flush is inefficient in terms of IO. While
    there is an expectation that the elevator will merge requests, this does
    not always happen. Quoting Christoph Hellwig;

    The elevator has a relatively small window it can operate on,
    and can never fix up a bad large scale writeback pattern.

    This patch prevents direct reclaim writing back filesystem pages by
    checking if current is kswapd. Anonymous pages are still written to swap
    as there is not the equivalent of a flusher thread for anonymous pages.
    If the dirty pages cannot be written back, they are placed back on the LRU
    lists. There is now a direct dependency on dirty page balancing to
    prevent too many pages in the system being dirtied which would prevent
    reclaim making forward progress.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Cc: Johannes Weiner
    Cc: Wu Fengguang
    Cc: Jan Kara
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Alex Elder
    Cc: Theodore Ts'o
    Cc: Chris Mason
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

15 Sep, 2011

1 commit

  • The vmstat_text array is only defined for CONFIG_SYSFS or CONFIG_PROC_FS,
    yet it is referenced for per-node vmstat with CONFIG_NUMA:

    drivers/built-in.o: In function `node_read_vmstat':
    node.c:(.text+0x1106df): undefined reference to `vmstat_text'

    Introduced in commit fa25c503dfa2 ("mm: per-node vmstat: show proper
    vmstats").

    Define the array for CONFIG_NUMA as well.

    [akpm@linux-foundation.org: remove unneeded ifdefs]
    Signed-off-by: David Rientjes
    Reported-by: Cong Wang
    Acked-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

25 May, 2011

2 commits

  • Currently, cpu hotplug updates pcp->stat_threshold, but memory hotplug
    doesn't. There is no reason for this.

    [akpm@linux-foundation.org: fix CONFIG_SMP=n build]
    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • commit 2ac390370a ("writeback: add
    /sys/devices/system/node//vmstat") added vmstat entry. But
    strangely it only show nr_written and nr_dirtied.

    # cat /sys/devices/system/node/node20/vmstat
    nr_written 0
    nr_dirtied 0

    Of course, It's not adequate. With this patch, the vmstat show all vm
    stastics as /proc/vmstat.

    # cat /sys/devices/system/node/node0/vmstat
    nr_free_pages 899224
    nr_inactive_anon 201
    nr_active_anon 17380
    nr_inactive_file 31572
    nr_active_file 28277
    nr_unevictable 0
    nr_mlock 0
    nr_anon_pages 17321
    nr_mapped 8640
    nr_file_pages 60107
    nr_dirty 33
    nr_writeback 0
    nr_slab_reclaimable 6850
    nr_slab_unreclaimable 7604
    nr_page_table_pages 3105
    nr_kernel_stack 175
    nr_unstable 0
    nr_bounce 0
    nr_vmscan_write 0
    nr_writeback_temp 0
    nr_isolated_anon 0
    nr_isolated_file 0
    nr_shmem 260
    nr_dirtied 1050
    nr_written 938
    numa_hit 962872
    numa_miss 0
    numa_foreign 0
    numa_interleave 8617
    numa_local 962872
    numa_other 0
    nr_anon_transparent_hugepages 0

    [akpm@linux-foundation.org: no externs in .c files]
    Signed-off-by: KOSAKI Motohiro
    Cc: Michael Rubin
    Cc: Wu Fengguang
    Acked-by: David Rientjes
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

15 Apr, 2011

2 commits

  • I found it difficult to make sense of transparent huge pages without
    having any counters for its actions. Add some counters to vmstat for
    allocation of transparent hugepages and fallback to smaller pages.

    Optional patch, but useful for development and understanding the system.

    Contains improvements from Andrea Arcangeli and Johannes Weiner

    [akpm@linux-foundation.org: coding-style fixes]
    [hannes@cmpxchg.org: fix vmstat_text[] entries]
    Signed-off-by: Andi Kleen
    Acked-by: Andrea Arcangeli
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

23 Mar, 2011

1 commit

  • Add a new __GFP_OTHER_NODE flag to tell the low level numa statistics in
    zone_statistics() that an allocation is on behalf of another thread. This
    way the local and remote counters can be still correct, even when
    background daemons like khugepaged are changing memory mappings.

    This only affects the accounting, but I think it's worth doing that right
    to avoid confusing users.

    I first tried to just pass down the right node, but this required a lot of
    changes to pass down this parameter and at least one addition of a 10th
    argument to a 9 argument function. Using the flag is a lot less
    intrusive.

    Open: should be also used for migration?

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Andi Kleen
    Cc: Andrea Arcangeli
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

14 Jan, 2011

3 commits

  • Add hugepage stat information to /proc/vmstat and /proc/meminfo.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • reduce_pgdat_percpu_threshold() and restore_pgdat_percpu_threshold() exist
    to adjust the per-cpu vmstat thresholds while kswapd is awake to avoid
    errors due to counter drift. The functions duplicate some code so this
    patch replaces them with a single set_pgdat_percpu_threshold() that takes
    a callback function to calculate the desired threshold as a parameter.

    [akpm@linux-foundation.org: readability tweak]
    [kosaki.motohiro@jp.fujitsu.com: set_pgdat_percpu_threshold(): don't use for_each_online_cpu]
    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit aa45484 ("calculate a better estimate of NR_FREE_PAGES when memory
    is low") noted that watermarks were based on the vmstat NR_FREE_PAGES. To
    avoid synchronization overhead, these counters are maintained on a per-cpu
    basis and drained both periodically and when a threshold is above a
    threshold. On large CPU systems, the difference between the estimate and
    real value of NR_FREE_PAGES can be very high. The system can get into a
    case where pages are allocated far below the min watermark potentially
    causing livelock issues. The commit solved the problem by taking a better
    reading of NR_FREE_PAGES when memory was low.

    Unfortately, as reported by Shaohua Li this accurate reading can consume a
    large amount of CPU time on systems with many sockets due to cache line
    bouncing. This patch takes a different approach. For large machines
    where counter drift might be unsafe and while kswapd is awake, the per-cpu
    thresholds for the target pgdat are reduced to limit the level of drift to
    what should be a safe level. This incurs a performance penalty in heavy
    memory pressure by a factor that depends on the workload and the machine
    but the machine should function correctly without accidentally exhausting
    all memory on a node. There is an additional cost when kswapd wakes and
    sleeps but the event is not expected to be frequent - in Shaohua's test
    case, there was one recorded sleep and wake event at least.

    To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is
    introduced that takes a more accurate reading of NR_FREE_PAGES when called
    from wakeup_kswapd, when deciding whether it is really safe to go back to
    sleep in sleeping_prematurely() and when deciding if a zone is really
    balanced or not in balance_pgdat(). We are still using an expensive
    function but limiting how often it is called.

    When the test case is reproduced, the time spent in the watermark
    functions is reduced. The following report is on the percentage of time
    spent cumulatively spent in the functions zone_nr_free_pages(),
    zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(),
    zone_page_state_snapshot(), zone_page_state().

    vanilla 11.6615%
    disable-threshold 0.2584%

    David said:

    : We had to pull aa454840 "mm: page allocator: calculate a better estimate
    : of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36
    : internally because tests showed that it would cause the machine to stall
    : as the result of heavy kswapd activity. I merged it back with this fix as
    : it is pending in the -mm tree and it solves the issue we were seeing, so I
    : definitely think this should be pushed to -stable (and I would seriously
    : consider it for 2.6.37 inclusion even at this late date).

    Signed-off-by: Mel Gorman
    Reported-by: Shaohua Li
    Reviewed-by: Christoph Lameter
    Tested-by: Nicolas Bareil
    Cc: David Rientjes
    Cc: Kyle McMartin
    Cc: [2.6.37.1, 2.6.36.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

08 Jan, 2011

2 commits

  • * 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (30 commits)
    gameport: use this_cpu_read instead of lookup
    x86: udelay: Use this_cpu_read to avoid address calculation
    x86: Use this_cpu_inc_return for nmi counter
    x86: Replace uses of current_cpu_data with this_cpu ops
    x86: Use this_cpu_ops to optimize code
    vmstat: User per cpu atomics to avoid interrupt disable / enable
    irq_work: Use per cpu atomics instead of regular atomics
    cpuops: Use cmpxchg for xchg to avoid lock semantics
    x86: this_cpu_cmpxchg and this_cpu_xchg operations
    percpu: Generic this_cpu_cmpxchg() and this_cpu_xchg support
    percpu,x86: relocate this_cpu_add_return() and friends
    connector: Use this_cpu operations
    xen: Use this_cpu_inc_return
    taskstats: Use this_cpu_ops
    random: Use this_cpu_inc_return
    fs: Use this_cpu_inc_return in buffer.c
    highmem: Use this_cpu_xx_return() operations
    vmstat: Use this_cpu_inc_return for vm statistics
    x86: Support for this_cpu_add, sub, dec, inc_return
    percpu: Generic support for this_cpu_add, sub, dec, inc_return
    ...

    Fixed up conflicts: in arch/x86/kernel/{apic/nmi.c, apic/x2apic_uv_x.c, process.c}
    as per Tejun.

    Linus Torvalds
     
  • * 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (33 commits)
    usb: don't use flush_scheduled_work()
    speedtch: don't abuse struct delayed_work
    media/video: don't use flush_scheduled_work()
    media/video: explicitly flush request_module work
    ioc4: use static work_struct for ioc4_load_modules()
    init: don't call flush_scheduled_work() from do_initcalls()
    s390: don't use flush_scheduled_work()
    rtc: don't use flush_scheduled_work()
    mmc: update workqueue usages
    mfd: update workqueue usages
    dvb: don't use flush_scheduled_work()
    leds-wm8350: don't use flush_scheduled_work()
    mISDN: don't use flush_scheduled_work()
    macintosh/ams: don't use flush_scheduled_work()
    vmwgfx: don't use flush_scheduled_work()
    tpm: don't use flush_scheduled_work()
    sonypi: don't use flush_scheduled_work()
    hvsi: don't use flush_scheduled_work()
    xen: don't use flush_scheduled_work()
    gdrom: don't use flush_scheduled_work()
    ...

    Fixed up trivial conflict in drivers/media/video/bt8xx/bttv-input.c
    as per Tejun.

    Linus Torvalds
     

18 Dec, 2010

1 commit

  • Currently the operations to increment vm counters must disable interrupts
    in order to not mess up their housekeeping of counters.

    So use this_cpu_cmpxchg() to avoid the overhead. Since we can no longer
    count on preremption being disabled we still have some minor issues.
    The fetching of the counter thresholds is racy.
    A threshold from another cpu may be applied if we happen to be
    rescheduled on another cpu. However, the following vmstat operation
    will then bring the counter again under the threshold limit.

    The operations for __xxx_zone_state are not changed since the caller
    has taken care of the synchronization needs (and therefore the cycle
    count is even less than the optimized version for the irq disable case
    provided here).

    The optimization using this_cpu_cmpxchg will only be used if the arch
    supports efficient this_cpu_ops (must have CONFIG_CMPXCHG_LOCAL set!)

    The use of this_cpu_cmpxchg reduces the cycle count for the counter
    operations by %80 (inc_zone_page_state goes from 170 cycles to 32).

    Signed-off-by: Christoph Lameter

    Christoph Lameter
     

17 Dec, 2010

2 commits

  • this_cpu_inc_return() saves us a memory access there. Code
    size does not change.

    V1->V2:
    - Fixed the location of the __per_cpu pointer attributes
    - Sparse checked
    V2->V3:
    - Move fixes to __percpu attribute usage to earlier patch

    Reviewed-by: Pekka Enberg
    Acked-by: H. Peter Anvin
    Signed-off-by: Christoph Lameter
    Signed-off-by: Tejun Heo

    Christoph Lameter
     
  • this cpu operations can be used to slightly optimize the function. The
    changes will avoid some address calculations and replace them with the
    use of the percpu segment register.

    If one would have this_cpu_inc_return and this_cpu_dec_return then it
    would be possible to optimize inc_zone_page_state and
    dec_zone_page_state even more.

    V1->V2:
    - Fix __dec_zone_state overflow handling
    - Use s8 variables for temporary storage.

    V2->V3:
    - Put __percpu annotations in correct places.

    Reviewed-by: Pekka Enberg
    Acked-by: H. Peter Anvin
    Signed-off-by: Christoph Lameter
    Signed-off-by: Tejun Heo

    Christoph Lameter
     

15 Dec, 2010

1 commit

  • cancel_rearming_delayed_work[queue]() has been superceded by
    cancel_delayed_work_sync() quite some time ago. Convert all the
    in-kernel users. The conversions are completely equivalent and
    trivial.

    Signed-off-by: Tejun Heo
    Acked-by: "David S. Miller"
    Acked-by: Greg Kroah-Hartman
    Acked-by: Evgeniy Polyakov
    Cc: Jeff Garzik
    Cc: Benjamin Herrenschmidt
    Cc: Mauro Carvalho Chehab
    Cc: netdev@vger.kernel.org
    Cc: Anton Vorontsov
    Cc: David Woodhouse
    Cc: "J. Bruce Fields"
    Cc: Neil Brown
    Cc: Alex Elder
    Cc: xfs-masters@oss.sgi.com
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Andrew Morton
    Cc: netfilter-devel@vger.kernel.org
    Cc: Trond Myklebust
    Cc: linux-nfs@vger.kernel.org

    Tejun Heo
     

03 Dec, 2010

1 commit

  • The nr_dirty_[background_]threshold fields are misplaced before the
    numa_* fields, and users will read strange values.

    This is the right order. Before patch, nr_dirty_background_threshold
    will read as 0 (the value from numa_miss).

    numa_hit 128501
    numa_miss 0
    numa_foreign 0
    numa_interleave 7388
    numa_local 128501
    numa_other 0
    nr_dirty_threshold 144291
    nr_dirty_background_threshold 72145

    Signed-off-by: Wu Fengguang
    Cc: Michael Rubin
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     

04 Nov, 2010

1 commit

  • Fix regression introduced by commit 79da826aee6 ("writeback: report
    dirty thresholds in /proc/vmstat").

    The incorrect pointer arithmetic can result in problems like this:

    BUG: unable to handle kernel paging request at 07c06d16
    IP: [] strnlen+0x6/0x20
    Call Trace:
    [] ? string+0x39/0xe0
    [] ? __wake_up_common+0x4b/0x80
    [] ? vsnprintf+0x1ec/0x380
    [] ? seq_printf+0x2e/0x60
    [] ? vmstat_show+0x26/0x30
    [] ? seq_read+0xa6/0x380
    [] ? seq_read+0x0/0x380
    [] ? proc_reg_read+0x5f/0x90
    [] ? vfs_read+0xa1/0x140
    [] ? proc_reg_read+0x0/0x90
    [] ? sys_read+0x41/0x70
    [] ? sysenter_do_call+0x12/0x26

    Reported-by: Tetsuo Handa
    Cc: Michael Rubin
    Signed-off-by: Wu Fengguang
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     

27 Oct, 2010

3 commits

  • This removes following warning from sparse:

    mm/vmstat.c:466:5: warning: symbol 'fragmentation_index' was not declared. Should it be static?

    [akpm@linux-foundation.org: move the include to top-of-file]
    Signed-off-by: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     
  • The kernel already exposes the user desired thresholds in /proc/sys/vm
    with dirty_background_ratio and background_ratio. But the kernel may
    alter the number requested without giving the user any indication that is
    the case.

    Knowing the actual ratios the kernel is honoring can help app developers
    understand how their buffered IO will be sent to the disk.

    $ grep threshold /proc/vmstat
    nr_dirty_threshold 409111
    nr_dirty_background_threshold 818223

    Signed-off-by: Michael Rubin
    Cc: Wu Fengguang
    Cc: Dave Chinner
    Cc: Jens Axboe
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Rubin
     
  • To help developers and applications gain visibility into writeback
    behaviour adding two entries to vm_stat_items and /proc/vmstat. This will
    allow us to track the "written" and "dirtied" counts.

    # grep nr_dirtied /proc/vmstat
    nr_dirtied 3747
    # grep nr_written /proc/vmstat
    nr_written 3618

    Signed-off-by: Michael Rubin
    Reviewed-by: Wu Fengguang
    Cc: Dave Chinner
    Cc: Jens Axboe
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Rubin
     

10 Sep, 2010

2 commits

  • …low and kswapd is awake

    Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is
    cheaper than scanning a number of lists. To avoid synchronization
    overhead, counter deltas are maintained on a per-cpu basis and drained
    both periodically and when the delta is above a threshold. On large CPU
    systems, the difference between the estimated and real value of
    NR_FREE_PAGES can be very high. If NR_FREE_PAGES is much higher than
    number of real free page in buddy, the VM can allocate pages below min
    watermark, at worst reducing the real number of pages to zero. Even if
    the OOM killer kills some victim for freeing memory, it may not free
    memory if the exit path requires a new page resulting in livelock.

    This patch introduces a zone_page_state_snapshot() function (courtesy of
    Christoph) that takes a slightly more accurate view of an arbitrary vmstat
    counter. It is used to read NR_FREE_PAGES while kswapd is awake to avoid
    the watermark being accidentally broken. The estimate is not perfect and
    may result in cache line bounces but is expected to be lighter than the
    IPI calls necessary to continually drain the per-cpu counters while kswapd
    is awake.

    Signed-off-by: Christoph Lameter <cl@linux.com>
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Christoph Lameter
     
  • refresh_zone_stat_thresholds() calculates parameter based on the number of
    online cpus. It's called at cpu offlining but needs to be called at
    onlining, too.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

10 Aug, 2010

2 commits

  • Since 2.6.28 zone->prev_priority is unused. Then it can be removed
    safely. It reduce stack usage slightly.

    Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
    can be integrate again, it's useful. but four (or more) times trying
    haven't got good performance number. Thus I give up such approach.

    The rest of this changelog is notes on prev_priority and why it existed in
    the first place and why it might be not necessary any more. This information
    is based heavily on discussions between Andrew Morton, Rik van Riel and
    Kosaki Motohiro who is heavily quotes from.

    Historically prev_priority was important because it determined when the VM
    would start unmapping PTE pages. i.e. there are no balances of note within
    the VM, Anon vs File and Mapped vs Unmapped. Without prev_priority, there
    is a potential risk of unnecessarily increasing minor faults as a large
    amount of read activity of use-once pages could push mapped pages to the
    end of the LRU and get unmapped.

    There is no proof this is still a problem but currently it is not considered
    to be. Active files are not deactivated if the active file list is smaller
    than the inactive list reducing the liklihood that file-mapped pages are
    being pushed off the LRU and referenced executable pages are kept on the
    active list to avoid them getting pushed out by read activity.

    Even if it is a problem, prev_priority prev_priority wouldn't works
    nowadays. First of all, current vmscan still a lot of UP centric code. it
    expose some weakness on some dozens CPUs machine. I think we need more and
    more improvement.

    The problem is, current vmscan mix up per-system-pressure, per-zone-pressure
    and per-task-pressure a bit. example, prev_priority try to boost priority to
    other concurrent priority. but if the another task have mempolicy restriction,
    it is unnecessary, but also makes wrong big latency and exceeding reclaim.
    per-task based priority + prev_priority adjustment make the emulation of
    per-system pressure. but it have two issue 1) too rough and brutal emulation
    2) we need per-zone pressure, not per-system.

    Another example, currently DEF_PRIORITY is 12. it mean the lru rotate about
    2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer.
    but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the
    system have higher memory pressure than priority==0 (1/4096*10,000 > 2).
    prev_priority can't solve such multithreads workload issue. In other word,
    prev_priority concept assume the sysmtem don't have lots threads."

    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Mel Gorman
    Reviewed-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: Dave Chinner
    Cc: Chris Mason
    Cc: Nick Piggin
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Christoph Hellwig
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Andrea Arcangeli
    Cc: Michael Rubin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • The sum_vm_events passes cpumask for for_each_cpu(). But it's useless
    since we have for_each_online_cpu. Althougth it's tirival overhead, it's
    not good about coding consistency.

    Let's use for_each_online_cpu instead of for_each_cpu with cpumask
    argument.

    Signed-off-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

25 May, 2010

4 commits

  • Ordinarily when a high-order allocation fails, direct reclaim is entered
    to free pages to satisfy the allocation. With this patch, it is
    determined if an allocation failed due to external fragmentation instead
    of low memory and if so, the calling process will compact until a suitable
    page is freed. Compaction by moving pages in memory is considerably
    cheaper than paging out to disk and works where there are locked pages or
    no swap. If compaction fails to free a page of a suitable size, then
    reclaim will still occur.

    Direct compaction returns as soon as possible. As each block is
    compacted, it is checked if a suitable page has been freed and if so, it
    returns.

    [akpm@linux-foundation.org: Fix build errors]
    [aarcange@redhat.com: fix count_vm_event preempt in memory compaction direct reclaim]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patch is the core of a mechanism which compacts memory in a zone by
    relocating movable pages towards the end of the zone.

    A single compaction run involves a migration scanner and a free scanner.
    Both scanners operate on pageblock-sized areas in the zone. The migration
    scanner starts at the bottom of the zone and searches for all movable
    pages within each area, isolating them onto a private list called
    migratelist. The free scanner starts at the top of the zone and searches
    for suitable areas and consumes the free pages within making them
    available for the migration scanner. The pages isolated for migration are
    then migrated to the newly isolated free pages.

    [aarcange@redhat.com: Fix unsafe optimisation]
    [mel@csn.ul.ie: do not schedule work on other CPUs for compaction]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The fragmentation fragmentation index, is only meaningful if an allocation
    would fail and indicates what the failure is due to. A value of -1 such
    as in many of the examples above states that the allocation would succeed.
    If it would fail, the value is between 0 and 1. A value tending towards
    0 implies the allocation failed due to a lack of memory. A value tending
    towards 1 implies it failed due to external fragmentation.

    For the most part, the huge page size will be the size of interest but not
    necessarily so it is exported on a per-order and per-zo basis via
    /sys/kernel/debug/extfrag/extfrag_index

    > cat /sys/kernel/debug/extfrag/extfrag_index
    Node 0, zone DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.00
    Node 0, zone Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.954

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Acked-by: Rik van Riel
    Reviewed-by: Christoph Lameter
    Cc: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The unusable free space index measures how much of the available free
    memory cannot be used to satisfy an allocation of a given size and is a
    value between 0 and 1. The higher the value, the more of free memory is
    unusable and by implication, the worse the external fragmentation is. For
    the most part, the huge page size will be the size of interest but not
    necessarily so it is exported on a per-order and per-zone basis via
    /sys/kernel/debug/extfrag/unusable_index.

    > cat /sys/kernel/debug/extfrag/unusable_index
    Node 0, zone DMA 0.000 0.000 0.000 0.001 0.005 0.013 0.021 0.037 0.037 0.101 0.230
    Node 0, zone Normal 0.000 0.000 0.000 0.001 0.002 0.002 0.005 0.015 0.028 0.028 0.054

    [akpm@linux-foundation.org: Fix allnoconfig]
    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

07 Mar, 2010

1 commit

  • commit e815af95 ("change all_unreclaimable zone member to flags") changed
    all_unreclaimable member to bit flag. But it had an undesireble side
    effect. free_one_page() is one of most hot path in linux kernel and
    increasing atomic ops in it can reduce kernel performance a bit.

    Thus, this patch revert such commit partially. at least
    all_unreclaimable shouldn't share memory word with other zone flags.

    [akpm@linux-foundation.org: fix patch interaction]
    Signed-off-by: KOSAKI Motohiro
    Cc: David Rientjes
    Cc: Wu Fengguang
    Cc: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Huang Shijie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

05 Jan, 2010

2 commits

  • Remove the pageset notifier since it only marks that a processor
    exists on a specific node. Move that code into the vmstat notifier.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Tejun Heo

    Christoph Lameter
     
  • Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.

    This drastically reduces the size of struct zone for systems with large
    amounts of processors and allows placement of critical variables of struct
    zone in one cacheline even on very large systems.

    Another effect is that the pagesets of one processor are placed near one
    another. If multiple pagesets from different zones fit into one cacheline
    then additional cacheline fetches can be avoided on the hot paths when
    allocating memory from multiple zones.

    Bootstrap becomes simpler if we use the same scheme for UP, SMP, NUMA. #ifdefs
    are reduced and we can drop the zone_pcp macro.

    Hotplug handling is also simplified since cpu alloc can bring up and
    shut down cpu areas for a specific cpu as a whole. So there is no need to
    allocate or free individual pagesets.

    V7-V8:
    - Explain chicken egg dilemmna with percpu allocator.

    V4-V5:
    - Fix up cases where per_cpu_ptr is called before irq disable
    - Integrate the bootstrap logic that was separate before.

    tj: Build failure in pageset_cpuup_callback() due to missing ret
    variable fixed.

    Reviewed-by: Mel Gorman
    Signed-off-by: Christoph Lameter
    Signed-off-by: Tejun Heo

    Christoph Lameter
     

16 Dec, 2009

2 commits

  • If reclaim fails to make sufficient progress, the priority is raised.
    Once the priority is higher, kswapd starts waiting on congestion.
    However, if the zone is below the min watermark then kswapd needs to
    continue working without delay as there is a danger of an increased rate
    of GFP_ATOMIC allocation failure.

    This patch changes the conditions under which kswapd waits on congestion
    by only going to sleep if the min watermarks are being met.

    [mel@csn.ul.ie: add stats to track how relevant the logic is]
    [mel@csn.ul.ie: make kswapd only check its own zones and rename the relevant counters]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • After kswapd balances all zones in a pgdat, it goes to sleep. In the
    event of no IO congestion, kswapd can go to sleep very shortly after the
    high watermark was reached. If there are a constant stream of allocations
    from parallel processes, it can mean that kswapd went to sleep too quickly
    and the high watermark is not being maintained for sufficient length time.

    This patch makes kswapd go to sleep as a two-stage process. It first
    tries to sleep for HZ/10. If it is woken up by another process or the
    high watermark is no longer met, it's considered a premature sleep and
    kswapd continues work. Otherwise it goes fully to sleep.

    This adds more counters to distinguish between fast and slow breaches of
    watermarks. A "fast" premature sleep is one where the low watermark was
    hit in a very short time after kswapd going to sleep. A "slow" premature
    sleep indicates that the high watermark was breached after a very short
    interval.

    Signed-off-by: Mel Gorman
    Cc: Frans Pop
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

29 Oct, 2009

1 commit

  • This patch updates percpu related symbols under kernel/ and mm/ such
    that percpu symbols are unique and don't clash with local symbols.
    This serves two purposes of decreasing the possibility of global
    percpu symbol collision and allowing dropping per_cpu__ prefix from
    percpu symbols.

    * kernel/lockdep.c: s/lock_stats/cpu_lock_stats/

    * kernel/sched.c: s/init_rq_rt/init_rt_rq_var/ (any better idea?)
    s/sched_group_cpus/sched_groups/

    * kernel/softirq.c: s/ksoftirqd/run_ksoftirqd/a

    * kernel/softlockup.c: s/(*)_timestamp/softlockup_\1_ts/
    s/watchdog_task/softlockup_watchdog/
    s/timestamp/ts/ for local variables

    * kernel/time/timer_stats: s/lookup_lock/tstats_lookup_lock/

    * mm/slab.c: s/reap_work/slab_reap_work/
    s/reap_node/slab_reap_node/

    * mm/vmstat.c: local variable changed to avoid collision with vmstat_work

    Partly based on Rusty Russell's "alloc_percpu: rename percpu vars
    which cause name clashes" patch.

    Signed-off-by: Tejun Heo
    Acked-by: (slab/vmstat) Christoph Lameter
    Reviewed-by: Christoph Lameter
    Cc: Rusty Russell
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Andrew Morton
    Cc: Nick Piggin

    Tejun Heo
     

22 Sep, 2009

1 commit

  • If the system is running a heavy load of processes then concurrent reclaim
    can isolate a large number of pages from the LRU. /proc/vmstat and the
    output generated for an OOM do not show how many pages were isolated.

    This has been observed during process fork bomb testing (mstctl11 in LTP).

    This patch shows the information about isolated pages.

    Reproduced via:

    -----------------------
    % ./hackbench 140 process 1000
    => OOM occur

    active_anon:146 inactive_anon:0 isolated_anon:49245
    active_file:79 inactive_file:18 isolated_file:113
    unevictable:0 dirty:0 writeback:0 unstable:0 buffer:39
    free:370 slab_reclaimable:309 slab_unreclaimable:5492
    mapped:53 shmem:15 pagetables:28140 bounce:0

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Acked-by: Wu Fengguang
    Reviewed-by: Minchan Kim
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro