Eric Lee / smarc-fsl-linux-kernel

23 Mar, 2011

1 commit

78afd5612 mm: add __GFP_OTHER_NODE flag ... Browse Code »

Add a new __GFP_OTHER_NODE flag to tell the low level numa statistics in
zone_statistics() that an allocation is on behalf of another thread. This
way the local and remote counters can be still correct, even when
background daemons like khugepaged are changing memory mappings.

This only affects the accounting, but I think it's worth doing that right
to avoid confusing users.

I first tried to just pass down the right node, but this required a lot of
changes to pass down this parameter and at least one addition of a 10th
argument to a 9 argument function. Using the flag is a lot less
intrusive.

Open: should be also used for migration?

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Andi Kleen
Cc: Andrea Arcangeli
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2011-03-23 08:44:05 +0800

14 Jan, 2011

3 commits

79134171d thp: transparent hugepage vmstat ... Browse Code »

Add hugepage stat information to /proc/vmstat and /proc/meminfo.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:43 +0800
b44129b30 mm: vmstat: use a single setter function and callback for adjusting percpu thresholds ... Browse Code »

reduce_pgdat_percpu_threshold() and restore_pgdat_percpu_threshold() exist
to adjust the per-cpu vmstat thresholds while kswapd is awake to avoid
errors due to counter drift. The functions duplicate some code so this
patch replaces them with a single set_pgdat_percpu_threshold() that takes
a callback function to calculate the desired threshold as a parameter.

[akpm@linux-foundation.org: readability tweak]
[kosaki.motohiro@jp.fujitsu.com: set_pgdat_percpu_threshold(): don't use for_each_online_cpu]
Signed-off-by: Mel Gorman
Reviewed-by: Christoph Lameter
Reviewed-by: KAMEZAWA Hiroyuki
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2011-01-14 09:32:31 +0800
88f5acf88 mm: page allocator: adjust the per-cpu counter threshold when memory is low ... Browse Code »
1

Commit aa45484 ("calculate a better estimate of NR_FREE_PAGES when memory
is low") noted that watermarks were based on the vmstat NR_FREE_PAGES. To
avoid synchronization overhead, these counters are maintained on a per-cpu
basis and drained both periodically and when a threshold is above a
threshold. On large CPU systems, the difference between the estimate and
real value of NR_FREE_PAGES can be very high. The system can get into a
case where pages are allocated far below the min watermark potentially
causing livelock issues. The commit solved the problem by taking a better
reading of NR_FREE_PAGES when memory was low.

Unfortately, as reported by Shaohua Li this accurate reading can consume a
large amount of CPU time on systems with many sockets due to cache line
bouncing. This patch takes a different approach. For large machines
where counter drift might be unsafe and while kswapd is awake, the per-cpu
thresholds for the target pgdat are reduced to limit the level of drift to
what should be a safe level. This incurs a performance penalty in heavy
memory pressure by a factor that depends on the workload and the machine
but the machine should function correctly without accidentally exhausting
all memory on a node. There is an additional cost when kswapd wakes and
sleeps but the event is not expected to be frequent - in Shaohua's test
case, there was one recorded sleep and wake event at least.

To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is
introduced that takes a more accurate reading of NR_FREE_PAGES when called
from wakeup_kswapd, when deciding whether it is really safe to go back to
sleep in sleeping_prematurely() and when deciding if a zone is really
balanced or not in balance_pgdat(). We are still using an expensive
function but limiting how often it is called.

When the test case is reproduced, the time spent in the watermark
functions is reduced. The following report is on the percentage of time
spent cumulatively spent in the functions zone_nr_free_pages(),
zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(),
zone_page_state_snapshot(), zone_page_state().

vanilla 11.6615%
disable-threshold 0.2584%

David said:

: We had to pull aa454840 "mm: page allocator: calculate a better estimate
: of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36
: internally because tests showed that it would cause the machine to stall
: as the result of heavy kswapd activity. I merged it back with this fix as
: it is pending in the -mm tree and it solves the issue we were seeing, so I
: definitely think this should be pushed to -stable (and I would seriously
: consider it for 2.6.37 inclusion even at this late date).

Signed-off-by: Mel Gorman
Reported-by: Shaohua Li
Reviewed-by: Christoph Lameter
Tested-by: Nicolas Bareil
Cc: David Rientjes
Cc: Kyle McMartin
Cc: [2.6.37.1, 2.6.36.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2011-01-14 09:32:31 +0800

08 Jan, 2011

2 commits

72eb6a791 Merge branch 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu ... Browse Code »

* 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (30 commits)
gameport: use this_cpu_read instead of lookup
x86: udelay: Use this_cpu_read to avoid address calculation
x86: Use this_cpu_inc_return for nmi counter
x86: Replace uses of current_cpu_data with this_cpu ops
x86: Use this_cpu_ops to optimize code
vmstat: User per cpu atomics to avoid interrupt disable / enable
irq_work: Use per cpu atomics instead of regular atomics
cpuops: Use cmpxchg for xchg to avoid lock semantics
x86: this_cpu_cmpxchg and this_cpu_xchg operations
percpu: Generic this_cpu_cmpxchg() and this_cpu_xchg support
percpu,x86: relocate this_cpu_add_return() and friends
connector: Use this_cpu operations
xen: Use this_cpu_inc_return
taskstats: Use this_cpu_ops
random: Use this_cpu_inc_return
fs: Use this_cpu_inc_return in buffer.c
highmem: Use this_cpu_xx_return() operations
vmstat: Use this_cpu_inc_return for vm statistics
x86: Support for this_cpu_add, sub, dec, inc_return
percpu: Generic support for this_cpu_add, sub, dec, inc_return
...

Fixed up conflicts: in arch/x86/kernel/{apic/nmi.c, apic/x2apic_uv_x.c, process.c}
as per Tejun.

Linus Torvalds
2011-01-08 09:02:58 +0800
23d69b09b Merge branch 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq ... Browse Code »

* 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (33 commits)
usb: don't use flush_scheduled_work()
speedtch: don't abuse struct delayed_work
media/video: don't use flush_scheduled_work()
media/video: explicitly flush request_module work
ioc4: use static work_struct for ioc4_load_modules()
init: don't call flush_scheduled_work() from do_initcalls()
s390: don't use flush_scheduled_work()
rtc: don't use flush_scheduled_work()
mmc: update workqueue usages
mfd: update workqueue usages
dvb: don't use flush_scheduled_work()
leds-wm8350: don't use flush_scheduled_work()
mISDN: don't use flush_scheduled_work()
macintosh/ams: don't use flush_scheduled_work()
vmwgfx: don't use flush_scheduled_work()
tpm: don't use flush_scheduled_work()
sonypi: don't use flush_scheduled_work()
hvsi: don't use flush_scheduled_work()
xen: don't use flush_scheduled_work()
gdrom: don't use flush_scheduled_work()
...

Fixed up trivial conflict in drivers/media/video/bt8xx/bttv-input.c
as per Tejun.

Linus Torvalds
2011-01-08 08:58:04 +0800

18 Dec, 2010

1 commit

7c8391206 vmstat: User per cpu atomics to avoid interrupt disable / enable ... Browse Code »

Currently the operations to increment vm counters must disable interrupts
in order to not mess up their housekeeping of counters.

So use this_cpu_cmpxchg() to avoid the overhead. Since we can no longer
count on preremption being disabled we still have some minor issues.
The fetching of the counter thresholds is racy.
A threshold from another cpu may be applied if we happen to be
rescheduled on another cpu. However, the following vmstat operation
will then bring the counter again under the threshold limit.

The operations for __xxx_zone_state are not changed since the caller
has taken care of the synchronization needs (and therefore the cycle
count is even less than the optimized version for the irq disable case
provided here).

The optimization using this_cpu_cmpxchg will only be used if the arch
supports efficient this_cpu_ops (must have CONFIG_CMPXCHG_LOCAL set!)

The use of this_cpu_cmpxchg reduces the cycle count for the counter
operations by %80 (inc_zone_page_state goes from 170 cycles to 32).

Signed-off-by: Christoph Lameter

Christoph Lameter
2010-12-18 22:54:49 +0800

17 Dec, 2010

2 commits

908ee0f12 vmstat: Use this_cpu_inc_return for vm statistics ... Browse Code »

this_cpu_inc_return() saves us a memory access there. Code
size does not change.

V1->V2:
- Fixed the location of the __per_cpu pointer attributes
- Sparse checked
V2->V3:
- Move fixes to __percpu attribute usage to earlier patch

Reviewed-by: Pekka Enberg
Acked-by: H. Peter Anvin
Signed-off-by: Christoph Lameter
Signed-off-by: Tejun Heo

Christoph Lameter
2010-12-17 22:18:04 +0800
12938a922 vmstat: Optimize zone counter modifications through the use of this cpu operations ... Browse Code »

this cpu operations can be used to slightly optimize the function. The
changes will avoid some address calculations and replace them with the
use of the percpu segment register.

If one would have this_cpu_inc_return and this_cpu_dec_return then it
would be possible to optimize inc_zone_page_state and
dec_zone_page_state even more.

V1->V2:
- Fix __dec_zone_state overflow handling
- Use s8 variables for temporary storage.

V2->V3:
- Put __percpu annotations in correct places.

Reviewed-by: Pekka Enberg
Acked-by: H. Peter Anvin
Signed-off-by: Christoph Lameter
Signed-off-by: Tejun Heo

Christoph Lameter
2010-12-17 22:07:18 +0800

15 Dec, 2010

1 commit

afe2c511f workqueue: convert cancel_rearming_delayed_work[queue]() users to cancel_delayed_work_sync() ... Browse Code »

cancel_rearming_delayed_work[queue]() has been superceded by
cancel_delayed_work_sync() quite some time ago. Convert all the
in-kernel users. The conversions are completely equivalent and
trivial.

Signed-off-by: Tejun Heo
Acked-by: "David S. Miller"
Acked-by: Greg Kroah-Hartman
Acked-by: Evgeniy Polyakov
Cc: Jeff Garzik
Cc: Benjamin Herrenschmidt
Cc: Mauro Carvalho Chehab
Cc: netdev@vger.kernel.org
Cc: Anton Vorontsov
Cc: David Woodhouse
Cc: "J. Bruce Fields"
Cc: Neil Brown
Cc: Alex Elder
Cc: xfs-masters@oss.sgi.com
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Andrew Morton
Cc: netfilter-devel@vger.kernel.org
Cc: Trond Myklebust
Cc: linux-nfs@vger.kernel.org

Tejun Heo
2010-12-15 17:56:11 +0800

03 Dec, 2010

1 commit

e172662d1 vmstat: fix dirty threshold ordering ... Browse Code »

The nr_dirty_[background_]threshold fields are misplaced before the
numa_* fields, and users will read strange values.

This is the right order. Before patch, nr_dirty_background_threshold
will read as 0 (the value from numa_miss).

numa_hit 128501
numa_miss 0
numa_foreign 0
numa_interleave 7388
numa_local 128501
numa_other 0
nr_dirty_threshold 144291
nr_dirty_background_threshold 72145

Signed-off-by: Wu Fengguang
Cc: Michael Rubin
Reviewed-by: KOSAKI Motohiro
Reviewed-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2010-12-03 06:51:14 +0800

04 Nov, 2010

1 commit

ff8b16d7e vmstat: fix offset calculation on void* ... Browse Code »

Fix regression introduced by commit 79da826aee6 ("writeback: report
dirty thresholds in /proc/vmstat").

The incorrect pointer arithmetic can result in problems like this:

BUG: unable to handle kernel paging request at 07c06d16
IP: [] strnlen+0x6/0x20
Call Trace:
[] ? string+0x39/0xe0
[] ? __wake_up_common+0x4b/0x80
[] ? vsnprintf+0x1ec/0x380
[] ? seq_printf+0x2e/0x60
[] ? vmstat_show+0x26/0x30
[] ? seq_read+0xa6/0x380
[] ? seq_read+0x0/0x380
[] ? proc_reg_read+0x5f/0x90
[] ? vfs_read+0xa1/0x140
[] ? proc_reg_read+0x0/0x90
[] ? sys_read+0x41/0x70
[] ? sysenter_do_call+0x12/0x26

Reported-by: Tetsuo Handa
Cc: Michael Rubin
Signed-off-by: Wu Fengguang
Signed-off-by: Linus Torvalds

Wu Fengguang
2010-11-04 02:39:58 +0800

27 Oct, 2010

3 commits

36deb0be3 vmstat: include compaction.h when CONFIG_COMPACTION ... Browse Code »

This removes following warning from sparse:

mm/vmstat.c:466:5: warning: symbol 'fragmentation_index' was not declared. Should it be static?

[akpm@linux-foundation.org: move the include to top-of-file]
Signed-off-by: Namhyung Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Namhyung Kim
2010-10-27 07:52:10 +0800
79da826ae writeback: report dirty thresholds in /proc/vmstat ... Browse Code »

The kernel already exposes the user desired thresholds in /proc/sys/vm
with dirty_background_ratio and background_ratio. But the kernel may
alter the number requested without giving the user any indication that is
the case.

Knowing the actual ratios the kernel is honoring can help app developers
understand how their buffered IO will be sent to the disk.

$ grep threshold /proc/vmstat
nr_dirty_threshold 409111
nr_dirty_background_threshold 818223

Signed-off-by: Michael Rubin
Cc: Wu Fengguang
Cc: Dave Chinner
Cc: Jens Axboe
Cc: KOSAKI Motohiro
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michael Rubin
2010-10-27 07:52:06 +0800
ea941f0e2 writeback: add nr_dirtied and nr_written to /proc/vmstat ... Browse Code »

To help developers and applications gain visibility into writeback
behaviour adding two entries to vm_stat_items and /proc/vmstat. This will
allow us to track the "written" and "dirtied" counts.

# grep nr_dirtied /proc/vmstat
nr_dirtied 3747
# grep nr_written /proc/vmstat
nr_written 3618

Signed-off-by: Michael Rubin
Reviewed-by: Wu Fengguang
Cc: Dave Chinner
Cc: Jens Axboe
Cc: KOSAKI Motohiro
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michael Rubin
2010-10-27 07:52:06 +0800

10 Sep, 2010

2 commits

aa4548403 mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory is … ... Browse Code »

…low and kswapd is awake

Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is
cheaper than scanning a number of lists. To avoid synchronization
overhead, counter deltas are maintained on a per-cpu basis and drained
both periodically and when the delta is above a threshold. On large CPU
systems, the difference between the estimated and real value of
NR_FREE_PAGES can be very high. If NR_FREE_PAGES is much higher than
number of real free page in buddy, the VM can allocate pages below min
watermark, at worst reducing the real number of pages to zero. Even if
the OOM killer kills some victim for freeing memory, it may not free
memory if the exit path requires a new page resulting in livelock.

This patch introduces a zone_page_state_snapshot() function (courtesy of
Christoph) that takes a slightly more accurate view of an arbitrary vmstat
counter. It is used to read NR_FREE_PAGES while kswapd is awake to avoid
the watermark being accidentally broken. The estimate is not perfect and
may result in cache line bounces but is expected to be lighter than the
IPI calls necessary to continually drain the per-cpu counters while kswapd
is awake.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Christoph Lameter
2010-09-10 09:57:25 +0800
5ee28a447 vmstat: update zone stat threshold when onlining a cpu ... Browse Code »

refresh_zone_stat_thresholds() calculates parameter based on the number of
online cpus. It's called at cpu offlining but needs to be called at
onlining, too.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Christoph Lameter
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-09-10 09:57:25 +0800

10 Aug, 2010

2 commits

25edde033 vmscan: kill prev_priority completely ... Browse Code »
43

Since 2.6.28 zone->prev_priority is unused. Then it can be removed
safely. It reduce stack usage slightly.

Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
can be integrate again, it's useful. but four (or more) times trying
haven't got good performance number. Thus I give up such approach.

The rest of this changelog is notes on prev_priority and why it existed in
the first place and why it might be not necessary any more. This information
is based heavily on discussions between Andrew Morton, Rik van Riel and
Kosaki Motohiro who is heavily quotes from.

Historically prev_priority was important because it determined when the VM
would start unmapping PTE pages. i.e. there are no balances of note within
the VM, Anon vs File and Mapped vs Unmapped. Without prev_priority, there
is a potential risk of unnecessarily increasing minor faults as a large
amount of read activity of use-once pages could push mapped pages to the
end of the LRU and get unmapped.

There is no proof this is still a problem but currently it is not considered
to be. Active files are not deactivated if the active file list is smaller
than the inactive list reducing the liklihood that file-mapped pages are
being pushed off the LRU and referenced executable pages are kept on the
active list to avoid them getting pushed out by read activity.

Even if it is a problem, prev_priority prev_priority wouldn't works
nowadays. First of all, current vmscan still a lot of UP centric code. it
expose some weakness on some dozens CPUs machine. I think we need more and
more improvement.

The problem is, current vmscan mix up per-system-pressure, per-zone-pressure
and per-task-pressure a bit. example, prev_priority try to boost priority to
other concurrent priority. but if the another task have mempolicy restriction,
it is unnecessary, but also makes wrong big latency and exceeding reclaim.
per-task based priority + prev_priority adjustment make the emulation of
per-system pressure. but it have two issue 1) too rough and brutal emulation
2) we need per-zone pressure, not per-system.

Another example, currently DEF_PRIORITY is 12. it mean the lru rotate about
2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer.
but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the
system have higher memory pressure than priority==0 (1/4096*10,000 > 2).
prev_priority can't solve such multithreads workload issue. In other word,
prev_priority concept assume the sysmtem don't have lots threads."

Signed-off-by: KOSAKI Motohiro
Signed-off-by: Mel Gorman
Reviewed-by: Johannes Weiner
Reviewed-by: Rik van Riel
Cc: Dave Chinner
Cc: Chris Mason
Cc: Nick Piggin
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Christoph Hellwig
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Andrea Arcangeli
Cc: Michael Rubin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2010-08-10 11:45:00 +0800
31f961a89 mm: use for_each_online_cpu() in vmstat ... Browse Code »

The sum_vm_events passes cpumask for for_each_cpu(). But it's useless
since we have for_each_online_cpu. Althougth it's tirival overhead, it's
not good about coding consistency.

Let's use for_each_online_cpu instead of for_each_cpu with cpumask
argument.

Signed-off-by: Minchan Kim
Reviewed-by: KAMEZAWA Hiroyuki
Acked-by: Christoph Lameter
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2010-08-10 11:44:57 +0800

25 May, 2010

4 commits

56de7263f mm: compaction: direct compact when a high-order allocation fails ... Browse Code »

Ordinarily when a high-order allocation fails, direct reclaim is entered
to free pages to satisfy the allocation. With this patch, it is
determined if an allocation failed due to external fragmentation instead
of low memory and if so, the calling process will compact until a suitable
page is freed. Compaction by moving pages in memory is considerably
cheaper than paging out to disk and works where there are locked pages or
no swap. If compaction fails to free a page of a suitable size, then
reclaim will still occur.

Direct compaction returns as soon as possible. As each block is
compacted, it is checked if a suitable page has been freed and if so, it
returns.

[akpm@linux-foundation.org: Fix build errors]
[aarcange@redhat.com: fix count_vm_event preempt in memory compaction direct reclaim]
Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Reviewed-by: Minchan Kim
Cc: KOSAKI Motohiro
Cc: Christoph Lameter
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2010-05-25 23:06:59 +0800
748446bb6 mm: compaction: memory compaction core ... Browse Code »

This patch is the core of a mechanism which compacts memory in a zone by
relocating movable pages towards the end of the zone.

A single compaction run involves a migration scanner and a free scanner.
Both scanners operate on pageblock-sized areas in the zone. The migration
scanner starts at the bottom of the zone and searches for all movable
pages within each area, isolating them onto a private list called
migratelist. The free scanner starts at the top of the zone and searches
for suitable areas and consumes the free pages within making them
available for the migration scanner. The pages isolated for migration are
then migrated to the newly isolated free pages.

[aarcange@redhat.com: Fix unsafe optimisation]
[mel@csn.ul.ie: do not schedule work on other CPUs for compaction]
Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Reviewed-by: Minchan Kim
Cc: KOSAKI Motohiro
Cc: Christoph Lameter
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2010-05-25 23:06:59 +0800
f1a5ab121 mm: export fragmentation index via debugfs ... Browse Code »

The fragmentation fragmentation index, is only meaningful if an allocation
would fail and indicates what the failure is due to. A value of -1 such
as in many of the examples above states that the allocation would succeed.
If it would fail, the value is between 0 and 1. A value tending towards
0 implies the allocation failed due to a lack of memory. A value tending
towards 1 implies it failed due to external fragmentation.

For the most part, the huge page size will be the size of interest but not
necessarily so it is exported on a per-order and per-zo basis via
/sys/kernel/debug/extfrag/extfrag_index

> cat /sys/kernel/debug/extfrag/extfrag_index
Node 0, zone DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.00
Node 0, zone Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.954

Signed-off-by: Mel Gorman
Reviewed-by: Minchan Kim
Acked-by: Rik van Riel
Reviewed-by: Christoph Lameter
Cc: KOSAKI Motohiro
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2010-05-25 23:06:59 +0800
d7a5752c0 mm: export unusable free space index via debugfs ... Browse Code »

The unusable free space index measures how much of the available free
memory cannot be used to satisfy an allocation of a given size and is a
value between 0 and 1. The higher the value, the more of free memory is
unusable and by implication, the worse the external fragmentation is. For
the most part, the huge page size will be the size of interest but not
necessarily so it is exported on a per-order and per-zone basis via
/sys/kernel/debug/extfrag/unusable_index.

> cat /sys/kernel/debug/extfrag/unusable_index
Node 0, zone DMA 0.000 0.000 0.000 0.001 0.005 0.013 0.021 0.037 0.037 0.101 0.230
Node 0, zone Normal 0.000 0.000 0.000 0.001 0.002 0.002 0.005 0.015 0.028 0.028 0.054

[akpm@linux-foundation.org: Fix allnoconfig]
Signed-off-by: Mel Gorman
Reviewed-by: Minchan Kim
Reviewed-by: KOSAKI Motohiro
Reviewed-by: Christoph Lameter
Cc: Rik van Riel
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2010-05-25 23:06:59 +0800

30 Mar, 2010

1 commit

5a0e3ad6a include cleanup: Update gfp.h and slab.h includes to prepare for breaking implic… ... Browse Code »

…it slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

Tejun Heo
2010-03-30 21:02:32 +0800

07 Mar, 2010

1 commit

93e4a89a8 mm: restore zone->all_unreclaimable to independence word ... Browse Code »

commit e815af95 ("change all_unreclaimable zone member to flags") changed
all_unreclaimable member to bit flag. But it had an undesireble side
effect. free_one_page() is one of most hot path in linux kernel and
increasing atomic ops in it can reduce kernel performance a bit.

Thus, this patch revert such commit partially. at least
all_unreclaimable shouldn't share memory word with other zone flags.

[akpm@linux-foundation.org: fix patch interaction]
Signed-off-by: KOSAKI Motohiro
Cc: David Rientjes
Cc: Wu Fengguang
Cc: KAMEZAWA Hiroyuki
Cc: Minchan Kim
Cc: Huang Shijie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2010-03-07 03:26:25 +0800

05 Jan, 2010

2 commits

ad596925e this_cpu: Remove pageset_notifier ... Browse Code »

Remove the pageset notifier since it only marks that a processor
exists on a specific node. Move that code into the vmstat notifier.

Signed-off-by: Christoph Lameter
Signed-off-by: Tejun Heo

Christoph Lameter
2010-01-05 14:34:51 +0800
99dcc3e5a this_cpu: Page allocator conversion ... Browse Code »

Use the per cpu allocator functionality to avoid per cpu arrays in struct zone.

This drastically reduces the size of struct zone for systems with large
amounts of processors and allows placement of critical variables of struct
zone in one cacheline even on very large systems.

Another effect is that the pagesets of one processor are placed near one
another. If multiple pagesets from different zones fit into one cacheline
then additional cacheline fetches can be avoided on the hot paths when
allocating memory from multiple zones.

Bootstrap becomes simpler if we use the same scheme for UP, SMP, NUMA. #ifdefs
are reduced and we can drop the zone_pcp macro.

Hotplug handling is also simplified since cpu alloc can bring up and
shut down cpu areas for a specific cpu as a whole. So there is no need to
allocate or free individual pagesets.

V7-V8:
- Explain chicken egg dilemmna with percpu allocator.

V4-V5:
- Fix up cases where per_cpu_ptr is called before irq disable
- Integrate the bootstrap logic that was separate before.

tj: Build failure in pageset_cpuup_callback() due to missing ret
variable fixed.

Reviewed-by: Mel Gorman
Signed-off-by: Christoph Lameter
Signed-off-by: Tejun Heo

Christoph Lameter
2010-01-05 14:34:51 +0800

16 Dec, 2009

2 commits

bb3ab5968 vmscan: stop kswapd waiting on congestion when the min watermark is not being met ... Browse Code »

If reclaim fails to make sufficient progress, the priority is raised.
Once the priority is higher, kswapd starts waiting on congestion.
However, if the zone is below the min watermark then kswapd needs to
continue working without delay as there is a danger of an increased rate
of GFP_ATOMIC allocation failure.

This patch changes the conditions under which kswapd waits on congestion
by only going to sleep if the min watermarks are being met.

[mel@csn.ul.ie: add stats to track how relevant the logic is]
[mel@csn.ul.ie: make kswapd only check its own zones and rename the relevant counters]
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-12-16 00:53:16 +0800
f50de2d38 vmscan: have kswapd sleep for a short interval and double check it should be asleep ... Browse Code »

After kswapd balances all zones in a pgdat, it goes to sleep. In the
event of no IO congestion, kswapd can go to sleep very shortly after the
high watermark was reached. If there are a constant stream of allocations
from parallel processes, it can mean that kswapd went to sleep too quickly
and the high watermark is not being maintained for sufficient length time.

This patch makes kswapd go to sleep as a two-stage process. It first
tries to sleep for HZ/10. If it is woken up by another process or the
high watermark is no longer met, it's considered a premature sleep and
kswapd continues work. Otherwise it goes fully to sleep.

This adds more counters to distinguish between fast and slow breaches of
watermarks. A "fast" premature sleep is one where the low watermark was
hit in a very short time after kswapd going to sleep. A "slow" premature
sleep indicates that the high watermark was breached after a very short
interval.

Signed-off-by: Mel Gorman
Cc: Frans Pop
Cc: KOSAKI Motohiro
Cc: Rik van Riel
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-12-16 00:53:16 +0800

29 Oct, 2009

1 commit

1871e52c7 percpu: make percpu symbols under kernel/ and mm/ unique ... Browse Code »

This patch updates percpu related symbols under kernel/ and mm/ such
that percpu symbols are unique and don't clash with local symbols.
This serves two purposes of decreasing the possibility of global
percpu symbol collision and allowing dropping per_cpu__ prefix from
percpu symbols.

* kernel/lockdep.c: s/lock_stats/cpu_lock_stats/

* kernel/sched.c: s/init_rq_rt/init_rt_rq_var/ (any better idea?)
s/sched_group_cpus/sched_groups/

* kernel/softirq.c: s/ksoftirqd/run_ksoftirqd/a

* kernel/softlockup.c: s/(*)_timestamp/softlockup_\1_ts/
s/watchdog_task/softlockup_watchdog/
s/timestamp/ts/ for local variables

* kernel/time/timer_stats: s/lookup_lock/tstats_lookup_lock/

* mm/slab.c: s/reap_work/slab_reap_work/
s/reap_node/slab_reap_node/

* mm/vmstat.c: local variable changed to avoid collision with vmstat_work

Partly based on Rusty Russell's "alloc_percpu: rename percpu vars
which cause name clashes" patch.

Signed-off-by: Tejun Heo
Acked-by: (slab/vmstat) Christoph Lameter
Reviewed-by: Christoph Lameter
Cc: Rusty Russell
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: Andrew Morton
Cc: Nick Piggin

Tejun Heo
2009-10-29 21:34:13 +0800

22 Sep, 2009

3 commits

a731286de mm: vmstat: add isolate pages ... Browse Code »

If the system is running a heavy load of processes then concurrent reclaim
can isolate a large number of pages from the LRU. /proc/vmstat and the
output generated for an OOM do not show how many pages were isolated.

This has been observed during process fork bomb testing (mstctl11 in LTP).

This patch shows the information about isolated pages.

Reproduced via:

-----------------------
% ./hackbench 140 process 1000
=> OOM occur

active_anon:146 inactive_anon:0 isolated_anon:49245
active_file:79 inactive_file:18 isolated_file:113
unevictable:0 dirty:0 writeback:0 unstable:0 buffer:39
free:370 slab_reclaimable:309 slab_unreclaimable:5492
mapped:53 shmem:15 pagetables:28140 bounce:0

Signed-off-by: KOSAKI Motohiro
Acked-by: Rik van Riel
Acked-by: Wu Fengguang
Reviewed-by: Minchan Kim
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-09-22 22:17:29 +0800
4b02108ac mm: oom analysis: add shmem vmstat ... Browse Code »

Recently we encountered OOM problems due to memory use of the GEM cache.
Generally a large amuont of Shmem/Tmpfs pages tend to create a memory
shortage problem.

We often use the following calculation to determine the amount of shmem
pages:

shmem = NR_ACTIVE_ANON + NR_INACTIVE_ANON - NR_ANON_PAGES

however the expression does not consider isolated and mlocked pages.

This patch adds explicit accounting for pages used by shmem and tmpfs.

Signed-off-by: KOSAKI Motohiro
Acked-by: Rik van Riel
Reviewed-by: Christoph Lameter
Acked-by: Wu Fengguang
Cc: David Rientjes
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-09-22 22:17:27 +0800
c6a7f5728 mm: oom analysis: Show kernel stack usage in /proc/meminfo and OOM log output ... Browse Code »

The amount of memory allocated to kernel stacks can become significant and
cause OOM conditions. However, we do not display the amount of memory
consumed by stacks.

Add code to display the amount of memory used for stacks in /proc/meminfo.

Signed-off-by: KOSAKI Motohiro
Reviewed-by: Christoph Lameter
Reviewed-by: Minchan Kim
Reviewed-by: Rik van Riel
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-09-22 22:17:27 +0800

17 Jun, 2009

5 commits

24cf72518 vmscan: count the number of times zone_reclaim() scans and fails ... Browse Code »

On NUMA machines, the administrator can configure zone_reclaim_mode that
is a more targetted form of direct reclaim. On machines with large NUMA
distances for example, a zone_reclaim_mode defaults to 1 meaning that
clean unmapped pages will be reclaimed if the zone watermarks are not
being met.

There is a heuristic that determines if the scan is worthwhile but it is
possible that the heuristic will fail and the CPU gets tied up scanning
uselessly. Detecting the situation requires some guesswork and
experimentation so this patch adds a counter "zreclaim_failed" to
/proc/vmstat. If during high CPU utilisation this counter is increasing
rapidly, then the resolution to the problem may be to set
/proc/sys/vm/zone_reclaim_mode to 0.

[akpm@linux-foundation.org: name things consistently]
Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Christoph Lameter
Reviewed-by: KOSAKI Motohiro
Cc: Wu Fengguang
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-06-17 10:47:46 +0800
683776596 mm: remove CONFIG_UNEVICTABLE_LRU config option ... Browse Code »

Currently, nobody wants to turn UNEVICTABLE_LRU off. Thus this
configurability is unnecessary.

Signed-off-by: KOSAKI Motohiro
Cc: Johannes Weiner
Cc: Andi Kleen
Acked-by: Minchan Kim
Cc: David Woodhouse
Cc: Matt Mackall
Cc: Rik van Riel
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-06-17 10:47:42 +0800
08d9ae7cb vmscan: don't export nr_saved_scan in /proc/zoneinfo ... Browse Code »

The lru->nr_saved_scan's are not meaningful counters for even kernel
developers. They typically are smaller than 32 and are always 0 for large
lists. So remove them from /proc/zoneinfo.

Hopefully this interface change won't break too many scripts.
/proc/zoneinfo is too unstructured to be script friendly, and I wonder the
affected scripts - if there are any - are still bleeding since the not
long ago commit "vmscan: split LRU lists into anon & file sets", which
also touched the "scanned" line :)

If we are to re-export accumulated vmscan counts in the future, they can
go to new lines in /proc/zoneinfo instead of the current form, or to
/sys/devices/system/node/node0/meminfo?

Signed-off-by: Wu Fengguang

Acked-by: Rik van Riel
Cc: Nick Piggin
Acked-by: Christoph Lameter
Cc: Peter Zijlstra
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2009-06-17 10:47:39 +0800
6e08a369e vmscan: cleanup the scan batching code ... Browse Code »

The vmscan batching logic is twisting. Move it into a standalone function
nr_scan_try_batch() and document it. No behavior change.

Signed-off-by: Wu Fengguang
Acked-by: Rik van Riel
Cc: Nick Piggin
Cc: Christoph Lameter
Acked-by: Peter Zijlstra
Acked-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2009-06-17 10:47:39 +0800
418589663 page allocator: use allocation flags as an index to the zone watermark ... Browse Code »

ALLOC_WMARK_MIN, ALLOC_WMARK_LOW and ALLOC_WMARK_HIGH determin whether
pages_min, pages_low or pages_high is used as the zone watermark when
allocating the pages. Two branches in the allocator hotpath determine
which watermark to use.

This patch uses the flags as an array index into a watermark array that is
indexed with WMARK_* defines accessed via helpers. All call sites that
use zone->pages_* are updated to use the helpers for accessing the values
and the array offsets for setting.

Signed-off-by: Mel Gorman
Reviewed-by: Christoph Lameter
Cc: KOSAKI Motohiro
Cc: Pekka Enberg
Cc: Peter Zijlstra
Cc: Nick Piggin
Cc: Dave Hansen
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-06-17 10:47:35 +0800

18 May, 2009

1 commit

eb33575cf [ARM] Double check memmap is actually valid with a memmap has unexpected holes V2 ... Browse Code »

pfn_valid() is meant to be able to tell if a given PFN has valid memmap
associated with it or not. In FLATMEM, it is expected that holes always
have valid memmap as long as there is valid PFNs either side of the hole.
In SPARSEMEM, it is assumed that a valid section has a memmap for the
entire section.

However, ARM and maybe other embedded architectures in the future free
memmap backing holes to save memory on the assumption the memmap is never
used. The page_zone linkages are then broken even though pfn_valid()
returns true. A walker of the full memmap must then do this additional
check to ensure the memmap they are looking at is sane by making sure the
zone and PFN linkages are still valid. This is expensive, but walkers of
the full memmap are extremely rare.

This was caught before for FLATMEM and hacked around but it hits again for
SPARSEMEM because the page_zone linkages can look ok where the PFN linkages
are totally screwed. This looks like a hatchet job but the reality is that
any clean solution would end up consumning all the memory saved by punching
these unexpected holes in the memmap. For example, we tried marking the
memmap within the section invalid but the section size exceeds the size of
the hole in most cases so pfn_valid() starts returning false where valid
memmap exists. Shrinking the size of the section would increase memory
consumption offsetting the gains.

This patch identifies when an architecture is punching unexpected holes
in the memmap that the memory model cannot automatically detect and sets
ARCH_HAS_HOLES_MEMORYMODEL. At the moment, this is restricted to EP93xx
which is the model sub-architecture this has been reported on but may expand
later. When set, walkers of the full memmap must call memmap_valid_within()
for each PFN and passing in what it expects the page and zone to be for
that PFN. If it finds the linkages to be broken, it assumes the memmap is
invalid for that PFN.

Signed-off-by: Mel Gorman
Signed-off-by: Russell King

Mel Gorman
2009-05-18 18:22:24 +0800

03 Apr, 2009

1 commit

98f4ebb29 mm: align vmstat_work's timer ... Browse Code »

Even though vmstat_work is marked deferrable, there are still benefits to
aligning it. For certain applications we want to keep OS jitter as low as
possible and aligning timers and work so they occur together can reduce
their overall impact.

Signed-off-by: Anton Blanchard
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Anton Blanchard
2009-04-03 10:04:48 +0800