Doug / smarc-fsl-linux-kernel | Embedian Git Server

09 Oct, 2012

7 commits

e46a28790 CMA: migrate mlocked pages ... Browse Code »

Presently CMA cannot migrate mlocked pages so it ends up failing to allocate
contiguous memory space.

This patch makes mlocked pages be migrated out. Of course, it can affect
realtime processes but in CMA usecase, contiguous memory allocation failing
is far worse than access latency to an mlocked page being variable while
CMA is running. If someone wants to make the system realtime, he shouldn't
enable CMA because stalls can still happen at random times.

[akpm@linux-foundation.org: tweak comment text, per Mel]
Signed-off-by: Minchan Kim
Acked-by: Mel Gorman
Cc: Michal Nazarewicz
Cc: Bartlomiej Zolnierkiewicz
Cc: Marek Szyprowski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2012-10-09 15:23:00 +0800
957f822a0 mm, numa: reclaim from all nodes within reclaim distance ... Browse Code »

RECLAIM_DISTANCE represents the distance between nodes at which it is
deemed too costly to allocate from; it's preferred to try to reclaim from
a local zone before falling back to allocating on a remote node with such
a distance.

To do this, zone_reclaim_mode is set if the distance between any two
nodes on the system is greather than this distance. This, however, ends
up causing the page allocator to reclaim from every zone regardless of
its affinity.

What we really want is to reclaim only from zones that are closer than
RECLAIM_DISTANCE. This patch adds a nodemask to each node that
represents the set of nodes that are within this distance. During the
zone iteration, if the bit for a zone's node is set for the local node,
then reclaim is attempted; otherwise, the zone is skipped.

[akpm@linux-foundation.org: fix CONFIG_NUMA=n build]
Signed-off-by: David Rientjes
Cc: Mel Gorman
Cc: Minchan Kim
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-10-09 15:22:56 +0800
62997027c mm: compaction: clear PG_migrate_skip based on compaction and reclaim activity ... Browse Code »

Compaction caches if a pageblock was scanned and no pages were isolated so
that the pageblocks can be skipped in the future to reduce scanning. This
information is not cleared by the page allocator based on activity due to
the impact it would have to the page allocator fast paths. Hence there is
a requirement that something clear the cache or pageblocks will be skipped
forever. Currently the cache is cleared if there were a number of recent
allocation failures and it has not been cleared within the last 5 seconds.
Time-based decisions like this are terrible as they have no relationship
to VM activity and is basically a big hammer.

Unfortunately, accurate heuristics would add cost to some hot paths so
this patch implements a rough heuristic. There are two cases where the
cache is cleared.

1. If a !kswapd process completes a compaction cycle (migrate and free
scanner meet), the zone is marked compact_blockskip_flush. When kswapd
goes to sleep, it will clear the cache. This is expected to be the
common case where the cache is cleared. It does not really matter if
kswapd happens to be asleep or going to sleep when the flag is set as
it will be woken on the next allocation request.

2. If there have been multiple failures recently and compaction just
finished being deferred then a process will clear the cache and start a
full scan. This situation happens if there are multiple high-order
allocation requests under heavy memory pressure.

The clearing of the PG_migrate_skip bits and other scans is inherently
racy but the race is harmless. For allocations that can fail such as THP,
they will simply fail. For requests that cannot fail, they will retry the
allocation. Tests indicated that scanning rates were roughly similar to
when the time-based heuristic was used and the allocation success rates
were similar.

Signed-off-by: Mel Gorman
Cc: Rik van Riel
Cc: Richard Davies
Cc: Shaohua Li
Cc: Avi Kivity
Cc: Rafael Aquini
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-10-09 15:22:51 +0800
c89511ab2 mm: compaction: Restart compaction from near where it left off ... Browse Code »

This is almost entirely based on Rik's previous patches and discussions
with him about how this might be implemented.

Order > 0 compaction stops when enough free pages of the correct page
order have been coalesced. When doing subsequent higher order
allocations, it is possible for compaction to be invoked many times.

However, the compaction code always starts out looking for things to
compact at the start of the zone, and for free pages to compact things to
at the end of the zone.

This can cause quadratic behaviour, with isolate_freepages starting at the
end of the zone each time, even though previous invocations of the
compaction code already filled up all free memory on that end of the zone.
This can cause isolate_freepages to take enormous amounts of CPU with
certain workloads on larger memory systems.

This patch caches where the migration and free scanner should start from
on subsequent compaction invocations using the pageblock-skip information.
When compaction starts it begins from the cached restart points and will
update the cached restart points until a page is isolated or a pageblock
is skipped that would have been scanned by synchronous compaction.

Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Richard Davies
Cc: Shaohua Li
Cc: Avi Kivity
Acked-by: Rafael Aquini
Cc: Fengguang Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-10-09 15:22:50 +0800
bb13ffeb9 mm: compaction: cache if a pageblock was scanned and no pages were isolated ... Browse Code »

When compaction was implemented it was known that scanning could
potentially be excessive. The ideal was that a counter be maintained for
each pageblock but maintaining this information would incur a severe
penalty due to a shared writable cache line. It has reached the point
where the scanning costs are a serious problem, particularly on
long-lived systems where a large process starts and allocates a large
number of THPs at the same time.

Instead of using a shared counter, this patch adds another bit to the
pageblock flags called PG_migrate_skip. If a pageblock is scanned by
either migrate or free scanner and 0 pages were isolated, the pageblock is
marked to be skipped in the future. When scanning, this bit is checked
before any scanning takes place and the block skipped if set.

The main difficulty with a patch like this is "when to ignore the cached
information?" If it's ignored too often, the scanning rates will still be
excessive. If the information is too stale then allocations will fail
that might have otherwise succeeded. In this patch

o CMA always ignores the information
o If the migrate and free scanner meet then the cached information will
be discarded if it's at least 5 seconds since the last time the cache
was discarded
o If there are a large number of allocation failures, discard the cache.

The time-based heuristic is very clumsy but there are few choices for a
better event. Depending solely on multiple allocation failures still
allows excessive scanning when THP allocations are failing in quick
succession due to memory pressure. Waiting until memory pressure is
relieved would cause compaction to continually fail instead of using
reclaim/compaction to try allocate the page. The time-based mechanism is
clumsy but a better option is not obvious.

Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Richard Davies
Cc: Shaohua Li
Cc: Avi Kivity
Acked-by: Rafael Aquini
Cc: Fengguang Wu
Cc: Michal Nazarewicz
Cc: Bartlomiej Zolnierkiewicz
Cc: Kyungmin Park
Cc: Mark Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-10-09 15:22:50 +0800
753341a4b revert "mm: have order > 0 compaction start off where it left" ... Browse Code »

This reverts commit 7db8889ab05b ("mm: have order > 0 compaction start
off where it left") and commit de74f1cc ("mm: have order > 0 compaction
start near a pageblock with free pages"). These patches were a good
idea and tests confirmed that they massively reduced the amount of
scanning but the implementation is complex and tricky to understand. A
later patch will cache what pageblocks should be skipped and
reimplements the concept of compact_cached_free_pfn on top for both
migration and free scanners.

Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Richard Davies
Cc: Shaohua Li
Cc: Avi Kivity
Acked-by: Rafael Aquini
Acked-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-10-09 15:22:50 +0800
d1ce749a0 cma: count free CMA pages ... Browse Code »

Add NR_FREE_CMA_PAGES counter to be later used for checking watermark in
__zone_watermark_ok(). For simplicity and to avoid #ifdef hell make this
counter always available (not only when CONFIG_CMA=y).

[akpm@linux-foundation.org: use conventional migratetype naming]
Signed-off-by: Bartlomiej Zolnierkiewicz
Signed-off-by: Kyungmin Park
Cc: Marek Szyprowski
Cc: Michal Nazarewicz
Cc: Minchan Kim
Cc: Mel Gorman
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Bartlomiej Zolnierkiewicz
2012-10-09 15:22:44 +0800

01 Aug, 2012

7 commits

5515061d2 mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage ... Browse Code »

If swap is backed by network storage such as NBD, there is a risk that a
large number of reclaimers can hang the system by consuming all
PF_MEMALLOC reserves. To avoid these hangs, the administrator must tune
min_free_kbytes in advance which is a bit fragile.

This patch throttles direct reclaimers if half the PF_MEMALLOC reserves
are in use. If the system is routinely getting throttled the system
administrator can increase min_free_kbytes so degradation is smoother but
the system will keep running.

Signed-off-by: Mel Gorman
Cc: David Miller
Cc: Neil Brown
Cc: Peter Zijlstra
Cc: Mike Christie
Cc: Eric B Munson
Cc: Eric Dumazet
Cc: Sebastian Andrzej Siewior
Cc: Mel Gorman
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-08-01 09:42:46 +0800
702d1a6e0 memory-hotplug: fix kswapd looping forever problem ... Browse Code »

When hotplug offlining happens on zone A, it starts to mark freed page as
MIGRATE_ISOLATE type in buddy for preventing further allocation.
(MIGRATE_ISOLATE is very irony type because it's apparently on buddy but
we can't allocate them).

When the memory shortage happens during hotplug offlining, current task
starts to reclaim, then wake up kswapd. Kswapd checks watermark, then go
sleep because current zone_watermark_ok_safe doesn't consider
MIGRATE_ISOLATE freed page count. Current task continue to reclaim in
direct reclaim path without kswapd's helping. The problem is that
zone->all_unreclaimable is set by only kswapd so that current task would
be looping forever like below.

__alloc_pages_slowpath
restart:
wake_all_kswapd
rebalance:
__alloc_pages_direct_reclaim
do_try_to_free_pages
if global_reclaim && !all_unreclaimable
return 1; /* It means we did did_some_progress */
skip __alloc_pages_may_oom
should_alloc_retry
goto rebalance;

If we apply KOSAKI's patch[1] which doesn't depends on kswapd about
setting zone->all_unreclaimable, we can solve this problem by killing some
task in direct reclaim path. But it doesn't wake up kswapd, still. It
could be a problem still if other subsystem needs GFP_ATOMIC request. So
kswapd should consider MIGRATE_ISOLATE when it calculate free pages BEFORE
going sleep.

This patch counts the number of MIGRATE_ISOLATE page block and
zone_watermark_ok_safe will consider it if the system has such blocks
(fortunately, it's very rare so no problem in POV overhead and kswapd is
never hotpath).

Copy/modify from Mel's quote
"
Ideal solution would be "allocating" the pageblock.
It would keep the free space accounting as it is but historically,
memory hotplug didn't allocate pages because it would be difficult to
detect if a pageblock was isolated or if part of some balloon.
Allocating just full pageblocks would work around this, However,
it would play very badly with CMA.
"

[1] http://lkml.org/lkml/2012/6/14/74

[akpm@linux-foundation.org: simplify nr_zone_isolate_freepages(), rework zone_watermark_ok_safe() comment, simplify set_pageblock_isolate() and restore_pageblock_isolate()]
[akpm@linux-foundation.org: fix CONFIG_MEMORY_ISOLATION=n build]
Signed-off-by: Minchan Kim
Suggested-by: KOSAKI Motohiro
Tested-by: Aaditya Kumar
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2012-08-01 09:42:45 +0800
9adb62a5d mm/hotplug: correctly setup fallback zonelists when creating new pgdat ... Browse Code »

When hotadd_new_pgdat() is called to create new pgdat for a new node, a
fallback zonelist should be created for the new node. There's code to try
to achieve that in hotadd_new_pgdat() as below:

/*
* The node we allocated has no zone fallback lists. For avoiding
* to access not-initialized zonelist, build here.
*/
mutex_lock(&zonelists_mutex);
build_all_zonelists(pgdat, NULL);
mutex_unlock(&zonelists_mutex);

But it doesn't work as expected. When hotadd_new_pgdat() is called, the
new node is still in offline state because node_set_online(nid) hasn't
been called yet. And build_all_zonelists() only builds zonelists for
online nodes as:

for_each_online_node(nid) {
pg_data_t *pgdat = NODE_DATA(nid);

build_zonelists(pgdat);
build_zonelist_cache(pgdat);
}

Though we hope to create zonelist for the new pgdat, but it doesn't. So
add a new parameter "pgdat" the build_all_zonelists() to build pgdat for
the new pgdat too.

Signed-off-by: Jiang Liu
Signed-off-by: Xishi Qiu
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Rusty Russell
Cc: Yinghai Lu
Cc: Tony Luck
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: David Rientjes
Cc: Keping Chen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiang Liu
2012-08-01 09:42:44 +0800
fe03025db mm: CONFIG_HAVE_MEMBLOCK_NODE -> CONFIG_HAVE_MEMBLOCK_NODE_MAP ... Browse Code »

0ee332c14518699 ("memblock: Kill early_node_map[]") wanted to replace
CONFIG_ARCH_POPULATES_NODE_MAP with CONFIG_HAVE_MEMBLOCK_NODE_MAP but
ended up replacing one occurence with a reference to the non-existent
symbol CONFIG_HAVE_MEMBLOCK_NODE.

The resulting omission of code would probably have been causing problems
to 32-bit machines with memory hotplug.

Signed-off-by: Rabin Vincent
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rabin Vincent
2012-08-01 09:42:43 +0800
7db8889ab mm: have order > 0 compaction start off where it left ... Browse Code »

Order > 0 compaction stops when enough free pages of the correct page
order have been coalesced. When doing subsequent higher order
allocations, it is possible for compaction to be invoked many times.

However, the compaction code always starts out looking for things to
compact at the start of the zone, and for free pages to compact things to
at the end of the zone.

This can cause quadratic behaviour, with isolate_freepages starting at the
end of the zone each time, even though previous invocations of the
compaction code already filled up all free memory on that end of the zone.

This can cause isolate_freepages to take enormous amounts of CPU with
certain workloads on larger memory systems.

The obvious solution is to have isolate_freepages remember where it left
off last time, and continue at that point the next time it gets invoked
for an order > 0 compaction. This could cause compaction to fail if
cc->free_pfn and cc->migrate_pfn are close together initially, in that
case we restart from the end of the zone and try once more.

Forced full (order == -1) compactions are left alone.

[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: s/laste/last/, use 80 cols]
Signed-off-by: Rik van Riel
Reported-by: Jim Schutt
Tested-by: Jim Schutt
Cc: Minchan Kim
Reviewed-by: KAMEZAWA Hiroyuki
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2012-08-01 09:42:43 +0800
ca28ddc90 mm: remove unused LRU_ALL_EVICTABLE ... Browse Code »

Signed-off-by: Wanpeng Li
Acked-by: KAMEZAWA Hiroyuki
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanpeng Li
2012-08-01 09:42:43 +0800
c255a4580 memcg: rename config variables ... Browse Code »

Sanity:

CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM

[mhocko@suse.cz: fix missed bits]
Cc: Glauber Costa
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Cc: Tejun Heo
Cc: Aneesh Kumar K.V
Cc: David Rientjes
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2012-08-01 09:42:43 +0800

25 Jul, 2012

1 commit

d14b7a419 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial ... Browse Code »

Pull trivial tree from Jiri Kosina:
"Trivial updates all over the place as usual."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (29 commits)
Fix typo in include/linux/clk.h .
pci: hotplug: Fix typo in pci
iommu: Fix typo in iommu
video: Fix typo in drivers/video
Documentation: Add newline at end-of-file to files lacking one
arm,unicore32: Remove obsolete "select MISC_DEVICES"
module.c: spelling s/postition/position/g
cpufreq: Fix typo in cpufreq driver
trivial: typo in comment in mksysmap
mach-omap2: Fix typo in debug message and comment
scsi: aha152x: Fix sparse warning and make printing pointer address more portable.
Change email address for Steve Glendinning
Btrfs: fix typo in convert_extent_bit
via: Remove bogus if check
netprio_cgroup.c: fix comment typo
backlight: fix memory leak on obscure error path
Documentation: asus-laptop.txt references an obsolete Kconfig item
Documentation: ManagementStyle: fixed typo
mm/vmscan: cleanup comment error in balance_pgdat
mm: cleanup on the comments of zone_reclaim_stat
...

Linus Torvalds
2012-07-25 04:34:56 +0800

12 Jul, 2012

1 commit

d8adde17e memory hotplug: fix invalid memory access caused by stale kswapd pointer ... Browse Code »

kswapd_stop() is called to destroy the kswapd work thread when all memory
of a NUMA node has been offlined. But kswapd_stop() only terminates the
work thread without resetting NODE_DATA(nid)->kswapd to NULL. The stale
pointer will prevent kswapd_run() from creating a new work thread when
adding memory to the memory-less NUMA node again. Eventually the stale
pointer may cause invalid memory access.

An example stack dump as below. It's reproduced with 2.6.32, but latest
kernel has the same issue.

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [] exit_creds+0x12/0x78
PGD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/system/memory/memory391/state
CPU 11
Modules linked in: cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq microcode fuse loop dm_mod tpm_tis rtc_cmos i2c_i801 rtc_core tpm serio_raw pcspkr sg tpm_bios igb i2c_core iTCO_wdt rtc_lib mptctl iTCO_vendor_support button dca bnx2 usbhid hid uhci_hcd ehci_hcd usbcore sd_mod crc_t10dif edd ext3 mbcache jbd fan ide_pci_generic ide_core ata_generic ata_piix libata thermal processor thermal_sys hwmon mptsas mptscsih mptbase scsi_transport_sas scsi_mod
Pid: 7949, comm: sh Not tainted 2.6.32.12-qiuxishi-5-default #92 Tecal RH2285
RIP: 0010:exit_creds+0x12/0x78
RSP: 0018:ffff8806044f1d78 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff880604f22140 RCX: 0000000000019502
RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000
RBP: ffff880604f22150 R08: 0000000000000000 R09: ffffffff81a4dc10
R10: 00000000000032a0 R11: ffff880006202500 R12: 0000000000000000
R13: 0000000000c40000 R14: 0000000000008000 R15: 0000000000000001
FS: 00007fbc03d066f0(0000) GS:ffff8800282e0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 000000060f029000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process sh (pid: 7949, threadinfo ffff8806044f0000, task ffff880603d7c600)
Stack:
ffff880604f22140 ffffffff8103aac5 ffff880604f22140 ffffffff8104d21e
ffff880006202500 0000000000008000 0000000000c38000 ffffffff810bd5b1
0000000000000000 ffff880603d7c600 00000000ffffdd29 0000000000000003
Call Trace:
__put_task_struct+0x5d/0x97
kthread_stop+0x50/0x58
offline_pages+0x324/0x3da
memory_block_change_state+0x179/0x1db
store_mem_state+0x9e/0xbb
sysfs_write_file+0xd0/0x107
vfs_write+0xad/0x169
sys_write+0x45/0x6e
system_call_fastpath+0x16/0x1b
Code: ff 4d 00 0f 94 c0 84 c0 74 08 48 89 ef e8 1f fd ff ff 5b 5d 31 c0 41 5c c3 53 48 8b 87 20 06 00 00 48 89 fb 48 8b bf 18 06 00 00 00 48 c7 83 18 06 00 00 00 00 00 00 f0 ff 0f 0f 94 c0 84 c0
RIP exit_creds+0x12/0x78
RSP
CR2: 0000000000000000

[akpm@linux-foundation.org: add pglist_data.kswapd locking comments]
Signed-off-by: Xishi Qiu
Signed-off-by: Jiang Liu
Acked-by: KAMEZAWA Hiroyuki
Acked-by: KOSAKI Motohiro
Acked-by: Mel Gorman
Acked-by: David Rientjes
Reviewed-by: Minchan Kim
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiang Liu
2012-07-12 07:04:41 +0800

29 Jun, 2012

1 commit

59f91e5dd Merge branch 'master' into for-next ... Browse Code »

Conflicts:
include/linux/mmzone.h

Synced with Linus' tree so that trivial patch can be applied
on top of up-to-date code properly.

Reported-by: Stephen Rothwell

Jiri Kosina
2012-06-29 20:45:58 +0800

28 Jun, 2012

1 commit

46028e6d1 mm: cleanup on the comments of zone_reclaim_stat ... Browse Code »

Signed-off-by: Wanpeng Li
Acked-by: Minchan Kim
Signed-off-by: Jiri Kosina

Wanpeng Li
2012-06-28 17:56:12 +0800

30 May, 2012

3 commits

7f5e86c2c mm: add link from struct lruvec to struct zone ... Browse Code »

This is the first stage of struct mem_cgroup_zone removal. Further
patches replace struct mem_cgroup_zone with a pointer to struct lruvec.

If CONFIG_CGROUP_MEM_RES_CTLR=n lruvec_zone() is just container_of().

Signed-off-by: Konstantin Khlebnikov
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-05-30 07:22:26 +0800
89abfab13 mm/memcg: move reclaim_stat into lruvec ... Browse Code »

With mem_cgroup_disabled() now explicit, it becomes clear that the
zone_reclaim_stat structure actually belongs in lruvec, per-zone when
memcg is disabled but per-memcg per-zone when it's enabled.

We can delete mem_cgroup_get_reclaim_stat(), and change
update_page_reclaim_stat() to update just the one set of stats, the one
which get_scan_count() will actually use.

Signed-off-by: Hugh Dickins
Signed-off-by: Konstantin Khlebnikov
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Michal Hocko
Reviewed-by: Minchan Kim
Reviewed-by: Michal Hocko
Cc: Glauber Costa
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-05-30 07:22:25 +0800
f3fd4a619 mm: remove lru type checks from __isolate_lru_page() ... Browse Code »

After patch "mm: forbid lumpy-reclaim in shrink_active_list()" we can
completely remove anon/file and active/inactive lru type filters from
__isolate_lru_page(), because isolation for 0-order reclaim always
isolates pages from right lru list. And pages-isolation for lumpy
shrink_inactive_list() or memory-compaction anyway allowed to isolate
pages from all evictable lru lists.

Signed-off-by: Konstantin Khlebnikov
Acked-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Acked-by: Michal Hocko
Cc: Glauber Costa
Cc: Johannes Weiner
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-05-30 07:22:25 +0800

26 May, 2012

1 commit

d484864dd Merge branch 'for-linus' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping ... Browse Code »

Pull CMA and ARM DMA-mapping updates from Marek Szyprowski:
"These patches contain two major updates for DMA mapping subsystem
(mainly for ARM architecture). First one is Contiguous Memory
Allocator (CMA) which makes it possible for device drivers to allocate
big contiguous chunks of memory after the system has booted.

The main difference from the similar frameworks is the fact that CMA
allows to transparently reuse the memory region reserved for the big
chunk allocation as a system memory, so no memory is wasted when no
big chunk is allocated. Once the alloc request is issued, the
framework migrates system pages to create space for the required big
chunk of physically contiguous memory.

For more information one can refer to nice LWN articles:

- 'A reworked contiguous memory allocator':
http://lwn.net/Articles/447405/

- 'CMA and ARM':
http://lwn.net/Articles/450286/

- 'A deep dive into CMA':
http://lwn.net/Articles/486301/

- and the following thread with the patches and links to all previous
versions:
https://lkml.org/lkml/2012/4/3/204

The main client for this new framework is ARM DMA-mapping subsystem.

The second part provides a complete redesign in ARM DMA-mapping
subsystem. The core implementation has been changed to use common
struct dma_map_ops based infrastructure with the recent updates for
new dma attributes merged in v3.4-rc2. This allows to use more than
one implementation of dma-mapping calls and change/select them on the
struct device basis. The first client of this new infractructure is
dmabounce implementation which has been completely cut out of the
core, common code.

The last patch of this redesign update introduces a new, experimental
implementation of dma-mapping calls on top of generic IOMMU framework.
This lets ARM sub-platform to transparently use IOMMU for DMA-mapping
calls if one provides required IOMMU hardware.

For more information please refer to the following thread:
http://www.spinics.net/lists/arm-kernel/msg175729.html

The last patch merges changes from both updates and provides a
resolution for the conflicts which cannot be avoided when patches have
been applied on the same files (mainly arch/arm/mm/dma-mapping.c)."

Acked by Andrew Morton :
"Yup, this one please. It's had much work, plenty of review and I
think even Russell is happy with it."

* 'for-linus' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping: (28 commits)
ARM: dma-mapping: use PMD size for section unmap
cma: fix migration mode
ARM: integrate CMA with DMA-mapping subsystem
X86: integrate CMA with DMA-mapping subsystem
drivers: add Contiguous Memory Allocator
mm: trigger page reclaim in alloc_contig_range() to stabilise watermarks
mm: extract reclaim code from __alloc_pages_direct_reclaim()
mm: Serialize access to min_free_kbytes
mm: page_isolation: MIGRATE_CMA isolation functions added
mm: mmzone: MIGRATE_CMA migration type added
mm: page_alloc: change fallbacks array handling
mm: page_alloc: introduce alloc_contig_range()
mm: compaction: export some of the functions
mm: compaction: introduce isolate_freepages_range()
mm: compaction: introduce map_pages()
mm: compaction: introduce isolate_migratepages_range()
mm: page_alloc: remove trailing whitespace
ARM: dma-mapping: add support for IOMMU mapper
ARM: dma-mapping: use alloc, mmap, free from dma_ops
ARM: dma-mapping: remove redundant code and do the cleanup
...

Conflicts:
arch/x86/include/asm/dma-mapping.h

Linus Torvalds
2012-05-26 00:18:59 +0800

21 May, 2012

2 commits

49f223a9c mm: trigger page reclaim in alloc_contig_range() to stabilise watermarks ... Browse Code »

alloc_contig_range() performs memory allocation so it also should keep
track on keeping the correct level of memory watermarks. This commit adds
a call to *_slowpath style reclaim to grab enough pages to make sure that
the final collection of contiguous pages from freelists will not starve
the system.

Signed-off-by: Marek Szyprowski
Signed-off-by: Kyungmin Park
CC: Michal Nazarewicz
Tested-by: Rob Clark
Tested-by: Ohad Ben-Cohen
Tested-by: Benjamin Gaignard
Tested-by: Robert Nelson
Tested-by: Barry Song

Marek Szyprowski
2012-05-21 21:09:36 +0800
47118af07 mm: mmzone: MIGRATE_CMA migration type added ... Browse Code »

The MIGRATE_CMA migration type has two main characteristics:
(i) only movable pages can be allocated from MIGRATE_CMA
pageblocks and (ii) page allocator will never change migration
type of MIGRATE_CMA pageblocks.

This guarantees (to some degree) that page in a MIGRATE_CMA page
block can always be migrated somewhere else (unless there's no
memory left in the system).

It is designed to be used for allocating big chunks (eg. 10MiB)
of physically contiguous memory. Once driver requests
contiguous memory, pages from MIGRATE_CMA pageblocks may be
migrated away to create a contiguous block.

To minimise number of migrations, MIGRATE_CMA migration type
is the last type tried when page allocator falls back to other
migration types when requested.

Signed-off-by: Michal Nazarewicz
Signed-off-by: Marek Szyprowski
Signed-off-by: Kyungmin Park
Acked-by: Mel Gorman
Reviewed-by: KAMEZAWA Hiroyuki
Tested-by: Rob Clark
Tested-by: Ohad Ben-Cohen
Tested-by: Benjamin Gaignard
Tested-by: Robert Nelson
Tested-by: Barry Song

Michal Nazarewicz
2012-05-21 21:09:32 +0800

15 Apr, 2012

1 commit

35fca53e1 mmzone: fix comment typo coelesce -> coalesce ... Browse Code »

Signed-off-by: Wang YanQing
Signed-off-by: Jiri Kosina

Wang YanQing
2012-04-15 22:57:23 +0800

22 Mar, 2012

1 commit

aff622495 vmscan: only defer compaction for failed order and higher ... Browse Code »

Currently a failed order-9 (transparent hugepage) compaction can lead to
memory compaction being temporarily disabled for a memory zone. Even if
we only need compaction for an order 2 allocation, eg. for jumbo frames
networking.

The fix is relatively straightforward: keep track of the highest order at
which compaction is succeeding, and only defer compaction for orders at
which compaction is failing.

Signed-off-by: Rik van Riel
Cc: Andrea Arcangeli
Acked-by: Mel Gorman
Cc: Johannes Weiner
Cc: Minchan Kim
Cc: KOSAKI Motohiro
Cc: Hillf Danton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2012-03-22 08:54:56 +0800

13 Jan, 2012

3 commits

4111304da mm: enum lru_list lru ... Browse Code »

Mostly we use "enum lru_list lru": change those few "l"s to "lru"s.

Signed-off-by: Hugh Dickins
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-01-13 12:13:10 +0800
c82449352 mm: compaction: make isolate_lru_page() filter-aware again ... Browse Code »

Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware")
noted that compaction does not migrate dirty or writeback pages and that
is was meaningless to pick the page and re-add it to the LRU list. This
had to be partially reverted because some dirty pages can be migrated by
compaction without blocking.

This patch updates "mm: compaction: make isolate_lru_page" by skipping
over pages that migration has no possibility of migrating to minimise LRU
disruption.

Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Andrea Arcangeli
Reviewed-by: Minchan Kim
Cc: Dave Jones
Cc: Jan Kara
Cc: Andy Isaacson
Cc: Nai Xia
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-01-13 12:13:09 +0800
6290df545 mm: collect LRU list heads into struct lruvec ... Browse Code »

Having a unified structure with a LRU list set for both global zones and
per-memcg zones allows to keep that code simple which deals with LRU
lists and does not care about the container itself.

Once the per-memcg LRU lists directly link struct pages, the isolation
function and all other list manipulations are shared between the memcg
case and the global LRU case.

Signed-off-by: Johannes Weiner
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Michal Hocko
Reviewed-by: Kirill A. Shutemov
Cc: Daisuke Nishimura
Cc: Balbir Singh
Cc: Ying Han
Cc: Greg Thelen
Cc: Michel Lespinasse
Cc: Rik van Riel
Cc: Minchan Kim
Cc: Christoph Hellwig
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-01-13 12:13:05 +0800

11 Jan, 2012

1 commit

ab8fabd46 mm: exclude reserved pages from dirtyable memory ... Browse Code »

Per-zone dirty limits try to distribute page cache pages allocated for
writing across zones in proportion to the individual zone sizes, to reduce
the likelihood of reclaim having to write back individual pages from the
LRU lists in order to make progress.

This patch:

The amount of dirtyable pages should not include the full number of free
pages: there is a number of reserved pages that the page allocator and
kswapd always try to keep free.

The closer (reclaimable pages - dirty pages) is to the number of reserved
pages, the more likely it becomes for reclaim to run into dirty pages:

+----------+ ---
| anon | |
+----------+ |
| | |
| | -- dirty limit new -- flusher new
| file | | |
| | | |
| | -- dirty limit old -- flusher old
| | |
+----------+ --- reclaim
| reserved |
+----------+
| kernel |
+----------+

This patch introduces a per-zone dirty reserve that takes both the lowmem
reserve as well as the high watermark of the zone into account, and a
global sum of those per-zone values that is subtracted from the global
amount of dirtyable pages. The lowmem reserve is unavailable to page
cache allocations and kswapd tries to keep the high watermark free. We
don't want to end up in a situation where reclaim has to clean pages in
order to balance zones.

Not treating reserved pages as dirtyable on a global level is only a
conceptual fix. In reality, dirty pages are not distributed equally
across zones and reclaim runs into dirty pages on a regular basis.

But it is important to get this right before tackling the problem on a
per-zone level, where the distance between reclaim and the dirty pages is
mostly much smaller in absolute numbers.

[akpm@linux-foundation.org: fix highmem build]
Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Reviewed-by: Michal Hocko
Reviewed-by: Minchan Kim
Acked-by: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Cc: Christoph Hellwig
Cc: Wu Fengguang
Cc: Dave Chinner
Cc: Jan Kara
Cc: Shaohua Li
Cc: Chris Mason
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-01-11 08:30:43 +0800

09 Dec, 2011

1 commit

0ee332c14 memblock: Kill early_node_map[] ... Browse Code »

Now all ARCH_POPULATES_NODE_MAP archs select HAVE_MEBLOCK_NODE_MAP -
there's no user of early_node_map[] left. Kill early_node_map[] and
replace ARCH_POPULATES_NODE_MAP with HAVE_MEMBLOCK_NODE_MAP. Also,
relocate for_each_mem_pfn_range() and helper from mm.h to memblock.h
as page_alloc.c would no longer host an alternative implementation.

This change is ultimately one to one mapping and shouldn't cause any
observable difference; however, after the recent changes, there are
some functions which now would fit memblock.c better than page_alloc.c
and dependency on HAVE_MEMBLOCK_NODE_MAP instead of HAVE_MEMBLOCK
doesn't make much sense on some of them. Further cleanups for
functions inside HAVE_MEMBLOCK_NODE_MAP in mm.h would be nice.

-v2: Fix compile bug introduced by mis-spelling
CONFIG_HAVE_MEMBLOCK_NODE_MAP to CONFIG_MEMBLOCK_HAVE_NODE_MAP in
mmzone.h. Reported by Stephen Rothwell.

Signed-off-by: Tejun Heo
Cc: Stephen Rothwell
Cc: Benjamin Herrenschmidt
Cc: Yinghai Lu
Cc: Tony Luck
Cc: Ralf Baechle
Cc: Martin Schwidefsky
Cc: Chen Liqin
Cc: Paul Mundt
Cc: "David S. Miller"
Cc: "H. Peter Anvin"

Tejun Heo
2011-12-09 02:22:09 +0800

01 Nov, 2011

5 commits

49ea7eb65 mm: vmscan: immediately reclaim end-of-LRU dirty pages when writeback completes ... Browse Code »

When direct reclaim encounters a dirty page, it gets recycled around the
LRU for another cycle. This patch marks the page PageReclaim similar to
deactivate_page() so that the page gets reclaimed almost immediately after
the page gets cleaned. This is to avoid reclaiming clean pages that are
younger than a dirty page encountered at the end of the LRU that might
have been something like a use-once page.

Signed-off-by: Mel Gorman
Acked-by: Johannes Weiner
Cc: Dave Chinner
Cc: Christoph Hellwig
Cc: Wu Fengguang
Cc: Jan Kara
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Alex Elder
Cc: Theodore Ts'o
Cc: Chris Mason
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2011-11-01 08:30:47 +0800
ee72886d8 mm: vmscan: do not writeback filesystem pages in direct reclaim ... Browse Code »

Testing from the XFS folk revealed that there is still too much I/O from
the end of the LRU in kswapd. Previously it was considered acceptable by
VM people for a small number of pages to be written back from reclaim with
testing generally showing about 0.3% of pages reclaimed were written back
(higher if memory was low). That writing back a small number of pages is
ok has been heavily disputed for quite some time and Dave Chinner
explained it well;

It doesn't have to be a very high number to be a problem. IO
is orders of magnitude slower than the CPU time it takes to
flush a page, so the cost of making a bad flush decision is
very high. And single page writeback from the LRU is almost
always a bad flush decision.

To complicate matters, filesystems respond very differently to requests
from reclaim according to Christoph Hellwig;

xfs tries to write it back if the requester is kswapd
ext4 ignores the request if it's a delayed allocation
btrfs ignores the request

As a result, each filesystem has different performance characteristics
when under memory pressure and there are many pages being dirtied. In
some cases, the request is ignored entirely so the VM cannot depend on the
IO being dispatched.

The objective of this series is to reduce writing of filesystem-backed
pages from reclaim, play nicely with writeback that is already in progress
and throttle reclaim appropriately when writeback pages are encountered.
The assumption is that the flushers will always write pages faster than if
reclaim issues the IO.

A secondary goal is to avoid the problem whereby direct reclaim splices
two potentially deep call stacks together.

There is a potential new problem as reclaim has less control over how long
before a page in a particularly zone or container is cleaned and direct
reclaimers depend on kswapd or flusher threads to do the necessary work.
However, as filesystems sometimes ignore direct reclaim requests already,
it is not expected to be a serious issue.

Patch 1 disables writeback of filesystem pages from direct reclaim
entirely. Anonymous pages are still written.

Patch 2 removes dead code in lumpy reclaim as it is no longer able
to synchronously write pages. This hurts lumpy reclaim but
there is an expectation that compaction is used for hugepage
allocations these days and lumpy reclaim's days are numbered.

Patches 3-4 add warnings to XFS and ext4 if called from
direct reclaim. With patch 1, this "never happens" and is
intended to catch regressions in this logic in the future.

Patch 5 disables writeback of filesystem pages from kswapd unless
the priority is raised to the point where kswapd is considered
to be in trouble.

Patch 6 throttles reclaimers if too many dirty pages are being
encountered and the zones or backing devices are congested.

Patch 7 invalidates dirty pages found at the end of the LRU so they
are reclaimed quickly after being written back rather than
waiting for a reclaimer to find them

I consider this series to be orthogonal to the writeback work but it is
worth noting that the writeback work affects the viability of patch 8 in
particular.

I tested this on ext4 and xfs using fs_mark, a simple writeback test based
on dd and a micro benchmark that does a streaming write to a large mapping
(exercises use-once LRU logic) followed by streaming writes to a mix of
anonymous and file-backed mappings. The command line for fs_mark when
botted with 512M looked something like

./fs_mark -d /tmp/fsmark-2676 -D 100 -N 150 -n 150 -L 25 -t 1 -S0 -s 10485760

The number of files was adjusted depending on the amount of available
memory so that the files created was about 3xRAM. For multiple threads,
the -d switch is specified multiple times.

The test machine is x86-64 with an older generation of AMD processor with
4 cores. The underlying storage was 4 disks configured as RAID-0 as this
was the best configuration of storage I had available. Swap is on a
separate disk. Dirty ratio was tuned to 40% instead of the default of
20%.

Testing was run with and without monitors to both verify that the patches
were operating as expected and that any performance gain was real and not
due to interference from monitors.

Here is a summary of results based on testing XFS.

512M1P-xfs Files/s mean 32.69 ( 0.00%) 34.44 ( 5.08%)
512M1P-xfs Elapsed Time fsmark 51.41 48.29
512M1P-xfs Elapsed Time simple-wb 114.09 108.61
512M1P-xfs Elapsed Time mmap-strm 113.46 109.34
512M1P-xfs Kswapd efficiency fsmark 62% 63%
512M1P-xfs Kswapd efficiency simple-wb 56% 61%
512M1P-xfs Kswapd efficiency mmap-strm 44% 42%
512M-xfs Files/s mean 30.78 ( 0.00%) 35.94 (14.36%)
512M-xfs Elapsed Time fsmark 56.08 48.90
512M-xfs Elapsed Time simple-wb 112.22 98.13
512M-xfs Elapsed Time mmap-strm 219.15 196.67
512M-xfs Kswapd efficiency fsmark 54% 56%
512M-xfs Kswapd efficiency simple-wb 54% 55%
512M-xfs Kswapd efficiency mmap-strm 45% 44%
512M-4X-xfs Files/s mean 30.31 ( 0.00%) 33.33 ( 9.06%)
512M-4X-xfs Elapsed Time fsmark 63.26 55.88
512M-4X-xfs Elapsed Time simple-wb 100.90 90.25
512M-4X-xfs Elapsed Time mmap-strm 261.73 255.38
512M-4X-xfs Kswapd efficiency fsmark 49% 50%
512M-4X-xfs Kswapd efficiency simple-wb 54% 56%
512M-4X-xfs Kswapd efficiency mmap-strm 37% 36%
512M-16X-xfs Files/s mean 60.89 ( 0.00%) 65.22 ( 6.64%)
512M-16X-xfs Elapsed Time fsmark 67.47 58.25
512M-16X-xfs Elapsed Time simple-wb 103.22 90.89
512M-16X-xfs Elapsed Time mmap-strm 237.09 198.82
512M-16X-xfs Kswapd efficiency fsmark 45% 46%
512M-16X-xfs Kswapd efficiency simple-wb 53% 55%
512M-16X-xfs Kswapd efficiency mmap-strm 33% 33%

Up until 512-4X, the FSmark improvements were statistically significant.
For the 4X and 16X tests the results were within standard deviations but
just barely. The time to completion for all tests is improved which is an
important result. In general, kswapd efficiency is not affected by
skipping dirty pages.

1024M1P-xfs
1024M1P-xfs
1024M1P-xfs
1024M1P-xfs
1024M1P-xfs
1024M1P-xfs
1024M1P-xfs
1024M-xfs
1024M-xfs
1024M-xfs
1024M-xfs
1024M-xfs
1024M-xfs
1024M-xfs
1024M-4X-xfs
1024M-4X-xfs
1024M-4X-xfs
1024M-4X-xfs
1024M-4X-xfs
1024M-4X-xfs
1024M-4X-xfs
1024M-16X-xfs
1024M-16X-xfs
1024M-16X-xfs
1024M-16X-xfs
1024M-16X-xfs
1024M-16X-xfs
1024M-16X-xfs Files/s mean 39.09 ( 0.00%) 41.15 ( 5.01%) Elapsed Time fsmark 84.14 80.41 Elapsed Time simple-wb 210.77 184.78 Elapsed Time mmap-strm 162.00 160.34 Kswapd efficiency fsmark 69% 75% Kswapd efficiency simple-wb 71% 77% Kswapd efficiency mmap-strm 43% 44% Files/s mean 35.45 ( 0.00%) 37.00 ( 4.19%) Elapsed Time fsmark 94.59 91.00 Elapsed Time simple-wb 229.84 195.08 Elapsed Time mmap-strm 405.38 440.29 Kswapd efficiency fsmark 79% 71% Kswapd efficiency simple-wb 74% 74% Kswapd efficiency mmap-strm 39% 42% Files/s mean 32.63 ( 0.00%) 35.05 ( 6.90%) Elapsed Time fsmark 103.33 97.74 Elapsed Time simple-wb 204.48 178.57 Elapsed Time mmap-strm 528.38 511.88 Kswapd efficiency fsmark 81% 70% Kswapd efficiency simple-wb 73% 72% Kswapd efficiency mmap-strm 39% 38% Files/s mean 42.65 ( 0.00%) 42.97 ( 0.74%) Elapsed Time fsmark 103.11 99.11 Elapsed Time simple-wb 200.83 178.24 Elapsed Time mmap-strm 397.35 459.82 Kswapd efficiency fsmark 84% 69% Kswapd efficiency simple-wb 74% 73% Kswapd efficiency mmap-strm 39% 40%

All FSMark tests up to 16X had statistically significant improvements.
For the most part, tests are completing faster with the exception of the
streaming writes to a mixture of anonymous and file-backed mappings which
were slower in two cases

In the cases where the mmap-strm tests were slower, there was more
swapping due to dirty pages being skipped. The number of additional pages
swapped is almost identical to the fewer number of pages written from
reclaim. In other words, roughly the same number of pages were reclaimed
but swapping was slower. As the test is a bit unrealistic and stresses
memory heavily, the small shift is acceptable.

4608M1P-xfs Files/s mean 29.75 ( 0.00%) 30.96 ( 3.91%)
4608M1P-xfs Elapsed Time fsmark 512.01 492.15
4608M1P-xfs Elapsed Time simple-wb 618.18 566.24
4608M1P-xfs Elapsed Time mmap-strm 488.05 465.07
4608M1P-xfs Kswapd efficiency fsmark 93% 86%
4608M1P-xfs Kswapd efficiency simple-wb 88% 84%
4608M1P-xfs Kswapd efficiency mmap-strm 46% 45%
4608M-xfs Files/s mean 27.60 ( 0.00%) 28.85 ( 4.33%)
4608M-xfs Elapsed Time fsmark 555.96 532.34
4608M-xfs Elapsed Time simple-wb 659.72 571.85
4608M-xfs Elapsed Time mmap-strm 1082.57 1146.38
4608M-xfs Kswapd efficiency fsmark 89% 91%
4608M-xfs Kswapd efficiency simple-wb 88% 82%
4608M-xfs Kswapd efficiency mmap-strm 48% 46%
4608M-4X-xfs Files/s mean 26.00 ( 0.00%) 27.47 ( 5.35%)
4608M-4X-xfs Elapsed Time fsmark 592.91 564.00
4608M-4X-xfs Elapsed Time simple-wb 616.65 575.07
4608M-4X-xfs Elapsed Time mmap-strm 1773.02 1631.53
4608M-4X-xfs Kswapd efficiency fsmark 90% 94%
4608M-4X-xfs Kswapd efficiency simple-wb 87% 82%
4608M-4X-xfs Kswapd efficiency mmap-strm 43% 43%
4608M-16X-xfs Files/s mean 26.07 ( 0.00%) 26.42 ( 1.32%)
4608M-16X-xfs Elapsed Time fsmark 602.69 585.78
4608M-16X-xfs Elapsed Time simple-wb 606.60 573.81
4608M-16X-xfs Elapsed Time mmap-strm 1549.75 1441.86
4608M-16X-xfs Kswapd efficiency fsmark 98% 98%
4608M-16X-xfs Kswapd efficiency simple-wb 88% 82%
4608M-16X-xfs Kswapd efficiency mmap-strm 44% 42%

Unlike the other tests, the fsmark results are not statistically
significant but the min and max times are both improved and for the most
part, tests completed faster.

There are other indications that this is an improvement as well. For
example, in the vast majority of cases, there were fewer pages scanned by
direct reclaim implying in many cases that stalls due to direct reclaim
are reduced. KSwapd is scanning more due to skipping dirty pages which is
unfortunate but the CPU usage is still acceptable

In an earlier set of tests, I used blktrace and in almost all cases
throughput throughout the entire test was higher. However, I ended up
discarding those results as recording blktrace data was too heavy for my
liking.

On a laptop, I plugged in a USB stick and ran a similar tests of tests
using it as backing storage. A desktop environment was running and for
the entire duration of the tests, firefox and gnome terminal were
launching and exiting to vaguely simulate a user.

1024M-xfs Files/s mean 0.41 ( 0.00%) 0.44 ( 6.82%)
1024M-xfs Elapsed Time fsmark 2053.52 1641.03
1024M-xfs Elapsed Time simple-wb 1229.53 768.05
1024M-xfs Elapsed Time mmap-strm 4126.44 4597.03
1024M-xfs Kswapd efficiency fsmark 84% 85%
1024M-xfs Kswapd efficiency simple-wb 92% 81%
1024M-xfs Kswapd efficiency mmap-strm 60% 51%
1024M-xfs Avg wait ms fsmark 5404.53 4473.87
1024M-xfs Avg wait ms simple-wb 2541.35 1453.54
1024M-xfs Avg wait ms mmap-strm 3400.25 3852.53

The mmap-strm results were hurt because firefox launching had a tendency
to push the test out of memory. On the postive side, firefox launched
marginally faster with the patches applied. Time to completion for many
tests was faster but more importantly - the "Avg wait" time as measured by
iostat was far lower implying the system would be more responsive. It was
also the case that "Avg wait ms" on the root filesystem was lower. I
tested it manually and while the system felt slightly more responsive
while copying data to a USB stick, it was marginal enough that it could be
my imagination.

This patch: do not writeback filesystem pages in direct reclaim.

When kswapd is failing to keep zones above the min watermark, a process
will enter direct reclaim in the same manner kswapd does. If a dirty page
is encountered during the scan, this page is written to backing storage
using mapping->writepage.

This causes two problems. First, it can result in very deep call stacks,
particularly if the target storage or filesystem are complex. Some
filesystems ignore write requests from direct reclaim as a result. The
second is that a single-page flush is inefficient in terms of IO. While
there is an expectation that the elevator will merge requests, this does
not always happen. Quoting Christoph Hellwig;

The elevator has a relatively small window it can operate on,
and can never fix up a bad large scale writeback pattern.

This patch prevents direct reclaim writing back filesystem pages by
checking if current is kswapd. Anonymous pages are still written to swap
as there is not the equivalent of a flusher thread for anonymous pages.
If the dirty pages cannot be written back, they are placed back on the LRU
lists. There is now a direct dependency on dirty page balancing to
prevent too many pages in the system being dirtied which would prevent
reclaim making forward progress.

Signed-off-by: Mel Gorman
Reviewed-by: Minchan Kim
Cc: Dave Chinner
Cc: Christoph Hellwig
Cc: Johannes Weiner
Cc: Wu Fengguang
Cc: Jan Kara
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Alex Elder
Cc: Theodore Ts'o
Cc: Chris Mason
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2011-11-01 08:30:46 +0800
f80c06736 mm: zone_reclaim: make isolate_lru_page() filter-aware ... Browse Code »

In __zone_reclaim case, we don't want to shrink mapped page. Nonetheless,
we have isolated mapped page and re-add it into LRU's head. It's
unnecessary CPU overhead and makes LRU churning.

Of course, when we isolate the page, the page might be mapped but when we
try to migrate the page, the page would be not mapped. So it could be
migrated. But race is rare and although it happens, it's no big deal.

Signed-off-by: Minchan Kim
Acked-by: Johannes Weiner
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: KOSAKI Motohiro
Reviewed-by: Michal Hocko
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2011-11-01 08:30:44 +0800
39deaf858 mm: compaction: make isolate_lru_page() filter-aware ... Browse Code »

In async mode, compaction doesn't migrate dirty or writeback pages. So,
it's meaningless to pick the page and re-add it to lru list.

Of course, when we isolate the page in compaction, the page might be dirty
or writeback but when we try to migrate the page, the page would be not
dirty, writeback. So it could be migrated. But it's very unlikely as
isolate and migration cycle is much faster than writeout.

So, this patch helps cpu overhead and prevent unnecessary LRU churning.

Signed-off-by: Minchan Kim
Acked-by: Johannes Weiner
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: KOSAKI Motohiro
Acked-by: Mel Gorman
Acked-by: Rik van Riel
Reviewed-by: Michal Hocko
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2011-11-01 08:30:44 +0800
4356f21d0 mm: change isolate mode from #define to bitwise type ... Browse Code »

Change ISOLATE_XXX macro with bitwise isolate_mode_t type. Normally,
macro isn't recommended as it's type-unsafe and making debugging harder as
symbol cannot be passed throught to the debugger.

Quote from Johannes
" Hmm, it would probably be cleaner to fully convert the isolation mode
into independent flags. INACTIVE, ACTIVE, BOTH is currently a
tri-state among flags, which is a bit ugly."

This patch moves isolate mode from swap.h to mmzone.h by memcontrol.h

Signed-off-by: Minchan Kim
Cc: Johannes Weiner
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Michal Hocko
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2011-11-01 08:30:44 +0800

27 Jul, 2011

2 commits

60063497a atomic: use <linux/atomic.h> ... Browse Code »

This allows us to move duplicated code in
(atomic_inc_not_zero() for now) to

Signed-off-by: Arun Sharma
Reviewed-by: Eric Dumazet
Cc: Ingo Molnar
Cc: David Miller
Cc: Eric Dumazet
Acked-by: Mike Frysinger
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Arun Sharma
2011-07-27 07:49:47 +0800
bb2a0de92 memcg: consolidate memory cgroup lru stat functions ... Browse Code »

In mm/memcontrol.c, there are many lru stat functions as..

mem_cgroup_zone_nr_lru_pages
mem_cgroup_node_nr_file_lru_pages
mem_cgroup_nr_file_lru_pages
mem_cgroup_node_nr_anon_lru_pages
mem_cgroup_nr_anon_lru_pages
mem_cgroup_node_nr_unevictable_lru_pages
mem_cgroup_nr_unevictable_lru_pages
mem_cgroup_node_nr_lru_pages
mem_cgroup_nr_lru_pages
mem_cgroup_get_local_zonestat

Some of them are under #ifdef MAX_NUMNODES >1 and others are not.
This seems bad. This patch consolidates all functions into

mem_cgroup_zone_nr_lru_pages()
mem_cgroup_node_nr_lru_pages()
mem_cgroup_nr_lru_pages()

For these functions, "which LRU?" information is passed by a mask.

example:
mem_cgroup_nr_lru_pages(mem, BIT(LRU_ACTIVE_ANON))

And I added some macro as ALL_LRU, ALL_LRU_FILE, ALL_LRU_ANON.

example:
mem_cgroup_nr_lru_pages(mem, ALL_LRU)

BTW, considering layout of NUMA memory placement of counters, this patch seems
to be better.

Now, when we gather all LRU information, we scan in following orer
for_each_lru -> for_each_node -> for_each_zone.

This means we'll touch cache lines in different node in turn.

After patch, we'll scan
for_each_node -> for_each_zone -> for_each_lru(mask)

Then, we'll gather information in the same cacheline at once.

[akpm@linux-foundation.org: fix warnigns, build error]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Cc: Michal Hocko
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-07-27 07:49:42 +0800

28 Jun, 2011

1 commit

c6830c226 Fix node_start/end_pfn() definition for mm/page_cgroup.c ... Browse Code »

commit 21a3c96 uses node_start/end_pfn(nid) for detection start/end
of nodes. But, it's not defined in linux/mmzone.h but defined in
/arch/???/include/mmzone.h which is included only under
CONFIG_NEED_MULTIPLE_NODES=y.

Then, we see
mm/page_cgroup.c: In function 'page_cgroup_init':
mm/page_cgroup.c:308: error: implicit declaration of function 'node_start_pfn'
mm/page_cgroup.c:309: error: implicit declaration of function 'node_end_pfn'

So, fixiing page_cgroup.c is an idea...

But node_start_pfn()/node_end_pfn() is a very generic macro and
should be implemented in the same manner for all archs.
(m32r has different implementation...)

This patch removes definitions of node_start/end_pfn() in each archs
and defines a unified one in linux/mmzone.h. It's not under
CONFIG_NEED_MULTIPLE_NODES, now.

A result of macro expansion is here (mm/page_cgroup.c)

for !NUMA
start_pfn = ((&contig_page_data)->node_start_pfn);
end_pfn = ({ pg_data_t *__pgdat = (&contig_page_data); __pgdat->node_start_pfn + __pgdat->node_spanned_pages;});

for NUMA (x86-64)
start_pfn = ((node_data[nid])->node_start_pfn);
end_pfn = ({ pg_data_t *__pgdat = (node_data[nid]); __pgdat->node_start_pfn + __pgdat->node_spanned_pages;});

Changelog:
- fixed to avoid using "nid" twice in node_end_pfn() macro.

Reported-and-acked-by: Randy Dunlap
Reported-and-tested-by: Ingo Molnar
Acked-by: Mel Gorman
Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-06-28 05:13:09 +0800

28 May, 2011

1 commit

2a56d2220 Merge branch 'for-linus' of master.kernel.org:/home/rmk/linux-2.6-arm ... Browse Code »

* 'for-linus' of master.kernel.org:/home/rmk/linux-2.6-arm: (45 commits)
ARM: 6945/1: Add unwinding support for division functions
ARM: kill pmd_off()
ARM: 6944/1: mm: allow ASID 0 to be allocated to tasks
ARM: 6943/1: mm: use TTBR1 instead of reserved context ID
ARM: 6942/1: mm: make TTBR1 always point to swapper_pg_dir on ARMv6/7
ARM: 6941/1: cache: ensure MVA is cacheline aligned in flush_kern_dcache_area
ARM: add sendmmsg syscall
ARM: 6863/1: allow hotplug on msm
ARM: 6832/1: mmci: support for ST-Ericsson db8500v2
ARM: 6830/1: mach-ux500: force PrimeCell revisions
ARM: 6829/1: amba: make hardcoded periphid override hardware
ARM: 6828/1: mach-ux500: delete SSP PrimeCell ID
ARM: 6827/1: mach-netx: delete hardcoded periphid
ARM: 6940/1: fiq: Briefly document driver responsibilities for suspend/resume
ARM: 6938/1: fiq: Refactor {get,set}_fiq_regs() for Thumb-2
ARM: 6914/1: sparsemem: fix highmem detection when using SPARSEMEM
ARM: 6913/1: sparsemem: allow pfn_valid to be overridden when using SPARSEMEM
at91: drop at572d940hf support
at91rm9200: introduce at91rm9200_set_type to specficy cpu package
at91: drop boot_params and PLAT_PHYS_OFFSET
...

Linus Torvalds
2011-05-28 10:51:32 +0800